Skip to content

Vision Models

Reading documents like Word, PDF, or PowerPoint can sometimes be complicated if they contain images. To avoid this problem, you can use visual language models (VLMs), which are capable of recognizing images and extracting descriptions from them. In this prospectus, a model module has been developed, the implementation of which is based on the BaseVisionModel class. It is presented below.

Which model should I use?

The choice of model depends on your cloud provider, available API keys, and desired level of integration. All models inherit from BaseVisionModel and provide the same interface for extracting text and descriptions from images.

Model When to use Requirements Features
OpenAIVisionModel If you have an OpenAI API key and want OpenAI cloud OPENAI_API_KEY (optional: OPENAI_MODEL, defaults to "gpt-4o") Simple setup; standard OpenAI chat API
AzureOpenAIVisionModel For Azure OpenAI Services users AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT, AZURE_OPENAI_API_VERSION Integrates with Azure; enterprise controls
GrokVisionModel If you have access to xAI’s Grok multimodal model XAI_API_KEY (optional: XAI_MODEL, defaults to "grok-4") Supports data-URIs; optional image quality
GeminiVisionModel If you want Google’s Gemini Vision models GEMINI_API_KEY + Multimodal extra: pip install 'splitter-mr[multimodal]' Google Gemini API, multi-modal, high-quality extraction
HuggingFaceVisionModel Local/open-source/offline inference Multimodal extra: pip install 'splitter-mr[multimodal]' (optional: HF_ACCESS_TOKEN, for required models) Runs locally, uses HF AutoProcessor + chat templates
AnthropicVisionModel If you have an Anthropic key and want Claude Vision ANTHROPIC_API_KEY (optional: ANTHROPIC_MODEL, defaults to "claude-sonnet-4-20250514") Uses OpenAI SDK with Anthropic base URL; data-URI (base64) image input; OpenAI-compatible chat.completions
BaseVisionModel Abstract base, not used directly Template to build your own adapters

Models

BaseVisionModel

BaseVisionModel

Bases: ABC

Abstract base for vision models that extract text from images.

Subclasses encapsulate local or API-backed implementations (e.g., OpenAI, Azure OpenAI, or on-device models). Implementations should handle encoding, request construction, and response parsing while exposing a uniform interface for clients of the library.

Source code in src/splitter_mr/model/base_model.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
class BaseVisionModel(ABC):
    """
    Abstract base for vision models that extract text from images.

    Subclasses encapsulate local or API-backed implementations (e.g., OpenAI,
    Azure OpenAI, or on-device models). Implementations should handle encoding,
    request construction, and response parsing while exposing a uniform
    interface for clients of the library.
    """

    @abstractmethod
    def __init__(self, model_name) -> Any:
        """Initialize the model.

        Args:
            model_name (Any): Identifier of the underlying model. For hosted APIs
                this could be a model name or deployment name; for local models,
                it could be a path or configuration object.

        Raises:
            ValueError: If required configuration or credentials are missing.
        """

    @abstractmethod
    def get_client(self) -> Any:
        """Return the underlying client or handle.

        Returns:
            Any: A client/handle that the implementation uses to perform
                inference (e.g., an SDK client instance, session object, or
                lightweight wrapper). May be ``None`` for pure-local implementations.
        """

    @abstractmethod
    def analyze_content(
        self,
        prompt: str,
        file: Optional[bytes],
        file_ext: Optional[str],
        **parameters: Dict[str, Any],
    ) -> str:
        """Extract text from an image using the provided prompt.

        Encodes the image (provided as base64 **without** the
        ``data:<mime>;base64,`` prefix), sends it with an instruction prompt to
        the underlying vision model, and returns the model's textual output.

        Args:
            prompt (str): Instruction or task description guiding the extraction
                (e.g., *"Read all visible text"* or *"Summarize the receipt"*).
            file (Optional[bytes]): Base64-encoded image bytes **without** the
                header/prefix. Must not be ``None`` for remote/API calls that
                require an image payload.
            file_ext (Optional[str]): File extension (e.g., ``"png"``, ``"jpg"``)
                used to infer the MIME type when required by the backend.
            **parameters (Dict[str, Any]): Additional backend-specific options
                forwarded to the implementation (e.g., timeouts, user tags,
                temperature, etc.).

        Returns:
            str: The extracted text or the model's textual response.

        Raises:
            ValueError: If ``file`` is ``None`` when required, or if the file
                type is unsupported by the implementation.
            RuntimeError: If the inference call fails or returns an unexpected
                response shape.
        """
__init__(model_name) abstractmethod

Initialize the model.

Parameters:

Name Type Description Default
model_name Any

Identifier of the underlying model. For hosted APIs this could be a model name or deployment name; for local models, it could be a path or configuration object.

required

Raises:

Type Description
ValueError

If required configuration or credentials are missing.

Source code in src/splitter_mr/model/base_model.py
15
16
17
18
19
20
21
22
23
24
25
26
@abstractmethod
def __init__(self, model_name) -> Any:
    """Initialize the model.

    Args:
        model_name (Any): Identifier of the underlying model. For hosted APIs
            this could be a model name or deployment name; for local models,
            it could be a path or configuration object.

    Raises:
        ValueError: If required configuration or credentials are missing.
    """
analyze_content(prompt, file, file_ext, **parameters) abstractmethod

Extract text from an image using the provided prompt.

Encodes the image (provided as base64 without the data:<mime>;base64, prefix), sends it with an instruction prompt to the underlying vision model, and returns the model's textual output.

Parameters:

Name Type Description Default
prompt str

Instruction or task description guiding the extraction (e.g., "Read all visible text" or "Summarize the receipt").

required
file Optional[bytes]

Base64-encoded image bytes without the header/prefix. Must not be None for remote/API calls that require an image payload.

required
file_ext Optional[str]

File extension (e.g., "png", "jpg") used to infer the MIME type when required by the backend.

required
**parameters Dict[str, Any]

Additional backend-specific options forwarded to the implementation (e.g., timeouts, user tags, temperature, etc.).

{}

Returns:

Name Type Description
str str

The extracted text or the model's textual response.

Raises:

Type Description
ValueError

If file is None when required, or if the file type is unsupported by the implementation.

RuntimeError

If the inference call fails or returns an unexpected response shape.

Source code in src/splitter_mr/model/base_model.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
@abstractmethod
def analyze_content(
    self,
    prompt: str,
    file: Optional[bytes],
    file_ext: Optional[str],
    **parameters: Dict[str, Any],
) -> str:
    """Extract text from an image using the provided prompt.

    Encodes the image (provided as base64 **without** the
    ``data:<mime>;base64,`` prefix), sends it with an instruction prompt to
    the underlying vision model, and returns the model's textual output.

    Args:
        prompt (str): Instruction or task description guiding the extraction
            (e.g., *"Read all visible text"* or *"Summarize the receipt"*).
        file (Optional[bytes]): Base64-encoded image bytes **without** the
            header/prefix. Must not be ``None`` for remote/API calls that
            require an image payload.
        file_ext (Optional[str]): File extension (e.g., ``"png"``, ``"jpg"``)
            used to infer the MIME type when required by the backend.
        **parameters (Dict[str, Any]): Additional backend-specific options
            forwarded to the implementation (e.g., timeouts, user tags,
            temperature, etc.).

    Returns:
        str: The extracted text or the model's textual response.

    Raises:
        ValueError: If ``file`` is ``None`` when required, or if the file
            type is unsupported by the implementation.
        RuntimeError: If the inference call fails or returns an unexpected
            response shape.
    """
get_client() abstractmethod

Return the underlying client or handle.

Returns:

Name Type Description
Any Any

A client/handle that the implementation uses to perform inference (e.g., an SDK client instance, session object, or lightweight wrapper). May be None for pure-local implementations.

Source code in src/splitter_mr/model/base_model.py
28
29
30
31
32
33
34
35
36
@abstractmethod
def get_client(self) -> Any:
    """Return the underlying client or handle.

    Returns:
        Any: A client/handle that the implementation uses to perform
            inference (e.g., an SDK client instance, session object, or
            lightweight wrapper). May be ``None`` for pure-local implementations.
    """

OpenAIVisionModel

OpenAIVisionModel logo OpenAIVisionModel logo

OpenAIVisionModel

Bases: BaseVisionModel

Implementation of BaseModel leveraging OpenAI's Chat Completions API.

Uses the client.chat.completions.create() method to send base64-encoded images along with text prompts in a single multimodal request.

Source code in src/splitter_mr/model/models/openai_model.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
class OpenAIVisionModel(BaseVisionModel):
    """
    Implementation of BaseModel leveraging OpenAI's Chat Completions API.

    Uses the `client.chat.completions.create()` method to send base64-encoded
    images along with text prompts in a single multimodal request.
    """

    def __init__(
        self,
        api_key: Optional[str] = None,
        model_name: str = os.getenv("OPENAI_MODEL", "gpt-4o"),
    ) -> None:
        """
        Initialize the OpenAIVisionModel.

        Args:
            api_key (str, optional): OpenAI API key. If not provided, uses the
                ``OPENAI_API_KEY`` environment variable.
            model_name (str): Vision-capable model name (e.g., ``"gpt-4o"``).

        Raises:
            ValueError: If no API key is provided or ``OPENAI_API_KEY`` is not set.
        """
        if api_key is None:
            api_key = os.getenv("OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "OpenAI API key not provided or 'OPENAI_API_KEY' env var is not set."
                )
        self.client = OpenAI(api_key=api_key)
        self.model_name = model_name

    def get_client(self) -> OpenAI:
        """
        Get the underlying OpenAI client instance.

        Returns:
            OpenAI: The initialized API client.
        """
        return self.client

    def analyze_content(
        self,
        file: Optional[bytes],
        prompt: str = DEFAULT_IMAGE_CAPTION_PROMPT,
        *,
        file_ext: Optional[str] = "png",
        **parameters: Any,
    ) -> str:
        """
        Extract text from an image using OpenAI's Chat Completions API.

        Encodes the provided image bytes as a base64 data URI and sends it
        along with a textual prompt to the specified vision-capable model.
        The model processes the image and returns extracted text.

        Args:
            file (bytes, optional): Base64-encoded image content **without** the
                ``data:image/...;base64,`` prefix. Must not be None.
            prompt (str, optional): Instruction text guiding the extraction.
                Defaults to ``DEFAULT_IMAGE_CAPTION_PROMPT``.
            file_ext (str, optional): File extension (e.g., ``"png"``, ``"jpg"``,
                ``"jpeg"``, ``"webp"``, ``"gif"``) used to determine the MIME type.
                Defaults to ``"png"``.
            **parameters (Any): Additional keyword arguments passed directly to
                the OpenAI client ``chat.completions.create()`` method. Consult documentation
                [here](https://platform.openai.com/docs/api-reference/chat/create).

        Returns:
            str: Extracted text returned by the model.

        Raises:
            ValueError: If ``file`` is None or the file extension is not compatible.
            openai.OpenAIError: If the API request fails.

        Example:
            ```python
            from splitter_mr.model import OpenAIVisionModel
            import base64

            model = OpenAIVisionModel(api_key="sk-...")
            with open("example.png", "rb") as f:
                img_b64 = base64.b64encode(f.read()).decode("utf-8")

            text = model.analyze_content(img_b64, prompt="Describe the content of this image.")
            print(text)
            ```
        """
        if file is None:
            raise ValueError("No file content provided for text extraction.")

        ext = (file_ext or "png").lower()
        mime_type = (
            OPENAI_MIME_BY_EXTENSION.get(ext)  # noqa: W503
            or mimetypes.types_map.get(f".{ext}")  # noqa: W503
            or "image/png"  # noqa: W503
        )

        if mime_type not in SUPPORTED_OPENAI_MIME_TYPES:
            raise ValueError(f"Unsupported image MIME type: {mime_type}")

        payload_obj = OpenAIClientPayload(
            role="user",
            content=[
                OpenAIClientTextContent(type="text", text=prompt),
                OpenAIClientImageContent(
                    type="image_url",
                    image_url=OpenAIClientImageUrl(
                        url=f"data:{mime_type};base64,{file}"
                    ),
                ),
            ],
        )
        payload = payload_obj.model_dump(exclude_none=True)

        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=[payload],
            **parameters,
        )
        return response.choices[0].message.content
__init__(api_key=None, model_name=os.getenv('OPENAI_MODEL', 'gpt-4o'))

Initialize the OpenAIVisionModel.

Parameters:

Name Type Description Default
api_key str

OpenAI API key. If not provided, uses the OPENAI_API_KEY environment variable.

None
model_name str

Vision-capable model name (e.g., "gpt-4o").

getenv('OPENAI_MODEL', 'gpt-4o')

Raises:

Type Description
ValueError

If no API key is provided or OPENAI_API_KEY is not set.

Source code in src/splitter_mr/model/models/openai_model.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def __init__(
    self,
    api_key: Optional[str] = None,
    model_name: str = os.getenv("OPENAI_MODEL", "gpt-4o"),
) -> None:
    """
    Initialize the OpenAIVisionModel.

    Args:
        api_key (str, optional): OpenAI API key. If not provided, uses the
            ``OPENAI_API_KEY`` environment variable.
        model_name (str): Vision-capable model name (e.g., ``"gpt-4o"``).

    Raises:
        ValueError: If no API key is provided or ``OPENAI_API_KEY`` is not set.
    """
    if api_key is None:
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "OpenAI API key not provided or 'OPENAI_API_KEY' env var is not set."
            )
    self.client = OpenAI(api_key=api_key)
    self.model_name = model_name
analyze_content(file, prompt=DEFAULT_IMAGE_CAPTION_PROMPT, *, file_ext='png', **parameters)

Extract text from an image using OpenAI's Chat Completions API.

Encodes the provided image bytes as a base64 data URI and sends it along with a textual prompt to the specified vision-capable model. The model processes the image and returns extracted text.

Parameters:

Name Type Description Default
file bytes

Base64-encoded image content without the data:image/...;base64, prefix. Must not be None.

required
prompt str

Instruction text guiding the extraction. Defaults to DEFAULT_IMAGE_CAPTION_PROMPT.

DEFAULT_IMAGE_CAPTION_PROMPT
file_ext str

File extension (e.g., "png", "jpg", "jpeg", "webp", "gif") used to determine the MIME type. Defaults to "png".

'png'
**parameters Any

Additional keyword arguments passed directly to the OpenAI client chat.completions.create() method. Consult documentation here.

{}

Returns:

Name Type Description
str str

Extracted text returned by the model.

Raises:

Type Description
ValueError

If file is None or the file extension is not compatible.

OpenAIError

If the API request fails.

Example
from splitter_mr.model import OpenAIVisionModel
import base64

model = OpenAIVisionModel(api_key="sk-...")
with open("example.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode("utf-8")

text = model.analyze_content(img_b64, prompt="Describe the content of this image.")
print(text)
Source code in src/splitter_mr/model/models/openai_model.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
def analyze_content(
    self,
    file: Optional[bytes],
    prompt: str = DEFAULT_IMAGE_CAPTION_PROMPT,
    *,
    file_ext: Optional[str] = "png",
    **parameters: Any,
) -> str:
    """
    Extract text from an image using OpenAI's Chat Completions API.

    Encodes the provided image bytes as a base64 data URI and sends it
    along with a textual prompt to the specified vision-capable model.
    The model processes the image and returns extracted text.

    Args:
        file (bytes, optional): Base64-encoded image content **without** the
            ``data:image/...;base64,`` prefix. Must not be None.
        prompt (str, optional): Instruction text guiding the extraction.
            Defaults to ``DEFAULT_IMAGE_CAPTION_PROMPT``.
        file_ext (str, optional): File extension (e.g., ``"png"``, ``"jpg"``,
            ``"jpeg"``, ``"webp"``, ``"gif"``) used to determine the MIME type.
            Defaults to ``"png"``.
        **parameters (Any): Additional keyword arguments passed directly to
            the OpenAI client ``chat.completions.create()`` method. Consult documentation
            [here](https://platform.openai.com/docs/api-reference/chat/create).

    Returns:
        str: Extracted text returned by the model.

    Raises:
        ValueError: If ``file`` is None or the file extension is not compatible.
        openai.OpenAIError: If the API request fails.

    Example:
        ```python
        from splitter_mr.model import OpenAIVisionModel
        import base64

        model = OpenAIVisionModel(api_key="sk-...")
        with open("example.png", "rb") as f:
            img_b64 = base64.b64encode(f.read()).decode("utf-8")

        text = model.analyze_content(img_b64, prompt="Describe the content of this image.")
        print(text)
        ```
    """
    if file is None:
        raise ValueError("No file content provided for text extraction.")

    ext = (file_ext or "png").lower()
    mime_type = (
        OPENAI_MIME_BY_EXTENSION.get(ext)  # noqa: W503
        or mimetypes.types_map.get(f".{ext}")  # noqa: W503
        or "image/png"  # noqa: W503
    )

    if mime_type not in SUPPORTED_OPENAI_MIME_TYPES:
        raise ValueError(f"Unsupported image MIME type: {mime_type}")

    payload_obj = OpenAIClientPayload(
        role="user",
        content=[
            OpenAIClientTextContent(type="text", text=prompt),
            OpenAIClientImageContent(
                type="image_url",
                image_url=OpenAIClientImageUrl(
                    url=f"data:{mime_type};base64,{file}"
                ),
            ),
        ],
    )
    payload = payload_obj.model_dump(exclude_none=True)

    response = self.client.chat.completions.create(
        model=self.model_name,
        messages=[payload],
        **parameters,
    )
    return response.choices[0].message.content
get_client()

Get the underlying OpenAI client instance.

Returns:

Name Type Description
OpenAI OpenAI

The initialized API client.

Source code in src/splitter_mr/model/models/openai_model.py
52
53
54
55
56
57
58
59
def get_client(self) -> OpenAI:
    """
    Get the underlying OpenAI client instance.

    Returns:
        OpenAI: The initialized API client.
    """
    return self.client

AzureOpenAIVisionModel

OpenAIVisionModel logo OpenAIVisionModel logo

AzureOpenAIVisionModel

Bases: BaseVisionModel

Implementation of BaseModel for Azure OpenAI Vision using the Responses API.

Utilizes Azure’s preview responses API, which supports base64-encoded images and stateful multimodal calls.

Source code in src/splitter_mr/model/models/azure_openai_model.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
class AzureOpenAIVisionModel(BaseVisionModel):
    """
    Implementation of BaseModel for Azure OpenAI Vision using the Responses API.

    Utilizes Azure’s preview `responses` API, which supports
    base64-encoded images and stateful multimodal calls.
    """

    def __init__(
        self,
        api_key: str = None,
        azure_endpoint: str = None,
        azure_deployment: str = None,
        api_version: str = None,
    ) -> None:
        """
        Initializes the AzureOpenAIVisionModel.

        Args:
            api_key (str, optional): Azure OpenAI API key.
                If not provided, uses 'AZURE_OPENAI_API_KEY' env var.
            azure_endpoint (str, optional): Azure endpoint.
                If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var.
            azure_deployment (str, optional): Azure deployment name.
                If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var.
            api_version (str, optional): API version string.
                If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'.

        Raises:
            ValueError: If no connection details are provided or environment variables
                are not set.
        """
        if api_key is None:
            api_key = os.getenv("AZURE_OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "Azure OpenAI API key not provided or 'AZURE_OPENAI_API_KEY' env var is not set."
                )
        if azure_endpoint is None:
            azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
            if not azure_endpoint:
                raise ValueError(
                    "Azure endpoint not provided or 'AZURE_OPENAI_ENDPOINT' env var is not set."
                )
        if azure_deployment is None:
            azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT")
            if not azure_deployment:
                raise ValueError(
                    "Azure deployment name not provided or 'AZURE_OPENAI_DEPLOYMENT' env var is not set."
                )
        if api_version is None:
            api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2025-04-14-preview")

        self.client = AzureOpenAI(
            api_key=api_key,
            azure_endpoint=azure_endpoint,
            azure_deployment=azure_deployment,
            api_version=api_version,
        )
        self.model_name = azure_deployment

    def get_client(self) -> AzureOpenAI:
        """Returns the AzureOpenAI client instance."""
        return self.client

    def analyze_content(
        self,
        file: Optional[bytes],
        prompt: str = DEFAULT_IMAGE_CAPTION_PROMPT,
        file_ext: Optional[str] = "png",
        **parameters: Any,
    ) -> str:
        """
        Extract text from an image using the Azure OpenAI Vision model.

        Encodes the given image as a data URI with an appropriate MIME type based on
        ``file_ext`` and sends it along with a prompt to the Azure OpenAI Vision API.
        The API processes the image and returns extracted text in the response.

        Args:
            file (bytes, optional): Base64-encoded image content **without** the
                ``data:image/...;base64,`` prefix. Must not be None.
            prompt (str, optional): Instruction text guiding the extraction.
                Defaults to ``DEFAULT_IMAGE_CAPTION_PROMPT``.
            file_ext (str, optional): File extension (e.g., ``"png"``, ``"jpg"``)
                used to determine the MIME type for the image. Defaults to ``"png"``.
            **parameters (Any): Additional keyword arguments passed directly to
                the Azure OpenAI client ``chat.completions.create()`` method. Consult
                documentation [here](https://platform.openai.com/docs/api-reference/chat/create).

        Returns:
            str: The extracted text returned by the vision model.

        Raises:
            ValueError: If ``file`` is None or the file extension is not compatible.
            openai.OpenAIError: If the API request fails.

        Example:
            ```python
            model = AzureOpenAIVisionModel(...)
            with open("image.jpg", "rb") as f:
                img_b64 = base64.b64encode(f.read()).decode("utf-8")
            text = model.analyze_content(img_b64, prompt="Describe this image", file_ext="jpg")
            print(text)
            ```
        """
        if file is None:
            raise ValueError("No file content provided to be analyzed with the VLM.")

        ext = (file_ext or "png").lower()
        mime_type = (
            OPENAI_MIME_BY_EXTENSION.get(ext)  # noqa: W503
            or mimetypes.types_map.get(f".{ext}")  # noqa: W503
            or "image/png"  # noqa: W503
        )

        if mime_type not in SUPPORTED_OPENAI_MIME_TYPES:
            raise ValueError(f"Unsupported image MIME type: {mime_type}")

        payload_obj = OpenAIClientPayload(
            role="user",
            content=[
                OpenAIClientTextContent(type="text", text=prompt),
                OpenAIClientImageContent(
                    type="image_url",
                    image_url=OpenAIClientImageUrl(
                        url=f"data:{mime_type};base64,{file}"
                    ),
                ),
            ],
        )
        payload = payload_obj.model_dump(exclude_none=True)

        response = self.client.chat.completions.create(
            model=self.get_client()._azure_deployment,
            messages=[payload],
            **parameters,
        )
        return response.choices[0].message.content
__init__(api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None)

Initializes the AzureOpenAIVisionModel.

Parameters:

Name Type Description Default
api_key str

Azure OpenAI API key. If not provided, uses 'AZURE_OPENAI_API_KEY' env var.

None
azure_endpoint str

Azure endpoint. If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var.

None
azure_deployment str

Azure deployment name. If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var.

None
api_version str

API version string. If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'.

None

Raises:

Type Description
ValueError

If no connection details are provided or environment variables are not set.

Source code in src/splitter_mr/model/models/azure_openai_model.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def __init__(
    self,
    api_key: str = None,
    azure_endpoint: str = None,
    azure_deployment: str = None,
    api_version: str = None,
) -> None:
    """
    Initializes the AzureOpenAIVisionModel.

    Args:
        api_key (str, optional): Azure OpenAI API key.
            If not provided, uses 'AZURE_OPENAI_API_KEY' env var.
        azure_endpoint (str, optional): Azure endpoint.
            If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var.
        azure_deployment (str, optional): Azure deployment name.
            If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var.
        api_version (str, optional): API version string.
            If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'.

    Raises:
        ValueError: If no connection details are provided or environment variables
            are not set.
    """
    if api_key is None:
        api_key = os.getenv("AZURE_OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "Azure OpenAI API key not provided or 'AZURE_OPENAI_API_KEY' env var is not set."
            )
    if azure_endpoint is None:
        azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
        if not azure_endpoint:
            raise ValueError(
                "Azure endpoint not provided or 'AZURE_OPENAI_ENDPOINT' env var is not set."
            )
    if azure_deployment is None:
        azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT")
        if not azure_deployment:
            raise ValueError(
                "Azure deployment name not provided or 'AZURE_OPENAI_DEPLOYMENT' env var is not set."
            )
    if api_version is None:
        api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2025-04-14-preview")

    self.client = AzureOpenAI(
        api_key=api_key,
        azure_endpoint=azure_endpoint,
        azure_deployment=azure_deployment,
        api_version=api_version,
    )
    self.model_name = azure_deployment
analyze_content(file, prompt=DEFAULT_IMAGE_CAPTION_PROMPT, file_ext='png', **parameters)

Extract text from an image using the Azure OpenAI Vision model.

Encodes the given image as a data URI with an appropriate MIME type based on file_ext and sends it along with a prompt to the Azure OpenAI Vision API. The API processes the image and returns extracted text in the response.

Parameters:

Name Type Description Default
file bytes

Base64-encoded image content without the data:image/...;base64, prefix. Must not be None.

required
prompt str

Instruction text guiding the extraction. Defaults to DEFAULT_IMAGE_CAPTION_PROMPT.

DEFAULT_IMAGE_CAPTION_PROMPT
file_ext str

File extension (e.g., "png", "jpg") used to determine the MIME type for the image. Defaults to "png".

'png'
**parameters Any

Additional keyword arguments passed directly to the Azure OpenAI client chat.completions.create() method. Consult documentation here.

{}

Returns:

Name Type Description
str str

The extracted text returned by the vision model.

Raises:

Type Description
ValueError

If file is None or the file extension is not compatible.

OpenAIError

If the API request fails.

Example
model = AzureOpenAIVisionModel(...)
with open("image.jpg", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode("utf-8")
text = model.analyze_content(img_b64, prompt="Describe this image", file_ext="jpg")
print(text)
Source code in src/splitter_mr/model/models/azure_openai_model.py
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
def analyze_content(
    self,
    file: Optional[bytes],
    prompt: str = DEFAULT_IMAGE_CAPTION_PROMPT,
    file_ext: Optional[str] = "png",
    **parameters: Any,
) -> str:
    """
    Extract text from an image using the Azure OpenAI Vision model.

    Encodes the given image as a data URI with an appropriate MIME type based on
    ``file_ext`` and sends it along with a prompt to the Azure OpenAI Vision API.
    The API processes the image and returns extracted text in the response.

    Args:
        file (bytes, optional): Base64-encoded image content **without** the
            ``data:image/...;base64,`` prefix. Must not be None.
        prompt (str, optional): Instruction text guiding the extraction.
            Defaults to ``DEFAULT_IMAGE_CAPTION_PROMPT``.
        file_ext (str, optional): File extension (e.g., ``"png"``, ``"jpg"``)
            used to determine the MIME type for the image. Defaults to ``"png"``.
        **parameters (Any): Additional keyword arguments passed directly to
            the Azure OpenAI client ``chat.completions.create()`` method. Consult
            documentation [here](https://platform.openai.com/docs/api-reference/chat/create).

    Returns:
        str: The extracted text returned by the vision model.

    Raises:
        ValueError: If ``file`` is None or the file extension is not compatible.
        openai.OpenAIError: If the API request fails.

    Example:
        ```python
        model = AzureOpenAIVisionModel(...)
        with open("image.jpg", "rb") as f:
            img_b64 = base64.b64encode(f.read()).decode("utf-8")
        text = model.analyze_content(img_b64, prompt="Describe this image", file_ext="jpg")
        print(text)
        ```
    """
    if file is None:
        raise ValueError("No file content provided to be analyzed with the VLM.")

    ext = (file_ext or "png").lower()
    mime_type = (
        OPENAI_MIME_BY_EXTENSION.get(ext)  # noqa: W503
        or mimetypes.types_map.get(f".{ext}")  # noqa: W503
        or "image/png"  # noqa: W503
    )

    if mime_type not in SUPPORTED_OPENAI_MIME_TYPES:
        raise ValueError(f"Unsupported image MIME type: {mime_type}")

    payload_obj = OpenAIClientPayload(
        role="user",
        content=[
            OpenAIClientTextContent(type="text", text=prompt),
            OpenAIClientImageContent(
                type="image_url",
                image_url=OpenAIClientImageUrl(
                    url=f"data:{mime_type};base64,{file}"
                ),
            ),
        ],
    )
    payload = payload_obj.model_dump(exclude_none=True)

    response = self.client.chat.completions.create(
        model=self.get_client()._azure_deployment,
        messages=[payload],
        **parameters,
    )
    return response.choices[0].message.content
get_client()

Returns the AzureOpenAI client instance.

Source code in src/splitter_mr/model/models/azure_openai_model.py
80
81
82
def get_client(self) -> AzureOpenAI:
    """Returns the AzureOpenAI client instance."""
    return self.client

GrokVisionModel

GrokVisionModel logo GrokVisionModel logo

GrokVisionModel

Bases: BaseVisionModel

Implementation of BaseModel for Grok Vision using the xAI API.

Provides methods to interact with Grok’s multimodal models that support base64-encoded images and natural language instructions. This class is designed to extract structured text descriptions or captions from images.

Source code in src/splitter_mr/model/models/grok_model.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
class GrokVisionModel(BaseVisionModel):
    """
    Implementation of BaseModel for Grok Vision using the xAI API.

    Provides methods to interact with Grok’s multimodal models that support
    base64-encoded images and natural language instructions. This class is
    designed to extract structured text descriptions or captions from images.
    """

    def __init__(
        self,
        api_key: Optional[str] = os.getenv("XAI_API_KEY"),
        model_name: str = os.getenv("XAI_MODEL", "grok-4"),
    ) -> None:
        """
        Initializes the GrokVisionModel.

        Args:
            api_key (str, optional): Grok API key. If not provided, uses the
                ``XAI_API_KEY`` environment variable.
            model_name (str, optional): Model identifier to use. If not provided,
                defaults to ``XAI_MODEL`` environment variable or ``"grok-4"``.

        Raises:
            ValueError: If ``api_key`` is not provided or cannot be resolved
                from environment variables.
        """
        api_key = api_key or os.getenv("XAI_API_KEY")
        model_name = model_name or os.getenv("XAI_MODEL") or "grok-4"

        if not api_key:
            raise ValueError(
                "Grok API key not provided or 'XAI_API_KEY' env var is not set."
            )

        self.model_name = model_name
        self.client = Client(
            api_key=api_key,
            base_url="https://api.x.ai/v1",
        )  # TODO: Change to xAI SDK

    def get_client(self) -> Client:
        """
        Returns the underlying Grok API client.

        Returns:
            Client: The initialized Grok ``Client`` instance.
        """
        return self.client

    def analyze_content(
        self,
        file: Optional[bytes],
        prompt: Optional[str] = None,
        *,
        file_ext: Optional[str] = "png",
        detail: str = "auto",
        **parameters: Any,
    ) -> str:
        """
        Extract text from an image using the Grok Vision model.

        Encodes the given image as a data URI with an appropriate MIME type based on
        ``file_ext`` and sends it along with a prompt to the Grok API. The API
        processes the image and returns extracted text in the response.

        Args:
            file (bytes, optional): Base64-encoded image content **without** the
                ``data:image/...;base64,`` prefix. Must not be None.
            prompt (str, optional): Instruction text guiding the extraction.
                Defaults to ``DEFAULT_IMAGE_CAPTION_PROMPT``.
            file_ext (str, optional): File extension (e.g., ``"png"``, ``"jpg"``)
                used to determine the MIME type for the image. Defaults to ``"png"``.
            detail (str, optional): Level of detail to request for the image
                analysis. Options typically include ``"low"``, ``"high"`` or ``"auto"``.
                Defaults to ``"auto"``.
            **parameters (Any): Additional keyword arguments passed directly to
                the Grok client ``chat.completions.create()`` method.

        Returns:
            str: The extracted text returned by the vision model.

        Raises:
            ValueError: If ``file`` is None or the file extension is not compatible.
            openai.OpenAIError: If the API request fails.

        Example:
            ```python
            from splitter_mr.model import GrokVisionModel

            model = GrokVisionModel()
            with open("image.jpg", "rb") as f:
                img_b64 = base64.b64encode(f.read()).decode("utf-8")

            text = model.analyze_content(
                img_b64, prompt="What's in this image?", file_ext="jpg", detail="high"
            )
            print(text)
            ```
        """
        if file is None:
            raise ValueError("No file content provided for text extraction.")

        ext = (file_ext or "png").lower()
        mime_type = (
            GROK_MIME_BY_EXTENSION.get(ext)  # noqa: W503
            or mimetypes.types_map.get(f".{ext}")  # noqa: W503
            or "image/png"  # noqa: W503
        )

        if mime_type not in SUPPORTED_GROK_MIME_TYPES:
            raise ValueError(f"Unsupported image MIME type: {mime_type}")

        prompt = prompt or DEFAULT_IMAGE_CAPTION_PROMPT

        payload_obj = OpenAIClientPayload(
            role="user",
            content=[
                OpenAIClientTextContent(type="text", text=prompt),
                OpenAIClientImageContent(
                    type="image_url",
                    image_url=OpenAIClientImageUrl(
                        url=f"data:{mime_type};base64,{file}",
                        detail=detail,
                    ),
                ),
            ],
        )

        response = self.client.chat.completions.create(
            model=self.model_name, messages=[payload_obj], **parameters
        )

        return response.choices[0].message.content
__init__(api_key=os.getenv('XAI_API_KEY'), model_name=os.getenv('XAI_MODEL', 'grok-4'))

Initializes the GrokVisionModel.

Parameters:

Name Type Description Default
api_key str

Grok API key. If not provided, uses the XAI_API_KEY environment variable.

getenv('XAI_API_KEY')
model_name str

Model identifier to use. If not provided, defaults to XAI_MODEL environment variable or "grok-4".

getenv('XAI_MODEL', 'grok-4')

Raises:

Type Description
ValueError

If api_key is not provided or cannot be resolved from environment variables.

Source code in src/splitter_mr/model/models/grok_model.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def __init__(
    self,
    api_key: Optional[str] = os.getenv("XAI_API_KEY"),
    model_name: str = os.getenv("XAI_MODEL", "grok-4"),
) -> None:
    """
    Initializes the GrokVisionModel.

    Args:
        api_key (str, optional): Grok API key. If not provided, uses the
            ``XAI_API_KEY`` environment variable.
        model_name (str, optional): Model identifier to use. If not provided,
            defaults to ``XAI_MODEL`` environment variable or ``"grok-4"``.

    Raises:
        ValueError: If ``api_key`` is not provided or cannot be resolved
            from environment variables.
    """
    api_key = api_key or os.getenv("XAI_API_KEY")
    model_name = model_name or os.getenv("XAI_MODEL") or "grok-4"

    if not api_key:
        raise ValueError(
            "Grok API key not provided or 'XAI_API_KEY' env var is not set."
        )

    self.model_name = model_name
    self.client = Client(
        api_key=api_key,
        base_url="https://api.x.ai/v1",
    )  # TODO: Change to xAI SDK
analyze_content(file, prompt=None, *, file_ext='png', detail='auto', **parameters)

Extract text from an image using the Grok Vision model.

Encodes the given image as a data URI with an appropriate MIME type based on file_ext and sends it along with a prompt to the Grok API. The API processes the image and returns extracted text in the response.

Parameters:

Name Type Description Default
file bytes

Base64-encoded image content without the data:image/...;base64, prefix. Must not be None.

required
prompt str

Instruction text guiding the extraction. Defaults to DEFAULT_IMAGE_CAPTION_PROMPT.

None
file_ext str

File extension (e.g., "png", "jpg") used to determine the MIME type for the image. Defaults to "png".

'png'
detail str

Level of detail to request for the image analysis. Options typically include "low", "high" or "auto". Defaults to "auto".

'auto'
**parameters Any

Additional keyword arguments passed directly to the Grok client chat.completions.create() method.

{}

Returns:

Name Type Description
str str

The extracted text returned by the vision model.

Raises:

Type Description
ValueError

If file is None or the file extension is not compatible.

OpenAIError

If the API request fails.

Example
from splitter_mr.model import GrokVisionModel

model = GrokVisionModel()
with open("image.jpg", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode("utf-8")

text = model.analyze_content(
    img_b64, prompt="What's in this image?", file_ext="jpg", detail="high"
)
print(text)
Source code in src/splitter_mr/model/models/grok_model.py
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
def analyze_content(
    self,
    file: Optional[bytes],
    prompt: Optional[str] = None,
    *,
    file_ext: Optional[str] = "png",
    detail: str = "auto",
    **parameters: Any,
) -> str:
    """
    Extract text from an image using the Grok Vision model.

    Encodes the given image as a data URI with an appropriate MIME type based on
    ``file_ext`` and sends it along with a prompt to the Grok API. The API
    processes the image and returns extracted text in the response.

    Args:
        file (bytes, optional): Base64-encoded image content **without** the
            ``data:image/...;base64,`` prefix. Must not be None.
        prompt (str, optional): Instruction text guiding the extraction.
            Defaults to ``DEFAULT_IMAGE_CAPTION_PROMPT``.
        file_ext (str, optional): File extension (e.g., ``"png"``, ``"jpg"``)
            used to determine the MIME type for the image. Defaults to ``"png"``.
        detail (str, optional): Level of detail to request for the image
            analysis. Options typically include ``"low"``, ``"high"`` or ``"auto"``.
            Defaults to ``"auto"``.
        **parameters (Any): Additional keyword arguments passed directly to
            the Grok client ``chat.completions.create()`` method.

    Returns:
        str: The extracted text returned by the vision model.

    Raises:
        ValueError: If ``file`` is None or the file extension is not compatible.
        openai.OpenAIError: If the API request fails.

    Example:
        ```python
        from splitter_mr.model import GrokVisionModel

        model = GrokVisionModel()
        with open("image.jpg", "rb") as f:
            img_b64 = base64.b64encode(f.read()).decode("utf-8")

        text = model.analyze_content(
            img_b64, prompt="What's in this image?", file_ext="jpg", detail="high"
        )
        print(text)
        ```
    """
    if file is None:
        raise ValueError("No file content provided for text extraction.")

    ext = (file_ext or "png").lower()
    mime_type = (
        GROK_MIME_BY_EXTENSION.get(ext)  # noqa: W503
        or mimetypes.types_map.get(f".{ext}")  # noqa: W503
        or "image/png"  # noqa: W503
    )

    if mime_type not in SUPPORTED_GROK_MIME_TYPES:
        raise ValueError(f"Unsupported image MIME type: {mime_type}")

    prompt = prompt or DEFAULT_IMAGE_CAPTION_PROMPT

    payload_obj = OpenAIClientPayload(
        role="user",
        content=[
            OpenAIClientTextContent(type="text", text=prompt),
            OpenAIClientImageContent(
                type="image_url",
                image_url=OpenAIClientImageUrl(
                    url=f"data:{mime_type};base64,{file}",
                    detail=detail,
                ),
            ),
        ],
    )

    response = self.client.chat.completions.create(
        model=self.model_name, messages=[payload_obj], **parameters
    )

    return response.choices[0].message.content
get_client()

Returns the underlying Grok API client.

Returns:

Name Type Description
Client Client

The initialized Grok Client instance.

Source code in src/splitter_mr/model/models/grok_model.py
60
61
62
63
64
65
66
67
def get_client(self) -> Client:
    """
    Returns the underlying Grok API client.

    Returns:
        Client: The initialized Grok ``Client`` instance.
    """
    return self.client

GeminiVisionModel

GeminiVisionModel logo GeminiVisionModel logo

GeminiVisionModel

Bases: BaseVisionModel

Implementation of BaseVisionModel using Google's Gemini Image Understanding API.

Source code in src/splitter_mr/model/models/gemini_model.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
class GeminiVisionModel(BaseVisionModel):
    """Implementation of `BaseVisionModel` using Google's Gemini Image Understanding API."""

    def __init__(
        self, api_key: Optional[str] = None, model_name: str = "gemini-2.5-flash"
    ) -> None:
        """
        Initialize the GeminiVisionModel.

        Args:
            api_key: Gemini API key. If not provided, uses 'GEMINI_API_KEY' env var.
            model_name: Vision-capable Gemini model name.

        Raises:
            ImportError: If `google-generativeai` is not installed.
            ValueError: If no API key is provided or 'GEMINI_API_KEY' not set.
        """

        if api_key is None:
            api_key = os.getenv("GEMINI_API_KEY")
        if not api_key:
            raise ValueError(
                "Google Gemini API key not provided or 'GEMINI_API_KEY' not set."
            )

        self.api_key = api_key
        self.model_name = model_name
        self.client = genai.Client(api_key=self.api_key)
        self.model = self.client.models
        self._types = types  # keep handle for analyze_content

    def get_client(self) -> Any:
        """Return the underlying Gemini SDK client."""
        return self.client

    def analyze_content(
        self,
        prompt: str,
        file: Optional[bytes],
        file_ext: Optional[str] = None,
        **parameters: Any,
    ) -> str:
        """Extract text from an image using Gemini's image understanding API."""
        if file is None:
            raise ValueError("No image file provided for extraction.")

        ext = (file_ext or "jpg").lower()
        mime_type = mimetypes.types_map.get(f".{ext}", "image/jpeg")

        img_b64 = file.decode("utf-8") if isinstance(file, (bytes, bytearray)) else file
        try:
            img_bytes = base64.b64decode(img_b64)
        except Exception as e:
            raise ValueError(f"Failed to decode base64 image data: {e}")

        # Build Gemini-compatible parts (using lazy-imported types)
        image_part = self._types.Part.from_bytes(data=img_bytes, mime_type=mime_type)
        text_part = prompt
        contents = [image_part, text_part]

        try:
            response = self.model.generate_content(
                model=self.model_name,
                contents=contents,
                **parameters,
            )
            return response.text
        except Exception as e:
            raise RuntimeError(f"Gemini model inference failed: {e}")
__init__(api_key=None, model_name='gemini-2.5-flash')

Initialize the GeminiVisionModel.

Parameters:

Name Type Description Default
api_key Optional[str]

Gemini API key. If not provided, uses 'GEMINI_API_KEY' env var.

None
model_name str

Vision-capable Gemini model name.

'gemini-2.5-flash'

Raises:

Type Description
ImportError

If google-generativeai is not installed.

ValueError

If no API key is provided or 'GEMINI_API_KEY' not set.

Source code in src/splitter_mr/model/models/gemini_model.py
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def __init__(
    self, api_key: Optional[str] = None, model_name: str = "gemini-2.5-flash"
) -> None:
    """
    Initialize the GeminiVisionModel.

    Args:
        api_key: Gemini API key. If not provided, uses 'GEMINI_API_KEY' env var.
        model_name: Vision-capable Gemini model name.

    Raises:
        ImportError: If `google-generativeai` is not installed.
        ValueError: If no API key is provided or 'GEMINI_API_KEY' not set.
    """

    if api_key is None:
        api_key = os.getenv("GEMINI_API_KEY")
    if not api_key:
        raise ValueError(
            "Google Gemini API key not provided or 'GEMINI_API_KEY' not set."
        )

    self.api_key = api_key
    self.model_name = model_name
    self.client = genai.Client(api_key=self.api_key)
    self.model = self.client.models
    self._types = types  # keep handle for analyze_content
analyze_content(prompt, file, file_ext=None, **parameters)

Extract text from an image using Gemini's image understanding API.

Source code in src/splitter_mr/model/models/gemini_model.py
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
def analyze_content(
    self,
    prompt: str,
    file: Optional[bytes],
    file_ext: Optional[str] = None,
    **parameters: Any,
) -> str:
    """Extract text from an image using Gemini's image understanding API."""
    if file is None:
        raise ValueError("No image file provided for extraction.")

    ext = (file_ext or "jpg").lower()
    mime_type = mimetypes.types_map.get(f".{ext}", "image/jpeg")

    img_b64 = file.decode("utf-8") if isinstance(file, (bytes, bytearray)) else file
    try:
        img_bytes = base64.b64decode(img_b64)
    except Exception as e:
        raise ValueError(f"Failed to decode base64 image data: {e}")

    # Build Gemini-compatible parts (using lazy-imported types)
    image_part = self._types.Part.from_bytes(data=img_bytes, mime_type=mime_type)
    text_part = prompt
    contents = [image_part, text_part]

    try:
        response = self.model.generate_content(
            model=self.model_name,
            contents=contents,
            **parameters,
        )
        return response.text
    except Exception as e:
        raise RuntimeError(f"Gemini model inference failed: {e}")
get_client()

Return the underlying Gemini SDK client.

Source code in src/splitter_mr/model/models/gemini_model.py
43
44
45
def get_client(self) -> Any:
    """Return the underlying Gemini SDK client."""
    return self.client

AnthropicVisionModel

AnthropicVisionModel logo AnthropicVisionModel logo

AnthropicVisionModel

Bases: BaseVisionModel

Implementation of BaseVisionModel using Anthropic's Claude Vision API via OpenAI SDK.

Sends base64-encoded images + prompts to the Claude multimodal endpoint.

Source code in src/splitter_mr/model/models/anthropic_model.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
class AnthropicVisionModel(BaseVisionModel):
    """
    Implementation of BaseVisionModel using Anthropic's Claude Vision API via OpenAI SDK.

    Sends base64-encoded images + prompts to the Claude multimodal endpoint.
    """

    def __init__(
        self,
        api_key: Optional[str] = None,
        model_name: str = os.getenv("ANTHROPIC_MODEL", "claude-sonnet-4-20250514"),
    ) -> None:
        """
        Initialize the AnthropicVisionModel.

        Args:
            api_key (str, optional): Anthropic API key. Uses ANTHROPIC_API_KEY env var if not provided.
            model_name (str): Vision-capable Claude model name.

        Raises:
            ValueError: If no API key provided or found in environment.
        """
        if api_key is None:
            api_key = os.getenv("ANTHROPIC_API_KEY")
            if not api_key:
                raise ValueError(
                    "Anthropic API key not provided and 'ANTHROPIC_API_KEY' env var not set."
                )

        base_url: str = ("https://api.anthropic.com/v1/",)
        self.client = OpenAI(api_key=api_key, base_url=base_url)
        self.model_name = model_name

    def get_client(self) -> OpenAI:
        """
        Get the underlying Anthropic API client instance.

        Returns:
            OpenAI: The initialized API client.
        """
        return self.client

    def analyze_content(
        self,
        file: Optional[bytes],
        prompt: str = DEFAULT_IMAGE_CAPTION_PROMPT,
        *,
        file_ext: Optional[str] = "png",
        **parameters: Dict[str, Any],
    ) -> str:
        """
        Extract text from an image using Anthropic's Claude Vision API.

        Args:
            prompt (str): Task or instruction (e.g. "Describe the image contents").
            file (bytes): Base64-encoded image content, no prefix/header.
            file_ext (str, optional): File extension (e.g. "png", "jpg").
            **parameters: Extra arguments to client.chat.completions.create().

        Returns:
            str: Extracted text or model response.

        Raises:
            ValueError: If file is None or unsupported file type.
            RuntimeError: For failed/invalid responses.
        """
        if file is None:
            raise ValueError("No file content provided for vision model.")

        ext = (file_ext or "png").lower()
        mime_type = (
            OPENAI_MIME_BY_EXTENSION.get(ext)
            or mimetypes.types_map.get(f".{ext}")  # noqa: W503
            or "image/png"  # noqa: W503
        )
        if mime_type not in SUPPORTED_OPENAI_MIME_TYPES:
            raise ValueError(f"Unsupported image MIME type for Anthropic: {mime_type}")

        # Build multimodal payload in OpenAI/Anthropic-compatible format
        payload_obj = OpenAIClientPayload(
            role="user",
            content=[
                OpenAIClientTextContent(type="text", text=prompt),
                OpenAIClientImageContent(
                    type="image_url",
                    image_url=OpenAIClientImageUrl(
                        url=f"data:{mime_type};base64,{file}"
                    ),
                ),
            ],
        )
        payload = payload_obj.model_dump(exclude_none=True)

        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=[payload],
            **parameters,
        )
        try:
            return response.choices[0].message.content
        except Exception as e:
            raise RuntimeError(f"Failed to extract response: {e}")
__init__(api_key=None, model_name=os.getenv('ANTHROPIC_MODEL', 'claude-sonnet-4-20250514'))

Initialize the AnthropicVisionModel.

Parameters:

Name Type Description Default
api_key str

Anthropic API key. Uses ANTHROPIC_API_KEY env var if not provided.

None
model_name str

Vision-capable Claude model name.

getenv('ANTHROPIC_MODEL', 'claude-sonnet-4-20250514')

Raises:

Type Description
ValueError

If no API key provided or found in environment.

Source code in src/splitter_mr/model/models/anthropic_model.py
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def __init__(
    self,
    api_key: Optional[str] = None,
    model_name: str = os.getenv("ANTHROPIC_MODEL", "claude-sonnet-4-20250514"),
) -> None:
    """
    Initialize the AnthropicVisionModel.

    Args:
        api_key (str, optional): Anthropic API key. Uses ANTHROPIC_API_KEY env var if not provided.
        model_name (str): Vision-capable Claude model name.

    Raises:
        ValueError: If no API key provided or found in environment.
    """
    if api_key is None:
        api_key = os.getenv("ANTHROPIC_API_KEY")
        if not api_key:
            raise ValueError(
                "Anthropic API key not provided and 'ANTHROPIC_API_KEY' env var not set."
            )

    base_url: str = ("https://api.anthropic.com/v1/",)
    self.client = OpenAI(api_key=api_key, base_url=base_url)
    self.model_name = model_name
analyze_content(file, prompt=DEFAULT_IMAGE_CAPTION_PROMPT, *, file_ext='png', **parameters)

Extract text from an image using Anthropic's Claude Vision API.

Parameters:

Name Type Description Default
prompt str

Task or instruction (e.g. "Describe the image contents").

DEFAULT_IMAGE_CAPTION_PROMPT
file bytes

Base64-encoded image content, no prefix/header.

required
file_ext str

File extension (e.g. "png", "jpg").

'png'
**parameters Dict[str, Any]

Extra arguments to client.chat.completions.create().

{}

Returns:

Name Type Description
str str

Extracted text or model response.

Raises:

Type Description
ValueError

If file is None or unsupported file type.

RuntimeError

For failed/invalid responses.

Source code in src/splitter_mr/model/models/anthropic_model.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
def analyze_content(
    self,
    file: Optional[bytes],
    prompt: str = DEFAULT_IMAGE_CAPTION_PROMPT,
    *,
    file_ext: Optional[str] = "png",
    **parameters: Dict[str, Any],
) -> str:
    """
    Extract text from an image using Anthropic's Claude Vision API.

    Args:
        prompt (str): Task or instruction (e.g. "Describe the image contents").
        file (bytes): Base64-encoded image content, no prefix/header.
        file_ext (str, optional): File extension (e.g. "png", "jpg").
        **parameters: Extra arguments to client.chat.completions.create().

    Returns:
        str: Extracted text or model response.

    Raises:
        ValueError: If file is None or unsupported file type.
        RuntimeError: For failed/invalid responses.
    """
    if file is None:
        raise ValueError("No file content provided for vision model.")

    ext = (file_ext or "png").lower()
    mime_type = (
        OPENAI_MIME_BY_EXTENSION.get(ext)
        or mimetypes.types_map.get(f".{ext}")  # noqa: W503
        or "image/png"  # noqa: W503
    )
    if mime_type not in SUPPORTED_OPENAI_MIME_TYPES:
        raise ValueError(f"Unsupported image MIME type for Anthropic: {mime_type}")

    # Build multimodal payload in OpenAI/Anthropic-compatible format
    payload_obj = OpenAIClientPayload(
        role="user",
        content=[
            OpenAIClientTextContent(type="text", text=prompt),
            OpenAIClientImageContent(
                type="image_url",
                image_url=OpenAIClientImageUrl(
                    url=f"data:{mime_type};base64,{file}"
                ),
            ),
        ],
    )
    payload = payload_obj.model_dump(exclude_none=True)

    response = self.client.chat.completions.create(
        model=self.model_name,
        messages=[payload],
        **parameters,
    )
    try:
        return response.choices[0].message.content
    except Exception as e:
        raise RuntimeError(f"Failed to extract response: {e}")
get_client()

Get the underlying Anthropic API client instance.

Returns:

Name Type Description
OpenAI OpenAI

The initialized API client.

Source code in src/splitter_mr/model/models/anthropic_model.py
52
53
54
55
56
57
58
59
def get_client(self) -> OpenAI:
    """
    Get the underlying Anthropic API client instance.

    Returns:
        OpenAI: The initialized API client.
    """
    return self.client

HuggingFaceVisionModel

Warning

HuggingFaceVisionModel can NOT currently support all the models available in HuggingFace.

For example, closed models (e.g., Microsoft Florence 2 large) or models which uses uncommon architectures (NanoNets). We strongly recommend to use SmolDocling, since it has been exhaustively tested.

HuggingFaceVisionModel logo HuggingFaceVisionModel logo

HuggingFaceVisionModel

Bases: BaseVisionModel

Vision-language model wrapper using Hugging Face Transformers.

This implementation loads a local or Hugging Face Hub model that supports image-to-text or multimodal tasks. It accepts a prompt and an image as base64 (without the data URI header) and returns the model's generated text. Pydantic schema models are used for message validation.

Example
import base64, requests
from splitter_mr.model.models.huggingface_model import HuggingFaceVisionModel

# Encode an image as base64
img_bytes = requests.get(
    "https://huggingface.co/datasets/huggingface/documentation-images/"
    "resolve/main/p-blog/candy.JPG"
).content
img_b64 = base64.b64encode(img_bytes).decode("utf-8")

model = HuggingFaceVisionModel("ds4sd/SmolDocling-256M-preview")
result = model.analyze_content("What animal is on the candy?", file=img_b64)
print(result)  # e.g., "A small green thing."
Source code in src/splitter_mr/model/models/huggingface_model.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
class HuggingFaceVisionModel(BaseVisionModel):
    """
    Vision-language model wrapper using Hugging Face Transformers.

    This implementation loads a local or Hugging Face Hub model that supports
    image-to-text or multimodal tasks. It accepts a prompt and an image as
    base64 (without the data URI header) and returns the model's generated text.
    Pydantic schema models are used for message validation.

    Example:
        ```python
        import base64, requests
        from splitter_mr.model.models.huggingface_model import HuggingFaceVisionModel

        # Encode an image as base64
        img_bytes = requests.get(
            "https://huggingface.co/datasets/huggingface/documentation-images/"
            "resolve/main/p-blog/candy.JPG"
        ).content
        img_b64 = base64.b64encode(img_bytes).decode("utf-8")

        model = HuggingFaceVisionModel("ds4sd/SmolDocling-256M-preview")
        result = model.analyze_content("What animal is on the candy?", file=img_b64)
        print(result)  # e.g., "A small green thing."
        ```
    """

    DEFAULT_EXT: str = "jpg"
    FALLBACKS: List[Tuple[str, Optional[Any]]] = [
        ("AutoModelForVision2Seq", None),
        ("AutoModelForImageTextToText", None),
        ("AutoModelForCausalLM", None),
        ("AutoModelForPreTraining", None),
        ("AutoModel", None),
    ]

    def __init__(self, model_name: str = "ds4sd/SmolDocling-256M-preview") -> None:
        """
        Initialize a HuggingFaceVisionModel.

        Args:
            model_name (str, optional): Model repo ID or local path
                (e.g., ``"ds4sd/SmolDocling-256M-preview"``).

        Raises:
            ImportError: If the 'multimodal' extra (transformers) is not installed.
            RuntimeError: If processor or model loading fails after all attempts.
        """

        transformers = importlib.import_module("transformers")

        AutoProcessor = transformers.AutoProcessor
        AutoImageProcessor = transformers.AutoImageProcessor
        AutoConfig = transformers.AutoConfig

        self.model_id = model_name
        self.model = None
        self.processor = None

        # Load processor
        try:
            self.processor = AutoProcessor.from_pretrained(
                self.model_id, trust_remote_code=True
            )
        except Exception:
            try:
                self.processor = AutoImageProcessor.from_pretrained(
                    self.model_id, trust_remote_code=True
                )
            except Exception as e:
                raise RuntimeError("All processor loading attempts failed.") from e

        # Load model
        config = AutoConfig.from_pretrained(self.model_id)
        errors: List[str] = []

        try:
            arch_name = config.architectures[0]
            ModelClass = getattr(transformers, arch_name)
            self.model = ModelClass.from_pretrained(
                self.model_id, trust_remote_code=True
            )
        except Exception as e:
            errors.append(f"[AutoModel by architecture] {e}")

        if self.model is None:
            resolved: List[Tuple[str, Any]] = []
            for name, cls in self.FALLBACKS:
                resolved.append((name, cls or getattr(transformers, name)))
            for name, cls in resolved:
                try:
                    self.model = cls.from_pretrained(
                        self.model_id, trust_remote_code=True
                    )
                    break
                except Exception as e:
                    errors.append(f"[{name}] {e}")

        if self.model is None:
            raise RuntimeError(
                "All model loading attempts failed:\n" + "\n".join(errors)
            )

    def get_client(self) -> Any:
        """Return the underlying HuggingFace model instance.

        Returns:
            Any: The instantiated HuggingFace model object.
        """
        return self.model

    def analyze_content(
        self,
        prompt: str,
        file: Optional[bytes],
        file_ext: Optional[str] = None,
        **parameters: Dict[str, Any],
    ) -> str:
        """
        Extract text from an image using the vision-language model.

        This method encodes an image as a data URI, builds a validated
        message using schema models, prepares inputs, and calls the model
        to generate a textual response.

        Args:
            prompt (str): Instruction or question for the model
                (e.g., ``"Describe this image."``).
            file (Optional[bytes]): Image as a base64-encoded string (without prefix).
            file_ext (Optional[str], optional): File extension (e.g., ``"jpg"`` or ``"png"``).
                Defaults to ``"jpg"`` if not provided.
            **parameters (Dict[str, Any]): Extra keyword arguments passed directly
                to the model's ``generate()`` method (e.g., ``max_new_tokens``,
                ``temperature``).

        Returns:
            str: The extracted or generated text.

        Raises:
            ValueError: If ``file`` is None.
            RuntimeError: If input preparation or inference fails.
        """
        if file is None:
            raise ValueError("No image file provided for extraction.")

        ext = (file_ext or self.DEFAULT_EXT).lower()
        mime_type = mimetypes.types_map.get(f".{ext}", "image/jpeg")
        img_b64 = file if isinstance(file, str) else file.decode("utf-8")
        img_data_uri = f"data:{mime_type};base64,{img_b64}"

        text_content = HFChatTextContent(type="text", text=prompt)
        image_content = HFChatImageContent(type="image", image=img_data_uri)
        chat_msg = HFChatMessage(role="user", content=[image_content, text_content])
        messages = [chat_msg.model_dump(exclude_none=True)]

        try:
            inputs = self.processor.apply_chat_template(
                messages,
                add_generation_prompt=True,
                tokenize=True,
                return_dict=True,
                return_tensors="pt",
                truncation=True,
            ).to(self.model.device)
        except Exception as e:
            raise RuntimeError(f"Failed to prepare input: {e}")

        try:
            max_new_tokens = parameters.pop("max_new_tokens", 40)
            outputs = self.model.generate(
                **inputs, max_new_tokens=max_new_tokens, **parameters
            )
            output_text = self.processor.decode(
                outputs[0][inputs["input_ids"].shape[-1] :], skip_special_tokens=True
            )
            return output_text
        except Exception as e:
            raise RuntimeError(f"Model inference failed: {e}")
__init__(model_name='ds4sd/SmolDocling-256M-preview')

Initialize a HuggingFaceVisionModel.

Parameters:

Name Type Description Default
model_name str

Model repo ID or local path (e.g., "ds4sd/SmolDocling-256M-preview").

'ds4sd/SmolDocling-256M-preview'

Raises:

Type Description
ImportError

If the 'multimodal' extra (transformers) is not installed.

RuntimeError

If processor or model loading fails after all attempts.

Source code in src/splitter_mr/model/models/huggingface_model.py
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def __init__(self, model_name: str = "ds4sd/SmolDocling-256M-preview") -> None:
    """
    Initialize a HuggingFaceVisionModel.

    Args:
        model_name (str, optional): Model repo ID or local path
            (e.g., ``"ds4sd/SmolDocling-256M-preview"``).

    Raises:
        ImportError: If the 'multimodal' extra (transformers) is not installed.
        RuntimeError: If processor or model loading fails after all attempts.
    """

    transformers = importlib.import_module("transformers")

    AutoProcessor = transformers.AutoProcessor
    AutoImageProcessor = transformers.AutoImageProcessor
    AutoConfig = transformers.AutoConfig

    self.model_id = model_name
    self.model = None
    self.processor = None

    # Load processor
    try:
        self.processor = AutoProcessor.from_pretrained(
            self.model_id, trust_remote_code=True
        )
    except Exception:
        try:
            self.processor = AutoImageProcessor.from_pretrained(
                self.model_id, trust_remote_code=True
            )
        except Exception as e:
            raise RuntimeError("All processor loading attempts failed.") from e

    # Load model
    config = AutoConfig.from_pretrained(self.model_id)
    errors: List[str] = []

    try:
        arch_name = config.architectures[0]
        ModelClass = getattr(transformers, arch_name)
        self.model = ModelClass.from_pretrained(
            self.model_id, trust_remote_code=True
        )
    except Exception as e:
        errors.append(f"[AutoModel by architecture] {e}")

    if self.model is None:
        resolved: List[Tuple[str, Any]] = []
        for name, cls in self.FALLBACKS:
            resolved.append((name, cls or getattr(transformers, name)))
        for name, cls in resolved:
            try:
                self.model = cls.from_pretrained(
                    self.model_id, trust_remote_code=True
                )
                break
            except Exception as e:
                errors.append(f"[{name}] {e}")

    if self.model is None:
        raise RuntimeError(
            "All model loading attempts failed:\n" + "\n".join(errors)
        )
analyze_content(prompt, file, file_ext=None, **parameters)

Extract text from an image using the vision-language model.

This method encodes an image as a data URI, builds a validated message using schema models, prepares inputs, and calls the model to generate a textual response.

Parameters:

Name Type Description Default
prompt str

Instruction or question for the model (e.g., "Describe this image.").

required
file Optional[bytes]

Image as a base64-encoded string (without prefix).

required
file_ext Optional[str]

File extension (e.g., "jpg" or "png"). Defaults to "jpg" if not provided.

None
**parameters Dict[str, Any]

Extra keyword arguments passed directly to the model's generate() method (e.g., max_new_tokens, temperature).

{}

Returns:

Name Type Description
str str

The extracted or generated text.

Raises:

Type Description
ValueError

If file is None.

RuntimeError

If input preparation or inference fails.

Source code in src/splitter_mr/model/models/huggingface_model.py
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
def analyze_content(
    self,
    prompt: str,
    file: Optional[bytes],
    file_ext: Optional[str] = None,
    **parameters: Dict[str, Any],
) -> str:
    """
    Extract text from an image using the vision-language model.

    This method encodes an image as a data URI, builds a validated
    message using schema models, prepares inputs, and calls the model
    to generate a textual response.

    Args:
        prompt (str): Instruction or question for the model
            (e.g., ``"Describe this image."``).
        file (Optional[bytes]): Image as a base64-encoded string (without prefix).
        file_ext (Optional[str], optional): File extension (e.g., ``"jpg"`` or ``"png"``).
            Defaults to ``"jpg"`` if not provided.
        **parameters (Dict[str, Any]): Extra keyword arguments passed directly
            to the model's ``generate()`` method (e.g., ``max_new_tokens``,
            ``temperature``).

    Returns:
        str: The extracted or generated text.

    Raises:
        ValueError: If ``file`` is None.
        RuntimeError: If input preparation or inference fails.
    """
    if file is None:
        raise ValueError("No image file provided for extraction.")

    ext = (file_ext or self.DEFAULT_EXT).lower()
    mime_type = mimetypes.types_map.get(f".{ext}", "image/jpeg")
    img_b64 = file if isinstance(file, str) else file.decode("utf-8")
    img_data_uri = f"data:{mime_type};base64,{img_b64}"

    text_content = HFChatTextContent(type="text", text=prompt)
    image_content = HFChatImageContent(type="image", image=img_data_uri)
    chat_msg = HFChatMessage(role="user", content=[image_content, text_content])
    messages = [chat_msg.model_dump(exclude_none=True)]

    try:
        inputs = self.processor.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=True,
            return_dict=True,
            return_tensors="pt",
            truncation=True,
        ).to(self.model.device)
    except Exception as e:
        raise RuntimeError(f"Failed to prepare input: {e}")

    try:
        max_new_tokens = parameters.pop("max_new_tokens", 40)
        outputs = self.model.generate(
            **inputs, max_new_tokens=max_new_tokens, **parameters
        )
        output_text = self.processor.decode(
            outputs[0][inputs["input_ids"].shape[-1] :], skip_special_tokens=True
        )
        return output_text
    except Exception as e:
        raise RuntimeError(f"Model inference failed: {e}")
get_client()

Return the underlying HuggingFace model instance.

Returns:

Name Type Description
Any Any

The instantiated HuggingFace model object.

Source code in src/splitter_mr/model/models/huggingface_model.py
112
113
114
115
116
117
118
def get_client(self) -> Any:
    """Return the underlying HuggingFace model instance.

    Returns:
        Any: The instantiated HuggingFace model object.
    """
    return self.model