Skip to content

Visual Models

Reading documents like Word, PDF, or PowerPoint can sometimes be complicated if they contain images. To avoid this problem, you can use visual language models (VLMs), which are capable of recognizing images and extracting descriptions from them. In this prospectus, a model module has been developed, the implementation of which is based on the BaseModel class. It is presented below.

Which model should I use?

The choice of model depends on your cloud provider, available API keys, and desired level of integration. All models inherit from BaseModel and provide the same interface for extracting text and descriptions from images.

Model When to use Requirements Features
OpenAIVisionModel Use if you have an OpenAI API key and want to use OpenAI cloud OpenAI account & API key No Azure setup, easy to get started.
AzureOpenAIVisionModel Use if your organization uses Azure OpenAI Services Azure OpenAI deployment, API key, endpoint Integration with Azure, enterprise security.
BaseModel Abstract base, not used directly Use as a template for building your own.

Models

BaseModel

BaseModel

Bases: ABC

Abstract base for vision models that extract text from images.

Subclasses encapsulate local or API-backed implementations (e.g., OpenAI, Azure OpenAI, or on-device models). Implementations should handle encoding, request construction, and response parsing while exposing a uniform interface for clients of the library.

Source code in src/splitter_mr/model/base_model.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
class BaseModel(ABC):
    """
    Abstract base for vision models that extract text from images.

    Subclasses encapsulate local or API-backed implementations (e.g., OpenAI,
    Azure OpenAI, or on-device models). Implementations should handle encoding,
    request construction, and response parsing while exposing a uniform
    interface for clients of the library.
    """

    @abstractmethod
    def __init__(self, model_name) -> Any:
        """Initialize the model.

        Args:
            model_name (Any): Identifier of the underlying model. For hosted APIs
                this could be a model name or deployment name; for local models,
                it could be a path or configuration object.

        Raises:
            ValueError: If required configuration or credentials are missing.
        """

    @abstractmethod
    def get_client(self) -> Any:
        """Return the underlying client or handle.

        Returns:
            Any: A client/handle that the implementation uses to perform
                inference (e.g., an SDK client instance, session object, or
                lightweight wrapper). May be ``None`` for pure-local implementations.
        """

    @abstractmethod
    def extract_text(
        self,
        prompt: str,
        file: Optional[bytes],
        file_ext: Optional[str],
        **parameters: Dict[str, Any],
    ) -> str:
        """Extract text from an image using the provided prompt.

        Encodes the image (provided as base64 **without** the
        ``data:<mime>;base64,`` prefix), sends it with an instruction prompt to
        the underlying vision model, and returns the model's textual output.

        Args:
            prompt (str): Instruction or task description guiding the extraction
                (e.g., *"Read all visible text"* or *"Summarize the receipt"*).
            file (Optional[bytes]): Base64-encoded image bytes **without** the
                header/prefix. Must not be ``None`` for remote/API calls that
                require an image payload.
            file_ext (Optional[str]): File extension (e.g., ``"png"``, ``"jpg"``)
                used to infer the MIME type when required by the backend.
            **parameters (Dict[str, Any]): Additional backend-specific options
                forwarded to the implementation (e.g., timeouts, user tags,
                temperature, etc.).

        Returns:
            str: The extracted text or the model's textual response.

        Raises:
            ValueError: If ``file`` is ``None`` when required, or if the file
                type is unsupported by the implementation.
            RuntimeError: If the inference call fails or returns an unexpected
                response shape.
        """
__init__(model_name) abstractmethod

Initialize the model.

Parameters:

Name Type Description Default
model_name Any

Identifier of the underlying model. For hosted APIs this could be a model name or deployment name; for local models, it could be a path or configuration object.

required

Raises:

Type Description
ValueError

If required configuration or credentials are missing.

Source code in src/splitter_mr/model/base_model.py
15
16
17
18
19
20
21
22
23
24
25
26
@abstractmethod
def __init__(self, model_name) -> Any:
    """Initialize the model.

    Args:
        model_name (Any): Identifier of the underlying model. For hosted APIs
            this could be a model name or deployment name; for local models,
            it could be a path or configuration object.

    Raises:
        ValueError: If required configuration or credentials are missing.
    """
get_client() abstractmethod

Return the underlying client or handle.

Returns:

Name Type Description
Any Any

A client/handle that the implementation uses to perform inference (e.g., an SDK client instance, session object, or lightweight wrapper). May be None for pure-local implementations.

Source code in src/splitter_mr/model/base_model.py
28
29
30
31
32
33
34
35
36
@abstractmethod
def get_client(self) -> Any:
    """Return the underlying client or handle.

    Returns:
        Any: A client/handle that the implementation uses to perform
            inference (e.g., an SDK client instance, session object, or
            lightweight wrapper). May be ``None`` for pure-local implementations.
    """
extract_text(prompt, file, file_ext, **parameters) abstractmethod

Extract text from an image using the provided prompt.

Encodes the image (provided as base64 without the data:<mime>;base64, prefix), sends it with an instruction prompt to the underlying vision model, and returns the model's textual output.

Parameters:

Name Type Description Default
prompt str

Instruction or task description guiding the extraction (e.g., "Read all visible text" or "Summarize the receipt").

required
file Optional[bytes]

Base64-encoded image bytes without the header/prefix. Must not be None for remote/API calls that require an image payload.

required
file_ext Optional[str]

File extension (e.g., "png", "jpg") used to infer the MIME type when required by the backend.

required
**parameters Dict[str, Any]

Additional backend-specific options forwarded to the implementation (e.g., timeouts, user tags, temperature, etc.).

{}

Returns:

Name Type Description
str str

The extracted text or the model's textual response.

Raises:

Type Description
ValueError

If file is None when required, or if the file type is unsupported by the implementation.

RuntimeError

If the inference call fails or returns an unexpected response shape.

Source code in src/splitter_mr/model/base_model.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
@abstractmethod
def extract_text(
    self,
    prompt: str,
    file: Optional[bytes],
    file_ext: Optional[str],
    **parameters: Dict[str, Any],
) -> str:
    """Extract text from an image using the provided prompt.

    Encodes the image (provided as base64 **without** the
    ``data:<mime>;base64,`` prefix), sends it with an instruction prompt to
    the underlying vision model, and returns the model's textual output.

    Args:
        prompt (str): Instruction or task description guiding the extraction
            (e.g., *"Read all visible text"* or *"Summarize the receipt"*).
        file (Optional[bytes]): Base64-encoded image bytes **without** the
            header/prefix. Must not be ``None`` for remote/API calls that
            require an image payload.
        file_ext (Optional[str]): File extension (e.g., ``"png"``, ``"jpg"``)
            used to infer the MIME type when required by the backend.
        **parameters (Dict[str, Any]): Additional backend-specific options
            forwarded to the implementation (e.g., timeouts, user tags,
            temperature, etc.).

    Returns:
        str: The extracted text or the model's textual response.

    Raises:
        ValueError: If ``file`` is ``None`` when required, or if the file
            type is unsupported by the implementation.
        RuntimeError: If the inference call fails or returns an unexpected
            response shape.
    """

OpenAIVisionModel

OpenAIVisionModel logo OpenAIVisionModel logo

OpenAIVisionModel

Bases: BaseModel

Implementation of BaseModel leveraging OpenAI's Chat Completions API.

Uses the client.chat.completions.create() method to send base64-encoded images along with text prompts in a single multimodal request.

Source code in src/splitter_mr/model/models/openai_model.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
class OpenAIVisionModel(BaseModel):
    """
    Implementation of BaseModel leveraging OpenAI's Chat Completions API.

    Uses the `client.chat.completions.create()` method to send base64-encoded
    images along with text prompts in a single multimodal request.
    """

    def __init__(
        self, api_key: Optional[str] = None, model_name: str = "gpt-4.1"
    ) -> None:
        """
        Initialize the OpenAIVisionModel.

        Args:
            api_key (str, optional): OpenAI API key. If not provided, uses the
                ``OPENAI_API_KEY`` environment variable.
            model_name (str): Vision-capable model name (e.g., ``"gpt-4.1"``).

        Raises:
            ValueError: If no API key is provided or ``OPENAI_API_KEY`` is not set.
        """
        if api_key is None:
            api_key = os.getenv("OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "OpenAI API key not provided or 'OPENAI_API_KEY' env var is not set."
                )
        self.client = OpenAI(api_key=api_key)
        self.model_name = model_name

    def get_client(self) -> OpenAI:
        """
        Get the underlying OpenAI client instance.

        Returns:
            OpenAI: The initialized API client.
        """
        return self.client

    def extract_text(
        self,
        file: Optional[bytes],
        prompt: str = DEFAULT_IMAGE_CAPTION_PROMPT,
        *,
        file_ext: Optional[str] = "png",
        **parameters: Any,
    ) -> str:
        """
        Extract text from an image using OpenAI's Chat Completions API.

        Encodes the provided image bytes as a base64 data URI and sends it
        along with a textual prompt to the specified vision-capable model.
        The model processes the image and returns extracted text.

        Args:
            file (bytes, optional): Base64-encoded image content **without** the
                ``data:image/...;base64,`` prefix. Must not be None.
            prompt (str, optional): Instruction text guiding the extraction.
                Defaults to ``DEFAULT_IMAGE_CAPTION_PROMPT``.
            file_ext (str, optional): File extension (e.g., ``"png"``, ``"jpg"``,
                ``"jpeg"``, ``"webp"``, ``"gif"``) used to determine the MIME type.
                Defaults to ``"png"``.
            **parameters (Any): Additional keyword arguments passed directly to
                the OpenAI client ``chat.completions.create()`` method. Consult documentation
                [here](https://platform.openai.com/docs/api-reference/chat/create).

        Returns:
            str: Extracted text returned by the model.

        Raises:
            ValueError: If ``file`` is None or the file extension is not compatible.
            openai.OpenAIError: If the API request fails.

        Example:
            ```python
            from splitter_mr.model import OpenAIVisionModel
            import base64

            model = OpenAIVisionModel(api_key="sk-...")
            with open("example.png", "rb") as f:
                img_b64 = base64.b64encode(f.read()).decode("utf-8")

            text = model.extract_text(img_b64, prompt="Describe the content of this image.")
            print(text)
            ```
        """
        if file is None:
            raise ValueError("No file content provided for text extraction.")

        ext = (file_ext or "png").lower()
        mime_type = (
            OPENAI_MIME_BY_EXTENSION.get(ext)  # noqa: W503
            or mimetypes.types_map.get(f".{ext}")  # noqa: W503
            or "image/png"  # noqa: W503
        )

        if mime_type not in SUPPORTED_OPENAI_MIME_TYPES:
            raise ValueError(f"Unsupported image MIME type: {mime_type}")

        payload_obj = ClientPayload(
            role="user",
            content=[
                ClientTextContent(type="text", text=prompt),
                ClientImageContent(
                    type="image_url",
                    image_url=ClientImageUrl(url=f"data:{mime_type};base64,{file}"),
                ),
            ],
        )
        payload = payload_obj.model_dump()

        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=[payload],
            **parameters,
        )
        return response.choices[0].message.content
__init__(api_key=None, model_name='gpt-4.1')

Initialize the OpenAIVisionModel.

Parameters:

Name Type Description Default
api_key str

OpenAI API key. If not provided, uses the OPENAI_API_KEY environment variable.

None
model_name str

Vision-capable model name (e.g., "gpt-4.1").

'gpt-4.1'

Raises:

Type Description
ValueError

If no API key is provided or OPENAI_API_KEY is not set.

Source code in src/splitter_mr/model/models/openai_model.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def __init__(
    self, api_key: Optional[str] = None, model_name: str = "gpt-4.1"
) -> None:
    """
    Initialize the OpenAIVisionModel.

    Args:
        api_key (str, optional): OpenAI API key. If not provided, uses the
            ``OPENAI_API_KEY`` environment variable.
        model_name (str): Vision-capable model name (e.g., ``"gpt-4.1"``).

    Raises:
        ValueError: If no API key is provided or ``OPENAI_API_KEY`` is not set.
    """
    if api_key is None:
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "OpenAI API key not provided or 'OPENAI_API_KEY' env var is not set."
            )
    self.client = OpenAI(api_key=api_key)
    self.model_name = model_name
get_client()

Get the underlying OpenAI client instance.

Returns:

Name Type Description
OpenAI OpenAI

The initialized API client.

Source code in src/splitter_mr/model/models/openai_model.py
50
51
52
53
54
55
56
57
def get_client(self) -> OpenAI:
    """
    Get the underlying OpenAI client instance.

    Returns:
        OpenAI: The initialized API client.
    """
    return self.client
extract_text(file, prompt=DEFAULT_IMAGE_CAPTION_PROMPT, *, file_ext='png', **parameters)

Extract text from an image using OpenAI's Chat Completions API.

Encodes the provided image bytes as a base64 data URI and sends it along with a textual prompt to the specified vision-capable model. The model processes the image and returns extracted text.

Parameters:

Name Type Description Default
file bytes

Base64-encoded image content without the data:image/...;base64, prefix. Must not be None.

required
prompt str

Instruction text guiding the extraction. Defaults to DEFAULT_IMAGE_CAPTION_PROMPT.

DEFAULT_IMAGE_CAPTION_PROMPT
file_ext str

File extension (e.g., "png", "jpg", "jpeg", "webp", "gif") used to determine the MIME type. Defaults to "png".

'png'
**parameters Any

Additional keyword arguments passed directly to the OpenAI client chat.completions.create() method. Consult documentation here.

{}

Returns:

Name Type Description
str str

Extracted text returned by the model.

Raises:

Type Description
ValueError

If file is None or the file extension is not compatible.

OpenAIError

If the API request fails.

Example
from splitter_mr.model import OpenAIVisionModel
import base64

model = OpenAIVisionModel(api_key="sk-...")
with open("example.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode("utf-8")

text = model.extract_text(img_b64, prompt="Describe the content of this image.")
print(text)
Source code in src/splitter_mr/model/models/openai_model.py
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
def extract_text(
    self,
    file: Optional[bytes],
    prompt: str = DEFAULT_IMAGE_CAPTION_PROMPT,
    *,
    file_ext: Optional[str] = "png",
    **parameters: Any,
) -> str:
    """
    Extract text from an image using OpenAI's Chat Completions API.

    Encodes the provided image bytes as a base64 data URI and sends it
    along with a textual prompt to the specified vision-capable model.
    The model processes the image and returns extracted text.

    Args:
        file (bytes, optional): Base64-encoded image content **without** the
            ``data:image/...;base64,`` prefix. Must not be None.
        prompt (str, optional): Instruction text guiding the extraction.
            Defaults to ``DEFAULT_IMAGE_CAPTION_PROMPT``.
        file_ext (str, optional): File extension (e.g., ``"png"``, ``"jpg"``,
            ``"jpeg"``, ``"webp"``, ``"gif"``) used to determine the MIME type.
            Defaults to ``"png"``.
        **parameters (Any): Additional keyword arguments passed directly to
            the OpenAI client ``chat.completions.create()`` method. Consult documentation
            [here](https://platform.openai.com/docs/api-reference/chat/create).

    Returns:
        str: Extracted text returned by the model.

    Raises:
        ValueError: If ``file`` is None or the file extension is not compatible.
        openai.OpenAIError: If the API request fails.

    Example:
        ```python
        from splitter_mr.model import OpenAIVisionModel
        import base64

        model = OpenAIVisionModel(api_key="sk-...")
        with open("example.png", "rb") as f:
            img_b64 = base64.b64encode(f.read()).decode("utf-8")

        text = model.extract_text(img_b64, prompt="Describe the content of this image.")
        print(text)
        ```
    """
    if file is None:
        raise ValueError("No file content provided for text extraction.")

    ext = (file_ext or "png").lower()
    mime_type = (
        OPENAI_MIME_BY_EXTENSION.get(ext)  # noqa: W503
        or mimetypes.types_map.get(f".{ext}")  # noqa: W503
        or "image/png"  # noqa: W503
    )

    if mime_type not in SUPPORTED_OPENAI_MIME_TYPES:
        raise ValueError(f"Unsupported image MIME type: {mime_type}")

    payload_obj = ClientPayload(
        role="user",
        content=[
            ClientTextContent(type="text", text=prompt),
            ClientImageContent(
                type="image_url",
                image_url=ClientImageUrl(url=f"data:{mime_type};base64,{file}"),
            ),
        ],
    )
    payload = payload_obj.model_dump()

    response = self.client.chat.completions.create(
        model=self.model_name,
        messages=[payload],
        **parameters,
    )
    return response.choices[0].message.content

AzureOpenAIVisionModel

OpenAIVisionModel logo OpenAIVisionModel logo

AzureOpenAIVisionModel

Bases: BaseModel

Implementation of BaseModel for Azure OpenAI Vision using the Responses API.

Utilizes Azure’s preview responses API, which supports base64-encoded images and stateful multimodal calls.

Source code in src/splitter_mr/model/models/azure_openai_model.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
class AzureOpenAIVisionModel(BaseModel):
    """
    Implementation of BaseModel for Azure OpenAI Vision using the Responses API.

    Utilizes Azure’s preview `responses` API, which supports
    base64-encoded images and stateful multimodal calls.
    """

    def __init__(
        self,
        api_key: str = None,
        azure_endpoint: str = None,
        azure_deployment: str = None,
        api_version: str = None,
    ) -> None:
        """
        Initializes the AzureOpenAIVisionModel.

        Args:
            api_key (str, optional): Azure OpenAI API key.
                If not provided, uses 'AZURE_OPENAI_API_KEY' env var.
            azure_endpoint (str, optional): Azure endpoint.
                If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var.
            azure_deployment (str, optional): Azure deployment name.
                If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var.
            api_version (str, optional): API version string.
                If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'.

        Raises:
            ValueError: If no connection details are provided or environment variables
                are not set.
        """
        if api_key is None:
            api_key = os.getenv("AZURE_OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "Azure OpenAI API key not provided or 'AZURE_OPENAI_API_KEY' env var is not set."
                )
        if azure_endpoint is None:
            azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
            if not azure_endpoint:
                raise ValueError(
                    "Azure endpoint not provided or 'AZURE_OPENAI_ENDPOINT' env var is not set."
                )
        if azure_deployment is None:
            azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT")
            if not azure_deployment:
                raise ValueError(
                    "Azure deployment name not provided or 'AZURE_OPENAI_DEPLOYMENT' env var is not set."
                )
        if api_version is None:
            api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2025-04-14-preview")

        self.client = AzureOpenAI(
            api_key=api_key,
            azure_endpoint=azure_endpoint,
            azure_deployment=azure_deployment,
            api_version=api_version,
        )
        self.model_name = azure_deployment

    def get_client(self) -> AzureOpenAI:
        """Returns the AzureOpenAI client instance."""
        return self.client

    def extract_text(
        self,
        file: Optional[bytes],
        prompt: str = DEFAULT_IMAGE_CAPTION_PROMPT,
        file_ext: Optional[str] = "png",
        **parameters: Any,
    ) -> str:
        """
        Extract text from an image using the Azure OpenAI Vision model.

        Encodes the given image as a data URI with an appropriate MIME type based on
        ``file_ext`` and sends it along with a prompt to the Azure OpenAI Vision API.
        The API processes the image and returns extracted text in the response.

        Args:
            file (bytes, optional): Base64-encoded image content **without** the
                ``data:image/...;base64,`` prefix. Must not be None.
            prompt (str, optional): Instruction text guiding the extraction.
                Defaults to ``DEFAULT_IMAGE_CAPTION_PROMPT``.
            file_ext (str, optional): File extension (e.g., ``"png"``, ``"jpg"``)
                used to determine the MIME type for the image. Defaults to ``"png"``.
            **parameters (Any): Additional keyword arguments passed directly to
                the Azure OpenAI client ``chat.completions.create()`` method. Consult
                documentation [here](https://platform.openai.com/docs/api-reference/chat/create).

        Returns:
            str: The extracted text returned by the vision model.

        Raises:
            ValueError: If ``file`` is None or the file extension is not compatible.
            openai.OpenAIError: If the API request fails.

        Example:
            ```python
            model = AzureOpenAIVisionModel(...)
            with open("image.jpg", "rb") as f:
                img_b64 = base64.b64encode(f.read()).decode("utf-8")
            text = model.extract_text(img_b64, prompt="Describe this image", file_ext="jpg")
            print(text)
            ```
        """
        if file is None:
            raise ValueError("No file content provided to be analyzed with the VLM.")

        ext = (file_ext or "png").lower()
        mime_type = (
            OPENAI_MIME_BY_EXTENSION.get(ext)  # noqa: W503
            or mimetypes.types_map.get(f".{ext}")  # noqa: W503
            or "image/png"  # noqa: W503
        )

        if mime_type not in SUPPORTED_OPENAI_MIME_TYPES:
            raise ValueError(f"Unsupported image MIME type: {mime_type}")

        payload_obj = ClientPayload(
            role="user",
            content=[
                ClientTextContent(type="text", text=prompt),
                ClientImageContent(
                    type="image_url",
                    image_url=ClientImageUrl(url=f"data:{mime_type};base64,{file}"),
                ),
            ],
        )
        payload = payload_obj.model_dump()

        response = self.client.chat.completions.create(
            model=self.get_client()._azure_deployment,
            messages=[payload],
            **parameters,
        )
        return response.choices[0].message.content
__init__(api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None)

Initializes the AzureOpenAIVisionModel.

Parameters:

Name Type Description Default
api_key str

Azure OpenAI API key. If not provided, uses 'AZURE_OPENAI_API_KEY' env var.

None
azure_endpoint str

Azure endpoint. If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var.

None
azure_deployment str

Azure deployment name. If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var.

None
api_version str

API version string. If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'.

None

Raises:

Type Description
ValueError

If no connection details are provided or environment variables are not set.

Source code in src/splitter_mr/model/models/azure_openai_model.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def __init__(
    self,
    api_key: str = None,
    azure_endpoint: str = None,
    azure_deployment: str = None,
    api_version: str = None,
) -> None:
    """
    Initializes the AzureOpenAIVisionModel.

    Args:
        api_key (str, optional): Azure OpenAI API key.
            If not provided, uses 'AZURE_OPENAI_API_KEY' env var.
        azure_endpoint (str, optional): Azure endpoint.
            If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var.
        azure_deployment (str, optional): Azure deployment name.
            If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var.
        api_version (str, optional): API version string.
            If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'.

    Raises:
        ValueError: If no connection details are provided or environment variables
            are not set.
    """
    if api_key is None:
        api_key = os.getenv("AZURE_OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "Azure OpenAI API key not provided or 'AZURE_OPENAI_API_KEY' env var is not set."
            )
    if azure_endpoint is None:
        azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
        if not azure_endpoint:
            raise ValueError(
                "Azure endpoint not provided or 'AZURE_OPENAI_ENDPOINT' env var is not set."
            )
    if azure_deployment is None:
        azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT")
        if not azure_deployment:
            raise ValueError(
                "Azure deployment name not provided or 'AZURE_OPENAI_DEPLOYMENT' env var is not set."
            )
    if api_version is None:
        api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2025-04-14-preview")

    self.client = AzureOpenAI(
        api_key=api_key,
        azure_endpoint=azure_endpoint,
        azure_deployment=azure_deployment,
        api_version=api_version,
    )
    self.model_name = azure_deployment
get_client()

Returns the AzureOpenAI client instance.

Source code in src/splitter_mr/model/models/azure_openai_model.py
80
81
82
def get_client(self) -> AzureOpenAI:
    """Returns the AzureOpenAI client instance."""
    return self.client
extract_text(file, prompt=DEFAULT_IMAGE_CAPTION_PROMPT, file_ext='png', **parameters)

Extract text from an image using the Azure OpenAI Vision model.

Encodes the given image as a data URI with an appropriate MIME type based on file_ext and sends it along with a prompt to the Azure OpenAI Vision API. The API processes the image and returns extracted text in the response.

Parameters:

Name Type Description Default
file bytes

Base64-encoded image content without the data:image/...;base64, prefix. Must not be None.

required
prompt str

Instruction text guiding the extraction. Defaults to DEFAULT_IMAGE_CAPTION_PROMPT.

DEFAULT_IMAGE_CAPTION_PROMPT
file_ext str

File extension (e.g., "png", "jpg") used to determine the MIME type for the image. Defaults to "png".

'png'
**parameters Any

Additional keyword arguments passed directly to the Azure OpenAI client chat.completions.create() method. Consult documentation here.

{}

Returns:

Name Type Description
str str

The extracted text returned by the vision model.

Raises:

Type Description
ValueError

If file is None or the file extension is not compatible.

OpenAIError

If the API request fails.

Example
model = AzureOpenAIVisionModel(...)
with open("image.jpg", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode("utf-8")
text = model.extract_text(img_b64, prompt="Describe this image", file_ext="jpg")
print(text)
Source code in src/splitter_mr/model/models/azure_openai_model.py
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
def extract_text(
    self,
    file: Optional[bytes],
    prompt: str = DEFAULT_IMAGE_CAPTION_PROMPT,
    file_ext: Optional[str] = "png",
    **parameters: Any,
) -> str:
    """
    Extract text from an image using the Azure OpenAI Vision model.

    Encodes the given image as a data URI with an appropriate MIME type based on
    ``file_ext`` and sends it along with a prompt to the Azure OpenAI Vision API.
    The API processes the image and returns extracted text in the response.

    Args:
        file (bytes, optional): Base64-encoded image content **without** the
            ``data:image/...;base64,`` prefix. Must not be None.
        prompt (str, optional): Instruction text guiding the extraction.
            Defaults to ``DEFAULT_IMAGE_CAPTION_PROMPT``.
        file_ext (str, optional): File extension (e.g., ``"png"``, ``"jpg"``)
            used to determine the MIME type for the image. Defaults to ``"png"``.
        **parameters (Any): Additional keyword arguments passed directly to
            the Azure OpenAI client ``chat.completions.create()`` method. Consult
            documentation [here](https://platform.openai.com/docs/api-reference/chat/create).

    Returns:
        str: The extracted text returned by the vision model.

    Raises:
        ValueError: If ``file`` is None or the file extension is not compatible.
        openai.OpenAIError: If the API request fails.

    Example:
        ```python
        model = AzureOpenAIVisionModel(...)
        with open("image.jpg", "rb") as f:
            img_b64 = base64.b64encode(f.read()).decode("utf-8")
        text = model.extract_text(img_b64, prompt="Describe this image", file_ext="jpg")
        print(text)
        ```
    """
    if file is None:
        raise ValueError("No file content provided to be analyzed with the VLM.")

    ext = (file_ext or "png").lower()
    mime_type = (
        OPENAI_MIME_BY_EXTENSION.get(ext)  # noqa: W503
        or mimetypes.types_map.get(f".{ext}")  # noqa: W503
        or "image/png"  # noqa: W503
    )

    if mime_type not in SUPPORTED_OPENAI_MIME_TYPES:
        raise ValueError(f"Unsupported image MIME type: {mime_type}")

    payload_obj = ClientPayload(
        role="user",
        content=[
            ClientTextContent(type="text", text=prompt),
            ClientImageContent(
                type="image_url",
                image_url=ClientImageUrl(url=f"data:{mime_type};base64,{file}"),
            ),
        ],
    )
    payload = payload_obj.model_dump()

    response = self.client.chat.completions.create(
        model=self.get_client()._azure_deployment,
        messages=[payload],
        **parameters,
    )
    return response.choices[0].message.content