Skip to content

Visual Models

Reading documents like Word, PDF, or PowerPoint can sometimes be complicated if they contain images. To avoid this problem, you can use visual language models (VLMs), which are capable of recognizing images and extracting descriptions from them. In this prospectus, a model module has been developed, the implementation of which is based on the BaseModel class. It is presented below.

Which model should I use?

The choice of model depends on your cloud provider, available API keys, and desired level of integration. All models inherit from BaseModel and provide the same interface for extracting text and descriptions from images.

Model When to use Requirements Features
OpenAIVisionModel Use if you have an OpenAI API key and want to use OpenAI cloud OpenAI account & API key No Azure setup, easy to get started.
AzureOpenAIVisionModel Use if your organization uses Azure OpenAI Services Azure OpenAI deployment, API key, endpoint Integration with Azure, enterprise security.
BaseModel Abstract base, not used directly Use as a template for building your own.

Models

BaseModel

BaseModel

Bases: ABC

Source code in src/splitter_mr/model/base_model.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
class BaseModel(ABC):
    @abstractmethod
    def __init__(self, model_name) -> Any:
        pass

    @abstractmethod
    def get_client(self) -> Any:
        pass

    @abstractmethod
    def extract_text(
        self, prompt: str, file: Optional[bytes], **parameters: Dict[str, Any]
    ) -> str:
        """
        Extracts text from the provided image (base64-encoded string) using the prompt.
        """
        pass
extract_text(prompt, file, **parameters) abstractmethod

Extracts text from the provided image (base64-encoded string) using the prompt.

Source code in src/splitter_mr/model/base_model.py
14
15
16
17
18
19
20
21
@abstractmethod
def extract_text(
    self, prompt: str, file: Optional[bytes], **parameters: Dict[str, Any]
) -> str:
    """
    Extracts text from the provided image (base64-encoded string) using the prompt.
    """
    pass

OpenAIVisionModel

OpenAI logo

OpenAIVisionModel

Bases: BaseModel

Implementation of BaseModel leveraging OpenAI's Responses API.

Uses the client.responses.create() method to send base64-encoded images along with text prompts in a single multimodal request.

Source code in src/splitter_mr/model/models/openai_model.py
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
class OpenAIVisionModel(BaseModel):
    """
    Implementation of BaseModel leveraging OpenAI's Responses API.

    Uses the `client.responses.create()` method to send base64-encoded images
    along with text prompts in a single multimodal request.
    """

    def __init__(self, api_key: str = None, model_name: str = "gpt-4.1"):
        """
        Initializes the OpenAIVisionModel.

        Args:
            api_key (str, optional): OpenAI API key. If not provided, uses environment variable 'OPENAI_API_KEY'.
            model_name (str): Vision-capable model name (e.g., "gpt-4.1").
        """
        if api_key is None:
            api_key = os.getenv("OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "OpenAI API key not provided and 'OPENAI_API_KEY' env var is not set."
                )
        self.client = OpenAI(api_key=api_key)
        self.model_name = model_name

    def get_client(self) -> OpenAI:
        """Returns the underlying OpenAI client instance."""
        return self.client

    def extract_text(
        self,
        file: Optional[bytes],
        prompt: str = "Extract the text from this resource in the original language. Return the result in markdown code format.",
        **parameters: Any,
    ) -> str:
        """
        Extracts text from a base64-encoded image using OpenAI's Responses API.

        Args:
            file (bytes): Base64-encoded image string.
            prompt (str): Instructions for text extraction.
            **parameters: Additional parameters for `client.responses.create()`.

        Returns:
            str: The extracted text from the image.

        Example:
            ```python
            from splitter_mr.model import OpenAIVisionModel

            # Initialize with your OpenAI API key (set as env variable or pass directly)
            model = OpenAIVisionModel(api_key="sk-...")

            with open("example.png", "rb") as f:
                image_bytes = f.read()

            markdown = model.extract_text(image_bytes)
            print(markdown)
            ```
            ```python
            This picture shows ...
            ```
        """
        payload = {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{file}"},
                },
            ],
        }
        response = self.client.chat.completions.create(
            model=self.model_name, messages=[payload], **parameters
        )
        return response.choices[0].message.content
__init__(api_key=None, model_name='gpt-4.1')

Initializes the OpenAIVisionModel.

Parameters:

Name Type Description Default
api_key str

OpenAI API key. If not provided, uses environment variable 'OPENAI_API_KEY'.

None
model_name str

Vision-capable model name (e.g., "gpt-4.1").

'gpt-4.1'
Source code in src/splitter_mr/model/models/openai_model.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def __init__(self, api_key: str = None, model_name: str = "gpt-4.1"):
    """
    Initializes the OpenAIVisionModel.

    Args:
        api_key (str, optional): OpenAI API key. If not provided, uses environment variable 'OPENAI_API_KEY'.
        model_name (str): Vision-capable model name (e.g., "gpt-4.1").
    """
    if api_key is None:
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "OpenAI API key not provided and 'OPENAI_API_KEY' env var is not set."
            )
    self.client = OpenAI(api_key=api_key)
    self.model_name = model_name
get_client()

Returns the underlying OpenAI client instance.

Source code in src/splitter_mr/model/models/openai_model.py
34
35
36
def get_client(self) -> OpenAI:
    """Returns the underlying OpenAI client instance."""
    return self.client
extract_text(file, prompt='Extract the text from this resource in the original language. Return the result in markdown code format.', **parameters)

Extracts text from a base64-encoded image using OpenAI's Responses API.

Parameters:

Name Type Description Default
file bytes

Base64-encoded image string.

required
prompt str

Instructions for text extraction.

'Extract the text from this resource in the original language. Return the result in markdown code format.'
**parameters Any

Additional parameters for client.responses.create().

{}

Returns:

Name Type Description
str str

The extracted text from the image.

Example

from splitter_mr.model import OpenAIVisionModel

# Initialize with your OpenAI API key (set as env variable or pass directly)
model = OpenAIVisionModel(api_key="sk-...")

with open("example.png", "rb") as f:
    image_bytes = f.read()

markdown = model.extract_text(image_bytes)
print(markdown)
This picture shows ...

Source code in src/splitter_mr/model/models/openai_model.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
def extract_text(
    self,
    file: Optional[bytes],
    prompt: str = "Extract the text from this resource in the original language. Return the result in markdown code format.",
    **parameters: Any,
) -> str:
    """
    Extracts text from a base64-encoded image using OpenAI's Responses API.

    Args:
        file (bytes): Base64-encoded image string.
        prompt (str): Instructions for text extraction.
        **parameters: Additional parameters for `client.responses.create()`.

    Returns:
        str: The extracted text from the image.

    Example:
        ```python
        from splitter_mr.model import OpenAIVisionModel

        # Initialize with your OpenAI API key (set as env variable or pass directly)
        model = OpenAIVisionModel(api_key="sk-...")

        with open("example.png", "rb") as f:
            image_bytes = f.read()

        markdown = model.extract_text(image_bytes)
        print(markdown)
        ```
        ```python
        This picture shows ...
        ```
    """
    payload = {
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{file}"},
            },
        ],
    }
    response = self.client.chat.completions.create(
        model=self.model_name, messages=[payload], **parameters
    )
    return response.choices[0].message.content

AzureOpenAIVisionModel

OpenAI logo

AzureOpenAIVisionModel

Bases: BaseModel

Implementation of BaseModel for Azure OpenAI Vision using the Responses API.

Utilizes Azure’s preview responses API, which supports base64-encoded images and stateful multimodal calls.

Source code in src/splitter_mr/model/models/azure_openai_model.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
class AzureOpenAIVisionModel(BaseModel):
    """
    Implementation of BaseModel for Azure OpenAI Vision using the Responses API.

    Utilizes Azure’s preview `responses` API, which supports
    base64-encoded images and stateful multimodal calls.
    """

    def __init__(
        self,
        api_key: str = None,
        azure_endpoint: str = None,
        azure_deployment: str = None,
        api_version: str = None,
    ):
        """
        Initializes the AzureOpenAIVisionModel.

        Args:
            api_key (str, optional): Azure OpenAI API key.
                If not provided, uses 'AZURE_OPENAI_API_KEY' env var.
            azure_endpoint (str, optional): Azure endpoint.
                If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var.
            azure_deployment (str, optional): Azure deployment name.
                If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var.
            api_version (str, optional): API version string.
                If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'.
        """
        if api_key is None:
            api_key = os.getenv("AZURE_OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "Azure OpenAI API key not provided and 'AZURE_OPENAI_API_KEY' env var is not set."
                )
        if azure_endpoint is None:
            azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
            if not azure_endpoint:
                raise ValueError(
                    "Azure endpoint not provided and 'AZURE_OPENAI_ENDPOINT' env var is not set."
                )
        if azure_deployment is None:
            azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT")
            if not azure_deployment:
                raise ValueError(
                    "Azure deployment name not provided and 'AZURE_OPENAI_DEPLOYMENT' env var is not set."
                )
        if api_version is None:
            api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2025-04-14-preview")

        self.client = AzureOpenAI(
            api_key=api_key,
            azure_endpoint=azure_endpoint,
            azure_deployment=azure_deployment,
            api_version=api_version,
        )
        self.model_name = azure_deployment

    def get_client(self) -> AzureOpenAI:
        """Returns the AzureOpenAI client instance."""
        return self.client

    def extract_text(
        self,
        file: Optional[bytes],
        prompt: str = "Extract the text from this resource in the original language. Return the result in markdown code format.",
        **parameters: Any,
    ) -> str:
        """
        Extracts text from a base64 image using Azure's Responses API.

        Args:
            file (bytes): Base64‑encoded image string.
            prompt (str): Instruction prompt for text extraction.
            **parameters: Extra params passed to client.responses.create().

        Returns:
            str: Extracted text from the image.

        Example:
            ```python
            from splitter_mr.model import AzureOpenAIVisionModel

            # Ensure required Azure environment variables are set, or pass parameters directly
            model = AzureOpenAIVisionModel(
                api_key="...",
                azure_endpoint="https://...azure.com/",
                azure_deployment="deployment-name"
            )

            with open("example.png", "rb") as f:
                image_bytes = f.read()

            markdown = model.extract_text(image_bytes)
            print(markdown)
            ```
        """
        payload = {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{file}"},
                },
            ],
        }
        response = self.client.chat.completions.create(
            model=self.get_client()._azure_deployment, messages=[payload], **parameters
        )
        return response.choices[0].message.content
__init__(api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None)

Initializes the AzureOpenAIVisionModel.

Parameters:

Name Type Description Default
api_key str

Azure OpenAI API key. If not provided, uses 'AZURE_OPENAI_API_KEY' env var.

None
azure_endpoint str

Azure endpoint. If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var.

None
azure_deployment str

Azure deployment name. If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var.

None
api_version str

API version string. If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'.

None
Source code in src/splitter_mr/model/models/azure_openai_model.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def __init__(
    self,
    api_key: str = None,
    azure_endpoint: str = None,
    azure_deployment: str = None,
    api_version: str = None,
):
    """
    Initializes the AzureOpenAIVisionModel.

    Args:
        api_key (str, optional): Azure OpenAI API key.
            If not provided, uses 'AZURE_OPENAI_API_KEY' env var.
        azure_endpoint (str, optional): Azure endpoint.
            If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var.
        azure_deployment (str, optional): Azure deployment name.
            If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var.
        api_version (str, optional): API version string.
            If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'.
    """
    if api_key is None:
        api_key = os.getenv("AZURE_OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "Azure OpenAI API key not provided and 'AZURE_OPENAI_API_KEY' env var is not set."
            )
    if azure_endpoint is None:
        azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
        if not azure_endpoint:
            raise ValueError(
                "Azure endpoint not provided and 'AZURE_OPENAI_ENDPOINT' env var is not set."
            )
    if azure_deployment is None:
        azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT")
        if not azure_deployment:
            raise ValueError(
                "Azure deployment name not provided and 'AZURE_OPENAI_DEPLOYMENT' env var is not set."
            )
    if api_version is None:
        api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2025-04-14-preview")

    self.client = AzureOpenAI(
        api_key=api_key,
        azure_endpoint=azure_endpoint,
        azure_deployment=azure_deployment,
        api_version=api_version,
    )
    self.model_name = azure_deployment
get_client()

Returns the AzureOpenAI client instance.

Source code in src/splitter_mr/model/models/azure_openai_model.py
66
67
68
def get_client(self) -> AzureOpenAI:
    """Returns the AzureOpenAI client instance."""
    return self.client
extract_text(file, prompt='Extract the text from this resource in the original language. Return the result in markdown code format.', **parameters)

Extracts text from a base64 image using Azure's Responses API.

Parameters:

Name Type Description Default
file bytes

Base64‑encoded image string.

required
prompt str

Instruction prompt for text extraction.

'Extract the text from this resource in the original language. Return the result in markdown code format.'
**parameters Any

Extra params passed to client.responses.create().

{}

Returns:

Name Type Description
str str

Extracted text from the image.

Example
from splitter_mr.model import AzureOpenAIVisionModel

# Ensure required Azure environment variables are set, or pass parameters directly
model = AzureOpenAIVisionModel(
    api_key="...",
    azure_endpoint="https://...azure.com/",
    azure_deployment="deployment-name"
)

with open("example.png", "rb") as f:
    image_bytes = f.read()

markdown = model.extract_text(image_bytes)
print(markdown)
Source code in src/splitter_mr/model/models/azure_openai_model.py
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
def extract_text(
    self,
    file: Optional[bytes],
    prompt: str = "Extract the text from this resource in the original language. Return the result in markdown code format.",
    **parameters: Any,
) -> str:
    """
    Extracts text from a base64 image using Azure's Responses API.

    Args:
        file (bytes): Base64‑encoded image string.
        prompt (str): Instruction prompt for text extraction.
        **parameters: Extra params passed to client.responses.create().

    Returns:
        str: Extracted text from the image.

    Example:
        ```python
        from splitter_mr.model import AzureOpenAIVisionModel

        # Ensure required Azure environment variables are set, or pass parameters directly
        model = AzureOpenAIVisionModel(
            api_key="...",
            azure_endpoint="https://...azure.com/",
            azure_deployment="deployment-name"
        )

        with open("example.png", "rb") as f:
            image_bytes = f.read()

        markdown = model.extract_text(image_bytes)
        print(markdown)
        ```
    """
    payload = {
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{file}"},
            },
        ],
    }
    response = self.client.chat.completions.create(
        model=self.get_client()._azure_deployment, messages=[payload], **parameters
    )
    return response.choices[0].message.content