Visual Models¶

Reading documents like Word, PDF, or PowerPoint can sometimes be complicated if they contain images. To avoid this problem, you can use visual language models (VLMs), which are capable of recognizing images and extracting descriptions from them. In this prospectus, a model module has been developed, the implementation of which is based on the BaseModel class. It is presented below.

Which model should I use?¶

The choice of model depends on your cloud provider, available API keys, and desired level of integration. All models inherit from BaseModel and provide the same interface for extracting text and descriptions from images.

Model	When to use	Requirements	Features
`OpenAIVisionModel`	Use if you have an OpenAI API key and want to use OpenAI cloud	OpenAI account & API key	No Azure setup, easy to get started.
`AzureOpenAIVisionModel`	Use if your organization uses Azure OpenAI Services	Azure OpenAI deployment, API key, endpoint	Integration with Azure, enterprise security.
`BaseModel`	Abstract base, not used directly	–	Use as a template for building your own.

Models¶

BaseModel¶

`BaseModel` ¶

Bases: ABC

Source code in src/splitter_mr/model/base_model.py

class BaseModel(ABC):
    @abstractmethod
    def __init__(self, model_name) -> Any:
        pass

    @abstractmethod
    def get_client(self) -> Any:
        pass

    @abstractmethod
    def extract_text(
        self, prompt: str, file: Optional[bytes], **parameters: Dict[str, Any]
    ) -> str:
        """
        Extracts text from the provided image (base64-encoded string) using the prompt.
        """
        pass

`extract_text(prompt, file, **parameters)` `abstractmethod` ¶

Extracts text from the provided image (base64-encoded string) using the prompt.

Source code in src/splitter_mr/model/base_model.py

@abstractmethod
def extract_text(
    self, prompt: str, file: Optional[bytes], **parameters: Dict[str, Any]
) -> str:
    """
    Extracts text from the provided image (base64-encoded string) using the prompt.
    """
    pass

OpenAIVisionModel¶

OpenAI logo

`OpenAIVisionModel` ¶

Bases: BaseModel

Implementation of BaseModel leveraging OpenAI's Responses API.

Uses the client.responses.create() method to send base64-encoded images along with text prompts in a single multimodal request.

Source code in src/splitter_mr/model/models/openai_model.py

class OpenAIVisionModel(BaseModel):
    """
    Implementation of BaseModel leveraging OpenAI's Responses API.

    Uses the `client.responses.create()` method to send base64-encoded images
    along with text prompts in a single multimodal request.
    """

    def __init__(self, api_key: str = None, model_name: str = "gpt-4.1"):
        """
        Initializes the OpenAIVisionModel.

        Args:
            api_key (str, optional): OpenAI API key. If not provided, uses environment variable 'OPENAI_API_KEY'.
            model_name (str): Vision-capable model name (e.g., "gpt-4.1").
        """
        if api_key is None:
            api_key = os.getenv("OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "OpenAI API key not provided and 'OPENAI_API_KEY' env var is not set."
                )
        self.client = OpenAI(api_key=api_key)
        self.model_name = model_name

    def get_client(self) -> OpenAI:
        """Returns the underlying OpenAI client instance."""
        return self.client

    def extract_text(
        self,
        file: Optional[bytes],
        prompt: str = "Extract the text from this resource in the original language. Return the result in markdown code format.",
        **parameters: Any,
    ) -> str:
        """
        Extracts text from a base64-encoded image using OpenAI's Responses API.

        Args:
            file (bytes): Base64-encoded image string.
            prompt (str): Instructions for text extraction.
            **parameters: Additional parameters for `client.responses.create()`.

        Returns:
            str: The extracted text from the image.

        Example:
            ```python
            from splitter_mr.model import OpenAIVisionModel

            # Initialize with your OpenAI API key (set as env variable or pass directly)
            model = OpenAIVisionModel(api_key="sk-...")

            with open("example.png", "rb") as f:
                image_bytes = f.read()

            markdown = model.extract_text(image_bytes)
            print(markdown)
            ```
            ```python
            This picture shows ...
            ```
        """
        payload = {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{file}"},
                },
            ],
        }
        response = self.client.chat.completions.create(
            model=self.model_name, messages=[payload], **parameters
        )
        return response.choices[0].message.content

`init(api_key=None, model_name='gpt-4.1')` ¶

Initializes the OpenAIVisionModel.

Parameters:

Name	Type	Description	Default
`api_key`	`str`	OpenAI API key. If not provided, uses environment variable 'OPENAI_API_KEY'.	`None`
`model_name`	`str`	Vision-capable model name (e.g., "gpt-4.1").	`'gpt-4.1'`

Source code in src/splitter_mr/model/models/openai_model.py

def __init__(self, api_key: str = None, model_name: str = "gpt-4.1"):
    """
    Initializes the OpenAIVisionModel.

    Args:
        api_key (str, optional): OpenAI API key. If not provided, uses environment variable 'OPENAI_API_KEY'.
        model_name (str): Vision-capable model name (e.g., "gpt-4.1").
    """
    if api_key is None:
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "OpenAI API key not provided and 'OPENAI_API_KEY' env var is not set."
            )
    self.client = OpenAI(api_key=api_key)
    self.model_name = model_name

`get_client()` ¶

Returns the underlying OpenAI client instance.

Source code in src/splitter_mr/model/models/openai_model.py

def get_client(self) -> OpenAI:
    """Returns the underlying OpenAI client instance."""
    return self.client

`extract_text(file, prompt='Extract the text from this resource in the original language. Return the result in markdown code format.', **parameters)` ¶

Extracts text from a base64-encoded image using OpenAI's Responses API.

Parameters:

Name	Type	Description	Default
`file`	`bytes`	Base64-encoded image string.	required
`prompt`	`str`	Instructions for text extraction.	`'Extract the text from this resource in the original language. Return the result in markdown code format.'`
`**parameters`	`Any`	Additional parameters for `client.responses.create()`.	`{}`

Returns:

Name	Type	Description
`str`	`str`	The extracted text from the image.

Example

from splitter_mr.model import OpenAIVisionModel

# Initialize with your OpenAI API key (set as env variable or pass directly)
model = OpenAIVisionModel(api_key="sk-...")

with open("example.png", "rb") as f:
    image_bytes = f.read()

markdown = model.extract_text(image_bytes)
print(markdown)

This picture shows ...

Source code in src/splitter_mr/model/models/openai_model.py

def extract_text(
    self,
    file: Optional[bytes],
    prompt: str = "Extract the text from this resource in the original language. Return the result in markdown code format.",
    **parameters: Any,
) -> str:
    """
    Extracts text from a base64-encoded image using OpenAI's Responses API.

    Args:
        file (bytes): Base64-encoded image string.
        prompt (str): Instructions for text extraction.
        **parameters: Additional parameters for `client.responses.create()`.

    Returns:
        str: The extracted text from the image.

    Example:
        ```python
        from splitter_mr.model import OpenAIVisionModel

        # Initialize with your OpenAI API key (set as env variable or pass directly)
        model = OpenAIVisionModel(api_key="sk-...")

        with open("example.png", "rb") as f:
            image_bytes = f.read()

        markdown = model.extract_text(image_bytes)
        print(markdown)
        ```
        ```python
        This picture shows ...
        ```
    """
    payload = {
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{file}"},
            },
        ],
    }
    response = self.client.chat.completions.create(
        model=self.model_name, messages=[payload], **parameters
    )
    return response.choices[0].message.content

AzureOpenAIVisionModel¶

OpenAI logo

`AzureOpenAIVisionModel` ¶

Bases: BaseModel

Implementation of BaseModel for Azure OpenAI Vision using the Responses API.

Utilizes Azure’s preview responses API, which supports base64-encoded images and stateful multimodal calls.

Source code in src/splitter_mr/model/models/azure_openai_model.py

class AzureOpenAIVisionModel(BaseModel):
    """
    Implementation of BaseModel for Azure OpenAI Vision using the Responses API.

    Utilizes Azure’s preview `responses` API, which supports
    base64-encoded images and stateful multimodal calls.
    """

    def __init__(
        self,
        api_key: str = None,
        azure_endpoint: str = None,
        azure_deployment: str = None,
        api_version: str = None,
    ):
        """
        Initializes the AzureOpenAIVisionModel.

        Args:
            api_key (str, optional): Azure OpenAI API key.
                If not provided, uses 'AZURE_OPENAI_API_KEY' env var.
            azure_endpoint (str, optional): Azure endpoint.
                If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var.
            azure_deployment (str, optional): Azure deployment name.
                If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var.
            api_version (str, optional): API version string.
                If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'.
        """
        if api_key is None:
            api_key = os.getenv("AZURE_OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "Azure OpenAI API key not provided and 'AZURE_OPENAI_API_KEY' env var is not set."
                )
        if azure_endpoint is None:
            azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
            if not azure_endpoint:
                raise ValueError(
                    "Azure endpoint not provided and 'AZURE_OPENAI_ENDPOINT' env var is not set."
                )
        if azure_deployment is None:
            azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT")
            if not azure_deployment:
                raise ValueError(
                    "Azure deployment name not provided and 'AZURE_OPENAI_DEPLOYMENT' env var is not set."
                )
        if api_version is None:
            api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2025-04-14-preview")

        self.client = AzureOpenAI(
            api_key=api_key,
            azure_endpoint=azure_endpoint,
            azure_deployment=azure_deployment,
            api_version=api_version,
        )
        self.model_name = azure_deployment

    def get_client(self) -> AzureOpenAI:
        """Returns the AzureOpenAI client instance."""
        return self.client

    def extract_text(
        self,
        file: Optional[bytes],
        prompt: str = "Extract the text from this resource in the original language. Return the result in markdown code format.",
        **parameters: Any,
    ) -> str:
        """
        Extracts text from a base64 image using Azure's Responses API.

        Args:
            file (bytes): Base64‑encoded image string.
            prompt (str): Instruction prompt for text extraction.
            **parameters: Extra params passed to client.responses.create().

        Returns:
            str: Extracted text from the image.

        Example:
            ```python
            from splitter_mr.model import AzureOpenAIVisionModel

            # Ensure required Azure environment variables are set, or pass parameters directly
            model = AzureOpenAIVisionModel(
                api_key="...",
                azure_endpoint="https://...azure.com/",
                azure_deployment="deployment-name"
            )

            with open("example.png", "rb") as f:
                image_bytes = f.read()

            markdown = model.extract_text(image_bytes)
            print(markdown)
            ```
        """
        payload = {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{file}"},
                },
            ],
        }
        response = self.client.chat.completions.create(
            model=self.get_client()._azure_deployment, messages=[payload], **parameters
        )
        return response.choices[0].message.content

`init(api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None)` ¶

Initializes the AzureOpenAIVisionModel.

Parameters:

Name	Type	Description	Default
`api_key`	`str`	Azure OpenAI API key. If not provided, uses 'AZURE_OPENAI_API_KEY' env var.	`None`
`azure_endpoint`	`str`	Azure endpoint. If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var.	`None`
`azure_deployment`	`str`	Azure deployment name. If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var.	`None`
`api_version`	`str`	API version string. If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'.	`None`

Source code in src/splitter_mr/model/models/azure_openai_model.py

def __init__(
    self,
    api_key: str = None,
    azure_endpoint: str = None,
    azure_deployment: str = None,
    api_version: str = None,
):
    """
    Initializes the AzureOpenAIVisionModel.

    Args:
        api_key (str, optional): Azure OpenAI API key.
            If not provided, uses 'AZURE_OPENAI_API_KEY' env var.
        azure_endpoint (str, optional): Azure endpoint.
            If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var.
        azure_deployment (str, optional): Azure deployment name.
            If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var.
        api_version (str, optional): API version string.
            If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'.
    """
    if api_key is None:
        api_key = os.getenv("AZURE_OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "Azure OpenAI API key not provided and 'AZURE_OPENAI_API_KEY' env var is not set."
            )
    if azure_endpoint is None:
        azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
        if not azure_endpoint:
            raise ValueError(
                "Azure endpoint not provided and 'AZURE_OPENAI_ENDPOINT' env var is not set."
            )
    if azure_deployment is None:
        azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT")
        if not azure_deployment:
            raise ValueError(
                "Azure deployment name not provided and 'AZURE_OPENAI_DEPLOYMENT' env var is not set."
            )
    if api_version is None:
        api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2025-04-14-preview")

    self.client = AzureOpenAI(
        api_key=api_key,
        azure_endpoint=azure_endpoint,
        azure_deployment=azure_deployment,
        api_version=api_version,
    )
    self.model_name = azure_deployment

`get_client()` ¶

Returns the AzureOpenAI client instance.

Source code in src/splitter_mr/model/models/azure_openai_model.py

def get_client(self) -> AzureOpenAI:
    """Returns the AzureOpenAI client instance."""
    return self.client

`extract_text(file, prompt='Extract the text from this resource in the original language. Return the result in markdown code format.', **parameters)` ¶

Extracts text from a base64 image using Azure's Responses API.

Parameters:

Name	Type	Description	Default
`file`	`bytes`	Base64‑encoded image string.	required
`prompt`	`str`	Instruction prompt for text extraction.	`'Extract the text from this resource in the original language. Return the result in markdown code format.'`
`**parameters`	`Any`	Extra params passed to client.responses.create().	`{}`

Returns:

Name	Type	Description
`str`	`str`	Extracted text from the image.

Example

from splitter_mr.model import AzureOpenAIVisionModel

# Ensure required Azure environment variables are set, or pass parameters directly
model = AzureOpenAIVisionModel(
    api_key="...",
    azure_endpoint="https://...azure.com/",
    azure_deployment="deployment-name"
)

with open("example.png", "rb") as f:
    image_bytes = f.read()

markdown = model.extract_text(image_bytes)
print(markdown)

Source code in src/splitter_mr/model/models/azure_openai_model.py

def extract_text(
    self,
    file: Optional[bytes],
    prompt: str = "Extract the text from this resource in the original language. Return the result in markdown code format.",
    **parameters: Any,
) -> str:
    """
    Extracts text from a base64 image using Azure's Responses API.

    Args:
        file (bytes): Base64‑encoded image string.
        prompt (str): Instruction prompt for text extraction.
        **parameters: Extra params passed to client.responses.create().

    Returns:
        str: Extracted text from the image.

    Example:
        ```python
        from splitter_mr.model import AzureOpenAIVisionModel

        # Ensure required Azure environment variables are set, or pass parameters directly
        model = AzureOpenAIVisionModel(
            api_key="...",
            azure_endpoint="https://...azure.com/",
            azure_deployment="deployment-name"
        )

        with open("example.png", "rb") as f:
            image_bytes = f.read()

        markdown = model.extract_text(image_bytes)
        print(markdown)
        ```
    """
    payload = {
        "role": "user",
        "content": [
            {"type": "text", "text": prompt},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{file}"},
            },
        ],
    }
    response = self.client.chat.completions.create(
        model=self.get_client()._azure_deployment, messages=[payload], **parameters
    )
    return response.choices[0].message.content

Visual Models¶

Which model should I use?¶

Models¶

BaseModel¶

BaseModel ¶

extract_text(prompt, file, **parameters) abstractmethod ¶

OpenAIVisionModel¶

OpenAIVisionModel ¶

__init__(api_key=None, model_name='gpt-4.1') ¶

get_client() ¶

extract_text(file, prompt='Extract the text from this resource in the original language. Return the result in markdown code format.', **parameters) ¶

AzureOpenAIVisionModel¶

AzureOpenAIVisionModel ¶

__init__(api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None) ¶

get_client() ¶

extract_text(file, prompt='Extract the text from this resource in the original language. Return the result in markdown code format.', **parameters) ¶

`BaseModel` ¶

`extract_text(prompt, file, **parameters)` `abstractmethod` ¶

`OpenAIVisionModel` ¶

`init(api_key=None, model_name='gpt-4.1')` ¶

`get_client()` ¶

`extract_text(file, prompt='Extract the text from this resource in the original language. Return the result in markdown code format.', **parameters)` ¶

`AzureOpenAIVisionModel` ¶

`init(api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None)` ¶

`get_client()` ¶

`extract_text(file, prompt='Extract the text from this resource in the original language. Return the result in markdown code format.', **parameters)` ¶