Visual Models¶
Reading documents like Word, PDF, or PowerPoint can sometimes be complicated if they contain images. To avoid this problem, you can use visual language models (VLMs), which are capable of recognizing images and extracting descriptions from them. In this prospectus, a model module has been developed, the implementation of which is based on the BaseModel
class. It is presented below.
Which model should I use?¶
The choice of model depends on your cloud provider, available API keys, and desired level of integration.
All models inherit from BaseModel
and provide the same interface for extracting text and descriptions from images.
Model | When to use | Requirements | Features |
---|---|---|---|
OpenAIVisionModel |
Use if you have an OpenAI API key and want to use OpenAI cloud | OpenAI account & API key | No Azure setup, easy to get started. |
AzureOpenAIVisionModel |
Use if your organization uses Azure OpenAI Services | Azure OpenAI deployment, API key, endpoint | Integration with Azure, enterprise security. |
BaseModel |
Abstract base, not used directly | – | Use as a template for building your own. |
Models¶
BaseModel¶
BaseModel
¶
Bases: ABC
Abstract base for vision models that extract text from images.
Subclasses encapsulate local or API-backed implementations (e.g., OpenAI, Azure OpenAI, or on-device models). Implementations should handle encoding, request construction, and response parsing while exposing a uniform interface for clients of the library.
Source code in src/splitter_mr/model/base_model.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
__init__(model_name)
abstractmethod
¶
Initialize the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
Any
|
Identifier of the underlying model. For hosted APIs this could be a model name or deployment name; for local models, it could be a path or configuration object. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If required configuration or credentials are missing. |
Source code in src/splitter_mr/model/base_model.py
15 16 17 18 19 20 21 22 23 24 25 26 |
|
get_client()
abstractmethod
¶
Return the underlying client or handle.
Returns:
Name | Type | Description |
---|---|---|
Any |
Any
|
A client/handle that the implementation uses to perform
inference (e.g., an SDK client instance, session object, or
lightweight wrapper). May be |
Source code in src/splitter_mr/model/base_model.py
28 29 30 31 32 33 34 35 36 |
|
extract_text(prompt, file, file_ext, **parameters)
abstractmethod
¶
Extract text from an image using the provided prompt.
Encodes the image (provided as base64 without the
data:<mime>;base64,
prefix), sends it with an instruction prompt to
the underlying vision model, and returns the model's textual output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
Instruction or task description guiding the extraction (e.g., "Read all visible text" or "Summarize the receipt"). |
required |
file
|
Optional[bytes]
|
Base64-encoded image bytes without the
header/prefix. Must not be |
required |
file_ext
|
Optional[str]
|
File extension (e.g., |
required |
**parameters
|
Dict[str, Any]
|
Additional backend-specific options forwarded to the implementation (e.g., timeouts, user tags, temperature, etc.). |
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The extracted text or the model's textual response. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
RuntimeError
|
If the inference call fails or returns an unexpected response shape. |
Source code in src/splitter_mr/model/base_model.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
OpenAIVisionModel¶
OpenAIVisionModel
¶
Bases: BaseModel
Implementation of BaseModel leveraging OpenAI's Chat Completions API.
Uses the client.chat.completions.create()
method to send base64-encoded
images along with text prompts in a single multimodal request.
Source code in src/splitter_mr/model/models/openai_model.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
|
__init__(api_key=None, model_name='gpt-4.1')
¶
Initialize the OpenAIVisionModel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
api_key
|
str
|
OpenAI API key. If not provided, uses the
|
None
|
model_name
|
str
|
Vision-capable model name (e.g., |
'gpt-4.1'
|
Raises:
Type | Description |
---|---|
ValueError
|
If no API key is provided or |
Source code in src/splitter_mr/model/models/openai_model.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
|
get_client()
¶
Get the underlying OpenAI client instance.
Returns:
Name | Type | Description |
---|---|---|
OpenAI |
OpenAI
|
The initialized API client. |
Source code in src/splitter_mr/model/models/openai_model.py
50 51 52 53 54 55 56 57 |
|
extract_text(file, prompt=DEFAULT_IMAGE_CAPTION_PROMPT, *, file_ext='png', **parameters)
¶
Extract text from an image using OpenAI's Chat Completions API.
Encodes the provided image bytes as a base64 data URI and sends it along with a textual prompt to the specified vision-capable model. The model processes the image and returns extracted text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file
|
bytes
|
Base64-encoded image content without the
|
required |
prompt
|
str
|
Instruction text guiding the extraction.
Defaults to |
DEFAULT_IMAGE_CAPTION_PROMPT
|
file_ext
|
str
|
File extension (e.g., |
'png'
|
**parameters
|
Any
|
Additional keyword arguments passed directly to
the OpenAI client |
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text returned by the model. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
OpenAIError
|
If the API request fails. |
Example
from splitter_mr.model import OpenAIVisionModel
import base64
model = OpenAIVisionModel(api_key="sk-...")
with open("example.png", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode("utf-8")
text = model.extract_text(img_b64, prompt="Describe the content of this image.")
print(text)
Source code in src/splitter_mr/model/models/openai_model.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
|
AzureOpenAIVisionModel¶
AzureOpenAIVisionModel
¶
Bases: BaseModel
Implementation of BaseModel for Azure OpenAI Vision using the Responses API.
Utilizes Azure’s preview responses
API, which supports
base64-encoded images and stateful multimodal calls.
Source code in src/splitter_mr/model/models/azure_openai_model.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
__init__(api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None)
¶
Initializes the AzureOpenAIVisionModel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
api_key
|
str
|
Azure OpenAI API key. If not provided, uses 'AZURE_OPENAI_API_KEY' env var. |
None
|
azure_endpoint
|
str
|
Azure endpoint. If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var. |
None
|
azure_deployment
|
str
|
Azure deployment name. If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var. |
None
|
api_version
|
str
|
API version string. If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'. |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If no connection details are provided or environment variables are not set. |
Source code in src/splitter_mr/model/models/azure_openai_model.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
get_client()
¶
Returns the AzureOpenAI client instance.
Source code in src/splitter_mr/model/models/azure_openai_model.py
80 81 82 |
|
extract_text(file, prompt=DEFAULT_IMAGE_CAPTION_PROMPT, file_ext='png', **parameters)
¶
Extract text from an image using the Azure OpenAI Vision model.
Encodes the given image as a data URI with an appropriate MIME type based on
file_ext
and sends it along with a prompt to the Azure OpenAI Vision API.
The API processes the image and returns extracted text in the response.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file
|
bytes
|
Base64-encoded image content without the
|
required |
prompt
|
str
|
Instruction text guiding the extraction.
Defaults to |
DEFAULT_IMAGE_CAPTION_PROMPT
|
file_ext
|
str
|
File extension (e.g., |
'png'
|
**parameters
|
Any
|
Additional keyword arguments passed directly to
the Azure OpenAI client |
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The extracted text returned by the vision model. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
OpenAIError
|
If the API request fails. |
Example
model = AzureOpenAIVisionModel(...)
with open("image.jpg", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode("utf-8")
text = model.extract_text(img_b64, prompt="Describe this image", file_ext="jpg")
print(text)
Source code in src/splitter_mr/model/models/azure_openai_model.py
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|