Visual Models¶
Reading documents like Word, PDF, or PowerPoint can sometimes be complicated if they contain images. To avoid this problem, you can use visual language models (VLMs), which are capable of recognizing images and extracting descriptions from them. In this prospectus, a model module has been developed, the implementation of which is based on the BaseModel
class. It is presented below.
Which model should I use?¶
The choice of model depends on your cloud provider, available API keys, and desired level of integration.
All models inherit from BaseModel
and provide the same interface for extracting text and descriptions from images.
Model | When to use | Requirements | Features |
---|---|---|---|
OpenAIVisionModel |
Use if you have an OpenAI API key and want to use OpenAI cloud | OpenAI account & API key | No Azure setup, easy to get started. |
AzureOpenAIVisionModel |
Use if your organization uses Azure OpenAI Services | Azure OpenAI deployment, API key, endpoint | Integration with Azure, enterprise security. |
BaseModel |
Abstract base, not used directly | – | Use as a template for building your own. |
Models¶
BaseModel¶
BaseModel
¶
Bases: ABC
Source code in src/splitter_mr/model/base_model.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
extract_text(prompt, file, **parameters)
abstractmethod
¶
Extracts text from the provided image (base64-encoded string) using the prompt.
Source code in src/splitter_mr/model/base_model.py
14 15 16 17 18 19 20 21 |
|
OpenAIVisionModel¶
OpenAIVisionModel
¶
Bases: BaseModel
Implementation of BaseModel leveraging OpenAI's Responses API.
Uses the client.responses.create()
method to send base64-encoded images
along with text prompts in a single multimodal request.
Source code in src/splitter_mr/model/models/openai_model.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
|
__init__(api_key=None, model_name='gpt-4.1')
¶
Initializes the OpenAIVisionModel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
api_key
|
str
|
OpenAI API key. If not provided, uses environment variable 'OPENAI_API_KEY'. |
None
|
model_name
|
str
|
Vision-capable model name (e.g., "gpt-4.1"). |
'gpt-4.1'
|
Source code in src/splitter_mr/model/models/openai_model.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
get_client()
¶
Returns the underlying OpenAI client instance.
Source code in src/splitter_mr/model/models/openai_model.py
34 35 36 |
|
extract_text(file, prompt='Extract the text from this resource in the original language. Return the result in markdown code format.', **parameters)
¶
Extracts text from a base64-encoded image using OpenAI's Responses API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file
|
bytes
|
Base64-encoded image string. |
required |
prompt
|
str
|
Instructions for text extraction. |
'Extract the text from this resource in the original language. Return the result in markdown code format.'
|
**parameters
|
Any
|
Additional parameters for |
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The extracted text from the image. |
Example
from splitter_mr.model import OpenAIVisionModel
# Initialize with your OpenAI API key (set as env variable or pass directly)
model = OpenAIVisionModel(api_key="sk-...")
with open("example.png", "rb") as f:
image_bytes = f.read()
markdown = model.extract_text(image_bytes)
print(markdown)
This picture shows ...
Source code in src/splitter_mr/model/models/openai_model.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
|
AzureOpenAIVisionModel¶
AzureOpenAIVisionModel
¶
Bases: BaseModel
Implementation of BaseModel for Azure OpenAI Vision using the Responses API.
Utilizes Azure’s preview responses
API, which supports
base64-encoded images and stateful multimodal calls.
Source code in src/splitter_mr/model/models/azure_openai_model.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
|
__init__(api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None)
¶
Initializes the AzureOpenAIVisionModel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
api_key
|
str
|
Azure OpenAI API key. If not provided, uses 'AZURE_OPENAI_API_KEY' env var. |
None
|
azure_endpoint
|
str
|
Azure endpoint. If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var. |
None
|
azure_deployment
|
str
|
Azure deployment name. If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var. |
None
|
api_version
|
str
|
API version string. If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'. |
None
|
Source code in src/splitter_mr/model/models/azure_openai_model.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
|
get_client()
¶
Returns the AzureOpenAI client instance.
Source code in src/splitter_mr/model/models/azure_openai_model.py
66 67 68 |
|
extract_text(file, prompt='Extract the text from this resource in the original language. Return the result in markdown code format.', **parameters)
¶
Extracts text from a base64 image using Azure's Responses API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file
|
bytes
|
Base64‑encoded image string. |
required |
prompt
|
str
|
Instruction prompt for text extraction. |
'Extract the text from this resource in the original language. Return the result in markdown code format.'
|
**parameters
|
Any
|
Extra params passed to client.responses.create(). |
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text from the image. |
Example
from splitter_mr.model import AzureOpenAIVisionModel
# Ensure required Azure environment variables are set, or pass parameters directly
model = AzureOpenAIVisionModel(
api_key="...",
azure_endpoint="https://...azure.com/",
azure_deployment="deployment-name"
)
with open("example.png", "rb") as f:
image_bytes = f.read()
markdown = model.extract_text(image_bytes)
print(markdown)
Source code in src/splitter_mr/model/models/azure_openai_model.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
|