Vision Models¶
Reading documents like Word, PDF, or PowerPoint can sometimes be complicated if they contain images. To avoid this problem, you can use visual language models (VLMs), which are capable of recognizing images and extracting descriptions from them. In this prospectus, a model module has been developed, the implementation of which is based on the BaseVisionModel class. It is presented below.
Which model should I use?¶
The choice of model depends on your cloud provider, available API keys, and desired level of integration. All models inherit from BaseVisionModel and provide the same interface for extracting text and descriptions from images.
Model | When to use | Requirements | Features |
---|---|---|---|
OpenAIVisionModel | If you have an OpenAI API key and want OpenAI cloud | OPENAI_API_KEY (optional: OPENAI_MODEL , defaults to "gpt-4o" ) |
Simple setup; standard OpenAI chat API |
AzureOpenAIVisionModel | For Azure OpenAI Services users | AZURE_OPENAI_API_KEY , AZURE_OPENAI_ENDPOINT , AZURE_OPENAI_DEPLOYMENT , AZURE_OPENAI_API_VERSION |
Integrates with Azure; enterprise controls |
GrokVisionModel | If you have access to xAI’s Grok multimodal model | XAI_API_KEY (optional: XAI_MODEL , defaults to "grok-4" ) |
Supports data-URIs; optional image quality |
GeminiVisionModel | If you want Google’s Gemini Vision models | GEMINI_API_KEY + Multimodal extra: pip install 'splitter-mr[multimodal]' |
Google Gemini API, multi-modal, high-quality extraction |
HuggingFaceVisionModel | Local/open-source/offline inference | Multimodal extra: pip install 'splitter-mr[multimodal]' (optional: HF_ACCESS_TOKEN , for required models) |
Runs locally, uses HF AutoProcessor + chat templates |
AnthropicVisionModel | If you have an Anthropic key and want Claude Vision | ANTHROPIC_API_KEY (optional: ANTHROPIC_MODEL , defaults to "claude-sonnet-4-20250514" ) |
Uses OpenAI SDK with Anthropic base URL; data-URI (base64) image input; OpenAI-compatible chat.completions |
BaseVisionModel | Abstract base, not used directly | – | Template to build your own adapters |
Models¶
BaseVisionModel¶
BaseVisionModel
¶
Bases: ABC
Abstract base for vision models that extract text from images.
Subclasses encapsulate local or API-backed implementations (e.g., OpenAI, Azure OpenAI, or on-device models). Implementations should handle encoding, request construction, and response parsing while exposing a uniform interface for clients of the library.
Source code in src/splitter_mr/model/base_model.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
__init__(model_name)
abstractmethod
¶
Initialize the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
Any
|
Identifier of the underlying model. For hosted APIs this could be a model name or deployment name; for local models, it could be a path or configuration object. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If required configuration or credentials are missing. |
Source code in src/splitter_mr/model/base_model.py
15 16 17 18 19 20 21 22 23 24 25 26 |
|
analyze_content(prompt, file, file_ext, **parameters)
abstractmethod
¶
Extract text from an image using the provided prompt.
Encodes the image (provided as base64 without the
data:<mime>;base64,
prefix), sends it with an instruction prompt to
the underlying vision model, and returns the model's textual output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
Instruction or task description guiding the extraction (e.g., "Read all visible text" or "Summarize the receipt"). |
required |
file
|
Optional[bytes]
|
Base64-encoded image bytes without the
header/prefix. Must not be |
required |
file_ext
|
Optional[str]
|
File extension (e.g., |
required |
**parameters
|
Dict[str, Any]
|
Additional backend-specific options forwarded to the implementation (e.g., timeouts, user tags, temperature, etc.). |
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The extracted text or the model's textual response. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
RuntimeError
|
If the inference call fails or returns an unexpected response shape. |
Source code in src/splitter_mr/model/base_model.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
get_client()
abstractmethod
¶
Return the underlying client or handle.
Returns:
Name | Type | Description |
---|---|---|
Any |
Any
|
A client/handle that the implementation uses to perform
inference (e.g., an SDK client instance, session object, or
lightweight wrapper). May be |
Source code in src/splitter_mr/model/base_model.py
28 29 30 31 32 33 34 35 36 |
|
OpenAIVisionModel¶
OpenAIVisionModel
¶
Bases: BaseVisionModel
Implementation of BaseModel leveraging OpenAI's Chat Completions API.
Uses the client.chat.completions.create()
method to send base64-encoded
images along with text prompts in a single multimodal request.
Source code in src/splitter_mr/model/models/openai_model.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
|
__init__(api_key=None, model_name=os.getenv('OPENAI_MODEL', 'gpt-4o'))
¶
Initialize the OpenAIVisionModel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
api_key
|
str
|
OpenAI API key. If not provided, uses the
|
None
|
model_name
|
str
|
Vision-capable model name (e.g., |
getenv('OPENAI_MODEL', 'gpt-4o')
|
Raises:
Type | Description |
---|---|
ValueError
|
If no API key is provided or |
Source code in src/splitter_mr/model/models/openai_model.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
analyze_content(file, prompt=DEFAULT_IMAGE_CAPTION_PROMPT, *, file_ext='png', **parameters)
¶
Extract text from an image using OpenAI's Chat Completions API.
Encodes the provided image bytes as a base64 data URI and sends it along with a textual prompt to the specified vision-capable model. The model processes the image and returns extracted text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file
|
bytes
|
Base64-encoded image content without the
|
required |
prompt
|
str
|
Instruction text guiding the extraction.
Defaults to |
DEFAULT_IMAGE_CAPTION_PROMPT
|
file_ext
|
str
|
File extension (e.g., |
'png'
|
**parameters
|
Any
|
Additional keyword arguments passed directly to
the OpenAI client |
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text returned by the model. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
OpenAIError
|
If the API request fails. |
Example
from splitter_mr.model import OpenAIVisionModel
import base64
model = OpenAIVisionModel(api_key="sk-...")
with open("example.png", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode("utf-8")
text = model.analyze_content(img_b64, prompt="Describe the content of this image.")
print(text)
Source code in src/splitter_mr/model/models/openai_model.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
|
get_client()
¶
Get the underlying OpenAI client instance.
Returns:
Name | Type | Description |
---|---|---|
OpenAI |
OpenAI
|
The initialized API client. |
Source code in src/splitter_mr/model/models/openai_model.py
52 53 54 55 56 57 58 59 |
|
AzureOpenAIVisionModel¶
AzureOpenAIVisionModel
¶
Bases: BaseVisionModel
Implementation of BaseModel for Azure OpenAI Vision using the Responses API.
Utilizes Azure’s preview responses
API, which supports
base64-encoded images and stateful multimodal calls.
Source code in src/splitter_mr/model/models/azure_openai_model.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
|
__init__(api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None)
¶
Initializes the AzureOpenAIVisionModel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
api_key
|
str
|
Azure OpenAI API key. If not provided, uses 'AZURE_OPENAI_API_KEY' env var. |
None
|
azure_endpoint
|
str
|
Azure endpoint. If not provided, uses 'AZURE_OPENAI_ENDPOINT' env var. |
None
|
azure_deployment
|
str
|
Azure deployment name. If not provided, uses 'AZURE_OPENAI_DEPLOYMENT' env var. |
None
|
api_version
|
str
|
API version string. If not provided, uses 'AZURE_OPENAI_API_VERSION' env var or defaults to '2025-04-14-preview'. |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If no connection details are provided or environment variables are not set. |
Source code in src/splitter_mr/model/models/azure_openai_model.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
analyze_content(file, prompt=DEFAULT_IMAGE_CAPTION_PROMPT, file_ext='png', **parameters)
¶
Extract text from an image using the Azure OpenAI Vision model.
Encodes the given image as a data URI with an appropriate MIME type based on
file_ext
and sends it along with a prompt to the Azure OpenAI Vision API.
The API processes the image and returns extracted text in the response.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file
|
bytes
|
Base64-encoded image content without the
|
required |
prompt
|
str
|
Instruction text guiding the extraction.
Defaults to |
DEFAULT_IMAGE_CAPTION_PROMPT
|
file_ext
|
str
|
File extension (e.g., |
'png'
|
**parameters
|
Any
|
Additional keyword arguments passed directly to
the Azure OpenAI client |
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The extracted text returned by the vision model. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
OpenAIError
|
If the API request fails. |
Example
model = AzureOpenAIVisionModel(...)
with open("image.jpg", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode("utf-8")
text = model.analyze_content(img_b64, prompt="Describe this image", file_ext="jpg")
print(text)
Source code in src/splitter_mr/model/models/azure_openai_model.py
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
|
get_client()
¶
Returns the AzureOpenAI client instance.
Source code in src/splitter_mr/model/models/azure_openai_model.py
80 81 82 |
|
GrokVisionModel¶
GrokVisionModel
¶
Bases: BaseVisionModel
Implementation of BaseModel for Grok Vision using the xAI API.
Provides methods to interact with Grok’s multimodal models that support base64-encoded images and natural language instructions. This class is designed to extract structured text descriptions or captions from images.
Source code in src/splitter_mr/model/models/grok_model.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
|
__init__(api_key=os.getenv('XAI_API_KEY'), model_name=os.getenv('XAI_MODEL', 'grok-4'))
¶
Initializes the GrokVisionModel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
api_key
|
str
|
Grok API key. If not provided, uses the
|
getenv('XAI_API_KEY')
|
model_name
|
str
|
Model identifier to use. If not provided,
defaults to |
getenv('XAI_MODEL', 'grok-4')
|
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in src/splitter_mr/model/models/grok_model.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
analyze_content(file, prompt=None, *, file_ext='png', detail='auto', **parameters)
¶
Extract text from an image using the Grok Vision model.
Encodes the given image as a data URI with an appropriate MIME type based on
file_ext
and sends it along with a prompt to the Grok API. The API
processes the image and returns extracted text in the response.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file
|
bytes
|
Base64-encoded image content without the
|
required |
prompt
|
str
|
Instruction text guiding the extraction.
Defaults to |
None
|
file_ext
|
str
|
File extension (e.g., |
'png'
|
detail
|
str
|
Level of detail to request for the image
analysis. Options typically include |
'auto'
|
**parameters
|
Any
|
Additional keyword arguments passed directly to
the Grok client |
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The extracted text returned by the vision model. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
OpenAIError
|
If the API request fails. |
Example
from splitter_mr.model import GrokVisionModel
model = GrokVisionModel()
with open("image.jpg", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode("utf-8")
text = model.analyze_content(
img_b64, prompt="What's in this image?", file_ext="jpg", detail="high"
)
print(text)
Source code in src/splitter_mr/model/models/grok_model.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
|
get_client()
¶
Returns the underlying Grok API client.
Returns:
Name | Type | Description |
---|---|---|
Client |
Client
|
The initialized Grok |
Source code in src/splitter_mr/model/models/grok_model.py
60 61 62 63 64 65 66 67 |
|
GeminiVisionModel¶
GeminiVisionModel
¶
Bases: BaseVisionModel
Implementation of BaseVisionModel
using Google's Gemini Image Understanding API.
Source code in src/splitter_mr/model/models/gemini_model.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
|
__init__(api_key=None, model_name='gemini-2.5-flash')
¶
Initialize the GeminiVisionModel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
api_key
|
Optional[str]
|
Gemini API key. If not provided, uses 'GEMINI_API_KEY' env var. |
None
|
model_name
|
str
|
Vision-capable Gemini model name. |
'gemini-2.5-flash'
|
Raises:
Type | Description |
---|---|
ImportError
|
If |
ValueError
|
If no API key is provided or 'GEMINI_API_KEY' not set. |
Source code in src/splitter_mr/model/models/gemini_model.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
|
analyze_content(prompt, file, file_ext=None, **parameters)
¶
Extract text from an image using Gemini's image understanding API.
Source code in src/splitter_mr/model/models/gemini_model.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
|
get_client()
¶
Return the underlying Gemini SDK client.
Source code in src/splitter_mr/model/models/gemini_model.py
43 44 45 |
|
AnthropicVisionModel¶
AnthropicVisionModel
¶
Bases: BaseVisionModel
Implementation of BaseVisionModel using Anthropic's Claude Vision API via OpenAI SDK.
Sends base64-encoded images + prompts to the Claude multimodal endpoint.
Source code in src/splitter_mr/model/models/anthropic_model.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
|
__init__(api_key=None, model_name=os.getenv('ANTHROPIC_MODEL', 'claude-sonnet-4-20250514'))
¶
Initialize the AnthropicVisionModel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
api_key
|
str
|
Anthropic API key. Uses ANTHROPIC_API_KEY env var if not provided. |
None
|
model_name
|
str
|
Vision-capable Claude model name. |
getenv('ANTHROPIC_MODEL', 'claude-sonnet-4-20250514')
|
Raises:
Type | Description |
---|---|
ValueError
|
If no API key provided or found in environment. |
Source code in src/splitter_mr/model/models/anthropic_model.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
analyze_content(file, prompt=DEFAULT_IMAGE_CAPTION_PROMPT, *, file_ext='png', **parameters)
¶
Extract text from an image using Anthropic's Claude Vision API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
Task or instruction (e.g. "Describe the image contents"). |
DEFAULT_IMAGE_CAPTION_PROMPT
|
file
|
bytes
|
Base64-encoded image content, no prefix/header. |
required |
file_ext
|
str
|
File extension (e.g. "png", "jpg"). |
'png'
|
**parameters
|
Dict[str, Any]
|
Extra arguments to client.chat.completions.create(). |
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Extracted text or model response. |
Raises:
Type | Description |
---|---|
ValueError
|
If file is None or unsupported file type. |
RuntimeError
|
For failed/invalid responses. |
Source code in src/splitter_mr/model/models/anthropic_model.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
|
get_client()
¶
Get the underlying Anthropic API client instance.
Returns:
Name | Type | Description |
---|---|---|
OpenAI |
OpenAI
|
The initialized API client. |
Source code in src/splitter_mr/model/models/anthropic_model.py
52 53 54 55 56 57 58 59 |
|
HuggingFaceVisionModel¶
Warning
HuggingFaceVisionModel
can NOT currently support all the models available in HuggingFace.
For example, closed models (e.g., Microsoft Florence 2 large) or models which uses uncommon architectures (NanoNets). We strongly recommend to use SmolDocling, since it has been exhaustively tested.
HuggingFaceVisionModel
¶
Bases: BaseVisionModel
Vision-language model wrapper using Hugging Face Transformers.
This implementation loads a local or Hugging Face Hub model that supports image-to-text or multimodal tasks. It accepts a prompt and an image as base64 (without the data URI header) and returns the model's generated text. Pydantic schema models are used for message validation.
Example
import base64, requests
from splitter_mr.model.models.huggingface_model import HuggingFaceVisionModel
# Encode an image as base64
img_bytes = requests.get(
"https://huggingface.co/datasets/huggingface/documentation-images/"
"resolve/main/p-blog/candy.JPG"
).content
img_b64 = base64.b64encode(img_bytes).decode("utf-8")
model = HuggingFaceVisionModel("ds4sd/SmolDocling-256M-preview")
result = model.analyze_content("What animal is on the candy?", file=img_b64)
print(result) # e.g., "A small green thing."
Source code in src/splitter_mr/model/models/huggingface_model.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
|
__init__(model_name='ds4sd/SmolDocling-256M-preview')
¶
Initialize a HuggingFaceVisionModel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
Model repo ID or local path
(e.g., |
'ds4sd/SmolDocling-256M-preview'
|
Raises:
Type | Description |
---|---|
ImportError
|
If the 'multimodal' extra (transformers) is not installed. |
RuntimeError
|
If processor or model loading fails after all attempts. |
Source code in src/splitter_mr/model/models/huggingface_model.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
|
analyze_content(prompt, file, file_ext=None, **parameters)
¶
Extract text from an image using the vision-language model.
This method encodes an image as a data URI, builds a validated message using schema models, prepares inputs, and calls the model to generate a textual response.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
Instruction or question for the model
(e.g., |
required |
file
|
Optional[bytes]
|
Image as a base64-encoded string (without prefix). |
required |
file_ext
|
Optional[str]
|
File extension (e.g., |
None
|
**parameters
|
Dict[str, Any]
|
Extra keyword arguments passed directly
to the model's |
{}
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The extracted or generated text. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
RuntimeError
|
If input preparation or inference fails. |
Source code in src/splitter_mr/model/models/huggingface_model.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
|
get_client()
¶
Return the underlying HuggingFace model instance.
Returns:
Name | Type | Description |
---|---|---|
Any |
Any
|
The instantiated HuggingFace model object. |
Source code in src/splitter_mr/model/models/huggingface_model.py
112 113 114 115 116 117 118 |
|