Embedding Models¶
Encoder models are the engines that produce embeddings — vectorized representations of your input (see the image below). These embeddings capture mathematical relationships between semantic units (like words, sentences, or even images).
Why does this matter? Because once you have embeddings, you can:
- Measure how relevant a word is within a text.
- Compare the similarity between two pieces of text.
- Power search, clustering, and recommendation systems.
SplitterMR takes advantage of these models to break text into chunks based on meaning, not just size. Sentences with similar context end up together, regardless of length or position. This approach is called SemanticSplitter
— perfect when you want your chunks to make sense rather than just follow arbitrary size limits.
Below is the list of embedding models you can use out-of-the-box.
And if you want to bring your own, simply implement BaseEmbedding
and plug it in.
Embedders¶
BaseEmbedding¶
BaseEmbedding
¶
Bases: ABC
Abstract base for text embedding providers.
Implementations wrap specific backends (e.g., OpenAI, Azure OpenAI, local models) and expose a consistent interface to convert text into numeric vectors suitable for similarity search, clustering, and retrieval-augmented generation.
Source code in src/splitter_mr/embedding/base_embedding.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
|
__init__(model_name)
abstractmethod
¶
Initialize the embedding backend.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
Identifier of the embedding model (e.g.,
|
required |
Raises:
Type | Description |
---|---|
ValueError
|
If required configuration or credentials are missing. |
Source code in src/splitter_mr/embedding/base_embedding.py
15 16 17 18 19 20 21 22 23 24 25 |
|
get_client()
abstractmethod
¶
Return the underlying client or handle.
Returns:
Name | Type | Description |
---|---|---|
Any |
Any
|
A client/handle used to perform embedding calls (e.g., an SDK
client instance, session object, or local runner). May be |
Source code in src/splitter_mr/embedding/base_embedding.py
27 28 29 30 31 32 33 34 35 |
|
embed_text(text, **parameters)
abstractmethod
¶
Compute an embedding vector for the given text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
Input text to embed. Implementations may apply normalization or truncation according to model limits. |
required |
**parameters
|
Dict[str, Any]
|
Additional backend-specific options forwarded to the implementation (e.g., user tags, request IDs). |
{}
|
Returns:
Type | Description |
---|---|
List[float]
|
A single embedding vector representing |
Raises:
Type | Description |
---|---|
ValueError
|
If |
RuntimeError
|
If the embedding call fails or returns an unexpected response shape. |
Source code in src/splitter_mr/embedding/base_embedding.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
embed_documents(texts, **parameters)
¶
Compute embeddings for multiple texts (default loops over embed_text
).
Implementations are encouraged to override for true batch performance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
List[str]
|
List of input strings to embed. |
required |
**parameters
|
Dict[str, Any]
|
Backend-specific options. |
{}
|
Returns:
Type | Description |
---|---|
List[List[float]]
|
List of embedding vectors, one per input string. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in src/splitter_mr/embedding/base_embedding.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
|
OpenAIEmbedding¶
OpenAIEmbedding
¶
Bases: BaseEmbedding
Encoder provider using OpenAI's embeddings API.
This class wraps OpenAI's embeddings endpoint, providing convenience methods for both single-text and batch embeddings. It also adds token counting and validation to avoid exceeding model limits.
Example
from splitter_mr.embedding import OpenAIEmbedding
embedder = OpenAIEmbedding(model_name="text-embedding-3-large")
vector = embedder.embed_text("hello world")
print(vector)
Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
|
__init__(model_name='text-embedding-3-large', api_key=None, tokenizer_name=None)
¶
Initialize the OpenAI embeddings provider.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
The OpenAI embedding model name (e.g., |
'text-embedding-3-large'
|
api_key
|
Optional[str]
|
API key for OpenAI. If not provided, reads from the
|
None
|
tokenizer_name
|
Optional[str]
|
Optional explicit tokenizer name for |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If the API key is not provided or the |
Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
get_client()
¶
Get the configured OpenAI client.
Returns:
Name | Type | Description |
---|---|---|
OpenAI |
OpenAI
|
The OpenAI API client instance. |
Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
61 62 63 64 65 66 67 68 |
|
embed_text(text, **parameters)
¶
Compute an embedding vector for a single text string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to embed. Must be non-empty and within the model's token limit. |
required |
**parameters
|
Any
|
Additional keyword arguments forwarded to
|
{}
|
Returns:
Type | Description |
---|---|
List[float]
|
List[float]: The computed embedding vector. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
|
embed_documents(texts, **parameters)
¶
Compute embeddings for multiple texts in one API call.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
List[str]
|
List of text strings to embed. All must be non-empty and within the model's token limit. |
required |
**parameters
|
Any
|
Additional keyword arguments forwarded to
|
{}
|
Returns:
Type | Description |
---|---|
List[List[float]]
|
A list of embedding vectors, one per input string. |
Raises:
Type | Description |
---|---|
ValueError
|
|
Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
|
AzureOpenAIEmbedding¶
AzureOpenAIEmbedding
¶
Bases: BaseEmbedding
Encoder provider using Azure OpenAI Embeddings.
This class wraps Azure OpenAI's embeddings API, handling both authentication
and tokenization. It supports both direct embedding calls for a single text
(embed_text
) and batch embedding calls (embed_documents
).
Azure deployments use deployment names (e.g., my-embedding-deployment
)
instead of OpenAI's standard model names. Since tiktoken
may not be able to
map a deployment name to a tokenizer automatically, this class implements
a fallback mechanism to use a known encoding (e.g., cl100k_base
) when necessary.
Example
from splitter_mr.embedding import AzureOpenAIEmbedding
embedder = AzureOpenAIEmbedding(
azure_deployment="text-embedding-3-large",
api_key="...",
azure_endpoint="https://my-azure-endpoint.openai.azure.com/"
)
vector = embedder.embed_text("Hello world")
Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 |
|
__init__(model_name=None, api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None, tokenizer_name=None)
¶
Initialize the Azure OpenAI Embedding provider.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
Optional[str]
|
OpenAI model name (unused for Azure, but kept for API parity).
If |
None
|
api_key
|
Optional[str]
|
API key for Azure OpenAI. If not provided, it will be read from
the environment variable |
None
|
azure_endpoint
|
Optional[str]
|
The base endpoint for the Azure OpenAI service. If not provided,
it will be read from |
None
|
azure_deployment
|
Optional[str]
|
Deployment name for the embeddings model in Azure OpenAI. If not
provided, it will be read from |
None
|
api_version
|
Optional[str]
|
Azure API version string. Defaults to |
None
|
tokenizer_name
|
Optional[str]
|
Optional explicit tokenizer name for |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If any required parameter is missing or it is not found in environment variables. |
Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
get_client()
¶
Get the underlying Azure OpenAI client.
Returns:
Name | Type | Description |
---|---|---|
AzureOpenAI |
AzureOpenAI
|
The configured Azure OpenAI API client. |
Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
108 109 110 111 112 113 114 115 |
|
embed_text(text, **parameters)
¶
Compute an embedding vector for a single text string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to embed. Must be non-empty and within the model's token limit. |
required |
**parameters
|
Any
|
Additional parameters to forward to the Azure OpenAI embeddings API. |
{}
|
Returns:
Type | Description |
---|---|
List[float]
|
List[float]: The computed embedding vector. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 |
|
embed_documents(texts, **parameters)
¶
Compute embeddings for multiple texts in a single API call.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
List[str]
|
List of text strings to embed. All items must be non-empty strings within the token limit. |
required |
**parameters
|
Any
|
Additional parameters to forward to the Azure OpenAI embeddings API. |
{}
|
Returns:
Type | Description |
---|---|
List[List[float]]
|
A list of embedding vectors, one per input text. |
Raises:
Type | Description |
---|---|
ValueError
|
|
Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 |
|