Skip to content

Embedding Models

Encoder models are the engines that produce embeddings — vectorized representations of your input (see the image below). These embeddings capture mathematical relationships between semantic units (like words, sentences, or even images).

Why does this matter? Because once you have embeddings, you can:
- Measure how relevant a word is within a text.
- Compare the similarity between two pieces of text.
- Power search, clustering, and recommendation systems.

Example of an embedding representation

SplitterMR takes advantage of these models to break text into chunks based on meaning, not just size. Sentences with similar context end up together, regardless of length or position. This approach is called SemanticSplitter — perfect when you want your chunks to make sense rather than just follow arbitrary size limits.

Below is the list of embedding models you can use out-of-the-box.
And if you want to bring your own, simply implement BaseEmbedding and plug it in.

Embedders

BaseEmbedding

BaseEmbedding

Bases: ABC

Abstract base for text embedding providers.

Implementations wrap specific backends (e.g., OpenAI, Azure OpenAI, local models) and expose a consistent interface to convert text into numeric vectors suitable for similarity search, clustering, and retrieval-augmented generation.

Source code in src/splitter_mr/embedding/base_embedding.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
class BaseEmbedding(ABC):
    """
    Abstract base for text embedding providers.

    Implementations wrap specific backends (e.g., OpenAI, Azure OpenAI, local
    models) and expose a consistent interface to convert text into numeric
    vectors suitable for similarity search, clustering, and retrieval-augmented
    generation.
    """

    @abstractmethod
    def __init__(self, model_name: str) -> Any:
        """Initialize the embedding backend.

        Args:
            model_name (str): Identifier of the embedding model (e.g.,
                ``"text-embedding-3-large"`` or a local model alias/path).

        Raises:
            ValueError: If required configuration or credentials are missing.
        """

    @abstractmethod
    def get_client(self) -> Any:
        """Return the underlying client or handle.

        Returns:
            Any: A client/handle used to perform embedding calls (e.g., an SDK
                client instance, session object, or local runner). May be ``None``
                for pure-local implementations that do not require a client.
        """

    @abstractmethod
    def embed_text(
        self,
        text: str,
        **parameters: Dict[str, Any],
    ) -> List[float]:
        """
        Compute an embedding vector for the given text.

        Args:
            text (str): Input text to embed. Implementations may apply
                normalization or truncation according to model limits.
            **parameters (Dict[str, Any]): Additional backend-specific options
                forwarded to the implementation (e.g., user tags, request IDs).

        Returns:
            A single embedding vector representing ``text``.

        Raises:
            ValueError: If ``text`` is empty or exceeds backend constraints.
            RuntimeError: If the embedding call fails or returns an unexpected
                response shape.
        """

    def embed_documents(
        self,
        texts: List[str],
        **parameters: Dict[str, Any],
    ) -> List[List[float]]:
        """Compute embeddings for multiple texts (default loops over `embed_text`).

        Implementations are encouraged to override for true batch performance.

        Args:
            texts: List of input strings to embed.
            **parameters: Backend-specific options.

        Returns:
            List of embedding vectors, one per input string.

        Raises:
            ValueError: If `texts` is empty or any element is empty.
        """
        if not texts:
            raise ValueError("`texts` must be a non-empty list of strings.")
        return [self.embed_text(t, **parameters) for t in texts]
__init__(model_name) abstractmethod

Initialize the embedding backend.

Parameters:

Name Type Description Default
model_name str

Identifier of the embedding model (e.g., "text-embedding-3-large" or a local model alias/path).

required

Raises:

Type Description
ValueError

If required configuration or credentials are missing.

Source code in src/splitter_mr/embedding/base_embedding.py
15
16
17
18
19
20
21
22
23
24
25
@abstractmethod
def __init__(self, model_name: str) -> Any:
    """Initialize the embedding backend.

    Args:
        model_name (str): Identifier of the embedding model (e.g.,
            ``"text-embedding-3-large"`` or a local model alias/path).

    Raises:
        ValueError: If required configuration or credentials are missing.
    """
get_client() abstractmethod

Return the underlying client or handle.

Returns:

Name Type Description
Any Any

A client/handle used to perform embedding calls (e.g., an SDK client instance, session object, or local runner). May be None for pure-local implementations that do not require a client.

Source code in src/splitter_mr/embedding/base_embedding.py
27
28
29
30
31
32
33
34
35
@abstractmethod
def get_client(self) -> Any:
    """Return the underlying client or handle.

    Returns:
        Any: A client/handle used to perform embedding calls (e.g., an SDK
            client instance, session object, or local runner). May be ``None``
            for pure-local implementations that do not require a client.
    """
embed_text(text, **parameters) abstractmethod

Compute an embedding vector for the given text.

Parameters:

Name Type Description Default
text str

Input text to embed. Implementations may apply normalization or truncation according to model limits.

required
**parameters Dict[str, Any]

Additional backend-specific options forwarded to the implementation (e.g., user tags, request IDs).

{}

Returns:

Type Description
List[float]

A single embedding vector representing text.

Raises:

Type Description
ValueError

If text is empty or exceeds backend constraints.

RuntimeError

If the embedding call fails or returns an unexpected response shape.

Source code in src/splitter_mr/embedding/base_embedding.py
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
@abstractmethod
def embed_text(
    self,
    text: str,
    **parameters: Dict[str, Any],
) -> List[float]:
    """
    Compute an embedding vector for the given text.

    Args:
        text (str): Input text to embed. Implementations may apply
            normalization or truncation according to model limits.
        **parameters (Dict[str, Any]): Additional backend-specific options
            forwarded to the implementation (e.g., user tags, request IDs).

    Returns:
        A single embedding vector representing ``text``.

    Raises:
        ValueError: If ``text`` is empty or exceeds backend constraints.
        RuntimeError: If the embedding call fails or returns an unexpected
            response shape.
    """
embed_documents(texts, **parameters)

Compute embeddings for multiple texts (default loops over embed_text).

Implementations are encouraged to override for true batch performance.

Parameters:

Name Type Description Default
texts List[str]

List of input strings to embed.

required
**parameters Dict[str, Any]

Backend-specific options.

{}

Returns:

Type Description
List[List[float]]

List of embedding vectors, one per input string.

Raises:

Type Description
ValueError

If texts is empty or any element is empty.

Source code in src/splitter_mr/embedding/base_embedding.py
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def embed_documents(
    self,
    texts: List[str],
    **parameters: Dict[str, Any],
) -> List[List[float]]:
    """Compute embeddings for multiple texts (default loops over `embed_text`).

    Implementations are encouraged to override for true batch performance.

    Args:
        texts: List of input strings to embed.
        **parameters: Backend-specific options.

    Returns:
        List of embedding vectors, one per input string.

    Raises:
        ValueError: If `texts` is empty or any element is empty.
    """
    if not texts:
        raise ValueError("`texts` must be a non-empty list of strings.")
    return [self.embed_text(t, **parameters) for t in texts]

OpenAIEmbedding

OpenAIEmbedding logo OpenAIEmbedding logo

OpenAIEmbedding

Bases: BaseEmbedding

Encoder provider using OpenAI's embeddings API.

This class wraps OpenAI's embeddings endpoint, providing convenience methods for both single-text and batch embeddings. It also adds token counting and validation to avoid exceeding model limits.

Example
from splitter_mr.embedding import OpenAIEmbedding

embedder = OpenAIEmbedding(model_name="text-embedding-3-large")
vector = embedder.embed_text("hello world")
print(vector)
Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
class OpenAIEmbedding(BaseEmbedding):
    """
    Encoder provider using OpenAI's embeddings API.

    This class wraps OpenAI's embeddings endpoint, providing convenience
    methods for both single-text and batch embeddings. It also adds token
    counting and validation to avoid exceeding model limits.

    Example:
        ```python
        from splitter_mr.embedding import OpenAIEmbedding

        embedder = OpenAIEmbedding(model_name="text-embedding-3-large")
        vector = embedder.embed_text("hello world")
        print(vector)
        ```
    """

    def __init__(
        self,
        model_name: str = "text-embedding-3-large",
        api_key: Optional[str] = None,
        tokenizer_name: Optional[str] = None,
    ) -> None:
        """
        Initialize the OpenAI embeddings provider.

        Args:
            model_name (str):
                The OpenAI embedding model name (e.g., `"text-embedding-3-large"`).
            api_key (Optional[str]):
                API key for OpenAI. If not provided, reads from the
                `OPENAI_API_KEY` environment variable.
            tokenizer_name (Optional[str]):
                Optional explicit tokenizer name for `tiktoken`. If provided,
                this overrides automatic model-to-tokenizer mapping.

        Raises:
            ValueError: If the API key is not provided or the `OPENAI_API_KEY` environment variable is not set.
        """
        if api_key is None:
            api_key = os.getenv("OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "OpenAI API key not provided or 'OPENAI_API_KEY' env var is not set."
                )
        self.client = OpenAI(api_key=api_key)
        self.model_name = model_name
        self._tokenizer_name = tokenizer_name

    def get_client(self) -> OpenAI:
        """
        Get the configured OpenAI client.

        Returns:
            OpenAI: The OpenAI API client instance.
        """
        return self.client

    def _get_encoder(self):
        """
        Retrieve the `tiktoken` encoder for the configured model.

        If a `tokenizer_name` is explicitly provided, it is used. Otherwise,
        attempts to use `tiktoken.encoding_for_model`. If that fails, falls
        back to the default tokenizer defined by `OPENAI_EMBEDDING_MODEL_FALLBACK`.

        Returns:
            tiktoken.Encoding: The encoding object for tokenizing text.

        Raises:
            ValueError: If neither the model-specific nor fallback encoder
            can be loaded.
        """
        if self._tokenizer_name:
            return tiktoken.get_encoding(self._tokenizer_name)
        try:
            return tiktoken.encoding_for_model(self.model_name)
        except Exception:
            return tiktoken.get_encoding(OPENAI_EMBEDDING_MODEL_FALLBACK)

    def _count_tokens(self, text: str) -> int:
        """
        Count the number of tokens in the given text.

        Args:
            text (str): The text to tokenize.

        Returns:
            int: Number of tokens.
        """
        encoder = self._get_encoder()
        return len(encoder.encode(text))

    def _validate_token_length(self, text: str) -> None:
        """
        Ensure the text does not exceed the model's token limit.

        Args:
            text (str): The text to check.

        Raises:
            ValueError: If the token count exceeds `OPENAI_EMBEDDING_MAX_TOKENS`.
        """
        if self._count_tokens(text) > OPENAI_EMBEDDING_MAX_TOKENS:
            raise ValueError(
                f"Input text exceeds maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
            )

    def embed_text(self, text: str, **parameters: Any) -> List[float]:
        """
        Compute an embedding vector for a single text string.

        Args:
            text (str):
                The text to embed. Must be non-empty and within the model's
                token limit.
            **parameters:
                Additional keyword arguments forwarded to
                `client.embeddings.create(...)`.

        Returns:
            List[float]: The computed embedding vector.

        Raises:
            ValueError: If `text` is empty or exceeds the token limit.
        """
        if not text:
            raise ValueError("`text` must be a non-empty string.")
        self._validate_token_length(text)

        response = self.client.embeddings.create(
            input=text,
            model=self.model_name,
            **parameters,
        )
        return response.data[0].embedding

    def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
        """
        Compute embeddings for multiple texts in one API call.

        Args:
            texts (List[str]):
                List of text strings to embed. All must be non-empty and within
                the model's token limit.
            **parameters:
                Additional keyword arguments forwarded to
                `client.embeddings.create(...)`.

        Returns:
            A list of embedding vectors, one per input string.

        Raises:
            ValueError:
                - If `texts` is empty.
                - If any text is empty or not a string.
                - If any text exceeds the token limit.
        """
        if not texts:
            raise ValueError("`texts` must be a non-empty list of strings.")
        if any(not isinstance(t, str) or not t for t in texts):
            raise ValueError("All items in `texts` must be non-empty strings.")

        encoder = self._get_encoder()
        for t in texts:
            if len(encoder.encode(t)) > OPENAI_EMBEDDING_MAX_TOKENS:
                raise ValueError(
                    f"An input exceeds the maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
                )

        response = self.client.embeddings.create(
            input=texts,
            model=self.model_name,
            **parameters,
        )
        return [data.embedding for data in response.data]
__init__(model_name='text-embedding-3-large', api_key=None, tokenizer_name=None)

Initialize the OpenAI embeddings provider.

Parameters:

Name Type Description Default
model_name str

The OpenAI embedding model name (e.g., "text-embedding-3-large").

'text-embedding-3-large'
api_key Optional[str]

API key for OpenAI. If not provided, reads from the OPENAI_API_KEY environment variable.

None
tokenizer_name Optional[str]

Optional explicit tokenizer name for tiktoken. If provided, this overrides automatic model-to-tokenizer mapping.

None

Raises:

Type Description
ValueError

If the API key is not provided or the OPENAI_API_KEY environment variable is not set.

Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def __init__(
    self,
    model_name: str = "text-embedding-3-large",
    api_key: Optional[str] = None,
    tokenizer_name: Optional[str] = None,
) -> None:
    """
    Initialize the OpenAI embeddings provider.

    Args:
        model_name (str):
            The OpenAI embedding model name (e.g., `"text-embedding-3-large"`).
        api_key (Optional[str]):
            API key for OpenAI. If not provided, reads from the
            `OPENAI_API_KEY` environment variable.
        tokenizer_name (Optional[str]):
            Optional explicit tokenizer name for `tiktoken`. If provided,
            this overrides automatic model-to-tokenizer mapping.

    Raises:
        ValueError: If the API key is not provided or the `OPENAI_API_KEY` environment variable is not set.
    """
    if api_key is None:
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "OpenAI API key not provided or 'OPENAI_API_KEY' env var is not set."
            )
    self.client = OpenAI(api_key=api_key)
    self.model_name = model_name
    self._tokenizer_name = tokenizer_name
get_client()

Get the configured OpenAI client.

Returns:

Name Type Description
OpenAI OpenAI

The OpenAI API client instance.

Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
61
62
63
64
65
66
67
68
def get_client(self) -> OpenAI:
    """
    Get the configured OpenAI client.

    Returns:
        OpenAI: The OpenAI API client instance.
    """
    return self.client
embed_text(text, **parameters)

Compute an embedding vector for a single text string.

Parameters:

Name Type Description Default
text str

The text to embed. Must be non-empty and within the model's token limit.

required
**parameters Any

Additional keyword arguments forwarded to client.embeddings.create(...).

{}

Returns:

Type Description
List[float]

List[float]: The computed embedding vector.

Raises:

Type Description
ValueError

If text is empty or exceeds the token limit.

Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
def embed_text(self, text: str, **parameters: Any) -> List[float]:
    """
    Compute an embedding vector for a single text string.

    Args:
        text (str):
            The text to embed. Must be non-empty and within the model's
            token limit.
        **parameters:
            Additional keyword arguments forwarded to
            `client.embeddings.create(...)`.

    Returns:
        List[float]: The computed embedding vector.

    Raises:
        ValueError: If `text` is empty or exceeds the token limit.
    """
    if not text:
        raise ValueError("`text` must be a non-empty string.")
    self._validate_token_length(text)

    response = self.client.embeddings.create(
        input=text,
        model=self.model_name,
        **parameters,
    )
    return response.data[0].embedding
embed_documents(texts, **parameters)

Compute embeddings for multiple texts in one API call.

Parameters:

Name Type Description Default
texts List[str]

List of text strings to embed. All must be non-empty and within the model's token limit.

required
**parameters Any

Additional keyword arguments forwarded to client.embeddings.create(...).

{}

Returns:

Type Description
List[List[float]]

A list of embedding vectors, one per input string.

Raises:

Type Description
ValueError
  • If texts is empty.
  • If any text is empty or not a string.
  • If any text exceeds the token limit.
Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
    """
    Compute embeddings for multiple texts in one API call.

    Args:
        texts (List[str]):
            List of text strings to embed. All must be non-empty and within
            the model's token limit.
        **parameters:
            Additional keyword arguments forwarded to
            `client.embeddings.create(...)`.

    Returns:
        A list of embedding vectors, one per input string.

    Raises:
        ValueError:
            - If `texts` is empty.
            - If any text is empty or not a string.
            - If any text exceeds the token limit.
    """
    if not texts:
        raise ValueError("`texts` must be a non-empty list of strings.")
    if any(not isinstance(t, str) or not t for t in texts):
        raise ValueError("All items in `texts` must be non-empty strings.")

    encoder = self._get_encoder()
    for t in texts:
        if len(encoder.encode(t)) > OPENAI_EMBEDDING_MAX_TOKENS:
            raise ValueError(
                f"An input exceeds the maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
            )

    response = self.client.embeddings.create(
        input=texts,
        model=self.model_name,
        **parameters,
    )
    return [data.embedding for data in response.data]

AzureOpenAIEmbedding

AzureOpenAIEmbedding logo AzureOpenAIEmbedding logo

AzureOpenAIEmbedding

Bases: BaseEmbedding

Encoder provider using Azure OpenAI Embeddings.

This class wraps Azure OpenAI's embeddings API, handling both authentication and tokenization. It supports both direct embedding calls for a single text (embed_text) and batch embedding calls (embed_documents).

Azure deployments use deployment names (e.g., my-embedding-deployment) instead of OpenAI's standard model names. Since tiktoken may not be able to map a deployment name to a tokenizer automatically, this class implements a fallback mechanism to use a known encoding (e.g., cl100k_base) when necessary.

Example
from splitter_mr.embedding import AzureOpenAIEmbedding

embedder = AzureOpenAIEmbedding(
    azure_deployment="text-embedding-3-large",
    api_key="...",
    azure_endpoint="https://my-azure-endpoint.openai.azure.com/"
)
vector = embedder.embed_text("Hello world")
Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
class AzureOpenAIEmbedding(BaseEmbedding):
    """
    Encoder provider using Azure OpenAI Embeddings.

    This class wraps Azure OpenAI's embeddings API, handling both authentication
    and tokenization. It supports both direct embedding calls for a single text
    (`embed_text`) and batch embedding calls (`embed_documents`).

    Azure deployments use *deployment names* (e.g., `my-embedding-deployment`)
    instead of OpenAI's standard model names. Since `tiktoken` may not be able to
    map a deployment name to a tokenizer automatically, this class implements
    a fallback mechanism to use a known encoding (e.g., `cl100k_base`) when necessary.

    Example:
        ```python
        from splitter_mr.embedding import AzureOpenAIEmbedding

        embedder = AzureOpenAIEmbedding(
            azure_deployment="text-embedding-3-large",
            api_key="...",
            azure_endpoint="https://my-azure-endpoint.openai.azure.com/"
        )
        vector = embedder.embed_text("Hello world")
        ```
    """

    def __init__(
        self,
        model_name: Optional[str] = None,
        api_key: Optional[str] = None,
        azure_endpoint: Optional[str] = None,
        azure_deployment: Optional[str] = None,
        api_version: Optional[str] = None,
        tokenizer_name: Optional[str] = None,
    ) -> None:
        """
        Initialize the Azure OpenAI Embedding provider.

        Args:
            model_name (Optional[str]):
                OpenAI model name (unused for Azure, but kept for API parity).
                If `azure_deployment` is not provided, this will be used as the
                deployment name.
            api_key (Optional[str]):
                API key for Azure OpenAI. If not provided, it will be read from
                the environment variable `AZURE_OPENAI_API_KEY`.
            azure_endpoint (Optional[str]):
                The base endpoint for the Azure OpenAI service. If not provided,
                it will be read from `AZURE_OPENAI_ENDPOINT`.
            azure_deployment (Optional[str]):
                Deployment name for the embeddings model in Azure OpenAI. If not
                provided, it will be read from `AZURE_OPENAI_DEPLOYMENT` or
                fallback to `model_name`.
            api_version (Optional[str]):
                Azure API version string. Defaults to `"2025-04-14-preview"`.
                If not provided, it will be read from `AZURE_OPENAI_API_VERSION`.
            tokenizer_name (Optional[str]):
                Optional explicit tokenizer name for `tiktoken` (e.g.,
                `"cl100k_base"`). If provided, it overrides the automatic mapping.

        Raises:
            ValueError: If any required parameter is missing or it is not found in environment variables.
        """
        if api_key is None:
            api_key = os.getenv("AZURE_OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "Azure OpenAI API key not provided or 'AZURE_OPENAI_API_KEY' env var is not set."
                )

        if azure_endpoint is None:
            azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
            if not azure_endpoint:
                raise ValueError(
                    "Azure endpoint not provided or 'AZURE_OPENAI_ENDPOINT' env var is not set."
                )

        if azure_deployment is None:
            azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT") or model_name
            if not azure_deployment:
                raise ValueError(
                    "Azure deployment name not provided. Set 'azure_deployment', "
                    "'AZURE_OPENAI_DEPLOYMENT', or pass `model_name`."
                )

        if api_version is None:
            api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2025-04-14-preview")

        self.client = AzureOpenAI(
            api_key=api_key,
            azure_endpoint=azure_endpoint,
            azure_deployment=azure_deployment,
            api_version=api_version,
        )
        self.model_name = azure_deployment
        self._tokenizer_name = tokenizer_name

    def get_client(self) -> AzureOpenAI:
        """
        Get the underlying Azure OpenAI client.

        Returns:
            AzureOpenAI: The configured Azure OpenAI API client.
        """
        return self.client

    def _get_encoder(self):
        """
        Retrieve the `tiktoken` encoder for this deployment.

        This method ensures compatibility with Azure's deployment names, which
        may not be directly recognized by `tiktoken`. If the user has explicitly
        provided a tokenizer name, that is used. Otherwise, the method first
        tries to look up the encoding via `tiktoken.encoding_for_model` using the
        deployment name. If that fails, it falls back to the default encoding
        defined by `OPENAI_EMBEDDING_MODEL_FALLBACK`.

        Returns:
            tiktoken.Encoding: A tokenizer encoding object.

        Raises:
            ValueError: If `tiktoken` fails to load the fallback encoding.
        """
        if self._tokenizer_name:
            return tiktoken.get_encoding(self._tokenizer_name)
        try:
            return tiktoken.encoding_for_model(self.model_name)
        except Exception:
            return tiktoken.get_encoding(OPENAI_EMBEDDING_MODEL_FALLBACK)

    def _count_tokens(self, text: str) -> int:
        """
        Count the number of tokens in the given text.

        Uses the encoder retrieved from `_get_encoder()` to tokenize the input
        and returns the length of the resulting token list.

        Args:
            text (str): The text to tokenize.

        Returns:
            int: Number of tokens in the input text.
        """
        encoder = self._get_encoder()
        return len(encoder.encode(text))

    def _validate_token_length(self, text: str) -> None:
        """
        Ensure the input text does not exceed the model's maximum token limit.

        Args:
            text (str): The text to check.

        Raises:
            ValueError: If the token count exceeds `OPENAI_EMBEDDING_MAX_TOKENS`.
        """
        if self._count_tokens(text) > OPENAI_EMBEDDING_MAX_TOKENS:
            raise ValueError(
                f"Input text exceeds maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
            )

    def embed_text(self, text: str, **parameters: Any) -> List[float]:
        """
        Compute an embedding vector for a single text string.

        Args:
            text (str):
                The text to embed. Must be non-empty and within the model's
                token limit.
            **parameters:
                Additional parameters to forward to the Azure OpenAI embeddings API.

        Returns:
            List[float]: The computed embedding vector.

        Raises:
            ValueError: If `text` is empty or exceeds the token limit.
        """
        if not text:
            raise ValueError("`text` must be a non-empty string.")
        self._validate_token_length(text)
        response = self.client.embeddings.create(
            model=self.model_name,
            input=text,
            **parameters,
        )
        return response.data[0].embedding

    def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
        """
        Compute embeddings for multiple texts in a single API call.

        Args:
            texts (List[str]):
                List of text strings to embed. All items must be non-empty strings
                within the token limit.
            **parameters:
                Additional parameters to forward to the Azure OpenAI embeddings API.

        Returns:
            A list of embedding vectors, one per input text.

        Raises:
            ValueError:
                - If `texts` is empty.
                - If any text is empty or not a string.
                - If any text exceeds the token limit.
        """
        if not texts:
            raise ValueError("`texts` must be a non-empty list of strings.")
        if any(not isinstance(t, str) or not t for t in texts):
            raise ValueError("All items in `texts` must be non-empty strings.")

        encoder = self._get_encoder()
        for t in texts:
            if len(encoder.encode(t)) > OPENAI_EMBEDDING_MAX_TOKENS:
                raise ValueError(
                    f"An input exceeds the maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
                )

        response = self.client.embeddings.create(
            model=self.model_name,
            input=texts,
            **parameters,
        )
        return [data.embedding for data in response.data]
__init__(model_name=None, api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None, tokenizer_name=None)

Initialize the Azure OpenAI Embedding provider.

Parameters:

Name Type Description Default
model_name Optional[str]

OpenAI model name (unused for Azure, but kept for API parity). If azure_deployment is not provided, this will be used as the deployment name.

None
api_key Optional[str]

API key for Azure OpenAI. If not provided, it will be read from the environment variable AZURE_OPENAI_API_KEY.

None
azure_endpoint Optional[str]

The base endpoint for the Azure OpenAI service. If not provided, it will be read from AZURE_OPENAI_ENDPOINT.

None
azure_deployment Optional[str]

Deployment name for the embeddings model in Azure OpenAI. If not provided, it will be read from AZURE_OPENAI_DEPLOYMENT or fallback to model_name.

None
api_version Optional[str]

Azure API version string. Defaults to "2025-04-14-preview". If not provided, it will be read from AZURE_OPENAI_API_VERSION.

None
tokenizer_name Optional[str]

Optional explicit tokenizer name for tiktoken (e.g., "cl100k_base"). If provided, it overrides the automatic mapping.

None

Raises:

Type Description
ValueError

If any required parameter is missing or it is not found in environment variables.

Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
def __init__(
    self,
    model_name: Optional[str] = None,
    api_key: Optional[str] = None,
    azure_endpoint: Optional[str] = None,
    azure_deployment: Optional[str] = None,
    api_version: Optional[str] = None,
    tokenizer_name: Optional[str] = None,
) -> None:
    """
    Initialize the Azure OpenAI Embedding provider.

    Args:
        model_name (Optional[str]):
            OpenAI model name (unused for Azure, but kept for API parity).
            If `azure_deployment` is not provided, this will be used as the
            deployment name.
        api_key (Optional[str]):
            API key for Azure OpenAI. If not provided, it will be read from
            the environment variable `AZURE_OPENAI_API_KEY`.
        azure_endpoint (Optional[str]):
            The base endpoint for the Azure OpenAI service. If not provided,
            it will be read from `AZURE_OPENAI_ENDPOINT`.
        azure_deployment (Optional[str]):
            Deployment name for the embeddings model in Azure OpenAI. If not
            provided, it will be read from `AZURE_OPENAI_DEPLOYMENT` or
            fallback to `model_name`.
        api_version (Optional[str]):
            Azure API version string. Defaults to `"2025-04-14-preview"`.
            If not provided, it will be read from `AZURE_OPENAI_API_VERSION`.
        tokenizer_name (Optional[str]):
            Optional explicit tokenizer name for `tiktoken` (e.g.,
            `"cl100k_base"`). If provided, it overrides the automatic mapping.

    Raises:
        ValueError: If any required parameter is missing or it is not found in environment variables.
    """
    if api_key is None:
        api_key = os.getenv("AZURE_OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "Azure OpenAI API key not provided or 'AZURE_OPENAI_API_KEY' env var is not set."
            )

    if azure_endpoint is None:
        azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
        if not azure_endpoint:
            raise ValueError(
                "Azure endpoint not provided or 'AZURE_OPENAI_ENDPOINT' env var is not set."
            )

    if azure_deployment is None:
        azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT") or model_name
        if not azure_deployment:
            raise ValueError(
                "Azure deployment name not provided. Set 'azure_deployment', "
                "'AZURE_OPENAI_DEPLOYMENT', or pass `model_name`."
            )

    if api_version is None:
        api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2025-04-14-preview")

    self.client = AzureOpenAI(
        api_key=api_key,
        azure_endpoint=azure_endpoint,
        azure_deployment=azure_deployment,
        api_version=api_version,
    )
    self.model_name = azure_deployment
    self._tokenizer_name = tokenizer_name
get_client()

Get the underlying Azure OpenAI client.

Returns:

Name Type Description
AzureOpenAI AzureOpenAI

The configured Azure OpenAI API client.

Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
108
109
110
111
112
113
114
115
def get_client(self) -> AzureOpenAI:
    """
    Get the underlying Azure OpenAI client.

    Returns:
        AzureOpenAI: The configured Azure OpenAI API client.
    """
    return self.client
embed_text(text, **parameters)

Compute an embedding vector for a single text string.

Parameters:

Name Type Description Default
text str

The text to embed. Must be non-empty and within the model's token limit.

required
**parameters Any

Additional parameters to forward to the Azure OpenAI embeddings API.

{}

Returns:

Type Description
List[float]

List[float]: The computed embedding vector.

Raises:

Type Description
ValueError

If text is empty or exceeds the token limit.

Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
def embed_text(self, text: str, **parameters: Any) -> List[float]:
    """
    Compute an embedding vector for a single text string.

    Args:
        text (str):
            The text to embed. Must be non-empty and within the model's
            token limit.
        **parameters:
            Additional parameters to forward to the Azure OpenAI embeddings API.

    Returns:
        List[float]: The computed embedding vector.

    Raises:
        ValueError: If `text` is empty or exceeds the token limit.
    """
    if not text:
        raise ValueError("`text` must be a non-empty string.")
    self._validate_token_length(text)
    response = self.client.embeddings.create(
        model=self.model_name,
        input=text,
        **parameters,
    )
    return response.data[0].embedding
embed_documents(texts, **parameters)

Compute embeddings for multiple texts in a single API call.

Parameters:

Name Type Description Default
texts List[str]

List of text strings to embed. All items must be non-empty strings within the token limit.

required
**parameters Any

Additional parameters to forward to the Azure OpenAI embeddings API.

{}

Returns:

Type Description
List[List[float]]

A list of embedding vectors, one per input text.

Raises:

Type Description
ValueError
  • If texts is empty.
  • If any text is empty or not a string.
  • If any text exceeds the token limit.
Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
    """
    Compute embeddings for multiple texts in a single API call.

    Args:
        texts (List[str]):
            List of text strings to embed. All items must be non-empty strings
            within the token limit.
        **parameters:
            Additional parameters to forward to the Azure OpenAI embeddings API.

    Returns:
        A list of embedding vectors, one per input text.

    Raises:
        ValueError:
            - If `texts` is empty.
            - If any text is empty or not a string.
            - If any text exceeds the token limit.
    """
    if not texts:
        raise ValueError("`texts` must be a non-empty list of strings.")
    if any(not isinstance(t, str) or not t for t in texts):
        raise ValueError("All items in `texts` must be non-empty strings.")

    encoder = self._get_encoder()
    for t in texts:
        if len(encoder.encode(t)) > OPENAI_EMBEDDING_MAX_TOKENS:
            raise ValueError(
                f"An input exceeds the maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
            )

    response = self.client.embeddings.create(
        model=self.model_name,
        input=texts,
        **parameters,
    )
    return [data.embedding for data in response.data]