Skip to content

Embedding Models

Overview

Encoder models are the engines which produce embeddings. These embeddings are distributed and vectorized representations of a text. These embeddings allows to capture relationships between semantic units (commonly words, but can be sentences, or even multimodal content such as images).

These embeddings can be used in a variety of tasks, such as:

  • Measuring how relevant a word is within a text.
  • Comparing the similarity between two pieces of text.
  • Power searching, clustering, and recommendation systems building.

Example of an embedding representation

SplitterMR takes advantage of these models in SemanticSplitter. These representations are used to break text into chunks based on meaning, not just size. Sentences with similar context end up together, regardless of length or position.

Which embedder should I use?

All embedders inherit from BaseEmbedding and expose the same interface for generating embeddings. Choose based on your cloud provider, credentials, and compliance needs.

Model When to use Requirements Features
OpenAIEmbedding You have an OpenAI API key and want to use OpenAI’s hosted embeddings OPENAI_API_KEY Production-ready text embeddings; simple setup; broad ecosystem/tooling support.
AzureOpenAIEmbedding Your organization uses Azure OpenAI Services AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT Enterprise controls, Azure compliance & data residency; integrates with Azure identity.
GeminiEmbedding You want Google’s Gemini text embeddings GEMINI_API_KEY + Multimodal extra: pip install 'splitter-mr[multimodal]' Google Gemini API; modern, high-quality text embeddings.
AnthropicEmbeddings You want embeddings aligned with Anthropic guidance (via Voyage AI) VOYAGE_API_KEY + Multimodal extra: pip install 'splitter-mr[multimodal]' Voyage AI embeddings (general, code, finance, law, multimodal); supports input_type for query/document asymmetry.
HuggingFaceEmbedding Prefer local/open-source models (Sentence-Transformers); offline capability Multimodal extra: pip install 'splitter-mr[multimodal]' (optional: HF_ACCESS_TOKEN, only for required models) No API key; huge model zoo; CPU/GPU/MPS; optional L2 normalization for cosine similarity.
BaseEmbedding Abstract base, not used directly Implement to plug in a custom or self-hosted embedder.

Note

In case that you want to bring your own embedding provider, you can easily implement the class using BaseEmbedding.

Embedders

BaseEmbedding

BaseEmbedding

Bases: ABC

Abstract base for text embedding providers.

Implementations wrap specific backends (e.g., OpenAI, Azure OpenAI, local models) and expose a consistent interface to convert text into numeric vectors suitable for similarity search, clustering, and retrieval-augmented generation.

Source code in src/splitter_mr/embedding/base_embedding.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
class BaseEmbedding(ABC):
    """
    Abstract base for text embedding providers.

    Implementations wrap specific backends (e.g., OpenAI, Azure OpenAI, local
    models) and expose a consistent interface to convert text into numeric
    vectors suitable for similarity search, clustering, and retrieval-augmented
    generation.
    """

    @abstractmethod
    def __init__(self, model_name: str) -> Any:
        """Initialize the embedding backend.

        Args:
            model_name (str): Identifier of the embedding model (e.g.,
                ``"text-embedding-3-large"`` or a local model alias/path).

        Raises:
            ValueError: If required configuration or credentials are missing.
        """

    @abstractmethod
    def get_client(self) -> Any:
        """Return the underlying client or handle.

        Returns:
            Any: A client/handle used to perform embedding calls (e.g., an SDK
                client instance, session object, or local runner). May be ``None``
                for pure-local implementations that do not require a client.
        """

    @abstractmethod
    def embed_text(
        self,
        text: str,
        **parameters: Dict[str, Any],
    ) -> List[float]:
        """
        Compute an embedding vector for the given text.

        Args:
            text (str): Input text to embed. Implementations may apply
                normalization or truncation according to model limits.
            **parameters (Dict[str, Any]): Additional backend-specific options
                forwarded to the implementation (e.g., user tags, request IDs).

        Returns:
            A single embedding vector representing ``text``.

        Raises:
            ValueError: If ``text`` is empty or exceeds backend constraints.
            RuntimeError: If the embedding call fails or returns an unexpected
                response shape.
        """

    def embed_documents(
        self,
        texts: List[str],
        **parameters: Dict[str, Any],
    ) -> List[List[float]]:
        """Compute embeddings for multiple texts (default loops over `embed_text`).

        Implementations are encouraged to override for true batch performance.

        Args:
            texts: List of input strings to embed.
            **parameters: Backend-specific options.

        Returns:
            List of embedding vectors, one per input string.

        Raises:
            ValueError: If `texts` is empty or any element is empty.
        """
        if not texts:
            raise ValueError("`texts` must be a non-empty list of strings.")
        return [self.embed_text(t, **parameters) for t in texts]
__init__(model_name) abstractmethod

Initialize the embedding backend.

Parameters:

Name Type Description Default
model_name str

Identifier of the embedding model (e.g., "text-embedding-3-large" or a local model alias/path).

required

Raises:

Type Description
ValueError

If required configuration or credentials are missing.

Source code in src/splitter_mr/embedding/base_embedding.py
15
16
17
18
19
20
21
22
23
24
25
@abstractmethod
def __init__(self, model_name: str) -> Any:
    """Initialize the embedding backend.

    Args:
        model_name (str): Identifier of the embedding model (e.g.,
            ``"text-embedding-3-large"`` or a local model alias/path).

    Raises:
        ValueError: If required configuration or credentials are missing.
    """
get_client() abstractmethod

Return the underlying client or handle.

Returns:

Name Type Description
Any Any

A client/handle used to perform embedding calls (e.g., an SDK client instance, session object, or local runner). May be None for pure-local implementations that do not require a client.

Source code in src/splitter_mr/embedding/base_embedding.py
27
28
29
30
31
32
33
34
35
@abstractmethod
def get_client(self) -> Any:
    """Return the underlying client or handle.

    Returns:
        Any: A client/handle used to perform embedding calls (e.g., an SDK
            client instance, session object, or local runner). May be ``None``
            for pure-local implementations that do not require a client.
    """
embed_text(text, **parameters) abstractmethod

Compute an embedding vector for the given text.

Parameters:

Name Type Description Default
text str

Input text to embed. Implementations may apply normalization or truncation according to model limits.

required
**parameters Dict[str, Any]

Additional backend-specific options forwarded to the implementation (e.g., user tags, request IDs).

{}

Returns:

Type Description
List[float]

A single embedding vector representing text.

Raises:

Type Description
ValueError

If text is empty or exceeds backend constraints.

RuntimeError

If the embedding call fails or returns an unexpected response shape.

Source code in src/splitter_mr/embedding/base_embedding.py
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
@abstractmethod
def embed_text(
    self,
    text: str,
    **parameters: Dict[str, Any],
) -> List[float]:
    """
    Compute an embedding vector for the given text.

    Args:
        text (str): Input text to embed. Implementations may apply
            normalization or truncation according to model limits.
        **parameters (Dict[str, Any]): Additional backend-specific options
            forwarded to the implementation (e.g., user tags, request IDs).

    Returns:
        A single embedding vector representing ``text``.

    Raises:
        ValueError: If ``text`` is empty or exceeds backend constraints.
        RuntimeError: If the embedding call fails or returns an unexpected
            response shape.
    """
embed_documents(texts, **parameters)

Compute embeddings for multiple texts (default loops over embed_text).

Implementations are encouraged to override for true batch performance.

Parameters:

Name Type Description Default
texts List[str]

List of input strings to embed.

required
**parameters Dict[str, Any]

Backend-specific options.

{}

Returns:

Type Description
List[List[float]]

List of embedding vectors, one per input string.

Raises:

Type Description
ValueError

If texts is empty or any element is empty.

Source code in src/splitter_mr/embedding/base_embedding.py
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def embed_documents(
    self,
    texts: List[str],
    **parameters: Dict[str, Any],
) -> List[List[float]]:
    """Compute embeddings for multiple texts (default loops over `embed_text`).

    Implementations are encouraged to override for true batch performance.

    Args:
        texts: List of input strings to embed.
        **parameters: Backend-specific options.

    Returns:
        List of embedding vectors, one per input string.

    Raises:
        ValueError: If `texts` is empty or any element is empty.
    """
    if not texts:
        raise ValueError("`texts` must be a non-empty list of strings.")
    return [self.embed_text(t, **parameters) for t in texts]

OpenAIEmbedding

OpenAIEmbedding logo OpenAIEmbedding logo

OpenAIEmbedding

Bases: BaseEmbedding

Encoder provider using OpenAI's embeddings API.

This class wraps OpenAI's embeddings endpoint, providing convenience methods for both single-text and batch embeddings. It also adds token counting and validation to avoid exceeding model limits.

Example
from splitter_mr.embedding import OpenAIEmbedding

embedder = OpenAIEmbedding(model_name="text-embedding-3-large")
vector = embedder.embed_text("hello world")
print(vector)
Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
class OpenAIEmbedding(BaseEmbedding):
    """
    Encoder provider using OpenAI's embeddings API.

    This class wraps OpenAI's embeddings endpoint, providing convenience
    methods for both single-text and batch embeddings. It also adds token
    counting and validation to avoid exceeding model limits.

    Example:
        ```python
        from splitter_mr.embedding import OpenAIEmbedding

        embedder = OpenAIEmbedding(model_name="text-embedding-3-large")
        vector = embedder.embed_text("hello world")
        print(vector)
        ```
    """

    def __init__(
        self,
        model_name: str = "text-embedding-3-large",
        api_key: Optional[str] = None,
        tokenizer_name: Optional[str] = None,
    ) -> None:
        """
        Initialize the OpenAI embeddings provider.

        Args:
            model_name (str):
                The OpenAI embedding model name (e.g., `"text-embedding-3-large"`).
            api_key (Optional[str]):
                API key for OpenAI. If not provided, reads from the
                `OPENAI_API_KEY` environment variable.
            tokenizer_name (Optional[str]):
                Optional explicit tokenizer name for `tiktoken`. If provided,
                this overrides automatic model-to-tokenizer mapping.

        Raises:
            ValueError: If the API key is not provided or the `OPENAI_API_KEY` environment variable is not set.
        """
        if api_key is None:
            api_key = os.getenv("OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "OpenAI API key not provided or 'OPENAI_API_KEY' env var is not set."
                )
        self.client = OpenAI(api_key=api_key)
        self.model_name = model_name
        self._tokenizer_name = tokenizer_name

    def get_client(self) -> OpenAI:
        """
        Get the configured OpenAI client.

        Returns:
            OpenAI: The OpenAI API client instance.
        """
        return self.client

    def _get_encoder(self):
        """
        Retrieve the `tiktoken` encoder for the configured model.

        If a `tokenizer_name` is explicitly provided, it is used. Otherwise,
        attempts to use `tiktoken.encoding_for_model`. If that fails, falls
        back to the default tokenizer defined by `OPENAI_EMBEDDING_MODEL_FALLBACK`.

        Returns:
            tiktoken.Encoding: The encoding object for tokenizing text.

        Raises:
            ValueError: If neither the model-specific nor fallback encoder
            can be loaded.
        """
        if self._tokenizer_name:
            return tiktoken.get_encoding(self._tokenizer_name)
        try:
            return tiktoken.encoding_for_model(self.model_name)
        except Exception:
            return tiktoken.get_encoding(OPENAI_EMBEDDING_MODEL_FALLBACK)

    def _count_tokens(self, text: str) -> int:
        """
        Count the number of tokens in the given text.

        Args:
            text (str): The text to tokenize.

        Returns:
            int: Number of tokens.
        """
        encoder = self._get_encoder()
        return len(encoder.encode(text))

    def _validate_token_length(self, text: str) -> None:
        """
        Ensure the text does not exceed the model's token limit.

        Args:
            text (str): The text to check.

        Raises:
            ValueError: If the token count exceeds `OPENAI_EMBEDDING_MAX_TOKENS`.
        """
        if self._count_tokens(text) > OPENAI_EMBEDDING_MAX_TOKENS:
            raise ValueError(
                f"Input text exceeds maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
            )

    def embed_text(self, text: str, **parameters: Any) -> List[float]:
        """
        Compute an embedding vector for a single text string.

        Args:
            text (str):
                The text to embed. Must be non-empty and within the model's
                token limit.
            **parameters:
                Additional keyword arguments forwarded to
                `client.embeddings.create(...)`.

        Returns:
            List[float]: The computed embedding vector.

        Raises:
            ValueError: If `text` is empty or exceeds the token limit.
        """
        if not text:
            raise ValueError("`text` must be a non-empty string.")
        self._validate_token_length(text)

        response = self.client.embeddings.create(
            input=text,
            model=self.model_name,
            **parameters,
        )
        return response.data[0].embedding

    def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
        """
        Compute embeddings for multiple texts in one API call.

        Args:
            texts (List[str]):
                List of text strings to embed. All must be non-empty and within
                the model's token limit.
            **parameters:
                Additional keyword arguments forwarded to
                `client.embeddings.create(...)`.

        Returns:
            A list of embedding vectors, one per input string.

        Raises:
            ValueError:
                - If `texts` is empty.
                - If any text is empty or not a string.
                - If any text exceeds the token limit.
        """
        if not texts:
            raise ValueError("`texts` must be a non-empty list of strings.")
        if any(not isinstance(t, str) or not t for t in texts):
            raise ValueError("All items in `texts` must be non-empty strings.")

        encoder = self._get_encoder()
        for t in texts:
            if len(encoder.encode(t)) > OPENAI_EMBEDDING_MAX_TOKENS:
                raise ValueError(
                    f"An input exceeds the maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
                )

        response = self.client.embeddings.create(
            input=texts,
            model=self.model_name,
            **parameters,
        )
        return [data.embedding for data in response.data]
__init__(model_name='text-embedding-3-large', api_key=None, tokenizer_name=None)

Initialize the OpenAI embeddings provider.

Parameters:

Name Type Description Default
model_name str

The OpenAI embedding model name (e.g., "text-embedding-3-large").

'text-embedding-3-large'
api_key Optional[str]

API key for OpenAI. If not provided, reads from the OPENAI_API_KEY environment variable.

None
tokenizer_name Optional[str]

Optional explicit tokenizer name for tiktoken. If provided, this overrides automatic model-to-tokenizer mapping.

None

Raises:

Type Description
ValueError

If the API key is not provided or the OPENAI_API_KEY environment variable is not set.

Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def __init__(
    self,
    model_name: str = "text-embedding-3-large",
    api_key: Optional[str] = None,
    tokenizer_name: Optional[str] = None,
) -> None:
    """
    Initialize the OpenAI embeddings provider.

    Args:
        model_name (str):
            The OpenAI embedding model name (e.g., `"text-embedding-3-large"`).
        api_key (Optional[str]):
            API key for OpenAI. If not provided, reads from the
            `OPENAI_API_KEY` environment variable.
        tokenizer_name (Optional[str]):
            Optional explicit tokenizer name for `tiktoken`. If provided,
            this overrides automatic model-to-tokenizer mapping.

    Raises:
        ValueError: If the API key is not provided or the `OPENAI_API_KEY` environment variable is not set.
    """
    if api_key is None:
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "OpenAI API key not provided or 'OPENAI_API_KEY' env var is not set."
            )
    self.client = OpenAI(api_key=api_key)
    self.model_name = model_name
    self._tokenizer_name = tokenizer_name
get_client()

Get the configured OpenAI client.

Returns:

Name Type Description
OpenAI OpenAI

The OpenAI API client instance.

Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
61
62
63
64
65
66
67
68
def get_client(self) -> OpenAI:
    """
    Get the configured OpenAI client.

    Returns:
        OpenAI: The OpenAI API client instance.
    """
    return self.client
embed_text(text, **parameters)

Compute an embedding vector for a single text string.

Parameters:

Name Type Description Default
text str

The text to embed. Must be non-empty and within the model's token limit.

required
**parameters Any

Additional keyword arguments forwarded to client.embeddings.create(...).

{}

Returns:

Type Description
List[float]

List[float]: The computed embedding vector.

Raises:

Type Description
ValueError

If text is empty or exceeds the token limit.

Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
def embed_text(self, text: str, **parameters: Any) -> List[float]:
    """
    Compute an embedding vector for a single text string.

    Args:
        text (str):
            The text to embed. Must be non-empty and within the model's
            token limit.
        **parameters:
            Additional keyword arguments forwarded to
            `client.embeddings.create(...)`.

    Returns:
        List[float]: The computed embedding vector.

    Raises:
        ValueError: If `text` is empty or exceeds the token limit.
    """
    if not text:
        raise ValueError("`text` must be a non-empty string.")
    self._validate_token_length(text)

    response = self.client.embeddings.create(
        input=text,
        model=self.model_name,
        **parameters,
    )
    return response.data[0].embedding
embed_documents(texts, **parameters)

Compute embeddings for multiple texts in one API call.

Parameters:

Name Type Description Default
texts List[str]

List of text strings to embed. All must be non-empty and within the model's token limit.

required
**parameters Any

Additional keyword arguments forwarded to client.embeddings.create(...).

{}

Returns:

Type Description
List[List[float]]

A list of embedding vectors, one per input string.

Raises:

Type Description
ValueError
  • If texts is empty.
  • If any text is empty or not a string.
  • If any text exceeds the token limit.
Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
    """
    Compute embeddings for multiple texts in one API call.

    Args:
        texts (List[str]):
            List of text strings to embed. All must be non-empty and within
            the model's token limit.
        **parameters:
            Additional keyword arguments forwarded to
            `client.embeddings.create(...)`.

    Returns:
        A list of embedding vectors, one per input string.

    Raises:
        ValueError:
            - If `texts` is empty.
            - If any text is empty or not a string.
            - If any text exceeds the token limit.
    """
    if not texts:
        raise ValueError("`texts` must be a non-empty list of strings.")
    if any(not isinstance(t, str) or not t for t in texts):
        raise ValueError("All items in `texts` must be non-empty strings.")

    encoder = self._get_encoder()
    for t in texts:
        if len(encoder.encode(t)) > OPENAI_EMBEDDING_MAX_TOKENS:
            raise ValueError(
                f"An input exceeds the maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
            )

    response = self.client.embeddings.create(
        input=texts,
        model=self.model_name,
        **parameters,
    )
    return [data.embedding for data in response.data]

AzureOpenAIEmbedding

AzureOpenAIEmbedding logo AzureOpenAIEmbedding logo

AzureOpenAIEmbedding

Bases: BaseEmbedding

Encoder provider using Azure OpenAI Embeddings.

This class wraps Azure OpenAI's embeddings API, handling both authentication and tokenization. It supports both direct embedding calls for a single text (embed_text) and batch embedding calls (embed_documents).

Azure deployments use deployment names (e.g., my-embedding-deployment) instead of OpenAI's standard model names. Since tiktoken may not be able to map a deployment name to a tokenizer automatically, this class implements a fallback mechanism to use a known encoding (e.g., cl100k_base) when necessary.

Example
from splitter_mr.embedding import AzureOpenAIEmbedding

embedder = AzureOpenAIEmbedding(
    azure_deployment="text-embedding-3-large",
    api_key="...",
    azure_endpoint="https://my-azure-endpoint.openai.azure.com/"
)
vector = embedder.embed_text("Hello world")
Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
class AzureOpenAIEmbedding(BaseEmbedding):
    """
    Encoder provider using Azure OpenAI Embeddings.

    This class wraps Azure OpenAI's embeddings API, handling both authentication
    and tokenization. It supports both direct embedding calls for a single text
    (`embed_text`) and batch embedding calls (`embed_documents`).

    Azure deployments use *deployment names* (e.g., `my-embedding-deployment`)
    instead of OpenAI's standard model names. Since `tiktoken` may not be able to
    map a deployment name to a tokenizer automatically, this class implements
    a fallback mechanism to use a known encoding (e.g., `cl100k_base`) when necessary.

    Example:
        ```python
        from splitter_mr.embedding import AzureOpenAIEmbedding

        embedder = AzureOpenAIEmbedding(
            azure_deployment="text-embedding-3-large",
            api_key="...",
            azure_endpoint="https://my-azure-endpoint.openai.azure.com/"
        )
        vector = embedder.embed_text("Hello world")
        ```
    """

    def __init__(
        self,
        model_name: Optional[str] = None,
        api_key: Optional[str] = None,
        azure_endpoint: Optional[str] = None,
        azure_deployment: Optional[str] = None,
        api_version: Optional[str] = None,
        tokenizer_name: Optional[str] = None,
    ) -> None:
        """
        Initialize the Azure OpenAI Embedding provider.

        Args:
            model_name (Optional[str]):
                OpenAI model name (unused for Azure, but kept for API parity).
                If `azure_deployment` is not provided, this will be used as the
                deployment name.
            api_key (Optional[str]):
                API key for Azure OpenAI. If not provided, it will be read from
                the environment variable `AZURE_OPENAI_API_KEY`.
            azure_endpoint (Optional[str]):
                The base endpoint for the Azure OpenAI service. If not provided,
                it will be read from `AZURE_OPENAI_ENDPOINT`.
            azure_deployment (Optional[str]):
                Deployment name for the embeddings model in Azure OpenAI. If not
                provided, it will be read from `AZURE_OPENAI_DEPLOYMENT` or
                fallback to `model_name`.
            api_version (Optional[str]):
                Azure API version string. Defaults to `"2025-04-14-preview"`.
                If not provided, it will be read from `AZURE_OPENAI_API_VERSION`.
            tokenizer_name (Optional[str]):
                Optional explicit tokenizer name for `tiktoken` (e.g.,
                `"cl100k_base"`). If provided, it overrides the automatic mapping.

        Raises:
            ValueError: If any required parameter is missing or it is not found in environment variables.
        """
        if api_key is None:
            api_key = os.getenv("AZURE_OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "Azure OpenAI API key not provided or 'AZURE_OPENAI_API_KEY' env var is not set."
                )

        if azure_endpoint is None:
            azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
            if not azure_endpoint:
                raise ValueError(
                    "Azure endpoint not provided or 'AZURE_OPENAI_ENDPOINT' env var is not set."
                )

        if azure_deployment is None:
            azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT") or model_name
            if not azure_deployment:
                raise ValueError(
                    "Azure deployment name not provided. Set 'azure_deployment', "
                    "'AZURE_OPENAI_DEPLOYMENT', or pass `model_name`."
                )

        if api_version is None:
            api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2025-04-14-preview")

        self.client = AzureOpenAI(
            api_key=api_key,
            azure_endpoint=azure_endpoint,
            azure_deployment=azure_deployment,
            api_version=api_version,
        )
        self.model_name = azure_deployment
        self._tokenizer_name = tokenizer_name

    def get_client(self) -> AzureOpenAI:
        """
        Get the underlying Azure OpenAI client.

        Returns:
            AzureOpenAI: The configured Azure OpenAI API client.
        """
        return self.client

    def _get_encoder(self):
        """
        Retrieve the `tiktoken` encoder for this deployment.

        This method ensures compatibility with Azure's deployment names, which
        may not be directly recognized by `tiktoken`. If the user has explicitly
        provided a tokenizer name, that is used. Otherwise, the method first
        tries to look up the encoding via `tiktoken.encoding_for_model` using the
        deployment name. If that fails, it falls back to the default encoding
        defined by `OPENAI_EMBEDDING_MODEL_FALLBACK`.

        Returns:
            tiktoken.Encoding: A tokenizer encoding object.

        Raises:
            ValueError: If `tiktoken` fails to load the fallback encoding.
        """
        if self._tokenizer_name:
            return tiktoken.get_encoding(self._tokenizer_name)
        try:
            return tiktoken.encoding_for_model(self.model_name)
        except Exception:
            return tiktoken.get_encoding(OPENAI_EMBEDDING_MODEL_FALLBACK)

    def _count_tokens(self, text: str) -> int:
        """
        Count the number of tokens in the given text.

        Uses the encoder retrieved from `_get_encoder()` to tokenize the input
        and returns the length of the resulting token list.

        Args:
            text (str): The text to tokenize.

        Returns:
            int: Number of tokens in the input text.
        """
        encoder = self._get_encoder()
        return len(encoder.encode(text))

    def _validate_token_length(self, text: str) -> None:
        """
        Ensure the input text does not exceed the model's maximum token limit.

        Args:
            text (str): The text to check.

        Raises:
            ValueError: If the token count exceeds `OPENAI_EMBEDDING_MAX_TOKENS`.
        """
        if self._count_tokens(text) > OPENAI_EMBEDDING_MAX_TOKENS:
            raise ValueError(
                f"Input text exceeds maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
            )

    def embed_text(self, text: str, **parameters: Any) -> List[float]:
        """
        Compute an embedding vector for a single text string.

        Args:
            text (str):
                The text to embed. Must be non-empty and within the model's
                token limit.
            **parameters:
                Additional parameters to forward to the Azure OpenAI embeddings API.

        Returns:
            List[float]: The computed embedding vector.

        Raises:
            ValueError: If `text` is empty or exceeds the token limit.
        """
        if not text:
            raise ValueError("`text` must be a non-empty string.")
        self._validate_token_length(text)
        response = self.client.embeddings.create(
            model=self.model_name,
            input=text,
            **parameters,
        )
        return response.data[0].embedding

    def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
        """
        Compute embeddings for multiple texts in a single API call.

        Args:
            texts (List[str]):
                List of text strings to embed. All items must be non-empty strings
                within the token limit.
            **parameters:
                Additional parameters to forward to the Azure OpenAI embeddings API.

        Returns:
            A list of embedding vectors, one per input text.

        Raises:
            ValueError:
                - If `texts` is empty.
                - If any text is empty or not a string.
                - If any text exceeds the token limit.
        """
        if not texts:
            raise ValueError("`texts` must be a non-empty list of strings.")
        if any(not isinstance(t, str) or not t for t in texts):
            raise ValueError("All items in `texts` must be non-empty strings.")

        encoder = self._get_encoder()
        for t in texts:
            if len(encoder.encode(t)) > OPENAI_EMBEDDING_MAX_TOKENS:
                raise ValueError(
                    f"An input exceeds the maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
                )

        response = self.client.embeddings.create(
            model=self.model_name,
            input=texts,
            **parameters,
        )
        return [data.embedding for data in response.data]
__init__(model_name=None, api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None, tokenizer_name=None)

Initialize the Azure OpenAI Embedding provider.

Parameters:

Name Type Description Default
model_name Optional[str]

OpenAI model name (unused for Azure, but kept for API parity). If azure_deployment is not provided, this will be used as the deployment name.

None
api_key Optional[str]

API key for Azure OpenAI. If not provided, it will be read from the environment variable AZURE_OPENAI_API_KEY.

None
azure_endpoint Optional[str]

The base endpoint for the Azure OpenAI service. If not provided, it will be read from AZURE_OPENAI_ENDPOINT.

None
azure_deployment Optional[str]

Deployment name for the embeddings model in Azure OpenAI. If not provided, it will be read from AZURE_OPENAI_DEPLOYMENT or fallback to model_name.

None
api_version Optional[str]

Azure API version string. Defaults to "2025-04-14-preview". If not provided, it will be read from AZURE_OPENAI_API_VERSION.

None
tokenizer_name Optional[str]

Optional explicit tokenizer name for tiktoken (e.g., "cl100k_base"). If provided, it overrides the automatic mapping.

None

Raises:

Type Description
ValueError

If any required parameter is missing or it is not found in environment variables.

Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
def __init__(
    self,
    model_name: Optional[str] = None,
    api_key: Optional[str] = None,
    azure_endpoint: Optional[str] = None,
    azure_deployment: Optional[str] = None,
    api_version: Optional[str] = None,
    tokenizer_name: Optional[str] = None,
) -> None:
    """
    Initialize the Azure OpenAI Embedding provider.

    Args:
        model_name (Optional[str]):
            OpenAI model name (unused for Azure, but kept for API parity).
            If `azure_deployment` is not provided, this will be used as the
            deployment name.
        api_key (Optional[str]):
            API key for Azure OpenAI. If not provided, it will be read from
            the environment variable `AZURE_OPENAI_API_KEY`.
        azure_endpoint (Optional[str]):
            The base endpoint for the Azure OpenAI service. If not provided,
            it will be read from `AZURE_OPENAI_ENDPOINT`.
        azure_deployment (Optional[str]):
            Deployment name for the embeddings model in Azure OpenAI. If not
            provided, it will be read from `AZURE_OPENAI_DEPLOYMENT` or
            fallback to `model_name`.
        api_version (Optional[str]):
            Azure API version string. Defaults to `"2025-04-14-preview"`.
            If not provided, it will be read from `AZURE_OPENAI_API_VERSION`.
        tokenizer_name (Optional[str]):
            Optional explicit tokenizer name for `tiktoken` (e.g.,
            `"cl100k_base"`). If provided, it overrides the automatic mapping.

    Raises:
        ValueError: If any required parameter is missing or it is not found in environment variables.
    """
    if api_key is None:
        api_key = os.getenv("AZURE_OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "Azure OpenAI API key not provided or 'AZURE_OPENAI_API_KEY' env var is not set."
            )

    if azure_endpoint is None:
        azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
        if not azure_endpoint:
            raise ValueError(
                "Azure endpoint not provided or 'AZURE_OPENAI_ENDPOINT' env var is not set."
            )

    if azure_deployment is None:
        azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT") or model_name
        if not azure_deployment:
            raise ValueError(
                "Azure deployment name not provided. Set 'azure_deployment', "
                "'AZURE_OPENAI_DEPLOYMENT', or pass `model_name`."
            )

    if api_version is None:
        api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2025-04-14-preview")

    self.client = AzureOpenAI(
        api_key=api_key,
        azure_endpoint=azure_endpoint,
        azure_deployment=azure_deployment,
        api_version=api_version,
    )
    self.model_name = azure_deployment
    self._tokenizer_name = tokenizer_name
get_client()

Get the underlying Azure OpenAI client.

Returns:

Name Type Description
AzureOpenAI AzureOpenAI

The configured Azure OpenAI API client.

Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
108
109
110
111
112
113
114
115
def get_client(self) -> AzureOpenAI:
    """
    Get the underlying Azure OpenAI client.

    Returns:
        AzureOpenAI: The configured Azure OpenAI API client.
    """
    return self.client
embed_text(text, **parameters)

Compute an embedding vector for a single text string.

Parameters:

Name Type Description Default
text str

The text to embed. Must be non-empty and within the model's token limit.

required
**parameters Any

Additional parameters to forward to the Azure OpenAI embeddings API.

{}

Returns:

Type Description
List[float]

List[float]: The computed embedding vector.

Raises:

Type Description
ValueError

If text is empty or exceeds the token limit.

Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
def embed_text(self, text: str, **parameters: Any) -> List[float]:
    """
    Compute an embedding vector for a single text string.

    Args:
        text (str):
            The text to embed. Must be non-empty and within the model's
            token limit.
        **parameters:
            Additional parameters to forward to the Azure OpenAI embeddings API.

    Returns:
        List[float]: The computed embedding vector.

    Raises:
        ValueError: If `text` is empty or exceeds the token limit.
    """
    if not text:
        raise ValueError("`text` must be a non-empty string.")
    self._validate_token_length(text)
    response = self.client.embeddings.create(
        model=self.model_name,
        input=text,
        **parameters,
    )
    return response.data[0].embedding
embed_documents(texts, **parameters)

Compute embeddings for multiple texts in a single API call.

Parameters:

Name Type Description Default
texts List[str]

List of text strings to embed. All items must be non-empty strings within the token limit.

required
**parameters Any

Additional parameters to forward to the Azure OpenAI embeddings API.

{}

Returns:

Type Description
List[List[float]]

A list of embedding vectors, one per input text.

Raises:

Type Description
ValueError
  • If texts is empty.
  • If any text is empty or not a string.
  • If any text exceeds the token limit.
Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
    """
    Compute embeddings for multiple texts in a single API call.

    Args:
        texts (List[str]):
            List of text strings to embed. All items must be non-empty strings
            within the token limit.
        **parameters:
            Additional parameters to forward to the Azure OpenAI embeddings API.

    Returns:
        A list of embedding vectors, one per input text.

    Raises:
        ValueError:
            - If `texts` is empty.
            - If any text is empty or not a string.
            - If any text exceeds the token limit.
    """
    if not texts:
        raise ValueError("`texts` must be a non-empty list of strings.")
    if any(not isinstance(t, str) or not t for t in texts):
        raise ValueError("All items in `texts` must be non-empty strings.")

    encoder = self._get_encoder()
    for t in texts:
        if len(encoder.encode(t)) > OPENAI_EMBEDDING_MAX_TOKENS:
            raise ValueError(
                f"An input exceeds the maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
            )

    response = self.client.embeddings.create(
        model=self.model_name,
        input=texts,
        **parameters,
    )
    return [data.embedding for data in response.data]

GeminiEmbedding

GeminiEmbedding logo GeminiEmbedding logo

GeminiEmbedding

Bases: BaseEmbedding

Embedding provider using Google Gemini's embedding API.

This class wraps the Gemini API for generating embeddings from text or documents. Requires the google-genai package and a valid Gemini API key. This class is available only if splitter-mr[multimodal] is installed.

Typical usage example
from splitter_mr.embedding.models.gemini_embedding import GeminiEmbedding
embedder = GeminiEmbedding(api_key="your-api-key")
vector = embedder.embed_text("Hello, world!")
print(vector)
Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
class GeminiEmbedding(BaseEmbedding):
    """
    Embedding provider using Google Gemini's embedding API.

    This class wraps the Gemini API for generating embeddings from text or documents.
    Requires the `google-genai` package and a valid Gemini API key. This class
    is available only if `splitter-mr[multimodal]` is installed.

    Typical usage example:
        ```python
        from splitter_mr.embedding.models.gemini_embedding import GeminiEmbedding
        embedder = GeminiEmbedding(api_key="your-api-key")
        vector = embedder.embed_text("Hello, world!")
        print(vector)
        ```
    """

    def __init__(
        self,
        model_name: str = "models/embedding-001",
        api_key: Optional[str] = None,
    ) -> None:
        """
        Initialize the Gemini embedding provider.

        Args:
            model_name (str): The Gemini model identifier to use for embedding. Defaults to "models/embedding-001".
            api_key (Optional[str]): The Gemini API key. If not provided, reads from the 'GEMINI_API_KEY' environment variable.

        Raises:
            ImportError: If the `google-genai` package is not installed.
            ValueError: If no API key is provided or found in the environment.
        """
        self.api_key = api_key or os.getenv("GEMINI_API_KEY")
        if not self.api_key:
            raise ValueError(
                "Google Gemini API key not provided and 'GEMINI_API_KEY' environment variable not set."
            )
        self.model_name = model_name
        self.client = genai.Client(api_key=api_key)
        self.models = self.client.models

    def get_client(self) -> "genai.Client":
        """
        Return the underlying Gemini API client.

        Returns:
            The loaded Gemini API module (`google.genai`).
        """
        return self.client

    def embed_text(self, text: str, **parameters: Any) -> List[float]:
        """
        Generate an embedding for a single text string using Gemini.

        Args:
            text (str): The input text to embed.
            **parameters (Any): Additional parameters for the Gemini API.

        Returns:
            List[float]: The generated embedding vector.

        Raises:
            ValueError: If the input text is not a non-empty string.
            RuntimeError: If the embedding call fails or returns an invalid response.
        """
        if not isinstance(text, str) or not text.strip():
            raise ValueError("`text` must be a non-empty string.")

        try:
            result = self.models.embed_content(
                model=self.model_name, contents=text, **parameters
            )
            embedding = getattr(result, "embedding", None)
            if embedding is None:
                raise RuntimeError(
                    "Gemini embedding call succeeded but no 'embedding' field was returned."
                )
            return embedding
        except Exception as e:
            raise RuntimeError(f"Failed to get embedding from Gemini: {e}") from e

    def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
        """
        Generate embeddings for a list of text strings using Gemini.

        Args:
            texts (List[str]): A list of input text strings.
            **parameters (Any): Additional parameters for the Gemini API.

        Returns:
            List[List[float]]: The generated embedding vectors, one per input.

        Raises:
            ValueError: If the input is not a non-empty list of non-empty strings.
            RuntimeError: If the embedding call fails or returns an invalid response.
        """
        if (
            not isinstance(texts, list)
            or not texts  # noqa: W503
            or any(not isinstance(t, str) or not t.strip() for t in texts)  # noqa: W503
        ):
            raise ValueError("`texts` must be a non-empty list of non-empty strings.")

        try:
            result = self.models.embed_content(
                model=self.model_name, contents=texts, **parameters
            )
            # The Gemini API returns a list of embeddings under .embeddings
            embeddings = getattr(result, "embeddings", None)
            if embeddings is None:
                raise RuntimeError(
                    "Gemini embedding call succeeded but no 'embeddings' field was returned."
                )
            return embeddings

        except Exception as e:
            raise RuntimeError(
                f"Failed to get document embeddings from Gemini: {e}"
            ) from e
__init__(model_name='models/embedding-001', api_key=None)

Initialize the Gemini embedding provider.

Parameters:

Name Type Description Default
model_name str

The Gemini model identifier to use for embedding. Defaults to "models/embedding-001".

'models/embedding-001'
api_key Optional[str]

The Gemini API key. If not provided, reads from the 'GEMINI_API_KEY' environment variable.

None

Raises:

Type Description
ImportError

If the google-genai package is not installed.

ValueError

If no API key is provided or found in the environment.

Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def __init__(
    self,
    model_name: str = "models/embedding-001",
    api_key: Optional[str] = None,
) -> None:
    """
    Initialize the Gemini embedding provider.

    Args:
        model_name (str): The Gemini model identifier to use for embedding. Defaults to "models/embedding-001".
        api_key (Optional[str]): The Gemini API key. If not provided, reads from the 'GEMINI_API_KEY' environment variable.

    Raises:
        ImportError: If the `google-genai` package is not installed.
        ValueError: If no API key is provided or found in the environment.
    """
    self.api_key = api_key or os.getenv("GEMINI_API_KEY")
    if not self.api_key:
        raise ValueError(
            "Google Gemini API key not provided and 'GEMINI_API_KEY' environment variable not set."
        )
    self.model_name = model_name
    self.client = genai.Client(api_key=api_key)
    self.models = self.client.models
get_client()

Return the underlying Gemini API client.

Returns:

Type Description
Client

The loaded Gemini API module (google.genai).

Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py
51
52
53
54
55
56
57
58
def get_client(self) -> "genai.Client":
    """
    Return the underlying Gemini API client.

    Returns:
        The loaded Gemini API module (`google.genai`).
    """
    return self.client
embed_text(text, **parameters)

Generate an embedding for a single text string using Gemini.

Parameters:

Name Type Description Default
text str

The input text to embed.

required
**parameters Any

Additional parameters for the Gemini API.

{}

Returns:

Type Description
List[float]

List[float]: The generated embedding vector.

Raises:

Type Description
ValueError

If the input text is not a non-empty string.

RuntimeError

If the embedding call fails or returns an invalid response.

Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def embed_text(self, text: str, **parameters: Any) -> List[float]:
    """
    Generate an embedding for a single text string using Gemini.

    Args:
        text (str): The input text to embed.
        **parameters (Any): Additional parameters for the Gemini API.

    Returns:
        List[float]: The generated embedding vector.

    Raises:
        ValueError: If the input text is not a non-empty string.
        RuntimeError: If the embedding call fails or returns an invalid response.
    """
    if not isinstance(text, str) or not text.strip():
        raise ValueError("`text` must be a non-empty string.")

    try:
        result = self.models.embed_content(
            model=self.model_name, contents=text, **parameters
        )
        embedding = getattr(result, "embedding", None)
        if embedding is None:
            raise RuntimeError(
                "Gemini embedding call succeeded but no 'embedding' field was returned."
            )
        return embedding
    except Exception as e:
        raise RuntimeError(f"Failed to get embedding from Gemini: {e}") from e
embed_documents(texts, **parameters)

Generate embeddings for a list of text strings using Gemini.

Parameters:

Name Type Description Default
texts List[str]

A list of input text strings.

required
**parameters Any

Additional parameters for the Gemini API.

{}

Returns:

Type Description
List[List[float]]

List[List[float]]: The generated embedding vectors, one per input.

Raises:

Type Description
ValueError

If the input is not a non-empty list of non-empty strings.

RuntimeError

If the embedding call fails or returns an invalid response.

Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
    """
    Generate embeddings for a list of text strings using Gemini.

    Args:
        texts (List[str]): A list of input text strings.
        **parameters (Any): Additional parameters for the Gemini API.

    Returns:
        List[List[float]]: The generated embedding vectors, one per input.

    Raises:
        ValueError: If the input is not a non-empty list of non-empty strings.
        RuntimeError: If the embedding call fails or returns an invalid response.
    """
    if (
        not isinstance(texts, list)
        or not texts  # noqa: W503
        or any(not isinstance(t, str) or not t.strip() for t in texts)  # noqa: W503
    ):
        raise ValueError("`texts` must be a non-empty list of non-empty strings.")

    try:
        result = self.models.embed_content(
            model=self.model_name, contents=texts, **parameters
        )
        # The Gemini API returns a list of embeddings under .embeddings
        embeddings = getattr(result, "embeddings", None)
        if embeddings is None:
            raise RuntimeError(
                "Gemini embedding call succeeded but no 'embeddings' field was returned."
            )
        return embeddings

    except Exception as e:
        raise RuntimeError(
            f"Failed to get document embeddings from Gemini: {e}"
        ) from e

AnthropicEmbedding

AnthropicEmbedding logo AnthropicEmbedding logo

AnthropicEmbedding

Bases: BaseEmbedding

Embedding provider aligned with Anthropic's guidance, implemented via Voyage AI.

Anthropic does not offer a native embeddings API; their docs recommend using third-party providers such as Voyage AI for high-quality, domain-specific, and multimodal embeddings. This class wraps Voyage's Python SDK to provide a consistent interface that matches BaseEmbedding.

Example
from splitter_mr.embedding import AnthropicEmbeddings

embedder = AnthropicEmbeddings(model_name="voyage-3.5")
vec = embedder.embed_text("hello world", input_type="document")
print(len(vec))
Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
class AnthropicEmbedding(BaseEmbedding):
    """
    Embedding provider aligned with Anthropic's guidance, implemented via Voyage AI.

    Anthropic does not offer a native embeddings API; their docs recommend using
    third-party providers such as **Voyage AI** for high-quality, domain-specific,
    and multimodal embeddings. This class wraps Voyage's Python SDK to provide a
    consistent interface that matches `BaseEmbedding`.

    Example:
        ```python
        from splitter_mr.embedding import AnthropicEmbeddings

        embedder = AnthropicEmbeddings(model_name="voyage-3.5")
        vec = embedder.embed_text("hello world", input_type="document")
        print(len(vec))
        ```
    """

    def __init__(
        self,
        model_name: str = "voyage-3.5",
        api_key: Optional[str] = None,
        default_input_type: Optional[str] = "document",
    ) -> None:
        """
        Initialize the Voyage embeddings provider.

        Args:
            model_name:
                Voyage embedding model name (e.g., "voyage-3.5", "voyage-3-large",
                "voyage-code-3", "voyage-finance-2", "voyage-law-2").
            api_key:
                Voyage API key. If not provided, reads from the `VOYAGE_API_KEY`
                environment variable.
            default_input_type:
                Default for Voyage's `input_type` parameter ("document" | "query").

        Raises:
            ImportError: If the `multimodal` extra (with `voyageai`) is not installed.
            ValueError: If no API key is provided or found in the environment.
        """

        if api_key is None:
            api_key = os.getenv("VOYAGE_API_KEY")
            if not api_key:
                raise ValueError(
                    "Voyage API key not provided and 'VOYAGE_API_KEY' environment variable is not set."
                )

        self.client = voyageai.Client(api_key=api_key)
        self.model_name = model_name
        self.default_input_type = default_input_type

    def get_client(self) -> Any:
        """Return the underlying Voyage client."""
        return self.client

    def _ensure_input_type(self, parameters: dict) -> dict:
        """Default `input_type` to self.default_input_type if not set."""
        params = dict(parameters) if parameters else {}
        if "input_type" not in params and self.default_input_type:
            params["input_type"] = self.default_input_type
        return params

    def embed_text(self, text: str, **parameters: Any) -> List[float]:
        """Compute an embedding vector for a single text string."""
        if not isinstance(text, str) or not text.strip():
            raise ValueError("`text` must be a non-empty string.")

        params = self._ensure_input_type(parameters)
        result = self.client.embed([text], model=self.model_name, **params)

        if not hasattr(result, "embeddings") or not result.embeddings:
            raise RuntimeError(
                "Voyage returned an empty or malformed embeddings response."
            )

        embedding = result.embeddings[0]
        if not isinstance(embedding, list) or not embedding:
            raise RuntimeError("Voyage returned an invalid embedding vector.")

        return embedding

    def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
        """Compute embeddings for multiple texts in one API call."""
        if not texts:
            raise ValueError("`texts` must be a non-empty list of strings.")
        if any(not isinstance(t, str) or not t.strip() for t in texts):
            raise ValueError("All items in `texts` must be non-empty strings.")

        params = self._ensure_input_type(parameters)
        result = self.client.embed(texts, model=self.model_name, **params)

        if not hasattr(result, "embeddings") or not result.embeddings:
            raise RuntimeError(
                "Voyage returned an empty or malformed embeddings response."
            )

        if len(result.embeddings) != len(texts):
            raise RuntimeError(
                f"Voyage returned {len(result.embeddings)} embeddings for {len(texts)} inputs."
            )

        embeddings = result.embeddings

        return embeddings
__init__(model_name='voyage-3.5', api_key=None, default_input_type='document')

Initialize the Voyage embeddings provider.

Parameters:

Name Type Description Default
model_name str

Voyage embedding model name (e.g., "voyage-3.5", "voyage-3-large", "voyage-code-3", "voyage-finance-2", "voyage-law-2").

'voyage-3.5'
api_key Optional[str]

Voyage API key. If not provided, reads from the VOYAGE_API_KEY environment variable.

None
default_input_type Optional[str]

Default for Voyage's input_type parameter ("document" | "query").

'document'

Raises:

Type Description
ImportError

If the multimodal extra (with voyageai) is not installed.

ValueError

If no API key is provided or found in the environment.

Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def __init__(
    self,
    model_name: str = "voyage-3.5",
    api_key: Optional[str] = None,
    default_input_type: Optional[str] = "document",
) -> None:
    """
    Initialize the Voyage embeddings provider.

    Args:
        model_name:
            Voyage embedding model name (e.g., "voyage-3.5", "voyage-3-large",
            "voyage-code-3", "voyage-finance-2", "voyage-law-2").
        api_key:
            Voyage API key. If not provided, reads from the `VOYAGE_API_KEY`
            environment variable.
        default_input_type:
            Default for Voyage's `input_type` parameter ("document" | "query").

    Raises:
        ImportError: If the `multimodal` extra (with `voyageai`) is not installed.
        ValueError: If no API key is provided or found in the environment.
    """

    if api_key is None:
        api_key = os.getenv("VOYAGE_API_KEY")
        if not api_key:
            raise ValueError(
                "Voyage API key not provided and 'VOYAGE_API_KEY' environment variable is not set."
            )

    self.client = voyageai.Client(api_key=api_key)
    self.model_name = model_name
    self.default_input_type = default_input_type
get_client()

Return the underlying Voyage client.

Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py
63
64
65
def get_client(self) -> Any:
    """Return the underlying Voyage client."""
    return self.client
embed_text(text, **parameters)

Compute an embedding vector for a single text string.

Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
def embed_text(self, text: str, **parameters: Any) -> List[float]:
    """Compute an embedding vector for a single text string."""
    if not isinstance(text, str) or not text.strip():
        raise ValueError("`text` must be a non-empty string.")

    params = self._ensure_input_type(parameters)
    result = self.client.embed([text], model=self.model_name, **params)

    if not hasattr(result, "embeddings") or not result.embeddings:
        raise RuntimeError(
            "Voyage returned an empty or malformed embeddings response."
        )

    embedding = result.embeddings[0]
    if not isinstance(embedding, list) or not embedding:
        raise RuntimeError("Voyage returned an invalid embedding vector.")

    return embedding
embed_documents(texts, **parameters)

Compute embeddings for multiple texts in one API call.

Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
    """Compute embeddings for multiple texts in one API call."""
    if not texts:
        raise ValueError("`texts` must be a non-empty list of strings.")
    if any(not isinstance(t, str) or not t.strip() for t in texts):
        raise ValueError("All items in `texts` must be non-empty strings.")

    params = self._ensure_input_type(parameters)
    result = self.client.embed(texts, model=self.model_name, **params)

    if not hasattr(result, "embeddings") or not result.embeddings:
        raise RuntimeError(
            "Voyage returned an empty or malformed embeddings response."
        )

    if len(result.embeddings) != len(texts):
        raise RuntimeError(
            f"Voyage returned {len(result.embeddings)} embeddings for {len(texts)} inputs."
        )

    embeddings = result.embeddings

    return embeddings

HuggingFaceEmbedding

Warning

Currently, only models compatible with sentence-transformers library are available.

HuggingFaceEmbedding logo HuggingFaceEmbedding logo

HuggingFaceEmbedding

Bases: BaseEmbedding

Encoder provider using Hugging Face sentence-transformers models.

This class wraps a local (or HF Hub) SentenceTransformer model to produce dense embeddings for text. It provides a consistent interface with your BaseEmbedding and convenient options for device selection and optional input-length validation. This class is available only if splitter-mr[multimodal] is installed.

Example
from splitter_mr.embedding.models.huggingface_embedding import HuggingFaceEmbedding

# Any sentence-transformers checkpoint works (local path or HF Hub id)
embedder = HuggingFaceEmbedding(
    model_name="ibm-granite/granite-embedding-english-r2",
    device="cpu",            # or "cuda", "mps", etc.
    normalize=True,          # L2-normalize outputs
    enforce_max_length=True  # raise if text exceeds model max seq length
)

vector = embedder.embed_text("hello world")
print(vector)
Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
class HuggingFaceEmbedding(BaseEmbedding):
    """
    Encoder provider using Hugging Face `sentence-transformers` models.

    This class wraps a local (or HF Hub) SentenceTransformer model to produce
    dense embeddings for text. It provides a consistent interface with your
    `BaseEmbedding` and convenient options for device selection and optional
    input-length validation. This class is available only if
    `splitter-mr[multimodal]` is installed.

    Example:
        ```python
        from splitter_mr.embedding.models.huggingface_embedding import HuggingFaceEmbedding

        # Any sentence-transformers checkpoint works (local path or HF Hub id)
        embedder = HuggingFaceEmbedding(
            model_name="ibm-granite/granite-embedding-english-r2",
            device="cpu",            # or "cuda", "mps", etc.
            normalize=True,          # L2-normalize outputs
            enforce_max_length=True  # raise if text exceeds model max seq length
        )

        vector = embedder.embed_text("hello world")
        print(vector)
        ```
    """

    def __init__(
        self,
        model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
        device: Optional[str] = "cpu",
        normalize: bool = True,
        enforce_max_length: bool = False,
    ) -> None:
        """
        Initialize the sentence-transformers embeddings provider.

        Args:
            model_name:
                SentenceTransformer model id or local path. Examples:
                - `"ibm-granite/granite-embedding-english-r2"`
                - `"sentence-transformers/all-MiniLM-L6-v2"`
                - `"/path/to/local/model"`
            device:
                Optional device spec (e.g., `"cpu"`, `"cuda"`, `"mps"` or a
                `torch.device`). If omitted, sentence-transformers chooses.
            normalize:
                If True, return L2-normalized embeddings (sets
                `normalize_embeddings=True` in `encode`).
            enforce_max_length:
                If True, attempt to count tokens and raise `ValueError` when
                input exceeds the model's configured max sequence length.
                (If the model/tokenizer does not expose this reliably, the
                check is skipped gracefully.)

        Raises:
            ValueError: If the model cannot be loaded.
        """

        from sentence_transformers import SentenceTransformer

        st_device = str(device) if device is not None else None
        try:
            self.model = SentenceTransformer(model_name, device=st_device)
        except Exception as e:
            raise ValueError(
                f"Failed to load SentenceTransformer '{model_name}': {e}"
            ) from e

        self.model_name = model_name
        self.normalize = normalize
        self.enforce_max_length = enforce_max_length

    def get_client(self) -> "SentenceTransformer":
        """Return the underlying `SentenceTransformer` instance."""
        return self.model

    def _max_seq_length(self) -> Optional[int]:
        """Best-effort retrieval of model's max sequence length."""
        try:
            # sentence-transformers exposes this on the model
            return int(self.model.get_max_seq_length())
        except Exception:
            try:
                # Fallback: some versions have `max_seq_length` attribute
                return int(getattr(self.model, "max_seq_length", None))
            except Exception:
                return None

    def _count_tokens(self, text: str) -> Optional[int]:
        """
        Best-effort token counting via model.tokenize; returns None if unavailable.
        """
        try:
            features = self.model.tokenize([text])  # dict with "input_ids"
            input_ids = features["input_ids"]
            # input_ids is usually a list/array/tensor of shape [batch, seq]
            if isinstance(input_ids, list):
                first = input_ids[0]
                return len(first)
            if torch is not None and torch.is_tensor(input_ids):
                return int(input_ids.shape[1])
            if isinstance(input_ids, np.ndarray):
                return int(input_ids.shape[1])
        except Exception:
            pass
        return None

    def _validate_length_if_needed(self, text: str) -> None:
        """Raise ValueError if enforce_max_length=True and text is too long."""
        if not self.enforce_max_length:
            return
        max_len = self._max_seq_length()
        tok_count = self._count_tokens(text)
        if max_len is not None and tok_count is not None and tok_count > max_len:
            raise ValueError(
                f"Input exceeds model max sequence length ({tok_count} > {max_len} tokens)."
            )

    def embed_text(self, text: str, **parameters: Any) -> List[float]:
        """
        Compute an embedding vector for a single text string.

        Args:
            text:
                The text to embed. Must be non-empty. If `enforce_max_length`
                is True, a ValueError is raised when it exceeds the model limit.
            **parameters:
                Extra keyword arguments forwarded to `SentenceTransformer.encode`.
                Common options include:
                  - `batch_size` (int)
                  - `show_progress_bar` (bool)
                  - `convert_to_tensor` (bool)  # will be forced False here
                  - `device` (str)
                  - `normalize_embeddings` (bool)

        Returns:
            List[float]: The computed embedding vector.

        Raises:
            ValueError: If `text` is empty or exceeds length constraints (when enforced).
            RuntimeError: If the embedding call fails unexpectedly.
        """
        if not isinstance(text, str) or not text:
            raise ValueError("`text` must be a non-empty string.")

        self._validate_length_if_needed(text)

        # Ensure Python list output
        parameters = dict(parameters)  # shallow copy
        parameters["convert_to_tensor"] = False
        parameters.setdefault("normalize_embeddings", self.normalize)

        try:
            # `encode` accepts a single string and returns a 1D array-like
            vec = self.model.encode(text, **parameters)
        except Exception as e:
            raise RuntimeError(f"Embedding call failed: {e}") from e

        # Normalize output to List[float]
        if isinstance(vec, np.ndarray):
            return vec.astype(np.float32, copy=False).tolist()
        if torch is not None and hasattr(vec, "detach"):
            return vec.detach().cpu().float().tolist()
        if isinstance(vec, (list, tuple)):
            return [float(x) for x in vec]
        # Anything else: try to coerce
        try:
            return list(map(float, vec))  # type: ignore[arg-type]
        except Exception as e:
            raise RuntimeError(f"Unexpected embedding output type: {type(vec)}") from e

    def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
        """
        Compute embeddings for multiple texts efficiently using `encode`.

        Args:
            texts:
                List of input strings to embed. Must be non-empty and contain
                only non-empty strings. Length enforcement is applied per item
                if `enforce_max_length=True`.
            **parameters:
                Extra keyword arguments forwarded to `SentenceTransformer.encode`.
                Common options:
                  - `batch_size` (int)
                  - `show_progress_bar` (bool)
                  - `convert_to_tensor` (bool)  # will be forced False here
                  - `device` (str)
                  - `normalize_embeddings` (bool)

        Returns:
            List[List[float]]: One embedding per input string.

        Raises:
            ValueError: If `texts` is empty or any element is empty/non-string.
            RuntimeError: If the embedding call fails unexpectedly.
        """
        if not texts:
            raise ValueError("`texts` must be a non-empty list of strings.")
        if any((not isinstance(t, str) or not t) for t in texts):
            raise ValueError("All items in `texts` must be non-empty strings.")

        if self.enforce_max_length:
            for t in texts:
                self._validate_length_if_needed(t)

        parameters = dict(parameters)
        parameters["convert_to_tensor"] = False
        parameters.setdefault("normalize_embeddings", self.normalize)

        try:
            # Returns ndarray (n, d) or list-of-lists
            mat = self.model.encode(texts, **parameters)
        except Exception as e:
            raise RuntimeError(f"Batch embedding call failed: {e}") from e

        if isinstance(mat, np.ndarray):
            return mat.astype(np.float32, copy=False).tolist()
        if torch is not None and hasattr(mat, "detach"):
            return mat.detach().cpu().float().tolist()
        if (
            isinstance(mat, list)
            and mat  # noqa: W503
            and isinstance(mat[0], (list, tuple, float, int))  # noqa: W503
        ):
            # Already python lists (ST often returns this when convert_to_tensor=False)
            if mat and isinstance(mat[0], (float, int)):  # single vector in a flat list
                return [list(map(float, mat))]
            return [list(map(float, row)) for row in mat]  # type: ignore[arg-type]

        raise RuntimeError(f"Unexpected batch embedding output type: {type(mat)}")
__init__(model_name='sentence-transformers/all-MiniLM-L6-v2', device='cpu', normalize=True, enforce_max_length=False)

Initialize the sentence-transformers embeddings provider.

Parameters:

Name Type Description Default
model_name str

SentenceTransformer model id or local path. Examples: - "ibm-granite/granite-embedding-english-r2" - "sentence-transformers/all-MiniLM-L6-v2" - "/path/to/local/model"

'sentence-transformers/all-MiniLM-L6-v2'
device Optional[str]

Optional device spec (e.g., "cpu", "cuda", "mps" or a torch.device). If omitted, sentence-transformers chooses.

'cpu'
normalize bool

If True, return L2-normalized embeddings (sets normalize_embeddings=True in encode).

True
enforce_max_length bool

If True, attempt to count tokens and raise ValueError when input exceeds the model's configured max sequence length. (If the model/tokenizer does not expose this reliably, the check is skipped gracefully.)

False

Raises:

Type Description
ValueError

If the model cannot be loaded.

Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
def __init__(
    self,
    model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
    device: Optional[str] = "cpu",
    normalize: bool = True,
    enforce_max_length: bool = False,
) -> None:
    """
    Initialize the sentence-transformers embeddings provider.

    Args:
        model_name:
            SentenceTransformer model id or local path. Examples:
            - `"ibm-granite/granite-embedding-english-r2"`
            - `"sentence-transformers/all-MiniLM-L6-v2"`
            - `"/path/to/local/model"`
        device:
            Optional device spec (e.g., `"cpu"`, `"cuda"`, `"mps"` or a
            `torch.device`). If omitted, sentence-transformers chooses.
        normalize:
            If True, return L2-normalized embeddings (sets
            `normalize_embeddings=True` in `encode`).
        enforce_max_length:
            If True, attempt to count tokens and raise `ValueError` when
            input exceeds the model's configured max sequence length.
            (If the model/tokenizer does not expose this reliably, the
            check is skipped gracefully.)

    Raises:
        ValueError: If the model cannot be loaded.
    """

    from sentence_transformers import SentenceTransformer

    st_device = str(device) if device is not None else None
    try:
        self.model = SentenceTransformer(model_name, device=st_device)
    except Exception as e:
        raise ValueError(
            f"Failed to load SentenceTransformer '{model_name}': {e}"
        ) from e

    self.model_name = model_name
    self.normalize = normalize
    self.enforce_max_length = enforce_max_length
get_client()

Return the underlying SentenceTransformer instance.

Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py
85
86
87
def get_client(self) -> "SentenceTransformer":
    """Return the underlying `SentenceTransformer` instance."""
    return self.model
embed_text(text, **parameters)

Compute an embedding vector for a single text string.

Parameters:

Name Type Description Default
text str

The text to embed. Must be non-empty. If enforce_max_length is True, a ValueError is raised when it exceeds the model limit.

required
**parameters Any

Extra keyword arguments forwarded to SentenceTransformer.encode. Common options include: - batch_size (int) - show_progress_bar (bool) - convert_to_tensor (bool) # will be forced False here - device (str) - normalize_embeddings (bool)

{}

Returns:

Type Description
List[float]

List[float]: The computed embedding vector.

Raises:

Type Description
ValueError

If text is empty or exceeds length constraints (when enforced).

RuntimeError

If the embedding call fails unexpectedly.

Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
def embed_text(self, text: str, **parameters: Any) -> List[float]:
    """
    Compute an embedding vector for a single text string.

    Args:
        text:
            The text to embed. Must be non-empty. If `enforce_max_length`
            is True, a ValueError is raised when it exceeds the model limit.
        **parameters:
            Extra keyword arguments forwarded to `SentenceTransformer.encode`.
            Common options include:
              - `batch_size` (int)
              - `show_progress_bar` (bool)
              - `convert_to_tensor` (bool)  # will be forced False here
              - `device` (str)
              - `normalize_embeddings` (bool)

    Returns:
        List[float]: The computed embedding vector.

    Raises:
        ValueError: If `text` is empty or exceeds length constraints (when enforced).
        RuntimeError: If the embedding call fails unexpectedly.
    """
    if not isinstance(text, str) or not text:
        raise ValueError("`text` must be a non-empty string.")

    self._validate_length_if_needed(text)

    # Ensure Python list output
    parameters = dict(parameters)  # shallow copy
    parameters["convert_to_tensor"] = False
    parameters.setdefault("normalize_embeddings", self.normalize)

    try:
        # `encode` accepts a single string and returns a 1D array-like
        vec = self.model.encode(text, **parameters)
    except Exception as e:
        raise RuntimeError(f"Embedding call failed: {e}") from e

    # Normalize output to List[float]
    if isinstance(vec, np.ndarray):
        return vec.astype(np.float32, copy=False).tolist()
    if torch is not None and hasattr(vec, "detach"):
        return vec.detach().cpu().float().tolist()
    if isinstance(vec, (list, tuple)):
        return [float(x) for x in vec]
    # Anything else: try to coerce
    try:
        return list(map(float, vec))  # type: ignore[arg-type]
    except Exception as e:
        raise RuntimeError(f"Unexpected embedding output type: {type(vec)}") from e
embed_documents(texts, **parameters)

Compute embeddings for multiple texts efficiently using encode.

Parameters:

Name Type Description Default
texts List[str]

List of input strings to embed. Must be non-empty and contain only non-empty strings. Length enforcement is applied per item if enforce_max_length=True.

required
**parameters Any

Extra keyword arguments forwarded to SentenceTransformer.encode. Common options: - batch_size (int) - show_progress_bar (bool) - convert_to_tensor (bool) # will be forced False here - device (str) - normalize_embeddings (bool)

{}

Returns:

Type Description
List[List[float]]

List[List[float]]: One embedding per input string.

Raises:

Type Description
ValueError

If texts is empty or any element is empty/non-string.

RuntimeError

If the embedding call fails unexpectedly.

Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
    """
    Compute embeddings for multiple texts efficiently using `encode`.

    Args:
        texts:
            List of input strings to embed. Must be non-empty and contain
            only non-empty strings. Length enforcement is applied per item
            if `enforce_max_length=True`.
        **parameters:
            Extra keyword arguments forwarded to `SentenceTransformer.encode`.
            Common options:
              - `batch_size` (int)
              - `show_progress_bar` (bool)
              - `convert_to_tensor` (bool)  # will be forced False here
              - `device` (str)
              - `normalize_embeddings` (bool)

    Returns:
        List[List[float]]: One embedding per input string.

    Raises:
        ValueError: If `texts` is empty or any element is empty/non-string.
        RuntimeError: If the embedding call fails unexpectedly.
    """
    if not texts:
        raise ValueError("`texts` must be a non-empty list of strings.")
    if any((not isinstance(t, str) or not t) for t in texts):
        raise ValueError("All items in `texts` must be non-empty strings.")

    if self.enforce_max_length:
        for t in texts:
            self._validate_length_if_needed(t)

    parameters = dict(parameters)
    parameters["convert_to_tensor"] = False
    parameters.setdefault("normalize_embeddings", self.normalize)

    try:
        # Returns ndarray (n, d) or list-of-lists
        mat = self.model.encode(texts, **parameters)
    except Exception as e:
        raise RuntimeError(f"Batch embedding call failed: {e}") from e

    if isinstance(mat, np.ndarray):
        return mat.astype(np.float32, copy=False).tolist()
    if torch is not None and hasattr(mat, "detach"):
        return mat.detach().cpu().float().tolist()
    if (
        isinstance(mat, list)
        and mat  # noqa: W503
        and isinstance(mat[0], (list, tuple, float, int))  # noqa: W503
    ):
        # Already python lists (ST often returns this when convert_to_tensor=False)
        if mat and isinstance(mat[0], (float, int)):  # single vector in a flat list
            return [list(map(float, mat))]
        return [list(map(float, row)) for row in mat]  # type: ignore[arg-type]

    raise RuntimeError(f"Unexpected batch embedding output type: {type(mat)}")