Embedding Models¶

Overview¶

Encoder models are the engines which produce embeddings. These embeddings are distributed and vectorized representations of a text. These embeddings allows to capture relationships between semantic units (commonly words, but can be sentences, or even multimodal content such as images).

These embeddings can be used in a variety of tasks, such as:

Measuring how relevant a word is within a text.
Comparing the similarity between two pieces of text.
Power searching, clustering, and recommendation systems building.

Example of an embedding representation

SplitterMR takes advantage of these models in SemanticSplitter. These representations are used to break text into chunks based on meaning, not just size. Sentences with similar context end up together, regardless of length or position.

Which embedder should I use?¶

All embedders inherit from BaseEmbedding and expose the same interface for generating embeddings. Choose based on your cloud provider, credentials, and compliance needs.

Model	When to use	Requirements	Features
OpenAIEmbedding	You have an OpenAI API key and want to use OpenAI’s hosted embeddings	`OPENAI_API_KEY`	Production-ready text embeddings; simple setup; broad ecosystem/tooling support.
AzureOpenAIEmbedding	Your organization uses Azure OpenAI Services	`AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_DEPLOYMENT`	Enterprise controls, Azure compliance & data residency; integrates with Azure identity.
GeminiEmbedding	You want Google’s Gemini text embeddings	`GEMINI_API_KEY` + Multimodal extra: `pip install 'splitter-mr[multimodal]'`	Google Gemini API; modern, high-quality text embeddings.
AnthropicEmbeddings	You want embeddings aligned with Anthropic guidance (via Voyage AI)	`VOYAGE_API_KEY` + Multimodal extra: `pip install 'splitter-mr[multimodal]'`	Voyage AI embeddings (general, code, finance, law, multimodal); supports `input_type` for query/document asymmetry.
HuggingFaceEmbedding	Prefer local/open-source models (Sentence-Transformers); offline capability	Multimodal extra: `pip install 'splitter-mr[multimodal]'` (optional: `HF_ACCESS_TOKEN`, only for required models)	No API key; huge model zoo; CPU/GPU/MPS; optional L2 normalization for cosine similarity.
BaseEmbedding	Abstract base, not used directly	–	Implement to plug in a custom or self-hosted embedder.

Note

In case that you want to bring your own embedding provider, you can easily implement the class using BaseEmbedding.

Embedders¶

BaseEmbedding¶

`BaseEmbedding` ¶

Bases: ABC

Abstract base for text embedding providers.

Implementations wrap specific backends (e.g., OpenAI, Azure OpenAI, local models) and expose a consistent interface to convert text into numeric vectors suitable for similarity search, clustering, and retrieval-augmented generation.

Source code in src/splitter_mr/embedding/base_embedding.py

class BaseEmbedding(ABC):
    """
    Abstract base for text embedding providers.

    Implementations wrap specific backends (e.g., OpenAI, Azure OpenAI, local
    models) and expose a consistent interface to convert text into numeric
    vectors suitable for similarity search, clustering, and retrieval-augmented
    generation.
    """

    @abstractmethod
    def __init__(self, model_name: str) -> Any:
        """Initialize the embedding backend.

        Args:
            model_name (str): Identifier of the embedding model (e.g.,
                ``"text-embedding-3-large"`` or a local model alias/path).

        Raises:
            ValueError: If required configuration or credentials are missing.
        """

    @abstractmethod
    def get_client(self) -> Any:
        """Return the underlying client or handle.

        Returns:
            Any: A client/handle used to perform embedding calls (e.g., an SDK
                client instance, session object, or local runner). May be ``None``
                for pure-local implementations that do not require a client.
        """

    @abstractmethod
    def embed_text(
        self,
        text: str,
        **parameters: Dict[str, Any],
    ) -> List[float]:
        """
        Compute an embedding vector for the given text.

        Args:
            text (str): Input text to embed. Implementations may apply
                normalization or truncation according to model limits.
            **parameters (Dict[str, Any]): Additional backend-specific options
                forwarded to the implementation (e.g., user tags, request IDs).

        Returns:
            A single embedding vector representing ``text``.

        Raises:
            ValueError: If ``text`` is empty or exceeds backend constraints.
            RuntimeError: If the embedding call fails or returns an unexpected
                response shape.
        """

    def embed_documents(
        self,
        texts: List[str],
        **parameters: Dict[str, Any],
    ) -> List[List[float]]:
        """Compute embeddings for multiple texts (default loops over `embed_text`).

        Implementations are encouraged to override for true batch performance.

        Args:
            texts: List of input strings to embed.
            **parameters: Backend-specific options.

        Returns:
            List of embedding vectors, one per input string.

        Raises:
            ValueError: If `texts` is empty or any element is empty.
        """
        if not texts:
            raise ValueError("`texts` must be a non-empty list of strings.")
        return [self.embed_text(t, **parameters) for t in texts]

`init(model_name)` `abstractmethod` ¶

Initialize the embedding backend.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Identifier of the embedding model (e.g., `"text-embedding-3-large"` or a local model alias/path).	required

Raises:

Type	Description
`ValueError`	If required configuration or credentials are missing.

Source code in src/splitter_mr/embedding/base_embedding.py

@abstractmethod
def __init__(self, model_name: str) -> Any:
    """Initialize the embedding backend.

    Args:
        model_name (str): Identifier of the embedding model (e.g.,
            ``"text-embedding-3-large"`` or a local model alias/path).

    Raises:
        ValueError: If required configuration or credentials are missing.
    """

`get_client()` `abstractmethod` ¶

Return the underlying client or handle.

Returns:

Name	Type	Description
`Any`	`Any`	A client/handle used to perform embedding calls (e.g., an SDK client instance, session object, or local runner). May be `None` for pure-local implementations that do not require a client.

Source code in src/splitter_mr/embedding/base_embedding.py

@abstractmethod
def get_client(self) -> Any:
    """Return the underlying client or handle.

    Returns:
        Any: A client/handle used to perform embedding calls (e.g., an SDK
            client instance, session object, or local runner). May be ``None``
            for pure-local implementations that do not require a client.
    """

`embed_text(text, **parameters)` `abstractmethod` ¶

Compute an embedding vector for the given text.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text to embed. Implementations may apply normalization or truncation according to model limits.	required
`**parameters`	`Dict[str, Any]`	Additional backend-specific options forwarded to the implementation (e.g., user tags, request IDs).	`{}`

Returns:

Type	Description
`List[float]`	A single embedding vector representing `text`.

Raises:

Type	Description
`ValueError`	If `text` is empty or exceeds backend constraints.
`RuntimeError`	If the embedding call fails or returns an unexpected response shape.

Source code in src/splitter_mr/embedding/base_embedding.py

@abstractmethod
def embed_text(
    self,
    text: str,
    **parameters: Dict[str, Any],
) -> List[float]:
    """
    Compute an embedding vector for the given text.

    Args:
        text (str): Input text to embed. Implementations may apply
            normalization or truncation according to model limits.
        **parameters (Dict[str, Any]): Additional backend-specific options
            forwarded to the implementation (e.g., user tags, request IDs).

    Returns:
        A single embedding vector representing ``text``.

    Raises:
        ValueError: If ``text`` is empty or exceeds backend constraints.
        RuntimeError: If the embedding call fails or returns an unexpected
            response shape.
    """

`embed_documents(texts, **parameters)` ¶

Compute embeddings for multiple texts (default loops over embed_text).

Implementations are encouraged to override for true batch performance.

Parameters:

Name	Type	Description	Default
`texts`	`List[str]`	List of input strings to embed.	required
`**parameters`	`Dict[str, Any]`	Backend-specific options.	`{}`

Returns:

Type	Description
`List[List[float]]`	List of embedding vectors, one per input string.

Raises:

Type	Description
`ValueError`	If `texts` is empty or any element is empty.

Source code in src/splitter_mr/embedding/base_embedding.py

def embed_documents(
    self,
    texts: List[str],
    **parameters: Dict[str, Any],
) -> List[List[float]]:
    """Compute embeddings for multiple texts (default loops over `embed_text`).

    Implementations are encouraged to override for true batch performance.

    Args:
        texts: List of input strings to embed.
        **parameters: Backend-specific options.

    Returns:
        List of embedding vectors, one per input string.

    Raises:
        ValueError: If `texts` is empty or any element is empty.
    """
    if not texts:
        raise ValueError("`texts` must be a non-empty list of strings.")
    return [self.embed_text(t, **parameters) for t in texts]

OpenAIEmbedding¶

OpenAIEmbedding logo

`OpenAIEmbedding` ¶

Bases: BaseEmbedding

Encoder provider using OpenAI's embeddings API.

This class wraps OpenAI's embeddings endpoint, providing convenience methods for both single-text and batch embeddings. It also adds token counting and validation to avoid exceeding model limits.

Example

from splitter_mr.embedding import OpenAIEmbedding

embedder = OpenAIEmbedding(model_name="text-embedding-3-large")
vector = embedder.embed_text("hello world")
print(vector)

Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py

class OpenAIEmbedding(BaseEmbedding):
    """
    Encoder provider using OpenAI's embeddings API.

    This class wraps OpenAI's embeddings endpoint, providing convenience
    methods for both single-text and batch embeddings. It also adds token
    counting and validation to avoid exceeding model limits.

    Example:
        ```python
        from splitter_mr.embedding import OpenAIEmbedding

        embedder = OpenAIEmbedding(model_name="text-embedding-3-large")
        vector = embedder.embed_text("hello world")
        print(vector)
        ```
    """

    def __init__(
        self,
        model_name: str = "text-embedding-3-large",
        api_key: Optional[str] = None,
        tokenizer_name: Optional[str] = None,
    ) -> None:
        """
        Initialize the OpenAI embeddings provider.

        Args:
            model_name (str):
                The OpenAI embedding model name (e.g., `"text-embedding-3-large"`).
            api_key (Optional[str]):
                API key for OpenAI. If not provided, reads from the
                `OPENAI_API_KEY` environment variable.
            tokenizer_name (Optional[str]):
                Optional explicit tokenizer name for `tiktoken`. If provided,
                this overrides automatic model-to-tokenizer mapping.

        Raises:
            ValueError: If the API key is not provided or the `OPENAI_API_KEY` environment variable is not set.
        """
        if api_key is None:
            api_key = os.getenv("OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "OpenAI API key not provided or 'OPENAI_API_KEY' env var is not set."
                )
        self.client = OpenAI(api_key=api_key)
        self.model_name = model_name
        self._tokenizer_name = tokenizer_name

    def get_client(self) -> OpenAI:
        """
        Get the configured OpenAI client.

        Returns:
            OpenAI: The OpenAI API client instance.
        """
        return self.client

    def _get_encoder(self):
        """
        Retrieve the `tiktoken` encoder for the configured model.

        If a `tokenizer_name` is explicitly provided, it is used. Otherwise,
        attempts to use `tiktoken.encoding_for_model`. If that fails, falls
        back to the default tokenizer defined by `OPENAI_EMBEDDING_MODEL_FALLBACK`.

        Returns:
            tiktoken.Encoding: The encoding object for tokenizing text.

        Raises:
            ValueError: If neither the model-specific nor fallback encoder
            can be loaded.
        """
        if self._tokenizer_name:
            return tiktoken.get_encoding(self._tokenizer_name)
        try:
            return tiktoken.encoding_for_model(self.model_name)
        except Exception:
            return tiktoken.get_encoding(OPENAI_EMBEDDING_MODEL_FALLBACK)

    def _count_tokens(self, text: str) -> int:
        """
        Count the number of tokens in the given text.

        Args:
            text (str): The text to tokenize.

        Returns:
            int: Number of tokens.
        """
        encoder = self._get_encoder()
        return len(encoder.encode(text))

    def _validate_token_length(self, text: str) -> None:
        """
        Ensure the text does not exceed the model's token limit.

        Args:
            text (str): The text to check.

        Raises:
            ValueError: If the token count exceeds `OPENAI_EMBEDDING_MAX_TOKENS`.
        """
        if self._count_tokens(text) > OPENAI_EMBEDDING_MAX_TOKENS:
            raise ValueError(
                f"Input text exceeds maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
            )

    def embed_text(self, text: str, **parameters: Any) -> List[float]:
        """
        Compute an embedding vector for a single text string.

        Args:
            text (str):
                The text to embed. Must be non-empty and within the model's
                token limit.
            **parameters:
                Additional keyword arguments forwarded to
                `client.embeddings.create(...)`.

        Returns:
            List[float]: The computed embedding vector.

        Raises:
            ValueError: If `text` is empty or exceeds the token limit.
        """
        if not text:
            raise ValueError("`text` must be a non-empty string.")
        self._validate_token_length(text)

        response = self.client.embeddings.create(
            input=text,
            model=self.model_name,
            **parameters,
        )
        return response.data[0].embedding

    def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
        """
        Compute embeddings for multiple texts in one API call.

        Args:
            texts (List[str]):
                List of text strings to embed. All must be non-empty and within
                the model's token limit.
            **parameters:
                Additional keyword arguments forwarded to
                `client.embeddings.create(...)`.

        Returns:
            A list of embedding vectors, one per input string.

        Raises:
            ValueError:
                - If `texts` is empty.
                - If any text is empty or not a string.
                - If any text exceeds the token limit.
        """
        if not texts:
            raise ValueError("`texts` must be a non-empty list of strings.")
        if any(not isinstance(t, str) or not t for t in texts):
            raise ValueError("All items in `texts` must be non-empty strings.")

        encoder = self._get_encoder()
        for t in texts:
            if len(encoder.encode(t)) > OPENAI_EMBEDDING_MAX_TOKENS:
                raise ValueError(
                    f"An input exceeds the maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
                )

        response = self.client.embeddings.create(
            input=texts,
            model=self.model_name,
            **parameters,
        )
        return [data.embedding for data in response.data]

`init(model_name='text-embedding-3-large', api_key=None, tokenizer_name=None)` ¶

Initialize the OpenAI embeddings provider.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	The OpenAI embedding model name (e.g., `"text-embedding-3-large"`).	`'text-embedding-3-large'`
`api_key`	`Optional[str]`	API key for OpenAI. If not provided, reads from the `OPENAI_API_KEY` environment variable.	`None`
`tokenizer_name`	`Optional[str]`	Optional explicit tokenizer name for `tiktoken`. If provided, this overrides automatic model-to-tokenizer mapping.	`None`

Raises:

Type	Description
`ValueError`	If the API key is not provided or the `OPENAI_API_KEY` environment variable is not set.

Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py

def __init__(
    self,
    model_name: str = "text-embedding-3-large",
    api_key: Optional[str] = None,
    tokenizer_name: Optional[str] = None,
) -> None:
    """
    Initialize the OpenAI embeddings provider.

    Args:
        model_name (str):
            The OpenAI embedding model name (e.g., `"text-embedding-3-large"`).
        api_key (Optional[str]):
            API key for OpenAI. If not provided, reads from the
            `OPENAI_API_KEY` environment variable.
        tokenizer_name (Optional[str]):
            Optional explicit tokenizer name for `tiktoken`. If provided,
            this overrides automatic model-to-tokenizer mapping.

    Raises:
        ValueError: If the API key is not provided or the `OPENAI_API_KEY` environment variable is not set.
    """
    if api_key is None:
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "OpenAI API key not provided or 'OPENAI_API_KEY' env var is not set."
            )
    self.client = OpenAI(api_key=api_key)
    self.model_name = model_name
    self._tokenizer_name = tokenizer_name

`get_client()` ¶

Get the configured OpenAI client.

Returns:

Name	Type	Description
`OpenAI`	`OpenAI`	The OpenAI API client instance.

Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py

def get_client(self) -> OpenAI:
    """
    Get the configured OpenAI client.

    Returns:
        OpenAI: The OpenAI API client instance.
    """
    return self.client

`embed_text(text, **parameters)` ¶

Compute an embedding vector for a single text string.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to embed. Must be non-empty and within the model's token limit.	required
`**parameters`	`Any`	Additional keyword arguments forwarded to `client.embeddings.create(...)`.	`{}`

Returns:

Type	Description
`List[float]`	List[float]: The computed embedding vector.

Raises:

Type	Description
`ValueError`	If `text` is empty or exceeds the token limit.

Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py

def embed_text(self, text: str, **parameters: Any) -> List[float]:
    """
    Compute an embedding vector for a single text string.

    Args:
        text (str):
            The text to embed. Must be non-empty and within the model's
            token limit.
        **parameters:
            Additional keyword arguments forwarded to
            `client.embeddings.create(...)`.

    Returns:
        List[float]: The computed embedding vector.

    Raises:
        ValueError: If `text` is empty or exceeds the token limit.
    """
    if not text:
        raise ValueError("`text` must be a non-empty string.")
    self._validate_token_length(text)

    response = self.client.embeddings.create(
        input=text,
        model=self.model_name,
        **parameters,
    )
    return response.data[0].embedding

`embed_documents(texts, **parameters)` ¶

Compute embeddings for multiple texts in one API call.

Parameters:

Name	Type	Description	Default
`texts`	`List[str]`	List of text strings to embed. All must be non-empty and within the model's token limit.	required
`**parameters`	`Any`	Additional keyword arguments forwarded to `client.embeddings.create(...)`.	`{}`

Returns:

Type	Description
`List[List[float]]`	A list of embedding vectors, one per input string.

Raises:

Type	Description
`ValueError`	If `texts` is empty. If any text is empty or not a string. If any text exceeds the token limit.

Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py

def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
    """
    Compute embeddings for multiple texts in one API call.

    Args:
        texts (List[str]):
            List of text strings to embed. All must be non-empty and within
            the model's token limit.
        **parameters:
            Additional keyword arguments forwarded to
            `client.embeddings.create(...)`.

    Returns:
        A list of embedding vectors, one per input string.

    Raises:
        ValueError:
            - If `texts` is empty.
            - If any text is empty or not a string.
            - If any text exceeds the token limit.
    """
    if not texts:
        raise ValueError("`texts` must be a non-empty list of strings.")
    if any(not isinstance(t, str) or not t for t in texts):
        raise ValueError("All items in `texts` must be non-empty strings.")

    encoder = self._get_encoder()
    for t in texts:
        if len(encoder.encode(t)) > OPENAI_EMBEDDING_MAX_TOKENS:
            raise ValueError(
                f"An input exceeds the maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
            )

    response = self.client.embeddings.create(
        input=texts,
        model=self.model_name,
        **parameters,
    )
    return [data.embedding for data in response.data]

AzureOpenAIEmbedding¶

AzureOpenAIEmbedding logo

`AzureOpenAIEmbedding` ¶

Bases: BaseEmbedding

Encoder provider using Azure OpenAI Embeddings.

This class wraps Azure OpenAI's embeddings API, handling both authentication and tokenization. It supports both direct embedding calls for a single text (embed_text) and batch embedding calls (embed_documents).

Azure deployments use deployment names (e.g., my-embedding-deployment) instead of OpenAI's standard model names. Since tiktoken may not be able to map a deployment name to a tokenizer automatically, this class implements a fallback mechanism to use a known encoding (e.g., cl100k_base) when necessary.

Example

from splitter_mr.embedding import AzureOpenAIEmbedding

embedder = AzureOpenAIEmbedding(
    azure_deployment="text-embedding-3-large",
    api_key="...",
    azure_endpoint="https://my-azure-endpoint.openai.azure.com/"
)
vector = embedder.embed_text("Hello world")

Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py

class AzureOpenAIEmbedding(BaseEmbedding):
    """
    Encoder provider using Azure OpenAI Embeddings.

    This class wraps Azure OpenAI's embeddings API, handling both authentication
    and tokenization. It supports both direct embedding calls for a single text
    (`embed_text`) and batch embedding calls (`embed_documents`).

    Azure deployments use *deployment names* (e.g., `my-embedding-deployment`)
    instead of OpenAI's standard model names. Since `tiktoken` may not be able to
    map a deployment name to a tokenizer automatically, this class implements
    a fallback mechanism to use a known encoding (e.g., `cl100k_base`) when necessary.

    Example:
        ```python
        from splitter_mr.embedding import AzureOpenAIEmbedding

        embedder = AzureOpenAIEmbedding(
            azure_deployment="text-embedding-3-large",
            api_key="...",
            azure_endpoint="https://my-azure-endpoint.openai.azure.com/"
        )
        vector = embedder.embed_text("Hello world")
        ```
    """

    def __init__(
        self,
        model_name: Optional[str] = None,
        api_key: Optional[str] = None,
        azure_endpoint: Optional[str] = None,
        azure_deployment: Optional[str] = None,
        api_version: Optional[str] = None,
        tokenizer_name: Optional[str] = None,
    ) -> None:
        """
        Initialize the Azure OpenAI Embedding provider.

        Args:
            model_name (Optional[str]):
                OpenAI model name (unused for Azure, but kept for API parity).
                If `azure_deployment` is not provided, this will be used as the
                deployment name.
            api_key (Optional[str]):
                API key for Azure OpenAI. If not provided, it will be read from
                the environment variable `AZURE_OPENAI_API_KEY`.
            azure_endpoint (Optional[str]):
                The base endpoint for the Azure OpenAI service. If not provided,
                it will be read from `AZURE_OPENAI_ENDPOINT`.
            azure_deployment (Optional[str]):
                Deployment name for the embeddings model in Azure OpenAI. If not
                provided, it will be read from `AZURE_OPENAI_DEPLOYMENT` or
                fallback to `model_name`.
            api_version (Optional[str]):
                Azure API version string. Defaults to `"2025-04-14-preview"`.
                If not provided, it will be read from `AZURE_OPENAI_API_VERSION`.
            tokenizer_name (Optional[str]):
                Optional explicit tokenizer name for `tiktoken` (e.g.,
                `"cl100k_base"`). If provided, it overrides the automatic mapping.

        Raises:
            ValueError: If any required parameter is missing or it is not found in environment variables.
        """
        if api_key is None:
            api_key = os.getenv("AZURE_OPENAI_API_KEY")
            if not api_key:
                raise ValueError(
                    "Azure OpenAI API key not provided or 'AZURE_OPENAI_API_KEY' env var is not set."
                )

        if azure_endpoint is None:
            azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
            if not azure_endpoint:
                raise ValueError(
                    "Azure endpoint not provided or 'AZURE_OPENAI_ENDPOINT' env var is not set."
                )

        if azure_deployment is None:
            azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT") or model_name
            if not azure_deployment:
                raise ValueError(
                    "Azure deployment name not provided. Set 'azure_deployment', "
                    "'AZURE_OPENAI_DEPLOYMENT', or pass `model_name`."
                )

        if api_version is None:
            api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2025-04-14-preview")

        self.client = AzureOpenAI(
            api_key=api_key,
            azure_endpoint=azure_endpoint,
            azure_deployment=azure_deployment,
            api_version=api_version,
        )
        self.model_name = azure_deployment
        self._tokenizer_name = tokenizer_name

    def get_client(self) -> AzureOpenAI:
        """
        Get the underlying Azure OpenAI client.

        Returns:
            AzureOpenAI: The configured Azure OpenAI API client.
        """
        return self.client

    def _get_encoder(self):
        """
        Retrieve the `tiktoken` encoder for this deployment.

        This method ensures compatibility with Azure's deployment names, which
        may not be directly recognized by `tiktoken`. If the user has explicitly
        provided a tokenizer name, that is used. Otherwise, the method first
        tries to look up the encoding via `tiktoken.encoding_for_model` using the
        deployment name. If that fails, it falls back to the default encoding
        defined by `OPENAI_EMBEDDING_MODEL_FALLBACK`.

        Returns:
            tiktoken.Encoding: A tokenizer encoding object.

        Raises:
            ValueError: If `tiktoken` fails to load the fallback encoding.
        """
        if self._tokenizer_name:
            return tiktoken.get_encoding(self._tokenizer_name)
        try:
            return tiktoken.encoding_for_model(self.model_name)
        except Exception:
            return tiktoken.get_encoding(OPENAI_EMBEDDING_MODEL_FALLBACK)

    def _count_tokens(self, text: str) -> int:
        """
        Count the number of tokens in the given text.

        Uses the encoder retrieved from `_get_encoder()` to tokenize the input
        and returns the length of the resulting token list.

        Args:
            text (str): The text to tokenize.

        Returns:
            int: Number of tokens in the input text.
        """
        encoder = self._get_encoder()
        return len(encoder.encode(text))

    def _validate_token_length(self, text: str) -> None:
        """
        Ensure the input text does not exceed the model's maximum token limit.

        Args:
            text (str): The text to check.

        Raises:
            ValueError: If the token count exceeds `OPENAI_EMBEDDING_MAX_TOKENS`.
        """
        if self._count_tokens(text) > OPENAI_EMBEDDING_MAX_TOKENS:
            raise ValueError(
                f"Input text exceeds maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
            )

    def embed_text(self, text: str, **parameters: Any) -> List[float]:
        """
        Compute an embedding vector for a single text string.

        Args:
            text (str):
                The text to embed. Must be non-empty and within the model's
                token limit.
            **parameters:
                Additional parameters to forward to the Azure OpenAI embeddings API.

        Returns:
            List[float]: The computed embedding vector.

        Raises:
            ValueError: If `text` is empty or exceeds the token limit.
        """
        if not text:
            raise ValueError("`text` must be a non-empty string.")
        self._validate_token_length(text)
        response = self.client.embeddings.create(
            model=self.model_name,
            input=text,
            **parameters,
        )
        return response.data[0].embedding

    def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
        """
        Compute embeddings for multiple texts in a single API call.

        Args:
            texts (List[str]):
                List of text strings to embed. All items must be non-empty strings
                within the token limit.
            **parameters:
                Additional parameters to forward to the Azure OpenAI embeddings API.

        Returns:
            A list of embedding vectors, one per input text.

        Raises:
            ValueError:
                - If `texts` is empty.
                - If any text is empty or not a string.
                - If any text exceeds the token limit.
        """
        if not texts:
            raise ValueError("`texts` must be a non-empty list of strings.")
        if any(not isinstance(t, str) or not t for t in texts):
            raise ValueError("All items in `texts` must be non-empty strings.")

        encoder = self._get_encoder()
        for t in texts:
            if len(encoder.encode(t)) > OPENAI_EMBEDDING_MAX_TOKENS:
                raise ValueError(
                    f"An input exceeds the maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
                )

        response = self.client.embeddings.create(
            model=self.model_name,
            input=texts,
            **parameters,
        )
        return [data.embedding for data in response.data]

`init(model_name=None, api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None, tokenizer_name=None)` ¶

Initialize the Azure OpenAI Embedding provider.

Parameters:

Name	Type	Description	Default
`model_name`	`Optional[str]`	OpenAI model name (unused for Azure, but kept for API parity). If `azure_deployment` is not provided, this will be used as the deployment name.	`None`
`api_key`	`Optional[str]`	API key for Azure OpenAI. If not provided, it will be read from the environment variable `AZURE_OPENAI_API_KEY`.	`None`
`azure_endpoint`	`Optional[str]`	The base endpoint for the Azure OpenAI service. If not provided, it will be read from `AZURE_OPENAI_ENDPOINT`.	`None`
`azure_deployment`	`Optional[str]`	Deployment name for the embeddings model in Azure OpenAI. If not provided, it will be read from `AZURE_OPENAI_DEPLOYMENT` or fallback to `model_name`.	`None`
`api_version`	`Optional[str]`	Azure API version string. Defaults to `"2025-04-14-preview"`. If not provided, it will be read from `AZURE_OPENAI_API_VERSION`.	`None`
`tokenizer_name`	`Optional[str]`	Optional explicit tokenizer name for `tiktoken` (e.g., `"cl100k_base"`). If provided, it overrides the automatic mapping.	`None`

Raises:

Type	Description
`ValueError`	If any required parameter is missing or it is not found in environment variables.

Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py

def __init__(
    self,
    model_name: Optional[str] = None,
    api_key: Optional[str] = None,
    azure_endpoint: Optional[str] = None,
    azure_deployment: Optional[str] = None,
    api_version: Optional[str] = None,
    tokenizer_name: Optional[str] = None,
) -> None:
    """
    Initialize the Azure OpenAI Embedding provider.

    Args:
        model_name (Optional[str]):
            OpenAI model name (unused for Azure, but kept for API parity).
            If `azure_deployment` is not provided, this will be used as the
            deployment name.
        api_key (Optional[str]):
            API key for Azure OpenAI. If not provided, it will be read from
            the environment variable `AZURE_OPENAI_API_KEY`.
        azure_endpoint (Optional[str]):
            The base endpoint for the Azure OpenAI service. If not provided,
            it will be read from `AZURE_OPENAI_ENDPOINT`.
        azure_deployment (Optional[str]):
            Deployment name for the embeddings model in Azure OpenAI. If not
            provided, it will be read from `AZURE_OPENAI_DEPLOYMENT` or
            fallback to `model_name`.
        api_version (Optional[str]):
            Azure API version string. Defaults to `"2025-04-14-preview"`.
            If not provided, it will be read from `AZURE_OPENAI_API_VERSION`.
        tokenizer_name (Optional[str]):
            Optional explicit tokenizer name for `tiktoken` (e.g.,
            `"cl100k_base"`). If provided, it overrides the automatic mapping.

    Raises:
        ValueError: If any required parameter is missing or it is not found in environment variables.
    """
    if api_key is None:
        api_key = os.getenv("AZURE_OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "Azure OpenAI API key not provided or 'AZURE_OPENAI_API_KEY' env var is not set."
            )

    if azure_endpoint is None:
        azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
        if not azure_endpoint:
            raise ValueError(
                "Azure endpoint not provided or 'AZURE_OPENAI_ENDPOINT' env var is not set."
            )

    if azure_deployment is None:
        azure_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT") or model_name
        if not azure_deployment:
            raise ValueError(
                "Azure deployment name not provided. Set 'azure_deployment', "
                "'AZURE_OPENAI_DEPLOYMENT', or pass `model_name`."
            )

    if api_version is None:
        api_version = os.getenv("AZURE_OPENAI_API_VERSION", "2025-04-14-preview")

    self.client = AzureOpenAI(
        api_key=api_key,
        azure_endpoint=azure_endpoint,
        azure_deployment=azure_deployment,
        api_version=api_version,
    )
    self.model_name = azure_deployment
    self._tokenizer_name = tokenizer_name

`get_client()` ¶

Get the underlying Azure OpenAI client.

Returns:

Name	Type	Description
`AzureOpenAI`	`AzureOpenAI`	The configured Azure OpenAI API client.

Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py

def get_client(self) -> AzureOpenAI:
    """
    Get the underlying Azure OpenAI client.

    Returns:
        AzureOpenAI: The configured Azure OpenAI API client.
    """
    return self.client

`embed_text(text, **parameters)` ¶

Compute an embedding vector for a single text string.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to embed. Must be non-empty and within the model's token limit.	required
`**parameters`	`Any`	Additional parameters to forward to the Azure OpenAI embeddings API.	`{}`

Returns:

Type	Description
`List[float]`	List[float]: The computed embedding vector.

Raises:

Type	Description
`ValueError`	If `text` is empty or exceeds the token limit.

Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py

def embed_text(self, text: str, **parameters: Any) -> List[float]:
    """
    Compute an embedding vector for a single text string.

    Args:
        text (str):
            The text to embed. Must be non-empty and within the model's
            token limit.
        **parameters:
            Additional parameters to forward to the Azure OpenAI embeddings API.

    Returns:
        List[float]: The computed embedding vector.

    Raises:
        ValueError: If `text` is empty or exceeds the token limit.
    """
    if not text:
        raise ValueError("`text` must be a non-empty string.")
    self._validate_token_length(text)
    response = self.client.embeddings.create(
        model=self.model_name,
        input=text,
        **parameters,
    )
    return response.data[0].embedding

`embed_documents(texts, **parameters)` ¶

Compute embeddings for multiple texts in a single API call.

Parameters:

Name	Type	Description	Default
`texts`	`List[str]`	List of text strings to embed. All items must be non-empty strings within the token limit.	required
`**parameters`	`Any`	Additional parameters to forward to the Azure OpenAI embeddings API.	`{}`

Returns:

Type	Description
`List[List[float]]`	A list of embedding vectors, one per input text.

Raises:

Type	Description
`ValueError`	If `texts` is empty. If any text is empty or not a string. If any text exceeds the token limit.

Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py

def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
    """
    Compute embeddings for multiple texts in a single API call.

    Args:
        texts (List[str]):
            List of text strings to embed. All items must be non-empty strings
            within the token limit.
        **parameters:
            Additional parameters to forward to the Azure OpenAI embeddings API.

    Returns:
        A list of embedding vectors, one per input text.

    Raises:
        ValueError:
            - If `texts` is empty.
            - If any text is empty or not a string.
            - If any text exceeds the token limit.
    """
    if not texts:
        raise ValueError("`texts` must be a non-empty list of strings.")
    if any(not isinstance(t, str) or not t for t in texts):
        raise ValueError("All items in `texts` must be non-empty strings.")

    encoder = self._get_encoder()
    for t in texts:
        if len(encoder.encode(t)) > OPENAI_EMBEDDING_MAX_TOKENS:
            raise ValueError(
                f"An input exceeds the maximum allowed length of {OPENAI_EMBEDDING_MAX_TOKENS} tokens."
            )

    response = self.client.embeddings.create(
        model=self.model_name,
        input=texts,
        **parameters,
    )
    return [data.embedding for data in response.data]

GeminiEmbedding¶

GeminiEmbedding logo

`GeminiEmbedding` ¶

Bases: BaseEmbedding

Embedding provider using Google Gemini's embedding API.

This class wraps the Gemini API for generating embeddings from text or documents. Requires the google-genai package and a valid Gemini API key. This class is available only if splitter-mr[multimodal] is installed.

Typical usage example

from splitter_mr.embedding.models.gemini_embedding import GeminiEmbedding
embedder = GeminiEmbedding(api_key="your-api-key")
vector = embedder.embed_text("Hello, world!")
print(vector)

Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py

class GeminiEmbedding(BaseEmbedding):
    """
    Embedding provider using Google Gemini's embedding API.

    This class wraps the Gemini API for generating embeddings from text or documents.
    Requires the `google-genai` package and a valid Gemini API key. This class
    is available only if `splitter-mr[multimodal]` is installed.

    Typical usage example:
        ```python
        from splitter_mr.embedding.models.gemini_embedding import GeminiEmbedding
        embedder = GeminiEmbedding(api_key="your-api-key")
        vector = embedder.embed_text("Hello, world!")
        print(vector)
        ```
    """

    def __init__(
        self,
        model_name: str = "models/embedding-001",
        api_key: Optional[str] = None,
    ) -> None:
        """
        Initialize the Gemini embedding provider.

        Args:
            model_name (str): The Gemini model identifier to use for embedding. Defaults to "models/embedding-001".
            api_key (Optional[str]): The Gemini API key. If not provided, reads from the 'GEMINI_API_KEY' environment variable.

        Raises:
            ImportError: If the `google-genai` package is not installed.
            ValueError: If no API key is provided or found in the environment.
        """
        self.api_key = api_key or os.getenv("GEMINI_API_KEY")
        if not self.api_key:
            raise ValueError(
                "Google Gemini API key not provided and 'GEMINI_API_KEY' environment variable not set."
            )
        self.model_name = model_name
        self.client = genai.Client(api_key=api_key)
        self.models = self.client.models

    def get_client(self) -> "genai.Client":
        """
        Return the underlying Gemini API client.

        Returns:
            The loaded Gemini API module (`google.genai`).
        """
        return self.client

    def embed_text(self, text: str, **parameters: Any) -> List[float]:
        """
        Generate an embedding for a single text string using Gemini.

        Args:
            text (str): The input text to embed.
            **parameters (Any): Additional parameters for the Gemini API.

        Returns:
            List[float]: The generated embedding vector.

        Raises:
            ValueError: If the input text is not a non-empty string.
            RuntimeError: If the embedding call fails or returns an invalid response.
        """
        if not isinstance(text, str) or not text.strip():
            raise ValueError("`text` must be a non-empty string.")

        try:
            result = self.models.embed_content(
                model=self.model_name, contents=text, **parameters
            )
            embedding = getattr(result, "embedding", None)
            if embedding is None:
                raise RuntimeError(
                    "Gemini embedding call succeeded but no 'embedding' field was returned."
                )
            return embedding
        except Exception as e:
            raise RuntimeError(f"Failed to get embedding from Gemini: {e}") from e

    def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
        """
        Generate embeddings for a list of text strings using Gemini.

        Args:
            texts (List[str]): A list of input text strings.
            **parameters (Any): Additional parameters for the Gemini API.

        Returns:
            List[List[float]]: The generated embedding vectors, one per input.

        Raises:
            ValueError: If the input is not a non-empty list of non-empty strings.
            RuntimeError: If the embedding call fails or returns an invalid response.
        """
        if (
            not isinstance(texts, list)
            or not texts  # noqa: W503
            or any(not isinstance(t, str) or not t.strip() for t in texts)  # noqa: W503
        ):
            raise ValueError("`texts` must be a non-empty list of non-empty strings.")

        try:
            result = self.models.embed_content(
                model=self.model_name, contents=texts, **parameters
            )
            # The Gemini API returns a list of embeddings under .embeddings
            embeddings = getattr(result, "embeddings", None)
            if embeddings is None:
                raise RuntimeError(
                    "Gemini embedding call succeeded but no 'embeddings' field was returned."
                )
            return embeddings

        except Exception as e:
            raise RuntimeError(
                f"Failed to get document embeddings from Gemini: {e}"
            ) from e

`init(model_name='models/embedding-001', api_key=None)` ¶

Initialize the Gemini embedding provider.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	The Gemini model identifier to use for embedding. Defaults to "models/embedding-001".	`'models/embedding-001'`
`api_key`	`Optional[str]`	The Gemini API key. If not provided, reads from the 'GEMINI_API_KEY' environment variable.	`None`

Raises:

Type	Description
`ImportError`	If the `google-genai` package is not installed.
`ValueError`	If no API key is provided or found in the environment.

Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py

def __init__(
    self,
    model_name: str = "models/embedding-001",
    api_key: Optional[str] = None,
) -> None:
    """
    Initialize the Gemini embedding provider.

    Args:
        model_name (str): The Gemini model identifier to use for embedding. Defaults to "models/embedding-001".
        api_key (Optional[str]): The Gemini API key. If not provided, reads from the 'GEMINI_API_KEY' environment variable.

    Raises:
        ImportError: If the `google-genai` package is not installed.
        ValueError: If no API key is provided or found in the environment.
    """
    self.api_key = api_key or os.getenv("GEMINI_API_KEY")
    if not self.api_key:
        raise ValueError(
            "Google Gemini API key not provided and 'GEMINI_API_KEY' environment variable not set."
        )
    self.model_name = model_name
    self.client = genai.Client(api_key=api_key)
    self.models = self.client.models

`get_client()` ¶

Return the underlying Gemini API client.

Returns:

Type	Description
`Client`	The loaded Gemini API module (`google.genai`).

Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py

def get_client(self) -> "genai.Client":
    """
    Return the underlying Gemini API client.

    Returns:
        The loaded Gemini API module (`google.genai`).
    """
    return self.client

`embed_text(text, **parameters)` ¶

Generate an embedding for a single text string using Gemini.

Parameters:

Name	Type	Description	Default
`text`	`str`	The input text to embed.	required
`**parameters`	`Any`	Additional parameters for the Gemini API.	`{}`

Returns:

Type	Description
`List[float]`	List[float]: The generated embedding vector.

Raises:

Type	Description
`ValueError`	If the input text is not a non-empty string.
`RuntimeError`	If the embedding call fails or returns an invalid response.

Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py

def embed_text(self, text: str, **parameters: Any) -> List[float]:
    """
    Generate an embedding for a single text string using Gemini.

    Args:
        text (str): The input text to embed.
        **parameters (Any): Additional parameters for the Gemini API.

    Returns:
        List[float]: The generated embedding vector.

    Raises:
        ValueError: If the input text is not a non-empty string.
        RuntimeError: If the embedding call fails or returns an invalid response.
    """
    if not isinstance(text, str) or not text.strip():
        raise ValueError("`text` must be a non-empty string.")

    try:
        result = self.models.embed_content(
            model=self.model_name, contents=text, **parameters
        )
        embedding = getattr(result, "embedding", None)
        if embedding is None:
            raise RuntimeError(
                "Gemini embedding call succeeded but no 'embedding' field was returned."
            )
        return embedding
    except Exception as e:
        raise RuntimeError(f"Failed to get embedding from Gemini: {e}") from e

`embed_documents(texts, **parameters)` ¶

Generate embeddings for a list of text strings using Gemini.

Parameters:

Name	Type	Description	Default
`texts`	`List[str]`	A list of input text strings.	required
`**parameters`	`Any`	Additional parameters for the Gemini API.	`{}`

Returns:

Type	Description
`List[List[float]]`	List[List[float]]: The generated embedding vectors, one per input.

Raises:

Type	Description
`ValueError`	If the input is not a non-empty list of non-empty strings.
`RuntimeError`	If the embedding call fails or returns an invalid response.

Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py

def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
    """
    Generate embeddings for a list of text strings using Gemini.

    Args:
        texts (List[str]): A list of input text strings.
        **parameters (Any): Additional parameters for the Gemini API.

    Returns:
        List[List[float]]: The generated embedding vectors, one per input.

    Raises:
        ValueError: If the input is not a non-empty list of non-empty strings.
        RuntimeError: If the embedding call fails or returns an invalid response.
    """
    if (
        not isinstance(texts, list)
        or not texts  # noqa: W503
        or any(not isinstance(t, str) or not t.strip() for t in texts)  # noqa: W503
    ):
        raise ValueError("`texts` must be a non-empty list of non-empty strings.")

    try:
        result = self.models.embed_content(
            model=self.model_name, contents=texts, **parameters
        )
        # The Gemini API returns a list of embeddings under .embeddings
        embeddings = getattr(result, "embeddings", None)
        if embeddings is None:
            raise RuntimeError(
                "Gemini embedding call succeeded but no 'embeddings' field was returned."
            )
        return embeddings

    except Exception as e:
        raise RuntimeError(
            f"Failed to get document embeddings from Gemini: {e}"
        ) from e

AnthropicEmbedding¶

AnthropicEmbedding logo

`AnthropicEmbedding` ¶

Bases: BaseEmbedding

Embedding provider aligned with Anthropic's guidance, implemented via Voyage AI.

Anthropic does not offer a native embeddings API; their docs recommend using third-party providers such as Voyage AI for high-quality, domain-specific, and multimodal embeddings. This class wraps Voyage's Python SDK to provide a consistent interface that matches BaseEmbedding.

Example

from splitter_mr.embedding import AnthropicEmbeddings

embedder = AnthropicEmbeddings(model_name="voyage-3.5")
vec = embedder.embed_text("hello world", input_type="document")
print(len(vec))

Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py

class AnthropicEmbedding(BaseEmbedding):
    """
    Embedding provider aligned with Anthropic's guidance, implemented via Voyage AI.

    Anthropic does not offer a native embeddings API; their docs recommend using
    third-party providers such as **Voyage AI** for high-quality, domain-specific,
    and multimodal embeddings. This class wraps Voyage's Python SDK to provide a
    consistent interface that matches `BaseEmbedding`.

    Example:
        ```python
        from splitter_mr.embedding import AnthropicEmbeddings

        embedder = AnthropicEmbeddings(model_name="voyage-3.5")
        vec = embedder.embed_text("hello world", input_type="document")
        print(len(vec))
        ```
    """

    def __init__(
        self,
        model_name: str = "voyage-3.5",
        api_key: Optional[str] = None,
        default_input_type: Optional[str] = "document",
    ) -> None:
        """
        Initialize the Voyage embeddings provider.

        Args:
            model_name:
                Voyage embedding model name (e.g., "voyage-3.5", "voyage-3-large",
                "voyage-code-3", "voyage-finance-2", "voyage-law-2").
            api_key:
                Voyage API key. If not provided, reads from the `VOYAGE_API_KEY`
                environment variable.
            default_input_type:
                Default for Voyage's `input_type` parameter ("document" | "query").

        Raises:
            ImportError: If the `multimodal` extra (with `voyageai`) is not installed.
            ValueError: If no API key is provided or found in the environment.
        """

        if api_key is None:
            api_key = os.getenv("VOYAGE_API_KEY")
            if not api_key:
                raise ValueError(
                    "Voyage API key not provided and 'VOYAGE_API_KEY' environment variable is not set."
                )

        self.client = voyageai.Client(api_key=api_key)
        self.model_name = model_name
        self.default_input_type = default_input_type

    def get_client(self) -> Any:
        """Return the underlying Voyage client."""
        return self.client

    def _ensure_input_type(self, parameters: dict) -> dict:
        """Default `input_type` to self.default_input_type if not set."""
        params = dict(parameters) if parameters else {}
        if "input_type" not in params and self.default_input_type:
            params["input_type"] = self.default_input_type
        return params

    def embed_text(self, text: str, **parameters: Any) -> List[float]:
        """Compute an embedding vector for a single text string."""
        if not isinstance(text, str) or not text.strip():
            raise ValueError("`text` must be a non-empty string.")

        params = self._ensure_input_type(parameters)
        result = self.client.embed([text], model=self.model_name, **params)

        if not hasattr(result, "embeddings") or not result.embeddings:
            raise RuntimeError(
                "Voyage returned an empty or malformed embeddings response."
            )

        embedding = result.embeddings[0]
        if not isinstance(embedding, list) or not embedding:
            raise RuntimeError("Voyage returned an invalid embedding vector.")

        return embedding

    def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
        """Compute embeddings for multiple texts in one API call."""
        if not texts:
            raise ValueError("`texts` must be a non-empty list of strings.")
        if any(not isinstance(t, str) or not t.strip() for t in texts):
            raise ValueError("All items in `texts` must be non-empty strings.")

        params = self._ensure_input_type(parameters)
        result = self.client.embed(texts, model=self.model_name, **params)

        if not hasattr(result, "embeddings") or not result.embeddings:
            raise RuntimeError(
                "Voyage returned an empty or malformed embeddings response."
            )

        if len(result.embeddings) != len(texts):
            raise RuntimeError(
                f"Voyage returned {len(result.embeddings)} embeddings for {len(texts)} inputs."
            )

        embeddings = result.embeddings

        return embeddings

`init(model_name='voyage-3.5', api_key=None, default_input_type='document')` ¶

Initialize the Voyage embeddings provider.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	Voyage embedding model name (e.g., "voyage-3.5", "voyage-3-large", "voyage-code-3", "voyage-finance-2", "voyage-law-2").	`'voyage-3.5'`
`api_key`	`Optional[str]`	Voyage API key. If not provided, reads from the `VOYAGE_API_KEY` environment variable.	`None`
`default_input_type`	`Optional[str]`	Default for Voyage's `input_type` parameter ("document" \| "query").	`'document'`

Raises:

Type	Description
`ImportError`	If the `multimodal` extra (with `voyageai`) is not installed.
`ValueError`	If no API key is provided or found in the environment.

Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py

def __init__(
    self,
    model_name: str = "voyage-3.5",
    api_key: Optional[str] = None,
    default_input_type: Optional[str] = "document",
) -> None:
    """
    Initialize the Voyage embeddings provider.

    Args:
        model_name:
            Voyage embedding model name (e.g., "voyage-3.5", "voyage-3-large",
            "voyage-code-3", "voyage-finance-2", "voyage-law-2").
        api_key:
            Voyage API key. If not provided, reads from the `VOYAGE_API_KEY`
            environment variable.
        default_input_type:
            Default for Voyage's `input_type` parameter ("document" | "query").

    Raises:
        ImportError: If the `multimodal` extra (with `voyageai`) is not installed.
        ValueError: If no API key is provided or found in the environment.
    """

    if api_key is None:
        api_key = os.getenv("VOYAGE_API_KEY")
        if not api_key:
            raise ValueError(
                "Voyage API key not provided and 'VOYAGE_API_KEY' environment variable is not set."
            )

    self.client = voyageai.Client(api_key=api_key)
    self.model_name = model_name
    self.default_input_type = default_input_type

`get_client()` ¶

Return the underlying Voyage client.

Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py

def get_client(self) -> Any:
    """Return the underlying Voyage client."""
    return self.client

`embed_text(text, **parameters)` ¶

Compute an embedding vector for a single text string.

Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py

def embed_text(self, text: str, **parameters: Any) -> List[float]:
    """Compute an embedding vector for a single text string."""
    if not isinstance(text, str) or not text.strip():
        raise ValueError("`text` must be a non-empty string.")

    params = self._ensure_input_type(parameters)
    result = self.client.embed([text], model=self.model_name, **params)

    if not hasattr(result, "embeddings") or not result.embeddings:
        raise RuntimeError(
            "Voyage returned an empty or malformed embeddings response."
        )

    embedding = result.embeddings[0]
    if not isinstance(embedding, list) or not embedding:
        raise RuntimeError("Voyage returned an invalid embedding vector.")

    return embedding

`embed_documents(texts, **parameters)` ¶

Compute embeddings for multiple texts in one API call.

Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py

def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
    """Compute embeddings for multiple texts in one API call."""
    if not texts:
        raise ValueError("`texts` must be a non-empty list of strings.")
    if any(not isinstance(t, str) or not t.strip() for t in texts):
        raise ValueError("All items in `texts` must be non-empty strings.")

    params = self._ensure_input_type(parameters)
    result = self.client.embed(texts, model=self.model_name, **params)

    if not hasattr(result, "embeddings") or not result.embeddings:
        raise RuntimeError(
            "Voyage returned an empty or malformed embeddings response."
        )

    if len(result.embeddings) != len(texts):
        raise RuntimeError(
            f"Voyage returned {len(result.embeddings)} embeddings for {len(texts)} inputs."
        )

    embeddings = result.embeddings

    return embeddings

HuggingFaceEmbedding¶

Warning

Currently, only models compatible with sentence-transformers library are available.

HuggingFaceEmbedding logo

`HuggingFaceEmbedding` ¶

Bases: BaseEmbedding

Encoder provider using Hugging Face sentence-transformers models.

This class wraps a local (or HF Hub) SentenceTransformer model to produce dense embeddings for text. It provides a consistent interface with your BaseEmbedding and convenient options for device selection and optional input-length validation. This class is available only if splitter-mr[multimodal] is installed.

Example

from splitter_mr.embedding.models.huggingface_embedding import HuggingFaceEmbedding

# Any sentence-transformers checkpoint works (local path or HF Hub id)
embedder = HuggingFaceEmbedding(
    model_name="ibm-granite/granite-embedding-english-r2",
    device="cpu",            # or "cuda", "mps", etc.
    normalize=True,          # L2-normalize outputs
    enforce_max_length=True  # raise if text exceeds model max seq length
)

vector = embedder.embed_text("hello world")
print(vector)

Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py

class HuggingFaceEmbedding(BaseEmbedding):
    """
    Encoder provider using Hugging Face `sentence-transformers` models.

    This class wraps a local (or HF Hub) SentenceTransformer model to produce
    dense embeddings for text. It provides a consistent interface with your
    `BaseEmbedding` and convenient options for device selection and optional
    input-length validation. This class is available only if
    `splitter-mr[multimodal]` is installed.

    Example:
        ```python
        from splitter_mr.embedding.models.huggingface_embedding import HuggingFaceEmbedding

        # Any sentence-transformers checkpoint works (local path or HF Hub id)
        embedder = HuggingFaceEmbedding(
            model_name="ibm-granite/granite-embedding-english-r2",
            device="cpu",            # or "cuda", "mps", etc.
            normalize=True,          # L2-normalize outputs
            enforce_max_length=True  # raise if text exceeds model max seq length
        )

        vector = embedder.embed_text("hello world")
        print(vector)
        ```
    """

    def __init__(
        self,
        model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
        device: Optional[str] = "cpu",
        normalize: bool = True,
        enforce_max_length: bool = False,
    ) -> None:
        """
        Initialize the sentence-transformers embeddings provider.

        Args:
            model_name:
                SentenceTransformer model id or local path. Examples:
                - `"ibm-granite/granite-embedding-english-r2"`
                - `"sentence-transformers/all-MiniLM-L6-v2"`
                - `"/path/to/local/model"`
            device:
                Optional device spec (e.g., `"cpu"`, `"cuda"`, `"mps"` or a
                `torch.device`). If omitted, sentence-transformers chooses.
            normalize:
                If True, return L2-normalized embeddings (sets
                `normalize_embeddings=True` in `encode`).
            enforce_max_length:
                If True, attempt to count tokens and raise `ValueError` when
                input exceeds the model's configured max sequence length.
                (If the model/tokenizer does not expose this reliably, the
                check is skipped gracefully.)

        Raises:
            ValueError: If the model cannot be loaded.
        """

        from sentence_transformers import SentenceTransformer

        st_device = str(device) if device is not None else None
        try:
            self.model = SentenceTransformer(model_name, device=st_device)
        except Exception as e:
            raise ValueError(
                f"Failed to load SentenceTransformer '{model_name}': {e}"
            ) from e

        self.model_name = model_name
        self.normalize = normalize
        self.enforce_max_length = enforce_max_length

    def get_client(self) -> "SentenceTransformer":
        """Return the underlying `SentenceTransformer` instance."""
        return self.model

    def _max_seq_length(self) -> Optional[int]:
        """Best-effort retrieval of model's max sequence length."""
        try:
            # sentence-transformers exposes this on the model
            return int(self.model.get_max_seq_length())
        except Exception:
            try:
                # Fallback: some versions have `max_seq_length` attribute
                return int(getattr(self.model, "max_seq_length", None))
            except Exception:
                return None

    def _count_tokens(self, text: str) -> Optional[int]:
        """
        Best-effort token counting via model.tokenize; returns None if unavailable.
        """
        try:
            features = self.model.tokenize([text])  # dict with "input_ids"
            input_ids = features["input_ids"]
            # input_ids is usually a list/array/tensor of shape [batch, seq]
            if isinstance(input_ids, list):
                first = input_ids[0]
                return len(first)
            if torch is not None and torch.is_tensor(input_ids):
                return int(input_ids.shape[1])
            if isinstance(input_ids, np.ndarray):
                return int(input_ids.shape[1])
        except Exception:
            pass
        return None

    def _validate_length_if_needed(self, text: str) -> None:
        """Raise ValueError if enforce_max_length=True and text is too long."""
        if not self.enforce_max_length:
            return
        max_len = self._max_seq_length()
        tok_count = self._count_tokens(text)
        if max_len is not None and tok_count is not None and tok_count > max_len:
            raise ValueError(
                f"Input exceeds model max sequence length ({tok_count} > {max_len} tokens)."
            )

    def embed_text(self, text: str, **parameters: Any) -> List[float]:
        """
        Compute an embedding vector for a single text string.

        Args:
            text:
                The text to embed. Must be non-empty. If `enforce_max_length`
                is True, a ValueError is raised when it exceeds the model limit.
            **parameters:
                Extra keyword arguments forwarded to `SentenceTransformer.encode`.
                Common options include:
                  - `batch_size` (int)
                  - `show_progress_bar` (bool)
                  - `convert_to_tensor` (bool)  # will be forced False here
                  - `device` (str)
                  - `normalize_embeddings` (bool)

        Returns:
            List[float]: The computed embedding vector.

        Raises:
            ValueError: If `text` is empty or exceeds length constraints (when enforced).
            RuntimeError: If the embedding call fails unexpectedly.
        """
        if not isinstance(text, str) or not text:
            raise ValueError("`text` must be a non-empty string.")

        self._validate_length_if_needed(text)

        # Ensure Python list output
        parameters = dict(parameters)  # shallow copy
        parameters["convert_to_tensor"] = False
        parameters.setdefault("normalize_embeddings", self.normalize)

        try:
            # `encode` accepts a single string and returns a 1D array-like
            vec = self.model.encode(text, **parameters)
        except Exception as e:
            raise RuntimeError(f"Embedding call failed: {e}") from e

        # Normalize output to List[float]
        if isinstance(vec, np.ndarray):
            return vec.astype(np.float32, copy=False).tolist()
        if torch is not None and hasattr(vec, "detach"):
            return vec.detach().cpu().float().tolist()
        if isinstance(vec, (list, tuple)):
            return [float(x) for x in vec]
        # Anything else: try to coerce
        try:
            return list(map(float, vec))  # type: ignore[arg-type]
        except Exception as e:
            raise RuntimeError(f"Unexpected embedding output type: {type(vec)}") from e

    def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
        """
        Compute embeddings for multiple texts efficiently using `encode`.

        Args:
            texts:
                List of input strings to embed. Must be non-empty and contain
                only non-empty strings. Length enforcement is applied per item
                if `enforce_max_length=True`.
            **parameters:
                Extra keyword arguments forwarded to `SentenceTransformer.encode`.
                Common options:
                  - `batch_size` (int)
                  - `show_progress_bar` (bool)
                  - `convert_to_tensor` (bool)  # will be forced False here
                  - `device` (str)
                  - `normalize_embeddings` (bool)

        Returns:
            List[List[float]]: One embedding per input string.

        Raises:
            ValueError: If `texts` is empty or any element is empty/non-string.
            RuntimeError: If the embedding call fails unexpectedly.
        """
        if not texts:
            raise ValueError("`texts` must be a non-empty list of strings.")
        if any((not isinstance(t, str) or not t) for t in texts):
            raise ValueError("All items in `texts` must be non-empty strings.")

        if self.enforce_max_length:
            for t in texts:
                self._validate_length_if_needed(t)

        parameters = dict(parameters)
        parameters["convert_to_tensor"] = False
        parameters.setdefault("normalize_embeddings", self.normalize)

        try:
            # Returns ndarray (n, d) or list-of-lists
            mat = self.model.encode(texts, **parameters)
        except Exception as e:
            raise RuntimeError(f"Batch embedding call failed: {e}") from e

        if isinstance(mat, np.ndarray):
            return mat.astype(np.float32, copy=False).tolist()
        if torch is not None and hasattr(mat, "detach"):
            return mat.detach().cpu().float().tolist()
        if (
            isinstance(mat, list)
            and mat  # noqa: W503
            and isinstance(mat[0], (list, tuple, float, int))  # noqa: W503
        ):
            # Already python lists (ST often returns this when convert_to_tensor=False)
            if mat and isinstance(mat[0], (float, int)):  # single vector in a flat list
                return [list(map(float, mat))]
            return [list(map(float, row)) for row in mat]  # type: ignore[arg-type]

        raise RuntimeError(f"Unexpected batch embedding output type: {type(mat)}")

`init(model_name='sentence-transformers/all-MiniLM-L6-v2', device='cpu', normalize=True, enforce_max_length=False)` ¶

Initialize the sentence-transformers embeddings provider.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	SentenceTransformer model id or local path. Examples: - `"ibm-granite/granite-embedding-english-r2"` - `"sentence-transformers/all-MiniLM-L6-v2"` - `"/path/to/local/model"`	`'sentence-transformers/all-MiniLM-L6-v2'`
`device`	`Optional[str]`	Optional device spec (e.g., `"cpu"`, `"cuda"`, `"mps"` or a `torch.device`). If omitted, sentence-transformers chooses.	`'cpu'`
`normalize`	`bool`	If True, return L2-normalized embeddings (sets `normalize_embeddings=True` in `encode`).	`True`
`enforce_max_length`	`bool`	If True, attempt to count tokens and raise `ValueError` when input exceeds the model's configured max sequence length. (If the model/tokenizer does not expose this reliably, the check is skipped gracefully.)	`False`

Raises:

Type	Description
`ValueError`	If the model cannot be loaded.

Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py

def __init__(
    self,
    model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
    device: Optional[str] = "cpu",
    normalize: bool = True,
    enforce_max_length: bool = False,
) -> None:
    """
    Initialize the sentence-transformers embeddings provider.

    Args:
        model_name:
            SentenceTransformer model id or local path. Examples:
            - `"ibm-granite/granite-embedding-english-r2"`
            - `"sentence-transformers/all-MiniLM-L6-v2"`
            - `"/path/to/local/model"`
        device:
            Optional device spec (e.g., `"cpu"`, `"cuda"`, `"mps"` or a
            `torch.device`). If omitted, sentence-transformers chooses.
        normalize:
            If True, return L2-normalized embeddings (sets
            `normalize_embeddings=True` in `encode`).
        enforce_max_length:
            If True, attempt to count tokens and raise `ValueError` when
            input exceeds the model's configured max sequence length.
            (If the model/tokenizer does not expose this reliably, the
            check is skipped gracefully.)

    Raises:
        ValueError: If the model cannot be loaded.
    """

    from sentence_transformers import SentenceTransformer

    st_device = str(device) if device is not None else None
    try:
        self.model = SentenceTransformer(model_name, device=st_device)
    except Exception as e:
        raise ValueError(
            f"Failed to load SentenceTransformer '{model_name}': {e}"
        ) from e

    self.model_name = model_name
    self.normalize = normalize
    self.enforce_max_length = enforce_max_length

`get_client()` ¶

Return the underlying SentenceTransformer instance.

Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py

def get_client(self) -> "SentenceTransformer":
    """Return the underlying `SentenceTransformer` instance."""
    return self.model

`embed_text(text, **parameters)` ¶

Compute an embedding vector for a single text string.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to embed. Must be non-empty. If `enforce_max_length` is True, a ValueError is raised when it exceeds the model limit.	required
`**parameters`	`Any`	Extra keyword arguments forwarded to `SentenceTransformer.encode`. Common options include: - `batch_size` (int) - `show_progress_bar` (bool) - `convert_to_tensor` (bool) # will be forced False here - `device` (str) - `normalize_embeddings` (bool)	`{}`

Returns:

Type	Description
`List[float]`	List[float]: The computed embedding vector.

Raises:

Type	Description
`ValueError`	If `text` is empty or exceeds length constraints (when enforced).
`RuntimeError`	If the embedding call fails unexpectedly.

Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py

def embed_text(self, text: str, **parameters: Any) -> List[float]:
    """
    Compute an embedding vector for a single text string.

    Args:
        text:
            The text to embed. Must be non-empty. If `enforce_max_length`
            is True, a ValueError is raised when it exceeds the model limit.
        **parameters:
            Extra keyword arguments forwarded to `SentenceTransformer.encode`.
            Common options include:
              - `batch_size` (int)
              - `show_progress_bar` (bool)
              - `convert_to_tensor` (bool)  # will be forced False here
              - `device` (str)
              - `normalize_embeddings` (bool)

    Returns:
        List[float]: The computed embedding vector.

    Raises:
        ValueError: If `text` is empty or exceeds length constraints (when enforced).
        RuntimeError: If the embedding call fails unexpectedly.
    """
    if not isinstance(text, str) or not text:
        raise ValueError("`text` must be a non-empty string.")

    self._validate_length_if_needed(text)

    # Ensure Python list output
    parameters = dict(parameters)  # shallow copy
    parameters["convert_to_tensor"] = False
    parameters.setdefault("normalize_embeddings", self.normalize)

    try:
        # `encode` accepts a single string and returns a 1D array-like
        vec = self.model.encode(text, **parameters)
    except Exception as e:
        raise RuntimeError(f"Embedding call failed: {e}") from e

    # Normalize output to List[float]
    if isinstance(vec, np.ndarray):
        return vec.astype(np.float32, copy=False).tolist()
    if torch is not None and hasattr(vec, "detach"):
        return vec.detach().cpu().float().tolist()
    if isinstance(vec, (list, tuple)):
        return [float(x) for x in vec]
    # Anything else: try to coerce
    try:
        return list(map(float, vec))  # type: ignore[arg-type]
    except Exception as e:
        raise RuntimeError(f"Unexpected embedding output type: {type(vec)}") from e

`embed_documents(texts, **parameters)` ¶

Compute embeddings for multiple texts efficiently using encode.

Parameters:

Name	Type	Description	Default
`texts`	`List[str]`	List of input strings to embed. Must be non-empty and contain only non-empty strings. Length enforcement is applied per item if `enforce_max_length=True`.	required
`**parameters`	`Any`	Extra keyword arguments forwarded to `SentenceTransformer.encode`. Common options: - `batch_size` (int) - `show_progress_bar` (bool) - `convert_to_tensor` (bool) # will be forced False here - `device` (str) - `normalize_embeddings` (bool)	`{}`

Returns:

Type	Description
`List[List[float]]`	List[List[float]]: One embedding per input string.

Raises:

Type	Description
`ValueError`	If `texts` is empty or any element is empty/non-string.
`RuntimeError`	If the embedding call fails unexpectedly.

Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py

def embed_documents(self, texts: List[str], **parameters: Any) -> List[List[float]]:
    """
    Compute embeddings for multiple texts efficiently using `encode`.

    Args:
        texts:
            List of input strings to embed. Must be non-empty and contain
            only non-empty strings. Length enforcement is applied per item
            if `enforce_max_length=True`.
        **parameters:
            Extra keyword arguments forwarded to `SentenceTransformer.encode`.
            Common options:
              - `batch_size` (int)
              - `show_progress_bar` (bool)
              - `convert_to_tensor` (bool)  # will be forced False here
              - `device` (str)
              - `normalize_embeddings` (bool)

    Returns:
        List[List[float]]: One embedding per input string.

    Raises:
        ValueError: If `texts` is empty or any element is empty/non-string.
        RuntimeError: If the embedding call fails unexpectedly.
    """
    if not texts:
        raise ValueError("`texts` must be a non-empty list of strings.")
    if any((not isinstance(t, str) or not t) for t in texts):
        raise ValueError("All items in `texts` must be non-empty strings.")

    if self.enforce_max_length:
        for t in texts:
            self._validate_length_if_needed(t)

    parameters = dict(parameters)
    parameters["convert_to_tensor"] = False
    parameters.setdefault("normalize_embeddings", self.normalize)

    try:
        # Returns ndarray (n, d) or list-of-lists
        mat = self.model.encode(texts, **parameters)
    except Exception as e:
        raise RuntimeError(f"Batch embedding call failed: {e}") from e

    if isinstance(mat, np.ndarray):
        return mat.astype(np.float32, copy=False).tolist()
    if torch is not None and hasattr(mat, "detach"):
        return mat.detach().cpu().float().tolist()
    if (
        isinstance(mat, list)
        and mat  # noqa: W503
        and isinstance(mat[0], (list, tuple, float, int))  # noqa: W503
    ):
        # Already python lists (ST often returns this when convert_to_tensor=False)
        if mat and isinstance(mat[0], (float, int)):  # single vector in a flat list
            return [list(map(float, mat))]
        return [list(map(float, row)) for row in mat]  # type: ignore[arg-type]

    raise RuntimeError(f"Unexpected batch embedding output type: {type(mat)}")

Embedding Models¶

Overview¶

Which embedder should I use?¶

Embedders¶

BaseEmbedding¶

BaseEmbedding ¶

__init__(model_name) abstractmethod ¶

get_client() abstractmethod ¶

embed_text(text, **parameters) abstractmethod ¶

embed_documents(texts, **parameters) ¶

OpenAIEmbedding¶

OpenAIEmbedding ¶

__init__(model_name='text-embedding-3-large', api_key=None, tokenizer_name=None) ¶

get_client() ¶

embed_text(text, **parameters) ¶

embed_documents(texts, **parameters) ¶

AzureOpenAIEmbedding¶

AzureOpenAIEmbedding ¶

__init__(model_name=None, api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None, tokenizer_name=None) ¶

get_client() ¶

embed_text(text, **parameters) ¶

embed_documents(texts, **parameters) ¶

GeminiEmbedding¶

GeminiEmbedding ¶

__init__(model_name='models/embedding-001', api_key=None) ¶

get_client() ¶

embed_text(text, **parameters) ¶

embed_documents(texts, **parameters) ¶

AnthropicEmbedding¶

AnthropicEmbedding ¶

__init__(model_name='voyage-3.5', api_key=None, default_input_type='document') ¶

get_client() ¶

embed_text(text, **parameters) ¶

embed_documents(texts, **parameters) ¶

HuggingFaceEmbedding¶

HuggingFaceEmbedding ¶

__init__(model_name='sentence-transformers/all-MiniLM-L6-v2', device='cpu', normalize=True, enforce_max_length=False) ¶

get_client() ¶

embed_text(text, **parameters) ¶

embed_documents(texts, **parameters) ¶

`BaseEmbedding` ¶

`init(model_name)` `abstractmethod` ¶

`get_client()` `abstractmethod` ¶

`embed_text(text, **parameters)` `abstractmethod` ¶

`embed_documents(texts, **parameters)` ¶

`OpenAIEmbedding` ¶

`init(model_name='text-embedding-3-large', api_key=None, tokenizer_name=None)` ¶

`get_client()` ¶

`embed_text(text, **parameters)` ¶

`embed_documents(texts, **parameters)` ¶

`AzureOpenAIEmbedding` ¶

`init(model_name=None, api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None, tokenizer_name=None)` ¶

`get_client()` ¶

`embed_text(text, **parameters)` ¶

`embed_documents(texts, **parameters)` ¶

`GeminiEmbedding` ¶

`init(model_name='models/embedding-001', api_key=None)` ¶

`get_client()` ¶

`embed_text(text, **parameters)` ¶

`embed_documents(texts, **parameters)` ¶

`AnthropicEmbedding` ¶

`init(model_name='voyage-3.5', api_key=None, default_input_type='document')` ¶

`get_client()` ¶

`embed_text(text, **parameters)` ¶

`embed_documents(texts, **parameters)` ¶

`HuggingFaceEmbedding` ¶

`init(model_name='sentence-transformers/all-MiniLM-L6-v2', device='cpu', normalize=True, enforce_max_length=False)` ¶

`get_client()` ¶

`embed_text(text, **parameters)` ¶

`embed_documents(texts, **parameters)` ¶