Embedding Models¶
Overview¶
Encoder models are the engines which produce embeddings. These embeddings are distributed and vectorized representations of a text. These embeddings allows to capture relationships between semantic units (commonly words, but can be sentences, or even multimodal content such as images).
These embeddings can be used in a variety of tasks, such as:
- Measuring how relevant a word is within a text.
- Comparing the similarity between two pieces of text.
- Power searching, clustering, and recommendation systems building.
SplitterMR takes advantage of these models in SemanticSplitter
. These representations are used to break text into chunks based on meaning, not just size. Sentences with similar context end up together, regardless of length or position.
Which embedder should I use?¶
All embedders inherit from BaseEmbedding and expose the same interface for generating embeddings. Choose based on your cloud provider, credentials, and compliance needs.
Model | When to use | Requirements | Features |
---|---|---|---|
OpenAIEmbedding | You have an OpenAI API key and want to use OpenAI’s hosted embeddings | OPENAI_API_KEY |
Production-ready text embeddings; simple setup; broad ecosystem/tooling support. |
AzureOpenAIEmbedding | Your organization uses Azure OpenAI Services | AZURE_OPENAI_API_KEY , AZURE_OPENAI_ENDPOINT , AZURE_OPENAI_DEPLOYMENT |
Enterprise controls, Azure compliance & data residency; integrates with Azure identity. |
GeminiEmbedding | You want Google’s Gemini text embeddings | GEMINI_API_KEY + Multimodal extra: pip install 'splitter-mr[multimodal]' |
Google Gemini API; modern, high-quality text embeddings. |
AnthropicEmbeddings | You want embeddings aligned with Anthropic guidance (via Voyage AI) | VOYAGE_API_KEY + Multimodal extra: pip install 'splitter-mr[multimodal]' |
Voyage AI embeddings (general, code, finance, law, multimodal); supports input_type for query/document asymmetry. |
HuggingFaceEmbedding | Prefer local/open-source models (Sentence-Transformers); offline capability | Multimodal extra: pip install 'splitter-mr[multimodal]' (optional: HF_ACCESS_TOKEN , only for required models) |
No API key; huge model zoo; CPU/GPU/MPS; optional L2 normalization for cosine similarity. |
BaseEmbedding | Abstract base, not used directly | – | Implement to plug in a custom or self-hosted embedder. |
Note
In case that you want to bring your own embedding provider, you can easily implement the class using BaseEmbedding
.
Embedders¶
BaseEmbedding¶
BaseEmbedding
¶
Bases: ABC
Abstract base for text embedding providers.
Implementations wrap specific backends (e.g., OpenAI, Azure OpenAI, local models) and expose a consistent interface to convert text into numeric vectors suitable for similarity search, clustering, and retrieval-augmented generation.
Source code in src/splitter_mr/embedding/base_embedding.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
|
__init__(model_name)
abstractmethod
¶
Initialize the embedding backend.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
Identifier of the embedding model (e.g.,
|
required |
Raises:
Type | Description |
---|---|
ValueError
|
If required configuration or credentials are missing. |
Source code in src/splitter_mr/embedding/base_embedding.py
15 16 17 18 19 20 21 22 23 24 25 |
|
get_client()
abstractmethod
¶
Return the underlying client or handle.
Returns:
Name | Type | Description |
---|---|---|
Any |
Any
|
A client/handle used to perform embedding calls (e.g., an SDK
client instance, session object, or local runner). May be |
Source code in src/splitter_mr/embedding/base_embedding.py
27 28 29 30 31 32 33 34 35 |
|
embed_text(text, **parameters)
abstractmethod
¶
Compute an embedding vector for the given text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
Input text to embed. Implementations may apply normalization or truncation according to model limits. |
required |
**parameters
|
Dict[str, Any]
|
Additional backend-specific options forwarded to the implementation (e.g., user tags, request IDs). |
{}
|
Returns:
Type | Description |
---|---|
List[float]
|
A single embedding vector representing |
Raises:
Type | Description |
---|---|
ValueError
|
If |
RuntimeError
|
If the embedding call fails or returns an unexpected response shape. |
Source code in src/splitter_mr/embedding/base_embedding.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
embed_documents(texts, **parameters)
¶
Compute embeddings for multiple texts (default loops over embed_text
).
Implementations are encouraged to override for true batch performance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
List[str]
|
List of input strings to embed. |
required |
**parameters
|
Dict[str, Any]
|
Backend-specific options. |
{}
|
Returns:
Type | Description |
---|---|
List[List[float]]
|
List of embedding vectors, one per input string. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in src/splitter_mr/embedding/base_embedding.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
|
OpenAIEmbedding¶
OpenAIEmbedding
¶
Bases: BaseEmbedding
Encoder provider using OpenAI's embeddings API.
This class wraps OpenAI's embeddings endpoint, providing convenience methods for both single-text and batch embeddings. It also adds token counting and validation to avoid exceeding model limits.
Example
from splitter_mr.embedding import OpenAIEmbedding
embedder = OpenAIEmbedding(model_name="text-embedding-3-large")
vector = embedder.embed_text("hello world")
print(vector)
Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
|
__init__(model_name='text-embedding-3-large', api_key=None, tokenizer_name=None)
¶
Initialize the OpenAI embeddings provider.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
The OpenAI embedding model name (e.g., |
'text-embedding-3-large'
|
api_key
|
Optional[str]
|
API key for OpenAI. If not provided, reads from the
|
None
|
tokenizer_name
|
Optional[str]
|
Optional explicit tokenizer name for |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If the API key is not provided or the |
Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
get_client()
¶
Get the configured OpenAI client.
Returns:
Name | Type | Description |
---|---|---|
OpenAI |
OpenAI
|
The OpenAI API client instance. |
Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
61 62 63 64 65 66 67 68 |
|
embed_text(text, **parameters)
¶
Compute an embedding vector for a single text string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to embed. Must be non-empty and within the model's token limit. |
required |
**parameters
|
Any
|
Additional keyword arguments forwarded to
|
{}
|
Returns:
Type | Description |
---|---|
List[float]
|
List[float]: The computed embedding vector. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
|
embed_documents(texts, **parameters)
¶
Compute embeddings for multiple texts in one API call.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
List[str]
|
List of text strings to embed. All must be non-empty and within the model's token limit. |
required |
**parameters
|
Any
|
Additional keyword arguments forwarded to
|
{}
|
Returns:
Type | Description |
---|---|
List[List[float]]
|
A list of embedding vectors, one per input string. |
Raises:
Type | Description |
---|---|
ValueError
|
|
Source code in src/splitter_mr/embedding/embeddings/openai_embedding.py
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
|
AzureOpenAIEmbedding¶
AzureOpenAIEmbedding
¶
Bases: BaseEmbedding
Encoder provider using Azure OpenAI Embeddings.
This class wraps Azure OpenAI's embeddings API, handling both authentication
and tokenization. It supports both direct embedding calls for a single text
(embed_text
) and batch embedding calls (embed_documents
).
Azure deployments use deployment names (e.g., my-embedding-deployment
)
instead of OpenAI's standard model names. Since tiktoken
may not be able to
map a deployment name to a tokenizer automatically, this class implements
a fallback mechanism to use a known encoding (e.g., cl100k_base
) when necessary.
Example
from splitter_mr.embedding import AzureOpenAIEmbedding
embedder = AzureOpenAIEmbedding(
azure_deployment="text-embedding-3-large",
api_key="...",
azure_endpoint="https://my-azure-endpoint.openai.azure.com/"
)
vector = embedder.embed_text("Hello world")
Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 |
|
__init__(model_name=None, api_key=None, azure_endpoint=None, azure_deployment=None, api_version=None, tokenizer_name=None)
¶
Initialize the Azure OpenAI Embedding provider.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
Optional[str]
|
OpenAI model name (unused for Azure, but kept for API parity).
If |
None
|
api_key
|
Optional[str]
|
API key for Azure OpenAI. If not provided, it will be read from
the environment variable |
None
|
azure_endpoint
|
Optional[str]
|
The base endpoint for the Azure OpenAI service. If not provided,
it will be read from |
None
|
azure_deployment
|
Optional[str]
|
Deployment name for the embeddings model in Azure OpenAI. If not
provided, it will be read from |
None
|
api_version
|
Optional[str]
|
Azure API version string. Defaults to |
None
|
tokenizer_name
|
Optional[str]
|
Optional explicit tokenizer name for |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If any required parameter is missing or it is not found in environment variables. |
Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
get_client()
¶
Get the underlying Azure OpenAI client.
Returns:
Name | Type | Description |
---|---|---|
AzureOpenAI |
AzureOpenAI
|
The configured Azure OpenAI API client. |
Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
108 109 110 111 112 113 114 115 |
|
embed_text(text, **parameters)
¶
Compute an embedding vector for a single text string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to embed. Must be non-empty and within the model's token limit. |
required |
**parameters
|
Any
|
Additional parameters to forward to the Azure OpenAI embeddings API. |
{}
|
Returns:
Type | Description |
---|---|
List[float]
|
List[float]: The computed embedding vector. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 |
|
embed_documents(texts, **parameters)
¶
Compute embeddings for multiple texts in a single API call.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
List[str]
|
List of text strings to embed. All items must be non-empty strings within the token limit. |
required |
**parameters
|
Any
|
Additional parameters to forward to the Azure OpenAI embeddings API. |
{}
|
Returns:
Type | Description |
---|---|
List[List[float]]
|
A list of embedding vectors, one per input text. |
Raises:
Type | Description |
---|---|
ValueError
|
|
Source code in src/splitter_mr/embedding/embeddings/azure_openai_embedding.py
199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 |
|
GeminiEmbedding¶
GeminiEmbedding
¶
Bases: BaseEmbedding
Embedding provider using Google Gemini's embedding API.
This class wraps the Gemini API for generating embeddings from text or documents.
Requires the google-genai
package and a valid Gemini API key. This class
is available only if splitter-mr[multimodal]
is installed.
Typical usage example
from splitter_mr.embedding.models.gemini_embedding import GeminiEmbedding
embedder = GeminiEmbedding(api_key="your-api-key")
vector = embedder.embed_text("Hello, world!")
print(vector)
Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
|
__init__(model_name='models/embedding-001', api_key=None)
¶
Initialize the Gemini embedding provider.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
The Gemini model identifier to use for embedding. Defaults to "models/embedding-001". |
'models/embedding-001'
|
api_key
|
Optional[str]
|
The Gemini API key. If not provided, reads from the 'GEMINI_API_KEY' environment variable. |
None
|
Raises:
Type | Description |
---|---|
ImportError
|
If the |
ValueError
|
If no API key is provided or found in the environment. |
Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
|
get_client()
¶
Return the underlying Gemini API client.
Returns:
Type | Description |
---|---|
Client
|
The loaded Gemini API module ( |
Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py
51 52 53 54 55 56 57 58 |
|
embed_text(text, **parameters)
¶
Generate an embedding for a single text string using Gemini.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The input text to embed. |
required |
**parameters
|
Any
|
Additional parameters for the Gemini API. |
{}
|
Returns:
Type | Description |
---|---|
List[float]
|
List[float]: The generated embedding vector. |
Raises:
Type | Description |
---|---|
ValueError
|
If the input text is not a non-empty string. |
RuntimeError
|
If the embedding call fails or returns an invalid response. |
Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
|
embed_documents(texts, **parameters)
¶
Generate embeddings for a list of text strings using Gemini.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
List[str]
|
A list of input text strings. |
required |
**parameters
|
Any
|
Additional parameters for the Gemini API. |
{}
|
Returns:
Type | Description |
---|---|
List[List[float]]
|
List[List[float]]: The generated embedding vectors, one per input. |
Raises:
Type | Description |
---|---|
ValueError
|
If the input is not a non-empty list of non-empty strings. |
RuntimeError
|
If the embedding call fails or returns an invalid response. |
Source code in src/splitter_mr/embedding/embeddings/gemini_embedding.py
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
|
AnthropicEmbedding¶
AnthropicEmbedding
¶
Bases: BaseEmbedding
Embedding provider aligned with Anthropic's guidance, implemented via Voyage AI.
Anthropic does not offer a native embeddings API; their docs recommend using
third-party providers such as Voyage AI for high-quality, domain-specific,
and multimodal embeddings. This class wraps Voyage's Python SDK to provide a
consistent interface that matches BaseEmbedding
.
Example
from splitter_mr.embedding import AnthropicEmbeddings
embedder = AnthropicEmbeddings(model_name="voyage-3.5")
vec = embedder.embed_text("hello world", input_type="document")
print(len(vec))
Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
__init__(model_name='voyage-3.5', api_key=None, default_input_type='document')
¶
Initialize the Voyage embeddings provider.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
Voyage embedding model name (e.g., "voyage-3.5", "voyage-3-large", "voyage-code-3", "voyage-finance-2", "voyage-law-2"). |
'voyage-3.5'
|
api_key
|
Optional[str]
|
Voyage API key. If not provided, reads from the |
None
|
default_input_type
|
Optional[str]
|
Default for Voyage's |
'document'
|
Raises:
Type | Description |
---|---|
ImportError
|
If the |
ValueError
|
If no API key is provided or found in the environment. |
Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
|
get_client()
¶
Return the underlying Voyage client.
Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py
63 64 65 |
|
embed_text(text, **parameters)
¶
Compute an embedding vector for a single text string.
Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
|
embed_documents(texts, **parameters)
¶
Compute embeddings for multiple texts in one API call.
Source code in src/splitter_mr/embedding/embeddings/anthropic_embedding.py
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
HuggingFaceEmbedding¶
Warning
Currently, only models compatible with sentence-transformers
library are available.
HuggingFaceEmbedding
¶
Bases: BaseEmbedding
Encoder provider using Hugging Face sentence-transformers
models.
This class wraps a local (or HF Hub) SentenceTransformer model to produce
dense embeddings for text. It provides a consistent interface with your
BaseEmbedding
and convenient options for device selection and optional
input-length validation. This class is available only if
splitter-mr[multimodal]
is installed.
Example
from splitter_mr.embedding.models.huggingface_embedding import HuggingFaceEmbedding
# Any sentence-transformers checkpoint works (local path or HF Hub id)
embedder = HuggingFaceEmbedding(
model_name="ibm-granite/granite-embedding-english-r2",
device="cpu", # or "cuda", "mps", etc.
normalize=True, # L2-normalize outputs
enforce_max_length=True # raise if text exceeds model max seq length
)
vector = embedder.embed_text("hello world")
print(vector)
Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
|
__init__(model_name='sentence-transformers/all-MiniLM-L6-v2', device='cpu', normalize=True, enforce_max_length=False)
¶
Initialize the sentence-transformers embeddings provider.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
SentenceTransformer model id or local path. Examples:
- |
'sentence-transformers/all-MiniLM-L6-v2'
|
device
|
Optional[str]
|
Optional device spec (e.g., |
'cpu'
|
normalize
|
bool
|
If True, return L2-normalized embeddings (sets
|
True
|
enforce_max_length
|
bool
|
If True, attempt to count tokens and raise |
False
|
Raises:
Type | Description |
---|---|
ValueError
|
If the model cannot be loaded. |
Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
|
get_client()
¶
Return the underlying SentenceTransformer
instance.
Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py
85 86 87 |
|
embed_text(text, **parameters)
¶
Compute an embedding vector for a single text string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
The text to embed. Must be non-empty. If |
required |
**parameters
|
Any
|
Extra keyword arguments forwarded to |
{}
|
Returns:
Type | Description |
---|---|
List[float]
|
List[float]: The computed embedding vector. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
RuntimeError
|
If the embedding call fails unexpectedly. |
Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
|
embed_documents(texts, **parameters)
¶
Compute embeddings for multiple texts efficiently using encode
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts
|
List[str]
|
List of input strings to embed. Must be non-empty and contain
only non-empty strings. Length enforcement is applied per item
if |
required |
**parameters
|
Any
|
Extra keyword arguments forwarded to |
{}
|
Returns:
Type | Description |
---|---|
List[List[float]]
|
List[List[float]]: One embedding per input string. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
RuntimeError
|
If the embedding call fails unexpectedly. |
Source code in src/splitter_mr/embedding/embeddings/huggingface_embedding.py
184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
|