Splitter¶
Introduction¶
The Splitter component implements the main functionality of this library. This component is designed to deliver classes (inherited from BaseSplitter
) which supports to split a markdown text or a string following many different strategies.
Splitter strategies description¶
Splitting Technique | Description |
---|---|
Character Splitter | Splits text into chunks based on a specified number of characters. Supports overlapping by character count or percentage. Parameters: chunk_size (max chars per chunk), chunk_overlap (overlapping chars: int or %). Compatible with: Text. |
Word Splitter | Splits text into chunks based on a specified number of words. Supports overlapping by word count or percentage. Parameters: chunk_size (max words per chunk), chunk_overlap (overlapping words: int or %). Compatible with: Text. |
Sentence Splitter | Splits text into chunks by a specified number of sentences. Allows overlap defined by a number or percentage of words from the end of the previous chunk. Customizable sentence separators (e.g., . , ! , ? ). Parameters: chunk_size (max sentences per chunk), chunk_overlap (overlapping words: int or %), sentence_separators (list of characters). Compatible with: Text. |
Paragraph Splitter | Splits text into chunks based on a specified number of paragraphs. Allows overlapping by word count or percentage, and customizable line breaks. Parameters: chunk_size (max paragraphs per chunk), chunk_overlap (overlapping words: int or %), line_break (delimiter(s) for paragraphs). Compatible with: Text. |
Recursive Splitter | Recursively splits text based on a hierarchy of separators (e.g., paragraph, sentence, word, character) until chunks reach a target size. Tries to preserve semantic units as long as possible. Parameters: chunk_size (max chars per chunk), chunk_overlap (overlapping chars), separators (list of characters to split on, e.g., ["\n\n", "\n", " ", ""] ). Compatible with: Text. |
Keyword Splitter | Splits text into chunks around matches of specified keywords, using one or more regex patterns. Supports precise boundary control—matched keywords can be included before , after , both sides, or omitted from the split. Each keyword can have a custom name (via dict ) for metadata counting. Secondary soft-wrapping by chunk_size is supported. Parameters: patterns (list of regex patterns, or dict mapping names to patterns), include_delimiters ("before" , "after" , "both" , or "none" ), flags (regex flags, e.g. re.MULTILINE ), chunk_size (max chars per chunk, soft-wrapped). Compatible with: Text. |
Token Splitter | Splits text into chunks based on the number of tokens, using various tokenization models (e.g., tiktoken, spaCy, NLTK). Useful for ensuring chunks are compatible with LLM context limits. Parameters: chunk_size (max tokens per chunk), model_name (tokenizer/model, e.g., "tiktoken/cl100k_base" , "spacy/en_core_web_sm" , "nltk/punkt" ), language (for NLTK). Compatible with: Text. |
Paged Splitter | Splits text by pages for documents that have page structure. Each chunk contains a specified number of pages, with optional word overlap. Parameters: num_pages (pages per chunk), chunk_overlap (overlapping words). Compatible with: Word, PDF, Excel, PowerPoint. |
Row/Column Splitter | For tabular formats, splits data by a set number of rows or columns per chunk, with possible overlap. Row-based and column-based splitting are mutually exclusive. Parameters: num_rows , num_cols (rows/columns per chunk), overlap (overlapping rows or columns). Compatible with: Tabular formats (csv, tsv, parquet, flat json). |
JSON Splitter | Recursively splits JSON documents into smaller sub-structures that preserve the original JSON schema. Parameters: max_chunk_size (max chars per chunk), min_chunk_size (min chars per chunk). Compatible with: JSON. |
Semantic Splitter | Splits text into chunks based on semantic similarity, using an embedding model and a max tokens parameter. Useful for meaningful semantic groupings. Parameters: embedding_model (model for embeddings), max_tokens (max tokens per chunk). Compatible with: Text. |
HTML Tag Splitter | Splits HTML content based on a specified tag, or automatically detects the most frequent and shallowest tag if not specified. Each chunk is a complete HTML fragment for that tag. Parameters: chunk_size (max chars per chunk), tag (HTML tag to split on, optional). Compatible with: HTML. |
Header Splitter | Splits Markdown or HTML documents into chunks using header levels (e.g., # , ## , or <h1> , <h2> ). Uses configurable headers for chunking. Parameters: headers_to_split_on (list of headers and semantic names), chunk_size (unused, for compatibility). Compatible with: Markdown, HTML. |
Code Splitter | Splits source code files into programmatically meaningful chunks (functions, classes, methods, etc.), aware of the syntax of the specified programming language (e.g., Python, Java, Kotlin). Uses language-aware logic to avoid splitting inside code blocks. Parameters: chunk_size (max chars per chunk), language (programming language as string, e.g., "python" , "java" ). Compatible with: Source code files (Python, Java, Kotlin, C++, JavaScript, Go, etc.). |
Output format¶
Bases: BaseModel
Pydantic model defining the output structure for all splitters.
Attributes:
Name | Type | Description |
---|---|---|
chunks |
List[str]
|
List of text chunks produced by splitting. |
chunk_id |
List[str]
|
List of unique IDs corresponding to each chunk. |
document_name |
Optional[str]
|
The name of the document. |
document_path |
str
|
The path to the document. |
document_id |
Optional[str]
|
A unique identifier for the document. |
conversion_method |
Optional[str]
|
The method used for document conversion. |
reader_method |
Optional[str]
|
The method used for reading the document. |
ocr_method |
Optional[str]
|
The OCR method used, if any. |
split_method |
str
|
The method used to split the document. |
split_params |
Optional[Dict[str, Any]]
|
Parameters used during the splitting process. |
metadata |
Optional[Dict[str, Any]]
|
Additional metadata associated with the splitting. |
Source code in src/splitter_mr/schema/models.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
|
append_metadata(metadata)
¶
Append (update) the metadata dictionary with new key-value pairs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
metadata
|
Dict[str, Any]
|
The metadata to add or update. |
required |
Source code in src/splitter_mr/schema/models.py
183 184 185 186 187 188 189 190 191 192 |
|
from_chunks(chunks)
classmethod
¶
Create a SplitterOutput from a list of chunks, with all other fields set to their defaults.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunks
|
List[str]
|
A list of text chunks. |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
An instance of SplitterOutput with the given chunks. |
Source code in src/splitter_mr/schema/models.py
171 172 173 174 175 176 177 178 179 180 181 |
|
validate_and_set_defaults()
¶
Validates and sets defaults for the SplitterOutput instance.
Raises:
Type | Description |
---|---|
ValueError
|
If |
Returns:
Name | Type | Description |
---|---|---|
self |
SplitterOutput
|
The validated and updated instance. |
Source code in src/splitter_mr/schema/models.py
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
|
Splitters¶
BaseSplitter¶
BaseSplitter
¶
Bases: ABC
Abstract base class for all splitter implementations.
This class defines the common interface and utility methods for splitters that
divide text or data into smaller chunks, typically for downstream natural language
processing tasks or information retrieval. Subclasses should implement the split
method, which takes in a dictionary (typically from a document reader) and returns
a structured output with the required chunking.
Attributes:
Name | Type | Description |
---|---|---|
chunk_size |
int
|
The maximum number of units (e.g., characters, words, etc.) per chunk. |
Methods:
Name | Description |
---|---|
split |
Abstract method. Should be implemented by all subclasses to perform the actual splitting logic. |
_generate_chunk_ids |
Generates a list of unique chunk IDs using UUID4, for use in the output. |
_default_metadata |
Returns a default (empty) metadata dictionary, which can be extended by subclasses. |
Source code in src/splitter_mr/splitter/base_splitter.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
__init__(chunk_size=1000)
¶
Initializer method for BaseSplitter classes
Source code in src/splitter_mr/splitter/base_splitter.py
30 31 32 33 34 |
|
split(reader_output)
abstractmethod
¶
Abstract method to split input data into chunks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
ReaderOutput
|
Input data, typically from a document reader, including the text to split and any relevant metadata. |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
A dictionary containing split chunks and associated metadata. |
Source code in src/splitter_mr/splitter/base_splitter.py
36 37 38 39 40 41 42 43 44 45 46 47 |
|
CharacterSplitter¶
CharacterSplitter
¶
Bases: BaseSplitter
CharacterSplitter splits a given text into overlapping or non-overlapping chunks based on a specified number of characters per chunk.
This splitter is configurable with a maximum chunk size (chunk_size
) and an overlap
between consecutive chunks (chunk_overlap
). The overlap can be specified either as
an integer (number of characters) or as a float between 0 and 1 (fraction of chunk size).
This is particularly useful for downstream NLP tasks where context preservation between
chunks is important.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum number of characters per chunk. |
1000
|
chunk_overlap
|
Union[int, float]
|
Number or percentage of overlapping characters between chunks. |
0
|
Source code in src/splitter_mr/splitter/splitters/character_splitter.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
split(reader_output)
¶
Splits the input text from the reader_output dictionary into character-based chunks.
Each chunk contains at most chunk_size
characters, and adjacent chunks can overlap
by a specified number or percentage of characters, according to the chunk_overlap
parameter set at initialization. Returns a dictionary with the same document metadata,
unique chunk identifiers, and the split parameters used.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata (e.g., 'document_name', 'document_path', etc.). |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If chunk_overlap is greater than or equal to chunk_size. |
Example
from splitter_mr.splitter import CharacterSplitter
# This dictionary has been obtained as the output from a Reader object.
reader_output = ReaderOutput(
text: "abcdefghijklmnopqrstuvwxyz",
document_name: "doc.txt",
document_path: "/path/doc.txt",
)
splitter = CharacterSplitter(chunk_size=5, chunk_overlap=2)
output = splitter.split(reader_output)
print(output["chunks"])
['abcde', 'defgh', 'ghijk', ..., 'yz']
Source code in src/splitter_mr/splitter/splitters/character_splitter.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
WordSplitter¶
WordSplitter
¶
Bases: BaseSplitter
WordSplitter splits a given text into overlapping or non-overlapping chunks based on a specified number of words per chunk.
This splitter is configurable with a maximum chunk size (chunk_size
, in words)
and an overlap between consecutive chunks (chunk_overlap
). The overlap can be
specified either as an integer (number of words) or as a float between 0 and 1
(fraction of chunk size). Useful for NLP tasks where word-based boundaries are
important for context preservation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum number of words per chunk. |
5
|
chunk_overlap
|
Union[int, float]
|
Number or percentage of overlapping words between chunks. |
0
|
Source code in src/splitter_mr/splitter/splitters/word_splitter.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
|
split(reader_output)
¶
Splits the input text from the reader_output dictionary into word-based chunks.
Each chunk contains at most chunk_size
words, and adjacent chunks can overlap
by a specified number or percentage of words, according to the chunk_overlap
parameter set at initialization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata (e.g., 'document_name', 'document_path', etc.). |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If chunk_overlap is greater than or equal to chunk_size. |
Example
from splitter_mr.splitter import WordSplitter
reader_output = ReaderOutput(
text: "The quick brown fox jumps over the lazy dog. Pack my box with five dozen liquor jugs. Sphinx of black quartz, judge my vow.",
document_name: "pangrams.txt",
document_path: "/https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/pangrams.txt",
)
# Split into chunks of 5 words, overlapping by 2 words
splitter = WordSplitter(chunk_size=5, chunk_overlap=2)
output = splitter.split(reader_output)
print(output["chunks"])
['The quick brown fox jumps',
'fox jumps over the lazy',
'over the lazy dog. Pack', ...]
Source code in src/splitter_mr/splitter/splitters/word_splitter.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
|
SentenceSplitter¶
SentenceSplitter
¶
Bases: BaseSplitter
SentenceSplitter splits a given text into overlapping or non-overlapping chunks, where each chunk contains a specified number of sentences, and overlap is defined by a number or percentage of words from the end of the previous chunk.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum number of sentences per chunk. |
5
|
chunk_overlap
|
Union[int, float]
|
Number or percentage of overlapping words between chunks. |
0
|
separators
|
Union[str, List[str]]
|
Character(s) to split sentences. |
DEFAULT_SENTENCE_SEPARATORS
|
Source code in src/splitter_mr/splitter/splitters/sentence_splitter.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
|
split(reader_output)
¶
Splits the input text from the reader_output
dictionary into sentence-based chunks,
allowing for overlap at the word level.
Each chunk contains at most chunk_size
sentences, where sentence boundaries are
detected using the specified separators
(e.g., '.', '!', '?').
Overlap between consecutive chunks is specified by chunk_overlap
, which can be an
integer (number of words) or a float (fraction of the maximum words in a sentence).
This is useful for downstream NLP tasks that require context preservation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata, such as 'document_name', 'document_path', 'document_id', etc. |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
ValueError
|
If 'text' is missing in |
Example
from splitter_mr.splitter import SentenceSplitter
# Example input: 7 sentences with varied punctuation
# This dictionary has been obtained as an output from a Reader class.
reader_output = ReaderOutput(
text: "Hello world! How are you? I am fine. Testing sentence splitting. Short. End! And another?",
document_name: "sample.txt",
document_path: "/tmp/sample.txt",
document_id: "123"
)
# Split into chunks of 3 sentences each, no overlap
splitter = SentenceSplitter(chunk_size=3, chunk_overlap=0)
result = splitter.split(reader_output)
print(result.chunks)
['Hello world! How are you? I am fine.',
'Testing sentence splitting. Short. End!',
'And another?', ...]
Source code in src/splitter_mr/splitter/splitters/sentence_splitter.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
|
ParagraphSplitter¶
ParagraphSplitter
¶
Bases: BaseSplitter
ParagraphSplitter splits a given text into overlapping or non-overlapping chunks, where each chunk contains a specified number of paragraphs, and overlap is defined by a number or percentage of words from the end of the previous chunk.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum number of paragraphs per chunk. |
3
|
chunk_overlap
|
Union[int, float]
|
Number or percentage of overlapping words between chunks. |
0
|
line_break
|
Union[str, List[str]]
|
Character(s) used to split text into paragraphs. |
DEFAULT_PARAGRAPH_SEPARATORS
|
Source code in src/splitter_mr/splitter/splitters/paragraph_splitter.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
|
split(reader_output)
¶
Splits text in reader_output['text']
into paragraph-based chunks, with optional word overlap.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata (e.g., 'document_name', 'document_path'). |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If 'text' is missing from |
Example
from splitter_mr.splitter import ParagraphSplitter
# This dictionary has been obtained as the output from a Reader object.
reader_output = ReaderOutput(
text: "Para 1.\n\nPara 2.\n\nPara 3.",
document_name: "test.txt",
document_path: "/tmp/test.txt"
)
splitter = ParagraphSplitter(chunk_size=2, chunk_overlap=1, line_break="\n\n")
output = splitter.split(reader_output)
print(output["chunks"])
['Para 1.\n\nPara 2.', '2. Para 3.']
Source code in src/splitter_mr/splitter/splitters/paragraph_splitter.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
|
RecursiveCharacterSplitter¶
RecursiveCharacterSplitter
¶
Bases: BaseSplitter
RecursiveCharacterSplitter splits a given text into overlapping or non-overlapping chunks, where each chunk is created repeatedly breaking down the text until it reaches the desired chunk size. This class implements the Langchain RecursiveCharacterTextSplitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Approximate chunk size, in characters. |
1000
|
chunk_overlap
|
Union[int, float]
|
Number or percentage of overlapping characters between chunks. |
0.1
|
separators
|
Union[str, List[str]]
|
Character(s) to recursively split sentences. |
DEFAULT_RECURSIVE_SEPARATORS
|
Notes
More info about the RecursiveCharacterTextSplitter: Langchain Docs.
Source code in src/splitter_mr/splitter/splitters/recursive_splitter.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
|
split(reader_output)
¶
Splits the input text into character-based chunks using a recursive splitting strategy
(via Langchain's RecursiveCharacterTextSplitter
), supporting configurable separators,
chunk size, and overlap.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata (e.g., 'document_name', 'document_path', etc.). |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If 'text' is missing in |
Example
from splitter_mr.splitter import RecursiveCharacterSplitter
# This dictionary has been obtained as the output from a Reader object.
reader_output = ReaderOutput(
text: "This is a long document.
It will be recursively split into smaller chunks using the specified separators.
Each chunk will have some overlap with the next.",
document_name: "sample.txt",
document_path: "/tmp/sample.txt"
)
splitter = RecursiveCharacterSplitter(chunk_size=40, chunk_overlap=5)
output = splitter.split(reader_output)
print(output["chunks"])
['This is a long document. It will be', 'be recursively split into smaller chunks', ...]
Source code in src/splitter_mr/splitter/splitters/recursive_splitter.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
|
KeywordSplitter¶
KeywordSplitter
¶
Bases: BaseSplitter
Splitter that chunks text around keyword boundaries using regular expressions.
This splitter searches the input text for one or more keyword patterns (regex)
and creates chunks at each match boundary. You can control how the matched
delimiter is attached to the resulting chunks (before/after/both/none) and apply a
secondary, size-based re-chunking to respect chunk_size
.
The splitter emits a :class:~..schema.SplitterOutput
with metadata including
per-keyword match counts and raw match spans.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
patterns
|
Union[List[str], Dict[str, str]]
|
A list of regex pattern strings or a mapping of
|
required |
flags
|
int
|
Standard |
0
|
include_delimiters
|
str
|
Where to attach the matched keyword delimiter.
One of |
'before'
|
chunk_size
|
int
|
Target maximum size (in characters) for each chunk. When a produced chunk exceeds this value, it is soft-wrapped by whitespace using a greedy strategy. |
100000
|
Notes
- All regexes are compiled into one alternation with named groups when
patterns
is a dict. This simplifies per-keyword accounting. - If the input text is empty or no matches are found, the entire text becomes a single chunk (subject to size-based re-chunking).
Source code in src/splitter_mr/splitter/splitters/keyword_splitter.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 |
|
__init__(patterns, *, flags=0, include_delimiters='before', chunk_size=100000)
¶
Initialize the KeywordSplitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
patterns
|
Union[List[str], Dict[str, str]]
|
Keyword regex patterns. |
required |
flags
|
int
|
Regex flags. |
0
|
include_delimiters
|
str
|
How to include delimiters (before, after, both, none). |
'before'
|
chunk_size
|
int
|
Max chunk size in characters. |
100000
|
Source code in src/splitter_mr/splitter/splitters/keyword_splitter.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
|
split(reader_output)
¶
Split ReaderOutput into keyword-delimited chunks and build structured output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
ReaderOutput
|
Input document and metadata. |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Output structure with chunked text and metadata. |
Source code in src/splitter_mr/splitter/splitters/keyword_splitter.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
|
HeaderSplitter¶
HeaderSplitter
¶
Bases: BaseSplitter
Split HTML or Markdown documents into chunks by header levels (H1–H6).
- If the input looks like HTML, it is first converted to Markdown using the
project's HtmlToMarkdown utility, which emits ATX-style headings (
#
,##
, ...). - If the input is Markdown, Setext-style headings (underlines with
===
/---
) are normalized to ATX so headers are reliably detected. - Splitting is performed with LangChain's MarkdownHeaderTextSplitter.
- If no headers are detected after conversion/normalization, a safe fallback splitter (RecursiveCharacterTextSplitter) is used to avoid returning a single, excessively large chunk.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Size hint for fallback splitting; not used by header splitting itself. Defaults to 1000. |
1000
|
headers_to_split_on
|
Optional[List[str]]
|
Semantic header names like ["Header 1", "Header 2"]. If None, all levels 1–6 are enabled. |
None
|
group_header_with_content
|
bool
|
If True (default), headers are kept with their following content (strip_headers=False). If False, headers are stripped from chunks (strip_headers=True). |
True
|
Example
from splitter_mr.splitter import HeaderSplitter
splitter = HeaderSplitter(headers_to_split_on=["Header 1", "Header 2", "Header 3"])
output = splitter.split(reader_output) # reader_output.text may be HTML or MD
for idx, chunk in enumerate(output.chunks):
print(f"--- Chunk {idx+1} ---")
print(chunk)
Source code in src/splitter_mr/splitter/splitters/header_splitter.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
|
__init__(chunk_size=1000, headers_to_split_on=None, *, group_header_with_content=True)
¶
Initialize the HeaderSplitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Used by fallback character splitter if no headers are found. |
1000
|
headers_to_split_on
|
Optional[List[str]]
|
Semantic headers, e.g. ["Header 1", "Header 2"]. Defaults to all levels 1–6. |
None
|
group_header_with_content
|
bool
|
Keep headers attached to following content if True. |
True
|
Source code in src/splitter_mr/splitter/splitters/header_splitter.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
|
split(reader_output)
¶
Perform header-based splitting with HTML→Markdown conversion and safe fallback.
Steps
1) Detect filetype (HTML/MD). 2) If HTML, convert to Markdown with HtmlToMarkdown (emits ATX headings). 3) If Markdown, normalize Setext headings to ATX. 4) Split by headers via MarkdownHeaderTextSplitter. 5) If no headers found, fallback to RecursiveCharacterTextSplitter.
Source code in src/splitter_mr/splitter/splitters/header_splitter.py
136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
|
RecursiveJSONSplitter¶
RecursiveJSONSplitter
¶
Bases: BaseSplitter
RecursiveJSONSplitter splits a JSON string or structure into overlapping or non-overlapping chunks, using the Langchain RecursiveJsonSplitter. This splitter is designed to recursively break down JSON data (including nested objects and arrays) into manageable pieces based on keys, arrays, or other separators, until the desired chunk size is reached.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum chunk size, measured in the number of characters per chunk. |
1000
|
min_chunk_size
|
int
|
Minimum chunk size, in characters. |
200
|
Notes
Source code in src/splitter_mr/splitter/splitters/json_splitter.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
|
split(reader_output)
¶
Splits the input JSON text from the reader_output dictionary into recursively chunked pieces, allowing for overlap by number or percentage of characters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata (e.g., 'document_name', 'document_path', etc.). |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If the 'text' field is missing from reader_output. |
JSONDecodeError
|
If the 'text' field contains invalid JSON. |
Example
from splitter_mr.splitter import RecursiveJSONSplitter
# This dictionary has been obtained from `VanillaReader`
reader_output = ReaderOutput(
text: '{"company": {"name": "TechCorp", "employees": [{"name": "Alice"}, {"name": "Bob"}]}}'
document_name: "company_data.json",
document_path: "/https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/company_data.json",
document_id: "doc123",
conversion_method: "vanilla",
ocr_method: None
)
splitter = RecursiveJSONSplitter(chunk_size=100, min_chunk_size=20)
output = splitter.split(reader_output)
print(output["chunks"])
['{"company": {"name": "TechCorp"}}',
'{"employees": [{"name": "Alice"},
{"name": "Bob"}]}']
Source code in src/splitter_mr/splitter/splitters/json_splitter.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
|
HTMLTagSplitter¶
HTMLTagSplitter
¶
Bases: BaseSplitter
HTMLTagSplitter splits HTML content into chunks based on a specified tag. Supports batching and optional Markdown conversion.
Behavior
- When
tag
is specified (e.g., tag="div"), finds all matching elements. - When
tag
is None, splits by the most frequent and shallowest tag.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum chunk size in characters (only used when |
1
|
tag
|
str | None
|
HTML tag to split on. If None, auto-detects the best tag. |
None
|
batch
|
bool
|
If True (default), groups multiple tags into a chunk, not exceeding |
True
|
to_markdown
|
bool
|
If True, converts each chunk to Markdown using HtmlToMarkdown. |
True
|
Example
reader_output = ReaderOutput(text="
AB") splitter = HTMLTagSplitter(tag="div", batch=False) splitter.split(reader_output).chunks ['A', 'B'] splitter = HTMLTagSplitter(tag="div", batch=True, chunk_size=100) splitter.split(reader_output).chunks ['AB'] splitter = HTMLTagSplitter(tag="div", batch=False, to_markdown=True) splitter.split(reader_output).chunks ['A', 'B']
Attributes:
Name | Type | Description |
---|---|---|
chunk_size |
int
|
Maximum chunk size. |
tag |
Optional[str]
|
Tag to split on. |
batch |
bool
|
Whether to group elements into chunks. |
to_markdown |
bool
|
Whether to convert each chunk to Markdown. |
Source code in src/splitter_mr/splitter/splitters/html_tag_splitter.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 |
|
__init__(chunk_size=1, tag=None, *, batch=True, to_markdown=True)
¶
Initialize HTMLTagSplitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum chunk size, in characters (only for batching). |
1
|
tag
|
str | None
|
Tag to split on. If None, auto-detects. |
None
|
batch
|
bool
|
If True (default), groups tags up to |
True
|
to_markdown
|
bool
|
If True (default), convert each chunk to Markdown. |
True
|
Source code in src/splitter_mr/splitter/splitters/html_tag_splitter.py
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
|
split(reader_output)
¶
Splits HTML using the specified tag and batching, with optional Markdown conversion.
Semantics: - Tables: * batch=False -> one chunk per requested element. If splitting by a row-level tag (e.g. 'tr'), emit a mini-table per row: once + that row in
. * batch=True and chunk_size in (0, 1, None) -> all tables in one chunk. * batch=True and chunk_size > 1 -> split each table into multiple chunks by batchingParameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
ReaderOutput
|
ReaderOutput containing at least |
required |
Returns:
Type | Description |
---|---|
SplitterOutput
|
SplitterOutput |
Source code in src/splitter_mr/splitter/splitters/html_tag_splitter.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 |
|
RowColumnSplitter¶
RowColumnSplitter
¶
Bases: BaseSplitter
RowColumnSplitter splits tabular data (such as CSV, TSV, Markdown tables, or JSON tables) into smaller tables based on rows, columns, or by total character size while preserving row integrity.
This splitter supports several modes:
- By rows: Split the table into chunks with a fixed number of rows, with optional overlapping rows between chunks.
- By columns: Split the table into chunks by columns, with optional overlapping columns between chunks.
- By chunk size: Split the table into markdown-formatted table chunks, where each chunk contains as many complete rows as fit under the specified character limit, optionally overlapping a fixed number of rows between chunks.
This is useful for splitting large tabular files for downstream processing, LLM ingestion, or display, while preserving semantic and structural integrity of the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum number of characters per chunk (when using character-based splitting). |
1000
|
num_rows
|
int
|
Number of rows per chunk. Mutually exclusive with num_cols. |
0
|
num_cols
|
int
|
Number of columns per chunk. Mutually exclusive with num_rows. |
0
|
chunk_overlap
|
Union[int, float]
|
Number of overlapping rows or columns between chunks. If a float in (0,1), interpreted as a percentage of rows or columns. If integer, the number of overlapping rows/columns. When chunking by character size, this refers to the number of overlapping rows (not characters). |
0
|
Supported formats: CSV, TSV, TXT, Markdown table, JSON (tabular: list of dicts or dict of lists).
Source code in src/splitter_mr/splitter/splitters/row_column_splitter.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 |
|
split(reader_output)
¶
Splits the input tabular data into multiple markdown table chunks according to the specified chunking strategy. Each output chunk is a complete markdown table with header, and will never cut a row in half. The overlap is always applied in terms of full rows or columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary output from a Reader, containing at least: - 'text': The tabular data as string. - 'conversion_method': Format of the input ('csv', 'tsv', 'markdown', 'json', etc.). - Additional document metadata fields (optional). |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If both num_rows and num_cols are set. |
ValueError
|
If chunk_overlap as float is not in [0,1). |
ValueError
|
If chunk_size is too small to fit the header and at least one data row. |
Example
reader_output = ReaderOutput(
text: '| id | name |\n|----|------|\n| 1 | A |\n| 2 | B |\n| 3 | C |',
conversion_method: "markdown",
document_name: "table.md",
document_path: "/path/table.md",
)
splitter = RowColumnSplitter(chunk_size=80, chunk_overlap=20)
output = splitter.split(reader_output)
for chunk in output["chunks"]:
print("\n" + str(chunk) + "\n")
| id | name |
|------|--------|
| 1 | A |
| 2 | B |
| id | name |
|------|--------|
| 2 | B |
| 3 | C |
Source code in src/splitter_mr/splitter/splitters/row_column_splitter.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
|
CodeSplitter¶
CodeSplitter
¶
Bases: BaseSplitter
CodeSplitter recursively splits source code into programmatically meaningful chunks (functions, classes, methods, etc.) for the given programming language.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum chunk size, in characters. |
1000
|
language
|
str
|
Programming language (e.g., "python", "java", "kotlin", etc.) |
'python'
|
Notes
- Uses Langchain's RecursiveCharacterTextSplitter and its language-aware
from_language
method. - See Langchain docs: https://python.langchain.com/docs/how_to/code_splitter/
Source code in src/splitter_mr/splitter/splitters/code_splitter.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
split(reader_output)
¶
Splits code in reader_output['text']
according to the syntax of the specified
programming language, using function/class boundaries where possible.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
ReaderOutput
|
Object containing at least a 'text' field, plus optional document metadata. |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If language is not supported. |
Example
from splitter_mr.splitter import CodeSplitter
reader_output = ReaderOutput(
text: "def foo():\n pass\n\nclass Bar:\n def baz(self):\n pass",
document_name: "example.py",
document_path: "/tmp/example.py"
)
splitter = CodeSplitter(chunk_size=50, language="python")
output = splitter.split(reader_output)
print(output.chunks)
['def foo():\n pass\n', 'class Bar:\n def baz(self):\n pass']
Source code in src/splitter_mr/splitter/splitters/code_splitter.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
get_langchain_language(lang_str)
¶
Map a string language name to Langchain Language enum. Raises ValueError if not found.
Source code in src/splitter_mr/splitter/splitters/code_splitter.py
7 8 9 10 11 12 13 14 15 16 17 18 |
|
TokenSplitter¶
TokenSplitter
¶
Bases: BaseSplitter
TokenSplitter splits a given text into chunks based on token counts derived from different tokenization models or libraries.
This splitter supports tokenization via tiktoken
(OpenAI tokenizer),
spacy
(spaCy tokenizer), and nltk
(NLTK tokenizer). It allows splitting
text into chunks of a maximum number of tokens (chunk_size
), using the
specified tokenizer model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum number of tokens per chunk. |
1000
|
model_name
|
str
|
Specifies the tokenizer and model in the format
|
DEFAULT_TOKENIZER
|
language
|
str
|
Language code for NLTK tokenizer (default |
DEFAULT_TOKEN_LANGUAGE
|
Notes
More info about the splitting methods by Tokens for Langchain: Langchain Docs.
Source code in src/splitter_mr/splitter/splitters/token_splitter.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
|
list_nltk_punkt_languages()
staticmethod
¶
Return a sorted list of available punkt models (languages) for NLTK.
Source code in src/splitter_mr/splitter/splitters/token_splitter.py
61 62 63 64 65 66 67 68 69 |
|
split(reader_output)
¶
Splits the input text from reader_output
into token-based chunks using
the specified tokenizer.
Depending on model_name
, the splitter chooses the appropriate tokenizer:
- For
tiktoken
, usesRecursiveCharacterTextSplitter
with tiktoken encoding. e.g.:tiktoken/cl100k_base
. - For
spacy
, usesSpacyTextSplitter
with the specified spaCy pipeline. e.g.,spacy/en_core_web_sm
. - For
nltk
, usesNLTKTextSplitter
with the specified language tokenizer. e.g.,nltk/punkt_tab
.
Automatically downloads spaCy and NLTK models if missing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata, such as 'document_name', 'document_path', 'document_id', etc. |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
RuntimeError
|
If a spaCy model specified in |
ValueError
|
If an unsupported tokenizer is specified in |
Source code in src/splitter_mr/splitter/splitters/token_splitter.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
|
PagedSplitter¶
Splits text by pages for documents that have page structure. Each chunk contains a specified number of pages, with optional word overlap.
PagedSplitter
¶
Bases: BaseSplitter
Splits a multi-page document into page-based or multi-page chunks using a placeholder marker.
Supports overlap in characters between consecutive chunks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Number of pages per chunk. |
1
|
chunk_overlap
|
int
|
Number of overlapping characters to include from the end of the previous chunk. |
0
|
Raises:
Type | Description |
---|---|
ValueError
|
If chunk_size is less than 1. |
Source code in src/splitter_mr/splitter/splitters/paged_splitter.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
|
__init__(chunk_size=1, chunk_overlap=0)
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Number of pages per chunk. |
1
|
chunk_overlap
|
int
|
Number of overlapping characters to include from the end of the previous chunk. |
0
|
Source code in src/splitter_mr/splitter/splitters/paged_splitter.py
21 22 23 24 25 26 27 28 29 30 31 32 |
|
split(reader_output)
¶
Splits the input text into chunks using the page_placeholder in the ReaderOutput. Optionally adds character overlap between chunks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
ReaderOutput
|
The output from a reader containing text and metadata. |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
The result with chunks and related metadata. |
Raises:
Type | Description |
---|---|
ValueError
|
If the reader_output does not contain a valid page_placeholder. |
Example
from splitter_mr.splitter import PagedSplitter
reader_output = ReaderOutput(
text: "<!-- page --> Page 1 <!-- page --> This is the page 2.",
document_name: "test.md",
document_path: "tmp/test.md",
page_placeholder: "<!-- page -->",
...
)
splitter = PagedSplitter(chunk_size = 1)
output = splitter.split(reader_output)
print(output["chunks"])
[" Page 1 ", " This is the page 2."]
Source code in src/splitter_mr/splitter/splitters/paged_splitter.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
|
SemanticSplitter¶
Splits text into chunks based on semantic similarity, using an embedding model and a max tokens parameter. Useful for meaningful semantic groupings.
SemanticSplitter
¶
Bases: BaseSplitter
Split text into semantically coherent chunks using embedding similarity.
Pipeline:
- Split text into sentences via
SentenceSplitter
(one sentence chunks). - Build a sliding window around each sentence (
buffer_size
). - Embed each window with
BaseEmbedding
(batched). - Compute cosine distances between consecutive windows (1 - cosine_sim).
- Pick breakpoints using a thresholding strategy, or aim for
number_of_chunks
. - Join sentences between breakpoints; enforce minimum size via
chunk_size
.
Source code in src/splitter_mr/splitter/splitters/semantic_splitter.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 |
|
__init__(embedding, *, buffer_size=1, breakpoint_threshold_type='percentile', breakpoint_threshold_amount=None, number_of_chunks=None, chunk_size=1000)
¶
Initialize the semantic splitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedding
|
BaseEmbedding
|
Embedding backend. |
required |
buffer_size
|
int
|
Neighbor window size around each sentence. |
1
|
breakpoint_threshold_type
|
BreakpointThresholdType
|
Threshold strategy: "percentile" | "standard_deviation" | "interquartile" | "gradient". |
'percentile'
|
breakpoint_threshold_amount
|
Optional[float]
|
Threshold parameter. If None, uses sensible defaults per strategy (e.g., 95th percentile). |
None
|
number_of_chunks
|
Optional[int]
|
If set, pick a threshold that approximately yields this number of chunks (inverse percentile). |
None
|
chunk_size
|
int
|
Minimum characters required to emit a chunk. |
1000
|
Source code in src/splitter_mr/splitter/splitters/semantic_splitter.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
|
split(reader_output)
¶
Split the document text into semantically coherent chunks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
ReaderOutput
|
The document text & metadata. |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Chunks, IDs, metadata, and splitter configuration. |
Notes
- With 1 sentence (or 2 in gradient mode), returns the text/sentences as-is.
- Chunks shorter than
chunk_size
(minimum) are skipped and merged forward. chunk_size
behaves as the minimum chunk size in this splitter.
Source code in src/splitter_mr/splitter/splitters/semantic_splitter.py
229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 |
|