Splitter¶
Introduction¶
The Splitter component implements the main functionality of this library. This component is designed to deliver classes (inherited from BaseSplitter
) which supports to split a markdown text or a string following many different strategies.
Splitter strategies description¶
Splitting Technique | Description |
---|---|
Character Splitter | Splits text into chunks based on a specified number of characters. Supports overlapping by character count or percentage. Main Parameters: chunk_size (max chars per chunk), chunk_overlap (overlapping chars: int or %). Compatible with: Text. |
Word Splitter | Splits text into chunks based on a specified number of words. Supports overlapping by word count or percentage. Main Parameters: chunk_size (max words per chunk), chunk_overlap (overlapping words: int or %). Compatible with: Text. |
Sentence Splitter | Splits text into chunks by a specified number of sentences. Allows overlap defined by a number or percentage of words from the end of the previous chunk. Customizable sentence separators (e.g., . , ! , ? ). Main Parameters: chunk_size (max sentences per chunk), chunk_overlap (overlapping words: int or %), sentence_separators (list of characters). Compatible with: Text. |
Paragraph Splitter | Splits text into chunks based on a specified number of paragraphs. Allows overlapping by word count or percentage, and customizable line breaks. Main Parameters: chunk_size (max paragraphs per chunk), chunk_overlap (overlapping words: int or %), line_break (delimiter(s) for paragraphs). Compatible with: Text. |
Recursive Character Splitter | Recursively splits text based on a hierarchy of separators (e.g., paragraph, sentence, word, character) until chunks reach a target size. Tries to preserve semantic units as long as possible. Main Parameters: chunk_size (max chars per chunk), chunk_overlap (overlapping chars), separators (list of characters to split on, e.g., ["\n\n", "\n", " ", ""] ). Compatible with: Text. |
Token Splitter | Splits text into chunks based on the number of tokens, using various tokenization models (e.g., tiktoken, spaCy, NLTK). Useful for ensuring chunks are compatible with LLM context limits. Main Parameters: chunk_size (max tokens per chunk), model_name (tokenizer/model, e.g., "tiktoken/cl100k_base" , "spacy/en_core_web_sm" , "nltk/punkt" ), language (for NLTK). Compatible with: Text. |
Paged Splitter | WORK IN PROGRESS. Splits text by pages for documents that have page structure. Each chunk contains a specified number of pages, with optional word overlap. Main Parameters: num_pages (pages per chunk), chunk_overlap (overlapping words). Compatible with: Word, PDF, Excel, PowerPoint. |
Row/Column Splitter | For tabular formats, splits data by a set number of rows or columns per chunk, with possible overlap. Row-based and column-based splitting are mutually exclusive. Main Parameters: num_rows , num_cols (rows/columns per chunk), overlap (overlapping rows or columns). Compatible with: Tabular formats (csv, tsv, parquet, flat json). |
JSON Recursive Splitter | Recursively splits JSON documents into smaller sub-structures that preserve the original JSON schema. Main Parameters: max_chunk_size (max chars per chunk), min_chunk_size (min chars per chunk). Compatible with: JSON. |
Semantic Splitter | WORK IN PROGRESS. Splits text into chunks based on semantic similarity, using an embedding model and a max tokens parameter. Useful for meaningful semantic groupings. Main Parameters: embedding_model (model for embeddings), max_tokens (max tokens per chunk). Compatible with: Text. |
HTMLTagSplitter | Splits HTML content based on a specified tag, or automatically detects the most frequent and shallowest tag if not specified. Each chunk is a complete HTML fragment for that tag. Main Parameters: chunk_size (max chars per chunk), tag (HTML tag to split on, optional). Compatible with: HTML. |
HeaderSplitter | Splits Markdown or HTML documents into chunks using header levels (e.g., # , ## , or <h1> , <h2> ). Uses configurable headers for chunking. Main Parameters: headers_to_split_on (list of headers and semantic names), chunk_size (unused, for compatibility). Compatible with: Markdown, HTML. |
Code Splitter | Splits source code files into programmatically meaningful chunks (functions, classes, methods, etc.), aware of the syntax of the specified programming language (e.g., Python, Java, Kotlin). Uses language-aware logic to avoid splitting inside code blocks. Main Parameters: chunk_size (max chars per chunk), language (programming language as string, e.g., "python" , "java" ). Compatible with: Source code files (Python, Java, Kotlin, C++, JavaScript, Go, etc.). |
Token Splitter | Splits text files into LLM-aware minimal semantic units (tokens). You can use nltk, spacy and tiktoken as tokenizers. Main Parameters: chunk_size (max chars per chunk), model (e.g., nltk/punkt , tiktoken/cl100k , etc.), language (e.g., "spanish" , "english" ). Compatible with: Text. |
Warning
PagedSplitter amd Semantic Splitter are not fully implemented yet. Stay aware to updates!
Output format¶
Dataclass defining the output structure for all splitters.
Source code in src/splitter_mr/schema/schemas.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
Splitters¶
BaseSplitter¶
BaseSplitter
¶
Bases: ABC
Abstract base class for all splitter implementations.
This class defines the common interface and utility methods for splitters that
divide text or data into smaller chunks, typically for downstream natural language
processing tasks or information retrieval. Subclasses should implement the split
method, which takes in a dictionary (typically from a document reader) and returns
a structured output with the required chunking.
Attributes:
Name | Type | Description |
---|---|---|
chunk_size |
int
|
The maximum number of units (e.g., characters, words, etc.) per chunk. |
Methods:
Name | Description |
---|---|
split |
Abstract method. Should be implemented by all subclasses to perform the actual splitting logic. |
_generate_chunk_ids |
Generates a list of unique chunk IDs using UUID4, for use in the output. |
_default_metadata |
Returns a default (empty) metadata dictionary, which can be extended by subclasses. |
Source code in src/splitter_mr/splitter/base_splitter.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
|
split(reader_output)
abstractmethod
¶
Abstract method to split input data into chunks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
ReaderOutput
|
Input data, typically from a document reader, including the text to split and any relevant metadata. |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
A dictionary containing split chunks and associated metadata. |
Source code in src/splitter_mr/splitter/base_splitter.py
33 34 35 36 37 38 39 40 41 42 43 44 45 |
|
CharacterSplitter¶
CharacterSplitter
¶
Bases: BaseSplitter
CharacterSplitter splits a given text into overlapping or non-overlapping chunks based on a specified number of characters per chunk.
This splitter is configurable with a maximum chunk size (chunk_size
) and an overlap
between consecutive chunks (chunk_overlap
). The overlap can be specified either as
an integer (number of characters) or as a float between 0 and 1 (fraction of chunk size).
This is particularly useful for downstream NLP tasks where context preservation between
chunks is important.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum number of characters per chunk. |
1000
|
chunk_overlap
|
Union[int, float]
|
Number or percentage of overlapping characters between chunks. |
0
|
Source code in src/splitter_mr/splitter/splitters/character_splitter.py
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
split(reader_output)
¶
Splits the input text from the reader_output dictionary into character-based chunks.
Each chunk contains at most chunk_size
characters, and adjacent chunks can overlap
by a specified number or percentage of characters, according to the chunk_overlap
parameter set at initialization. Returns a dictionary with the same document metadata,
unique chunk identifiers, and the split parameters used.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata (e.g., 'document_name', 'document_path', etc.). |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If chunk_overlap is greater than or equal to chunk_size. |
Example
from splitter_mr.splitter import CharacterSplitter
# This dictionary has been obtained as the output from a Reader object.
reader_output = ReaderOutput(
text: "abcdefghijklmnopqrstuvwxyz",
document_name: "doc.txt",
document_path: "/path/doc.txt",
)
splitter = CharacterSplitter(chunk_size=5, chunk_overlap=2)
output = splitter.split(reader_output)
print(output["chunks"])
['abcde', 'defgh', 'ghijk', ..., 'yz']
Source code in src/splitter_mr/splitter/splitters/character_splitter.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
WordSplitter¶
WordSplitter
¶
Bases: BaseSplitter
WordSplitter splits a given text into overlapping or non-overlapping chunks based on a specified number of words per chunk.
This splitter is configurable with a maximum chunk size (chunk_size
, in words)
and an overlap between consecutive chunks (chunk_overlap
). The overlap can be
specified either as an integer (number of words) or as a float between 0 and 1
(fraction of chunk size). Useful for NLP tasks where word-based boundaries are
important for context preservation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum number of words per chunk. |
5
|
chunk_overlap
|
Union[int, float]
|
Number or percentage of overlapping words between chunks. |
0
|
Source code in src/splitter_mr/splitter/splitters/word_splitter.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
|
split(reader_output)
¶
Splits the input text from the reader_output dictionary into word-based chunks.
Each chunk contains at most chunk_size
words, and adjacent chunks can overlap
by a specified number or percentage of words, according to the chunk_overlap
parameter set at initialization.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata (e.g., 'document_name', 'document_path', etc.). |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If chunk_overlap is greater than or equal to chunk_size. |
Example
from splitter_mr.splitter import WordSplitter
reader_output = ReaderOutput(
text: "The quick brown fox jumps over the lazy dog. Pack my box with five dozen liquor jugs. Sphinx of black quartz, judge my vow.",
document_name: "pangrams.txt",
document_path: "/https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/pangrams.txt",
)
# Split into chunks of 5 words, overlapping by 2 words
splitter = WordSplitter(chunk_size=5, chunk_overlap=2)
output = splitter.split(reader_output)
print(output["chunks"])
['The quick brown fox jumps',
'fox jumps over the lazy',
'over the lazy dog. Pack', ...]
Source code in src/splitter_mr/splitter/splitters/word_splitter.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
|
SentenceSplitter¶
SentenceSplitter
¶
Bases: BaseSplitter
SentenceSplitter splits a given text into overlapping or non-overlapping chunks, where each chunk contains a specified number of sentences, and overlap is defined by a number or percentage of words from the end of the previous chunk.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum number of sentences per chunk. |
5
|
chunk_overlap
|
Union[int, float]
|
Number or percentage of overlapping words between chunks. |
0
|
separators
|
Union[str, List[str]]
|
Character(s) to split sentences. |
['.', '!', '?']
|
Source code in src/splitter_mr/splitter/splitters/sentence_splitter.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
split(reader_output)
¶
Splits the input text from the reader_output
dictionary into sentence-based chunks,
allowing for overlap at the word level.
Each chunk contains at most chunk_size
sentences, where sentence boundaries are
detected using the specified sentence_separators
(e.g., '.', '!', '?').
Overlap between consecutive chunks is specified by chunk_overlap
, which can be an
integer (number of words) or a float (fraction of the maximum words in a sentence).
This is useful for downstream NLP tasks that require context preservation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata, such as 'document_name', 'document_path', 'document_id', etc. |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
ValueError
|
If 'text' is missing in |
Example
from splitter_mr.splitter import SentenceSplitter
# Example input: 7 sentences with varied punctuation
# This dictionary has been obtained as an output from a Reader class.
reader_output = ReaderOutput(
text: "Hello world! How are you? I am fine. Testing sentence splitting. Short. End! And another?",
document_name: "sample.txt",
document_path: "/tmp/sample.txt",
document_id: "123"
)
# Split into chunks of 3 sentences each, no overlap
splitter = SentenceSplitter(chunk_size=3, chunk_overlap=0)
result = splitter.split(reader_output)
print(result.chunks)
['Hello world! How are you? I am fine.',
'Testing sentence splitting. Short. End!',
'And another?', ...]
Source code in src/splitter_mr/splitter/splitters/sentence_splitter.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
ParagraphSplitter¶
ParagraphSplitter
¶
Bases: BaseSplitter
ParagraphSplitter splits a given text into overlapping or non-overlapping chunks, where each chunk contains a specified number of paragraphs, and overlap is defined by a number or percentage of words from the end of the previous chunk.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum number of paragraphs per chunk. |
3
|
chunk_overlap
|
Union[int, float]
|
Number or percentage of overlapping words between chunks. |
0
|
line_break
|
Union[str, List[str]]
|
Character(s) used to split text into paragraphs. |
'\n'
|
Source code in src/splitter_mr/splitter/splitters/paragraph_splitter.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
|
split(reader_output)
¶
Splits text in reader_output['text']
into paragraph-based chunks, with optional word overlap.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata (e.g., 'document_name', 'document_path'). |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If 'text' is missing from |
Example
from splitter_mr.splitter import ParagraphSplitter
# This dictionary has been obtained as the output from a Reader object.
reader_output = ReaderOutput(
text: "Para 1.\n\nPara 2.\n\nPara 3.",
document_name: "test.txt",
document_path: "/tmp/test.txt"
)
splitter = ParagraphSplitter(chunk_size=2, chunk_overlap=1, line_break="\n\n")
output = splitter.split(reader_output)
print(output["chunks"])
['Para 1.\n\nPara 2.', '2. Para 3.']
Source code in src/splitter_mr/splitter/splitters/paragraph_splitter.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
|
RecursiveCharacterSplitter¶
RecursiveCharacterSplitter
¶
Bases: BaseSplitter
RecursiveCharacterSplitter splits a given text into overlapping or non-overlapping chunks, where each chunk is created repeatedly breaking down the text until it reaches the desired chunk size. This class implements the Langchain RecursiveCharacterTextSplitter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Approximate chunk size, in characters. |
1000
|
chunk_overlap
|
Union[int, float]
|
Number or percentage of overlapping characters between chunks. |
0.1
|
separators
|
Union[str, List[str]]
|
Character(s) to recursively split sentences. |
['\n\n', '\n', ' ', '.', ',', '\u200b', ',', '、', '.', '。', '']
|
Notes
More info about the RecursiveCharacterTextSplitter: Langchain Docs.
Source code in src/splitter_mr/splitter/splitters/recursive_splitter.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
|
split(reader_output)
¶
Splits the input text into character-based chunks using a recursive splitting strategy
(via Langchain's RecursiveCharacterTextSplitter
), supporting configurable separators,
chunk size, and overlap.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata (e.g., 'document_name', 'document_path', etc.). |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If 'text' is missing in |
Example
from splitter_mr.splitter import RecursiveCharacterSplitter
# This dictionary has been obtained as the output from a Reader object.
reader_output = ReaderOutput(
text: "This is a long document.
It will be recursively split into smaller chunks using the specified separators.
Each chunk will have some overlap with the next.",
document_name: "sample.txt",
document_path: "/tmp/sample.txt"
)
splitter = RecursiveCharacterSplitter(chunk_size=40, chunk_overlap=5)
output = splitter.split(reader_output)
print(output["chunks"])
['This is a long document. It will be', 'be recursively split into smaller chunks', ...]
Source code in src/splitter_mr/splitter/splitters/recursive_splitter.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
|
HeaderSplitter¶
HeaderSplitter
¶
Bases: BaseSplitter
Splits an HTML or Markdown document into chunks based on header levels.
This splitter converts a list of semantic header names (e.g., ["Header 1", "Header 2"]) into the correct header tokens for Markdown ("#", "##", ...) or HTML ("h1", "h2", ...), and uses Langchain's splitters under the hood. You can choose whether to group headers with their following content or split on each leaf element.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Kept for compatibility. Defaults to 1000. |
1000
|
headers_to_split_on
|
Optional[List[str]]
|
List of semantic header names such as ["Header 1", "Header 2"]. If None, all levels 1–6 are enabled. |
None
|
group_header_with_content
|
bool
|
If True (default), keeps each header with its following block(s). If False, falls back to line/element splitting. |
True
|
Notes
- Only actual Markdown (#) or HTML (
–
) headings are supported.
- Output is a SplitterOutput dataclass compatible with splitter_mr.
Example
from splitter_mr.splitter import HeaderSplitter
reader_output = ReaderOutput(
text = '<!DOCTYPE html><html><body><h1>Main Title</h1><h2>Section 1</h2><h2>Section 2</h2></body></html>',
...
)
splitter = HeaderSplitter(headers_to_split_on=["Header 1", "Header 2"])
output = splitter.split(reader_output)
print(output.chunks)
['<!DOCTYPE html><html><body><h1>Main Title</h1>', '<h2>Section 1</h2>', '<h2>Section 2</h2></body></html>']
Source code in src/splitter_mr/splitter/splitters/header_splitter.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
|
__init__(chunk_size=1000, headers_to_split_on=None, *, group_header_with_content=True)
¶
Initializes the TagSplitter with header configuration.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Unused, for API compatibility. |
1000
|
headers_to_split_on
|
Optional[List[str]]
|
List of header names (e.g., ["Header 2"]). |
None
|
group_header_with_content
|
bool
|
If True, group header with body. Default True. |
True
|
Source code in src/splitter_mr/splitter/splitters/header_splitter.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
|
split(reader_output)
¶
Splits the document into chunks using the configured header levels.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
ReaderOutput
|
Input object with document text and metadata. |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Output dataclass with chunked text and metadata. |
Raises:
Type | Description |
---|---|
ValueError
|
If reader_output.text is empty. |
Source code in src/splitter_mr/splitter/splitters/header_splitter.py
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
|
JSONRecursiveSplitter¶
RecursiveJSONSplitter
¶
Bases: BaseSplitter
JSONRecursiveSplitter splits a JSON string or structure into overlapping or non-overlapping chunks, using the Langchain RecursiveJsonSplitter. This splitter is designed to recursively break down JSON data (including nested objects and arrays) into manageable pieces based on keys, arrays, or other separators, until the desired chunk size is reached.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum chunk size, measured in the number of characters per chunk. |
1000
|
min_chunk_size
|
int
|
Minimum chunk size, in characters. |
200
|
Notes
Source code in src/splitter_mr/splitter/splitters/json_splitter.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
|
split(reader_output)
¶
Splits the input JSON text from the reader_output dictionary into recursively chunked pieces, allowing for overlap by number or percentage of characters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata (e.g., 'document_name', 'document_path', etc.). |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If the 'text' field is missing from reader_output. |
JSONDecodeError
|
If the 'text' field contains invalid JSON. |
Example
from splitter_mr.splitter import RecursiveJSONSplitter
# This dictionary has been obtained from `VanillaReader`
reader_output = ReaderOutput(
text: '{"company": {"name": "TechCorp", "employees": [{"name": "Alice"}, {"name": "Bob"}]}}'
document_name: "company_data.json",
document_path: "/https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/company_data.json",
document_id: "doc123",
conversion_method: "vanilla",
ocr_method: None
)
splitter = RecursiveJSONSplitter(chunk_size=100, min_chunk_size=20)
output = splitter.split(reader_output)
print(output["chunks"])
['{"company": {"name": "TechCorp"}}',
'{"employees": [{"name": "Alice"},
{"name": "Bob"}]}']
Source code in src/splitter_mr/splitter/splitters/json_splitter.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
|
HTMLTagSplitter¶
HTMLTagSplitter
¶
Bases: BaseSplitter
HTMLTagSplitter splits HTML content based on a specified tag. In case that this tag is not specified, it is automatically detected as the most frequent and shallowest tag.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
maximum chunk size, in characters |
10000
|
tag
|
str
|
lowest level of hierarchy where do you want to split the text. |
None
|
Source code in src/splitter_mr/splitter/splitters/html_tag_splitter.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 |
|
split(reader_output)
¶
Splits HTML in reader_output['text']
using the specified tag or, if not specified,
automatically selects the most frequent and shallowest tag.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata (e.g., 'document_name', 'document_path'). |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
Example
from splitter_mr.splitter import HTMLTagSplitter
# This dictionary has been obtained as the output from a Reader object.
reader_output = ReaderOutput(
text: "<html><body><div>Chunk 1</div><div>Chunk 2</div></body></html>",
document_name: "example.html",
document_path: "/path/to/example.html"
)
splitter = HTMLTagSplitter(tag="div")
output = splitter.split(reader_output)
print(output["chunks"])
[
'<html><body><div>Chunk 1</div></body></html>',
'<html><body><div>Chunk 2</div></body></html>'
]
Source code in src/splitter_mr/splitter/splitters/html_tag_splitter.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
|
RowColumnSplitter¶
RowColumnSplitter
¶
Bases: BaseSplitter
RowColumnSplitter splits tabular data (such as CSV, TSV, Markdown tables, or JSON tables) into smaller tables based on rows, columns, or by total character size while preserving row integrity.
This splitter supports several modes:
- By rows: Split the table into chunks with a fixed number of rows, with optional overlapping rows between chunks.
- By columns: Split the table into chunks by columns, with optional overlapping columns between chunks.
- By chunk size: Split the table into markdown-formatted table chunks, where each chunk contains as many complete rows as fit under the specified character limit, optionally overlapping a fixed number of rows between chunks.
This is useful for splitting large tabular files for downstream processing, LLM ingestion, or display, while preserving semantic and structural integrity of the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum number of characters per chunk (when using character-based splitting). |
1000
|
num_rows
|
int
|
Number of rows per chunk. Mutually exclusive with num_cols. |
0
|
num_cols
|
int
|
Number of columns per chunk. Mutually exclusive with num_rows. |
0
|
chunk_overlap
|
Union[int, float]
|
Number of overlapping rows or columns between chunks. If a float in (0,1), interpreted as a percentage of rows or columns. If integer, the number of overlapping rows/columns. When chunking by character size, this refers to the number of overlapping rows (not characters). |
0
|
Supported formats: CSV, TSV, TXT, Markdown table, JSON (tabular: list of dicts or dict of lists).
Source code in src/splitter_mr/splitter/splitters/row_column_splitter.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 |
|
split(reader_output)
¶
Splits the input tabular data into multiple markdown table chunks according to the specified chunking strategy. Each output chunk is a complete markdown table with header, and will never cut a row in half. The overlap is always applied in terms of full rows or columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary output from a Reader, containing at least: - 'text': The tabular data as string. - 'conversion_method': Format of the input ('csv', 'tsv', 'markdown', 'json', etc.). - Additional document metadata fields (optional). |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If both num_rows and num_cols are set. |
ValueError
|
If chunk_overlap as float is not in [0,1). |
ValueError
|
If chunk_size is too small to fit the header and at least one data row. |
Example
reader_output = ReaderOutput(
text: '| id | name |\n|----|------|\n| 1 | A |\n| 2 | B |\n| 3 | C |',
conversion_method: "markdown",
document_name: "table.md",
document_path: "/path/table.md",
)
splitter = RowColumnSplitter(chunk_size=80, chunk_overlap=20)
output = splitter.split(reader_output)
for chunk in output["chunks"]:
print("\n" + str(chunk) + "\n")
| id | name |
|------|--------|
| 1 | A |
| 2 | B |
| id | name |
|------|--------|
| 2 | B |
| 3 | C |
Source code in src/splitter_mr/splitter/splitters/row_column_splitter.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
|
CodeSplitter¶
CodeSplitter
¶
Bases: BaseSplitter
CodeSplitter recursively splits source code into programmatically meaningful chunks (functions, classes, methods, etc.) for the given programming language.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum chunk size, in characters. |
1000
|
language
|
str
|
Programming language (e.g., "python", "java", "kotlin", etc.) |
'python'
|
Notes
- Uses Langchain's RecursiveCharacterTextSplitter and its language-aware
from_language
method. - See Langchain docs: https://python.langchain.com/docs/how_to/code_splitter/
Source code in src/splitter_mr/splitter/splitters/code_splitter.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
split(reader_output)
¶
Splits code in reader_output['text']
according to the syntax of the specified
programming language, using function/class boundaries where possible.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
ReaderOutput
|
Object containing at least a 'text' field, plus optional document metadata. |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
ValueError
|
If language is not supported. |
Example
from splitter_mr.splitter import CodeSplitter
reader_output = ReaderOutput(
text: "def foo():\n pass\n\nclass Bar:\n def baz(self):\n pass",
document_name: "example.py",
document_path: "/tmp/example.py"
)
splitter = CodeSplitter(chunk_size=50, language="python")
output = splitter.split(reader_output)
print(output.chunks)
['def foo():\n pass\n', 'class Bar:\n def baz(self):\n pass']
Source code in src/splitter_mr/splitter/splitters/code_splitter.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
get_langchain_language(lang_str)
¶
Map a string language name to Langchain Language enum. Raises ValueError if not found.
Source code in src/splitter_mr/splitter/splitters/code_splitter.py
7 8 9 10 11 12 13 14 15 16 17 18 |
|
TokenSplitter¶
TokenSplitter
¶
Bases: BaseSplitter
TokenSplitter splits a given text into chunks based on token counts derived from different tokenization models or libraries.
This splitter supports tokenization via tiktoken
(OpenAI tokenizer),
spacy
(spaCy tokenizer), and nltk
(NLTK tokenizer). It allows splitting
text into chunks of a maximum number of tokens (chunk_size
), using the
specified tokenizer model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunk_size
|
int
|
Maximum number of tokens per chunk. |
1000
|
model_name
|
str
|
Specifies the tokenizer and model in the format
|
'tiktoken/cl100k_base'
|
language
|
str
|
Language code for NLTK tokenizer (default |
'english'
|
Notes
More info about the splitting methods by Tokens for Langchain: Langchain Docs.
Source code in src/splitter_mr/splitter/splitters/token_splitter.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 |
|
list_nltk_punkt_languages()
staticmethod
¶
Return a sorted list of available punkt models (languages) for NLTK.
Source code in src/splitter_mr/splitter/splitters/token_splitter.py
52 53 54 55 56 57 58 59 60 |
|
split(reader_output)
¶
Splits the input text from reader_output
into token-based chunks using
the specified tokenizer.
Depending on model_name
, the splitter chooses the appropriate tokenizer:
- For
tiktoken
, usesRecursiveCharacterTextSplitter
with tiktoken encoding. e.g.:tiktoken/cl100k_base
. - For
spacy
, usesSpacyTextSplitter
with the specified spaCy pipeline. e.g.,spacy/en_core_web_sm
. - For
nltk
, usesNLTKTextSplitter
with the specified language tokenizer. e.g.,nltk/punkt_tab
.
Automatically downloads spaCy and NLTK models if missing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reader_output
|
Dict[str, Any]
|
Dictionary containing at least a 'text' key (str) and optional document metadata, such as 'document_name', 'document_path', 'document_id', etc. |
required |
Returns:
Name | Type | Description |
---|---|---|
SplitterOutput |
SplitterOutput
|
Dataclass defining the output structure for all splitters. |
Raises:
Type | Description |
---|---|
RuntimeError
|
If a spaCy model specified in |
ValueError
|
If an unsupported tokenizer is specified in |
Example
from splitter_mr.splitter import TokenSplitter
reader_output = ReaderOutput(
text: "The quick brown fox jumps over the lazy dog. Pack my box with five dozen liquor jugs.",
document_name: "pangrams.txt",
document_path: "/https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/pangrams.txt",
)
splitter = TokenSplitter(chunk_size=10, model_name="tiktoken/gpt-4o")
output = splitter.split(reader_output)
print(output.chunks)
['The quick brown fox jumps over the lazy dog.',
'Pack my box with five dozen liquor jugs.']
Source code in src/splitter_mr/splitter/splitters/token_splitter.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 |
|
PagedSplitter¶
Splits text by pages for documents that have page structure. Each chunk contains a specified number of pages, with optional word overlap.
Coming soon!
SemanticSplitter¶
Splits text into chunks based on semantic similarity, using an embedding model and a max tokens parameter. Useful for meaningful semantic groupings.
Coming soon!