Skip to content

Splitter

Introduction

The Splitter component implements the main functionality of this library. This component is designed to deliver classes (inherited from BaseSplitter) which supports to split a markdown text or a string following many different strategies.

Splitter strategies description

Splitting Technique Description
Character Splitter Splits text into chunks based on a specified number of characters. Supports overlapping by character count or percentage.
Parameters: chunk_size (max chars per chunk), chunk_overlap (overlapping chars: int or %).
Compatible with: Text.
Word Splitter Splits text into chunks based on a specified number of words. Supports overlapping by word count or percentage.
Parameters: chunk_size (max words per chunk), chunk_overlap (overlapping words: int or %).
Compatible with: Text.
Sentence Splitter Splits text into chunks by a specified number of sentences. Allows overlap defined by a number or percentage of words from the end of the previous chunk. Customizable sentence separators (e.g., ., !, ?).
Parameters: chunk_size (max sentences per chunk), chunk_overlap (overlapping words: int or %), sentence_separators (list of characters).
Compatible with: Text.
Paragraph Splitter Splits text into chunks based on a specified number of paragraphs. Allows overlapping by word count or percentage, and customizable line breaks.
Parameters: chunk_size (max paragraphs per chunk), chunk_overlap (overlapping words: int or %), line_break (delimiter(s) for paragraphs).
Compatible with: Text.
Recursive Splitter Recursively splits text based on a hierarchy of separators (e.g., paragraph, sentence, word, character) until chunks reach a target size. Tries to preserve semantic units as long as possible.
Parameters: chunk_size (max chars per chunk), chunk_overlap (overlapping chars), separators (list of characters to split on, e.g., ["\n\n", "\n", " ", ""]).
Compatible with: Text.
Keyword Splitter Splits text into chunks around matches of specified keywords, using one or more regex patterns. Supports precise boundary control—matched keywords can be included before, after, both sides, or omitted from the split. Each keyword can have a custom name (via dict) for metadata counting. Secondary soft-wrapping by chunk_size is supported.
Parameters: patterns (list of regex patterns, or dict mapping names to patterns), include_delimiters ("before", "after", "both", or "none"), flags (regex flags, e.g. re.MULTILINE), chunk_size (max chars per chunk, soft-wrapped).
Compatible with: Text.
Token Splitter Splits text into chunks based on the number of tokens, using various tokenization models (e.g., tiktoken, spaCy, NLTK). Useful for ensuring chunks are compatible with LLM context limits.
Parameters: chunk_size (max tokens per chunk), model_name (tokenizer/model, e.g., "tiktoken/cl100k_base", "spacy/en_core_web_sm", "nltk/punkt"), language (for NLTK).
Compatible with: Text.
Paged Splitter Splits text by pages for documents that have page structure. Each chunk contains a specified number of pages, with optional word overlap.
Parameters: num_pages (pages per chunk), chunk_overlap (overlapping words).
Compatible with: Word, PDF, Excel, PowerPoint.
Row/Column Splitter For tabular formats, splits data by a set number of rows or columns per chunk, with possible overlap. Row-based and column-based splitting are mutually exclusive.
Parameters: num_rows, num_cols (rows/columns per chunk), overlap (overlapping rows or columns).
Compatible with: Tabular formats (csv, tsv, parquet, flat json).
JSON Splitter Recursively splits JSON documents into smaller sub-structures that preserve the original JSON schema.
Parameters: max_chunk_size (max chars per chunk), min_chunk_size (min chars per chunk).
Compatible with: JSON.
Semantic Splitter Splits text into chunks based on semantic similarity, using an embedding model and a max tokens parameter. Useful for meaningful semantic groupings.
Parameters: embedding_model (model for embeddings), max_tokens (max tokens per chunk).
Compatible with: Text.
HTML Tag Splitter Splits HTML content based on a specified tag, or automatically detects the most frequent and shallowest tag if not specified. Each chunk is a complete HTML fragment for that tag.
Parameters: chunk_size (max chars per chunk), tag (HTML tag to split on, optional).
Compatible with: HTML.
Header Splitter Splits Markdown or HTML documents into chunks using header levels (e.g., #, ##, or <h1>, <h2>). Uses configurable headers for chunking.
Parameters: headers_to_split_on (list of headers and semantic names), chunk_size (unused, for compatibility).
Compatible with: Markdown, HTML.
Code Splitter Splits source code files into programmatically meaningful chunks (functions, classes, methods, etc.), aware of the syntax of the specified programming language (e.g., Python, Java, Kotlin). Uses language-aware logic to avoid splitting inside code blocks.
Parameters: chunk_size (max chars per chunk), language (programming language as string, e.g., "python", "java").
Compatible with: Source code files (Python, Java, Kotlin, C++, JavaScript, Go, etc.).

Output format

Bases: BaseModel

Pydantic model defining the output structure for all splitters.

Attributes:

Name Type Description
chunks List[str]

List of text chunks produced by splitting.

chunk_id List[str]

List of unique IDs corresponding to each chunk.

document_name Optional[str]

The name of the document.

document_path str

The path to the document.

document_id Optional[str]

A unique identifier for the document.

conversion_method Optional[str]

The method used for document conversion.

reader_method Optional[str]

The method used for reading the document.

ocr_method Optional[str]

The OCR method used, if any.

split_method str

The method used to split the document.

split_params Optional[Dict[str, Any]]

Parameters used during the splitting process.

metadata Optional[Dict[str, Any]]

Additional metadata associated with the splitting.

Source code in src/splitter_mr/schema/models.py
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
class SplitterOutput(BaseModel):
    """Pydantic model defining the output structure for all splitters.

    Attributes:
        chunks: List of text chunks produced by splitting.
        chunk_id: List of unique IDs corresponding to each chunk.
        document_name: The name of the document.
        document_path: The path to the document.
        document_id: A unique identifier for the document.
        conversion_method: The method used for document conversion.
        reader_method: The method used for reading the document.
        ocr_method: The OCR method used, if any.
        split_method: The method used to split the document.
        split_params: Parameters used during the splitting process.
        metadata: Additional metadata associated with the splitting.
    """

    chunks: List[str] = Field(default_factory=list)
    chunk_id: List[str] = Field(default_factory=list)
    document_name: Optional[str] = None
    document_path: str = ""
    document_id: Optional[str] = None
    conversion_method: Optional[str] = None
    reader_method: Optional[str] = None
    ocr_method: Optional[str] = None
    split_method: str = ""
    split_params: Optional[Dict[str, Any]] = Field(default_factory=dict)
    metadata: Optional[Dict[str, Any]] = Field(default_factory=dict)

    @model_validator(mode="after")
    def validate_and_set_defaults(self):
        """Validates and sets defaults for the SplitterOutput instance.

        Raises:
            ValueError: If `chunks` is empty or if `chunk_id` length does not match `chunks` length.

        Returns:
            self (SplitterOutput): The validated and updated instance.
        """
        if not self.chunks:
            raise ValueError("Chunks list cannot be empty.")

        if self.chunk_id is not None:
            if len(self.chunk_id) != len(self.chunks):
                raise ValueError(
                    f"chunk_id length ({len(self.chunk_id)}) "
                    f"does not match chunks length ({len(self.chunks)})."
                )
        else:
            self.chunk_id = [str(uuid.uuid4()) for _ in self.chunks]

        if not self.document_id:
            self.document_id = str(uuid.uuid4())

        return self

    @classmethod
    def from_chunks(cls, chunks: List[str]) -> "SplitterOutput":
        """Create a SplitterOutput from a list of chunks, with all other
        fields set to their defaults.

        Args:
            chunks (List[str]): A list of text chunks.

        Returns:
            SplitterOutput: An instance of SplitterOutput with the given chunks.
        """
        return cls(chunks=chunks)

    def append_metadata(self, metadata: Dict[str, Any]) -> None:
        """
        Append (update) the metadata dictionary with new key-value pairs.

        Args:
            metadata (Dict[str, Any]): The metadata to add or update.
        """
        if self.metadata is None:
            self.metadata = {}
        self.metadata.update(metadata)
append_metadata(metadata)

Append (update) the metadata dictionary with new key-value pairs.

Parameters:

Name Type Description Default
metadata Dict[str, Any]

The metadata to add or update.

required
Source code in src/splitter_mr/schema/models.py
185
186
187
188
189
190
191
192
193
194
def append_metadata(self, metadata: Dict[str, Any]) -> None:
    """
    Append (update) the metadata dictionary with new key-value pairs.

    Args:
        metadata (Dict[str, Any]): The metadata to add or update.
    """
    if self.metadata is None:
        self.metadata = {}
    self.metadata.update(metadata)
from_chunks(chunks) classmethod

Create a SplitterOutput from a list of chunks, with all other fields set to their defaults.

Parameters:

Name Type Description Default
chunks List[str]

A list of text chunks.

required

Returns:

Name Type Description
SplitterOutput SplitterOutput

An instance of SplitterOutput with the given chunks.

Source code in src/splitter_mr/schema/models.py
172
173
174
175
176
177
178
179
180
181
182
183
@classmethod
def from_chunks(cls, chunks: List[str]) -> "SplitterOutput":
    """Create a SplitterOutput from a list of chunks, with all other
    fields set to their defaults.

    Args:
        chunks (List[str]): A list of text chunks.

    Returns:
        SplitterOutput: An instance of SplitterOutput with the given chunks.
    """
    return cls(chunks=chunks)
validate_and_set_defaults()

Validates and sets defaults for the SplitterOutput instance.

Raises:

Type Description
ValueError

If chunks is empty or if chunk_id length does not match chunks length.

Returns:

Name Type Description
self SplitterOutput

The validated and updated instance.

Source code in src/splitter_mr/schema/models.py
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
@model_validator(mode="after")
def validate_and_set_defaults(self):
    """Validates and sets defaults for the SplitterOutput instance.

    Raises:
        ValueError: If `chunks` is empty or if `chunk_id` length does not match `chunks` length.

    Returns:
        self (SplitterOutput): The validated and updated instance.
    """
    if not self.chunks:
        raise ValueError("Chunks list cannot be empty.")

    if self.chunk_id is not None:
        if len(self.chunk_id) != len(self.chunks):
            raise ValueError(
                f"chunk_id length ({len(self.chunk_id)}) "
                f"does not match chunks length ({len(self.chunks)})."
            )
    else:
        self.chunk_id = [str(uuid.uuid4()) for _ in self.chunks]

    if not self.document_id:
        self.document_id = str(uuid.uuid4())

    return self

Splitters

BaseSplitter

BaseSplitter

Bases: ABC

Abstract base class for all splitter implementations.

This class defines the common interface and utility methods for splitters that divide text or data into smaller chunks, typically for downstream natural language processing tasks or information retrieval. Subclasses should implement the split method, which takes a :class:ReaderOutput and returns a :class:SplitterOutput containing the resulting chunks and associated metadata.

Attributes:

Name Type Description
chunk_size int

The maximum number of units (characters, sentences, rows, etc.) that a derived splitter is allowed to place in a chunk (semantic meaning depends on the subclass).

Methods:

Name Description
split

Abstract method. Must be implemented by subclasses to perform the actual domain-specific splitting logic.

_generate_chunk_ids

Utility to generate a list of unique UUID4-based identifiers for chunk tracking.

_default_metadata

Returns a default (empty) metadata dictionary. Subclasses may override or extend this to attach extra information to the final :class:SplitterOutput.

Example

Creating a simple custom splitter that breaks text every N characters:

from splitter_mr.schema import ReaderOutput, SplitterOutput
from splitter_mr.splitter.base_splitter import BaseSplitter

class FixedCharSplitter(BaseSplitter):
    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        text = reader_output.text or ""
        chunks = [
            text[i : i + self.chunk_size]
            for i in range(0, len(text), self.chunk_size)
        ]

        chunk_ids = self._generate_chunk_ids(len(chunks))

        return SplitterOutput(
            chunks=chunks,
            chunk_id=chunk_ids,
            document_name=reader_output.document_name,
            document_path=reader_output.document_path,
            document_id=reader_output.document_id,
            conversion_method=reader_output.conversion_method,
            reader_method=reader_output.reader_method,
            ocr_method=reader_output.ocr_method,
            split_method="fixed_char_splitter",
            split_params={"chunk_size": self.chunk_size},
            metadata=self._default_metadata(),
        )

ro = ReaderOutput(text="abcdefghijklmnopqrstuvwxyz")
splitter = FixedCharSplitter(chunk_size=5)
out = splitter.split(ro)

print(out.chunks)
['abcde', 'fghij', 'klmno', 'pqrst', 'uvwxy', 'z']

Source code in src/splitter_mr/splitter/base_splitter.py
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
class BaseSplitter(ABC):
    """
    Abstract base class for all splitter implementations.

    This class defines the common interface and utility methods for splitters that
    divide text or data into smaller chunks, typically for downstream natural language
    processing tasks or information retrieval. Subclasses should implement the `split`
    method, which takes a :class:`ReaderOutput` and returns a :class:`SplitterOutput`
    containing the resulting chunks and associated metadata.

    Attributes:
        chunk_size (int):
            The maximum number of units (characters, sentences, rows, etc.)
            that a derived splitter is allowed to place in a chunk (semantic
            meaning depends on the subclass).

    Methods:
        split(reader_output):
            Abstract method. Must be implemented by subclasses to perform the
            actual domain-specific splitting logic.

        _generate_chunk_ids(num_chunks):
            Utility to generate a list of unique UUID4-based identifiers for
            chunk tracking.

        _default_metadata():
            Returns a default (empty) metadata dictionary. Subclasses may
            override or extend this to attach extra information to the final
            :class:`SplitterOutput`.

    Example:
        **Creating a simple custom splitter** that breaks text every ``N`` characters:

        ```python
        from splitter_mr.schema import ReaderOutput, SplitterOutput
        from splitter_mr.splitter.base_splitter import BaseSplitter

        class FixedCharSplitter(BaseSplitter):
            def split(self, reader_output: ReaderOutput) -> SplitterOutput:
                text = reader_output.text or ""
                chunks = [
                    text[i : i + self.chunk_size]
                    for i in range(0, len(text), self.chunk_size)
                ]

                chunk_ids = self._generate_chunk_ids(len(chunks))

                return SplitterOutput(
                    chunks=chunks,
                    chunk_id=chunk_ids,
                    document_name=reader_output.document_name,
                    document_path=reader_output.document_path,
                    document_id=reader_output.document_id,
                    conversion_method=reader_output.conversion_method,
                    reader_method=reader_output.reader_method,
                    ocr_method=reader_output.ocr_method,
                    split_method="fixed_char_splitter",
                    split_params={"chunk_size": self.chunk_size},
                    metadata=self._default_metadata(),
                )

        ro = ReaderOutput(text="abcdefghijklmnopqrstuvwxyz")
        splitter = FixedCharSplitter(chunk_size=5)
        out = splitter.split(ro)

        print(out.chunks)
        ```
        ```python
        ['abcde', 'fghij', 'klmno', 'pqrst', 'uvwxy', 'z']
        ```
    """

    def __init__(self, chunk_size: int = 1000):
        """
        Initializer method for BaseSplitter classes
        """
        self.chunk_size = chunk_size

    @abstractmethod
    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """
        Abstract method to split input data into chunks.

        Args:
            reader_output (ReaderOutput): Input data, typically from a document reader,
                including the text to split and any relevant metadata.

        Returns:
            SplitterOutput: A dictionary containing split chunks and associated metadata.
        """

    def _generate_chunk_ids(self, num_chunks: int) -> List[str]:
        """
        Generate a list of unique chunk identifiers.

        Args:
            num_chunks (int): Number of chunk IDs to generate.

        Returns:
            List[str]: List of unique string IDs (UUID4).
        """
        return [str(uuid.uuid4()) for _ in range(num_chunks)]

    def _default_metadata(self) -> dict:
        """
        Return a default metadata dictionary.

        Returns:
            dict: An empty dictionary; subclasses may override to provide additional metadata.
        """
        return {}
__init__(chunk_size=1000)

Initializer method for BaseSplitter classes

Source code in src/splitter_mr/splitter/base_splitter.py
80
81
82
83
84
def __init__(self, chunk_size: int = 1000):
    """
    Initializer method for BaseSplitter classes
    """
    self.chunk_size = chunk_size
split(reader_output) abstractmethod

Abstract method to split input data into chunks.

Parameters:

Name Type Description Default
reader_output ReaderOutput

Input data, typically from a document reader, including the text to split and any relevant metadata.

required

Returns:

Name Type Description
SplitterOutput SplitterOutput

A dictionary containing split chunks and associated metadata.

Source code in src/splitter_mr/splitter/base_splitter.py
86
87
88
89
90
91
92
93
94
95
96
97
@abstractmethod
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """
    Abstract method to split input data into chunks.

    Args:
        reader_output (ReaderOutput): Input data, typically from a document reader,
            including the text to split and any relevant metadata.

    Returns:
        SplitterOutput: A dictionary containing split chunks and associated metadata.
    """

CharacterSplitter

CharacterSplitter

Bases: BaseSplitter

Splits textual input into fixed-size character chunks with optional overlap.

The CharacterSplitter is a simple and robust splitter that divides text into overlapping or non-overlapping chunks, based on the specified number of characters per chunk. It is commonly used in document-processing or NLP pipelines where preserving context between chunks is important.

The splitter can be configured to use
  • chunk_size: maximum number of characters per chunk.
  • chunk_overlap: the number (or fraction) of overlapping characters between consecutive chunks.

Parameters:

Name Type Description Default
chunk_size int

Maximum number of characters per chunk. Must be >= 1.

1000
chunk_overlap Union[int, float]

Number or percentage of overlapping characters between chunks. If float, must be in [0.0, 1.0).

0

Raises:

Type Description
SplitterConfigException

If either chunk_size or chunk_overlap are invalid.

Source code in src/splitter_mr/splitter/splitters/character_splitter.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
class CharacterSplitter(BaseSplitter):
    """
    Splits textual input into fixed-size character chunks with optional overlap.

    The ``CharacterSplitter`` is a simple and robust splitter that divides text into
    overlapping or non-overlapping chunks, based on the specified number of characters
    per chunk. It is commonly used in document-processing or NLP pipelines where
    preserving context between chunks is important.

    The splitter can be configured to use:
      - ``chunk_size``: maximum number of characters per chunk.
      - ``chunk_overlap``: the number (or fraction) of overlapping characters
        between consecutive chunks.

    Args:
        chunk_size (int, optional): Maximum number of characters per chunk. Must be >= 1.
        chunk_overlap (Union[int, float], optional): Number or percentage of overlapping
            characters between chunks. If float, must be in [0.0, 1.0).

    Raises:
        SplitterConfigException: If either ``chunk_size`` or ``chunk_overlap`` are invalid.
    """

    def __init__(self, chunk_size: int = 1000, chunk_overlap: int | float = 0):
        if not isinstance(chunk_size, int) or chunk_size < 1:
            raise SplitterConfigException("chunk_size must be an integer >= 1")

        if isinstance(chunk_overlap, int):
            if chunk_overlap < 0:
                raise SplitterConfigException("chunk_overlap (int) must be >= 0")
            if chunk_overlap >= chunk_size:
                raise SplitterConfigException(
                    "chunk_overlap (int) must be smaller than chunk_size"
                )
        elif isinstance(chunk_overlap, float):
            if not (0.0 <= chunk_overlap < 1.0):
                raise SplitterConfigException(
                    "chunk_overlap (float) must be in [0.0, 1.0)"
                )
        else:
            raise SplitterConfigException("chunk_overlap must be int or float")

        super().__init__(chunk_size)
        self.chunk_overlap = chunk_overlap

    # ---- Main method ---- #

    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """
        Split the provided text into character-based chunks with optional overlap.

        The method iterates through the text and produces fixed-size chunks that can
        optionally overlap. Each chunk is accompanied by automatically generated
        unique identifiers and metadata inherited from the original document.

        Input validity is checked and warnings may be emitted for empty or invalid text.

        Args:
            reader_output (ReaderOutput): A validated input object containing at least
                a ``text`` field and optional document metadata.

        Returns:
            SplitterOutput: Structured splitter output including:
                - ``chunks``: list of text segments.
                - ``chunk_id``: unique identifier per chunk.
                - document metadata.
                - ``split_params`` reflecting the splitter configuration.

        Raises:
            ValueError: If initialization parameters are invalid.
            InvalidChunkException: If chunks cannot be properly created
                (e.g., all empty).
            SplitterOutputException: If the final SplitterOutput cannot be
                validated or built.

        Warnings:
            SplitterInputWarning: If text is empty or cannot be parsed as JSON.

        Example:
            ```python
            from splitter_mr.schema import ReaderOutput
            from splitter_mr.splitter import CharacterSplitter

            reader_output = ReaderOutput(
                text="Hello world! This is a test text for splitting.",
                document_name="example.txt",
                document_path="/path/example.txt"
            )
            splitter = CharacterSplitter(chunk_size=10, chunk_overlap=0.2)
            output = splitter.split(reader_output)
            print(output.chunks)
            ```
            ```python
            ['Hello worl', 'world! Thi', 'is is a te', ...]
            ```
        """
        text: str = reader_output.text
        chunk_size: int = self.chunk_size

        self._check_input(reader_output, text)

        overlap: int = self._coerce_overlap(chunk_size)

        chunks: list = []
        start: int = 0
        step: int = max(1, chunk_size - overlap)

        try:
            while start < len(text):
                end = start + chunk_size
                chunks.append(text[start:end])
                start += step

            if len(text) == 0:
                chunks = [""]

            if not isinstance(chunks, list) or len(chunks) == 0:
                raise InvalidChunkException("No chunks were produced.")
            if any(c is None for c in chunks):
                raise InvalidChunkException("A produced chunk is None.")
            if len(text) > 0 and all(c == "" for c in chunks):
                raise InvalidChunkException(
                    "All produced chunks are empty for non-empty text."
                )
        except InvalidChunkException:
            raise
        except Exception as e:
            raise InvalidChunkException(
                f"Unexpected error while building chunks: {e}"
            ) from e

        try:
            chunk_ids: list[str] = self._generate_chunk_ids(len(chunks))
            metadata: dict = self._default_metadata()
            output = SplitterOutput(
                chunks=chunks,
                chunk_id=chunk_ids,
                document_name=reader_output.document_name,
                document_path=reader_output.document_path or "",
                document_id=reader_output.document_id,
                conversion_method=reader_output.conversion_method,
                reader_method=reader_output.reader_method,
                ocr_method=reader_output.ocr_method,
                split_method="character_splitter",
                split_params={
                    "chunk_size": self.chunk_size,
                    "chunk_overlap": self.chunk_overlap,
                },
                metadata=metadata,
            )
            return output
        except Exception as e:
            raise SplitterOutputException(f"Failed to build SplitterOutput: {e}") from e

    # ---- Helpers ---- #

    def _coerce_overlap(self, chunk_size: int) -> int:
        """
        Convert the ``chunk_overlap`` parameter into an absolute
        number of characters.

        Args:
            chunk_size (int): The configured chunk size.

        Returns:
            int: The computed overlap value (in characters).
        """
        if isinstance(self.chunk_overlap, float):
            return int(chunk_size * self.chunk_overlap)
        return int(self.chunk_overlap)

    def _check_input(self, reader_output: ReaderOutput, text: str) -> None:
        """
        Validate and warn about potential input issues.

        This helper method emits warnings instead of raising exceptions
        for the following cases:
          - Empty or whitespace-only text.
          - Declared JSON input (``conversion_method='json'``) that cannot be
            parsed as JSON.

        Args:
            reader_output (ReaderOutput): Input reader output containing text
                and metadata.
            text (str): The textual content to check.

        Warnings:
            SplitterInputWarning: Emitted if text is empty or non-parseable JSON.
        """
        if text.strip() == "":
            warnings.warn(
                SplitterInputWarning(
                    "ReaderOutput.text is empty or whitespace-only. "
                    "Proceeding; this will yield a single empty chunk."
                )
            )

        if (reader_output.conversion_method or "").lower() == "json":
            try:
                json.loads(text or "")
            except Exception:
                warnings.warn(
                    SplitterInputWarning(
                        "Conversion method is 'json' but text is not valid JSON. "
                        "Proceeding as plain text."
                    )
                )
split(reader_output)

Split the provided text into character-based chunks with optional overlap.

The method iterates through the text and produces fixed-size chunks that can optionally overlap. Each chunk is accompanied by automatically generated unique identifiers and metadata inherited from the original document.

Input validity is checked and warnings may be emitted for empty or invalid text.

Parameters:

Name Type Description Default
reader_output ReaderOutput

A validated input object containing at least a text field and optional document metadata.

required

Returns:

Name Type Description
SplitterOutput SplitterOutput

Structured splitter output including: - chunks: list of text segments. - chunk_id: unique identifier per chunk. - document metadata. - split_params reflecting the splitter configuration.

Raises:

Type Description
ValueError

If initialization parameters are invalid.

InvalidChunkException

If chunks cannot be properly created (e.g., all empty).

SplitterOutputException

If the final SplitterOutput cannot be validated or built.

Warns:

Type Description
SplitterInputWarning

If text is empty or cannot be parsed as JSON.

Example

from splitter_mr.schema import ReaderOutput
from splitter_mr.splitter import CharacterSplitter

reader_output = ReaderOutput(
    text="Hello world! This is a test text for splitting.",
    document_name="example.txt",
    document_path="/path/example.txt"
)
splitter = CharacterSplitter(chunk_size=10, chunk_overlap=0.2)
output = splitter.split(reader_output)
print(output.chunks)
['Hello worl', 'world! Thi', 'is is a te', ...]

Source code in src/splitter_mr/splitter/splitters/character_splitter.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """
    Split the provided text into character-based chunks with optional overlap.

    The method iterates through the text and produces fixed-size chunks that can
    optionally overlap. Each chunk is accompanied by automatically generated
    unique identifiers and metadata inherited from the original document.

    Input validity is checked and warnings may be emitted for empty or invalid text.

    Args:
        reader_output (ReaderOutput): A validated input object containing at least
            a ``text`` field and optional document metadata.

    Returns:
        SplitterOutput: Structured splitter output including:
            - ``chunks``: list of text segments.
            - ``chunk_id``: unique identifier per chunk.
            - document metadata.
            - ``split_params`` reflecting the splitter configuration.

    Raises:
        ValueError: If initialization parameters are invalid.
        InvalidChunkException: If chunks cannot be properly created
            (e.g., all empty).
        SplitterOutputException: If the final SplitterOutput cannot be
            validated or built.

    Warnings:
        SplitterInputWarning: If text is empty or cannot be parsed as JSON.

    Example:
        ```python
        from splitter_mr.schema import ReaderOutput
        from splitter_mr.splitter import CharacterSplitter

        reader_output = ReaderOutput(
            text="Hello world! This is a test text for splitting.",
            document_name="example.txt",
            document_path="/path/example.txt"
        )
        splitter = CharacterSplitter(chunk_size=10, chunk_overlap=0.2)
        output = splitter.split(reader_output)
        print(output.chunks)
        ```
        ```python
        ['Hello worl', 'world! Thi', 'is is a te', ...]
        ```
    """
    text: str = reader_output.text
    chunk_size: int = self.chunk_size

    self._check_input(reader_output, text)

    overlap: int = self._coerce_overlap(chunk_size)

    chunks: list = []
    start: int = 0
    step: int = max(1, chunk_size - overlap)

    try:
        while start < len(text):
            end = start + chunk_size
            chunks.append(text[start:end])
            start += step

        if len(text) == 0:
            chunks = [""]

        if not isinstance(chunks, list) or len(chunks) == 0:
            raise InvalidChunkException("No chunks were produced.")
        if any(c is None for c in chunks):
            raise InvalidChunkException("A produced chunk is None.")
        if len(text) > 0 and all(c == "" for c in chunks):
            raise InvalidChunkException(
                "All produced chunks are empty for non-empty text."
            )
    except InvalidChunkException:
        raise
    except Exception as e:
        raise InvalidChunkException(
            f"Unexpected error while building chunks: {e}"
        ) from e

    try:
        chunk_ids: list[str] = self._generate_chunk_ids(len(chunks))
        metadata: dict = self._default_metadata()
        output = SplitterOutput(
            chunks=chunks,
            chunk_id=chunk_ids,
            document_name=reader_output.document_name,
            document_path=reader_output.document_path or "",
            document_id=reader_output.document_id,
            conversion_method=reader_output.conversion_method,
            reader_method=reader_output.reader_method,
            ocr_method=reader_output.ocr_method,
            split_method="character_splitter",
            split_params={
                "chunk_size": self.chunk_size,
                "chunk_overlap": self.chunk_overlap,
            },
            metadata=metadata,
        )
        return output
    except Exception as e:
        raise SplitterOutputException(f"Failed to build SplitterOutput: {e}") from e

WordSplitter

WordSplitter

Bases: BaseSplitter

Split text into overlapping or non-overlapping word-based chunks.

This splitter is configurable with a maximum chunk size (chunk_size in words) and an overlap between consecutive chunks (chunk_overlap). The overlap can be specified either as an integer (number of words) or as a float between 0 and 1 (fraction of chunk size). It is useful for NLP tasks where word-based boundaries are important for context preservation.

Parameters:

Name Type Description Default
chunk_size int

Maximum number of words per chunk. Must be a positive integer.

5
chunk_overlap Union[int, float]

Number or percentage of overlapping words between chunks. If a float is provided, it must satisfy 0 <= chunk_overlap < 1.

0

Raises:

Type Description
SplitterConfigException

If chunk_size is not positive, if chunk_overlap is invalid (negative, too large, or wrong type), or if it is greater than or equal to chunk_size.

Source code in src/splitter_mr/splitter/splitters/word_splitter.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
class WordSplitter(BaseSplitter):
    """Split text into overlapping or non-overlapping word-based chunks.

    This splitter is configurable with a maximum chunk size (``chunk_size`` in
    words) and an overlap between consecutive chunks (``chunk_overlap``). The
    overlap can be specified either as an integer (number of words) or as a
    float between 0 and 1 (fraction of chunk size). It is useful for NLP tasks
    where word-based boundaries are important for context preservation.

    Args:
        chunk_size: Maximum number of words per chunk. Must be a positive
            integer.
        chunk_overlap: Number or percentage of overlapping words between
            chunks. If a float is provided, it must satisfy
            ``0 <= chunk_overlap < 1``.

    Raises:
        SplitterConfigException: If ``chunk_size`` is not positive, if
            ``chunk_overlap`` is invalid (negative, too large, or wrong type),
            or if it is greater than or equal to ``chunk_size``.
    """

    def __init__(self, chunk_size: int = 5, chunk_overlap: Union[int, float] = 0):
        if chunk_size <= 0:
            raise SplitterConfigException(
                f"chunk_size must be a positive integer, got {chunk_size!r}."
            )

        if not isinstance(chunk_overlap, (int, float)):
            raise SplitterConfigException(
                "chunk_overlap must be an int or float, "
                f"got {type(chunk_overlap).__name__!r}."
            )

        # Validate float overlap range (as fraction)
        if isinstance(chunk_overlap, float) and not (0 <= chunk_overlap < 1):
            raise SplitterConfigException(
                "When chunk_overlap is a float, it must be between 0 and 1."
            )

        # Validate integer overlap is not negative
        if isinstance(chunk_overlap, int) and chunk_overlap < 0:
            raise SplitterConfigException(
                "chunk_overlap cannot be negative when provided as an integer."
            )

        super().__init__(chunk_size)
        self.chunk_overlap = chunk_overlap

    def _compute_overlap(self) -> int:
        """Compute overlap in words from ``self.chunk_overlap`` and ``chunk_size``.

        Returns:
            The overlap as a non-negative integer number of words.

        Raises:
            SplitterConfigException: If the resulting overlap is invalid or
                greater than or equal to ``chunk_size``.
        """
        chunk_size = self.chunk_size

        if isinstance(self.chunk_overlap, float):
            # At this point we already know 0 <= chunk_overlap < 1
            overlap = int(chunk_size * self.chunk_overlap)
        else:
            overlap = int(self.chunk_overlap)

        if overlap < 0:
            # Defensive; should be caught earlier, but keep for safety
            raise SplitterConfigException("chunk_overlap cannot be negative.")

        if overlap >= chunk_size:
            raise SplitterConfigException(
                "chunk_overlap must be smaller than chunk_size."
            )

        return overlap

    # ---- Main logic ---- #

    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """Split the input text into word-based chunks.

        The splitter uses simple whitespace tokenization and supports either
        integer or fractional overlap between consecutive chunks.

        Args:
            reader_output: Input text and associated metadata.

        Returns:
            A ``SplitterOutput`` instance containing:

            * ``chunks``: List of word-based chunks.
            * ``chunk_id``: Corresponding unique identifiers for each chunk.
            * Document metadata and splitter configuration parameters.

        Raises:
            SplitterConfigException:
                If the configuration is invalid (for example, overlap is too
                large).
            InvalidChunkException:
                If the internal chunk ID generation does not match the number
                of produced chunks.

        Warns:
            SplitterInputWarning:
                If the input text is empty or whitespace-only.
            ChunkUnderflowWarning:
                If no chunks are produced (for example, due to empty input or
                aggressive filtering).

        Example:
            ```python
            from splitter_mr.splitter import WordSplitter

            reader_output = ReaderOutput(
                text: "My Wonderful Family\\nI live in a house near the mountains.I have two brothers and one sister, and I was born last...",
                document_name: "my_wonderful_family.txt",
                document_path: "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/my_wonderful_family.txt",
            )

            # Split into chunks of 5 words, overlapping by 2 words
            splitter = WordSplitter(chunk_size=5, chunk_overlap=2)
            output = splitter.split(reader_output)
            print(output["chunks"])
            ```
            ```python
            ['My Wonderful Family\\nI live','I live in a house near','house near the mountains.I', ...]
            ```
        """
        # Initialize variables
        text = reader_output.text or ""
        chunk_size = self.chunk_size

        if not text.strip():
            warnings.warn(
                "WordSplitter received empty or whitespace-only text; "
                "no chunks will be produced.",
                SplitterInputWarning,
            )

        # Split text into words (using simple whitespace tokenization)
        words = text.split()
        total_words = len(words)

        # Determine overlap in words
        overlap = self._compute_overlap()
        step = chunk_size - overlap
        if step <= 0:
            raise SplitterConfigException(
                "Invalid step size computed for WordSplitter; "
                "check chunk_size and chunk_overlap configuration."
            )

        # Split into chunks
        chunks: list[str] = []
        start = 0
        while start < total_words:
            end = start + chunk_size
            chunk_words = words[start:end]
            if not chunk_words:
                break
            chunks.append(" ".join(chunk_words))
            start += step

        if not chunks:
            warnings.warn(
                "WordSplitter produced no chunks for the given input.",
                ChunkUnderflowWarning,
            )

        # Generate chunk_id and append metadata
        chunk_ids = self._generate_chunk_ids(len(chunks))
        if len(chunk_ids) != len(chunks):
            raise InvalidChunkException(
                "Chunk ID generation mismatch: number of chunk_ids does not "
                "match number of chunks."
            )

        metadata = self._default_metadata()

        # Return output (SplitterOutput may still validate and reject empty chunks)
        output = SplitterOutput(
            chunks=chunks,
            chunk_id=chunk_ids,
            document_name=reader_output.document_name,
            document_path=reader_output.document_path,
            document_id=reader_output.document_id,
            conversion_method=reader_output.conversion_method,
            reader_method=reader_output.reader_method,
            ocr_method=reader_output.ocr_method,
            split_method="word_splitter",
            split_params={
                "chunk_size": chunk_size,
                "chunk_overlap": self.chunk_overlap,
            },
            metadata=metadata,
        )
        return output
split(reader_output)

Split the input text into word-based chunks.

The splitter uses simple whitespace tokenization and supports either integer or fractional overlap between consecutive chunks.

Parameters:

Name Type Description Default
reader_output ReaderOutput

Input text and associated metadata.

required

Returns:

Type Description
SplitterOutput

A SplitterOutput instance containing:

SplitterOutput
  • chunks: List of word-based chunks.
SplitterOutput
  • chunk_id: Corresponding unique identifiers for each chunk.
SplitterOutput
  • Document metadata and splitter configuration parameters.

Raises:

Type Description
SplitterConfigException

If the configuration is invalid (for example, overlap is too large).

InvalidChunkException

If the internal chunk ID generation does not match the number of produced chunks.

Warns:

Type Description
SplitterInputWarning

If the input text is empty or whitespace-only.

ChunkUnderflowWarning

If no chunks are produced (for example, due to empty input or aggressive filtering).

Example

from splitter_mr.splitter import WordSplitter

reader_output = ReaderOutput(
    text: "My Wonderful Family\nI live in a house near the mountains.I have two brothers and one sister, and I was born last...",
    document_name: "my_wonderful_family.txt",
    document_path: "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/my_wonderful_family.txt",
)

# Split into chunks of 5 words, overlapping by 2 words
splitter = WordSplitter(chunk_size=5, chunk_overlap=2)
output = splitter.split(reader_output)
print(output["chunks"])
['My Wonderful Family\nI live','I live in a house near','house near the mountains.I', ...]

Source code in src/splitter_mr/splitter/splitters/word_splitter.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """Split the input text into word-based chunks.

    The splitter uses simple whitespace tokenization and supports either
    integer or fractional overlap between consecutive chunks.

    Args:
        reader_output: Input text and associated metadata.

    Returns:
        A ``SplitterOutput`` instance containing:

        * ``chunks``: List of word-based chunks.
        * ``chunk_id``: Corresponding unique identifiers for each chunk.
        * Document metadata and splitter configuration parameters.

    Raises:
        SplitterConfigException:
            If the configuration is invalid (for example, overlap is too
            large).
        InvalidChunkException:
            If the internal chunk ID generation does not match the number
            of produced chunks.

    Warns:
        SplitterInputWarning:
            If the input text is empty or whitespace-only.
        ChunkUnderflowWarning:
            If no chunks are produced (for example, due to empty input or
            aggressive filtering).

    Example:
        ```python
        from splitter_mr.splitter import WordSplitter

        reader_output = ReaderOutput(
            text: "My Wonderful Family\\nI live in a house near the mountains.I have two brothers and one sister, and I was born last...",
            document_name: "my_wonderful_family.txt",
            document_path: "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/my_wonderful_family.txt",
        )

        # Split into chunks of 5 words, overlapping by 2 words
        splitter = WordSplitter(chunk_size=5, chunk_overlap=2)
        output = splitter.split(reader_output)
        print(output["chunks"])
        ```
        ```python
        ['My Wonderful Family\\nI live','I live in a house near','house near the mountains.I', ...]
        ```
    """
    # Initialize variables
    text = reader_output.text or ""
    chunk_size = self.chunk_size

    if not text.strip():
        warnings.warn(
            "WordSplitter received empty or whitespace-only text; "
            "no chunks will be produced.",
            SplitterInputWarning,
        )

    # Split text into words (using simple whitespace tokenization)
    words = text.split()
    total_words = len(words)

    # Determine overlap in words
    overlap = self._compute_overlap()
    step = chunk_size - overlap
    if step <= 0:
        raise SplitterConfigException(
            "Invalid step size computed for WordSplitter; "
            "check chunk_size and chunk_overlap configuration."
        )

    # Split into chunks
    chunks: list[str] = []
    start = 0
    while start < total_words:
        end = start + chunk_size
        chunk_words = words[start:end]
        if not chunk_words:
            break
        chunks.append(" ".join(chunk_words))
        start += step

    if not chunks:
        warnings.warn(
            "WordSplitter produced no chunks for the given input.",
            ChunkUnderflowWarning,
        )

    # Generate chunk_id and append metadata
    chunk_ids = self._generate_chunk_ids(len(chunks))
    if len(chunk_ids) != len(chunks):
        raise InvalidChunkException(
            "Chunk ID generation mismatch: number of chunk_ids does not "
            "match number of chunks."
        )

    metadata = self._default_metadata()

    # Return output (SplitterOutput may still validate and reject empty chunks)
    output = SplitterOutput(
        chunks=chunks,
        chunk_id=chunk_ids,
        document_name=reader_output.document_name,
        document_path=reader_output.document_path,
        document_id=reader_output.document_id,
        conversion_method=reader_output.conversion_method,
        reader_method=reader_output.reader_method,
        ocr_method=reader_output.ocr_method,
        split_method="word_splitter",
        split_params={
            "chunk_size": chunk_size,
            "chunk_overlap": self.chunk_overlap,
        },
        metadata=metadata,
    )
    return output

SentenceSplitter

SentenceSplitter

Bases: BaseSplitter

SentenceSplitter splits a given text into overlapping or non-overlapping chunks, where each chunk contains a specified number of sentences, and overlap is defined by a number or percentage of words from the end of the previous chunk.

Parameters:

Name Type Description Default
chunk_size int

Maximum number of sentences per chunk.

5
chunk_overlap Union[int, float]

Number or percentage of overlapping words between chunks. If a float in [0, 1), it is treated as a fraction of the maximum sentence length (in words); otherwise it is interpreted as an absolute number of words.

0
separators Union[str, List[str]]

Sentence boundary separators. If a list, it is normalised into a regex pattern (legacy path). If a string, it is treated as a full regex pattern.

DEFAULT_SENTENCE_SEPARATORS

Raises:

Type Description
SplitterConfigException

If chunk_size is less than 1 or not an int, chunk_overlap is negative or not numeric, or separators is neither a non-empty string nor a list of non-empty strings.

Source code in src/splitter_mr/splitter/splitters/sentence_splitter.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
class SentenceSplitter(BaseSplitter):
    """
    SentenceSplitter splits a given text into overlapping or non-overlapping chunks,
    where each chunk contains a specified number of sentences, and overlap is defined
    by a number or percentage of words from the end of the previous chunk.

    Args:
        chunk_size (int): Maximum number of sentences per chunk.
        chunk_overlap (Union[int, float]): Number or percentage of overlapping words
            between chunks. If a float in [0, 1), it is treated as a fraction of the
            maximum sentence length (in words); otherwise it is interpreted as an
            absolute number of words.
        separators (Union[str, List[str]]): Sentence boundary separators. If a list,
            it is normalised into a regex pattern (legacy path). If a string, it is
            treated as a full regex pattern.

    Raises:
        SplitterConfigException:
            If ``chunk_size`` is less than 1 or not an int, ``chunk_overlap`` is
            negative or not numeric, or ``separators`` is neither a non-empty string
            nor a list of non-empty strings.
    """

    def __init__(
        self,
        chunk_size: int = 5,
        chunk_overlap: Union[int, float] = 0,
        separators: Union[str, List[str]] = DEFAULT_SENTENCE_SEPARATORS,
    ):
        # ---- Config validation ---- #
        if chunk_size < 1 or not isinstance(chunk_size, int):
            raise SplitterConfigException(
                "chunk_size must be a positive integer greater than or equal to 1"
            )

        if not isinstance(chunk_overlap, (int, float)) or chunk_overlap < 0:
            raise SplitterConfigException(
                "chunk_overlap must be a non-negative int or float"
            )

        # Normalise and validate separators
        if isinstance(separators, list):
            if not separators or any(
                not isinstance(s, str) or not s for s in separators
            ):
                raise SplitterConfigException(
                    "separators list must contain at least one non-empty string"
                )
            # Legacy path (NOT recommended): join list with alternation, ensure "..." before "."
            parts = sorted({*separators}, key=lambda s: (s != "...", s))
            sep_pattern = "|".join(re.escape(s) for s in parts)
            # Attach trailing quotes/brackets if user insisted on a list
            separators_pattern = rf'(?:{sep_pattern})(?:["”’\'\)\]\}}»]*)\s*'
        elif isinstance(separators, str):
            if not separators:
                raise SplitterConfigException(
                    "separators must be a non-empty regex/string or a list of strings"
                )
            # Recommended path: already a full regex pattern
            separators_pattern = separators
        else:
            raise SplitterConfigException(
                "separators must be a string or a list of strings"
            )

        super().__init__(chunk_size)
        self.chunk_overlap = chunk_overlap
        self.separators = separators_pattern
        self._sep_re = re.compile(f"({self.separators})")

    # ---- Main method ---- #

    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """
        Splits the input text from ``reader_output`` into sentence-based chunks,
        allowing for overlap at the word level.

        Pipeline:

        1. Validate and normalise ``reader_output.text``.
        2. Split into sentences.
        3. Compute word overlap.
        4. Build chunks (with overlap).
        5. Build the final :class:`SplitterOutput`.

        Args:
            reader_output (ReaderOutput): Dataclass containing at least a ``text``
                attribute (str or None) and optional document metadata.

        Returns:
            SplitterOutput: Dataclass defining the output structure for all splitters.

        Raises:
            ReaderOutputException:
                If ``reader_output.text`` is missing or not ``str``/``None``.
            InvalidChunkException:
                If the number of generated chunk IDs does not match the number of chunks.
            SplitterOutputException:
                If constructing :class:`SplitterOutput` fails unexpectedly.

        Warnings:
            SplitterInputWarning:
                When the input text is empty or whitespace-only.
            SplitterOutputWarning:
                When no non-empty sentences are found, causing the splitter to fall
                back to a single empty chunk.
            ChunkUnderflowWarning:
                When fewer chunks than ``chunk_size`` are produced because the input
                has too few sentences.

        Example:
            ```python
            from splitter_mr.splitter import SentenceSplitter

            reader_output = ReaderOutput(
                text: "My Wonderful Family\\nI live in a house near the mountains.I have two brothers and one sister, and I was born last...",
                document_name: "my_wonderful_family.txt",
                document_path: "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/my_wonderful_family.txt",
            )

            # Split into chunks of 2 sentences, no overlapping
            splitter = SentenceSplitter(chunk_size=2, chunk_overlap=0)
            output = splitter.split(reader_output)
            print(output["chunks"])
            ```
            ```python
            ['My Wonderful Family. I live in a house near the mountains.', 'I have two brothers and one sister, and I was born last...', ...]
            ```
        """
        text = self._validate_reader_output(reader_output)
        sentences = self._split_into_sentences(text)
        overlap = self._compute_overlap(sentences)
        chunks = self._build_chunks(sentences, overlap)
        return self._build_output(reader_output, chunks)

    # ---- Internal helpers ---- #

    def _validate_reader_output(self, reader_output: ReaderOutput) -> str:
        """
        Validate and normalise ReaderOutput.text.

        Raises:
            ReaderOutputException: On missing or invalid text.
        """
        if not hasattr(reader_output, "text"):
            raise ReaderOutputException(
                "ReaderOutput object must expose a 'text' attribute."
            )

        text = reader_output.text
        if text is None:
            text = ""
        elif not isinstance(text, str):
            raise ReaderOutputException(
                f"ReaderOutput.text must be of type 'str' or None, got "
                f"{type(text).__name__!r}"
            )

        if not text.strip():
            warnings.warn(
                "SentenceSplitter received empty or whitespace-only text; "
                "resulting chunks will be empty.",
                SplitterInputWarning,
                stacklevel=3,
            )

        return text

    def _split_into_sentences(self, text: str) -> List[str]:
        """
        Split the input text into normalised sentences.

        Warnings:
            SplitterOutputWarning:
                When no non-empty sentences are found and the splitter falls back
                to a single empty sentence.
        """
        if not text.strip():
            # Already warned in _validate_reader_output; just normalise.
            sentences = []
        else:
            parts = self._sep_re.split(text)  # [text, sep, text, sep, ...]
            sentences: List[str] = []
            i = 0
            while i < len(parts):
                segment = (parts[i] or "").strip()
                if i + 1 < len(parts):
                    # we have a separator that belongs to this sentence
                    sep = parts[i + 1] or ""
                    sentence = (segment + sep).strip()
                    if sentence:
                        sentences.append(sentence)
                    i += 2
                else:
                    # tail without terminator
                    if segment:
                        sentences.append(segment)
                    i += 1

        # Fallback when no sentences were found
        sentences = [s for s in sentences if s.strip()]
        if not sentences:
            warnings.warn(
                "SentenceSplitter did not find any non-empty sentences; "
                "returning a single empty chunk.",
                SplitterOutputWarning,
                stacklevel=3,
            )
            sentences = [""]

        return sentences

    def _compute_overlap(self, sentences: List[str]) -> int:
        """
        Compute the number of overlapping words between chunks based on the
        configured ``chunk_overlap``.
        """
        if isinstance(self.chunk_overlap, float) and 0 <= self.chunk_overlap < 1:
            max_sent_words = max((len(s.split()) for s in sentences), default=0)
            return int(max_sent_words * self.chunk_overlap)
        return int(self.chunk_overlap)

    def _build_chunks(self, sentences: List[str], overlap: int) -> List[str]:
        """
        Build sentence-based chunks, applying word overlap from the previous chunk
        when requested.

        Warnings:
            SplitterOutputWarning: If splitter produces empty chunks.
            ChunkUnderflowWarning: If fewer chunks than ``chunk_size`` are produced
                because the input has too few sentences.
        """
        chunks: List[str] = []
        num_sentences = len(sentences)
        start = 0

        while start < num_sentences:
            end = min(start + self.chunk_size, num_sentences)
            chunk_sents = sentences[start:end]
            chunk_text = " ".join(chunk_sents)

            if overlap > 0 and chunks:
                prev_words = chunks[-1].split()
                overlap_words = (
                    prev_words[-overlap:] if overlap <= len(prev_words) else prev_words
                )
                chunk_text = " ".join([" ".join(overlap_words), chunk_text]).strip()

            chunks.append(chunk_text)
            start += self.chunk_size

        if not chunks:
            chunks = [""]
            warnings.warn(
                "SentenceSplitter did not produce any chunks; "
                "returning a single empty chunk.",
                SplitterOutputWarning,
                stacklevel=3,
            )
            return chunks

        if len(chunks) < self.chunk_size:
            warnings.warn(
                f"SentenceSplitter produced fewer chunks ({len(chunks)}) than "
                f"the configured chunk_size ({self.chunk_size}) because the input "
                f"contains only {num_sentences} sentence(s).",
                ChunkUnderflowWarning,
                stacklevel=3,
            )

        return chunks

    def _build_output(
        self, reader_output: ReaderOutput, chunks: List[str]
    ) -> SplitterOutput:
        """
        Assemble and return the final SplitterOutput.

        Raises:
            InvalidChunkException: If #chunk_ids != #chunks.
            SplitterOutputException: If SplitterOutput construction fails.
        """
        chunk_ids = self._generate_chunk_ids(len(chunks))
        if len(chunk_ids) != len(chunks):
            raise InvalidChunkException(
                "Number of chunk IDs does not match number of chunks "
                f"(chunk_ids={len(chunk_ids)}, chunks={len(chunks)})."
            )

        metadata = self._default_metadata()

        try:
            return SplitterOutput(
                chunks=chunks,
                chunk_id=chunk_ids,
                document_name=reader_output.document_name,
                document_path=reader_output.document_path,
                document_id=reader_output.document_id,
                conversion_method=reader_output.conversion_method,
                reader_method=reader_output.reader_method,
                ocr_method=reader_output.ocr_method,
                split_method="sentence_splitter",
                split_params={
                    "chunk_size": self.chunk_size,
                    "chunk_overlap": self.chunk_overlap,
                    "separators": self.separators,
                },
                metadata=metadata,
            )
        except Exception as exc:
            raise SplitterOutputException(
                f"Failed to build SplitterOutput in SentenceSplitter: {exc}"
            ) from exc
split(reader_output)

Splits the input text from reader_output into sentence-based chunks, allowing for overlap at the word level.

Pipeline:

  1. Validate and normalise reader_output.text.
  2. Split into sentences.
  3. Compute word overlap.
  4. Build chunks (with overlap).
  5. Build the final :class:SplitterOutput.

Parameters:

Name Type Description Default
reader_output ReaderOutput

Dataclass containing at least a text attribute (str or None) and optional document metadata.

required

Returns:

Name Type Description
SplitterOutput SplitterOutput

Dataclass defining the output structure for all splitters.

Raises:

Type Description
ReaderOutputException

If reader_output.text is missing or not str/None.

InvalidChunkException

If the number of generated chunk IDs does not match the number of chunks.

SplitterOutputException

If constructing :class:SplitterOutput fails unexpectedly.

Warns:

Type Description
SplitterInputWarning

When the input text is empty or whitespace-only.

SplitterOutputWarning

When no non-empty sentences are found, causing the splitter to fall back to a single empty chunk.

ChunkUnderflowWarning

When fewer chunks than chunk_size are produced because the input has too few sentences.

Example

from splitter_mr.splitter import SentenceSplitter

reader_output = ReaderOutput(
    text: "My Wonderful Family\nI live in a house near the mountains.I have two brothers and one sister, and I was born last...",
    document_name: "my_wonderful_family.txt",
    document_path: "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/my_wonderful_family.txt",
)

# Split into chunks of 2 sentences, no overlapping
splitter = SentenceSplitter(chunk_size=2, chunk_overlap=0)
output = splitter.split(reader_output)
print(output["chunks"])
['My Wonderful Family. I live in a house near the mountains.', 'I have two brothers and one sister, and I was born last...', ...]

Source code in src/splitter_mr/splitter/splitters/sentence_splitter.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """
    Splits the input text from ``reader_output`` into sentence-based chunks,
    allowing for overlap at the word level.

    Pipeline:

    1. Validate and normalise ``reader_output.text``.
    2. Split into sentences.
    3. Compute word overlap.
    4. Build chunks (with overlap).
    5. Build the final :class:`SplitterOutput`.

    Args:
        reader_output (ReaderOutput): Dataclass containing at least a ``text``
            attribute (str or None) and optional document metadata.

    Returns:
        SplitterOutput: Dataclass defining the output structure for all splitters.

    Raises:
        ReaderOutputException:
            If ``reader_output.text`` is missing or not ``str``/``None``.
        InvalidChunkException:
            If the number of generated chunk IDs does not match the number of chunks.
        SplitterOutputException:
            If constructing :class:`SplitterOutput` fails unexpectedly.

    Warnings:
        SplitterInputWarning:
            When the input text is empty or whitespace-only.
        SplitterOutputWarning:
            When no non-empty sentences are found, causing the splitter to fall
            back to a single empty chunk.
        ChunkUnderflowWarning:
            When fewer chunks than ``chunk_size`` are produced because the input
            has too few sentences.

    Example:
        ```python
        from splitter_mr.splitter import SentenceSplitter

        reader_output = ReaderOutput(
            text: "My Wonderful Family\\nI live in a house near the mountains.I have two brothers and one sister, and I was born last...",
            document_name: "my_wonderful_family.txt",
            document_path: "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/my_wonderful_family.txt",
        )

        # Split into chunks of 2 sentences, no overlapping
        splitter = SentenceSplitter(chunk_size=2, chunk_overlap=0)
        output = splitter.split(reader_output)
        print(output["chunks"])
        ```
        ```python
        ['My Wonderful Family. I live in a house near the mountains.', 'I have two brothers and one sister, and I was born last...', ...]
        ```
    """
    text = self._validate_reader_output(reader_output)
    sentences = self._split_into_sentences(text)
    overlap = self._compute_overlap(sentences)
    chunks = self._build_chunks(sentences, overlap)
    return self._build_output(reader_output, chunks)

ParagraphSplitter

ParagraphSplitter

Bases: BaseSplitter

ParagraphSplitter splits a given text into overlapping or non-overlapping chunks, where each chunk contains a specified number of paragraphs, and overlap is defined by a number or percentage of words from the end of the previous chunk.

Parameters:

Name Type Description Default
chunk_size int

Maximum number of paragraphs per chunk.

3
chunk_overlap Union[int, float]

Number or percentage of overlapping words between chunks. If a float in [0, 1), it is treated as a fraction of the maximum paragraph length (in words); otherwise it is interpreted as an absolute number of words.

0
line_break Union[str, List[str]]

Character(s) used to split text into paragraphs. A single string or a list of strings.

DEFAULT_PARAGRAPH_SEPARATORS

Raises:

Type Description
SplitterConfigException

If chunk_size is less than 1, chunk_overlap is negative or not numeric, or line_break is neither a non-empty string nor a list of non-empty strings.

Source code in src/splitter_mr/splitter/splitters/paragraph_splitter.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
class ParagraphSplitter(BaseSplitter):
    """
    ParagraphSplitter splits a given text into overlapping or non-overlapping chunks,
    where each chunk contains a specified number of paragraphs, and overlap is defined
    by a number or percentage of words from the end of the previous chunk.

    Args:
        chunk_size (int): Maximum number of paragraphs per chunk.
        chunk_overlap (Union[int, float]): Number or percentage of overlapping words
            between chunks. If a float in [0, 1), it is treated as a fraction of the
            maximum paragraph length (in words); otherwise it is interpreted as an
            absolute number of words.
        line_break (Union[str, List[str]]): Character(s) used to split text into
            paragraphs. A single string or a list of strings.

    Raises:
        SplitterConfigException:
            If ``chunk_size`` is less than 1, ``chunk_overlap`` is negative or not
            numeric, or ``line_break`` is neither a non-empty string nor a list of
            non-empty strings.
    """

    def __init__(
        self,
        chunk_size: int = 3,
        chunk_overlap: Union[int, float] = 0,
        line_break: Union[str, List[str]] = DEFAULT_PARAGRAPH_SEPARATORS,
    ):
        if chunk_size < 1 or not isinstance(chunk_size, int):
            raise SplitterConfigException(
                "chunk_size must be a positive number greater than 1"
            )

        if not isinstance(chunk_overlap, (int, float)) or chunk_overlap < 0:
            raise SplitterConfigException(
                "chunk_overlap must be a non-negative int or float"
            )

        # Normalise line_break to a list of strings and validate
        if isinstance(line_break, str):
            line_break_list = [line_break]
        elif isinstance(line_break, list):
            line_break_list = line_break
        else:
            raise SplitterConfigException(
                "line_break must be a string or a list of strings"
            )

        if not line_break_list or any(
            not isinstance(lb, str) or not lb for lb in line_break_list
        ):
            raise SplitterConfigException(
                "line_break must contain at least one non-empty string"
            )

        super().__init__(chunk_size)
        self.chunk_overlap = chunk_overlap
        self.line_break = line_break_list

    # ---- Main method ---- #

    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """
        Split the text in ``reader_output.text`` into paragraph-based chunks.

        Pipeline:

        1. Validate and normalise ``reader_output.text``.
        2. Split into paragraphs.
        3. Compute word overlap.
        4. Build chunks (with overlap).
        5. Build the final :class:`SplitterOutput`.

        Args:
            reader_output (ReaderOutput): Dataclass containing at least a ``text`` field
                (str or None) and optional document metadata.

        Returns:
            SplitterOutput: Dataclass defining the output structure for all splitters.

        Raises:
            ReaderOutputException:
                If ``reader_output.text`` is missing or not ``str``/``None``.
            InvalidChunkException:
                If the number of generated chunk IDs does not match the number of chunks.
            SplitterOutputException:
                If constructing :class:`SplitterOutput` fails unexpectedly.

        Warnings:
            SplitterInputWarning:
                When the input text is empty or whitespace-only.
            SplitterOutputWarning:
                When no non-empty paragraphs are found, causing the splitter to fall
                back to a single empty chunk.

        Example:
            **Basic usage** with default line breaks and no overlap:

            ```python
            from splitter_mr.schema import ReaderOutput
            from splitter_mr.splitter.splitters import ParagraphSplitter

            text = (
                "First paragraph.\\n\\n"
                "Second paragraph with more text.\\n\\n"
                "Third paragraph."
            )

            ro = ReaderOutput(
                text=text,
                document_name="example.txt",
                document_path="/tmp/example.txt",
                document_id="doc-1",
                conversion_method="text",
                reader_method="plain",
                ocr_method=None,
                metadata={},
            )

            splitter = ParagraphSplitter(chunk_size=2, chunk_overlap=0)
            output = splitter.split(ro)

            print(output.chunks)
            ```

            ```python
            ['First paragraph.\\n\\nSecond paragraph with more text.', 'Third paragraph.']
            ```

            Example with **custom line breaks** and **word overlap** between chunks:

            ```python
            text = (
                "Intro paragraph.@@"
                "Details paragraph one.@@"
                "Details paragraph two.@@"
                "Conclusion paragraph."
            )

            ro = ReaderOutput(text=text, document_name="custom_sep.txt")

            splitter = ParagraphSplitter(
                chunk_size=2,
                chunk_overlap=3, # reuse last 3 words from previous chunk
                line_break="@@", # custom paragraph separator
            )
            output = splitter.split(ro)

            for chunk in output.chunks:
                print("--- CHUNK ---")
                print(chunk)
            ```
        """
        text = self._validate_reader_output(reader_output)
        paragraphs = self._split_into_paragraphs(text)
        overlap = self._compute_overlap(paragraphs)
        chunks = self._build_chunks(paragraphs, overlap)
        return self._build_output(reader_output, chunks)

    # ---- Internal helpers ---- #

    def _validate_reader_output(self, reader_output: ReaderOutput) -> str:
        """
        Validate and normalise ReaderOutput.text.

        Raises:
            ReaderOutputException: On missing or invalid text.
        """
        if not hasattr(reader_output, "text"):
            raise ReaderOutputException(
                "ReaderOutput object must expose a 'text' attribute."
            )

        text = reader_output.text
        if text is None:
            text = ""
        elif not isinstance(text, str):
            raise ReaderOutputException(
                f"ReaderOutput.text must be of type 'str' or None, got "
                f"{type(text).__name__!r}"
            )

        if not text.strip():
            warnings.warn(
                "ParagraphSplitter received empty or whitespace-only text; "
                "resulting chunks will be empty.",
                SplitterInputWarning,
                stacklevel=3,
            )

        return text

    def _split_into_paragraphs(self, text: str) -> List[str]:
        """
        Split the input text into normalised paragraphs.

        Warnings:
            SplitterOutputWarning:
                When no non-empty paragraphs are found and the splitter falls back
                to a single empty paragraph.
        """
        pattern = "|".join(map(re.escape, self.line_break))
        paragraphs = [p for p in re.split(pattern, text) if p.strip()]

        if not paragraphs:
            warnings.warn(
                "ParagraphSplitter did not find any non-empty paragraphs; "
                "returning a single empty chunk.",
                SplitterOutputWarning,
                stacklevel=3,
            )
            paragraphs = [""]

        return paragraphs

    def _compute_overlap(self, paragraphs: List[str]) -> int:
        """
        Compute the number of overlapping words between chunks based on the
        configured ``chunk_overlap``.
        """
        if isinstance(self.chunk_overlap, float) and 0 <= self.chunk_overlap < 1:
            max_para_words = max((len(p.split()) for p in paragraphs), default=0)
            return int(max_para_words * self.chunk_overlap)
        return int(self.chunk_overlap)

    def _build_chunks(self, paragraphs: List[str], overlap: int) -> List[str]:
        """
        Build paragraph-based chunks, applying word overlap from the previous chunk
        when requested.

        Warnings:
            SplitterOutputWarning: If splitter produces empty chunks.
            ChunkUnderflowWarning: If fewer chunks than ``chunk_size`` are produced
                because the input has too few paragraphs.
        """
        chunks: List[str] = []
        num_paragraphs = len(paragraphs)
        start = 0

        while start < num_paragraphs:
            end = min(start + self.chunk_size, num_paragraphs)
            chunk_paragraphs = paragraphs[start:end]
            chunk_text = self.line_break[0].join(chunk_paragraphs)

            if overlap > 0 and chunks:
                prev_words = chunks[-1].split()
                overlap_words = (
                    prev_words[-overlap:] if overlap <= len(prev_words) else prev_words
                )
                chunk_text = (
                    self.line_break[0]
                    .join([" ".join(overlap_words), chunk_text])
                    .strip()
                )

            chunks.append(chunk_text)
            start += self.chunk_size

        if not chunks:
            chunks = [""]
            warnings.warn(
                "ParagraphSplitter did not find any non-empty paragraphs; "
                "returning a single empty chunk.",
                SplitterOutputWarning,
                stacklevel=3,
            )
            return chunks

        if len(chunks) < self.chunk_size:
            warnings.warn(
                f"ParagraphSplitter produced fewer chunks ({len(chunks)}) than "
                f"the configured chunk_size ({self.chunk_size}) because the input "
                f"contains only {num_paragraphs} paragraph(s).",
                ChunkUnderflowWarning,
                stacklevel=3,
            )

        return chunks

    def _build_output(
        self, reader_output: ReaderOutput, chunks: List[str]
    ) -> SplitterOutput:
        """
        Assemble and return the final SplitterOutput.

        Raises:
            InvalidChunkException: If #chunk_ids != #chunks.
            SplitterOutputException: If SplitterOutput construction fails.
        """
        chunk_ids = self._generate_chunk_ids(len(chunks))
        if len(chunk_ids) != len(chunks):
            raise InvalidChunkException(
                "Number of chunk IDs does not match number of chunks "
                f"(chunk_ids={len(chunk_ids)}, chunks={len(chunks)})."
            )

        metadata = self._default_metadata()

        try:
            return SplitterOutput(
                chunks=chunks,
                chunk_id=chunk_ids,
                document_name=reader_output.document_name,
                document_path=reader_output.document_path,
                document_id=reader_output.document_id,
                conversion_method=reader_output.conversion_method,
                reader_method=reader_output.reader_method,
                ocr_method=reader_output.ocr_method,
                split_method="paragraph_splitter",
                split_params={
                    "chunk_size": self.chunk_size,
                    "chunk_overlap": self.chunk_overlap,
                    "line_break": self.line_break,
                },
                metadata=metadata,
            )
        except Exception as exc:
            raise SplitterOutputException(
                f"Failed to build SplitterOutput in ParagraphSplitter: {exc}"
            ) from exc
split(reader_output)

Split the text in reader_output.text into paragraph-based chunks.

Pipeline:

  1. Validate and normalise reader_output.text.
  2. Split into paragraphs.
  3. Compute word overlap.
  4. Build chunks (with overlap).
  5. Build the final :class:SplitterOutput.

Parameters:

Name Type Description Default
reader_output ReaderOutput

Dataclass containing at least a text field (str or None) and optional document metadata.

required

Returns:

Name Type Description
SplitterOutput SplitterOutput

Dataclass defining the output structure for all splitters.

Raises:

Type Description
ReaderOutputException

If reader_output.text is missing or not str/None.

InvalidChunkException

If the number of generated chunk IDs does not match the number of chunks.

SplitterOutputException

If constructing :class:SplitterOutput fails unexpectedly.

Warns:

Type Description
SplitterInputWarning

When the input text is empty or whitespace-only.

SplitterOutputWarning

When no non-empty paragraphs are found, causing the splitter to fall back to a single empty chunk.

Example

Basic usage with default line breaks and no overlap:

from splitter_mr.schema import ReaderOutput
from splitter_mr.splitter.splitters import ParagraphSplitter

text = (
    "First paragraph.\n\n"
    "Second paragraph with more text.\n\n"
    "Third paragraph."
)

ro = ReaderOutput(
    text=text,
    document_name="example.txt",
    document_path="/tmp/example.txt",
    document_id="doc-1",
    conversion_method="text",
    reader_method="plain",
    ocr_method=None,
    metadata={},
)

splitter = ParagraphSplitter(chunk_size=2, chunk_overlap=0)
output = splitter.split(ro)

print(output.chunks)
['First paragraph.\n\nSecond paragraph with more text.', 'Third paragraph.']

Example with custom line breaks and word overlap between chunks:

text = (
    "Intro paragraph.@@"
    "Details paragraph one.@@"
    "Details paragraph two.@@"
    "Conclusion paragraph."
)

ro = ReaderOutput(text=text, document_name="custom_sep.txt")

splitter = ParagraphSplitter(
    chunk_size=2,
    chunk_overlap=3, # reuse last 3 words from previous chunk
    line_break="@@", # custom paragraph separator
)
output = splitter.split(ro)

for chunk in output.chunks:
    print("--- CHUNK ---")
    print(chunk)
Source code in src/splitter_mr/splitter/splitters/paragraph_splitter.py
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """
    Split the text in ``reader_output.text`` into paragraph-based chunks.

    Pipeline:

    1. Validate and normalise ``reader_output.text``.
    2. Split into paragraphs.
    3. Compute word overlap.
    4. Build chunks (with overlap).
    5. Build the final :class:`SplitterOutput`.

    Args:
        reader_output (ReaderOutput): Dataclass containing at least a ``text`` field
            (str or None) and optional document metadata.

    Returns:
        SplitterOutput: Dataclass defining the output structure for all splitters.

    Raises:
        ReaderOutputException:
            If ``reader_output.text`` is missing or not ``str``/``None``.
        InvalidChunkException:
            If the number of generated chunk IDs does not match the number of chunks.
        SplitterOutputException:
            If constructing :class:`SplitterOutput` fails unexpectedly.

    Warnings:
        SplitterInputWarning:
            When the input text is empty or whitespace-only.
        SplitterOutputWarning:
            When no non-empty paragraphs are found, causing the splitter to fall
            back to a single empty chunk.

    Example:
        **Basic usage** with default line breaks and no overlap:

        ```python
        from splitter_mr.schema import ReaderOutput
        from splitter_mr.splitter.splitters import ParagraphSplitter

        text = (
            "First paragraph.\\n\\n"
            "Second paragraph with more text.\\n\\n"
            "Third paragraph."
        )

        ro = ReaderOutput(
            text=text,
            document_name="example.txt",
            document_path="/tmp/example.txt",
            document_id="doc-1",
            conversion_method="text",
            reader_method="plain",
            ocr_method=None,
            metadata={},
        )

        splitter = ParagraphSplitter(chunk_size=2, chunk_overlap=0)
        output = splitter.split(ro)

        print(output.chunks)
        ```

        ```python
        ['First paragraph.\\n\\nSecond paragraph with more text.', 'Third paragraph.']
        ```

        Example with **custom line breaks** and **word overlap** between chunks:

        ```python
        text = (
            "Intro paragraph.@@"
            "Details paragraph one.@@"
            "Details paragraph two.@@"
            "Conclusion paragraph."
        )

        ro = ReaderOutput(text=text, document_name="custom_sep.txt")

        splitter = ParagraphSplitter(
            chunk_size=2,
            chunk_overlap=3, # reuse last 3 words from previous chunk
            line_break="@@", # custom paragraph separator
        )
        output = splitter.split(ro)

        for chunk in output.chunks:
            print("--- CHUNK ---")
            print(chunk)
        ```
    """
    text = self._validate_reader_output(reader_output)
    paragraphs = self._split_into_paragraphs(text)
    overlap = self._compute_overlap(paragraphs)
    chunks = self._build_chunks(paragraphs, overlap)
    return self._build_output(reader_output, chunks)

RecursiveCharacterSplitter

RecursiveCharacterSplitter

Bases: BaseSplitter

RecursiveCharacterSplitter splits a given text into overlapping or non-overlapping chunks, where each chunk is created by repeatedly breaking down the text until it reaches the desired chunk size. This splitter is backed by LangChain's :class:RecursiveCharacterTextSplitter.

Parameters:

Name Type Description Default
chunk_size int

the number of characters per chunks (approximately).

1000
chunk_overlap int | float

the number of characters which matches between contiguous chunks, or a fraction of chunk_size when 0 <= value < 1.

0.1
separators str | List[str]

the list of characters or regex patterns which defines how text is split.

DEFAULT_RECURSIVE_SEPARATORS

Raises:

Type Description
SplitterConfigException

If chunk_size is less than 1, chunk_overlap is negative or effectively greater than or equal to chunk_size, or separators is neither a non-empty string nor a sequence of strings with at least one non-empty entry.

Source code in src/splitter_mr/splitter/splitters/recursive_splitter.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
class RecursiveCharacterSplitter(BaseSplitter):
    """
    RecursiveCharacterSplitter splits a given text into overlapping or non-overlapping chunks,
    where each chunk is created by repeatedly breaking down the text until it reaches the
    desired chunk size. This splitter is backed by LangChain's
    :class:`RecursiveCharacterTextSplitter`.

    Args:
        chunk_size (int): the number of characters per chunks (approximately).
        chunk_overlap (int | float): the number of characters which matches between
            contiguous chunks, or a fraction of chunk_size when 0 <= value < 1.
        separators (str | List[str]): the list of characters or regex patterns which
            defines how text is split.

    Raises:
        SplitterConfigException:
            If ``chunk_size`` is less than 1, ``chunk_overlap`` is negative or
            effectively greater than or equal to ``chunk_size``, or ``separators`` is
            neither a non-empty string nor a sequence of strings with at least one
            non-empty entry.
    """

    def __init__(
        self,
        chunk_size: int = 1000,
        chunk_overlap: Union[int, float] = 0.1,
        separators: Union[str, List[str], Tuple[str]] = DEFAULT_RECURSIVE_SEPARATORS,
    ):
        if not isinstance(chunk_size, int) or chunk_size < 1:
            raise SplitterConfigException(
                "chunk_size must be a positive integer greater than or equal to 1"
            )

        if not isinstance(chunk_overlap, (int, float)) or chunk_overlap < 0:
            raise SplitterConfigException(
                "chunk_overlap must be a non-negative int or float"
            )

        if isinstance(separators, str):
            separators_list = [separators]
        elif isinstance(separators, (list, tuple)):
            separators_list = list(separators)
        else:
            raise SplitterConfigException(
                "separators must be a string or a list of strings"
            )

        if not separators_list or any(not isinstance(s, str) for s in separators_list):
            raise SplitterConfigException("separators must contain only string values")

        is_default_separators = (
            isinstance(separators, tuple) and separators == DEFAULT_RECURSIVE_SEPARATORS
        )

        if not is_default_separators and any(s == "" for s in separators_list):
            raise SplitterConfigException(
                "separators must contain at least one non-empty string"
            )

        if isinstance(chunk_overlap, float):
            eff_overlap = int(chunk_size * chunk_overlap)
        else:
            eff_overlap = int(chunk_overlap)

        if eff_overlap >= chunk_size:
            raise SplitterConfigException(
                "chunk_overlap (effective characters) must be smaller than chunk_size"
            )

        super().__init__(chunk_size)
        self.chunk_overlap = chunk_overlap
        self.separators = separators_list

    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """
        Splits the input text into character-based chunks using a recursive splitting strategy
        (via LangChain's :class:`RecursiveCharacterTextSplitter`), supporting configurable
        separators, chunk size, and overlap.

        Args:
            reader_output (ReaderOutput): Dataclass containing at least a ``text`` field (str
                or None) and optional document metadata (e.g., ``document_name``,
                ``document_path``, etc.).

        Returns:
            SplitterOutput: Dataclass defining the output structure for all splitters.

        Raises:
            ReaderOutputException:
                If ``reader_output.text`` is missing or not ``str``/``None``.
            SplitterConfigException:
                If (effective) ``chunk_overlap`` is greater than or equal to ``chunk_size``.
            InvalidChunkException:
                If the number of generated chunk IDs does not match the number of chunks.
            SplitterOutputException:
                If constructing :class:`SplitterOutput` fails unexpectedly, or if
                LangChain's splitter raises an unexpected error.

        Warnings:
            SplitterInputWarning:
                When the input text is empty or whitespace-only.
            SplitterOutputWarning:
                When no chunks are produced and the splitter falls back to a single
                empty chunk.


        Example:
            **Basic usage** with a simple text string:

            ```python
            from splitter_mr.schema import ReaderOutput
            from splitter_mr.splitter import RecursiveCharacterSplitter

            # Sample text (short for demonstration)
            text = (
                "LangChain makes it easy to build LLM-powered applications. "
                "Recursive splitting helps maintain semantic coherence while "
                "still enforcing chunk-size limits."
            )

            reader_output = ReaderOutput(
                text=text,
                document_name="example.txt",
                document_path="/tmp/example.txt",
                document_id="abc123",
                conversion_method="text",
                metadata={}
            )

            splitter = RecursiveCharacterSplitter(
                chunk_size=50,
                chunk_overlap=0.2,  # 20% of chunk_size overlap
                separators=["\\n\\n", ".", " "]  # recursive fallback separators
            )

            output = splitter.split(reader_output)

            # Inspect results
            print(output.chunks)
            print(output.chunk_id)
            print(output.split_params)
            ```
        """
        # Validate input
        if not hasattr(reader_output, "text"):
            raise ReaderOutputException(
                "ReaderOutput object must expose a 'text' attribute."
            )

        text = reader_output.text
        if text is None:
            text = ""
        elif not isinstance(text, str):
            raise ReaderOutputException(
                f"ReaderOutput.text must be of type 'str' or None, got "
                f"{type(text).__name__!r}"
            )

        if not text.strip():
            warnings.warn(
                "RecursiveCharacterSplitter received empty or whitespace-only text; "
                "resulting chunks will be empty.",
                SplitterInputWarning,
                stacklevel=2,
            )

        chunk_size = self.chunk_size

        # Determine overlap in characters (effective value used by LangChain)
        if isinstance(self.chunk_overlap, float) and 0 <= self.chunk_overlap < 1:
            overlap = int(chunk_size * self.chunk_overlap)
        else:
            overlap = int(self.chunk_overlap)

        if overlap >= chunk_size:
            # Config is invalid relative to this chunk_size
            raise SplitterConfigException(
                "chunk_overlap (effective characters) must be smaller than chunk_size"
            )

        # Generate chunks
        try:
            splitter = RecursiveCharacterTextSplitter(
                chunk_size=chunk_size,
                chunk_overlap=overlap,
                separators=self.separators,
            )
            texts = splitter.create_documents([text])
        except Exception as exc:
            # Wrap any unexpected LangChain behaviour
            raise SplitterOutputException(
                f"RecursiveCharacterTextSplitter failed during split: {exc}"
            ) from exc

        chunks = [doc.page_content for doc in texts] if texts else []

        # -> If no chunks at all, warn and fall back
        if not chunks:
            warnings.warn(
                "RecursiveCharacterSplitter did not produce any chunks; "
                "returning a single empty chunk.",
                SplitterOutputWarning,
                stacklevel=2,
            )
            chunks = [""]

        # Generate chunk_ids and append metadata
        chunk_ids = self._generate_chunk_ids(len(chunks))
        if len(chunk_ids) != len(chunks):
            raise InvalidChunkException(
                "Number of chunk IDs does not match number of chunks "
                f"(chunk_ids={len(chunk_ids)}, chunks={len(chunks)})."
            )

        metadata = self._default_metadata()

        # Build output
        try:
            output = SplitterOutput(
                chunks=chunks,
                chunk_id=chunk_ids,
                document_name=reader_output.document_name,
                document_path=reader_output.document_path,
                document_id=reader_output.document_id,
                conversion_method=reader_output.conversion_method,
                reader_method=reader_output.reader_method,
                ocr_method=reader_output.ocr_method,
                split_method="recursive_character_splitter",
                split_params={
                    "chunk_size": chunk_size,
                    "chunk_overlap": overlap,
                    "separators": self.separators,
                },
                metadata=metadata,
            )
        except Exception as exc:
            raise SplitterOutputException(
                f"Failed to build SplitterOutput in RecursiveCharacterSplitter: {exc}"
            ) from exc

        return output
split(reader_output)

Splits the input text into character-based chunks using a recursive splitting strategy (via LangChain's :class:RecursiveCharacterTextSplitter), supporting configurable separators, chunk size, and overlap.

Parameters:

Name Type Description Default
reader_output ReaderOutput

Dataclass containing at least a text field (str or None) and optional document metadata (e.g., document_name, document_path, etc.).

required

Returns:

Name Type Description
SplitterOutput SplitterOutput

Dataclass defining the output structure for all splitters.

Raises:

Type Description
ReaderOutputException

If reader_output.text is missing or not str/None.

SplitterConfigException

If (effective) chunk_overlap is greater than or equal to chunk_size.

InvalidChunkException

If the number of generated chunk IDs does not match the number of chunks.

SplitterOutputException

If constructing :class:SplitterOutput fails unexpectedly, or if LangChain's splitter raises an unexpected error.

Warns:

Type Description
SplitterInputWarning

When the input text is empty or whitespace-only.

SplitterOutputWarning

When no chunks are produced and the splitter falls back to a single empty chunk.

Example

Basic usage with a simple text string:

from splitter_mr.schema import ReaderOutput
from splitter_mr.splitter import RecursiveCharacterSplitter

# Sample text (short for demonstration)
text = (
    "LangChain makes it easy to build LLM-powered applications. "
    "Recursive splitting helps maintain semantic coherence while "
    "still enforcing chunk-size limits."
)

reader_output = ReaderOutput(
    text=text,
    document_name="example.txt",
    document_path="/tmp/example.txt",
    document_id="abc123",
    conversion_method="text",
    metadata={}
)

splitter = RecursiveCharacterSplitter(
    chunk_size=50,
    chunk_overlap=0.2,  # 20% of chunk_size overlap
    separators=["\n\n", ".", " "]  # recursive fallback separators
)

output = splitter.split(reader_output)

# Inspect results
print(output.chunks)
print(output.chunk_id)
print(output.split_params)
Source code in src/splitter_mr/splitter/splitters/recursive_splitter.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """
    Splits the input text into character-based chunks using a recursive splitting strategy
    (via LangChain's :class:`RecursiveCharacterTextSplitter`), supporting configurable
    separators, chunk size, and overlap.

    Args:
        reader_output (ReaderOutput): Dataclass containing at least a ``text`` field (str
            or None) and optional document metadata (e.g., ``document_name``,
            ``document_path``, etc.).

    Returns:
        SplitterOutput: Dataclass defining the output structure for all splitters.

    Raises:
        ReaderOutputException:
            If ``reader_output.text`` is missing or not ``str``/``None``.
        SplitterConfigException:
            If (effective) ``chunk_overlap`` is greater than or equal to ``chunk_size``.
        InvalidChunkException:
            If the number of generated chunk IDs does not match the number of chunks.
        SplitterOutputException:
            If constructing :class:`SplitterOutput` fails unexpectedly, or if
            LangChain's splitter raises an unexpected error.

    Warnings:
        SplitterInputWarning:
            When the input text is empty or whitespace-only.
        SplitterOutputWarning:
            When no chunks are produced and the splitter falls back to a single
            empty chunk.


    Example:
        **Basic usage** with a simple text string:

        ```python
        from splitter_mr.schema import ReaderOutput
        from splitter_mr.splitter import RecursiveCharacterSplitter

        # Sample text (short for demonstration)
        text = (
            "LangChain makes it easy to build LLM-powered applications. "
            "Recursive splitting helps maintain semantic coherence while "
            "still enforcing chunk-size limits."
        )

        reader_output = ReaderOutput(
            text=text,
            document_name="example.txt",
            document_path="/tmp/example.txt",
            document_id="abc123",
            conversion_method="text",
            metadata={}
        )

        splitter = RecursiveCharacterSplitter(
            chunk_size=50,
            chunk_overlap=0.2,  # 20% of chunk_size overlap
            separators=["\\n\\n", ".", " "]  # recursive fallback separators
        )

        output = splitter.split(reader_output)

        # Inspect results
        print(output.chunks)
        print(output.chunk_id)
        print(output.split_params)
        ```
    """
    # Validate input
    if not hasattr(reader_output, "text"):
        raise ReaderOutputException(
            "ReaderOutput object must expose a 'text' attribute."
        )

    text = reader_output.text
    if text is None:
        text = ""
    elif not isinstance(text, str):
        raise ReaderOutputException(
            f"ReaderOutput.text must be of type 'str' or None, got "
            f"{type(text).__name__!r}"
        )

    if not text.strip():
        warnings.warn(
            "RecursiveCharacterSplitter received empty or whitespace-only text; "
            "resulting chunks will be empty.",
            SplitterInputWarning,
            stacklevel=2,
        )

    chunk_size = self.chunk_size

    # Determine overlap in characters (effective value used by LangChain)
    if isinstance(self.chunk_overlap, float) and 0 <= self.chunk_overlap < 1:
        overlap = int(chunk_size * self.chunk_overlap)
    else:
        overlap = int(self.chunk_overlap)

    if overlap >= chunk_size:
        # Config is invalid relative to this chunk_size
        raise SplitterConfigException(
            "chunk_overlap (effective characters) must be smaller than chunk_size"
        )

    # Generate chunks
    try:
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=overlap,
            separators=self.separators,
        )
        texts = splitter.create_documents([text])
    except Exception as exc:
        # Wrap any unexpected LangChain behaviour
        raise SplitterOutputException(
            f"RecursiveCharacterTextSplitter failed during split: {exc}"
        ) from exc

    chunks = [doc.page_content for doc in texts] if texts else []

    # -> If no chunks at all, warn and fall back
    if not chunks:
        warnings.warn(
            "RecursiveCharacterSplitter did not produce any chunks; "
            "returning a single empty chunk.",
            SplitterOutputWarning,
            stacklevel=2,
        )
        chunks = [""]

    # Generate chunk_ids and append metadata
    chunk_ids = self._generate_chunk_ids(len(chunks))
    if len(chunk_ids) != len(chunks):
        raise InvalidChunkException(
            "Number of chunk IDs does not match number of chunks "
            f"(chunk_ids={len(chunk_ids)}, chunks={len(chunks)})."
        )

    metadata = self._default_metadata()

    # Build output
    try:
        output = SplitterOutput(
            chunks=chunks,
            chunk_id=chunk_ids,
            document_name=reader_output.document_name,
            document_path=reader_output.document_path,
            document_id=reader_output.document_id,
            conversion_method=reader_output.conversion_method,
            reader_method=reader_output.reader_method,
            ocr_method=reader_output.ocr_method,
            split_method="recursive_character_splitter",
            split_params={
                "chunk_size": chunk_size,
                "chunk_overlap": overlap,
                "separators": self.separators,
            },
            metadata=metadata,
        )
    except Exception as exc:
        raise SplitterOutputException(
            f"Failed to build SplitterOutput in RecursiveCharacterSplitter: {exc}"
        ) from exc

    return output

KeywordSplitter

KeywordSplitter

Bases: BaseSplitter

Splitter that chunks text around keyword boundaries using regular expressions.

This splitter searches the input text for one or more keyword patterns (regex) and creates chunks at each match boundary. You can control how the matched delimiter is attached to the resulting chunks (before/after/both/none) and apply a secondary, size-based re-chunking to respect chunk_size.

Notes
  • All regexes are compiled into one alternation with named groups when patterns is a dict. This simplifies per-keyword accounting.
  • If the input text is empty or no matches are found, the entire text becomes a single chunk (subject to size-based re-chunking).

Parameters:

Name Type Description Default
patterns Union[List[str], Dict[str, str]]

A list of regex pattern strings or a mapping of name -> regex pattern. When a dict is provided, the keys are used in the metadata counts. When a list is provided, synthetic names are generated (k0, k1, ...).

required
flags int

Standard re flags combined with | (e.g., re.IGNORECASE).

0
include_delimiters str

Where to attach the matched keyword delimiter. One of "none", "before", "after", "both". - before (default) appends the match to the preceding chunk. - after prepends the match to the following chunk. - both duplicates the match on both sides. - none omits the delimiter from both sides.

DEFAULT_KEYWORD_DELIMITER_POS
chunk_size int

Target maximum size (in characters) for each chunk. When a produced chunk exceeds this value, it is soft-wrapped by whitespace using a greedy strategy.

100000

Raises:

Type Description
SplitterConfigException

If patterns, include_delimiters or chunk_size are invalid or regex compilation fails.

ReaderOutputException

If reader_output does not expose a valid text field.

InvalidChunkException

If internal chunk accounting becomes inconsistent.

SplitterOutputException

If building :class:SplitterOutput fails unexpectedly.

Source code in src/splitter_mr/splitter/splitters/keyword_splitter.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
class KeywordSplitter(BaseSplitter):
    """
    Splitter that chunks text around *keyword* boundaries using regular expressions.

    This splitter searches the input text for one or more *keyword patterns* (regex)
    and creates chunks at each match boundary. You can control how the matched
    delimiter is attached to the resulting chunks (before/after/both/none) and apply a
    secondary, size-based re-chunking to respect ``chunk_size``.

    Notes:
        - All regexes are compiled into **one** alternation with *named groups* when
          ``patterns`` is a dict. This simplifies per-keyword accounting.
        - If the input text is empty or no matches are found, the entire text
          becomes a single chunk (subject to size-based re-chunking).

    Args:
        patterns (Union[List[str], Dict[str, str]]): A list of regex pattern strings **or** a mapping of
            ``name -> regex pattern``. When a dict is provided, the keys are used in
            the metadata counts. When a list is provided, synthetic names are
            generated (``k0``, ``k1``, ...).
        flags (int): Standard ``re`` flags combined with ``|`` (e.g., ``re.IGNORECASE``).
        include_delimiters (str): Where to attach the matched keyword delimiter.
            One of ``"none"``, ``"before"``, ``"after"``, ``"both"``.
            - ``before`` (default) appends the match to the *preceding* chunk.
            - ``after`` prepends the match to the *following* chunk.
            - ``both`` duplicates the match on both sides.
            - ``none`` omits the delimiter from both sides.
        chunk_size (int): Target maximum size (in characters) for each chunk. When a
            produced chunk exceeds this value, it is *soft*-wrapped by whitespace
            using a greedy strategy.

    Raises:
        SplitterConfigException: If ``patterns``, ``include_delimiters`` or ``chunk_size``
            are invalid or regex compilation fails.
        ReaderOutputException: If ``reader_output`` does not expose a valid ``text`` field.
        InvalidChunkException: If internal chunk accounting becomes inconsistent.
        SplitterOutputException: If building :class:`SplitterOutput` fails unexpectedly.
    """

    def __init__(
        self,
        patterns: Union[List[str], Dict[str, str]],
        *,
        flags: int = 0,
        include_delimiters: str = DEFAULT_KEYWORD_DELIMITER_POS,
        chunk_size: int = 100000,
    ) -> None:
        # Basic config validation at construction time
        if chunk_size <= 0 or not isinstance(chunk_size, int):
            raise SplitterConfigException(
                f"chunk_size must be a positive integer, got {chunk_size!r}"
            )

        super().__init__(chunk_size=chunk_size)
        self.include_delimiters = self._validate_include_delimiters(include_delimiters)

        # Validate patterns type early for clearer errors
        if not isinstance(patterns, (list, dict)):
            raise SplitterConfigException(
                "patterns must be a list of regex strings or a dict[name -> pattern], "
                f"got {type(patterns).__name__!r}"
            )

        try:
            self.pattern_names, self.compiled = self._compile_patterns(patterns, flags)
        except re.error as exc:  # invalid regex, bad group name, etc.
            raise SplitterConfigException(
                f"Failed to compile keyword patterns: {exc}"
            ) from exc

        self.flags = flags

    # ---- Main method ---- #

    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """
        Split ReaderOutput into keyword-delimited chunks and build structured output.

        The method first splits around regex keyword matches (respecting
        ``include_delimiters``), then performs a secondary size-based soft wrap to
        respect ``chunk_size``. It returns a fully populated :class:`SplitterOutput`.

        Args:
            reader_output (ReaderOutput): Input document and metadata.

        Returns:
            SplitterOutput: Output structure with chunked text and metadata.

        Raises:
            ReaderOutputException: If ``reader_output`` has an invalid structure.
            InvalidChunkException: If the number of chunks and chunk IDs diverge.
            SplitterOutputException: If constructing the output object fails.

        Example:
            **Basic usage** with a **list** of patterns:

            ```python
            from splitter_mr.schema import ReaderOutput
            from splitter_mr.splitter.splitters import KeywordSplitter

            text = "Alpha KEY Beta KEY Gamma"
            ro = ReaderOutput(
                text=text,
                document_name="demo.txt",
                document_path="/tmp/demo.txt",
            )

            splitter = KeywordSplitter(patterns=[r"KEY"])
            out = splitter.split(ro)

            print(out.chunks)
            ```

            ```python
            ['Alpha KEY', 'Beta KEY', 'Gamma']
            ```

            Using a **`dict` of named patterns** (names appear in metadata):

            ```python
            patterns = {
                "plus": r"\\+",
                "minus": r"-",
            }
            text = "A + B - C + D"
            ro = ReaderOutput(text=text)

            splitter = KeywordSplitter(patterns=patterns)
            out = splitter.split(ro)

            print(out.chunks)
            ```

            ```python
            ['A +', 'B -', 'C +', 'D']
            ```

            ```python
            print(out.metadata["keyword_matches"]["counts"])
            ```

            ```json
            {'plus': 2, 'minus': 1}
            ```

            Demonstrating ``include_delimiters`` modes:

            ```python
            text = "A#B#C"

            splitter = KeywordSplitter(patterns=[r"#"], include_delimiters="after")
            out = splitter.split(ReaderOutput(text=text))
            print(out.chunks)
            ```

            ```python
            ['A#', 'B#', 'C']
            ```

            ```python
            splitter = KeywordSplitter(patterns=[r"#"], include_delimiters="none")
            out = splitter.split(ReaderOutput(text=text))
            print(out.chunks)
            ```

            ```python
            ['A', 'B', 'C']
            ```

            Example showing **size-based soft wrapping** (`chunk_size=5`):

            ```python
            text = "abcdefghijklmnopqrstuvwxyz"
            splitter = KeywordSplitter(patterns=[r"x"], chunk_size=5)
            ```

            ```python
            out = splitter.split(ReaderOutput(text=text))
            print(out.chunks)
            ```

            ```python
            ['abcde', 'fghij', 'klmno', 'pqrst', 'uvwxy', 'z']
            ```

            Example with **multiple patterns and mixed text**:

            ```python
            splitter = KeywordSplitter(
                patterns=[r"ERROR", r"WARNING"],
                include_delimiters="after",
            )

            log = "INFO Start\\nERROR Failure occurred\\nWARNING Low RAM\\nINFO End"
            out = splitter.split(ReaderOutput(text=log))

            print(out.chunks)
            ```

            ```python
            ['INFO Start\\nERROR', 'Failure occurred\\nWARNING', 'Low RAM\\nINFO End']
            ```
        """
        if not hasattr(reader_output, "text"):
            raise ReaderOutputException(
                "ReaderOutput object must expose a 'text' attribute."
            )

        text = reader_output.text
        if text is None:
            text: str = ""
        elif not isinstance(text, str):
            raise ReaderOutputException(
                f"ReaderOutput.text must be of type 'str' or None, got "
                f"{type(text).__name__!r}"
            )

        # Warn on suspiciously empty input
        if not text.strip():
            warnings.warn(
                "KeywordSplitter received empty or whitespace-only text; "
                "output will contain a single empty chunk.",
                SplitterInputWarning,
                stacklevel=2,
            )

        # Ensure document_id is present so it propagates (fixes metadata test)
        if not getattr(reader_output, "document_id", None):
            reader_output.document_id = str(uuid.uuid4())

        # Primary split by keyword matches (names used for counts)
        raw_chunks, match_spans, match_names = self._split_by_keywords(text)

        # Secondary size-based re-chunking to respect chunk_size
        sized_chunks: list[str] = []
        for ch in raw_chunks:
            sized_chunks.extend(self._soft_wrap(ch, self.chunk_size))
        if not sized_chunks:
            sized_chunks: list[str] = [""]

        # Generate IDs
        chunk_ids = self._generate_chunk_ids(len(sized_chunks))

        # Extra sanity check: chunks vs IDs
        if len(chunk_ids) != len(sized_chunks):
            raise InvalidChunkException(
                "Number of chunk IDs does not match number of chunks "
                f"(chunk_ids={len(chunk_ids)}, chunks={len(sized_chunks)})."
            )

        # Build metadata (ensure counts/spans are always present)
        matches_meta: dict[str, any] = {
            "counts": self._count_by_name(match_names),
            "spans": match_spans,
            "include_delimiters": self.include_delimiters,
            "flags": self.flags,
            "pattern_names": self.pattern_names,
            "chunk_size": self.chunk_size,
        }

        try:
            return self._build_output(
                reader_output=reader_output,
                chunks=sized_chunks,
                chunk_ids=chunk_ids,
                matches_meta=matches_meta,
            )
        except (TypeError, ValueError) as exc:
            raise SplitterOutputException(
                f"Failed to build SplitterOutput in KeywordSplitter: {exc}"
            ) from exc

    # ---- Helpers ---- #

    @staticmethod
    def _validate_include_delimiters(value: str) -> str:
        """
        Validate and normalize include_delimiters argument.

        Args:
            value (str): One of {"none", "before", "after", "both"}.

        Returns:
            str: Normalized delimiter mode.

        Raises:
            SplitterConfigException: If the mode is invalid.
        """
        if not isinstance(value, str):
            raise SplitterConfigException(
                f"include_delimiters must be a string, got {type(value).__name__!r}"
            )

        v: str = value.lower().strip()
        if v not in SUPPORTED_KEYWORD_DELIMITERS:
            raise SplitterConfigException(
                "include_delimiters must be one of "
                f"{sorted(SUPPORTED_KEYWORD_DELIMITERS)}, got {value!r}"
            )
        return v

    @staticmethod
    def _compile_patterns(
        patterns: Union[List[str], Dict[str, str]], flags: int
    ) -> Tuple[List[str], Pattern[str]]:
        """
        Compile patterns into a single alternation regex.

        If a dict is given, build a pattern with **named** groups to preserve the
        provided names. If a list is given, synthesize names (k0, k1, ...).

        Args:
            patterns (Union[List[str], Dict[str, str]]): Patterns or mapping.
            flags (int): Regex flags.

        Returns:
            Tuple[List[str], Pattern[str]]: Names and compiled regex.

        Raises:
            SplitterConfigException: If patterns have an unsupported type.
            re.error: If regex compilation fails (caught in __init__).
        """
        if isinstance(patterns, dict):
            names: list = list(patterns.keys())
            parts: list = [f"(?P<{name}>{pat})" for name, pat in patterns.items()]
        elif isinstance(patterns, list):
            names: list = [f"k{i}" for i in range(len(patterns))]
            parts: list = [f"(?P<{n}>{pat})" for n, pat in zip(names, patterns)]
        else:
            # Should be prevented by __init__, but keep as guardrail.
            raise SplitterConfigException(
                "patterns must be a list of regex strings or a dict[name -> pattern]"
            )

        combined: str = (
            "|".join(parts) if parts else r"(?!x)x"
        )  # never matches if empty
        compiled: re.Pattern = re.compile(combined, flags)
        return names, compiled

    def _split_by_keywords(
        self, text: str
    ) -> Tuple[List[str], List[Tuple[int, int]], List[str]]:
        """
        Split ``text`` around matches of ``self.compiled``.

        Respects include_delimiters in {"before", "after", "both", "none"}.

        Args:
            text (str): The text to split.

        Returns:
            Tuple[List[str], List[Tuple[int, int]], List[str]]:
                (chunks, spans, names) where `chunks` are before size re-wrapping,
                spans are (start, end) tuples, and names are group names for each match.
        """

        def _append_chunk(acc: List[str], chunk: str) -> None:
            if chunk and chunk.strip():
                acc.append(chunk)

        chunks: list[str] = []
        spans: list[tuple[int, int]] = []
        names: list[str] = []

        matches: list = list(self.compiled.finditer(text))
        last_idx: int = 0
        pending_prefix: str = ""  # used when include_delimiters is "after" or "both"

        for m in matches:
            start, end = m.span()
            match_txt: str = text[start:end]
            group_name: str = m.lastgroup or "unknown"

            spans.append((start, end))
            names.append(group_name)

            # Build the piece between last match end and this match start,
            # prefixing any pending delimiter
            before_piece: str = pending_prefix + text[last_idx:start]
            pending_prefix: str = ""

            # Attach delimiter to the left side if requested
            if self.include_delimiters in ("before", "both"):
                before_piece += match_txt

            _append_chunk(chunks, before_piece)

            # If delimiter should be on the right, carry it
            # forward to prefix next chunk
            if self.include_delimiters in ("after", "both"):
                pending_prefix = match_txt

            last_idx: int = end

        # Remainder after the last match (may contain pending_prefix)
        remainder: str = pending_prefix + text[last_idx:]
        _append_chunk(chunks, remainder)

        if not chunks:
            return [""], spans, names

        # normalize whitespace trimming for each chunk
        chunks: list[str] = [c.strip() for c in chunks if c and c.strip()]

        if not chunks:
            return [""], spans, names

        return chunks, spans, names

    @staticmethod
    def _soft_wrap(text: str, max_size: int) -> List[str]:
        """
        Greedy soft-wrap by whitespace to respect ``max_size``.

        - If ``len(text) <= max_size``: return ``[text]``.
        - Else: split on whitespace and rebuild lines greedily.
        - If a single token is longer than ``max_size``, it is hard-split.

        Args:
            text (str): Text to wrap.
            max_size (int): Maximum chunk size.

        Returns:
            List[str]: List of size-constrained chunks.
        """
        if max_size <= 0 or len(text) <= max_size:
            return [text] if text else []

        tokens = re.findall(r"\S+|\s+", text)
        out: List[str] = []
        buf = ""
        for tok in tokens:
            if len(buf) + len(tok) <= max_size:
                buf += tok
                continue
            if buf:
                out.append(buf)
                buf = ""
            # token alone is too big -> hard split
            while len(tok) > max_size:
                out.append(tok[:max_size])
                tok = tok[max_size:]
            buf = tok
        if buf:
            out.append(buf)
        return [c for c in (s.strip() for s in out) if c]

    @staticmethod
    def _count_by_name(names: Iterable[str]) -> Dict[str, int]:
        """
        Aggregate match counts by group name (k0/k1/... for list patterns,
        custom names for dict).

        Args:
            names (Iterable[str]): Group names.

        Returns:
            Dict[str, int]: Count of matches per group name.
        """
        counts: Dict[str, int] = {}
        for n in names:
            counts[n] = counts.get(n, 0) + 1
        return counts

    def _build_output(
        self,
        reader_output: ReaderOutput,
        chunks: List[str],
        chunk_ids: List[str],
        matches_meta: Dict[str, object],
    ) -> SplitterOutput:
        """
        Assemble a :class:`SplitterOutput` carrying over reader metadata.

        Args:
            reader_output (ReaderOutput): Input document and metadata.
            chunks (List[str]): Final list of chunks.
            chunk_ids (List[str]): Unique chunk IDs.
            matches_meta (Dict[str, object]): Keyword matches metadata.

        Returns:
            SplitterOutput: Populated output object.
        """
        return SplitterOutput(
            chunks=chunks,
            chunk_id=chunk_ids,
            document_name=reader_output.document_name,
            document_path=reader_output.document_path,
            document_id=reader_output.document_id,
            conversion_method=reader_output.conversion_method,
            reader_method=reader_output.reader_method,
            ocr_method=reader_output.ocr_method,
            split_method="keyword",
            split_params={
                "include_delimiters": self.include_delimiters,
                "flags": self.flags,
                "chunk_size": self.chunk_size,
                "pattern_names": self.pattern_names,
            },
            metadata={
                **(reader_output.metadata or {}),
                "keyword_matches": matches_meta,
            },
        )
split(reader_output)

Split ReaderOutput into keyword-delimited chunks and build structured output.

The method first splits around regex keyword matches (respecting include_delimiters), then performs a secondary size-based soft wrap to respect chunk_size. It returns a fully populated :class:SplitterOutput.

Parameters:

Name Type Description Default
reader_output ReaderOutput

Input document and metadata.

required

Returns:

Name Type Description
SplitterOutput SplitterOutput

Output structure with chunked text and metadata.

Raises:

Type Description
ReaderOutputException

If reader_output has an invalid structure.

InvalidChunkException

If the number of chunks and chunk IDs diverge.

SplitterOutputException

If constructing the output object fails.

Example

Basic usage with a list of patterns:

from splitter_mr.schema import ReaderOutput
from splitter_mr.splitter.splitters import KeywordSplitter

text = "Alpha KEY Beta KEY Gamma"
ro = ReaderOutput(
    text=text,
    document_name="demo.txt",
    document_path="/tmp/demo.txt",
)

splitter = KeywordSplitter(patterns=[r"KEY"])
out = splitter.split(ro)

print(out.chunks)
['Alpha KEY', 'Beta KEY', 'Gamma']

Using a dict of named patterns (names appear in metadata):

patterns = {
    "plus": r"\+",
    "minus": r"-",
}
text = "A + B - C + D"
ro = ReaderOutput(text=text)

splitter = KeywordSplitter(patterns=patterns)
out = splitter.split(ro)

print(out.chunks)
['A +', 'B -', 'C +', 'D']
print(out.metadata["keyword_matches"]["counts"])
{'plus': 2, 'minus': 1}

Demonstrating include_delimiters modes:

text = "A#B#C"

splitter = KeywordSplitter(patterns=[r"#"], include_delimiters="after")
out = splitter.split(ReaderOutput(text=text))
print(out.chunks)
['A#', 'B#', 'C']
splitter = KeywordSplitter(patterns=[r"#"], include_delimiters="none")
out = splitter.split(ReaderOutput(text=text))
print(out.chunks)
['A', 'B', 'C']

Example showing size-based soft wrapping (chunk_size=5):

text = "abcdefghijklmnopqrstuvwxyz"
splitter = KeywordSplitter(patterns=[r"x"], chunk_size=5)
out = splitter.split(ReaderOutput(text=text))
print(out.chunks)
['abcde', 'fghij', 'klmno', 'pqrst', 'uvwxy', 'z']

Example with multiple patterns and mixed text:

splitter = KeywordSplitter(
    patterns=[r"ERROR", r"WARNING"],
    include_delimiters="after",
)

log = "INFO Start\nERROR Failure occurred\nWARNING Low RAM\nINFO End"
out = splitter.split(ReaderOutput(text=log))

print(out.chunks)
['INFO Start\nERROR', 'Failure occurred\nWARNING', 'Low RAM\nINFO End']
Source code in src/splitter_mr/splitter/splitters/keyword_splitter.py
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """
    Split ReaderOutput into keyword-delimited chunks and build structured output.

    The method first splits around regex keyword matches (respecting
    ``include_delimiters``), then performs a secondary size-based soft wrap to
    respect ``chunk_size``. It returns a fully populated :class:`SplitterOutput`.

    Args:
        reader_output (ReaderOutput): Input document and metadata.

    Returns:
        SplitterOutput: Output structure with chunked text and metadata.

    Raises:
        ReaderOutputException: If ``reader_output`` has an invalid structure.
        InvalidChunkException: If the number of chunks and chunk IDs diverge.
        SplitterOutputException: If constructing the output object fails.

    Example:
        **Basic usage** with a **list** of patterns:

        ```python
        from splitter_mr.schema import ReaderOutput
        from splitter_mr.splitter.splitters import KeywordSplitter

        text = "Alpha KEY Beta KEY Gamma"
        ro = ReaderOutput(
            text=text,
            document_name="demo.txt",
            document_path="/tmp/demo.txt",
        )

        splitter = KeywordSplitter(patterns=[r"KEY"])
        out = splitter.split(ro)

        print(out.chunks)
        ```

        ```python
        ['Alpha KEY', 'Beta KEY', 'Gamma']
        ```

        Using a **`dict` of named patterns** (names appear in metadata):

        ```python
        patterns = {
            "plus": r"\\+",
            "minus": r"-",
        }
        text = "A + B - C + D"
        ro = ReaderOutput(text=text)

        splitter = KeywordSplitter(patterns=patterns)
        out = splitter.split(ro)

        print(out.chunks)
        ```

        ```python
        ['A +', 'B -', 'C +', 'D']
        ```

        ```python
        print(out.metadata["keyword_matches"]["counts"])
        ```

        ```json
        {'plus': 2, 'minus': 1}
        ```

        Demonstrating ``include_delimiters`` modes:

        ```python
        text = "A#B#C"

        splitter = KeywordSplitter(patterns=[r"#"], include_delimiters="after")
        out = splitter.split(ReaderOutput(text=text))
        print(out.chunks)
        ```

        ```python
        ['A#', 'B#', 'C']
        ```

        ```python
        splitter = KeywordSplitter(patterns=[r"#"], include_delimiters="none")
        out = splitter.split(ReaderOutput(text=text))
        print(out.chunks)
        ```

        ```python
        ['A', 'B', 'C']
        ```

        Example showing **size-based soft wrapping** (`chunk_size=5`):

        ```python
        text = "abcdefghijklmnopqrstuvwxyz"
        splitter = KeywordSplitter(patterns=[r"x"], chunk_size=5)
        ```

        ```python
        out = splitter.split(ReaderOutput(text=text))
        print(out.chunks)
        ```

        ```python
        ['abcde', 'fghij', 'klmno', 'pqrst', 'uvwxy', 'z']
        ```

        Example with **multiple patterns and mixed text**:

        ```python
        splitter = KeywordSplitter(
            patterns=[r"ERROR", r"WARNING"],
            include_delimiters="after",
        )

        log = "INFO Start\\nERROR Failure occurred\\nWARNING Low RAM\\nINFO End"
        out = splitter.split(ReaderOutput(text=log))

        print(out.chunks)
        ```

        ```python
        ['INFO Start\\nERROR', 'Failure occurred\\nWARNING', 'Low RAM\\nINFO End']
        ```
    """
    if not hasattr(reader_output, "text"):
        raise ReaderOutputException(
            "ReaderOutput object must expose a 'text' attribute."
        )

    text = reader_output.text
    if text is None:
        text: str = ""
    elif not isinstance(text, str):
        raise ReaderOutputException(
            f"ReaderOutput.text must be of type 'str' or None, got "
            f"{type(text).__name__!r}"
        )

    # Warn on suspiciously empty input
    if not text.strip():
        warnings.warn(
            "KeywordSplitter received empty or whitespace-only text; "
            "output will contain a single empty chunk.",
            SplitterInputWarning,
            stacklevel=2,
        )

    # Ensure document_id is present so it propagates (fixes metadata test)
    if not getattr(reader_output, "document_id", None):
        reader_output.document_id = str(uuid.uuid4())

    # Primary split by keyword matches (names used for counts)
    raw_chunks, match_spans, match_names = self._split_by_keywords(text)

    # Secondary size-based re-chunking to respect chunk_size
    sized_chunks: list[str] = []
    for ch in raw_chunks:
        sized_chunks.extend(self._soft_wrap(ch, self.chunk_size))
    if not sized_chunks:
        sized_chunks: list[str] = [""]

    # Generate IDs
    chunk_ids = self._generate_chunk_ids(len(sized_chunks))

    # Extra sanity check: chunks vs IDs
    if len(chunk_ids) != len(sized_chunks):
        raise InvalidChunkException(
            "Number of chunk IDs does not match number of chunks "
            f"(chunk_ids={len(chunk_ids)}, chunks={len(sized_chunks)})."
        )

    # Build metadata (ensure counts/spans are always present)
    matches_meta: dict[str, any] = {
        "counts": self._count_by_name(match_names),
        "spans": match_spans,
        "include_delimiters": self.include_delimiters,
        "flags": self.flags,
        "pattern_names": self.pattern_names,
        "chunk_size": self.chunk_size,
    }

    try:
        return self._build_output(
            reader_output=reader_output,
            chunks=sized_chunks,
            chunk_ids=chunk_ids,
            matches_meta=matches_meta,
        )
    except (TypeError, ValueError) as exc:
        raise SplitterOutputException(
            f"Failed to build SplitterOutput in KeywordSplitter: {exc}"
        ) from exc

HeaderSplitter

HeaderSplitter

Bases: BaseSplitter

Split HTML or Markdown documents into chunks by header levels (H1–H6).

  • If the input looks like HTML, it is first converted to Markdown using the project's HtmlToMarkdown utility, which emits ATX-style headings (#, ##, ...).
  • If the input is Markdown, Setext-style headings (underlines with === / ---) are normalized to ATX so headers are reliably detected.
  • Splitting is performed with LangChain's MarkdownHeaderTextSplitter.
  • If no headers are detected after conversion/normalization, a safe fallback splitter (RecursiveCharacterTextSplitter) is used to avoid returning a single, excessively large chunk.

Parameters:

Name Type Description Default
chunk_size int

Size hint for fallback splitting; not used by header splitting itself.

1000
headers_to_split_on Optional[Sequence[ALLOWED_HEADERS_LITERAL]]

Semantic header names like ("Header 1", "Header 2"). If None (default), all allowed headers are enabled (ALLOWED_HEADERS).

None
group_header_with_content bool

If True (default), headers are kept with their following content (strip_headers=False). If False, headers are removed from the chunks (strip_headers=True).

True

Raises:

Type Description
InvalidHeaderNameError

If any header is not present in ALLOWED_HEADERS.

Source code in src/splitter_mr/splitter/splitters/header_splitter.py
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
class HeaderSplitter(BaseSplitter):
    """Split HTML or Markdown documents into chunks by header levels (H1–H6).

    - If the input looks like HTML, it is first converted to Markdown using the
      project's HtmlToMarkdown utility, which emits ATX-style headings (`#`, `##`, ...).
    - If the input is Markdown, Setext-style headings (underlines with `===` / `---`)
      are normalized to ATX so headers are reliably detected.
    - Splitting is performed with LangChain's MarkdownHeaderTextSplitter.
    - If no headers are detected after conversion/normalization, a safe fallback
      splitter (RecursiveCharacterTextSplitter) is used to avoid returning a single,
      excessively large chunk.

    Args:
        chunk_size: Size hint for fallback splitting; not used by header splitting itself.
        headers_to_split_on: Semantic header names like ``("Header 1", "Header 2")``.
            If ``None`` (default), all allowed headers are enabled (``ALLOWED_HEADERS``).
        group_header_with_content: If ``True`` (default), headers are kept with their
            following content (``strip_headers=False``). If ``False``, headers are removed
            from the chunks (``strip_headers=True``).

    Raises:
        InvalidHeaderNameError: If any header is not present in ``ALLOWED_HEADERS``.
    """

    def __init__(
        self,
        chunk_size: int = 1000,
        headers_to_split_on: Optional[Sequence[HeaderName]] = None,
        *,
        group_header_with_content: bool = True,
    ):
        super().__init__(chunk_size)

        # Use immutable default and validate any user-supplied values.
        if headers_to_split_on is None:
            safe_headers: Tuple[HeaderName, ...] = cast(
                Tuple[HeaderName, ...], ALLOWED_HEADERS
            )
        else:
            safe_headers = self._validate_headers(headers_to_split_on)

        self.headers_to_split_on: Tuple[HeaderName, ...] = safe_headers
        self.group_header_with_content = bool(group_header_with_content)

    # ---- Main method ---- #

    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """
        Perform header-based splitting with HTML→Markdown conversion and safe fallback.

        Steps:
            1. Detect filetype (HTML/MD).
            2. If HTML, convert to Markdown with HtmlToMarkdown (emits ATX headings).
            3. If Markdown, normalize Setext headings to ATX.
            4. Split by headers via MarkdownHeaderTextSplitter.
            5. If no headers found, fallback to RecursiveCharacterTextSplitter.

        Args:
            reader_output: The reader output containing text and metadata.

        Returns:
            SplitterOutput: A populated splitter output with chunk contents and metadata.

        Warnings:
            SplitterInputWarning: if text field in ReaderOutput is missing or void.

        Raises:
            HtmlConversionError: if HTML Conversion fails.

        Example:
            Basic Markdown input with **default headers** (H1–H6), keeping headers with content:

            ```python
            from splitter_mr.splitter import HeaderSplitter
            from splitter_mr.schema.models import ReaderOutput

            md = (
                "# Title\\n"
                "Intro paragraph.\\n\\n"
                "## Section A\\n"
                "Content A.\\n\\n"
                "## Section B\\n"
                "Content B."
            )
            ro = ReaderOutput(text=md, document_name="example.md")

            splitter = HeaderSplitter(group_header_with_content=True)  # keep headers in chunks
            out = splitter.split(ro)
            print(out.chunks)
            ```
            ```python
            [
                "# Title\\nIntro paragraph.",
                "## Section A\\nContent A.",
                "## Section B\\nContent B."
            ]
            ```

            HTML input with a **restricted set of headers and stripping headers** from chunks:

            ```python
            html = (
                "<h1>Title</h1>"
                "<p>Intro paragraph.</p>"
                "<h2>Section A</h2>"
                "<p>Content A.</p>"
                "<h3>Sub A.1</h3>"
                "<p>Detail A.1</p>"
            )
            ro = ReaderOutput(text=html, document_name="example.html")

            # Only split on Header 1 and Header 2 (i.e., H1/H2)
            splitter = HeaderSplitter(
                headers_to_split_on=("Header 1", "Header 2"),
                group_header_with_content=False  # drop headers from chunks
            )
            out = splitter.split(ro)
            print(out.chunks)
            ```
            ```python
            [
                "Intro paragraph.",
                "Content A.\\nSub A.1\\nDetail A.1"
            ]
            ```
        """
        text: str = reader_output.text
        if text is None or not str(text).strip():
            warnings.warn(
                SplitterInputWarning(
                    "ReaderOutput.text is empty or whitespace-only. "
                    "Proceeding; this will yield a single empty chunk."
                )
            )
            chunks: list[str] = [""]
            return SplitterOutput(
                chunks=chunks,
                chunk_id=self._generate_chunk_ids(len(chunks)),
                document_name=reader_output.document_name,
                document_path=reader_output.document_path,
                document_id=reader_output.document_id,
                conversion_method=reader_output.conversion_method,
                reader_method=reader_output.reader_method,
                ocr_method=reader_output.ocr_method,
                split_method="header_splitter",
                split_params={
                    "headers_to_split_on": list(self.headers_to_split_on),
                    "group_header_with_content": self.group_header_with_content,
                },
                metadata=self._default_metadata(),
            )

        filetype: str = self._guess_filetype(reader_output)
        tuples: list[tuple] = self._make_tuples("md")

        text: str = reader_output.text

        # HTML → Markdown using the project's converter
        if filetype == "html":
            try:
                text: str = HtmlToMarkdown().convert(text)
            except Exception as e:
                raise HtmlConversionError(
                    f"HTML→Markdown failed for {reader_output.document_name!r}"
                ) from e
        else:
            text: str = self._normalize_setext(text)

        # Detect presence of ATX headers (after conversion/normalization)
        has_headers: bool = bool(re.search(r"(?m)^\s*#{1,6}\s+\S", text))

        # Configure header splitter. group_header_with_content -> strip_headers False
        splitter = MarkdownHeaderTextSplitter(
            headers_to_split_on=tuples,
            return_each_line=False,
            strip_headers=not self.group_header_with_content,
        )

        docs: list[str] = splitter.split_text(text) if has_headers else []
        # Fallback if no headers were found
        if not docs:
            rc = RecursiveCharacterTextSplitter(
                chunk_size=max(1, int(self.chunk_size) or 1000),
                chunk_overlap=min(200, max(0, int(self.chunk_size) // 10)),
            )
            docs: list = rc.create_documents([text])

        chunks: list[str] = [doc.page_content for doc in docs]

        return SplitterOutput(
            chunks=chunks,
            chunk_id=self._generate_chunk_ids(len(chunks)),
            document_name=reader_output.document_name,
            document_path=reader_output.document_path,
            document_id=reader_output.document_id,
            conversion_method=reader_output.conversion_method,
            reader_method=reader_output.reader_method,
            ocr_method=reader_output.ocr_method,
            split_method="header_splitter",
            split_params={
                "headers_to_split_on": list(self.headers_to_split_on),
                "group_header_with_content": self.group_header_with_content,
            },
            metadata=self._default_metadata(),
        )

    # ---- Helpers ---- #

    @staticmethod
    def _validate_headers(headers: Sequence[str]) -> Tuple[HeaderName, ...]:
        """Validate that headers are a subset of ALLOWED_HEADERS and return an immutable tuple.

        Args:
            headers: Proposed list/tuple of header names.

        Returns:
            A tuple of validated header names.

        Raises:
            InvalidHeaderNameError: If any header is not present in ``ALLOWED_HEADERS``.
        """
        invalid: list = [h for h in headers if h not in ALLOWED_HEADERS]
        if invalid:
            allowed_display: str = ", ".join(ALLOWED_HEADERS)
            bad_display: str = ", ".join(invalid)
            raise InvalidHeaderNameError(
                f"Invalid headers: [{bad_display}]. "
                f"Allowed values are: [{allowed_display}]."
            )
        # Preserve caller order but store immutably.
        return cast(Tuple[HeaderName, ...], tuple(headers))

    def _make_tuples(self, filetype: str) -> List[Tuple[str, str]]:
        """Convert semantic header names (e.g., ``"Header 2"``) into Markdown tokens.

        Args:
            filetype: Only ``"md"`` is supported (HTML is converted to MD first).

        Returns:
            Tuples of ``(header_token, semantic_name)``, e.g., ``("##", "Header 2")``.

        Raises:
            SplitterConfigException: If an unsupported filetype is provided.
        """
        tuples: list[tuple[str, str]] = []
        for header in self.headers_to_split_on:
            lvl = self._header_level(header)
            if filetype == "md":
                tuples.append(("#" * lvl, header))
            else:
                raise SplitterConfigException(f"Unsupported filetype: {filetype!r}")
        return tuples

    @staticmethod
    def _header_level(header: str) -> int:
        """Extract numeric level from a header name like ``"Header 2"``.

        Args:
            header: The header label.

        Returns:
            The numeric level extracted from the header label.

        Raises:
            InvalidHeaderNameError: If the header string is not of the expected form.
            HeaderLevelOutOfRangeError: if header level is greater than 7 or lower than 0.
        """
        m = re.match(r"header\s*(\d+)", header.lower())
        if not m:
            raise InvalidHeaderNameError(f"Expected 'Header N', got: {header!r}")
        level = int(m.group(1))
        if not 1 <= level <= 7:
            raise HeaderLevelOutOfRangeError(
                f"Header level {level} out of range [1..7]"
            )
        return level

    @staticmethod
    def _guess_filetype(reader_output: ReaderOutput) -> str:
        """Heuristically determine whether the input is HTML or Markdown.

        The method first checks the filename extension, then uses lightweight HTML
        detection via BeautifulSoup as a fallback.

        Args:
            reader_output: The input document and metadata.

        Returns:
            ``"html"`` if the text appears to be HTML, otherwise ``"md"``.

        Warnings:
            FiletypeAmbiguityWarning: warned if file extension and suggested
            DOM shape does not match.
        """
        name: str = (reader_output.document_name or "").lower()
        md_ext: str = "md" if name.endswith((".md", ".markdown")) else None
        ext_hint: str = "html" if name.endswith((".html", ".htm")) else md_ext

        text: str = reader_output.text or ""
        soup = BeautifulSoup(text, "html.parser")
        dom_hint: str = (
            "html"
            if (
                soup.find("html")  # noqa: W503
                or soup.find(re.compile(r"^h[1-6]$"))  # noqa: W503
                or soup.find("div")  # noqa: W503
            )
            else "md"
        )

        if ext_hint and ext_hint != dom_hint:
            warnings.warn(
                FiletypeAmbiguityWarning(
                    f"Filetype heuristics disagree for {name!r}: "
                    f"extension suggests {ext_hint}, DOM suggests {dom_hint}. "
                    f"Proceeding with {dom_hint}."
                )
            )
        return dom_hint

    @staticmethod
    def _normalize_setext(md_text: str) -> str:
        """Normalize Setext-style headings to ATX so MarkdownHeaderTextSplitter
        can detect them.

        Transformations:
            - ``H1:  Title\\n====  →  # Title``
            - ``H2:  Title\\n----  →  ## Title``

        Args:
            md_text: Raw Markdown text possibly containing Setext headings.

        Returns:
            Markdown text with Setext headings rewritten as ATX headings.

        Raises:
            NormalizationError: if regular expression normalization fails.
        """
        fence: re.Pattern = re.compile(r"(^```.*?$)(.*?)(^```$)", flags=re.M | re.S)
        placeholders: list[str] = []

        def _stash(m: re.Match) -> str:
            placeholders.append(m.group(0))
            return f"__CODEFENCE_PLACEHOLDER_{len(placeholders) - 1}__"

        try:
            protected: str = re.sub(fence, _stash, md_text)
        except re.error as e:
            raise NormalizationError(f"Failed to scan code fences: {e}") from e

        # Normalize setext in the protected text only (outside fences)
        try:
            protected: str = re.sub(
                r"^(?P<t>[^\n]+)\n=+\s*$", r"# \g<t>", protected, flags=re.M
            )
            protected: str = re.sub(
                r"^(?P<t>[^\n]+)\n-+\s*$", r"## \g<t>", protected, flags=re.M
            )
        except re.error as e:
            raise NormalizationError(f"Setext→ATX normalization failed: {e}") from e

        # Restore code fences
        def _unstash(match: re.Match) -> str:
            idx: int = int(match.group(1))
            return placeholders[idx]

        try:
            normalized: str = re.sub(
                r"__CODEFENCE_PLACEHOLDER_(\d+)__", _unstash, protected
            )
        except Exception as e:
            raise NormalizationError(
                "Failed to restore code fences after normalization"
            ) from e

        if re.search(r"^[^\n]+\n[=-]{2,}\s*$", normalized, flags=re.M):
            raise NormalizationError(
                "Unnormalized Setext headings remain after normalization"
            )

        return normalized
split(reader_output)

Perform header-based splitting with HTML→Markdown conversion and safe fallback.

Steps
  1. Detect filetype (HTML/MD).
  2. If HTML, convert to Markdown with HtmlToMarkdown (emits ATX headings).
  3. If Markdown, normalize Setext headings to ATX.
  4. Split by headers via MarkdownHeaderTextSplitter.
  5. If no headers found, fallback to RecursiveCharacterTextSplitter.

Parameters:

Name Type Description Default
reader_output ReaderOutput

The reader output containing text and metadata.

required

Returns:

Name Type Description
SplitterOutput SplitterOutput

A populated splitter output with chunk contents and metadata.

Warns:

Type Description
SplitterInputWarning

if text field in ReaderOutput is missing or void.

Raises:

Type Description
HtmlConversionError

if HTML Conversion fails.

Example

Basic Markdown input with default headers (H1–H6), keeping headers with content:

from splitter_mr.splitter import HeaderSplitter
from splitter_mr.schema.models import ReaderOutput

md = (
    "# Title\n"
    "Intro paragraph.\n\n"
    "## Section A\n"
    "Content A.\n\n"
    "## Section B\n"
    "Content B."
)
ro = ReaderOutput(text=md, document_name="example.md")

splitter = HeaderSplitter(group_header_with_content=True)  # keep headers in chunks
out = splitter.split(ro)
print(out.chunks)
[
    "# Title\nIntro paragraph.",
    "## Section A\nContent A.",
    "## Section B\nContent B."
]

HTML input with a restricted set of headers and stripping headers from chunks:

html = (
    "<h1>Title</h1>"
    "<p>Intro paragraph.</p>"
    "<h2>Section A</h2>"
    "<p>Content A.</p>"
    "<h3>Sub A.1</h3>"
    "<p>Detail A.1</p>"
)
ro = ReaderOutput(text=html, document_name="example.html")

# Only split on Header 1 and Header 2 (i.e., H1/H2)
splitter = HeaderSplitter(
    headers_to_split_on=("Header 1", "Header 2"),
    group_header_with_content=False  # drop headers from chunks
)
out = splitter.split(ro)
print(out.chunks)
[
    "Intro paragraph.",
    "Content A.\nSub A.1\nDetail A.1"
]

Source code in src/splitter_mr/splitter/splitters/header_splitter.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """
    Perform header-based splitting with HTML→Markdown conversion and safe fallback.

    Steps:
        1. Detect filetype (HTML/MD).
        2. If HTML, convert to Markdown with HtmlToMarkdown (emits ATX headings).
        3. If Markdown, normalize Setext headings to ATX.
        4. Split by headers via MarkdownHeaderTextSplitter.
        5. If no headers found, fallback to RecursiveCharacterTextSplitter.

    Args:
        reader_output: The reader output containing text and metadata.

    Returns:
        SplitterOutput: A populated splitter output with chunk contents and metadata.

    Warnings:
        SplitterInputWarning: if text field in ReaderOutput is missing or void.

    Raises:
        HtmlConversionError: if HTML Conversion fails.

    Example:
        Basic Markdown input with **default headers** (H1–H6), keeping headers with content:

        ```python
        from splitter_mr.splitter import HeaderSplitter
        from splitter_mr.schema.models import ReaderOutput

        md = (
            "# Title\\n"
            "Intro paragraph.\\n\\n"
            "## Section A\\n"
            "Content A.\\n\\n"
            "## Section B\\n"
            "Content B."
        )
        ro = ReaderOutput(text=md, document_name="example.md")

        splitter = HeaderSplitter(group_header_with_content=True)  # keep headers in chunks
        out = splitter.split(ro)
        print(out.chunks)
        ```
        ```python
        [
            "# Title\\nIntro paragraph.",
            "## Section A\\nContent A.",
            "## Section B\\nContent B."
        ]
        ```

        HTML input with a **restricted set of headers and stripping headers** from chunks:

        ```python
        html = (
            "<h1>Title</h1>"
            "<p>Intro paragraph.</p>"
            "<h2>Section A</h2>"
            "<p>Content A.</p>"
            "<h3>Sub A.1</h3>"
            "<p>Detail A.1</p>"
        )
        ro = ReaderOutput(text=html, document_name="example.html")

        # Only split on Header 1 and Header 2 (i.e., H1/H2)
        splitter = HeaderSplitter(
            headers_to_split_on=("Header 1", "Header 2"),
            group_header_with_content=False  # drop headers from chunks
        )
        out = splitter.split(ro)
        print(out.chunks)
        ```
        ```python
        [
            "Intro paragraph.",
            "Content A.\\nSub A.1\\nDetail A.1"
        ]
        ```
    """
    text: str = reader_output.text
    if text is None or not str(text).strip():
        warnings.warn(
            SplitterInputWarning(
                "ReaderOutput.text is empty or whitespace-only. "
                "Proceeding; this will yield a single empty chunk."
            )
        )
        chunks: list[str] = [""]
        return SplitterOutput(
            chunks=chunks,
            chunk_id=self._generate_chunk_ids(len(chunks)),
            document_name=reader_output.document_name,
            document_path=reader_output.document_path,
            document_id=reader_output.document_id,
            conversion_method=reader_output.conversion_method,
            reader_method=reader_output.reader_method,
            ocr_method=reader_output.ocr_method,
            split_method="header_splitter",
            split_params={
                "headers_to_split_on": list(self.headers_to_split_on),
                "group_header_with_content": self.group_header_with_content,
            },
            metadata=self._default_metadata(),
        )

    filetype: str = self._guess_filetype(reader_output)
    tuples: list[tuple] = self._make_tuples("md")

    text: str = reader_output.text

    # HTML → Markdown using the project's converter
    if filetype == "html":
        try:
            text: str = HtmlToMarkdown().convert(text)
        except Exception as e:
            raise HtmlConversionError(
                f"HTML→Markdown failed for {reader_output.document_name!r}"
            ) from e
    else:
        text: str = self._normalize_setext(text)

    # Detect presence of ATX headers (after conversion/normalization)
    has_headers: bool = bool(re.search(r"(?m)^\s*#{1,6}\s+\S", text))

    # Configure header splitter. group_header_with_content -> strip_headers False
    splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=tuples,
        return_each_line=False,
        strip_headers=not self.group_header_with_content,
    )

    docs: list[str] = splitter.split_text(text) if has_headers else []
    # Fallback if no headers were found
    if not docs:
        rc = RecursiveCharacterTextSplitter(
            chunk_size=max(1, int(self.chunk_size) or 1000),
            chunk_overlap=min(200, max(0, int(self.chunk_size) // 10)),
        )
        docs: list = rc.create_documents([text])

    chunks: list[str] = [doc.page_content for doc in docs]

    return SplitterOutput(
        chunks=chunks,
        chunk_id=self._generate_chunk_ids(len(chunks)),
        document_name=reader_output.document_name,
        document_path=reader_output.document_path,
        document_id=reader_output.document_id,
        conversion_method=reader_output.conversion_method,
        reader_method=reader_output.reader_method,
        ocr_method=reader_output.ocr_method,
        split_method="header_splitter",
        split_params={
            "headers_to_split_on": list(self.headers_to_split_on),
            "group_header_with_content": self.group_header_with_content,
        },
        metadata=self._default_metadata(),
    )

RecursiveJSONSplitter

RecursiveJSONSplitter

Bases: BaseSplitter

Split a JSON string or structure into overlapping or non-overlapping chunks, using the Langchain RecursiveJsonSplitter. This splitter is designed to recursively break down JSON data (including nested objects and arrays) into manageable pieces based on keys, arrays, or other separators, until the desired chunk size is reached.

Parameters:

Name Type Description Default
chunk_size int

Maximum chunk size, measured in the number of characters per chunk.

1000
min_chunk_size int

Minimum chunk size, in characters.

200

Raises:

Type Description
SplitterConfigException

if parameters are not provided with the expected type.

Notes

See Langchain Docs on RecursiveJsonSplitter.

Source code in src/splitter_mr/splitter/splitters/json_splitter.py
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
class RecursiveJSONSplitter(BaseSplitter):
    """
    Split a JSON string or structure into overlapping or non-overlapping chunks,
    using the Langchain RecursiveJsonSplitter. This splitter is designed to recursively
    break down JSON data (including nested objects and arrays) into manageable pieces based
    on keys, arrays, or other separators, until the desired chunk size is reached.

    Args:
        chunk_size (int): Maximum chunk size, measured in the number of characters per chunk.
        min_chunk_size (int): Minimum chunk size, in characters.

    Raises:
        SplitterConfigException: if parameters are not provided with the expected type.

    Notes:
        See [Langchain Docs on RecursiveJsonSplitter](https://python.langchain.com/api_reference/text_splitters/json/langchain_text_splitters.json.RecursiveJsonSplitter.html#langchain_text_splitters.json.RecursiveJsonSplitter).
    """

    def __init__(self, chunk_size: int = 1000, min_chunk_size: int = 200):
        super().__init__(chunk_size)

        if not isinstance(chunk_size, int):
            raise SplitterConfigException(
                "Parameter `chunk_size` must be an integer number"
            )
        if not isinstance(min_chunk_size, int):
            raise SplitterConfigException(
                "Parameter `min_chunk_size` must be an integer number"
            )

        self.min_chunk_size = min_chunk_size

    # ---- Main method ---- #

    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """
        Splits the input JSON text from the reader_output dictionary into recursively chunked pieces,
        allowing for overlap by number or percentage of characters.

        Args:
            reader_output (Dict[str, Any]):
                Dictionary containing at least a 'text' key (str) and optional document metadata
                (e.g., 'document_name', 'document_path', etc.).

        Returns:
            SplitterOutput: Dataclass defining the output structure for all splitters.

        Raises:
            ValueError: If the 'text' field is missing from reader_output.
            json.JSONDecodeError: If the 'text' field contains invalid JSON.

        Example:
            ```python
            from splitter_mr.splitter import RecursiveJSONSplitter

            # This dictionary has been obtained from `VanillaReader`
            reader_output = ReaderOutput(
                text: '{"company": {"name": "TechCorp", "employees": [{"name": "Alice"}, {"name": "Bob"}]}}'
                document_name: "company_data.json",
                document_path: "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/company_data.json",
                document_id: "doc123",
                conversion_method: "vanilla",
                ocr_method: None
            )
            splitter = RecursiveJSONSplitter(chunk_size=100, min_chunk_size=20)
            output = splitter.split(reader_output)
            print(output["chunks"])
            ```
            ```python
            ['{"company": {"name": "TechCorp"}}', '{"employees": [{"name": "Alice"}, {"name": "Bob"}]}']
            ```

        Raises:
            ReaderOutputException: if input does not contain a valid JSON.
            InvalidChunkException: if returned chunks are not in a valid format.
            SplitterOutputException: if response has not been generated as expected
        """
        # Initialize variables
        try:
            text = json.loads(reader_output.text)
        except json.JSONDecodeError as e:
            raise ReaderOutputException(f"Input does not contain a valid JSON: {e}")

        # Split text into smaller JSON chunks
        try:
            splitter = RecursiveJsonSplitter(
                max_chunk_size=self.chunk_size,
                min_chunk_size=int(self.chunk_size - self.min_chunk_size),
            )
            chunks = splitter.split_text(json_data=text, convert_lists=True)
        except Exception as e:
            raise InvalidChunkException(
                f"There was an error trying to split the JSON text: {e}"
            )

        if chunks is None or chunks == []:
            raise InvalidChunkException("Splitter has produced void or missing chunks")

        # Generate chunk_ids and metadata
        chunk_ids = self._generate_chunk_ids(len(chunks))
        metadata = self._default_metadata()

        # Return output
        try:
            output = SplitterOutput(
                chunks=chunks,
                chunk_id=chunk_ids,
                document_name=reader_output.document_name,
                document_path=reader_output.document_path,
                document_id=reader_output.document_id,
                conversion_method=reader_output.conversion_method,
                reader_method=reader_output.reader_method,
                ocr_method=reader_output.ocr_method,
                split_method="recursive_json_splitter",
                split_params={
                    "max_chunk_size": self.chunk_size,
                    "min_chunk_size": self.min_chunk_size,
                },
                metadata=metadata,
            )
        except Exception as exc:
            raise SplitterOutputException(
                f"There was an error trying to build SplitterOutput response: {exc}"
            )
        return output
split(reader_output)

Splits the input JSON text from the reader_output dictionary into recursively chunked pieces, allowing for overlap by number or percentage of characters.

Parameters:

Name Type Description Default
reader_output Dict[str, Any]

Dictionary containing at least a 'text' key (str) and optional document metadata (e.g., 'document_name', 'document_path', etc.).

required

Returns:

Name Type Description
SplitterOutput SplitterOutput

Dataclass defining the output structure for all splitters.

Raises:

Type Description
ValueError

If the 'text' field is missing from reader_output.

JSONDecodeError

If the 'text' field contains invalid JSON.

Example

from splitter_mr.splitter import RecursiveJSONSplitter

# This dictionary has been obtained from `VanillaReader`
reader_output = ReaderOutput(
    text: '{"company": {"name": "TechCorp", "employees": [{"name": "Alice"}, {"name": "Bob"}]}}'
    document_name: "company_data.json",
    document_path: "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/company_data.json",
    document_id: "doc123",
    conversion_method: "vanilla",
    ocr_method: None
)
splitter = RecursiveJSONSplitter(chunk_size=100, min_chunk_size=20)
output = splitter.split(reader_output)
print(output["chunks"])
['{"company": {"name": "TechCorp"}}', '{"employees": [{"name": "Alice"}, {"name": "Bob"}]}']

Raises:

Type Description
ReaderOutputException

if input does not contain a valid JSON.

InvalidChunkException

if returned chunks are not in a valid format.

SplitterOutputException

if response has not been generated as expected

Source code in src/splitter_mr/splitter/splitters/json_splitter.py
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """
    Splits the input JSON text from the reader_output dictionary into recursively chunked pieces,
    allowing for overlap by number or percentage of characters.

    Args:
        reader_output (Dict[str, Any]):
            Dictionary containing at least a 'text' key (str) and optional document metadata
            (e.g., 'document_name', 'document_path', etc.).

    Returns:
        SplitterOutput: Dataclass defining the output structure for all splitters.

    Raises:
        ValueError: If the 'text' field is missing from reader_output.
        json.JSONDecodeError: If the 'text' field contains invalid JSON.

    Example:
        ```python
        from splitter_mr.splitter import RecursiveJSONSplitter

        # This dictionary has been obtained from `VanillaReader`
        reader_output = ReaderOutput(
            text: '{"company": {"name": "TechCorp", "employees": [{"name": "Alice"}, {"name": "Bob"}]}}'
            document_name: "company_data.json",
            document_path: "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/company_data.json",
            document_id: "doc123",
            conversion_method: "vanilla",
            ocr_method: None
        )
        splitter = RecursiveJSONSplitter(chunk_size=100, min_chunk_size=20)
        output = splitter.split(reader_output)
        print(output["chunks"])
        ```
        ```python
        ['{"company": {"name": "TechCorp"}}', '{"employees": [{"name": "Alice"}, {"name": "Bob"}]}']
        ```

    Raises:
        ReaderOutputException: if input does not contain a valid JSON.
        InvalidChunkException: if returned chunks are not in a valid format.
        SplitterOutputException: if response has not been generated as expected
    """
    # Initialize variables
    try:
        text = json.loads(reader_output.text)
    except json.JSONDecodeError as e:
        raise ReaderOutputException(f"Input does not contain a valid JSON: {e}")

    # Split text into smaller JSON chunks
    try:
        splitter = RecursiveJsonSplitter(
            max_chunk_size=self.chunk_size,
            min_chunk_size=int(self.chunk_size - self.min_chunk_size),
        )
        chunks = splitter.split_text(json_data=text, convert_lists=True)
    except Exception as e:
        raise InvalidChunkException(
            f"There was an error trying to split the JSON text: {e}"
        )

    if chunks is None or chunks == []:
        raise InvalidChunkException("Splitter has produced void or missing chunks")

    # Generate chunk_ids and metadata
    chunk_ids = self._generate_chunk_ids(len(chunks))
    metadata = self._default_metadata()

    # Return output
    try:
        output = SplitterOutput(
            chunks=chunks,
            chunk_id=chunk_ids,
            document_name=reader_output.document_name,
            document_path=reader_output.document_path,
            document_id=reader_output.document_id,
            conversion_method=reader_output.conversion_method,
            reader_method=reader_output.reader_method,
            ocr_method=reader_output.ocr_method,
            split_method="recursive_json_splitter",
            split_params={
                "max_chunk_size": self.chunk_size,
                "min_chunk_size": self.min_chunk_size,
            },
            metadata=metadata,
        )
    except Exception as exc:
        raise SplitterOutputException(
            f"There was an error trying to build SplitterOutput response: {exc}"
        )
    return output

HTMLTagSplitter

HTMLTagSplitter

Bases: BaseSplitter

Split HTML content by tag, with optional batching and Markdown conversion.

Behavior
  • When tag is provided (e.g., div), split by all matching elements.
  • When tag is None, auto-detect the most frequent and shallowest tag.
  • Tables receive special handling to preserve header context when batching.

Parameters:

Name Type Description Default
chunk_size int

Maximum chunk size in characters for batching. If 0, 1, or None, batching groups all elements into a single chunk.

1
tag Optional[str]

HTML tag to split on (e.g., "div"). If None, the tag is auto-detected.

None
batch bool

If True, group elements up to chunk_size. If False, emit one chunk per element.

True
to_markdown bool

If True, convert each emitted chunk from HTML to Markdown.

True

Raises:

Type Description
SplitterConfigException

If chunk_size is negative or non-integer, or if tag is a non-string/empty string.

Source code in src/splitter_mr/splitter/splitters/html_tag_splitter.py
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
class HTMLTagSplitter(BaseSplitter):
    """Split HTML content by tag, with optional batching and Markdown conversion.

    Behavior:
      - When `tag` is provided (e.g., `div`), split by all matching elements.
      - When `tag` is `None`, auto-detect the most frequent and shallowest tag.
      - Tables receive special handling to preserve header context when batching.

    Args:
        chunk_size: Maximum chunk size in characters for batching. If `0`, `1`,
            or `None`, batching groups all elements into a single chunk.
        tag: HTML tag to split on (e.g., `"div"`). If `None`, the tag is auto-detected.
        batch: If True, group elements up to `chunk_size`. If False, emit one chunk per element.
        to_markdown: If True, convert each emitted chunk from HTML to Markdown.

    Raises:
        SplitterConfigException: If `chunk_size` is negative or non-integer, or if
            `tag` is a non-string/empty string.
    """

    def __init__(
        self,
        chunk_size: int = 1,
        tag: Optional[str] = None,
        *,
        batch: bool = True,
        to_markdown: bool = True,
    ):
        super().__init__(chunk_size)
        if chunk_size is not None and (
            not isinstance(chunk_size, int) or chunk_size < 0
        ):
            raise SplitterConfigException(
                f"chunk_size must be a non-negative int or None, got {chunk_size!r}"
            )
        self.tag = tag
        if self.tag is not None and (
            not isinstance(self.tag, str) or not self.tag.strip()
        ):
            raise SplitterConfigException(f"Invalid tag: '{self.tag!r}'")
        self.batch = batch
        self.to_markdown = to_markdown

    # ---- Main method ---- #

    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """Split HTML using the configured tag and batching, then optionally convert to Markdown.

        Semantics:
          - **Tables**
              - `batch=False`: one chunk per requested element. If splitting by a row-level tag
                (e.g., `tr`), emit a mini-table per row with `<thead>` once and that row in `<tbody>`.
              - `batch=True` and `chunk_size in (0, 1, None)`: all tables grouped into one chunk.
              - `batch=True` and `chunk_size > 1`: split each table into multiple chunks by batching
                `<tr>` rows; copy `<thead>` into every chunk and skip the header row from `<tbody>`.

          - **Non-table tags**
              - `batch=False`: one chunk per element.
              - `batch=True` and `chunk_size in (0, 1, None)`: all elements grouped into one chunk.
              -`batch=True` and `chunk_size > 1`: batch by total HTML length.

        Args:
          reader_output: Reader output containing at least `text`.

        Returns:
          SplitterOutput: The split result with chunks and metadata.

        Raises:
          HtmlConversionError: If parsing the HTML or converting chunks to Markdown fails.
          InvalidHtmlTagError: If the tag lookup (`find_all`) fails due to an invalid tag.
          SplitterOutputException: If building the final `SplitterOutput` fails.

        Example:
            **Basic usage** splitting **all `<div>` elements**:

            ```python
            from splitter_mr.schema import ReaderOutput
            from splitter_mr.splitter.splitters import HTMLTagSplitter

            html = '''
            <div>First block</div>
            <div>Second block</div>
            <div>Third block</div>
            '''

            ro = ReaderOutput(
                text=html,
                document_name="sample.html",
                document_path="/tmp/sample.html",
            )

            splitter = HTMLTagSplitter(chunk_size=10, tag="div", batch=False)
            output = splitter.split(ro)

            print(output.chunks)
            ```

            ```python
            ['<div>First block</div>','<div>Second block</div>','<div>Third block</div>']
            ```

            Example with **batching** (all `<p>` elements grouped into one chunk)::

            ```python
            html = "<p>A</p><p>B</p><p>C</p>"
            ro = ReaderOutput(text=html, document_name="demo.html")

            splitter = HTMLTagSplitter(chunk_size=1, tag="p", batch=True)
            out = splitter.split(ro)

            print(out.chunks[0])
            ```

            ```python
            '<p>A</p>\\n<p>B</p>\\n<p>C</p>'
            ```

            Example with **table batching** (each chunk contains a header and 2 rows):

            ```python
            html = '''
            <table>
                <thead><tr><th>H1</th><th>H2</th></tr></thead>
                <tbody>
                    <tr><td>A</td><td>1</td></tr>
                    <tr><td>B</td><td>2</td></tr>
                    <tr><td>C</td><td>3</td></tr>
                </tbody>
            </table>
            '''

            ro = ReaderOutput(text=html, document_name="table.html")

            splitter = HTMLTagSplitter(
                chunk_size=2,       # batch <tr> rows in groups of 2
                tag="tr",           # split by table rows
                batch=True,
            )
            out = splitter.split(ro)

            for i, c in enumerate(out.chunks, 1):
                print(f"--- CHUNK {i} ---")
                print(c)
            ```

            Example **enabling Markdown conversion**:

            ```python
            html = "<h1>Title</h1><p>Paragraph text</p>"
            ro = ReaderOutput(text=html)

            splitter = HTMLTagSplitter(
                chunk_size=5,
                tag=None,
                batch=False,
                to_markdown=True,
            )
            out = splitter.split(ro)

            print(out.chunks)
            ```
            ```python
            ['# Title', 'Paragraph text']
            ```

        Notes:
          If the input text is empty/whitespace-only, a warning is emitted and
          a single empty chunk is returned.
        """
        html: str = getattr(reader_output, "text", "") or ""
        if not html.strip():
            warnings.warn(
                SplitterInputWarning(
                    "ReaderOutput.text is empty or whitespace-only. "
                    "Proceeding; this will yield a single empty chunk."
                )
            )
            return self._emit_result(
                chunks=[""],
                reader_output=reader_output,
                tag=self.tag or DEFAULT_HTML_TAG,
            )

        soup = self._parse_html(html)
        tag = self.tag or self._auto_tag(soup)

        elements, effective_tag = self._select_elements(soup, tag)

        chunks = self._dispatch_chunking(elements, effective_tag)
        if not chunks:
            warnings.warn(SplitterOutputWarning("Splitter has produced empty chunks"))
            chunks = [""]

        if self.to_markdown:
            chunks = self._convert_chunks_to_markdown(chunks)

        return self._emit_result(
            chunks=chunks,
            reader_output=reader_output,
            tag=effective_tag,
        )

    # ---- Helpers ---- #

    def _parse_html(self, html: str) -> bs4.BeautifulSoup:
        """Parse HTML into a BeautifulSoup document.

        Args:
          html: Raw HTML string.

        Returns:
          BeautifulSoup: Parsed document.

        Raises:
          HtmlConversionError: If parsing fails.
        """
        try:
            return bs4.BeautifulSoup(html, HTML_PARSER)
        except Exception as e:
            raise HtmlConversionError(f"BeautifulSoup failed to parse HTML: {e}") from e

    def _select_elements(self, soup: BeautifulSoup, tag: str) -> tuple[list, str]:
        """Select elements by tag and handle table escalation for batching.

        Args:
          soup: Parsed BeautifulSoup document.
          tag: Tag to search for.

        Returns:
          tuple[list, str]: `(elements, effective_tag)`. `effective_tag` may be
          `"table"` if row-level tags are escalated to tables for batching.

        Raises:
          InvalidHtmlTagError: If the selection fails (BeautifulSoup `find_all` error).
        """
        try:
            elements = soup.find_all(tag)
            if not elements:
                warnings.warn(
                    AutoTagFallbackWarning(f"No elements found for tag {tag!r}")
                )
        except Exception as e:
            raise InvalidHtmlTagError(
                f"find_all method has failed when locating {tag!r} on document."
            ) from e

        # Escalate row-level/table-children to tables when batching
        effective_tag = tag
        if self.batch and tag in TABLE_CHILDREN and elements:
            warnings.warn(
                BatchHtmlTableWarning(
                    "Batch process has been detected. "
                    "It will be split by elements in HTML table."
                )
            )
            seen = set()
            parent_tables = []
            for el in elements:
                table = el.find_parent("table")
                if table and id(table) not in seen:
                    seen.add(id(table))
                    parent_tables.append(table)
            if parent_tables:
                elements = parent_tables
                effective_tag = "table"

        return elements, effective_tag

    def _dispatch_chunking(self, elements: list, tag: str) -> List[str]:
        """Dispatch to table or non-table chunking based on tag.

        Args:
          elements: List of matched elements (or parent tables if escalated).
          tag: Effective tag name (possibly `"table"` after escalation).

        Returns:
          List[str]: HTML chunks.
        """
        if tag == "table":
            return self._chunk_tables(elements)
        return self._chunk_non_tables(elements, tag)

    def _chunk_tables(self, tables: list) -> List[str]:
        """Chunk table elements according to batching rules.

        Args:
          tables: List of `<table>` elements.

        Returns:
          List[str]: Table chunks as HTML strings.

        Raises:
          HtmlConversionError: Indirectly, if HTML-to-Markdown conversion is later applied.
        """
        if not self.batch:
            return [self._build_doc_with_children([el]) for el in tables]

        if self.chunk_size in (0, 1, None):
            return [self._build_doc_with_children(tables)] if tables else [""]

        # chunk_size > 1: batch rows within each table
        chunks: list = []
        for table_el in tables:
            _, rows, _ = self._extract_table_header_and_rows(table_el)
            if not rows:
                chunks.append(self._build_doc_with_children([table_el]))
                continue

            buf: list = []
            for row in rows:
                test_buf = buf + [row]
                test_html = self._build_table_chunk(table_el, test_buf)
                if len(test_html) > self.chunk_size and buf:
                    chunks.append(self._build_table_chunk(table_el, buf))
                    buf = [row]
                else:
                    buf = test_buf
            if buf:
                chunks.append(self._build_table_chunk(table_el, buf))

        return chunks

    def _convert_chunks_to_markdown(self, chunks: List[str]) -> List[str]:
        """Convert a list of HTML chunks to Markdown.

        Args:
          chunks: HTML chunks to convert.

        Returns:
          List[str]: Markdown strings.

        Raises:
          HtmlConversionError: If the conversion fails for any chunk.
        """
        try:
            converter = html_to_markdown.HtmlToMarkdown()
            return [converter.convert(c) for c in chunks]
        except Exception as e:
            raise HtmlConversionError("HTML to Markdown conversion failed") from e

    def _emit_result(
        self, chunks: List[str], reader_output: ReaderOutput, tag: str
    ) -> SplitterOutput:
        """Assemble the SplitterOutput with common metadata.

        Args:
          chunks: Final list of chunks (HTML or Markdown).
          reader_output: Original reader output.
          tag: Effective tag used for splitting (may differ from configured tag).

        Returns:
          SplitterOutput: Structured output including ids and metadata.

        Raises:
          SplitterOutputException: If building the `SplitterOutput` object fails.
        """
        try:
            return SplitterOutput(
                chunks=chunks,
                chunk_id=self._generate_chunk_ids(len(chunks)),
                document_name=reader_output.document_name,
                document_path=reader_output.document_path,
                document_id=reader_output.document_id,
                conversion_method=reader_output.conversion_method,
                reader_method=reader_output.reader_method,
                ocr_method=reader_output.ocr_method,
                split_method="html_tag_splitter",
                split_params={
                    "chunk_size": self.chunk_size,
                    "tag": tag,
                    "batch": self.batch,
                    "to_markdown": self.to_markdown,
                },
                metadata=self._default_metadata(),
            )
        except Exception as e:
            raise SplitterOutputException(f"Failed to build SplitterOutput: {e}") from e

    # ---- HTML / Table helpers ---- #

    def _build_doc_with_children(self, children: List) -> str:
        """Wrap top-level nodes into a minimal HTML document.

        Args:
          children: Nodes to append under `<body>`.

        Returns:
          str: Serialized HTML document containing the provided children.
        """
        doc = bs4.BeautifulSoup("", HTML_PARSER)
        html_tag: Tag = doc.new_tag("html")
        body_tag: Tag = doc.new_tag("body")
        html_tag.append(body_tag)
        doc.append(html_tag)
        for c in children:
            body_tag.append(copy.deepcopy(c))
        return str(doc)

    def _extract_table_header_and_rows(self, table_tag):
        """Extract table header and data rows.

        Args:
          table_tag: A `<table>` BeautifulSoup element.

        Returns:
          tuple: `(header_thead, data_rows, header_row_src)` where:
            * `header_thead`: a deep-copied `<thead>` or `None`.
            * `data_rows`: list of original `<tr>` nodes not in `<thead>`.
            * `header_row_src`: original `<tr>` used to synthesize `<thead>` (if any).
        """
        header = table_tag.find("thead")
        header_row_src: None = None

        if header is not None:
            data_rows = []
            for tr in table_tag.find_all("tr"):
                if tr.find_parent("thead") is not None:
                    continue
                data_rows.append(tr)
            return copy.deepcopy(header), data_rows, None

        first_tr = table_tag.find("tr")
        header_thead: None = None
        if first_tr is not None:
            tmp = bs4.BeautifulSoup("", HTML_PARSER)
            thead = tmp.new_tag("thead")
            thead.append(copy.deepcopy(first_tr))
            header_thead = thead
            header_row_src = first_tr

        data_rows: list = []
        for tr in table_tag.find_all("tr"):
            if header_row_src is not None and tr is header_row_src:
                continue
            if tr.find_parent("thead") is not None:
                continue
            data_rows.append(tr)

        return header_thead, data_rows, header_row_src

    def _build_table_chunk(self, table_tag, rows_subset: List) -> str:
        """Build a minimal document containing a single table with a subset of rows.

        Args:
          table_tag: The source `<table>` element (attributes are copied).
          rows_subset: The `<tr>` rows to include under `<tbody>`.

        Returns:
          str: Serialized HTML document with `<table>` containing the subset.
        """
        header_thead, _, _ = self._extract_table_header_and_rows(table_tag)
        doc = BeautifulSoup("", HTML_PARSER)
        html_tag: Tag = doc.new_tag("html")
        body_tag: Tag = doc.new_tag("body")
        html_tag.append(body_tag)
        doc.append(html_tag)

        new_table: Tag = doc.new_tag("table", **table_tag.attrs)
        if header_thead is not None:
            new_table.append(copy.deepcopy(header_thead))

        tbody: Tag = doc.new_tag("tbody")
        for r in rows_subset:
            tbody.append(copy.deepcopy(r))
        new_table.append(tbody)

        body_tag.append(new_table)
        return str(doc)

    def _chunk_non_tables(self, elements: list, tag: str) -> List[str]:
        """Chunk non-table elements according to batching rules.

        Args:
          elements: List of non-table elements to chunk.
          tag: Effective tag name (not `"table"`).

        Returns:
          List[str]: HTML chunks for non-table content.
        """
        if not self.batch:
            return self._non_tables_unbatched(elements, tag)

        if self.chunk_size in (0, 1, None):
            return self._single_group_or_empty(elements)

        # chunk_size > 1: batch by total HTML length
        return self._chunk_by_total_length(elements, self.chunk_size)

    def _build_doc(self, els: Sequence) -> str:
        """Build a minimal HTML document from nodes.

        Args:
          els: Sequence of nodes to be wrapped under `<body>`.

        Returns:
          str: Serialized HTML document containing the nodes.
        """
        return self._build_doc_with_children(list(els))

    def _single_group_or_empty(self, elements: Sequence) -> List[str]:
        """Return a single grouped chunk or an explicit empty chunk.

        Args:
          elements: Sequence of elements to group.

        Returns:
          List[str]: A single combined chunk, or `[""]` if there are no elements.
        """
        return [self._build_doc(elements)] if elements else [""]

    def _non_tables_unbatched(self, elements: list, tag: str) -> List[str]:
        """Unbatched emission for non-table tags (and table-children).

        Args:
          elements: List of elements to emit individually.
          tag: Effective tag name (may be a table-child, e.g., `"tr"`).

        Returns:
          List[str]: One HTML chunk per element (with special handling for table-children).
        """
        # Simple one-per-element when not dealing with table-children
        if tag not in TABLE_CHILDREN:
            return [self._build_doc_with_children([el]) for el in elements]

        # Row-like: keep header context if parent table exists
        chunks: List[str] = []
        for el in elements:
            table_el = el.find_parent("table")
            if not table_el:
                chunks.append(self._build_doc_with_children([el]))
                continue

            # Skip header-only rows or header tags
            if (el.name == "tr" and el.find_parent("thead") is not None) or el.name in {
                "thead",
                "th",
            }:
                continue

            chunks.append(self._build_table_chunk(table_el, [el]))
        return chunks

    def _chunk_by_total_length(self, elements: Sequence, max_len: int) -> List[str]:
        """Batch arbitrary elements by aggregated HTML length.

        Args:
          elements: Sequence of elements to batch.
          max_len: Maximum allowed length (in characters) per chunk.

        Returns:
          List[str]: HTML chunks whose serialized length does not exceed `max_len`,
          except when a single element alone exceeds `max_len` (in which case it is
          emitted as an oversized chunk).
        """
        chunks: list[str] = []
        buffer: list = []
        for el in elements:
            candidate = buffer + [el]
            candidate_str = self._build_doc(candidate)
            if len(candidate_str) > max_len and buffer:
                chunks.append(self._build_doc(buffer))
                buffer = [el]
            else:
                buffer = candidate
        if buffer:
            chunks.append(self._build_doc(buffer))
        return chunks

    # ---- Auto Tagging logic ---- #

    def _auto_tag(self, soup: BeautifulSoup) -> str:
        """Auto-detect the most frequent and shallowest tag within `<body>`.

        If no repeated tags are found, return the first tag found in `<body>`,
        otherwise fallback to `'div'`. Emits an `AutoTagFallbackWarning` when
        `<body>` is missing or when a fallback is used.

        Args:
          soup: Parsed BeautifulSoup document.

        Returns:
          str: Chosen tag name.
        """
        body = soup.find("body")
        if not body:
            warnings.warn(
                AutoTagFallbackWarning(
                    f"No body tag has been found in the provided input. "
                    f"Defaulting to '{DEFAULT_HTML_TAG}' tag"
                )
            )
            return DEFAULT_HTML_TAG

        # Traverse all tags in body, tracking tag: (count, min_depth)
        tag_counter = Counter()
        tag_min_depth = defaultdict(lambda: float("inf"))

        def traverse(el, depth=0):
            for child in el.children:
                if getattr(child, "name", None):
                    tag_counter[child.name] += 1
                    tag_min_depth[child.name] = min(tag_min_depth[child.name], depth)
                    traverse(child, depth + 1)

        traverse(body)

        if not tag_counter:
            for t in body.find_all(True, recursive=True):
                return t.name
            warnings.warn(
                AutoTagFallbackWarning(f"Defaulting to '{DEFAULT_HTML_TAG}' tag")
            )
            return DEFAULT_HTML_TAG

        max_count: int = max(tag_counter.values())
        candidates: list = [t for t, cnt in tag_counter.items() if cnt == max_count]
        chosen: int = min(candidates, key=lambda t: tag_min_depth[t])
        return chosen
split(reader_output)

Split HTML using the configured tag and batching, then optionally convert to Markdown.

Semantics
  • Tables

    • batch=False: one chunk per requested element. If splitting by a row-level tag (e.g., tr), emit a mini-table per row with <thead> once and that row in <tbody>.
    • batch=True and chunk_size in (0, 1, None): all tables grouped into one chunk.
    • batch=True and chunk_size > 1: split each table into multiple chunks by batching <tr> rows; copy <thead> into every chunk and skip the header row from <tbody>.
  • Non-table tags

    • batch=False: one chunk per element.
    • batch=True and chunk_size in (0, 1, None): all elements grouped into one chunk. -batch=True and chunk_size > 1: batch by total HTML length.

Parameters:

Name Type Description Default
reader_output ReaderOutput

Reader output containing at least text.

required

Returns:

Name Type Description
SplitterOutput SplitterOutput

The split result with chunks and metadata.

Raises:

Type Description
HtmlConversionError

If parsing the HTML or converting chunks to Markdown fails.

InvalidHtmlTagError

If the tag lookup (find_all) fails due to an invalid tag.

SplitterOutputException

If building the final SplitterOutput fails.

Example

Basic usage splitting all <div> elements:

from splitter_mr.schema import ReaderOutput
from splitter_mr.splitter.splitters import HTMLTagSplitter

html = '''
<div>First block</div>
<div>Second block</div>
<div>Third block</div>
'''

ro = ReaderOutput(
    text=html,
    document_name="sample.html",
    document_path="/tmp/sample.html",
)

splitter = HTMLTagSplitter(chunk_size=10, tag="div", batch=False)
output = splitter.split(ro)

print(output.chunks)
['<div>First block</div>','<div>Second block</div>','<div>Third block</div>']

Example with batching (all <p> elements grouped into one chunk)::

html = "<p>A</p><p>B</p><p>C</p>"
ro = ReaderOutput(text=html, document_name="demo.html")

splitter = HTMLTagSplitter(chunk_size=1, tag="p", batch=True)
out = splitter.split(ro)

print(out.chunks[0])
'<p>A</p>\n<p>B</p>\n<p>C</p>'

Example with table batching (each chunk contains a header and 2 rows):

html = '''
<table>
    <thead><tr><th>H1</th><th>H2</th></tr></thead>
    <tbody>
        <tr><td>A</td><td>1</td></tr>
        <tr><td>B</td><td>2</td></tr>
        <tr><td>C</td><td>3</td></tr>
    </tbody>
</table>
'''

ro = ReaderOutput(text=html, document_name="table.html")

splitter = HTMLTagSplitter(
    chunk_size=2,       # batch <tr> rows in groups of 2
    tag="tr",           # split by table rows
    batch=True,
)
out = splitter.split(ro)

for i, c in enumerate(out.chunks, 1):
    print(f"--- CHUNK {i} ---")
    print(c)

Example enabling Markdown conversion:

html = "<h1>Title</h1><p>Paragraph text</p>"
ro = ReaderOutput(text=html)

splitter = HTMLTagSplitter(
    chunk_size=5,
    tag=None,
    batch=False,
    to_markdown=True,
)
out = splitter.split(ro)

print(out.chunks)
['# Title', 'Paragraph text']

Notes

If the input text is empty/whitespace-only, a warning is emitted and a single empty chunk is returned.

Source code in src/splitter_mr/splitter/splitters/html_tag_splitter.py
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """Split HTML using the configured tag and batching, then optionally convert to Markdown.

    Semantics:
      - **Tables**
          - `batch=False`: one chunk per requested element. If splitting by a row-level tag
            (e.g., `tr`), emit a mini-table per row with `<thead>` once and that row in `<tbody>`.
          - `batch=True` and `chunk_size in (0, 1, None)`: all tables grouped into one chunk.
          - `batch=True` and `chunk_size > 1`: split each table into multiple chunks by batching
            `<tr>` rows; copy `<thead>` into every chunk and skip the header row from `<tbody>`.

      - **Non-table tags**
          - `batch=False`: one chunk per element.
          - `batch=True` and `chunk_size in (0, 1, None)`: all elements grouped into one chunk.
          -`batch=True` and `chunk_size > 1`: batch by total HTML length.

    Args:
      reader_output: Reader output containing at least `text`.

    Returns:
      SplitterOutput: The split result with chunks and metadata.

    Raises:
      HtmlConversionError: If parsing the HTML or converting chunks to Markdown fails.
      InvalidHtmlTagError: If the tag lookup (`find_all`) fails due to an invalid tag.
      SplitterOutputException: If building the final `SplitterOutput` fails.

    Example:
        **Basic usage** splitting **all `<div>` elements**:

        ```python
        from splitter_mr.schema import ReaderOutput
        from splitter_mr.splitter.splitters import HTMLTagSplitter

        html = '''
        <div>First block</div>
        <div>Second block</div>
        <div>Third block</div>
        '''

        ro = ReaderOutput(
            text=html,
            document_name="sample.html",
            document_path="/tmp/sample.html",
        )

        splitter = HTMLTagSplitter(chunk_size=10, tag="div", batch=False)
        output = splitter.split(ro)

        print(output.chunks)
        ```

        ```python
        ['<div>First block</div>','<div>Second block</div>','<div>Third block</div>']
        ```

        Example with **batching** (all `<p>` elements grouped into one chunk)::

        ```python
        html = "<p>A</p><p>B</p><p>C</p>"
        ro = ReaderOutput(text=html, document_name="demo.html")

        splitter = HTMLTagSplitter(chunk_size=1, tag="p", batch=True)
        out = splitter.split(ro)

        print(out.chunks[0])
        ```

        ```python
        '<p>A</p>\\n<p>B</p>\\n<p>C</p>'
        ```

        Example with **table batching** (each chunk contains a header and 2 rows):

        ```python
        html = '''
        <table>
            <thead><tr><th>H1</th><th>H2</th></tr></thead>
            <tbody>
                <tr><td>A</td><td>1</td></tr>
                <tr><td>B</td><td>2</td></tr>
                <tr><td>C</td><td>3</td></tr>
            </tbody>
        </table>
        '''

        ro = ReaderOutput(text=html, document_name="table.html")

        splitter = HTMLTagSplitter(
            chunk_size=2,       # batch <tr> rows in groups of 2
            tag="tr",           # split by table rows
            batch=True,
        )
        out = splitter.split(ro)

        for i, c in enumerate(out.chunks, 1):
            print(f"--- CHUNK {i} ---")
            print(c)
        ```

        Example **enabling Markdown conversion**:

        ```python
        html = "<h1>Title</h1><p>Paragraph text</p>"
        ro = ReaderOutput(text=html)

        splitter = HTMLTagSplitter(
            chunk_size=5,
            tag=None,
            batch=False,
            to_markdown=True,
        )
        out = splitter.split(ro)

        print(out.chunks)
        ```
        ```python
        ['# Title', 'Paragraph text']
        ```

    Notes:
      If the input text is empty/whitespace-only, a warning is emitted and
      a single empty chunk is returned.
    """
    html: str = getattr(reader_output, "text", "") or ""
    if not html.strip():
        warnings.warn(
            SplitterInputWarning(
                "ReaderOutput.text is empty or whitespace-only. "
                "Proceeding; this will yield a single empty chunk."
            )
        )
        return self._emit_result(
            chunks=[""],
            reader_output=reader_output,
            tag=self.tag or DEFAULT_HTML_TAG,
        )

    soup = self._parse_html(html)
    tag = self.tag or self._auto_tag(soup)

    elements, effective_tag = self._select_elements(soup, tag)

    chunks = self._dispatch_chunking(elements, effective_tag)
    if not chunks:
        warnings.warn(SplitterOutputWarning("Splitter has produced empty chunks"))
        chunks = [""]

    if self.to_markdown:
        chunks = self._convert_chunks_to_markdown(chunks)

    return self._emit_result(
        chunks=chunks,
        reader_output=reader_output,
        tag=effective_tag,
    )

RowColumnSplitter

RowColumnSplitter

Bases: BaseSplitter

Split tabular data by rows, columns, or character-based chunk size.

RowColumnSplitter splits tabular data (such as CSV, TSV, Markdown tables, or JSON tables) into smaller tables based on rows, columns, or by total character size while preserving row integrity.

This splitter supports several modes:

  • By rows: Split the table into chunks with a fixed number of rows, with optional overlapping rows between chunks.
  • By columns: Split the table into chunks by columns, with optional overlapping columns between chunks.
  • By chunk size: Split the table into markdown-formatted table chunks, where each chunk contains as many complete rows as fit under the specified character limit, optionally overlapping a fixed number of rows between chunks.

Supported formats for the input text are:

  • CSV / TSV / TXT (comma- or tab-separated values).
  • Markdown tables.
  • JSON in tabular shape (list of dicts or dict of lists).

Parameters:

Name Type Description Default
chunk_size int

Maximum number of characters per chunk when using character-based splitting. Defaults to 1000.

1000
num_rows int

Number of rows per chunk when splitting by rows. Mutually exclusive with num_cols. Defaults to 0 (disabled).

0
num_cols int

Number of columns per chunk when splitting by columns. Mutually exclusive with num_rows. Defaults to 0 (disabled).

0
chunk_overlap int | float

Overlap between chunks. Interpretation depends on the mode:

  • When splitting by rows or columns, if an int, it is the number of overlapping rows/columns. If a float in \[0, 1), it is interpreted as a fraction of the rows/columns per chunk.
  • When splitting by chunk_size, it represents the number or fraction of overlapping rows (not characters).

Defaults to 0.

0

Raises:

Type Description
SplitterConfigException

If configuration is invalid, e.g.:

  • num_rows and num_cols are both non-zero.
  • chunk_overlap as float is not in \[0, 1).
  • chunk_overlap as int is negative.
Source code in src/splitter_mr/splitter/splitters/row_column_splitter.py
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
class RowColumnSplitter(BaseSplitter):
    """Split tabular data by rows, columns, or character-based chunk size.

    RowColumnSplitter splits tabular data (such as CSV, TSV, Markdown tables,
    or JSON tables) into smaller tables based on rows, columns, or by total
    character size while preserving row integrity.

    This splitter supports several modes:

    * **By rows**: Split the table into chunks with a fixed number of rows,
      with optional overlapping rows between chunks.
    * **By columns**: Split the table into chunks by columns, with optional
      overlapping columns between chunks.
    * **By chunk size**: Split the table into markdown-formatted table chunks,
      where each chunk contains as many complete rows as fit under the specified
      character limit, optionally overlapping a fixed number of rows between
      chunks.

    Supported formats for the input text are:

    * CSV / TSV / TXT (comma- or tab-separated values).
    * Markdown tables.
    * JSON in tabular shape (list of dicts or dict of lists).

    Args:
        chunk_size (int, optional):
            Maximum number of characters per chunk when using character-based
            splitting. Defaults to ``1000``.
        num_rows (int, optional):
            Number of rows per chunk when splitting by rows. Mutually
            exclusive with ``num_cols``. Defaults to ``0`` (disabled).
        num_cols (int, optional):
            Number of columns per chunk when splitting by columns. Mutually
            exclusive with ``num_rows``. Defaults to ``0`` (disabled).
        chunk_overlap (int | float, optional):
            Overlap between chunks. Interpretation depends on the mode:

            * When splitting by rows or columns, if an ``int``, it is the
              number of overlapping rows/columns. If a ``float`` in
              ``\\[0, 1)``, it is interpreted as a fraction of the rows/columns
              per chunk.
            * When splitting by ``chunk_size``, it represents the number or
              fraction of overlapping **rows** (not characters).

            Defaults to ``0``.

    Raises:
        SplitterConfigException:
            If configuration is invalid, e.g.:

            * ``num_rows`` and ``num_cols`` are both non-zero.
            * ``chunk_overlap`` as ``float`` is not in ``\\[0, 1)``.
            * ``chunk_overlap`` as ``int`` is negative.
    """

    def __init__(
        self,
        chunk_size: int = 1000,
        num_rows: int = 0,
        num_cols: int = 0,
        chunk_overlap: Union[int, float] = 0,
    ):
        super().__init__(chunk_size)
        self.num_rows = num_rows
        self.num_cols = num_cols
        self.chunk_overlap = chunk_overlap
        self._validate_config()

    def _validate_config(self) -> None:
        """Validate splitter configuration.

        Performs basic sanity checks on the configuration and raises
        splitter-specific errors when invalid.

        Raises:
            SplitterConfigException:
                If any of the following holds:

                * ``num_rows`` and ``num_cols`` are both non-zero.
                * ``num_rows`` or ``num_cols`` is negative.
                * ``chunk_overlap`` is negative.
                * ``chunk_overlap`` is a ``float`` outside ``\\[0, 1)``.
        """
        if self.num_rows and self.num_cols:
            raise SplitterConfigException(
                "num_rows and num_cols are mutually exclusive."
            )

        if self.num_rows < 0 or self.num_cols < 0:
            raise SplitterConfigException(
                "num_rows and num_cols must be non-negative integers."
            )

        if not isinstance(self.chunk_overlap, (int, float)):
            raise SplitterConfigException("chunk_overlap must be an int or a float.")

        if self.chunk_overlap < 0:
            raise SplitterConfigException("chunk_overlap must be non-negative.")

        if isinstance(self.chunk_overlap, float) and not (0 <= self.chunk_overlap < 1):
            raise SplitterConfigException(
                "chunk_overlap as float must be in the range [0, 1)."
            )

    # ---- Main logic ---- #

    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """
        Split the input tabular data into chunks.

        The splitting strategy is determined by the configuration:

        - If ``num_rows > 0``: split by rows.
        - Else if ``num_cols > 0``: split by columns.
        - Else: split by character-based chunk size in markdown format,
          preserving a header row and never cutting data rows.

        Args:
            reader_output (ReaderOutput):
                Reader output containing at least ``text`` (tabular data as a
                string) and optionally:

                * ``conversion_method``: format hint (``"markdown"``, ``"csv"``,
                  ``"tsv"``, ``"txt"``, ``"json"`` or custom).
                * ``document_name``, ``document_path``, ``document_id``,
                  ``conversion_method``, ``reader_method``, ``ocr_method`` for
                  metadata propagation.

        Returns:
            SplitterOutput:
                Populated splitter output with:

                * ``chunks``: list of chunked tables.
                * ``chunk_id``: generated chunk identifiers.
                * document metadata carried over from ``reader_output``.
                * ``split_method="row_column_splitter"``.
                * ``split_params`` describing the configuration.
                * ``metadata`` containing extra information.

        Raises:
            ReaderOutputException:
                If ``reader_output.text`` is missing or not of type
                ``str``/``None``.
            InvalidChunkException:
                If the number of generated chunk IDs does not match the number
                of chunks.
            SplitterOutputException:
                If constructing :class:`SplitterOutput` fails unexpectedly.

        Warnings:
            SplitterInputWarning:
                If the input text is empty/whitespace-only or if the
                ``conversion_method`` is unknown and a fallback parser is used.
            SplitterOutputWarning:
                If non-empty text produces an empty DataFrame, which may
                indicate malformed input.

        Example:
            Splitting a **CSV table** by **rows** with **overlap**:

            ```python
            from splitter_mr.schema import ReaderOutput
            from splitter_mr.splitter.splitters import RowColumnSplitter

            csv_text = (
                "id,name,amount\\n"
                "1,A,10\\n"
                "2,B,20\\n"
                "3,C,30\\n"
                "4,D,40\\n"
            )

            ro = ReaderOutput(
                text=csv_text,
                conversion_method="csv",
                document_name="payments.csv",
                document_path="/tmp/payments.csv",
                document_id="payments-1",
            )

            splitter = RowColumnSplitter(
                num_rows=2,          # 2 rows per chunk
                chunk_overlap=1,     # reuse last 1 row in the next chunk
            )
            out = splitter.split(ro)

            print(out.chunks)
            ```
            ```python
            [
              'id,name,amount\\n1,A,10\\n2,B,20',
              'id,name,amount\\n2,B,20\\n3,C,30',
              'id,name,amount\\n3,C,30\\n4,D,40',
            ]
            ```

            ```python
            print(out.metadata["chunks"][0])
            ```
            ```python
            {'rows': [0, 1], 'type': 'row'}
            ```

            Splitting a CSV table by **columns**::

            ```python
            splitter = RowColumnSplitter(
                num_cols=2,          # 2 columns per chunk
                chunk_overlap=1,     # reuse 1 column in the next chunk
            )
            out = splitter.split(ro)

            print(out.chunks)
            ```
            ```python
            [['id', 1, 2, 3, 4], ['name', 'A', 'B', 'C', 'D']]
            ```
            ```python
            print(out.metadata["chunks"][0])
            ```

            ```python
            {'cols': ['id', 'name'], 'type': 'column'}
            ```

            Splitting by **character-based chunk size** (markdown output)::

            ```python
            md_text = '''
            | id | name | amount |
            |----|------|--------|
            | 1  | A    | 10     |
            | 2  | B    | 20     |
            | 3  | C    | 30     |
            | 4  | D    | 40     |
            '''.strip()

            ro = ReaderOutput(
                text=md_text,
                conversion_method="markdown",
                document_name="table.md",
            )

            splitter = RowColumnSplitter(
                chunk_size=80,        # max ~80 chars per chunk
                chunk_overlap=0.25,   # 25% row overlap between chunks
            )
            out = splitter.split(ro)

            for i, (chunk, meta) in enumerate(
                zip(out.chunks, out.metadata["chunks"]), start=1
            ):
                print(f"--- CHUNK {i} ---")
                print(chunk)
                print("rows:", meta["rows"])   # original row indices
            ```

            Handling **unknown conversion_method** with JSON/CSV fallback::

            ```python
            json_text = '''
            [
                {"id": 1, "name": "A", "amount": 10},
                {"id": 2, "name": "B", "amount": 20}
            ]
            '''.strip()

            ro = ReaderOutput(
                text=json_text,
                conversion_method="unknown",   # triggers JSON → CSV fallback logic
            )

            splitter = RowColumnSplitter(num_rows=1)
            out = splitter.split(ro)
            print(out.chunks)
            ```
        """
        # Minimal ReaderOutput validation (type-level issues)
        if not hasattr(reader_output, "text"):
            raise ReaderOutputException(
                "ReaderOutput object must expose a 'text' attribute."
            )

        text = reader_output.text
        if text is None:
            text = ""
        elif not isinstance(text, str):
            raise ReaderOutputException(
                f"ReaderOutput.text must be of type 'str' or None, got "
                f"{type(text).__name__!r}"
            )

        # Load tabular data into a DataFrame
        df = self._load_tabular(reader_output)
        orig_method = (reader_output.conversion_method or "").lower()
        col_names = df.columns.tolist()

        # If text is non-empty but we got no rows/columns, warn
        if text.strip() and df.empty:
            warnings.warn(
                "RowColumnSplitter produced an empty DataFrame from non-empty "
                "input text; this may indicate malformed or unsupported table "
                "format.",
                SplitterOutputWarning,
            )

        # Dispatch to splitting strategy
        if self.num_rows > 0:
            chunks, meta_per_chunk = self._split_by_rows(df, orig_method)
        elif self.num_cols > 0:
            chunks, meta_per_chunk = self._split_by_columns(df, orig_method, col_names)
        else:
            chunks, meta_per_chunk = self._split_by_chunk_size(df)

        # Generate chunk IDs and validate
        chunk_ids = self._generate_chunk_ids(len(chunks))
        if len(chunk_ids) != len(chunks):
            raise InvalidChunkException(
                "Number of chunk IDs does not match number of chunks "
                f"(chunk_ids={len(chunk_ids)}, chunks={len(chunks)})."
            )

        # Build SplitterOutput, wrapping any unexpected issues
        try:
            return SplitterOutput(
                chunks=chunks,
                chunk_id=chunk_ids,
                document_name=reader_output.document_name,
                document_path=reader_output.document_path,
                document_id=reader_output.document_id,
                conversion_method=reader_output.conversion_method,
                reader_method=reader_output.reader_method,
                ocr_method=reader_output.ocr_method,
                split_method="row_column_splitter",
                split_params={
                    "chunk_size": self.chunk_size,
                    "num_rows": self.num_rows,
                    "num_cols": self.num_cols,
                    "chunk_overlap": self.chunk_overlap,
                },
                metadata={"chunks": meta_per_chunk},
            )
        except Exception as exc:
            raise SplitterOutputException(
                f"Failed to build SplitterOutput in RowColumnSplitter: {exc}"
            ) from exc

    # ---- Internal helpers ---- #

    # Splitting strategies

    def _split_by_rows(
        self,
        df: pd.DataFrame,
        method: str,
    ) -> Tuple[List[str], List[Dict[str, Any]]]:
        """Split the DataFrame into chunks by rows.

        Uses ``num_rows`` and ``chunk_overlap`` to build overlapping row-based
        chunks. Each chunk contains full rows; rows are never split.

        Args:
            df (pd.DataFrame):
                Input table as a DataFrame.
            method (str):
                Original conversion method (e.g., ``"markdown"`` or ``"csv"``);
            used to decide the output string format.

        Returns:
            Tuple[List[str], List[Dict[str, Any]]]:
                A tuple ``(chunks, metadata)`` where:

                * ``chunks`` is a list of stringified table chunks.
                * ``metadata`` is a list of per-chunk metadata dicts, each
                  containing:

                  * ``"rows"``: list of DataFrame indices included.
                  * ``"type"``: ``"row"``.

        """
        chunks: List[str] = []
        meta_per_chunk: List[Dict[str, Any]] = []

        overlap = self._get_overlap(self.num_rows)
        step = self.num_rows - overlap if (self.num_rows - overlap) > 0 else 1

        for i in range(0, len(df), step):
            chunk_df = df.iloc[i : i + self.num_rows]
            if chunk_df.empty:
                continue
            chunk_str = self._to_str(chunk_df, method)
            chunks.append(chunk_str)
            meta_per_chunk.append(
                {
                    "rows": chunk_df.index.tolist(),
                    "type": "row",
                }
            )

        return chunks, meta_per_chunk

    def _split_by_columns(
        self,
        df: pd.DataFrame,
        method: str,
        col_names: List[str],
    ) -> Tuple[List[str], List[Dict[str, Any]]]:
        """Split the DataFrame into chunks by columns.

        Uses ``num_cols`` and ``chunk_overlap`` to build overlapping
        column-based chunks. Each chunk preserves all rows but only a subset
        of columns.

        Args:
            df (pd.DataFrame):
                Input table as a DataFrame.
            method (str):
                Original conversion method (e.g., ``"markdown"`` or ``"csv"``);
                used to decide the output string format.
            col_names (List[str]):
                List of column names in the order used for slicing.

        Returns:
            Tuple[List[str], List[Dict[str, Any]]]:
                A tuple ``(chunks, metadata)`` where:

                * ``chunks`` is a list of stringified table chunks.
                * ``metadata`` is a list of per-chunk metadata dicts, each
                  containing:

                  * ``"cols"``: list of column names included.
                  * ``"type"``: ``"column"``.
        """
        chunks: List[str] = []
        meta_per_chunk: List[Dict[str, Any]] = []

        overlap = self._get_overlap(self.num_cols)
        step = self.num_cols - overlap if (self.num_cols - overlap) > 0 else 1
        total_cols = len(col_names)

        for i in range(0, total_cols, step):
            sel_cols = col_names[i : i + self.num_cols]
            if not sel_cols:
                continue
            chunk_df = df[sel_cols]
            chunk_str = self._to_str(chunk_df, method, colwise=True)
            chunks.append(chunk_str)
            meta_per_chunk.append(
                {
                    "cols": sel_cols,
                    "type": "column",
                }
            )

        return chunks, meta_per_chunk

    def _split_by_chunk_size(
        self,
        df: pd.DataFrame,
    ) -> Tuple[List[str], List[Dict[str, Any]]]:
        """Split the DataFrame into markdown chunks constrained by ``chunk_size``.

        The header is always preserved, rows are never cut, and overlap is
        applied in terms of full rows (not characters). Each chunk is rendered
        as a markdown table string.

        Args:
            df (pd.DataFrame):
                Input table as a DataFrame.

        Returns:
            Tuple[List[str], List[Dict[str, Any]]]:
                A tuple ``(chunks, metadata)`` where:

                * ``chunks`` is a list of markdown-formatted tables.
                * ``metadata`` is a list of per-chunk metadata dicts, each
                  containing:

                  * ``"rows"``: list of row indices in the original table.
                  * ``"type"``: ``"char_row"``.

        Raises:
            SplitterConfigException:
                If ``chunk_size`` is too small to fit the header and at
                least one row.
        """
        chunks: List[str] = []
        meta_per_chunk: List[Dict[str, Any]] = []

        # Build header
        header_lines = self._get_markdown_header(df)
        header_length = len(header_lines)

        # Build per-row markdown representations
        row_md_list = [self._get_markdown_row(df, i) for i in range(len(df))]
        row_len_list = [len(r) + 1 for r in row_md_list]  # +1 for newline

        # Input validation
        if row_md_list:
            min_required = header_length + max(row_len_list)
            if self.chunk_size < min_required:
                raise SplitterConfigException(
                    "chunk_size is too small to fit the header and at least one row; "
                    f"minimum required is {min_required}, got {self.chunk_size}."
                )

        i = 0
        n = len(row_md_list)

        while i < n:
            curr_chunk: List[str] = []
            curr_len = header_length
            j = i

            # Accumulate rows while there is space
            while j < n and curr_len + row_len_list[j] <= self.chunk_size:
                curr_chunk.append(row_md_list[j])
                curr_len += row_len_list[j]
                j += 1

            rows_in_chunk = j - i
            chunk_str = header_lines + "\n".join(curr_chunk)
            chunks.append(chunk_str)
            meta_per_chunk.append(
                {
                    "rows": list(range(i, j)),
                    "type": "char_row",
                }
            )

            # --- compute overlap AFTER we know rows_in_chunk ---
            if isinstance(self.chunk_overlap, float):
                overlap_rows = int(rows_in_chunk * self.chunk_overlap)
            else:
                overlap_rows = int(self.chunk_overlap)

            # Avoid infinite loops when overlap >= rows_in_chunk
            overlap_rows = min(overlap_rows, max(rows_in_chunk - 1, 0))
            i = j - overlap_rows

        return chunks, meta_per_chunk

    # ---- Internal helpers ---- #

    def _get_overlap(self, base: int) -> int:
        """Compute integer overlap from ``chunk_overlap`` configuration.

        Args:
            base (int):
                Base number (rows or columns) from which to compute the
                overlap when ``chunk_overlap`` is a float.

        Returns:
            int:
                Overlap expressed as an integer count of rows or columns.
        """
        if isinstance(self.chunk_overlap, float):
            return int(base * self.chunk_overlap)
        return int(self.chunk_overlap)

    def _load_tabular(self, reader_output: ReaderOutput) -> pd.DataFrame:
        """Load and parse input tabular data into a DataFrame.

        The parsing strategy is driven by ``reader_output.conversion_method``:

        * ``"markdown"`` → parse markdown table.
        * ``"csv"`` / ``"txt"`` → parse as CSV.
        * ``"tsv"`` → parse as TSV.
        * ``"json"`` → parse as tabular JSON (list-of-dicts or dict-of-lists).
        * Any other value (including ``None``) triggers a fallback:
          try tabular JSON, then CSV, with warnings.

        Args:
            reader_output (ReaderOutput):
                Reader output containing the raw tabular text and
                ``conversion_method`` hint.

        Returns:
            pd.DataFrame:
                DataFrame representation of the table. May be empty if the
                input text is empty or contains no parsable rows.

        Raises:
            pandas.errors.ParserError:
                If a known format (e.g. ``"json"`` or markdown) is declared
                but the content is malformed and cannot be parsed.

        Warnings:
            SplitterInputWarning:
                If:

                * The input text is empty or whitespace-only.
                * The ``conversion_method`` is unknown and a fallback parser
                  is used (JSON or CSV).
        """
        text = reader_output.text or ""
        if not text.strip():
            warnings.warn(
                "RowColumnSplitter received empty or whitespace-only text; "
                "resulting chunks will be empty.",
                SplitterInputWarning,
                stacklevel=3,
            )
            return pd.DataFrame()

        method = (reader_output.conversion_method or "").lower()

        # Local helpers / factory
        def _read_csv(src: str, **kwargs: Any) -> DataFrame:
            return pd.read_csv(io.StringIO(src), **kwargs)

        def _read_markdown(src: str) -> DataFrame:
            return self._parse_markdown_table(src)

        def _read_tsv(src: str) -> DataFrame:
            return _read_csv(src, sep="\t")

        def _read_json(src: str) -> DataFrame:
            df = self._try_json_tabular(src)
            if df is None:
                # If 'json' is declared but the content is not tabular JSON,
                # let this be an error rather than silently guessing.
                raise pd.errors.ParserError("Input is not tabular JSON")
            return df

        parser_map: Dict[str, Callable[[str], DataFrame]] = {
            "markdown": _read_markdown,
            "csv": _read_csv,
            "txt": _read_csv,
            "tsv": _read_tsv,
            "json": _read_json,
        }

        parser = parser_map.get(method)
        if parser is not None:
            return parser(text)

        # Unknown / missing method:
        # Try tabular JSON first, then fall back to CSV, and warn.
        json_df = self._try_json_tabular(text)
        if json_df is not None:
            warnings.warn(
                (
                    f"Unknown conversion_method '{method}', but input parsed as "
                    "tabular JSON. Treating as JSON table."
                ),
                SplitterInputWarning,
            )
            return json_df

        warnings.warn(
            (
                f"Unknown conversion_method '{method}', falling back to CSV parser. "
                "Check that the input is comma-separated."
            ),
            SplitterInputWarning,
        )
        return _read_csv(text)

    def _try_json_tabular(self, text: str) -> Optional[pd.DataFrame]:
        """Try to interpret text as tabular JSON.

        Accepted shapes:

        * ``List[Dict[str, Any]]``: rows as dicts.
        * ``Dict[str, List[Any]]``: columns as lists.

        Args:
            text (str):
                Raw JSON string.

        Returns:
            Optional[pd.DataFrame]:
                A DataFrame if parsing succeeds and a tabular shape is
                detected, otherwise ``None``.
        """
        try:
            js = json.loads(text)
        except Exception:
            return None

        if isinstance(js, list) and js and all(isinstance(row, dict) for row in js):
            return pd.DataFrame(js)

        if isinstance(js, dict):
            return pd.DataFrame(js)

        return None

    def _parse_markdown_table(self, md: str) -> pd.DataFrame:
        """Parse a markdown table string into a DataFrame.

        Ignores non-table lines and trims markdown-specific formatting.
        Also handles the separator line (e.g. ``---``) in the header.

        Args:
            md (str):
                Markdown text that may contain a table.

        Returns:
            pd.DataFrame:
                Parsed table as a DataFrame.

        Raises:
            pandas.errors.ParserError:
                If the markdown table is malformed and cannot be parsed.
        """
        table_lines: List[str] = []
        started = False
        for line in md.splitlines():
            if re.match(r"^\s*\|.*\|\s*$", line):
                started = True
                table_lines.append(line.strip())
            elif started and not line.strip():
                break
        table_md = "\n".join(table_lines)
        table_io = io.StringIO(
            re.sub(
                r"^\s*\|",
                "",
                re.sub(r"\|\s*$", "", table_md, flags=re.MULTILINE),
                flags=re.MULTILINE,
            )
        )
        try:
            df = pd.read_csv(table_io, sep="|").rename(
                lambda x: x.strip(), axis="columns"
            )
        except pd.errors.EmptyDataError:
            # No actual table content (e.g., markdown text with no table lines)
            return pd.DataFrame()
        except pd.errors.ParserError as e:
            # Real markdown table that is malformed → surface as ParserError
            raise pd.errors.ParserError(f"Malformed markdown table: {e}") from e

        if not df.empty and all(re.match(r"^-+$", str(x).strip()) for x in df.iloc[0]):
            df = df.drop(df.index[0]).reset_index(drop=True)
        return df

    def _to_str(self, df: pd.DataFrame, method: str, colwise: bool = False) -> str:
        """Convert a chunk DataFrame to string representation.

        Args:
            df (pd.DataFrame):
                Chunk DataFrame to convert.
            method (str):
                Original conversion method (e.g., ``"markdown"``, ``"csv"``).
            colwise (bool, optional):
                If ``True``, output a list-of-lists representation for
                column-based chunks. If ``False``, output a table-like string
                (markdown or CSV). Defaults to ``False``.

        Returns:
            str:
                String representation of the chunk.
        """
        if colwise:
            return (
                "["
                + ", ".join(str([col] + df[col].tolist()) for col in df.columns)  # noqa: W503
                + "]"  # noqa: W503
            )
        if method in ("markdown", "md"):
            return df.to_markdown(index=False)
        output = io.StringIO()
        df.to_csv(output, index=False)
        return output.getvalue().strip("\n")

    @staticmethod
    def _get_markdown_header(df: pd.DataFrame) -> str:
        """Return markdown header + separator with trailing newline.

        Args:
            df (pd.DataFrame):
                DataFrame whose columns define the header.

        Returns:
            str:
                Markdown-formatted header (two lines) followed by a newline.
        """
        lines = df.head(0).to_markdown(index=False).splitlines()
        return "\n".join(lines[:2]) + "\n"

    @staticmethod
    def _get_markdown_row(df: pd.DataFrame, row_idx: int) -> str:
        """Return a single markdown-formatted row from the DataFrame.

        Args:
            df (pd.DataFrame):
                DataFrame containing the table.
            row_idx (int):
                Index of the row to extract.

        Returns:
            str:
                Markdown-formatted row string (data row only).
        """
        row = df.iloc[[row_idx]]
        md = row.to_markdown(index=False).splitlines()
        return md[-1]
split(reader_output)

Split the input tabular data into chunks.

The splitting strategy is determined by the configuration:

  • If num_rows > 0: split by rows.
  • Else if num_cols > 0: split by columns.
  • Else: split by character-based chunk size in markdown format, preserving a header row and never cutting data rows.

Parameters:

Name Type Description Default
reader_output ReaderOutput

Reader output containing at least text (tabular data as a string) and optionally:

  • conversion_method: format hint ("markdown", "csv", "tsv", "txt", "json" or custom).
  • document_name, document_path, document_id, conversion_method, reader_method, ocr_method for metadata propagation.
required

Returns:

Name Type Description
SplitterOutput SplitterOutput

Populated splitter output with:

  • chunks: list of chunked tables.
  • chunk_id: generated chunk identifiers.
  • document metadata carried over from reader_output.
  • split_method="row_column_splitter".
  • split_params describing the configuration.
  • metadata containing extra information.

Raises:

Type Description
ReaderOutputException

If reader_output.text is missing or not of type str/None.

InvalidChunkException

If the number of generated chunk IDs does not match the number of chunks.

SplitterOutputException

If constructing :class:SplitterOutput fails unexpectedly.

Warns:

Type Description
SplitterInputWarning

If the input text is empty/whitespace-only or if the conversion_method is unknown and a fallback parser is used.

SplitterOutputWarning

If non-empty text produces an empty DataFrame, which may indicate malformed input.

Example

Splitting a CSV table by rows with overlap:

from splitter_mr.schema import ReaderOutput
from splitter_mr.splitter.splitters import RowColumnSplitter

csv_text = (
    "id,name,amount\n"
    "1,A,10\n"
    "2,B,20\n"
    "3,C,30\n"
    "4,D,40\n"
)

ro = ReaderOutput(
    text=csv_text,
    conversion_method="csv",
    document_name="payments.csv",
    document_path="/tmp/payments.csv",
    document_id="payments-1",
)

splitter = RowColumnSplitter(
    num_rows=2,          # 2 rows per chunk
    chunk_overlap=1,     # reuse last 1 row in the next chunk
)
out = splitter.split(ro)

print(out.chunks)
[
  'id,name,amount\n1,A,10\n2,B,20',
  'id,name,amount\n2,B,20\n3,C,30',
  'id,name,amount\n3,C,30\n4,D,40',
]

print(out.metadata["chunks"][0])
{'rows': [0, 1], 'type': 'row'}

Splitting a CSV table by columns::

splitter = RowColumnSplitter(
    num_cols=2,          # 2 columns per chunk
    chunk_overlap=1,     # reuse 1 column in the next chunk
)
out = splitter.split(ro)

print(out.chunks)
[['id', 1, 2, 3, 4], ['name', 'A', 'B', 'C', 'D']]
print(out.metadata["chunks"][0])

{'cols': ['id', 'name'], 'type': 'column'}

Splitting by character-based chunk size (markdown output)::

md_text = '''
| id | name | amount |
|----|------|--------|
| 1  | A    | 10     |
| 2  | B    | 20     |
| 3  | C    | 30     |
| 4  | D    | 40     |
'''.strip()

ro = ReaderOutput(
    text=md_text,
    conversion_method="markdown",
    document_name="table.md",
)

splitter = RowColumnSplitter(
    chunk_size=80,        # max ~80 chars per chunk
    chunk_overlap=0.25,   # 25% row overlap between chunks
)
out = splitter.split(ro)

for i, (chunk, meta) in enumerate(
    zip(out.chunks, out.metadata["chunks"]), start=1
):
    print(f"--- CHUNK {i} ---")
    print(chunk)
    print("rows:", meta["rows"])   # original row indices

Handling unknown conversion_method with JSON/CSV fallback::

json_text = '''
[
    {"id": 1, "name": "A", "amount": 10},
    {"id": 2, "name": "B", "amount": 20}
]
'''.strip()

ro = ReaderOutput(
    text=json_text,
    conversion_method="unknown",   # triggers JSON → CSV fallback logic
)

splitter = RowColumnSplitter(num_rows=1)
out = splitter.split(ro)
print(out.chunks)
Source code in src/splitter_mr/splitter/splitters/row_column_splitter.py
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """
    Split the input tabular data into chunks.

    The splitting strategy is determined by the configuration:

    - If ``num_rows > 0``: split by rows.
    - Else if ``num_cols > 0``: split by columns.
    - Else: split by character-based chunk size in markdown format,
      preserving a header row and never cutting data rows.

    Args:
        reader_output (ReaderOutput):
            Reader output containing at least ``text`` (tabular data as a
            string) and optionally:

            * ``conversion_method``: format hint (``"markdown"``, ``"csv"``,
              ``"tsv"``, ``"txt"``, ``"json"`` or custom).
            * ``document_name``, ``document_path``, ``document_id``,
              ``conversion_method``, ``reader_method``, ``ocr_method`` for
              metadata propagation.

    Returns:
        SplitterOutput:
            Populated splitter output with:

            * ``chunks``: list of chunked tables.
            * ``chunk_id``: generated chunk identifiers.
            * document metadata carried over from ``reader_output``.
            * ``split_method="row_column_splitter"``.
            * ``split_params`` describing the configuration.
            * ``metadata`` containing extra information.

    Raises:
        ReaderOutputException:
            If ``reader_output.text`` is missing or not of type
            ``str``/``None``.
        InvalidChunkException:
            If the number of generated chunk IDs does not match the number
            of chunks.
        SplitterOutputException:
            If constructing :class:`SplitterOutput` fails unexpectedly.

    Warnings:
        SplitterInputWarning:
            If the input text is empty/whitespace-only or if the
            ``conversion_method`` is unknown and a fallback parser is used.
        SplitterOutputWarning:
            If non-empty text produces an empty DataFrame, which may
            indicate malformed input.

    Example:
        Splitting a **CSV table** by **rows** with **overlap**:

        ```python
        from splitter_mr.schema import ReaderOutput
        from splitter_mr.splitter.splitters import RowColumnSplitter

        csv_text = (
            "id,name,amount\\n"
            "1,A,10\\n"
            "2,B,20\\n"
            "3,C,30\\n"
            "4,D,40\\n"
        )

        ro = ReaderOutput(
            text=csv_text,
            conversion_method="csv",
            document_name="payments.csv",
            document_path="/tmp/payments.csv",
            document_id="payments-1",
        )

        splitter = RowColumnSplitter(
            num_rows=2,          # 2 rows per chunk
            chunk_overlap=1,     # reuse last 1 row in the next chunk
        )
        out = splitter.split(ro)

        print(out.chunks)
        ```
        ```python
        [
          'id,name,amount\\n1,A,10\\n2,B,20',
          'id,name,amount\\n2,B,20\\n3,C,30',
          'id,name,amount\\n3,C,30\\n4,D,40',
        ]
        ```

        ```python
        print(out.metadata["chunks"][0])
        ```
        ```python
        {'rows': [0, 1], 'type': 'row'}
        ```

        Splitting a CSV table by **columns**::

        ```python
        splitter = RowColumnSplitter(
            num_cols=2,          # 2 columns per chunk
            chunk_overlap=1,     # reuse 1 column in the next chunk
        )
        out = splitter.split(ro)

        print(out.chunks)
        ```
        ```python
        [['id', 1, 2, 3, 4], ['name', 'A', 'B', 'C', 'D']]
        ```
        ```python
        print(out.metadata["chunks"][0])
        ```

        ```python
        {'cols': ['id', 'name'], 'type': 'column'}
        ```

        Splitting by **character-based chunk size** (markdown output)::

        ```python
        md_text = '''
        | id | name | amount |
        |----|------|--------|
        | 1  | A    | 10     |
        | 2  | B    | 20     |
        | 3  | C    | 30     |
        | 4  | D    | 40     |
        '''.strip()

        ro = ReaderOutput(
            text=md_text,
            conversion_method="markdown",
            document_name="table.md",
        )

        splitter = RowColumnSplitter(
            chunk_size=80,        # max ~80 chars per chunk
            chunk_overlap=0.25,   # 25% row overlap between chunks
        )
        out = splitter.split(ro)

        for i, (chunk, meta) in enumerate(
            zip(out.chunks, out.metadata["chunks"]), start=1
        ):
            print(f"--- CHUNK {i} ---")
            print(chunk)
            print("rows:", meta["rows"])   # original row indices
        ```

        Handling **unknown conversion_method** with JSON/CSV fallback::

        ```python
        json_text = '''
        [
            {"id": 1, "name": "A", "amount": 10},
            {"id": 2, "name": "B", "amount": 20}
        ]
        '''.strip()

        ro = ReaderOutput(
            text=json_text,
            conversion_method="unknown",   # triggers JSON → CSV fallback logic
        )

        splitter = RowColumnSplitter(num_rows=1)
        out = splitter.split(ro)
        print(out.chunks)
        ```
    """
    # Minimal ReaderOutput validation (type-level issues)
    if not hasattr(reader_output, "text"):
        raise ReaderOutputException(
            "ReaderOutput object must expose a 'text' attribute."
        )

    text = reader_output.text
    if text is None:
        text = ""
    elif not isinstance(text, str):
        raise ReaderOutputException(
            f"ReaderOutput.text must be of type 'str' or None, got "
            f"{type(text).__name__!r}"
        )

    # Load tabular data into a DataFrame
    df = self._load_tabular(reader_output)
    orig_method = (reader_output.conversion_method or "").lower()
    col_names = df.columns.tolist()

    # If text is non-empty but we got no rows/columns, warn
    if text.strip() and df.empty:
        warnings.warn(
            "RowColumnSplitter produced an empty DataFrame from non-empty "
            "input text; this may indicate malformed or unsupported table "
            "format.",
            SplitterOutputWarning,
        )

    # Dispatch to splitting strategy
    if self.num_rows > 0:
        chunks, meta_per_chunk = self._split_by_rows(df, orig_method)
    elif self.num_cols > 0:
        chunks, meta_per_chunk = self._split_by_columns(df, orig_method, col_names)
    else:
        chunks, meta_per_chunk = self._split_by_chunk_size(df)

    # Generate chunk IDs and validate
    chunk_ids = self._generate_chunk_ids(len(chunks))
    if len(chunk_ids) != len(chunks):
        raise InvalidChunkException(
            "Number of chunk IDs does not match number of chunks "
            f"(chunk_ids={len(chunk_ids)}, chunks={len(chunks)})."
        )

    # Build SplitterOutput, wrapping any unexpected issues
    try:
        return SplitterOutput(
            chunks=chunks,
            chunk_id=chunk_ids,
            document_name=reader_output.document_name,
            document_path=reader_output.document_path,
            document_id=reader_output.document_id,
            conversion_method=reader_output.conversion_method,
            reader_method=reader_output.reader_method,
            ocr_method=reader_output.ocr_method,
            split_method="row_column_splitter",
            split_params={
                "chunk_size": self.chunk_size,
                "num_rows": self.num_rows,
                "num_cols": self.num_cols,
                "chunk_overlap": self.chunk_overlap,
            },
            metadata={"chunks": meta_per_chunk},
        )
    except Exception as exc:
        raise SplitterOutputException(
            f"Failed to build SplitterOutput in RowColumnSplitter: {exc}"
        ) from exc

CodeSplitter

CodeSplitter

Bases: BaseSplitter

Recursively splits source code into language-aware, semantically meaningful chunks.

The CodeSplitter uses LangChain's :func:RecursiveCharacterTextSplitter.from_language method to generate code chunks that align with syntactic boundaries such as functions, methods, and classes. This allows for better context preservation during code analysis, summarization, or embedding.

Attributes:

Name Type Description
language str

Programming language to split (e.g., "python" or "java").

chunk_size int

Maximum number of characters per chunk.

Warns:

Type Description
SplitterInputWarning

Emitted when the input text is empty or whitespace-only, or when conversion_method='json' but the text is invalid JSON.

Raises:

Type Description
UnsupportedCodeLanguage

If the requested language is not supported by LangChain.

InvalidChunkException

If chunk generation fails or produces invalid chunks.

SplitterOutputException

If the final :class:SplitterOutput cannot be built or validated.

Source code in src/splitter_mr/splitter/splitters/code_splitter.py
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
class CodeSplitter(BaseSplitter):
    """Recursively splits source code into language-aware, semantically meaningful chunks.

    The ``CodeSplitter`` uses LangChain's
    :func:`RecursiveCharacterTextSplitter.from_language` method to generate
    code chunks that align with syntactic boundaries such as functions,
    methods, and classes. This allows for better context preservation during
    code analysis, summarization, or embedding.

    Attributes:
        language (str): Programming language to split (e.g., ``"python"`` or ``"java"``).
        chunk_size (int): Maximum number of characters per chunk.

    Warnings:
        SplitterInputWarning: Emitted when the input text is empty or whitespace-only,
            or when ``conversion_method='json'`` but the text is invalid JSON.

    Raises:
        UnsupportedCodeLanguage: If the requested language is not supported by LangChain.
        InvalidChunkException: If chunk generation fails or produces invalid chunks.
        SplitterOutputException: If the final :class:`SplitterOutput` cannot be built
            or validated.
    """

    def __init__(self, chunk_size: int = 1000, language: str = "python"):
        if not isinstance(chunk_size, int) or chunk_size < 1:
            raise SplitterConfigException("chunk_size must be an integer >= 1")
        super().__init__(chunk_size)
        self.language = language

    # ---- Main method ---- #

    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """Split the provided source code into language-aware chunks.

        The method performs input validation and warning emission, determines
        the appropriate language enum, builds code chunks via LangChain,
        and returns a fully validated :class:`SplitterOutput` instance.

        Args:
            reader_output (ReaderOutput): A validated input object containing
                at least a ``text`` field and optional document metadata.

        Returns:
            SplitterOutput: Structured splitter output containing:
                * ``chunks`` — list of split code segments.
                * ``chunk_id`` — corresponding unique identifiers.
                * document metadata and splitter parameters.

        Raises:
            UnsupportedCodeLanguage: If ``self.language`` is not recognized.
            InvalidChunkException: If chunk construction fails or yields invalid chunks.
            SplitterOutputException: If the :class:`SplitterOutput` cannot be built
                or validated.

        Warnings:
            SplitterInputWarning: If text is empty, whitespace-only, or invalid JSON.

        Example:
            ```python
            from splitter_mr.splitter import CodeSplitter
            from splitter_mr.schema.models import ReaderOutput

            reader_output = ReaderOutput(
                text="def foo():\\n    pass\\n\\nclass Bar:\\n    def baz(self):\\n        pass",
                document_name="example.py",
                document_path="/tmp/example.py",
            )

            splitter = CodeSplitter(chunk_size=50, language="python")
            output = splitter.split(reader_output)
            print(output.chunks)
            ```
            ```python
            ['def foo():\\n    pass\\n', 'class Bar:\\n    def baz(self):\\n        pass']
            ```
        """
        text = reader_output.text or ""
        chunk_size = self.chunk_size

        # Check input
        self._warn_on_input(reader_output, text)

        # Resolve language
        lang_enum = get_langchain_language(self.language)

        # Build chunks
        chunks = self._build_chunks(text, lang_enum, chunk_size)

        # Produce output
        try:
            chunk_ids = self._generate_chunk_ids(len(chunks))
            metadata = self._default_metadata()
            output = SplitterOutput(
                chunks=chunks,
                chunk_id=chunk_ids,
                document_name=reader_output.document_name,
                document_path=reader_output.document_path or "",
                document_id=reader_output.document_id,
                conversion_method=reader_output.conversion_method,
                reader_method=reader_output.reader_method,
                ocr_method=reader_output.ocr_method,
                split_method="code_splitter",
                split_params={"chunk_size": chunk_size, "language": self.language},
                metadata=metadata,
            )
            return output
        except Exception as e:
            raise SplitterOutputException(f"Failed to build SplitterOutput: {e}") from e

    # ---- Internal helpers ---- #

    def _warn_on_input(self, reader_output: ReaderOutput, text: str) -> None:
        """Emit :class:`SplitterInputWarning` for suspicious or malformed inputs.

        This helper checks for two common problems:

        * Empty or whitespace-only text → emits a warning and continues.
        * Declared JSON input that cannot be parsed → emits a warning and treats
          it as plain text.

        Args:
            reader_output (ReaderOutput): Input object containing text and metadata.
            text (str): Text content to analyze and possibly warn about.

        Warnings:
            SplitterInputWarning: If text is empty or invalid JSON (when declared).
        """
        if (text or "").strip() == "":
            warnings.warn(
                SplitterInputWarning(
                    "ReaderOutput.text is empty or whitespace-only. "
                    "Proceeding; this will yield a single empty chunk."
                )
            )

        if (reader_output.conversion_method or "").lower() == "json":
            try:
                json.loads(text or "")
            except Exception:
                warnings.warn(
                    SplitterInputWarning(
                        "ReaderOutput.conversion_method is 'json' but text "
                        "is not valid JSON. Proceeding as plain text."
                    )
                )

    def _build_chunks(
        self, text: str, lang_enum: Language, chunk_size: int
    ) -> List[str]:
        """Build and validate code chunks using LangChain.

        Args:
            text (str): Source code to split.
            lang_enum (Language): LangChain language enumeration.
            chunk_size (int): Maximum characters per chunk.

        Returns:
            List[str]: List of chunked code strings.

        Raises:
            InvalidChunkException: If no chunks are produced, a chunk is ``None``,
                or all chunks are empty for non-empty text.
        """
        try:
            splitter = RecursiveCharacterTextSplitter.from_language(
                language=lang_enum, chunk_size=chunk_size, chunk_overlap=0
            )
            docs = splitter.create_documents([text or ""])
            chunks = [doc.page_content for doc in docs]

            # Guarantee at least one empty chunk if text is empty
            if len((text or "")) == 0:
                chunks = [""]

            # Sanity checks
            if not isinstance(chunks, list) or len(chunks) == 0:
                raise InvalidChunkException("No chunks were produced.")
            if any(c is None for c in chunks):
                raise InvalidChunkException("A produced chunk is None.")
            if len(text or "") > 0 and all(c == "" for c in chunks):
                raise InvalidChunkException(
                    "All produced chunks are empty for non-empty text."
                )

            return chunks

        except InvalidChunkException:
            raise
        except Exception as e:
            raise InvalidChunkException(
                f"Unexpected error while building code chunks: {e}"
            ) from e
split(reader_output)

Split the provided source code into language-aware chunks.

The method performs input validation and warning emission, determines the appropriate language enum, builds code chunks via LangChain, and returns a fully validated :class:SplitterOutput instance.

Parameters:

Name Type Description Default
reader_output ReaderOutput

A validated input object containing at least a text field and optional document metadata.

required

Returns:

Name Type Description
SplitterOutput SplitterOutput

Structured splitter output containing: * chunks — list of split code segments. * chunk_id — corresponding unique identifiers. * document metadata and splitter parameters.

Raises:

Type Description
UnsupportedCodeLanguage

If self.language is not recognized.

InvalidChunkException

If chunk construction fails or yields invalid chunks.

SplitterOutputException

If the :class:SplitterOutput cannot be built or validated.

Warns:

Type Description
SplitterInputWarning

If text is empty, whitespace-only, or invalid JSON.

Example

from splitter_mr.splitter import CodeSplitter
from splitter_mr.schema.models import ReaderOutput

reader_output = ReaderOutput(
    text="def foo():\n    pass\n\nclass Bar:\n    def baz(self):\n        pass",
    document_name="example.py",
    document_path="/tmp/example.py",
)

splitter = CodeSplitter(chunk_size=50, language="python")
output = splitter.split(reader_output)
print(output.chunks)
['def foo():\n    pass\n', 'class Bar:\n    def baz(self):\n        pass']

Source code in src/splitter_mr/splitter/splitters/code_splitter.py
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """Split the provided source code into language-aware chunks.

    The method performs input validation and warning emission, determines
    the appropriate language enum, builds code chunks via LangChain,
    and returns a fully validated :class:`SplitterOutput` instance.

    Args:
        reader_output (ReaderOutput): A validated input object containing
            at least a ``text`` field and optional document metadata.

    Returns:
        SplitterOutput: Structured splitter output containing:
            * ``chunks`` — list of split code segments.
            * ``chunk_id`` — corresponding unique identifiers.
            * document metadata and splitter parameters.

    Raises:
        UnsupportedCodeLanguage: If ``self.language`` is not recognized.
        InvalidChunkException: If chunk construction fails or yields invalid chunks.
        SplitterOutputException: If the :class:`SplitterOutput` cannot be built
            or validated.

    Warnings:
        SplitterInputWarning: If text is empty, whitespace-only, or invalid JSON.

    Example:
        ```python
        from splitter_mr.splitter import CodeSplitter
        from splitter_mr.schema.models import ReaderOutput

        reader_output = ReaderOutput(
            text="def foo():\\n    pass\\n\\nclass Bar:\\n    def baz(self):\\n        pass",
            document_name="example.py",
            document_path="/tmp/example.py",
        )

        splitter = CodeSplitter(chunk_size=50, language="python")
        output = splitter.split(reader_output)
        print(output.chunks)
        ```
        ```python
        ['def foo():\\n    pass\\n', 'class Bar:\\n    def baz(self):\\n        pass']
        ```
    """
    text = reader_output.text or ""
    chunk_size = self.chunk_size

    # Check input
    self._warn_on_input(reader_output, text)

    # Resolve language
    lang_enum = get_langchain_language(self.language)

    # Build chunks
    chunks = self._build_chunks(text, lang_enum, chunk_size)

    # Produce output
    try:
        chunk_ids = self._generate_chunk_ids(len(chunks))
        metadata = self._default_metadata()
        output = SplitterOutput(
            chunks=chunks,
            chunk_id=chunk_ids,
            document_name=reader_output.document_name,
            document_path=reader_output.document_path or "",
            document_id=reader_output.document_id,
            conversion_method=reader_output.conversion_method,
            reader_method=reader_output.reader_method,
            ocr_method=reader_output.ocr_method,
            split_method="code_splitter",
            split_params={"chunk_size": chunk_size, "language": self.language},
            metadata=metadata,
        )
        return output
    except Exception as e:
        raise SplitterOutputException(f"Failed to build SplitterOutput: {e}") from e
get_langchain_language(lang_str)

Resolve a string name to a LangChain Language enum.

Parameters:

Name Type Description Default
lang_str str

Case-insensitive programming language name (e.g., "python", "java", "kotlin").

required

Returns:

Name Type Description
Language Language

The corresponding LangChain language enumeration.

Raises:

Type Description
SplitterConfigException

If the provided language is not supported by the LangChain Language enum.

Source code in src/splitter_mr/splitter/splitters/code_splitter.py
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def get_langchain_language(lang_str: str) -> Language:
    """Resolve a string name to a LangChain ``Language`` enum.

    Args:
        lang_str (str): Case-insensitive programming language name
            (e.g., ``"python"``, ``"java"``, ``"kotlin"``).

    Returns:
        Language: The corresponding LangChain language enumeration.

    Raises:
        SplitterConfigException: If the provided language is not supported
            by the LangChain ``Language`` enum.
    """
    lookup = {lang.name.lower(): lang for lang in Language}
    key = (lang_str or "").lower()
    if key not in lookup:
        supported = ", ".join(sorted(lookup.keys()))
        raise SplitterConfigException(
            f"Unsupported language '{lang_str}'. Supported languages: {supported}"
        )
    return lookup[key]

TokenSplitter

TokenSplitter

Bases: BaseSplitter

Split text into token-based chunks using multiple tokenizer backends.

TokenSplitter splits a given text into chunks based on token counts derived from different tokenization models or libraries.

This splitter supports tokenization via tiktoken (OpenAI tokenizer), spacy (spaCy tokenizer), and nltk (NLTK tokenizer). It allows splitting text into chunks of a maximum number of tokens (chunk_size), using the specified tokenizer model.

Parameters:

Name Type Description Default
chunk_size int

Maximum number of tokens per chunk.

1000
model_name str

Tokenizer and model in the format tokenizer/model. Supported tokenizers include:

  • tiktoken/cl100k_base (OpenAI tokenizer via tiktoken)
  • spacy/en_core_web_sm (spaCy English model)
  • nltk/punkt_tab (NLTK Punkt tokenizer variant)
DEFAULT_TOKENIZER
language str

Language code for the NLTK tokenizer (for example, "english").

DEFAULT_TOKEN_LANGUAGE

Raises:

Type Description
SplitterConfigException

If chunk_size is not a positive integer.

Notes

See the LangChain documentation for more details about splitting by tokens: https://python.langchain.com/docs/how_to/split_by_token/

Source code in src/splitter_mr/splitter/splitters/token_splitter.py
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
class TokenSplitter(BaseSplitter):
    """Split text into token-based chunks using multiple tokenizer backends.

    TokenSplitter splits a given text into chunks based on token counts
    derived from different tokenization models or libraries.

    This splitter supports tokenization via `tiktoken` (OpenAI tokenizer),
    `spacy` (spaCy tokenizer), and `nltk` (NLTK tokenizer). It allows splitting
    text into chunks of a maximum number of tokens (`chunk_size`), using the
    specified tokenizer model.

    Args:
        chunk_size: Maximum number of tokens per chunk.
        model_name: Tokenizer and model in the format ``tokenizer/model``.
            Supported tokenizers include:

            * ``tiktoken/cl100k_base`` (OpenAI tokenizer via tiktoken)
            * ``spacy/en_core_web_sm`` (spaCy English model)
            * ``nltk/punkt_tab`` (NLTK Punkt tokenizer variant)
        language: Language code for the NLTK tokenizer (for example, ``"english"``).

    Raises:
        SplitterConfigException: If ``chunk_size`` is not a positive integer.

    Notes:
        See the LangChain documentation for more details about splitting
        by tokens:
        https://python.langchain.com/docs/how_to/split_by_token/
    """

    def __init__(
        self,
        chunk_size: int = 1000,
        model_name: str = DEFAULT_TOKENIZER,
        language: str = DEFAULT_TOKEN_LANGUAGE,
    ):
        if chunk_size <= 0:
            raise SplitterConfigException(
                f"chunk_size must be a positive integer, got {chunk_size!r}."
            )

        super().__init__(chunk_size)
        self.model_name = model_name or DEFAULT_TOKENIZER
        self.language = language or DEFAULT_TOKEN_LANGUAGE

    # ---- Main method ---- #

    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """Split the input text into token-based chunks.

        The splitter uses the backend specified by ``model_name`` and
        delegates to a tokenizer-specific implementation:

        * tiktoken: Uses OpenAI encodings via
          ``RecursiveCharacterTextSplitter``.
        * spaCy: Uses the specified pipeline via ``SpacyTextSplitter``.
        * NLTK: Uses the Punkt sentence tokenizer via ``NLTKTextSplitter``.

        Models or language data are downloaded automatically if missing.

        Args:
            reader_output: Input text and associated metadata to be split.

        Returns:
            A ``SplitterOutput`` instance containing:

            * ``chunks``: List of token-based text chunks.
            * ``chunk_id``: Corresponding unique identifiers for each chunk.
            * Document metadata and splitter configuration parameters.

        Raises:
            SplitterConfigException:
                If ``model_name`` is malformed, the tokenizer backend is
                unsupported, or the requested model or language resources are
                unavailable.
            InvalidChunkException:
                If the underlying splitter returns an invalid chunks structure.

        Warns:
            SplitterInputWarning:
                If the input text is empty or whitespace-only.
            ChunkUnderflowWarning:
                If no chunks are produced from a non-empty input.

        Example:
            Basic usage with **tiktoken**:

            ```python
            from splitter_mr.splitter import TokenSplitter
            from splitter_mr.schema.models import ReaderOutput

            text = (
                "This is a demonstration of the TokenSplitter. "
                "It splits text into chunks based on token counts."
            )

            ro = ReaderOutput(text=text, document_name="demo.txt")
            splitter = TokenSplitter(
                chunk_size=20,
                model_name="tiktoken/cl100k_base",
            )
            output = splitter.split(ro)
            print(output.chunks)
            ```

            Using **spaCy**:

            ```python
            splitter = TokenSplitter(
                chunk_size=50,
                model_name="spacy/en_core_web_sm",
            )
            output = splitter.split(ro)
            print(output.chunks)
            ```

            Using **NLTK**:

            ```python
            splitter = TokenSplitter(
                chunk_size=40,
                model_name="nltk/punkt_tab",
                language="english",
            )
            output = splitter.split(ro)
            print(output.chunks)
            ```
        """
        text = reader_output.text or ""

        if not text.strip():
            warnings.warn(
                "TokenSplitter received empty or whitespace-only text; "
                "no chunks will be produced.",
                SplitterInputWarning,
            )

        tokenizer, model = self._parse_model()
        factory = self._get_splitter_factory(tokenizer)
        splitter = factory(model)

        chunks = splitter.split_text(text)

        if chunks is None:
            raise InvalidChunkException(
                "The underlying text splitter returned None instead of a list of chunks."
            )

        if not chunks:
            warnings.warn(
                "TokenSplitter produced no chunks for the given input.",
                ChunkUnderflowWarning,
            )

        chunk_ids = self._generate_chunk_ids(len(chunks))
        metadata = self._default_metadata()

        return SplitterOutput(
            chunks=chunks,
            chunk_id=chunk_ids,
            document_name=reader_output.document_name,
            document_path=reader_output.document_path,
            document_id=reader_output.document_id,
            conversion_method=reader_output.conversion_method,
            reader_method=reader_output.reader_method,
            ocr_method=reader_output.ocr_method,
            split_method="token_splitter",
            split_params={
                "chunk_size": self.chunk_size,
                "model_name": self.model_name,
                "language": self.language,
            },
            metadata=metadata,
        )

    # ---- Internal helpers ---- #

    @staticmethod
    def list_nltk_punkt_languages() -> List[str]:
        """Return a sorted list of available NLTK Punkt models.

        Returns:
            model_list(List[str]): A sorted list of language codes corresponding to available
            Punkt sentence tokenizer models in the local NLTK data path.
        """
        models = set()
        for base in map(Path, nltk.data.path):
            punkt_dir = base / "tokenizers" / "punkt"
            if punkt_dir.exists():
                models.update(f.stem for f in punkt_dir.glob("*.pickle"))
        model_list = sorted(models)
        return model_list

    def _parse_model(self) -> Tuple[str, str]:
        """Parse and validate the ``tokenizer/model`` string.

        Returns:
            A tuple ``(tokenizer, model)`` where ``tokenizer`` is the backend
            name (for example, ``"tiktoken"``, ``"spacy"``, or ``"nltk"``) and
            ``model`` is the corresponding model identifier.

        Raises:
            SplitterConfigException: If ``model_name`` is not in the
                ``tokenizer/model`` format.
        """
        if "/" not in self.model_name:
            raise SplitterConfigException(
                "model_name must be in the format 'tokenizer/model', "
                f"e.g. '{DEFAULT_TOKENIZER}'. Got: {self.model_name!r}"
            )
        tokenizer, model = self.model_name.split("/", 1)
        return tokenizer, model

    # ---- Tokenizer builders ---- #

    def _build_tiktoken_splitter(self, model: str) -> RecursiveCharacterTextSplitter:
        """Build a tiktoken-based text splitter.

        Args:
            model: The tiktoken encoding name (for example, ``"cl100k_base"``).

        Returns:
            A configured ``RecursiveCharacterTextSplitter`` that uses the
            specified tiktoken encoding.

        Raises:
            SplitterConfigException:
                If tiktoken encodings cannot be listed or if the requested
                encoding is not available.
        """
        try:
            available_models = tiktoken.list_encoding_names()
        except Exception as exc:  # defensive, backend failure
            raise SplitterConfigException(
                "Failed to list tiktoken encodings. "
                "Please ensure tiktoken is correctly installed."
            ) from exc

        if model not in available_models:
            raise SplitterConfigException(
                f"tiktoken encoding {model!r} is not available. "
                f"Available defaults include: {TIKTOKEN_DEFAULTS}. "
                f"Full list from tiktoken: {available_models}"
            )

        return RecursiveCharacterTextSplitter.from_tiktoken_encoder(
            encoding_name=model,
            chunk_size=self.chunk_size,
            chunk_overlap=0,
        )

    def _build_spacy_splitter(self, model: str) -> SpacyTextSplitter:
        """Build a spaCy-based text splitter.

        Args:
            model: The spaCy pipeline name (for example, ``"en_core_web_sm"``).

        Returns:
            A configured ``SpacyTextSplitter`` that uses the specified spaCy
            model.

        Raises:
            SplitterConfigException:
                If the spaCy model cannot be downloaded or loaded.

        Warns:
            SplitterInputWarning: If ``chunk_size`` is so large that spaCy
                may require excessive memory.
        """
        if not spacy.util.is_package(model):
            try:
                spacy.cli.download(model)
            except Exception as exc:
                raise SplitterConfigException(
                    f"spaCy model {model!r} is not available for download. "
                    f"Common models include: {SPACY_DEFAULTS}"
                ) from exc

        try:
            spacy.load(model)
        except Exception as exc:
            raise SplitterConfigException(
                f"spaCy model {model!r} could not be loaded. "
                "Please verify that the installation is not corrupted."
            ) from exc

        MAX_SAFE_LENGTH = 1_000_000
        if self.chunk_size > MAX_SAFE_LENGTH:
            warnings.warn(
                "Configured chunk_size is very large; spaCy v2.x parser and NER "
                "models may require ~1GB of temporary memory per 100,000 characters.",
                SplitterInputWarning,
            )

        return SpacyTextSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=0,
            max_length=MAX_SAFE_LENGTH,
            pipeline=model,
        )

    def _build_nltk_splitter(self, _model: str) -> NLTKTextSplitter:
        """Build an NLTK-based text splitter.

        The ``_model`` argument is currently unused because the NLTK backend
        is controlled by ``language`` rather than by an explicit model name.
        It is kept for uniformity with other builder methods.

        Args:
            _model: Unused placeholder for tokenizer-specific model ID.

        Returns:
            A configured ``NLTKTextSplitter`` that uses the configured
            ``language`` for sentence tokenization.

        Raises:
            SplitterConfigException:
                If NLTK Punkt data cannot be found or downloaded.
        """
        punkt_relpath = Path("tokenizers") / "punkt" / f"{self.language}.pickle"
        try:
            nltk.data.find(str(punkt_relpath))
        except LookupError:
            try:
                nltk.download(DEFAULT_NLTK[0])
            except Exception as exc:
                raise SplitterConfigException(
                    "NLTK Punkt data could not be downloaded. "
                    f"Tried language {self.language!r} and default resource "
                    f"{DEFAULT_NLTK[0]!r}."
                ) from exc

        return NLTKTextSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=0,
            language=self.language,
        )

    def _get_splitter_factory(self, tokenizer: str) -> Callable[[str], object]:
        """Return the factory function for the given tokenizer backend.

        Args:
            tokenizer: The tokenizer backend name, such as ``"tiktoken"``,
                ``"spacy"``, or ``"nltk"``.

        Returns:
            A callable that accepts a model string and returns a configured
            text splitter instance.

        Raises:
            SplitterConfigException: If the tokenizer backend is not supported.
        """
        factories: dict[str, Callable[[str], object]] = {
            "tiktoken": self._build_tiktoken_splitter,
            "spacy": self._build_spacy_splitter,
            "nltk": self._build_nltk_splitter,
        }

        try:
            return factories[tokenizer]
        except KeyError:
            raise SplitterConfigException(
                f"Unsupported tokenizer {tokenizer!r}. "
                f"Supported tokenizers: {SUPPORTED_TOKENIZERS}"
            )
list_nltk_punkt_languages() staticmethod

Return a sorted list of available NLTK Punkt models.

Returns:

Name Type Description
model_list List[str]

A sorted list of language codes corresponding to available

List[str]

Punkt sentence tokenizer models in the local NLTK data path.

Source code in src/splitter_mr/splitter/splitters/token_splitter.py
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
@staticmethod
def list_nltk_punkt_languages() -> List[str]:
    """Return a sorted list of available NLTK Punkt models.

    Returns:
        model_list(List[str]): A sorted list of language codes corresponding to available
        Punkt sentence tokenizer models in the local NLTK data path.
    """
    models = set()
    for base in map(Path, nltk.data.path):
        punkt_dir = base / "tokenizers" / "punkt"
        if punkt_dir.exists():
            models.update(f.stem for f in punkt_dir.glob("*.pickle"))
    model_list = sorted(models)
    return model_list
split(reader_output)

Split the input text into token-based chunks.

The splitter uses the backend specified by model_name and delegates to a tokenizer-specific implementation:

  • tiktoken: Uses OpenAI encodings via RecursiveCharacterTextSplitter.
  • spaCy: Uses the specified pipeline via SpacyTextSplitter.
  • NLTK: Uses the Punkt sentence tokenizer via NLTKTextSplitter.

Models or language data are downloaded automatically if missing.

Parameters:

Name Type Description Default
reader_output ReaderOutput

Input text and associated metadata to be split.

required

Returns:

Type Description
SplitterOutput

A SplitterOutput instance containing:

SplitterOutput
  • chunks: List of token-based text chunks.
SplitterOutput
  • chunk_id: Corresponding unique identifiers for each chunk.
SplitterOutput
  • Document metadata and splitter configuration parameters.

Raises:

Type Description
SplitterConfigException

If model_name is malformed, the tokenizer backend is unsupported, or the requested model or language resources are unavailable.

InvalidChunkException

If the underlying splitter returns an invalid chunks structure.

Warns:

Type Description
SplitterInputWarning

If the input text is empty or whitespace-only.

ChunkUnderflowWarning

If no chunks are produced from a non-empty input.

Example

Basic usage with tiktoken:

from splitter_mr.splitter import TokenSplitter
from splitter_mr.schema.models import ReaderOutput

text = (
    "This is a demonstration of the TokenSplitter. "
    "It splits text into chunks based on token counts."
)

ro = ReaderOutput(text=text, document_name="demo.txt")
splitter = TokenSplitter(
    chunk_size=20,
    model_name="tiktoken/cl100k_base",
)
output = splitter.split(ro)
print(output.chunks)

Using spaCy:

splitter = TokenSplitter(
    chunk_size=50,
    model_name="spacy/en_core_web_sm",
)
output = splitter.split(ro)
print(output.chunks)

Using NLTK:

splitter = TokenSplitter(
    chunk_size=40,
    model_name="nltk/punkt_tab",
    language="english",
)
output = splitter.split(ro)
print(output.chunks)
Source code in src/splitter_mr/splitter/splitters/token_splitter.py
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """Split the input text into token-based chunks.

    The splitter uses the backend specified by ``model_name`` and
    delegates to a tokenizer-specific implementation:

    * tiktoken: Uses OpenAI encodings via
      ``RecursiveCharacterTextSplitter``.
    * spaCy: Uses the specified pipeline via ``SpacyTextSplitter``.
    * NLTK: Uses the Punkt sentence tokenizer via ``NLTKTextSplitter``.

    Models or language data are downloaded automatically if missing.

    Args:
        reader_output: Input text and associated metadata to be split.

    Returns:
        A ``SplitterOutput`` instance containing:

        * ``chunks``: List of token-based text chunks.
        * ``chunk_id``: Corresponding unique identifiers for each chunk.
        * Document metadata and splitter configuration parameters.

    Raises:
        SplitterConfigException:
            If ``model_name`` is malformed, the tokenizer backend is
            unsupported, or the requested model or language resources are
            unavailable.
        InvalidChunkException:
            If the underlying splitter returns an invalid chunks structure.

    Warns:
        SplitterInputWarning:
            If the input text is empty or whitespace-only.
        ChunkUnderflowWarning:
            If no chunks are produced from a non-empty input.

    Example:
        Basic usage with **tiktoken**:

        ```python
        from splitter_mr.splitter import TokenSplitter
        from splitter_mr.schema.models import ReaderOutput

        text = (
            "This is a demonstration of the TokenSplitter. "
            "It splits text into chunks based on token counts."
        )

        ro = ReaderOutput(text=text, document_name="demo.txt")
        splitter = TokenSplitter(
            chunk_size=20,
            model_name="tiktoken/cl100k_base",
        )
        output = splitter.split(ro)
        print(output.chunks)
        ```

        Using **spaCy**:

        ```python
        splitter = TokenSplitter(
            chunk_size=50,
            model_name="spacy/en_core_web_sm",
        )
        output = splitter.split(ro)
        print(output.chunks)
        ```

        Using **NLTK**:

        ```python
        splitter = TokenSplitter(
            chunk_size=40,
            model_name="nltk/punkt_tab",
            language="english",
        )
        output = splitter.split(ro)
        print(output.chunks)
        ```
    """
    text = reader_output.text or ""

    if not text.strip():
        warnings.warn(
            "TokenSplitter received empty or whitespace-only text; "
            "no chunks will be produced.",
            SplitterInputWarning,
        )

    tokenizer, model = self._parse_model()
    factory = self._get_splitter_factory(tokenizer)
    splitter = factory(model)

    chunks = splitter.split_text(text)

    if chunks is None:
        raise InvalidChunkException(
            "The underlying text splitter returned None instead of a list of chunks."
        )

    if not chunks:
        warnings.warn(
            "TokenSplitter produced no chunks for the given input.",
            ChunkUnderflowWarning,
        )

    chunk_ids = self._generate_chunk_ids(len(chunks))
    metadata = self._default_metadata()

    return SplitterOutput(
        chunks=chunks,
        chunk_id=chunk_ids,
        document_name=reader_output.document_name,
        document_path=reader_output.document_path,
        document_id=reader_output.document_id,
        conversion_method=reader_output.conversion_method,
        reader_method=reader_output.reader_method,
        ocr_method=reader_output.ocr_method,
        split_method="token_splitter",
        split_params={
            "chunk_size": self.chunk_size,
            "model_name": self.model_name,
            "language": self.language,
        },
        metadata=metadata,
    )

PagedSplitter

Splits text by pages for documents that have page structure. Each chunk contains a specified number of pages, with optional word overlap.

PagedSplitter

Bases: BaseSplitter

Splits a multi-page document into page-based or multi-page chunks using a placeholder marker.

This splitter uses the page_placeholder field of :class:ReaderOutput to break the text into logical "pages" and then groups those pages into chunks. It can also introduce character-based overlap between consecutive chunks.

Parameters:

Name Type Description Default
chunk_size int

Number of pages per chunk.

1
chunk_overlap int

Number of overlapping characters to include from the end of the previous chunk.

0

Raises:

Type Description
SplitterConfigException

If chunk_size is less than 1 or chunk_overlap is negative.

Warns:

Type Description
SplitterInputWarning

When the input text is empty or whitespace-only.

SplitterOutputWarning

When no non-empty pages are found after splitting on the placeholder and the splitter falls back to a single empty chunk.

Source code in src/splitter_mr/splitter/splitters/paged_splitter.py
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
class PagedSplitter(BaseSplitter):
    """
    Splits a multi-page document into page-based or multi-page chunks using a placeholder marker.

    This splitter uses the ``page_placeholder`` field of :class:`ReaderOutput` to break
    the text into logical "pages" and then groups those pages into chunks. It can also
    introduce character-based overlap between consecutive chunks.

    Args:
        chunk_size (int): Number of pages per chunk.
        chunk_overlap (int): Number of overlapping characters to include from the end
            of the previous chunk.

    Raises:
        SplitterConfigException:
            If ``chunk_size`` is less than 1 or ``chunk_overlap`` is negative.

    Warnings:
        SplitterInputWarning:
            When the input text is empty or whitespace-only.
        SplitterOutputWarning:
            When no non-empty pages are found after splitting on the placeholder and
            the splitter falls back to a single empty chunk.
    """

    def __init__(self, chunk_size: int = 1, chunk_overlap: int = 0):
        if chunk_size < 1 or not isinstance(chunk_size, int):
            raise SplitterConfigException(
                "chunk_size must be greater a positive number greater than 1"
            )
        if chunk_overlap < 0 or not isinstance(chunk_overlap, int):
            raise SplitterConfigException(
                "chunk_overlap must be a positive number greater or equal than 0"
            )

        # Note: PagedSplitter uses `chunk_size` as pages-per-chunk, not characters.
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

    # ---- Main method --- #

    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """
        Split the input text into page-based chunks using the page placeholder.

        The splitting process is:

        1. Validate and normalise the :class:`ReaderOutput` and extract
           ``text`` / ``page_placeholder``.
        2. Split the text into pages using ``page_placeholder``.
        3. Group pages into chunks (with optional character-based overlap).
        4. Build the final :class:`SplitterOutput`.

        Args:
            reader_output (ReaderOutput): The output from a reader containing text,
                metadata, and a ``page_placeholder`` string.

        Returns:
            SplitterOutput: The result with chunks and related metadata.

        Raises:
            ReaderOutputException:
                If ``reader_output`` does not contain a valid ``text`` or
                ``page_placeholder`` field.
            InvalidChunkException:
                If the number of generated ``chunk_id`` values does not match the
                number of chunks.
            SplitterOutputException:
                If constructing :class:`SplitterOutput` fails unexpectedly.

        Warnings:
            SplitterInputWarning:
                When the input text is empty or whitespace-only.
            SplitterOutputWarning:
                When no non-empty pages are found after splitting on the placeholder
                and the splitter falls back to a single empty chunk.

        Example:
            **Basic usage** with a simple placeholder:

            ```python
            from splitter_mr.schema import ReaderOutput
            from splitter_mr.splitter.splitters import PagedSplitter

            text = "<!-- page -->Page 1<!-- page -->Page 2<!-- page -->Page 3"
            ro = ReaderOutput(
                text=text,
                page_placeholder="<!-- page -->",
                document_name="demo.txt",
                document_path="/tmp/demo.txt",
            )

            splitter = PagedSplitter(chunk_size=1, chunk_overlap=0)
            out = splitter.split(ro)

            print(out.chunks)
            ```
            ```python
            ['Page 1', 'Page 2', 'Page 3']
            ```

            Grouping **multiple pages** into a single chunk:

            ```python
            splitter = PagedSplitter(chunk_size=2)
            out = splitter.split(ro)

            print(out.chunks)
            ```
            ```python
            ['Page 1\\nPage 2', 'Page 3']
            ```

            Applying **character-based overlap** between chunks:

            ```python
            text = "<p>One</p><!-- page --><p>Two</p><!-- page --><p>Three</p>"
            ro = ReaderOutput(text=text, page_placeholder="<!-- page -->")

            # Overlap last 5 characters from each previous chunk
            splitter = PagedSplitter(chunk_size=1, chunk_overlap=5)
            out = splitter.split(ro)

            print(out.chunks)
            ```
            ```python
            ['<p>One</p>', 'ne</p><p>Two</p>', 'o</p><p>Three</p>']
            ```

            **Metadata propagation**:

            ```python
            ro = ReaderOutput(
                text="<!-- page -->A<!-- page -->B",
                page_placeholder="<!-- page -->",
                document_name="source.txt",
                document_path="/tmp/source.txt",
                document_id="abc123",
            )

            splitter = PagedSplitter(chunk_size=1)
            out = splitter.split(ro)

            print(out.document_name)
            ```
            ```python
            'source.txt'
            ```
            ```python
            print(out.split_method)
            ```
            ```python
            'paged_splitter'
            ```
            ```python
            print(out.split_params)
            ```
            ```python
            {'chunk_size': 1, 'chunk_overlap': 0}
            ```
        """
        text, page_placeholder = self._validate_reader_output(reader_output)
        pages = self._split_into_pages(text, page_placeholder)
        chunks = self._build_chunks(pages)

        try:
            return self._build_output(reader_output, chunks)
        except InvalidChunkException:
            raise
        except (TypeError, ValueError) as exc:
            raise SplitterOutputException(
                f"Failed to build SplitterOutput in PagedSplitter: {exc}"
            ) from exc

    # ---- Helpers ---- #

    def _validate_reader_output(self, reader_output: ReaderOutput) -> Tuple[str, str]:
        """
        Validate and normalise the incoming ReaderOutput.

        Ensures that ``page_placeholder`` and ``text`` are present and of the right
        type, and emits input-level warnings when appropriate.

        Raises:
            ReaderOutputException: On missing/invalid fields.
        """
        if not hasattr(reader_output, "page_placeholder"):
            raise ReaderOutputException(
                "ReaderOutput object must expose a 'page_placeholder' attribute."
            )

        page_placeholder = reader_output.page_placeholder
        if not isinstance(page_placeholder, str) or not page_placeholder.strip():
            raise ReaderOutputException(
                "ReaderOutput.page_placeholder must be a non-empty string."
            )

        if not hasattr(reader_output, "text"):
            raise ReaderOutputException(
                "ReaderOutput object must expose a 'text' attribute."
            )

        text = reader_output.text

        if not text.strip():
            warnings.warn(
                "PagedSplitter received empty or whitespace-only text; "
                "resulting chunks will be empty.",
                SplitterInputWarning,
                stacklevel=3,
            )

        return text, page_placeholder

    def _split_into_pages(self, text: str, page_placeholder: str) -> List[str]:
        """
        Split the document text into normalised pages using the placeholder.

        Emits an output-level warning and returns an empty list if no pages could be
        derived.

        Warnings:
            SplitterOutputWarning: When no non-empty pages are found.
        """
        pages: List[str] = [
            page.strip() for page in text.split(page_placeholder) if page.strip()
        ]

        if not pages:
            warnings.warn(
                "PagedSplitter did not find any non-empty pages after splitting; "
                "returning a single empty chunk.",
                SplitterOutputWarning,
                stacklevel=3,
            )

        return pages

    def _build_chunks(self, pages: List[str]) -> List[str]:
        """
        Group pages into chunks, applying character-based overlap if configured.

        Guarantees that the returned list is never empty (fallback to ``['']``).
        """
        chunks: List[str] = []

        for i in range(0, len(pages), self.chunk_size):
            chunk = "\n".join(pages[i : i + self.chunk_size])

            if self.chunk_overlap > 0 and i > 0 and chunks:
                overlap_text = chunks[-1][-self.chunk_overlap :]
                chunk = overlap_text + chunk

            if chunk:
                chunks.append(chunk)

        if not chunks:
            chunks = [""]

        return chunks

    def _build_output(
        self,
        reader_output: ReaderOutput,
        chunks: List[str],
    ) -> SplitterOutput:
        """
        Assemble the final :class:`SplitterOutput` and perform consistency checks.

        Raises:
            InvalidChunkException:
                If the number of generated chunk IDs does not match the number of chunks.
        """
        chunk_ids = self._generate_chunk_ids(len(chunks))

        if len(chunk_ids) != len(chunks):
            raise InvalidChunkException(
                "Number of chunk IDs does not match number of chunks "
                f"(chunk_ids={len(chunk_ids)}, chunks={len(chunks)})."
            )

        metadata = self._default_metadata()

        try:
            return SplitterOutput(
                chunks=chunks,
                chunk_id=chunk_ids,
                document_name=reader_output.document_name,
                document_path=reader_output.document_path,
                document_id=reader_output.document_id,
                conversion_method=reader_output.conversion_method,
                reader_method=reader_output.reader_method,
                ocr_method=reader_output.ocr_method,
                split_method="paged_splitter",
                split_params={
                    "chunk_size": self.chunk_size,
                    "chunk_overlap": self.chunk_overlap,
                },
                metadata=metadata,
            )
        except Exception as exc:
            raise SplitterOutputException(f"Error trying to build response: {exc}")
split(reader_output)

Split the input text into page-based chunks using the page placeholder.

The splitting process is:

  1. Validate and normalise the :class:ReaderOutput and extract text / page_placeholder.
  2. Split the text into pages using page_placeholder.
  3. Group pages into chunks (with optional character-based overlap).
  4. Build the final :class:SplitterOutput.

Parameters:

Name Type Description Default
reader_output ReaderOutput

The output from a reader containing text, metadata, and a page_placeholder string.

required

Returns:

Name Type Description
SplitterOutput SplitterOutput

The result with chunks and related metadata.

Raises:

Type Description
ReaderOutputException

If reader_output does not contain a valid text or page_placeholder field.

InvalidChunkException

If the number of generated chunk_id values does not match the number of chunks.

SplitterOutputException

If constructing :class:SplitterOutput fails unexpectedly.

Warns:

Type Description
SplitterInputWarning

When the input text is empty or whitespace-only.

SplitterOutputWarning

When no non-empty pages are found after splitting on the placeholder and the splitter falls back to a single empty chunk.

Example

Basic usage with a simple placeholder:

from splitter_mr.schema import ReaderOutput
from splitter_mr.splitter.splitters import PagedSplitter

text = "<!-- page -->Page 1<!-- page -->Page 2<!-- page -->Page 3"
ro = ReaderOutput(
    text=text,
    page_placeholder="<!-- page -->",
    document_name="demo.txt",
    document_path="/tmp/demo.txt",
)

splitter = PagedSplitter(chunk_size=1, chunk_overlap=0)
out = splitter.split(ro)

print(out.chunks)
['Page 1', 'Page 2', 'Page 3']

Grouping multiple pages into a single chunk:

splitter = PagedSplitter(chunk_size=2)
out = splitter.split(ro)

print(out.chunks)
['Page 1\nPage 2', 'Page 3']

Applying character-based overlap between chunks:

text = "<p>One</p><!-- page --><p>Two</p><!-- page --><p>Three</p>"
ro = ReaderOutput(text=text, page_placeholder="<!-- page -->")

# Overlap last 5 characters from each previous chunk
splitter = PagedSplitter(chunk_size=1, chunk_overlap=5)
out = splitter.split(ro)

print(out.chunks)
['<p>One</p>', 'ne</p><p>Two</p>', 'o</p><p>Three</p>']

Metadata propagation:

ro = ReaderOutput(
    text="<!-- page -->A<!-- page -->B",
    page_placeholder="<!-- page -->",
    document_name="source.txt",
    document_path="/tmp/source.txt",
    document_id="abc123",
)

splitter = PagedSplitter(chunk_size=1)
out = splitter.split(ro)

print(out.document_name)
'source.txt'
print(out.split_method)
'paged_splitter'
print(out.split_params)
{'chunk_size': 1, 'chunk_overlap': 0}

Source code in src/splitter_mr/splitter/splitters/paged_splitter.py
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """
    Split the input text into page-based chunks using the page placeholder.

    The splitting process is:

    1. Validate and normalise the :class:`ReaderOutput` and extract
       ``text`` / ``page_placeholder``.
    2. Split the text into pages using ``page_placeholder``.
    3. Group pages into chunks (with optional character-based overlap).
    4. Build the final :class:`SplitterOutput`.

    Args:
        reader_output (ReaderOutput): The output from a reader containing text,
            metadata, and a ``page_placeholder`` string.

    Returns:
        SplitterOutput: The result with chunks and related metadata.

    Raises:
        ReaderOutputException:
            If ``reader_output`` does not contain a valid ``text`` or
            ``page_placeholder`` field.
        InvalidChunkException:
            If the number of generated ``chunk_id`` values does not match the
            number of chunks.
        SplitterOutputException:
            If constructing :class:`SplitterOutput` fails unexpectedly.

    Warnings:
        SplitterInputWarning:
            When the input text is empty or whitespace-only.
        SplitterOutputWarning:
            When no non-empty pages are found after splitting on the placeholder
            and the splitter falls back to a single empty chunk.

    Example:
        **Basic usage** with a simple placeholder:

        ```python
        from splitter_mr.schema import ReaderOutput
        from splitter_mr.splitter.splitters import PagedSplitter

        text = "<!-- page -->Page 1<!-- page -->Page 2<!-- page -->Page 3"
        ro = ReaderOutput(
            text=text,
            page_placeholder="<!-- page -->",
            document_name="demo.txt",
            document_path="/tmp/demo.txt",
        )

        splitter = PagedSplitter(chunk_size=1, chunk_overlap=0)
        out = splitter.split(ro)

        print(out.chunks)
        ```
        ```python
        ['Page 1', 'Page 2', 'Page 3']
        ```

        Grouping **multiple pages** into a single chunk:

        ```python
        splitter = PagedSplitter(chunk_size=2)
        out = splitter.split(ro)

        print(out.chunks)
        ```
        ```python
        ['Page 1\\nPage 2', 'Page 3']
        ```

        Applying **character-based overlap** between chunks:

        ```python
        text = "<p>One</p><!-- page --><p>Two</p><!-- page --><p>Three</p>"
        ro = ReaderOutput(text=text, page_placeholder="<!-- page -->")

        # Overlap last 5 characters from each previous chunk
        splitter = PagedSplitter(chunk_size=1, chunk_overlap=5)
        out = splitter.split(ro)

        print(out.chunks)
        ```
        ```python
        ['<p>One</p>', 'ne</p><p>Two</p>', 'o</p><p>Three</p>']
        ```

        **Metadata propagation**:

        ```python
        ro = ReaderOutput(
            text="<!-- page -->A<!-- page -->B",
            page_placeholder="<!-- page -->",
            document_name="source.txt",
            document_path="/tmp/source.txt",
            document_id="abc123",
        )

        splitter = PagedSplitter(chunk_size=1)
        out = splitter.split(ro)

        print(out.document_name)
        ```
        ```python
        'source.txt'
        ```
        ```python
        print(out.split_method)
        ```
        ```python
        'paged_splitter'
        ```
        ```python
        print(out.split_params)
        ```
        ```python
        {'chunk_size': 1, 'chunk_overlap': 0}
        ```
    """
    text, page_placeholder = self._validate_reader_output(reader_output)
    pages = self._split_into_pages(text, page_placeholder)
    chunks = self._build_chunks(pages)

    try:
        return self._build_output(reader_output, chunks)
    except InvalidChunkException:
        raise
    except (TypeError, ValueError) as exc:
        raise SplitterOutputException(
            f"Failed to build SplitterOutput in PagedSplitter: {exc}"
        ) from exc

SemanticSplitter

Splits text into chunks based on semantic similarity, using an embedding model and a max tokens parameter. Useful for meaningful semantic groupings.

SemanticSplitter

Bases: BaseSplitter

Split text into semantically coherent chunks using embedding similarity.

Pipeline:

  • Split text into sentences via SentenceSplitter (one sentence chunks).
  • Build a sliding window around each sentence (buffer_size).
  • Embed each window with BaseEmbedding (batched).
  • Compute cosine distances between consecutive windows (1 - cosine_sim).
  • Pick breakpoints using a thresholding strategy, or aim for number_of_chunks.
  • Join sentences between breakpoints; enforce minimum size via chunk_size.

Parameters:

Name Type Description Default
embedding BaseEmbedding

Embedding backend implementing an embed_documents(texts: List[str]) method. Typically wraps a model from OpenAI, Azure, or a local embedding model.

required
buffer_size int

Number of neighbouring sentences to include on each side when building the contextual window for each sentence. A value of 1 means "current sentence plus one sentence to the left and one to the right" (where available).

1
breakpoint_threshold_type BreakpointThresholdType

Strategy used to decide where to place breakpoints. Supported values are:

  • "percentile" – cut where distances exceed a percentile of the distance distribution.
  • "standard_deviation" – cut where distances exceed mean + k * std.
  • "interquartile" – cut where distances exceed mean + k * IQR.
  • "gradient" – cut where the gradient of distances exceeds a percentile threshold.
'percentile'
breakpoint_threshold_amount Optional[float]

Strength of the threshold for the chosen strategy. Meaning depends on breakpoint_threshold_type:

  • For "percentile" / "gradient": value in [0, 100] interpreted as a percentile, or a value in (0, 1] interpreted as a ratio and automatically scaled to [0, 100].
  • For "standard_deviation" / "interquartile": finite multiplier k applied to the deviation term (std or IQR). If None, a default from DEFAULT_BREAKPOINTS is used.
None
number_of_chunks Optional[int]

Desired number of output chunks. When provided, the splitter selects the largest distances to approximate this target (subject to document length and chunk_size). Must be a positive, finite value; non-integers are allowed but will be truncated internally.

None
chunk_size int

Minimum allowed chunk size in characters. Short segments below this size are merged forward to avoid excessively small, fragmented chunks.

1000

Raises:

Type Description
SplitterConfigException
  • If embedding does not provide an embed_documents method.
  • If buffer_size < 0.
  • If breakpoint_threshold_type is not supported.
  • If breakpoint_threshold_amount is invalid for the chosen strategy.
  • If number_of_chunks is non-positive or non-finite.

Warns:

Type Description
SplitterInputWarning
  • If breakpoint_threshold_amount in (0, 1] is auto-scaled as a ratio to a percentile in [0, 100].
  • If number_of_chunks is not an integer; it will be truncated when used internally.
Source code in src/splitter_mr/splitter/splitters/semantic_splitter.py
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
class SemanticSplitter(BaseSplitter):
    """
    Split text into semantically coherent chunks using embedding similarity.

    **Pipeline:**

    - Split text into sentences via `SentenceSplitter` (one sentence chunks).
    - Build a sliding window around each sentence (`buffer_size`).
    - Embed each window with `BaseEmbedding` (batched).
    - Compute cosine *distances* between consecutive windows (1 - cosine_sim).
    - Pick breakpoints using a thresholding strategy, or aim for `number_of_chunks`.
    - Join sentences between breakpoints; enforce minimum size via `chunk_size`.

    Args:
        embedding: Embedding backend implementing an ``embed_documents(texts: List[str])``
            method. Typically wraps a model from OpenAI, Azure, or a local
            embedding model.
        buffer_size: Number of neighbouring sentences to include on each side
            when building the contextual window for each sentence. A value of
            ``1`` means "current sentence plus one sentence to the left and
            one to the right" (where available).
        breakpoint_threshold_type: Strategy used to decide where to place
            breakpoints. Supported values are:

            * ``"percentile"`` – cut where distances exceed a percentile
              of the distance distribution.
            * ``"standard_deviation"`` – cut where distances exceed
              ``mean + k * std``.
            * ``"interquartile"`` – cut where distances exceed
              ``mean + k * IQR``.
            * ``"gradient"`` – cut where the *gradient* of distances
              exceeds a percentile threshold.
        breakpoint_threshold_amount: Strength of the threshold for the
            chosen strategy. Meaning depends on ``breakpoint_threshold_type``:

            * For ``"percentile"`` / ``"gradient"``:
              value in ``[0, 100]`` interpreted as a percentile, or a
              value in ``(0, 1]`` interpreted as a ratio and automatically
              scaled to ``[0, 100]``.
            * For ``"standard_deviation"`` / ``"interquartile"``:
              finite multiplier ``k`` applied to the deviation term
              (std or IQR).
            If ``None``, a default from ``DEFAULT_BREAKPOINTS`` is used.
        number_of_chunks: Desired number of output chunks. When provided,
            the splitter selects the largest distances to approximate this
            target (subject to document length and `chunk_size`). Must be a
            positive, finite value; non-integers are allowed but will be
            truncated internally.
        chunk_size: Minimum allowed chunk size in characters. Short segments
            below this size are merged forward to avoid excessively small,
            fragmented chunks.

    Raises:
        SplitterConfigException:
            - If `embedding` does not provide an `embed_documents` method.
            - If `buffer_size < 0`.
            - If `breakpoint_threshold_type` is not supported.
            - If `breakpoint_threshold_amount` is invalid for the chosen strategy.
            - If `number_of_chunks` is non-positive or non-finite.

    Warnings:
        SplitterInputWarning:
            - If `breakpoint_threshold_amount` in (0, 1] is auto-scaled as
              a ratio to a percentile in [0, 100].
            - If `number_of_chunks` is not an integer; it will be truncated
              when used internally.
    """

    def __init__(
        self,
        embedding: BaseEmbedding,
        *,
        buffer_size: int = 1,
        breakpoint_threshold_type: BreakpointThresholdType = "percentile",
        breakpoint_threshold_amount: Optional[float] = None,
        number_of_chunks: Optional[int] = None,
        chunk_size: int = 1000,
    ) -> None:
        super().__init__(chunk_size=chunk_size)

        # Validate embedding backend
        if embedding is None or not hasattr(embedding, "embed_documents"):
            raise SplitterConfigException(
                "SemanticSplitter requires an embedding backend with an "
                "'embed_documents' method."
            )
        self.embedding = embedding

        # Validate buffer size
        if buffer_size < 0:
            raise SplitterConfigException("buffer_size must be >= 0.")
        self.buffer_size = int(buffer_size)

        # Validate breakpoint strategy
        valid_types = set(DEFAULT_BREAKPOINTS.keys())
        if breakpoint_threshold_type not in valid_types:
            raise SplitterConfigException(
                f"Invalid breakpoint_threshold_type={breakpoint_threshold_type!r}. "
                f"Expected one of {sorted(valid_types)}."
            )

        self.breakpoint_threshold_type = cast(
            BreakpointThresholdType, breakpoint_threshold_type
        )

        # Resolve threshold amount
        raw_amount = (
            DEFAULT_BREAKPOINTS[self.breakpoint_threshold_type]
            if breakpoint_threshold_amount is None
            else float(breakpoint_threshold_amount)
        )

        # Normalise / validate threshold amount per strategy
        if self.breakpoint_threshold_type in ("percentile", "gradient"):
            amt = float(raw_amount)
            if 0.0 < amt <= 1.0:
                # interpret as ratio -> scale to [0, 100]
                warnings.warn(
                    "SemanticSplitter: breakpoint_threshold_amount given in (0, 1]; "
                    "interpreting as a ratio and scaling to [0, 100] percent.",
                    SplitterInputWarning,
                )
                amt *= 100.0
            if not 0.0 <= amt <= 100.0:
                raise SplitterConfigException(
                    "For 'percentile' and 'gradient' strategies, "
                    "breakpoint_threshold_amount must be in [0, 100] "
                    "(or (0, 1] to be interpreted as a ratio)."
                )
            self.breakpoint_threshold_amount = amt
        else:
            # std-dev / IQR strategies: just require finite value
            if not np.isfinite(raw_amount):
                raise SplitterConfigException(
                    "breakpoint_threshold_amount must be finite."
                )
            self.breakpoint_threshold_amount = float(raw_amount)

        # Validate number_of_chunks
        if number_of_chunks is not None:
            if not np.isfinite(number_of_chunks) or number_of_chunks <= 0:
                raise SplitterConfigException(
                    "number_of_chunks must be a positive finite integer when provided."
                )
            if not float(number_of_chunks).is_integer():
                warnings.warn(
                    f"SemanticSplitter: number_of_chunks={number_of_chunks!r} is not "
                    "an integer; it will be truncated when used internally.",
                    SplitterInputWarning,
                )
        self.number_of_chunks = number_of_chunks

        self._sentence_splitter = SentenceSplitter(
            chunk_size=1, chunk_overlap=0, separators=[".", "!", "?"]
        )

    # ---- Main method ---- #

    def split(self, reader_output: ReaderOutput) -> SplitterOutput:
        """
        Split the document text into semantically coherent chunks.

        This method uses sentence embeddings to find semantic breakpoints.
        Sentences are embedded in overlapping windows (controlled by `buffer_size`),
        then cosine distances between consecutive windows are used to detect topic
        shifts. Breakpoints are determined using either a threshold strategy
        (percentile, std-dev, IQR, gradient) or by targeting a number of chunks.

        Args:
            reader_output (ReaderOutput): Input text and associated metadata.

        Returns:
            SplitterOutput: Structured splitter output containing:
                * ``chunks`` — list of semantically grouped text segments.
                * ``chunk_id`` — corresponding unique identifiers.
                * document metadata and splitter parameters.

        Raises:
            ReaderOutputException:
                If the provided text is empty, None, or otherwise invalid.
            SplitterConfigException:
                If an invalid configuration is detected at runtime
                (defensive re-checks).
            SplitterOutputException:
                - If sentence splitting fails unexpectedly.
                - If the embedding backend fails or returns invalid shapes.
                - If non-finite distances/gradients are produced.
                - If post-processing of distances fails.

        Warnings:
            SplitterInputWarning:
                - If certain configuration values are auto-normalised (see __init__).
            SplitterOutputWarning:
                - If no semantic breakpoints are detected and a single chunk is
                  returned for multi-sentence input.
                - If the requested `number_of_chunks` is larger than the maximum
                  achievable for the given document.
                - If all candidate cuts are rejected due to `chunk_size`, resulting
                  in a single merged chunk.

        Notes:
            - With a single sentence (or 2 in gradient mode), returns text as-is.
            - ``chunk_size`` acts as the *minimum* allowed chunk size; small
              segments are merged forward.
            - The `buffer_size` defines how much contextual overlap each sentence
              has for embedding (e.g., 1 = one sentence on either side).

        Example:
            **Basic usage** with a **custom embedding backend**:

            ```python
            from splitter_mr.schema import ReaderOutput
            from splitter_mr.splitter.splitters.semantic_splitter import SemanticSplitter
            from splitter_mr.embedding import BaseEmbedding

            class DummyEmbedding(BaseEmbedding):
                \"\"\"Minimal embedding backend for demonstration purposes.\"\"\"
                model_name = "dummy-semantic-model"

                def embed_documents(self, texts: list[str]) -> list[list[float]]:
                    # Return a simple fixed-length vector per text
                    dim = 8
                    return [[float(i) for i in range(dim)] for _ in texts]

            text = (
                "Cats like to sleep in the sun. "
                "They often chase laser pointers. "
                "Neural networks can classify animal images. "
                "Transformers are widely used in NLP."
            )

            ro = ReaderOutput(text=text, document_name="semantic_demo.txt")

            splitter = SemanticSplitter(
                embedding=DummyEmbedding(),
                buffer_size=1,
                breakpoint_threshold_type="percentile",
                breakpoint_threshold_amount=75.0,
                chunk_size=50,
            )

            output = splitter.split(ro)

            print(output.chunks)
            ```

            Targeting a **specific number of chunks**:

            ```python
            splitter = SemanticSplitter(
                embedding=DummyEmbedding(),
                buffer_size=1,
                number_of_chunks=3,
                chunk_size=40,
            )

            output = splitter.split(ro)
            print(output.chunks)          # ~3 semantic chunks (subject to document length)
            print(output.split_method)    # "semantic_splitter"
            print(output.split_params)    # includes threshold config and model name
            ```
        """
        text: str = reader_output.text
        if text is None or text.strip() == "":
            raise ReaderOutputException("ReaderOutput.text is empty or None.")

        sentences: list[str] = self._split_into_sentences(reader_output)

        # Edge cases where thresholds aren't meaningful
        if len(sentences) <= 1:
            chunks = sentences if sentences else [text]
        elif self.breakpoint_threshold_type == "gradient" and len(sentences) == 2:
            chunks = sentences
        else:
            distances, sentence_dicts = self._calculate_sentence_distances(sentences)

            indices_above: list[int]

            if self.number_of_chunks is not None and distances:
                # Warn if target number_of_chunks is unattainable
                max_possible = len(distances) + 1
                if self.number_of_chunks > max_possible:
                    warnings.warn(
                        "SemanticSplitter: requested number_of_chunks="
                        f"{self.number_of_chunks} is larger than the maximum "
                        f"possible ({max_possible}); using {max_possible} instead.",
                        SplitterOutputWarning,
                    )

                # Pick top (k-1) distances as breakpoints
                k = int(self.number_of_chunks)
                m = max(0, min(k - 1, len(distances)))  # number of cuts to make
                if m == 0:
                    indices_above = []  # single chunk
                else:
                    # indices of the m largest distances (breaks), sorted in ascending order
                    idxs = np.argsort(np.asarray(distances))[-m:]
                    indices_above = sorted(int(i) for i in idxs.tolist())
            else:
                threshold, ref_array = self._calculate_breakpoint_threshold(distances)
                indices_above = [
                    i for i, val in enumerate(ref_array) if val > threshold
                ]

            # Warn if no breakpoints found (but only when >1 chunk requested)
            if (
                not indices_above  # noqa: W503
                and len(sentences) > 1  # noqa: W503
                and (self.number_of_chunks is None or self.number_of_chunks > 1)  # noqa: W503
            ):
                warnings.warn(
                    "SemanticSplitter did not detect any semantic breakpoints; "
                    "returning a single chunk.",
                    SplitterOutputWarning,
                )

            chunks = []
            start_idx = 0

            for idx in indices_above:
                end = idx + 1  # inclusive slice end
                candidate = " ".join(
                    d["sentence"] for d in sentence_dicts[start_idx:end]
                ).strip()
                if len(candidate) < self.chunk_size:
                    # too small: keep accumulating (do NOT move start_idx)
                    continue
                chunks.append(candidate)
                start_idx = end

            # Tail (always emit whatever remains)
            if start_idx < len(sentence_dicts):
                tail = " ".join(
                    d["sentence"] for d in sentence_dicts[start_idx:]
                ).strip()
                if tail:
                    chunks.append(tail)

            if not chunks:
                chunks = [" ".join(sentences).strip() or (reader_output.text or "")]

        # Warn if everything got merged into a single chunk due to chunk_size
        if (
            len(chunks) == 1  # noqa: W503
            and len(" ".join(sentences)) >= self.chunk_size  # noqa: W503
            and len(sentences) > 1  # noqa: W503
        ):
            warnings.warn(
                "SemanticSplitter merged all sentences into a single chunk because "
                "no candidate segments met the minimum chunk_size.",
                SplitterOutputWarning,
            )

        # Append chunk_ids and metadata
        chunk_ids = self._generate_chunk_ids(len(chunks))
        metadata = self._default_metadata()
        model_name = getattr(self.embedding, "model_name", None)

        # Produce output
        return SplitterOutput(
            chunks=chunks,
            chunk_id=chunk_ids,
            document_name=reader_output.document_name,
            document_path=reader_output.document_path,
            document_id=reader_output.document_id,
            conversion_method=reader_output.conversion_method,
            reader_method=reader_output.reader_method,
            ocr_method=reader_output.ocr_method,
            split_method="semantic_splitter",
            split_params={
                "buffer_size": self.buffer_size,
                "breakpoint_threshold_type": self.breakpoint_threshold_type,
                "breakpoint_threshold_amount": self.breakpoint_threshold_amount,
                "number_of_chunks": self.number_of_chunks,
                "chunk_size": self.chunk_size,
                "model_name": model_name,
            },
            metadata=metadata,
        )

    # ---- Internal helpers ---- #

    def _split_into_sentences(self, reader_output: ReaderOutput) -> List[str]:
        """Split the input text into sentences using `SentenceSplitter` (no overlap).

        Args:
            reader_output (ReaderOutput): The document to split.

        Returns:
            List[str]: List of sentences preserving punctuation.

        Raises:
            SplitterOutputException: If the underlying sentence splitter fails
                in an unexpected way.
        """
        try:
            sent_out = self._sentence_splitter.split(reader_output)
        except SplitterOutputException:
            # Propagate domain-specific splitter failures as-is
            raise
        except Exception as exc:  # pragma: no cover - defensive
            raise SplitterOutputException(
                f"Sentence splitting failed in SemanticSplitter: {exc}"
            ) from exc
        return sent_out.chunks

    def _calculate_sentence_distances(
        self, single_sentences: List[str]
    ) -> Tuple[List[float], List[Dict[str, Any]]]:
        """Embed sentence windows (batch) and compute consecutive cosine distances.

        Args:
            single_sentences (List[str]): Sentences in order.

        Returns:
            Tuple[List[float], List[Dict[str, Any]]]:
                - distances between consecutive windows (len = n-1)
                - sentence dicts enriched with combined text and embeddings

        Raises:
            SplitterOutputException:
                - If the embedding backend fails during `embed_documents`.
                - If the number of returned embeddings does not match the number
                  of windows.
                - If non-finite (NaN/inf) distances are produced.
        """
        # Prepare sentence dicts and combine with buffer
        sentences = [
            {"sentence": s, "index": i} for i, s in enumerate(single_sentences)
        ]
        sentences = _combine_sentences(sentences, self.buffer_size)

        # Batch embed all combined sentences
        windows = [item["combined_sentence"] for item in sentences]
        try:
            embeddings = self.embedding.embed_documents(windows)
        except Exception as exc:  # pragma: no cover - defensive
            raise SplitterOutputException(
                f"Embedding backend failed during SemanticSplitter: {exc}"
            ) from exc

        if len(embeddings) != len(sentences):
            raise SplitterOutputException(
                "Embedding backend returned a number of vectors that does not match "
                f"the number of windows in SemanticSplitter "
                f"({len(embeddings)} embeddings for {len(sentences)} windows)."
            )

        for item, emb in zip(sentences, embeddings):
            item["combined_sentence_embedding"] = emb

        # Distances (1 - cosine similarity) between consecutive windows
        n = len(sentences)
        if n <= 1:
            return [], sentences

        distances: List[float] = []
        for i in range(n - 1):
            sim = _cosine_similaritynp(
                sentences[i]["combined_sentence_embedding"],
                sentences[i + 1]["combined_sentence_embedding"],
            )
            dist = 1.0 - sim
            distances.append(dist)
            sentences[i]["distance_to_next"] = dist

        distances_arr = np.asarray(distances, dtype=np.float64)
        if not np.all(np.isfinite(distances_arr)):
            raise SplitterOutputException(
                "Non-finite values (NaN/inf) encountered in semantic distances; "
                "embedding backend produced invalid vectors."
            )

        return distances_arr.tolist(), sentences

    def _threshold_from_clusters(self, distances: List[float]) -> float:
        """Estimate a percentile threshold to reach `number_of_chunks`.

        Maps desired chunks x∈[1, len(distances)] to percentile y∈[100, 0].

        Args:
            distances (List[float]): Consecutive distances.

        Returns:
            float: Threshold value as a percentile over `distances`.
        """
        assert self.number_of_chunks is not None
        x1, y1 = float(len(distances)), 0.0
        x2, y2 = 1.0, 100.0
        x = max(min(float(self.number_of_chunks), x1), x2)
        y = y1 + ((y2 - y1) / (x2 - x1)) * (x - x1) if x2 != x1 else y2
        y = float(np.clip(y, 0.0, 100.0))
        return float(np.percentile(distances, y)) if distances else 0.0

    def _calculate_breakpoint_threshold(
        self, distances: List[float]
    ) -> Tuple[float, List[float]]:
        """Compute the breakpoint threshold and reference array per selected strategy.

        Args:
            distances (List[float]): Consecutive distances between windows.

        Returns:
            Tuple[float, List[float]]: (threshold, reference_array)
                If strategy == "gradient", reference_array is the gradient;
                otherwise it's `distances`.

        Raises:
            SplitterOutputException:
                If non-finite values are detected in the distance or gradient arrays.
            SplitterConfigException:
                If an unexpected `breakpoint_threshold_type` is encountered.
        """
        if not distances:
            return 0.0, distances

        arr = np.asarray(distances, dtype=np.float64)
        if not np.all(np.isfinite(arr)):
            raise SplitterOutputException(
                "Non-finite values (NaN/inf) encountered in distances when "
                "computing breakpoint threshold."
            )

        if self.breakpoint_threshold_type == "percentile":
            return (
                float(np.percentile(arr, self.breakpoint_threshold_amount)),
                arr.tolist(),
            )

        if self.breakpoint_threshold_type == "standard_deviation":
            mu = float(np.mean(arr))
            sd = float(np.std(arr))
            return mu + self.breakpoint_threshold_amount * sd, arr.tolist()

        if self.breakpoint_threshold_type == "interquartile":
            q1, q3 = np.percentile(arr, [25.0, 75.0])
            iqr = float(q3 - q1)
            mu = float(np.mean(arr))
            return mu + self.breakpoint_threshold_amount * iqr, arr.tolist()

        if self.breakpoint_threshold_type == "gradient":
            grads_arr = np.gradient(arr)
            if not np.all(np.isfinite(grads_arr)):
                raise SplitterOutputException(
                    "Non-finite values (NaN/inf) encountered in gradient distances."
                )
            grads = grads_arr.tolist()
            thr = float(np.percentile(grads_arr, self.breakpoint_threshold_amount))
            return thr, grads  # use gradient array as the reference

        # Should be prevented by __init__, but keep as a defensive guard
        raise SplitterConfigException(
            f"Unexpected breakpoint_threshold_type: {self.breakpoint_threshold_type}"
        )
split(reader_output)

Split the document text into semantically coherent chunks.

This method uses sentence embeddings to find semantic breakpoints. Sentences are embedded in overlapping windows (controlled by buffer_size), then cosine distances between consecutive windows are used to detect topic shifts. Breakpoints are determined using either a threshold strategy (percentile, std-dev, IQR, gradient) or by targeting a number of chunks.

Parameters:

Name Type Description Default
reader_output ReaderOutput

Input text and associated metadata.

required

Returns:

Name Type Description
SplitterOutput SplitterOutput

Structured splitter output containing: * chunks — list of semantically grouped text segments. * chunk_id — corresponding unique identifiers. * document metadata and splitter parameters.

Raises:

Type Description
ReaderOutputException

If the provided text is empty, None, or otherwise invalid.

SplitterConfigException

If an invalid configuration is detected at runtime (defensive re-checks).

SplitterOutputException
  • If sentence splitting fails unexpectedly.
  • If the embedding backend fails or returns invalid shapes.
  • If non-finite distances/gradients are produced.
  • If post-processing of distances fails.

Warns:

Type Description
SplitterInputWarning
  • If certain configuration values are auto-normalised (see init).
SplitterOutputWarning
  • If no semantic breakpoints are detected and a single chunk is returned for multi-sentence input.
  • If the requested number_of_chunks is larger than the maximum achievable for the given document.
  • If all candidate cuts are rejected due to chunk_size, resulting in a single merged chunk.
Notes
  • With a single sentence (or 2 in gradient mode), returns text as-is.
  • chunk_size acts as the minimum allowed chunk size; small segments are merged forward.
  • The buffer_size defines how much contextual overlap each sentence has for embedding (e.g., 1 = one sentence on either side).
Example

Basic usage with a custom embedding backend:

from splitter_mr.schema import ReaderOutput
from splitter_mr.splitter.splitters.semantic_splitter import SemanticSplitter
from splitter_mr.embedding import BaseEmbedding

class DummyEmbedding(BaseEmbedding):
    """Minimal embedding backend for demonstration purposes."""
    model_name = "dummy-semantic-model"

    def embed_documents(self, texts: list[str]) -> list[list[float]]:
        # Return a simple fixed-length vector per text
        dim = 8
        return [[float(i) for i in range(dim)] for _ in texts]

text = (
    "Cats like to sleep in the sun. "
    "They often chase laser pointers. "
    "Neural networks can classify animal images. "
    "Transformers are widely used in NLP."
)

ro = ReaderOutput(text=text, document_name="semantic_demo.txt")

splitter = SemanticSplitter(
    embedding=DummyEmbedding(),
    buffer_size=1,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=75.0,
    chunk_size=50,
)

output = splitter.split(ro)

print(output.chunks)

Targeting a specific number of chunks:

splitter = SemanticSplitter(
    embedding=DummyEmbedding(),
    buffer_size=1,
    number_of_chunks=3,
    chunk_size=40,
)

output = splitter.split(ro)
print(output.chunks)          # ~3 semantic chunks (subject to document length)
print(output.split_method)    # "semantic_splitter"
print(output.split_params)    # includes threshold config and model name
Source code in src/splitter_mr/splitter/splitters/semantic_splitter.py
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
def split(self, reader_output: ReaderOutput) -> SplitterOutput:
    """
    Split the document text into semantically coherent chunks.

    This method uses sentence embeddings to find semantic breakpoints.
    Sentences are embedded in overlapping windows (controlled by `buffer_size`),
    then cosine distances between consecutive windows are used to detect topic
    shifts. Breakpoints are determined using either a threshold strategy
    (percentile, std-dev, IQR, gradient) or by targeting a number of chunks.

    Args:
        reader_output (ReaderOutput): Input text and associated metadata.

    Returns:
        SplitterOutput: Structured splitter output containing:
            * ``chunks`` — list of semantically grouped text segments.
            * ``chunk_id`` — corresponding unique identifiers.
            * document metadata and splitter parameters.

    Raises:
        ReaderOutputException:
            If the provided text is empty, None, or otherwise invalid.
        SplitterConfigException:
            If an invalid configuration is detected at runtime
            (defensive re-checks).
        SplitterOutputException:
            - If sentence splitting fails unexpectedly.
            - If the embedding backend fails or returns invalid shapes.
            - If non-finite distances/gradients are produced.
            - If post-processing of distances fails.

    Warnings:
        SplitterInputWarning:
            - If certain configuration values are auto-normalised (see __init__).
        SplitterOutputWarning:
            - If no semantic breakpoints are detected and a single chunk is
              returned for multi-sentence input.
            - If the requested `number_of_chunks` is larger than the maximum
              achievable for the given document.
            - If all candidate cuts are rejected due to `chunk_size`, resulting
              in a single merged chunk.

    Notes:
        - With a single sentence (or 2 in gradient mode), returns text as-is.
        - ``chunk_size`` acts as the *minimum* allowed chunk size; small
          segments are merged forward.
        - The `buffer_size` defines how much contextual overlap each sentence
          has for embedding (e.g., 1 = one sentence on either side).

    Example:
        **Basic usage** with a **custom embedding backend**:

        ```python
        from splitter_mr.schema import ReaderOutput
        from splitter_mr.splitter.splitters.semantic_splitter import SemanticSplitter
        from splitter_mr.embedding import BaseEmbedding

        class DummyEmbedding(BaseEmbedding):
            \"\"\"Minimal embedding backend for demonstration purposes.\"\"\"
            model_name = "dummy-semantic-model"

            def embed_documents(self, texts: list[str]) -> list[list[float]]:
                # Return a simple fixed-length vector per text
                dim = 8
                return [[float(i) for i in range(dim)] for _ in texts]

        text = (
            "Cats like to sleep in the sun. "
            "They often chase laser pointers. "
            "Neural networks can classify animal images. "
            "Transformers are widely used in NLP."
        )

        ro = ReaderOutput(text=text, document_name="semantic_demo.txt")

        splitter = SemanticSplitter(
            embedding=DummyEmbedding(),
            buffer_size=1,
            breakpoint_threshold_type="percentile",
            breakpoint_threshold_amount=75.0,
            chunk_size=50,
        )

        output = splitter.split(ro)

        print(output.chunks)
        ```

        Targeting a **specific number of chunks**:

        ```python
        splitter = SemanticSplitter(
            embedding=DummyEmbedding(),
            buffer_size=1,
            number_of_chunks=3,
            chunk_size=40,
        )

        output = splitter.split(ro)
        print(output.chunks)          # ~3 semantic chunks (subject to document length)
        print(output.split_method)    # "semantic_splitter"
        print(output.split_params)    # includes threshold config and model name
        ```
    """
    text: str = reader_output.text
    if text is None or text.strip() == "":
        raise ReaderOutputException("ReaderOutput.text is empty or None.")

    sentences: list[str] = self._split_into_sentences(reader_output)

    # Edge cases where thresholds aren't meaningful
    if len(sentences) <= 1:
        chunks = sentences if sentences else [text]
    elif self.breakpoint_threshold_type == "gradient" and len(sentences) == 2:
        chunks = sentences
    else:
        distances, sentence_dicts = self._calculate_sentence_distances(sentences)

        indices_above: list[int]

        if self.number_of_chunks is not None and distances:
            # Warn if target number_of_chunks is unattainable
            max_possible = len(distances) + 1
            if self.number_of_chunks > max_possible:
                warnings.warn(
                    "SemanticSplitter: requested number_of_chunks="
                    f"{self.number_of_chunks} is larger than the maximum "
                    f"possible ({max_possible}); using {max_possible} instead.",
                    SplitterOutputWarning,
                )

            # Pick top (k-1) distances as breakpoints
            k = int(self.number_of_chunks)
            m = max(0, min(k - 1, len(distances)))  # number of cuts to make
            if m == 0:
                indices_above = []  # single chunk
            else:
                # indices of the m largest distances (breaks), sorted in ascending order
                idxs = np.argsort(np.asarray(distances))[-m:]
                indices_above = sorted(int(i) for i in idxs.tolist())
        else:
            threshold, ref_array = self._calculate_breakpoint_threshold(distances)
            indices_above = [
                i for i, val in enumerate(ref_array) if val > threshold
            ]

        # Warn if no breakpoints found (but only when >1 chunk requested)
        if (
            not indices_above  # noqa: W503
            and len(sentences) > 1  # noqa: W503
            and (self.number_of_chunks is None or self.number_of_chunks > 1)  # noqa: W503
        ):
            warnings.warn(
                "SemanticSplitter did not detect any semantic breakpoints; "
                "returning a single chunk.",
                SplitterOutputWarning,
            )

        chunks = []
        start_idx = 0

        for idx in indices_above:
            end = idx + 1  # inclusive slice end
            candidate = " ".join(
                d["sentence"] for d in sentence_dicts[start_idx:end]
            ).strip()
            if len(candidate) < self.chunk_size:
                # too small: keep accumulating (do NOT move start_idx)
                continue
            chunks.append(candidate)
            start_idx = end

        # Tail (always emit whatever remains)
        if start_idx < len(sentence_dicts):
            tail = " ".join(
                d["sentence"] for d in sentence_dicts[start_idx:]
            ).strip()
            if tail:
                chunks.append(tail)

        if not chunks:
            chunks = [" ".join(sentences).strip() or (reader_output.text or "")]

    # Warn if everything got merged into a single chunk due to chunk_size
    if (
        len(chunks) == 1  # noqa: W503
        and len(" ".join(sentences)) >= self.chunk_size  # noqa: W503
        and len(sentences) > 1  # noqa: W503
    ):
        warnings.warn(
            "SemanticSplitter merged all sentences into a single chunk because "
            "no candidate segments met the minimum chunk_size.",
            SplitterOutputWarning,
        )

    # Append chunk_ids and metadata
    chunk_ids = self._generate_chunk_ids(len(chunks))
    metadata = self._default_metadata()
    model_name = getattr(self.embedding, "model_name", None)

    # Produce output
    return SplitterOutput(
        chunks=chunks,
        chunk_id=chunk_ids,
        document_name=reader_output.document_name,
        document_path=reader_output.document_path,
        document_id=reader_output.document_id,
        conversion_method=reader_output.conversion_method,
        reader_method=reader_output.reader_method,
        ocr_method=reader_output.ocr_method,
        split_method="semantic_splitter",
        split_params={
            "buffer_size": self.buffer_size,
            "breakpoint_threshold_type": self.breakpoint_threshold_type,
            "breakpoint_threshold_amount": self.breakpoint_threshold_amount,
            "number_of_chunks": self.number_of_chunks,
            "chunk_size": self.chunk_size,
            "model_name": model_name,
        },
        metadata=metadata,
    )