Skip to content

Warnings

SplitterMR uses standard Python UserWarning subclasses to alert users about suspicious inputs, ambiguous file types, or heuristic fallbacks that do not halt execution but may affect data quality.

Warnings live in:

from splitter_mr.schema.warnings import *

Why a custom warning hierarchy?

  • Granular filtering: You can ignore specific categories (e.g., FiletypeAmbiguityWarning) while keeping others active using Python's warnings filter.
  • Process observability: Distinguishes between input data issues (SplitterInputWarning) and processing results (SplitterOutputWarning).

Reader warnings

Readers emit BaseReaderWarning (or subclasses) when input files are ambiguous or minor issues occur during ingestion.

Hierarchy

`BaseReaderWarning` (`UserWarning`)
└── `FiletypeAmbiguityWarning`

General

BaseReaderWarning

Bases: UserWarning

Base Warning class to all Reader exceptions

Source code in src/splitter_mr/schema/warnings.py
 8
 9
10
11
12
13
class BaseReaderWarning(UserWarning):
    """
    Base Warning class to all Reader exceptions
    """

    pass

I/O and Heuristics

FiletypeAmbiguityWarning

Bases: BaseReaderWarning

Warned when filetype heuristics disagree (extension vs DOM sniff).

Source code in src/splitter_mr/schema/warnings.py
16
17
18
19
class FiletypeAmbiguityWarning(BaseReaderWarning):
    """
    Warned when filetype heuristics disagree (extension vs DOM sniff).
    """

Typical cases:

  • A file has a .json extension but contains HTML content.
  • MIME type sniffing disagrees with the provided file extension.

Splitter warnings

Splitters emit BaseSplitterWarning (or subclasses) regarding suspicious chunking inputs, outputs, or fallback behaviors.

Hierarchy

`BaseSplitterWarning` (`UserWarning`)
├── `SplitterInputWarning`   ├── `AutoTagFallbackWarning`   └── `BatchHtmlTableWarning`
└── `SplitterOutputWarning`
    ├── `ChunkUnderflowWarning`
    └── `ChunkOverflowWarning`

General

BaseSplitterWarning

Bases: UserWarning

Base Warning class to all Reader exceptions

Source code in src/splitter_mr/schema/warnings.py
29
30
31
32
33
34
class BaseSplitterWarning(UserWarning):
    """
    Base Warning class to all Reader exceptions
    """

    pass

handler: python options: members_order: source

I/O and Validation

SplitterInputWarning

Bases: BaseSplitterWarning

Warning raised when the splitter input is suspicious (e.g., empty text or text expected to be JSON but not parseable as JSON).

Source code in src/splitter_mr/schema/warnings.py
40
41
42
43
44
class SplitterInputWarning(BaseSplitterWarning):
    """
    Warning raised when the splitter input is suspicious (e.g., empty text or
    text expected to be JSON but not parseable as JSON).
    """

Typical cases:

  • Input text is empty.
  • Input text is expected to be JSON but parsing failed (fallback to raw text).

SplitterOutputWarning

Bases: BaseSplitterWarning

Warning raised when the splitter output present suspicious elements (e.g., empty text or text expected to be JSON but not parseable as JSON).

Source code in src/splitter_mr/schema/warnings.py
47
48
49
50
51
class SplitterOutputWarning(BaseSplitterWarning):
    """
    Warning raised when the splitter output present suspicious elements (e.g.,
    empty text or text expected to be JSON but not parseable as JSON).
    """

Typical cases:

  • The resulting chunks contain empty text fields.
  • Metadata generation produced suspicious values.

ChunkUnderflowWarning

Bases: SplitterOutputWarning

Warned when fewer chunks are produced than expected from the configured chunk_size due to the number of paragraphs being insufficient.

Source code in src/splitter_mr/schema/warnings.py
54
55
56
57
58
class ChunkUnderflowWarning(SplitterOutputWarning):
    """
    Warned when fewer chunks are produced than expected from the configured
    chunk_size due to the number of paragraphs being insufficient.
    """

Typical cases:

  • The document structure resulted in significantly fewer chunks than the chunk_size configuration suggested.

ChunkOverflowWarning

Bases: SplitterOutputWarning

Warned when fewer chunks are produced than expected from the configured chunk_size due to the number of paragraphs being insufficient.

Source code in src/splitter_mr/schema/warnings.py
61
62
63
64
65
class ChunkOverflowWarning(SplitterOutputWarning):
    """
    Warned when fewer chunks are produced than expected from the configured
    chunk_size due to the number of paragraphs being insufficient.
    """

Typical cases:

  • Chunking produced unexpected volume or size deviations based on paragraph constraints.

Splitters-specific warnings

HtmlTagSplitter

AutoTagFallbackWarning

Bases: SplitterInputWarning

Warned when HTML Tag Splitter performs auto tagging, e.g., when not finding a tag or when no tag is provided.

Source code in src/splitter_mr/schema/warnings.py
71
72
73
74
75
class AutoTagFallbackWarning(SplitterInputWarning):
    """
    Warned when HTML Tag Splitter performs auto tagging, e.g., when
    not finding a tag or when no tag is provided.
    """

Typical cases:

  • The specific tag requested was not found, triggering an auto-tagging strategy.

BatchHtmlTableWarning

Bases: SplitterInputWarning

Warned when a tag is presented in a table and the splitting process is being produced on batch. In that case, it is splitted by table.

Source code in src/splitter_mr/schema/warnings.py
78
79
80
81
82
class BatchHtmlTableWarning(SplitterInputWarning):
    """
    Warned when a tag is presented in a table and the splitting process is being
    produced on batch. In that case, it is splitted by table.
    """

Typical cases:

  • A target tag is located inside an HTML table during batch processing (split occurs by table to preserve context).

Reference table

Area Warning Parent Description
Reader BaseReaderWarning UserWarning Base reader warning
Reader FiletypeAmbiguityWarning BaseReaderWarning Extension vs. content mismatch
Splitter BaseSplitterWarning UserWarning Base splitter warning
Splitter SplitterInputWarning BaseSplitterWarning Suspicious input (empty/malformed)
Splitter SplitterOutputWarning BaseSplitterWarning Suspicious output elements
Splitter ChunkUnderflowWarning SplitterOutputWarning Fewer chunks than expected
Splitter ChunkOverflowWarning SplitterOutputWarning Chunks deviation/overflow
HTML Splitter AutoTagFallbackWarning SplitterInputWarning Tag not found; auto-tagging used
HTML Splitter BatchHtmlTableWarning SplitterInputWarning Tag inside table (batch mode)

Note

More warnings will be introduced soon. Stay aware to updates!