Warnings¶
SplitterMR uses standard Python UserWarning subclasses to alert users about suspicious inputs, ambiguous file types, or heuristic fallbacks that do not halt execution but may affect data quality.
Warnings live in:
from splitter_mr.schema.warnings import *
Why a custom warning hierarchy?¶
- Granular filtering: You can ignore specific categories (e.g.,
FiletypeAmbiguityWarning) while keeping others active using Python'swarningsfilter. - Process observability: Distinguishes between input data issues (
SplitterInputWarning) and processing results (SplitterOutputWarning).
Reader warnings¶
Readers emit BaseReaderWarning (or subclasses) when input files are ambiguous or minor issues occur during ingestion.
Hierarchy
`BaseReaderWarning` (`UserWarning`)
└── `FiletypeAmbiguityWarning`
General¶
BaseReaderWarning¶
Bases: UserWarning
Base Warning class to all Reader exceptions
Source code in src/splitter_mr/schema/warnings.py
8 9 10 11 12 13 | |
I/O and Heuristics¶
FiletypeAmbiguityWarning¶
Bases: BaseReaderWarning
Warned when filetype heuristics disagree (extension vs DOM sniff).
Source code in src/splitter_mr/schema/warnings.py
16 17 18 19 | |
Typical cases:
- A file has a
.jsonextension but contains HTML content. - MIME type sniffing disagrees with the provided file extension.
Splitter warnings¶
Splitters emit BaseSplitterWarning (or subclasses) regarding suspicious chunking inputs, outputs, or fallback behaviors.
Hierarchy
`BaseSplitterWarning` (`UserWarning`)
├── `SplitterInputWarning`
│ ├── `AutoTagFallbackWarning`
│ └── `BatchHtmlTableWarning`
└── `SplitterOutputWarning`
├── `ChunkUnderflowWarning`
└── `ChunkOverflowWarning`
General¶
BaseSplitterWarning¶
Bases: UserWarning
Base Warning class to all Reader exceptions
Source code in src/splitter_mr/schema/warnings.py
29 30 31 32 33 34 | |
handler: python options: members_order: source
I/O and Validation¶
SplitterInputWarning¶
Bases: BaseSplitterWarning
Warning raised when the splitter input is suspicious (e.g., empty text or text expected to be JSON but not parseable as JSON).
Source code in src/splitter_mr/schema/warnings.py
40 41 42 43 44 | |
Typical cases:
- Input text is empty.
- Input text is expected to be JSON but parsing failed (fallback to raw text).
SplitterOutputWarning¶
Bases: BaseSplitterWarning
Warning raised when the splitter output present suspicious elements (e.g., empty text or text expected to be JSON but not parseable as JSON).
Source code in src/splitter_mr/schema/warnings.py
47 48 49 50 51 | |
Typical cases:
- The resulting chunks contain empty text fields.
- Metadata generation produced suspicious values.
ChunkUnderflowWarning¶
Bases: SplitterOutputWarning
Warned when fewer chunks are produced than expected from the configured chunk_size due to the number of paragraphs being insufficient.
Source code in src/splitter_mr/schema/warnings.py
54 55 56 57 58 | |
Typical cases:
- The document structure resulted in significantly fewer chunks than the
chunk_sizeconfiguration suggested.
ChunkOverflowWarning¶
Bases: SplitterOutputWarning
Warned when fewer chunks are produced than expected from the configured chunk_size due to the number of paragraphs being insufficient.
Source code in src/splitter_mr/schema/warnings.py
61 62 63 64 65 | |
Typical cases:
- Chunking produced unexpected volume or size deviations based on paragraph constraints.
Splitters-specific warnings¶
HtmlTagSplitter¶
AutoTagFallbackWarning
Bases: SplitterInputWarning
Warned when HTML Tag Splitter performs auto tagging, e.g., when not finding a tag or when no tag is provided.
Source code in src/splitter_mr/schema/warnings.py
71 72 73 74 75 | |
Typical cases:
- The specific tag requested was not found, triggering an auto-tagging strategy.
BatchHtmlTableWarning
Bases: SplitterInputWarning
Warned when a tag is presented in a table and the splitting process is being produced on batch. In that case, it is splitted by table.
Source code in src/splitter_mr/schema/warnings.py
78 79 80 81 82 | |
Typical cases:
- A target tag is located inside an HTML table during batch processing (split occurs by table to preserve context).
Reference table¶
| Area | Warning | Parent | Description |
|---|---|---|---|
| Reader | BaseReaderWarning |
UserWarning |
Base reader warning |
| Reader | FiletypeAmbiguityWarning |
BaseReaderWarning |
Extension vs. content mismatch |
| Splitter | BaseSplitterWarning |
UserWarning |
Base splitter warning |
| Splitter | SplitterInputWarning |
BaseSplitterWarning |
Suspicious input (empty/malformed) |
| Splitter | SplitterOutputWarning |
BaseSplitterWarning |
Suspicious output elements |
| Splitter | ChunkUnderflowWarning |
SplitterOutputWarning |
Fewer chunks than expected |
| Splitter | ChunkOverflowWarning |
SplitterOutputWarning |
Chunks deviation/overflow |
| HTML Splitter | AutoTagFallbackWarning |
SplitterInputWarning |
Tag not found; auto-tagging used |
| HTML Splitter | BatchHtmlTableWarning |
SplitterInputWarning |
Tag inside table (batch mode) |
Note
More warnings will be introduced soon. Stay aware to updates!