Reader¶
Introduction¶
The Reader component is designed to read files homogeneously which come from many different formats and extensions. All of these readers are implemented sharing the same parent class, BaseReader
.
Which Reader should I use for my project?¶
Each Reader component extracts document text in different ways. Therefore, choosing the most suitable Reader component depends on your use case.
- If you want to preserve the original structure as much as possible, without any kind of markdown parsing, you can use the
VanillaReader
class. - In case that you have documents which have presented many tables in its structure or with many visual components (such as images), we strongly recommend to use
DoclingReader
. - If you are looking to maximize efficiency or make conversions to markdown simpler, we recommend using the
MarkItDownReader
component.
Note
Remember to visit the official repository and guides for these two last reader classes:
- Docling Developer guide
- MarkItDown GitHub repository.
Additionally, the file compatibility depending on the Reader class is given by the following table:
Reader | Unstructured files & PDFs | MS Office suite files | Tabular data | Files with hierarchical schema | Image files | Markdown conversion |
---|---|---|---|---|---|---|
VanillaReader |
txt , md |
xlsx |
csv , tsv , parquet |
json , yaml , html , xml |
- | No |
MarkItDownReader |
txt , md , pdf |
docx , xlsx , pptx |
csv , tsv |
json , html , xml |
jpg , png , pneg |
Yes |
DoclingReader |
txt , md , pdf |
docx , xlsx , pptx |
– | html , xhtml |
png , jpeg , tiff , bmp , webp |
Yes |
Output format¶
Dataclass defining the output structure for all readers.
Source code in src/splitter_mr/schema/schemas.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
Readers¶
BaseReader¶
BaseReader
¶
Bases: ABC
Abstract base class for all document readers.
This interface defines the contract for file readers that process documents and return
a standardized dictionary containing the extracted text and document-level metadata.
Subclasses must implement the read
method to handle specific file formats or reading
strategies.
Methods:
Name | Description |
---|---|
read |
Reads the input file and returns a dictionary with text and metadata. |
is_valid_file_path |
Check if a path is valid. |
is_url |
Check if the string provided is an URL. |
parse_json |
Try to parse a JSON object when a dictionary or string is provided. |
Source code in src/splitter_mr/reader/base_reader.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
|
is_valid_file_path(path)
staticmethod
¶
Checks if the provided string is a valid file path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
The string to check. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the string is a valid file path to an existing file, False otherwise. |
Example
BaseReader.is_valid_file_path("/tmp/myfile.txt")
True
Source code in src/splitter_mr/reader/base_reader.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
is_url(string)
staticmethod
¶
Determines whether the given string is a valid HTTP or HTTPS URL.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
string
|
str
|
The string to check. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise. |
Example
BaseReader.is_url("https://example.com")
True
BaseReader.is_url("not_a_url")
False
Source code in src/splitter_mr/reader/base_reader.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
|
parse_json(obj)
staticmethod
¶
Attempts to parse the provided object as JSON.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
obj
|
Union[dict, str]
|
The object to parse. If a dict, returns it as-is. If a string, attempts to parse it as a JSON string. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
The parsed JSON object. |
Raises:
Type | Description |
---|---|
ValueError
|
If a string is provided that cannot be parsed as valid JSON. |
TypeError
|
If the provided object is neither a dict nor a string. |
Example
BaseReader.try_parse_json('{"a": 1}')
{'a': 1}
BaseReader.try_parse_json({'b': 2})
{'b': 2}
BaseReader.try_parse_json('[not valid json]')
ValueError: String could not be parsed as JSON: ...
Source code in src/splitter_mr/reader/base_reader.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
|
read(file_path, model=None, **kwargs)
abstractmethod
¶
Reads input and returns a ReaderOutput with text content and standardized metadata.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
Path to the input file, a URL, raw string, or dictionary. |
required |
model
|
Optional[BaseModel]
|
Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content. |
None
|
**kwargs
|
Any
|
Additional keyword arguments for implementation-specific options. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
ReaderOutput |
ReaderOutput
|
Dataclass defining the output structure for all readers. |
Raises:
Type | Description |
---|---|
ValueError
|
If the provided string is not valid file path, URL, or parsable content. |
TypeError
|
If input type is unsupported. |
Example
class MyReader(BaseReader):
def read(self, file_path: str, **kwargs) -> ReaderOutput:
return ReaderOutput(
text="example",
document_name="example.txt",
document_path=file_path,
document_id=kwargs.get("document_id"),
conversion_method="custom",
ocr_method=None,
metadata={}
)
Source code in src/splitter_mr/reader/base_reader.py
124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
|
📚 Note: file examples are extracted from the
data
folder in the GitHub repository: link.
VanillaReader¶
SimpleHTMLTextExtractor
¶
Bases: HTMLParser
Extract HTML Structures from a text
Source code in src/splitter_mr/reader/readers/vanilla_reader.py
16 17 18 19 20 21 22 23 24 25 26 27 |
|
VanillaReader
¶
Bases: BaseReader
Read multiple file types using Python's built-in and standard libraries. Supported: .json, .html, .txt, .xml, .yaml/.yml, .csv, .tsv, .parquet, .pdf
For PDFs, this reader uses PDFPlumberReader to extract text, tables, and images, with options to show or omit images, and to annotate images using a vision model.
Source code in src/splitter_mr/reader/readers/vanilla_reader.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 |
|
read(file_path=None, **kwargs)
¶
Reads a document from various sources and returns its text content along with standardized metadata.
This method supports reading from
- Local file paths (file_path, or as a positional argument)
- URLs (file_url)
- JSON/dict objects (json_document)
- Raw text strings (text_document)
If multiple sources are provided, the following priority is used: file_path, file_url, json_document, text_document. If only file_path is provided, the method will attempt to automatically detect if the value is a path, URL, JSON, YAML, or plain text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
Path to the input file. |
None
|
**kwargs
|
Any
|
file_path (str, optional): Path to the input file (overrides positional argument). file_url (str, optional): URL to read the document from. json_document (dict or str, optional): Dictionary or JSON string containing document content. text_document (str, optional): Raw text or string content of the document. show_images (bool, optional): If True (default), images in PDFs are shown inline as base64 PNG. If False, images are omitted (or annotated if a model is provided). model (BaseModel, optional): Vision model for image annotation/captioning. prompt (str, optional): Custom prompt for image captioning. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
ReaderOutput |
ReaderOutput
|
Dataclass defining the output structure for all readers. |
Raises:
Type | Description |
---|---|
ValueError
|
If the provided source is not valid or supported, or if file/URL/JSON detection fails. |
TypeError
|
If provided arguments are of unsupported types. |
Notes
- PDF extraction now supports image captioning/omission indicators.
- For
.parquet
files, content is loaded via pandas and returned as CSV-formatted text.
Example
from splitter_mr.readers import VanillaReader
from splitter_mr.models import AzureOpenAIVisionModel
model = AzureOpenAIVisionModel()
reader = VanillaReader(model=model)
output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf", show_images=False)
print(output.text)
\n---\n## Page 1\n---\n\nMultiRAG Project – Splitter\nMultiRAG | Splitter\nLorem ipsum dolor sit amet, ...
Source code in src/splitter_mr/reader/readers/vanilla_reader.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 |
|
DoclingReader¶
DoclingReader
¶
Bases: BaseReader
Read multiple file types using IBM's Docling library, and convert the documents into markdown or JSON format.
Source code in src/splitter_mr/reader/readers/docling_reader.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
|
read(file_path, prompt='Analyze the following resource in the original language. Be concise but comprehensive, according to the image context. Return the content in markdown format', **kwargs)
¶
Reads and converts a document to Markdown format using the Docling library, supporting a wide range of file types including PDF, DOCX, HTML, and images.
This method leverages Docling's advanced document parsing capabilities—including layout and table detection, code and formula extraction, and integrated OCR—to produce clean, markdown-formatted output for downstream processing. The output includes standardized metadata and can be easily integrated into generative AI or information retrieval pipelines.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
Path to the input file to be read and converted. |
required |
**kwargs
|
Any
|
document_id (Optional[str]): Unique document identifier. If not provided, a UUID will be generated. conversion_method (Optional[str]): Name or description of the conversion method used. Default is None. ocr_method (Optional[str]): OCR method applied (if any). Default is None. metadata (Optional[List[str]]): Additional metadata as a list of strings. Default is an empty list. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
ReaderOutput |
ReaderOutput
|
Dataclass defining the output structure for all readers. |
Example
from splitter_mr.readers import DoclingReader
reader = DoclingReader()
result = reader.read(file_path = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf")
print(result.text)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
Pellentesque ex felis, cursus ege...
Source code in src/splitter_mr/reader/readers/docling_reader.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
|
MarkItDownReader¶
MarkItDownReader
¶
Bases: BaseReader
Read multiple file types using Microsoft's MarkItDown library, and convert the documents using markdown format.
This reader supports both standard MarkItDown conversion and the use of Vision Language Models (VLMs) for LLM-based OCR when extracting text from images or scanned documents.
Currently, only the following VLMs are supported: - OpenAIVisionModel - AzureOpenAIVisionModel
If a compatible model is provided, MarkItDown will leverage the specified VLM for OCR, and the model's name will be recorded as the OCR method used.
Notes
- This method uses MarkItDown to convert a wide variety of file formats (e.g., PDF, DOCX, images, HTML, CSV) to Markdown.
- If
document_id
is not provided, a UUID will be automatically assigned. - If
metadata
is not provided, an empty list will be used. - MarkItDown should be installed with all relevant optional dependencies for full file format support.
Source code in src/splitter_mr/reader/readers/markitdown_reader.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
read(file_path, **kwargs)
¶
Reads a file and converts its contents to Markdown using MarkItDown, returning structured metadata.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
str
|
Path to the input file to be read and converted. |
required |
**kwargs
|
Any
|
document_id (Optional[str]): Unique document identifier. If not provided, a UUID will be generated. conversion_method (Optional[str]): Name or description of the conversion method used. Default is None. ocr_method (Optional[str]): OCR method applied (if any). Default is None. metadata (Optional[List[str]]): Additional metadata as a list of strings. Default is an empty list. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
ReaderOutput |
ReaderOutput
|
Dataclass defining the output structure for all readers. |
Example
from splitter_mr.reader import MarkItDownReader
from splitter_mr.model import OpenAIVisionModel # Or AzureOpenAIVisionModel
openai = OpenAIVisionModel() # make sure to have necessary environment variables on `.env`.
reader = MarkItDownReader(model = openai)
result = reader.read(file_path = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf")
print(result.text)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
Pellentesque ex felis, cursus ege...
Source code in src/splitter_mr/reader/readers/markitdown_reader.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|