Skip to content

Reader

Introduction

The Reader component is designed to read files homogeneously which come from many different formats and extensions. All of these readers are implemented sharing the same parent class, BaseReader.

Which Reader should I use for my project?

Each Reader component extracts document text in different ways. Therefore, choosing the most suitable Reader component depends on your use case.

  • If you want to preserve the original structure as much as possible, without any kind of markdown parsing, you can use the VanillaReader class.
  • In case that you have documents which have presented many tables in its structure or with many visual components (such as images), we strongly recommend to use DoclingReader.
  • If you are looking to maximize efficiency or make conversions to markdown simpler, we recommend using the MarkItDownReader component.

Note

Remember to visit the official repository and guides for these two last reader classes:

Additionally, the file compatibility depending on the Reader class is given by the following table:

Reader Unstructured files & PDFs MS Office suite files Tabular data Files with hierarchical schema Image files Markdown conversion
Vanilla Reader txt, md, pdf xlsx, docx, pptx csv, tsv, parquet json, yaml, html, xml jpg, png, webp, gif Yes
MarkItDown Reader txt, md, pdf docx, xlsx, pptx csv, tsv json, html, xml jpg, png, pneg Yes
Docling Reader txt, md, pdf docx, xlsx, pptx html, xhtml png, jpeg, tiff, bmp, webp Yes

Installing Docling & MarkItDown

By default, pip install splitter-mr installs core features only.
To use DoclingReader and/or MarkItDownReader, install the corresponding extras:

Python ≥ 3.11 is required.

MarkItDown:

pip install "splitter-mr[markitdown]"

Docling:

pip install "splitter-mr[docling]"

Both:

pip install "splitter-mr[markitdown,docling]"

Note

For the full matrix of extras and alternative package managers, see the global How to install section in the project README: Splitter_MR — How to install

Output format

Bases: BaseModel

Pydantic model defining the output structure for all readers.

Attributes:

Name Type Description
text Optional[str]

The textual content extracted by the reader.

document_name Optional[str]

The name of the document.

document_path str

The path to the document.

document_id Optional[str]

A unique identifier for the document.

conversion_method Optional[str]

The method used for document conversion.

reader_method Optional[str]

The method used for reading the document.

ocr_method Optional[str]

The OCR method used, if any.

page_placeholder Optional[str]

The placeholder use to identify each page, if used.

metadata Optional[Dict[str, Any]]

Additional metadata associated with the document.

Source code in src/splitter_mr/schema/models.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
class ReaderOutput(BaseModel):
    """Pydantic model defining the output structure for all readers.

    Attributes:
        text: The textual content extracted by the reader.
        document_name: The name of the document.
        document_path: The path to the document.
        document_id: A unique identifier for the document.
        conversion_method: The method used for document conversion.
        reader_method: The method used for reading the document.
        ocr_method: The OCR method used, if any.
        page_placeholder: The placeholder use to identify each page, if used.
        metadata: Additional metadata associated with the document.
    """

    text: Optional[str] = ""
    document_name: Optional[str] = None
    document_path: str = ""
    document_id: Optional[str] = None
    conversion_method: Optional[str] = None
    reader_method: Optional[str] = None
    ocr_method: Optional[str] = None
    page_placeholder: Optional[str] = None
    metadata: Optional[Dict[str, Any]] = Field(default_factory=dict)

    @field_validator("document_id", mode="before")
    def default_document_id(cls, v: str):
        """Generate a default UUID for document_id if not provided.

        Args:
            v (str): The provided document_id value.

        Returns:
            document_id (str): The provided document_id or a newly generated UUID string.
        """
        document_id = v or str(uuid.uuid4())
        return document_id

    def from_variable(
        self, variable: Union[str, Dict[str, Any]], variable_name: str
    ) -> "ReaderOutput":
        """
        Generate a new ReaderOutput object from a variable (str or dict).

        Args:
            variable (Union[str, Dict[str, Any]]): The variable to use as text.
            variable_name (str): The name for document_name.

        Returns:
            ReaderOutput: The new ReaderOutput object.
        """
        if isinstance(variable, dict):
            text = json.dumps(variable, ensure_ascii=False, indent=2)
            conversion_method = "json"
            metadata = {"details": "Generated from a json variable"}
        elif isinstance(variable, str):
            text = variable
            conversion_method = "txt"
            metadata = {"details": "Generated from a str variable"}
        else:
            raise ValueError("Variable must be either a string or a dictionary.")

        return ReaderOutput(
            text=text,
            document_name=variable_name,
            document_path="",
            conversion_method=conversion_method,
            reader_method="vanilla",
            ocr_method=None,
            page_placeholder=None,
            metadata=metadata,
        )

    def append_metadata(self, metadata: Dict[str, Any]) -> None:
        """
        Append (update) the metadata dictionary with new key-value pairs.

        Args:
            metadata (Dict[str, Any]): The metadata to add or update.
        """
        if self.metadata is None:
            self.metadata = {}
        self.metadata.update(metadata)
append_metadata(metadata)

Append (update) the metadata dictionary with new key-value pairs.

Parameters:

Name Type Description Default
metadata Dict[str, Any]

The metadata to add or update.

required
Source code in src/splitter_mr/schema/models.py
 99
100
101
102
103
104
105
106
107
108
def append_metadata(self, metadata: Dict[str, Any]) -> None:
    """
    Append (update) the metadata dictionary with new key-value pairs.

    Args:
        metadata (Dict[str, Any]): The metadata to add or update.
    """
    if self.metadata is None:
        self.metadata = {}
    self.metadata.update(metadata)
default_document_id(v)

Generate a default UUID for document_id if not provided.

Parameters:

Name Type Description Default
v str

The provided document_id value.

required

Returns:

Name Type Description
document_id str

The provided document_id or a newly generated UUID string.

Source code in src/splitter_mr/schema/models.py
51
52
53
54
55
56
57
58
59
60
61
62
@field_validator("document_id", mode="before")
def default_document_id(cls, v: str):
    """Generate a default UUID for document_id if not provided.

    Args:
        v (str): The provided document_id value.

    Returns:
        document_id (str): The provided document_id or a newly generated UUID string.
    """
    document_id = v or str(uuid.uuid4())
    return document_id
from_variable(variable, variable_name)

Generate a new ReaderOutput object from a variable (str or dict).

Parameters:

Name Type Description Default
variable Union[str, Dict[str, Any]]

The variable to use as text.

required
variable_name str

The name for document_name.

required

Returns:

Name Type Description
ReaderOutput ReaderOutput

The new ReaderOutput object.

Source code in src/splitter_mr/schema/models.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
def from_variable(
    self, variable: Union[str, Dict[str, Any]], variable_name: str
) -> "ReaderOutput":
    """
    Generate a new ReaderOutput object from a variable (str or dict).

    Args:
        variable (Union[str, Dict[str, Any]]): The variable to use as text.
        variable_name (str): The name for document_name.

    Returns:
        ReaderOutput: The new ReaderOutput object.
    """
    if isinstance(variable, dict):
        text = json.dumps(variable, ensure_ascii=False, indent=2)
        conversion_method = "json"
        metadata = {"details": "Generated from a json variable"}
    elif isinstance(variable, str):
        text = variable
        conversion_method = "txt"
        metadata = {"details": "Generated from a str variable"}
    else:
        raise ValueError("Variable must be either a string or a dictionary.")

    return ReaderOutput(
        text=text,
        document_name=variable_name,
        document_path="",
        conversion_method=conversion_method,
        reader_method="vanilla",
        ocr_method=None,
        page_placeholder=None,
        metadata=metadata,
    )

Readers

To see a comparison between reading methods, refer to the following example.

BaseReader

BaseReader

Bases: ABC

Abstract base class for all document readers.

This interface defines the contract for file readers that process documents and return a standardized dictionary containing the extracted text and document-level metadata. Subclasses must implement the read method to handle specific file formats or reading strategies.

Methods:

Name Description
read

Reads the input file and returns a dictionary with text and metadata.

is_valid_file_path

Check if a path is valid.

is_url

Check if the string provided is an URL.

parse_json

Try to parse a JSON object when a dictionary or string is provided.

Source code in src/splitter_mr/reader/base_reader.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
class BaseReader(ABC):
    """
    Abstract base class for all document readers.

    This interface defines the contract for file readers that process documents and return
    a standardized dictionary containing the extracted text and document-level metadata.
    Subclasses must implement the `read` method to handle specific file formats or reading
    strategies.

    Methods:
        read: Reads the input file and returns a dictionary with text and metadata.
        is_valid_file_path: Check if a path is valid.
        is_url: Check if the string provided is an URL.
        parse_json: Try to parse a JSON object when a dictionary or string is provided.
    """

    @staticmethod
    def is_valid_file_path(path: str) -> bool:
        """
        Checks if the provided string is a valid file path.

        Args:
            path (str): The string to check.

        Returns:
            bool: True if the string is a valid file path to an existing file, False otherwise.

        Example:
            ```python
            BaseReader.is_valid_file_path("/tmp/myfile.txt")
            ```
            ```bash
            True
            ```
        """
        return os.path.isfile(path)

    @staticmethod
    def is_url(string: str) -> bool:
        """
        Determines whether the given string is a valid HTTP or HTTPS URL.

        Args:
            string (str): The string to check.

        Returns:
            bool: True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

        Example:
            ```python
            BaseReader.is_url("https://example.com")
            ```
            ```bash
            True
            ```
            ```python
            BaseReader.is_url("not_a_url")
            ```
            ```bash
            False
            ```
        """
        try:
            result = urlparse(string)
            return all([result.scheme in ("http", "https"), result.netloc])
        except Exception:
            return False

    @staticmethod
    def parse_json(obj: Union[dict, str]) -> dict:
        """
        Attempts to parse the provided object as JSON.

        Args:
            obj (Union[dict, str]): The object to parse. If a dict, returns it as-is.
                If a string, attempts to parse it as a JSON string.

        Returns:
            dict: The parsed JSON object.

        Raises:
            ValueError: If a string is provided that cannot be parsed as valid JSON.
            TypeError: If the provided object is neither a dict nor a string.

        Example:
            ```python
            BaseReader.try_parse_json('{"a": 1}')
            ```
            ```python
            {'a': 1}
            ```
            ```python
            BaseReader.try_parse_json({'b': 2})
            ```
            ```python
            {'b': 2}
            ```
            ```python
            BaseReader.try_parse_json('[not valid json]')
            ```
            ```python
            ValueError: String could not be parsed as JSON: ...
            ```
        """
        if isinstance(obj, dict):
            return obj
        if isinstance(obj, str):
            try:
                return json.loads(obj)
            except Exception as e:
                raise ValueError(f"String could not be parsed as JSON: {e}")
        raise TypeError("Provided object is not a string or dictionary")

    @abstractmethod
    def read(
        self, file_path: str, model: Optional[BaseVisionModel] = None, **kwargs: Any
    ) -> ReaderOutput:
        """
        Reads input and returns a ReaderOutput with text content and standardized metadata.

        Args:
            file_path (str): Path to the input file, a URL, raw string, or dictionary.
            model (Optional[BaseVisionModel]): Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.
            **kwargs: Additional keyword arguments for implementation-specific options.

        Returns:
            ReaderOutput: Dataclass defining the output structure for all readers.

        Raises:
            ValueError: If the provided string is not valid file path, URL, or parsable content.
            TypeError: If input type is unsupported.

        Example:
            ```python
            class MyReader(BaseReader):
                def read(self, file_path: str, **kwargs) -> ReaderOutput:
                    return ReaderOutput(
                        text="example",
                        document_name="example.txt",
                        document_path=file_path,
                        document_id=kwargs.get("document_id"),
                        conversion_method="custom",
                        ocr_method=None,
                        metadata={}
                    )
            ```
        """
is_url(string) staticmethod

Determines whether the given string is a valid HTTP or HTTPS URL.

Parameters:

Name Type Description Default
string str

The string to check.

required

Returns:

Name Type Description
bool bool

True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

Example

BaseReader.is_url("https://example.com")
True
BaseReader.is_url("not_a_url")
False

Source code in src/splitter_mr/reader/base_reader.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
@staticmethod
def is_url(string: str) -> bool:
    """
    Determines whether the given string is a valid HTTP or HTTPS URL.

    Args:
        string (str): The string to check.

    Returns:
        bool: True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

    Example:
        ```python
        BaseReader.is_url("https://example.com")
        ```
        ```bash
        True
        ```
        ```python
        BaseReader.is_url("not_a_url")
        ```
        ```bash
        False
        ```
    """
    try:
        result = urlparse(string)
        return all([result.scheme in ("http", "https"), result.netloc])
    except Exception:
        return False
is_valid_file_path(path) staticmethod

Checks if the provided string is a valid file path.

Parameters:

Name Type Description Default
path str

The string to check.

required

Returns:

Name Type Description
bool bool

True if the string is a valid file path to an existing file, False otherwise.

Example

BaseReader.is_valid_file_path("/tmp/myfile.txt")
True

Source code in src/splitter_mr/reader/base_reader.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
@staticmethod
def is_valid_file_path(path: str) -> bool:
    """
    Checks if the provided string is a valid file path.

    Args:
        path (str): The string to check.

    Returns:
        bool: True if the string is a valid file path to an existing file, False otherwise.

    Example:
        ```python
        BaseReader.is_valid_file_path("/tmp/myfile.txt")
        ```
        ```bash
        True
        ```
    """
    return os.path.isfile(path)
parse_json(obj) staticmethod

Attempts to parse the provided object as JSON.

Parameters:

Name Type Description Default
obj Union[dict, str]

The object to parse. If a dict, returns it as-is. If a string, attempts to parse it as a JSON string.

required

Returns:

Name Type Description
dict dict

The parsed JSON object.

Raises:

Type Description
ValueError

If a string is provided that cannot be parsed as valid JSON.

TypeError

If the provided object is neither a dict nor a string.

Example

BaseReader.try_parse_json('{"a": 1}')
{'a': 1}
BaseReader.try_parse_json({'b': 2})
{'b': 2}
BaseReader.try_parse_json('[not valid json]')
ValueError: String could not be parsed as JSON: ...

Source code in src/splitter_mr/reader/base_reader.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
@staticmethod
def parse_json(obj: Union[dict, str]) -> dict:
    """
    Attempts to parse the provided object as JSON.

    Args:
        obj (Union[dict, str]): The object to parse. If a dict, returns it as-is.
            If a string, attempts to parse it as a JSON string.

    Returns:
        dict: The parsed JSON object.

    Raises:
        ValueError: If a string is provided that cannot be parsed as valid JSON.
        TypeError: If the provided object is neither a dict nor a string.

    Example:
        ```python
        BaseReader.try_parse_json('{"a": 1}')
        ```
        ```python
        {'a': 1}
        ```
        ```python
        BaseReader.try_parse_json({'b': 2})
        ```
        ```python
        {'b': 2}
        ```
        ```python
        BaseReader.try_parse_json('[not valid json]')
        ```
        ```python
        ValueError: String could not be parsed as JSON: ...
        ```
    """
    if isinstance(obj, dict):
        return obj
    if isinstance(obj, str):
        try:
            return json.loads(obj)
        except Exception as e:
            raise ValueError(f"String could not be parsed as JSON: {e}")
    raise TypeError("Provided object is not a string or dictionary")
read(file_path, model=None, **kwargs) abstractmethod

Reads input and returns a ReaderOutput with text content and standardized metadata.

Parameters:

Name Type Description Default
file_path str

Path to the input file, a URL, raw string, or dictionary.

required
model Optional[BaseVisionModel]

Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.

None
**kwargs Any

Additional keyword arguments for implementation-specific options.

{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

Dataclass defining the output structure for all readers.

Raises:

Type Description
ValueError

If the provided string is not valid file path, URL, or parsable content.

TypeError

If input type is unsupported.

Example
class MyReader(BaseReader):
    def read(self, file_path: str, **kwargs) -> ReaderOutput:
        return ReaderOutput(
            text="example",
            document_name="example.txt",
            document_path=file_path,
            document_id=kwargs.get("document_id"),
            conversion_method="custom",
            ocr_method=None,
            metadata={}
        )
Source code in src/splitter_mr/reader/base_reader.py
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
@abstractmethod
def read(
    self, file_path: str, model: Optional[BaseVisionModel] = None, **kwargs: Any
) -> ReaderOutput:
    """
    Reads input and returns a ReaderOutput with text content and standardized metadata.

    Args:
        file_path (str): Path to the input file, a URL, raw string, or dictionary.
        model (Optional[BaseVisionModel]): Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.
        **kwargs: Additional keyword arguments for implementation-specific options.

    Returns:
        ReaderOutput: Dataclass defining the output structure for all readers.

    Raises:
        ValueError: If the provided string is not valid file path, URL, or parsable content.
        TypeError: If input type is unsupported.

    Example:
        ```python
        class MyReader(BaseReader):
            def read(self, file_path: str, **kwargs) -> ReaderOutput:
                return ReaderOutput(
                    text="example",
                    document_name="example.txt",
                    document_path=file_path,
                    document_id=kwargs.get("document_id"),
                    conversion_method="custom",
                    ocr_method=None,
                    metadata={}
                )
        ```
    """

📚 Note: file examples are extracted from thedata folder in the GitHub repository: link.

VanillaReader

VanillaReader logo VanillaReader logo

SimpleHTMLTextExtractor

Bases: HTMLParser

Extract text from HTML by concatenating text nodes (legacy helper).

Source code in src/splitter_mr/reader/readers/vanilla_reader.py
744
745
746
747
748
749
750
751
752
753
754
755
class SimpleHTMLTextExtractor(HTMLParser):
    """Extract text from HTML by concatenating text nodes (legacy helper)."""

    def __init__(self):
        super().__init__()
        self.text_parts = []

    def handle_data(self, data):
        self.text_parts.append(data)

    def get_text(self):
        return " ".join(self.text_parts).strip()
VanillaReader

Bases: BaseReader

Read multiple file types using Python's built-in and standard libraries. Supported: .json, .html/.htm, .txt, .xml, .yaml/.yml, .csv, .tsv, .parquet, .pdf

NEW: HTML handling (local files and URLs): - If html_to_markdown=True (kw arg), HTML is converted to Markdown using the project's HtmlToMarkdown utility, and the conversion method is reported as "md". - If html_to_markdown=False (default), raw HTML is returned without transformation, and the conversion method is "html".

For PDFs, this reader uses PDFPlumberReader to extract text, tables, and images, with options to show or omit images, and to annotate images using a vision model.

Source code in src/splitter_mr/reader/readers/vanilla_reader.py
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
class VanillaReader(BaseReader):
    """
    Read multiple file types using Python's built-in and standard libraries.
    Supported: .json, .html/.htm, .txt, .xml, .yaml/.yml, .csv, .tsv, .parquet, .pdf

    **NEW**: HTML handling (local files and URLs):
      - If ``html_to_markdown=True`` (kw arg), HTML is converted to Markdown using the
        project's HtmlToMarkdown utility, and the conversion method is reported as ``"md"``.
      - If ``html_to_markdown=False`` (default), raw HTML is returned without transformation,
        and the conversion method is ``"html"``.

    For PDFs, this reader uses PDFPlumberReader to extract text, tables, and images,
    with options to show or omit images, and to annotate images using a vision model.
    """

    def __init__(self, model: Optional[BaseVisionModel] = None):
        super().__init__()
        self.model = model
        self.pdf_reader = PDFPlumberReader()

    def read(
        self,
        file_path: str | Path = None,
        **kwargs: Any,
    ) -> ReaderOutput:
        """
        Read a document from various sources and return standardized output.

        This method supports:
        - Local file paths (``file_path`` or positional arg)
        - URLs (``file_url``)
        - JSON/dict objects (``json_document``)
        - Raw text strings (``text_document``)

        If multiple sources are provided, the priority is:
        ``file_path`` > ``file_url`` > ``json_document`` > ``text_document``.
        If only ``file_path`` is provided, auto-detects whether it is a path, URL,
        JSON, YAML, or plain text.

        Args:
            file_path (str | Path): Path to the input file (overridden by
                ``kwargs['file_path']`` if present).

            **kwargs: Optional arguments that adjust behavior:

                Source selection:
                    file_path (str): Path to the input file (overrides positional arg).
                    file_url (str): HTTPS/HTTP URL to read from.
                    json_document (dict | str): JSON-like document (dict or JSON string).
                    text_document (str): Raw text content.

                Identification/metadata:
                    document_id (str): Explicit document id. Defaults to a new UUID.
                    metadata (dict): Additional metadata to attach to the output.

                HTML handling:
                    html_to_markdown (bool): If True, convert HTML to Markdown before
                        returning. If False (default), return raw HTML as-is.

                PDF extraction:
                    scan_pdf_pages (bool): If True, rasterize and describe pages using a
                        vision model (VLM). If False (default), use element-wise extraction.
                    model (BaseVisionModel): Vision-capable model used for scanned PDFs and/or
                        image captioning (also used for image files).
                    prompt (str): Prompt for image captioning / page description. Defaults to
                        ``DEFAULT_IMAGE_CAPTION_PROMPT`` for element-wise PDFs and
                        ``DEFAULT_IMAGE_EXTRACTION_PROMPT`` for scanned PDFs/images.
                    resolution (int): DPI when rasterizing pages for VLM. Default: 300.
                    show_base64_images (bool): Include base64-embedded images in PDF output.
                        Default: False.
                    image_placeholder (str): Placeholder for omitted images in PDFs.
                        Default: ``"<!-- image -->"``.
                    page_placeholder (str): Placeholder inserted between PDF pages (only
                        surfaced when scanning or when the placeholder occurs in text).
                        Default: ``"<!-- page -->"``.
                    vlm_parameters (dict): Extra keyword args forwarded to
                        ``model.analyze_content(...)``.

                Excel / Parquet reading:
                    as_table (bool): For Excel (``.xlsx``/``.xls``), if True read as a table
                        using pandas and return CSV text. If False (default), convert to PDF
                        and run the PDF pipeline.
                    excel_engine (str): pandas Excel engine. Default: ``"openpyxl"``.
                    parquet_engine (str): pandas Parquet engine (e.g. ``"pyarrow"``,
                        ``"fastparquet"``). Default: pandas auto-selection.

        Returns:
            ReaderOutput: Unified result containing text, metadata, and extraction info.

        Raises:
            ValueError: If the source is invalid/unsupported, or if a VLM is required
                but not provided.
            TypeError: If provided arguments are of unsupported types.

        Notes:
            - HTML control via ``html_to_markdown`` applies to both local files and URLs.
            - For `.parquet` files, content is loaded via pandas and returned as CSV-formatted text.

        Example:
            ```python
            # Convert HTML to Markdown
            reader = VanillaReader()
            md_output = reader.read(file_path="page.html", html_to_markdown=True)

            # Keep raw HTML as-is
            html_output = reader.read(file_path="page.html", html_to_markdown=False)
            ```
        """

        source_type, source_val = _guess_source(kwargs, file_path)
        name, path, text, conv, ocr = self._dispatch_source(
            source_type, source_val, kwargs
        )

        page_ph: str = kwargs.get("page_placeholder", "<!-- page -->")
        page_ph_out = self._surface_page_placeholder(
            scan=bool(kwargs.get("scan_pdf_pages")),
            placeholder=page_ph,
            text=text,
        )

        return ReaderOutput(
            text=_ensure_str(text),
            document_name=name,
            document_path=path or "",
            document_id=kwargs.get("document_id", str(uuid.uuid4())),
            conversion_method=conv,
            reader_method="vanilla",
            ocr_method=ocr,
            page_placeholder=page_ph_out,
            metadata=kwargs.get("metadata", {}),
        )

    def _dispatch_source(  # noqa: WPS231
        self,
        src_type: str,
        src_val: Any,
        kw: Dict[str, Any],
    ) -> Tuple[str, Optional[str], Any, str, Optional[str]]:
        """
        Route the request to a specialised handler and return
        (document_name, document_path, text/content, conversion_method, ocr_method)
        """
        handlers = {
            "file_path": self._handle_local_path,
            "file_url": self._handle_url,
            "json_document": self._handle_explicit_json,
            "text_document": self._handle_explicit_text,
        }
        if src_type not in handlers:
            raise ValueError(f"Unrecognized document source: {src_type}")
        return handlers[src_type](src_val, kw)

    # ---- individual strategies below – each ~20 lines or fewer ---------- #

    # 1) Local / drive paths
    def _handle_local_path(
        self,
        path_like: str | Path,
        kw: Dict[str, Any],
    ) -> Tuple[str, str, Any, str, Optional[str]]:
        """Load from the filesystem (or, if it ‘looks like’ one, via HTTP)."""
        path_str = os.fspath(path_like) if isinstance(path_like, Path) else path_like
        if not isinstance(path_str, str):
            raise ValueError("file_path must be a string or Path object.")

        if not self.is_valid_file_path(path_str):
            if self.is_url(path_str):
                return self._handle_url(path_str, kw)
            return self._handle_fallback(path_str, kw)

        ext = os.path.splitext(path_str)[1].lower().lstrip(".")
        doc_name = os.path.basename(path_str)
        rel_path = os.path.relpath(path_str)

        # ---- type-specific branches ---- #
        # TODO: Refactor to sort the code and make it more readable
        if ext == "pdf":
            return (
                doc_name,
                rel_path,
                *self._process_pdf(path_str, kw),
            )
        if ext == "html" or ext == "htm":
            content, conv = _read_html_file(
                path_str, html_to_markdown=bool(kw.get("html_to_markdown", False))
            )
            return doc_name, rel_path, content, conv, None
        if ext in ("json", "txt", "xml", "csv", "tsv", "md", "markdown"):
            return doc_name, rel_path, _read_text_file(path_str, ext), ext, None
        if ext == "parquet":
            parquet_engine = kw.get(
                "parquet_engine"
            )  # e.g., "pyarrow" or "fastparquet"
            return (
                doc_name,
                rel_path,
                _read_parquet(path_str, engine=parquet_engine),
                "csv",
                None,
            )
        if ext in ("yaml", "yml"):
            return doc_name, rel_path, _read_text_file(path_str, ext), "json", None
        if ext in ("xlsx", "xls"):
            # When as_table=True, pass excel_engine
            if kw.get("as_table", False):
                excel_engine = kw.get("excel_engine", "openpyxl")
                return (
                    doc_name,
                    rel_path,
                    _read_excel(path_str, engine=excel_engine),
                    ext,
                    None,
                )
            # Otherwise convert workbook to PDF and reuse the PDF extractor
            pdf_path = self._convert_office_to_pdf(path_str)
            return (
                os.path.basename(pdf_path),
                os.path.relpath(pdf_path),
                *self._process_pdf(pdf_path, kw),
            )
        if ext in ("docx", "pptx"):
            pdf_path = self._convert_office_to_pdf(path_str)
            return (
                os.path.basename(pdf_path),
                os.path.relpath(pdf_path),
                *self._process_pdf(pdf_path, kw),
            )
        if ext in ("xlsx", "xls"):
            if kw.get("as_table", False):
                # direct spreadsheet → pandas → CSV
                return doc_name, rel_path, _read_excel(path_str), ext, None
            # otherwise convert workbook to PDF and reuse the PDF extractor
            pdf_path = self._convert_office_to_pdf(path_str)
            return (
                os.path.basename(pdf_path),
                os.path.relpath(pdf_path),
                *self._process_pdf(pdf_path, kw),
            )
        if ext in SUPPORTED_VANILLA_IMAGE_EXTENSIONS:
            model = kw.get("model", self.model)
            prompt = kw.get("prompt", DEFAULT_IMAGE_EXTRACTION_PROMPT)
            vlm_parameters = kw.get("vlm_parameters", {})
            return self._handle_image_to_llm(
                model, path_str, prompt=prompt, vlm_parameters=vlm_parameters
            )
        if ext in SUPPORTED_PROGRAMMING_LANGUAGES:
            return doc_name, rel_path, _read_text_file(path_str, ext), "txt", None

        raise ValueError(f"Unsupported file extension: {ext}. Use another Reader.")

    # 2) Remote URL
    def _handle_url(
        self,
        url: str,
        kw: Dict[str, Any],
    ) -> Tuple[str, str, Any, str, Optional[str]]:  # noqa: D401
        """Fetch via HTTP(S)."""
        if not isinstance(url, str) or not self.is_url(url):
            raise ValueError("file_url must be a valid URL string.")
        content, conv = _load_via_requests(
            url, html_to_markdown=bool(kw.get("html_to_markdown", False))
        )
        name = url.split("/")[-1] or "downloaded_file"
        return name, url, content, conv, None

    # 3) Explicit JSON (dict or str)
    def _handle_explicit_json(
        self,
        json_doc: Any,
        _kw: Dict[str, Any],
    ) -> Tuple[str, None, Any, str, None]:
        """JSON passed straight in."""
        return (
            _kw.get("document_name", None),
            None,
            self.parse_json(json_doc),
            "json",
            None,
        )

    # 4) Explicit raw text
    def _handle_explicit_text(
        self,
        txt: str,
        _kw: Dict[str, Any],
    ) -> Tuple[str, None, Any, str, None]:  # noqa: D401
        """Text (maybe JSON / YAML) passed straight in."""
        for parser, conv in ((self.parse_json, "json"), (yaml.safe_load, "json")):
            try:
                parsed = parser(txt)
                if isinstance(parsed, (dict, list)):
                    return _kw.get("document_name", None), None, parsed, conv, None
            except Exception:  # pragma: no cover
                pass
        return _kw.get("document_name", None), None, txt, "txt", None

    # ----- shared utilities ------------------------------------------------ #

    def _process_pdf(
        self,
        path: str,
        kw: Dict[str, Any],
    ) -> Tuple[Any, str, Optional[str]]:
        """
        Process a PDF file and extract content.

        This method supports two modes:
        - Scanned PDF pages using a vision-capable model (image-based extraction).
        - Element-wise text and image extraction using PDFPlumber.

        Args:
            path (str): The path to the PDF file.
            kw (dict): Keyword arguments controlling extraction behavior. Recognized keys include:
                scan_pdf_pages (bool): If True, process the PDF as scanned images.
                model (BaseVisionModel, optional): Vision-capable model for scanned PDFs or image captioning.
                prompt (str, optional): Prompt for image captioning.
                show_base64_images (bool): Whether to include base64 images in the output.
                image_placeholder (str): Placeholder for omitted images.
                page_placeholder (str): Placeholder for page breaks.

        Returns:
            tuple: A tuple of:
                - content (Any): Extracted text/content from the PDF.
                - conv (str): Conversion method used (e.g., "pdf", "png").
                - ocr_method (str or None): OCR model name if applicable.

        Raises:
            ValueError: If `scan_pdf_pages` is True but no vision-capable model is provided.
        """
        if kw.get("scan_pdf_pages"):
            model = kw.get("model", self.model)
            if model is None:
                raise ValueError("scan_pdf_pages=True requires a vision-capable model.")
            joined = self._scan_pdf_pages(path, model=model, **kw)
            return joined, "png", model.model_name
        # element-wise extraction
        content = self.pdf_reader.read(
            path,
            model=kw.get("model", self.model),
            prompt=kw.get("prompt") or DEFAULT_IMAGE_CAPTION_PROMPT,
            show_base64_images=kw.get("show_base64_images", False),
            image_placeholder=kw.get("image_placeholder", "<!-- image -->"),
            page_placeholder=kw.get("page_placeholder", "<!-- page -->"),
        )
        ocr_name = (
            (kw.get("model") or self.model).model_name
            if kw.get("model") or self.model
            else None
        )
        return content, "pdf", ocr_name

    def _scan_pdf_pages(self, file_path: str, model: BaseVisionModel, **kw) -> str:
        """
        Describe each page of a PDF using a vision model.

        Args:
            file_path (str): The path to the PDF file.
            model (BaseVisionModel): Vision-capable model used for page description.
            **kw: Additional keyword arguments. Recognized keys include:
                prompt (str, optional): Prompt for describing PDF pages.
                resolution (int): DPI resolution for rasterizing pages (default: 300).
                vlm_parameters (dict): Extra parameters for the vision model.

        Returns:
            str: A string containing page descriptions separated by page placeholders.
        """
        page_ph = kw.get("page_placeholder", "<!-- page -->")
        pages = self.pdf_reader.describe_pages(
            file_path=file_path,
            model=model,
            prompt=kw.get("prompt") or DEFAULT_IMAGE_EXTRACTION_PROMPT,
            resolution=kw.get("resolution", 300),
            **kw.get("vlm_parameters", {}),
        )
        return "\n\n---\n\n".join(f"{page_ph}\n\n{md}" for md in pages)

    def _handle_fallback(self, raw: str, kw: Dict[str, Any]):
        """
        Handle unsupported or unknown sources.

        Attempts to parse the input as JSON, then as text.
        Falls back to returning the raw content as plain text.

        Args:
            raw (str): Raw string content to be processed.
            kw (dict): Additional keyword arguments, may include:
                document_name (str): Optional name of the document.

        Returns:
            tuple: A tuple of:
                - document_name (str or None)
                - document_path (None)
                - content (Any): Parsed or raw content
                - conversion_method (str)
                - ocr_method (None)
        """
        try:
            return self._handle_explicit_json(raw, kw)
        except Exception:
            try:
                return self._handle_explicit_text(raw, kw)
            except Exception:  # pragma: no cover
                return kw.get("document_name", None), None, raw, "txt", None

    def _handle_image_to_llm(
        self,
        model: BaseVisionModel,
        file_path: str,
        prompt: Optional[str] = None,
        vlm_parameters: Optional[dict] = None,
    ) -> Tuple[str, str, Any, str, str]:
        """
        Extract content from an image file using a vision model.

        Reads the image, encodes it in base64, and sends it to the given vision model
        with the provided prompt.

        Args:
            model (BaseVisionModel): Vision-capable model to process the image.
            file_path (str): Path to the image file.
            prompt (str, optional): Prompt for guiding the vision model.
            vlm_parameters (dict, optional): Additional parameters for the vision model.

        Returns:
            tuple: A tuple of:
                - document_name (str)
                - document_path (str)
                - extracted (Any): Extracted content from the image.
                - conversion_method (str): Always "image".
                - ocr_method (str): Model name.

        Raises:
            ValueError: If no vision model is provided.
        """
        if model is None:
            raise ValueError("No vision model provided for image extraction.")
        # Read image as bytes and encode as base64
        with open(file_path, "rb") as f:
            img_bytes = f.read()
        ext = os.path.splitext(file_path)[1].lstrip(".").lower()
        img_b64 = base64.b64encode(img_bytes).decode("utf-8")
        prompt = prompt or DEFAULT_IMAGE_EXTRACTION_PROMPT
        vlm_parameters = vlm_parameters or {}
        extracted = model.analyze_content(
            img_b64, prompt=prompt, file_ext=ext, **vlm_parameters
        )
        doc_name = os.path.basename(file_path)
        rel_path = os.path.relpath(file_path)
        return doc_name, rel_path, extracted, "image", model.model_name

    @staticmethod
    def _surface_page_placeholder(
        scan: bool, placeholder: str, text: Any
    ) -> Optional[str]:
        """
        Decide whether to expose the page placeholder in output.

        Never exposes placeholders containing '%'. Returns the placeholder if
        scanning mode is enabled or if the placeholder is found in the text.

        Args:
            scan (bool): Whether the document was scanned.
            placeholder (str): Page placeholder string.
            text (Any): Extracted text or content.

        Returns:
            str or None: The placeholder string if it should be exposed, else None.
        """
        if "%" in placeholder:
            return None
        txt = _ensure_str(text)
        return placeholder if (scan or placeholder in txt) else None

    def _convert_office_to_pdf(self, file_path: str) -> str:
        """
        Convert a DOCX/XLSX/PPTX file to PDF using LibreOffice.

        Args:
            file_path: Absolute path to the Office document.

        Returns:
            Path to the generated PDF in a temporary directory.

        Raises:
            RuntimeError: If LibreOffice (``soffice``) is not in *PATH* or the
            conversion fails for any reason.
        """
        if not shutil.which("soffice"):
            raise RuntimeError(
                "LibreOffice/soffice is required for Office-to-PDF conversion "
                "but was not found in PATH.  Install LibreOffice or use a "
                "different reader."
            )

        outdir = tempfile.mkdtemp(prefix="vanilla_office2pdf_")
        cmd = [
            "soffice",
            "--headless",
            "--convert-to",
            "pdf",
            "--outdir",
            outdir,
            file_path,
        ]
        proc = subprocess.run(cmd, capture_output=True)
        if proc.returncode != 0:
            raise RuntimeError(
                f"LibreOffice failed converting {file_path} → PDF:\n{proc.stderr.decode()}"
            )

        pdf_name = os.path.splitext(os.path.basename(file_path))[0] + ".pdf"
        pdf_path = os.path.join(outdir, pdf_name)
        if not os.path.exists(pdf_path):
            raise RuntimeError(f"Expected PDF not found: {pdf_path}")

        return pdf_path
read(file_path=None, **kwargs)

Read a document from various sources and return standardized output.

This method supports: - Local file paths (file_path or positional arg) - URLs (file_url) - JSON/dict objects (json_document) - Raw text strings (text_document)

If multiple sources are provided, the priority is: file_path > file_url > json_document > text_document. If only file_path is provided, auto-detects whether it is a path, URL, JSON, YAML, or plain text.

Parameters:

Name Type Description Default
file_path str | Path

Path to the input file (overridden by kwargs['file_path'] if present).

None
**kwargs Any

Optional arguments that adjust behavior:

Source selection: file_path (str): Path to the input file (overrides positional arg). file_url (str): HTTPS/HTTP URL to read from. json_document (dict | str): JSON-like document (dict or JSON string). text_document (str): Raw text content.

Identification/metadata: document_id (str): Explicit document id. Defaults to a new UUID. metadata (dict): Additional metadata to attach to the output.

HTML handling: html_to_markdown (bool): If True, convert HTML to Markdown before returning. If False (default), return raw HTML as-is.

PDF extraction: scan_pdf_pages (bool): If True, rasterize and describe pages using a vision model (VLM). If False (default), use element-wise extraction. model (BaseVisionModel): Vision-capable model used for scanned PDFs and/or image captioning (also used for image files). prompt (str): Prompt for image captioning / page description. Defaults to DEFAULT_IMAGE_CAPTION_PROMPT for element-wise PDFs and DEFAULT_IMAGE_EXTRACTION_PROMPT for scanned PDFs/images. resolution (int): DPI when rasterizing pages for VLM. Default: 300. show_base64_images (bool): Include base64-embedded images in PDF output. Default: False. image_placeholder (str): Placeholder for omitted images in PDFs. Default: "<!-- image -->". page_placeholder (str): Placeholder inserted between PDF pages (only surfaced when scanning or when the placeholder occurs in text). Default: "<!-- page -->". vlm_parameters (dict): Extra keyword args forwarded to model.analyze_content(...).

Excel / Parquet reading: as_table (bool): For Excel (.xlsx/.xls), if True read as a table using pandas and return CSV text. If False (default), convert to PDF and run the PDF pipeline. excel_engine (str): pandas Excel engine. Default: "openpyxl". parquet_engine (str): pandas Parquet engine (e.g. "pyarrow", "fastparquet"). Default: pandas auto-selection.

{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

Unified result containing text, metadata, and extraction info.

Raises:

Type Description
ValueError

If the source is invalid/unsupported, or if a VLM is required but not provided.

TypeError

If provided arguments are of unsupported types.

Notes
  • HTML control via html_to_markdown applies to both local files and URLs.
  • For .parquet files, content is loaded via pandas and returned as CSV-formatted text.
Example
# Convert HTML to Markdown
reader = VanillaReader()
md_output = reader.read(file_path="page.html", html_to_markdown=True)

# Keep raw HTML as-is
html_output = reader.read(file_path="page.html", html_to_markdown=False)
Source code in src/splitter_mr/reader/readers/vanilla_reader.py
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
def read(
    self,
    file_path: str | Path = None,
    **kwargs: Any,
) -> ReaderOutput:
    """
    Read a document from various sources and return standardized output.

    This method supports:
    - Local file paths (``file_path`` or positional arg)
    - URLs (``file_url``)
    - JSON/dict objects (``json_document``)
    - Raw text strings (``text_document``)

    If multiple sources are provided, the priority is:
    ``file_path`` > ``file_url`` > ``json_document`` > ``text_document``.
    If only ``file_path`` is provided, auto-detects whether it is a path, URL,
    JSON, YAML, or plain text.

    Args:
        file_path (str | Path): Path to the input file (overridden by
            ``kwargs['file_path']`` if present).

        **kwargs: Optional arguments that adjust behavior:

            Source selection:
                file_path (str): Path to the input file (overrides positional arg).
                file_url (str): HTTPS/HTTP URL to read from.
                json_document (dict | str): JSON-like document (dict or JSON string).
                text_document (str): Raw text content.

            Identification/metadata:
                document_id (str): Explicit document id. Defaults to a new UUID.
                metadata (dict): Additional metadata to attach to the output.

            HTML handling:
                html_to_markdown (bool): If True, convert HTML to Markdown before
                    returning. If False (default), return raw HTML as-is.

            PDF extraction:
                scan_pdf_pages (bool): If True, rasterize and describe pages using a
                    vision model (VLM). If False (default), use element-wise extraction.
                model (BaseVisionModel): Vision-capable model used for scanned PDFs and/or
                    image captioning (also used for image files).
                prompt (str): Prompt for image captioning / page description. Defaults to
                    ``DEFAULT_IMAGE_CAPTION_PROMPT`` for element-wise PDFs and
                    ``DEFAULT_IMAGE_EXTRACTION_PROMPT`` for scanned PDFs/images.
                resolution (int): DPI when rasterizing pages for VLM. Default: 300.
                show_base64_images (bool): Include base64-embedded images in PDF output.
                    Default: False.
                image_placeholder (str): Placeholder for omitted images in PDFs.
                    Default: ``"<!-- image -->"``.
                page_placeholder (str): Placeholder inserted between PDF pages (only
                    surfaced when scanning or when the placeholder occurs in text).
                    Default: ``"<!-- page -->"``.
                vlm_parameters (dict): Extra keyword args forwarded to
                    ``model.analyze_content(...)``.

            Excel / Parquet reading:
                as_table (bool): For Excel (``.xlsx``/``.xls``), if True read as a table
                    using pandas and return CSV text. If False (default), convert to PDF
                    and run the PDF pipeline.
                excel_engine (str): pandas Excel engine. Default: ``"openpyxl"``.
                parquet_engine (str): pandas Parquet engine (e.g. ``"pyarrow"``,
                    ``"fastparquet"``). Default: pandas auto-selection.

    Returns:
        ReaderOutput: Unified result containing text, metadata, and extraction info.

    Raises:
        ValueError: If the source is invalid/unsupported, or if a VLM is required
            but not provided.
        TypeError: If provided arguments are of unsupported types.

    Notes:
        - HTML control via ``html_to_markdown`` applies to both local files and URLs.
        - For `.parquet` files, content is loaded via pandas and returned as CSV-formatted text.

    Example:
        ```python
        # Convert HTML to Markdown
        reader = VanillaReader()
        md_output = reader.read(file_path="page.html", html_to_markdown=True)

        # Keep raw HTML as-is
        html_output = reader.read(file_path="page.html", html_to_markdown=False)
        ```
    """

    source_type, source_val = _guess_source(kwargs, file_path)
    name, path, text, conv, ocr = self._dispatch_source(
        source_type, source_val, kwargs
    )

    page_ph: str = kwargs.get("page_placeholder", "<!-- page -->")
    page_ph_out = self._surface_page_placeholder(
        scan=bool(kwargs.get("scan_pdf_pages")),
        placeholder=page_ph,
        text=text,
    )

    return ReaderOutput(
        text=_ensure_str(text),
        document_name=name,
        document_path=path or "",
        document_id=kwargs.get("document_id", str(uuid.uuid4())),
        conversion_method=conv,
        reader_method="vanilla",
        ocr_method=ocr,
        page_placeholder=page_ph_out,
        metadata=kwargs.get("metadata", {}),
    )

VanillaReader uses a helper class to read PDF and use Visual Language Models. This class is PDFPlumberReader.

DoclingReader

DoclingReader logo DoclingReader logo

DoclingReader

Bases: BaseReader

High-level document reader leveraging IBM Docling for flexible document-to-Markdown conversion, with optional image captioning or VLM-based PDF processing. Supports automatic pipeline selection, seamless integration with custom vision-language models, and configurable output for both PDF and non-PDF files.

Source code in src/splitter_mr/reader/readers/docling_reader.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
class DoclingReader(BaseReader):
    """
    High-level document reader leveraging IBM Docling for flexible document-to-Markdown conversion,
    with optional image captioning or VLM-based PDF processing. Supports automatic pipeline selection,
    seamless integration with custom vision-language models, and configurable output for both PDF
    and non-PDF files.
    """

    SUPPORTED_EXTENSIONS = SUPPORTED_DOCLING_FILE_EXTENSIONS

    _IMAGE_PATTERN = re.compile(
        r"!\[(?P<alt>[^\]]*?)\]"
        r"\((?P<uri>data:image/[a-zA-Z0-9.+-]+;base64,(?P<b64>[A-Za-z0-9+/=]+))\)"
    )

    def __init__(self, model: Optional[BaseVisionModel] = None) -> None:
        """
        Initialize a DoclingReader instance.

        Args:
            model (Optional[BaseVisionModel], optional): An optional vision-language
                model instance used for PDF pipelines that require image captioning
                or per-page analysis. If provided, the model’s client and metadata
                (e.g., Azure deployment settings) are stored for use in downstream
                processing. Defaults to None.
        """
        self.model = model
        self.client = None
        self.model_name: Optional[str] = None
        if model:
            self.client = model.get_client()
            self.model_name = model.model_name

    def read(
        self,
        file_path: str | Path,
        **kwargs: Any,
    ) -> ReaderOutput:
        """
        Reads a document, automatically selecting the appropriate Docling pipeline for extraction.
        Supports PDFs (per-page VLM or standard extraction), as well as other file types.

        Args:
            file_path (str | Path): Path or URL to the document file.
            **kwargs: Keyword arguments to control extraction, including:
                - prompt (str): Prompt for image captioning or VLM-based PDF extraction.
                - scan_pdf_pages (bool): If True (and model provided), analyze each PDF page via VLM.
                - show_base64_images (bool): If True, embed base64 images in Markdown; if False, use
                    image placeholders.
                - page_placeholder (str): Placeholder for page breaks in output Markdown.
                - image_placeholder (str): Placeholder for image locations in output Markdown.
                - image_resolution (float): Resolution scaling factor for image extraction.
                - document_id (Optional[str]): Optional document ID for metadata.
                - metadata (Optional[dict]): Optional metadata dictionary.

        Returns:
            ReaderOutput: Extracted document in Markdown format and associated metadata.

        Raises:
            Warning: If a file extension is unsupported, falls back to VanillaReader and emits a warning.
            ValueError: If PDF pipeline requirements are not satisfied (e.g., neither model nor
                show_base64_images provided).
        """

        ext: str = os.path.splitext(file_path)[1].lower().lstrip(".")
        if ext not in self.SUPPORTED_EXTENSIONS:
            msg = f"Unsupported extension '{ext}'. Using VanillaReader."
            warnings.warn(msg)
            return VanillaReader().read(file_path=file_path, **kwargs)

        # Pipeline selection and execution
        pipeline_name, pipeline_args = self._select_pipeline(file_path, ext, **kwargs)
        md = DoclingPipelineFactory.run(pipeline_name, file_path, **pipeline_args)

        page_placeholder: str = pipeline_args.get("page_placeholder", "<!-- page -->")
        page_placeholder_value = (
            page_placeholder if page_placeholder and page_placeholder in md else None
        )

        text = md

        return ReaderOutput(
            text=text,
            document_name=os.path.basename(file_path),
            document_path=file_path,
            document_id=kwargs.get("document_id", str(uuid.uuid4())),
            conversion_method="markdown",
            reader_method="docling",
            ocr_method=self.model_name,
            page_placeholder=page_placeholder_value,
            metadata=kwargs.get("metadata", {}),
        )

    def _select_pipeline(self, file_path: str, ext: str, **kwargs) -> tuple[str, dict]:
        """
        Decides which pipeline to use and prepares arguments for it.

        Args:
            file_path (str): Path to the input document.
            ext (str): File extension.
            **kwargs: Extraction and pipeline control options, including:
                - prompt (str)
                - scan_pdf_pages (bool)
                - show_base64_images (bool)
                - page_placeholder (str)
                - image_placeholder (str)
                - image_resolution (float)

        Returns:
            tuple[str, dict]: Name of the selected pipeline and the dictionary of arguments for that pipeline.

        Pipeline selection logic:
            - For PDFs:
                - If scan_pdf_pages is True: uses per-page VLM/image pipeline.
                - Else if model is provided: uses VLM pipeline.
                - Else: uses default Markdown pipeline.
            - For other extensions: always uses Markdown pipeline.
        """
        # Defaults
        show_base64_images: bool = kwargs.get("show_base64_images", False)
        page_placeholder: str = kwargs.get("page_placeholder", "<!-- page -->")
        image_placeholder: str = kwargs.get("image_placeholder", "<!-- image -->")
        image_resolution: float = kwargs.get("image_resolution", 1.0)
        scan_pdf_pages: bool = kwargs.get("scan_pdf_pages", False)

        # --- PDF logic ---
        if ext == "pdf":
            if scan_pdf_pages:
                # Scan pages as images and extract their content
                pipeline_args = {
                    "model": self.model,
                    "prompt": kwargs.get("prompt", DEFAULT_IMAGE_EXTRACTION_PROMPT),
                    "image_resolution": image_resolution,
                    "page_placeholder": page_placeholder,
                    "show_base64_images": show_base64_images,
                }
                pipeline_name = "page_image"
            else:
                if self.model:
                    if show_base64_images:
                        warnings.warn(
                            "When using a model, base64 images are not rendered. So, deactivate the `show_base64_images` option or don't provide the model in the class constructor."
                        )
                    # Use VLM pipeline for the whole PDF
                    pipeline_args = {
                        "model": self.model,
                        "prompt": kwargs.get("prompt", DEFAULT_IMAGE_CAPTION_PROMPT),
                        "page_placeholder": page_placeholder,
                        "image_placeholder": image_placeholder,
                    }
                    pipeline_name = "vlm"
                else:
                    # No model: use markdown pipeline (default docling, base64 or placeholders)
                    pipeline_args = {
                        "show_base64_images": show_base64_images,
                        "page_placeholder": page_placeholder,
                        "image_placeholder": image_placeholder,
                        "image_resolution": image_resolution,
                        "ext": ext,
                    }
                    pipeline_name = "markdown"
        else:
            # For non-PDF: use markdown pipeline
            pipeline_args = {
                "show_base64_images": show_base64_images,
                "page_placeholder": page_placeholder,
                "image_placeholder": image_placeholder,
                "ext": ext,
            }
            pipeline_name = "markdown"

        return pipeline_name, pipeline_args
__init__(model=None)

Initialize a DoclingReader instance.

Parameters:

Name Type Description Default
model Optional[BaseVisionModel]

An optional vision-language model instance used for PDF pipelines that require image captioning or per-page analysis. If provided, the model’s client and metadata (e.g., Azure deployment settings) are stored for use in downstream processing. Defaults to None.

None
Source code in src/splitter_mr/reader/readers/docling_reader.py
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
def __init__(self, model: Optional[BaseVisionModel] = None) -> None:
    """
    Initialize a DoclingReader instance.

    Args:
        model (Optional[BaseVisionModel], optional): An optional vision-language
            model instance used for PDF pipelines that require image captioning
            or per-page analysis. If provided, the model’s client and metadata
            (e.g., Azure deployment settings) are stored for use in downstream
            processing. Defaults to None.
    """
    self.model = model
    self.client = None
    self.model_name: Optional[str] = None
    if model:
        self.client = model.get_client()
        self.model_name = model.model_name
read(file_path, **kwargs)

Reads a document, automatically selecting the appropriate Docling pipeline for extraction. Supports PDFs (per-page VLM or standard extraction), as well as other file types.

Parameters:

Name Type Description Default
file_path str | Path

Path or URL to the document file.

required
**kwargs Any

Keyword arguments to control extraction, including: - prompt (str): Prompt for image captioning or VLM-based PDF extraction. - scan_pdf_pages (bool): If True (and model provided), analyze each PDF page via VLM. - show_base64_images (bool): If True, embed base64 images in Markdown; if False, use image placeholders. - page_placeholder (str): Placeholder for page breaks in output Markdown. - image_placeholder (str): Placeholder for image locations in output Markdown. - image_resolution (float): Resolution scaling factor for image extraction. - document_id (Optional[str]): Optional document ID for metadata. - metadata (Optional[dict]): Optional metadata dictionary.

{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

Extracted document in Markdown format and associated metadata.

Raises:

Type Description
Warning

If a file extension is unsupported, falls back to VanillaReader and emits a warning.

ValueError

If PDF pipeline requirements are not satisfied (e.g., neither model nor show_base64_images provided).

Source code in src/splitter_mr/reader/readers/docling_reader.py
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
def read(
    self,
    file_path: str | Path,
    **kwargs: Any,
) -> ReaderOutput:
    """
    Reads a document, automatically selecting the appropriate Docling pipeline for extraction.
    Supports PDFs (per-page VLM or standard extraction), as well as other file types.

    Args:
        file_path (str | Path): Path or URL to the document file.
        **kwargs: Keyword arguments to control extraction, including:
            - prompt (str): Prompt for image captioning or VLM-based PDF extraction.
            - scan_pdf_pages (bool): If True (and model provided), analyze each PDF page via VLM.
            - show_base64_images (bool): If True, embed base64 images in Markdown; if False, use
                image placeholders.
            - page_placeholder (str): Placeholder for page breaks in output Markdown.
            - image_placeholder (str): Placeholder for image locations in output Markdown.
            - image_resolution (float): Resolution scaling factor for image extraction.
            - document_id (Optional[str]): Optional document ID for metadata.
            - metadata (Optional[dict]): Optional metadata dictionary.

    Returns:
        ReaderOutput: Extracted document in Markdown format and associated metadata.

    Raises:
        Warning: If a file extension is unsupported, falls back to VanillaReader and emits a warning.
        ValueError: If PDF pipeline requirements are not satisfied (e.g., neither model nor
            show_base64_images provided).
    """

    ext: str = os.path.splitext(file_path)[1].lower().lstrip(".")
    if ext not in self.SUPPORTED_EXTENSIONS:
        msg = f"Unsupported extension '{ext}'. Using VanillaReader."
        warnings.warn(msg)
        return VanillaReader().read(file_path=file_path, **kwargs)

    # Pipeline selection and execution
    pipeline_name, pipeline_args = self._select_pipeline(file_path, ext, **kwargs)
    md = DoclingPipelineFactory.run(pipeline_name, file_path, **pipeline_args)

    page_placeholder: str = pipeline_args.get("page_placeholder", "<!-- page -->")
    page_placeholder_value = (
        page_placeholder if page_placeholder and page_placeholder in md else None
    )

    text = md

    return ReaderOutput(
        text=text,
        document_name=os.path.basename(file_path),
        document_path=file_path,
        document_id=kwargs.get("document_id", str(uuid.uuid4())),
        conversion_method="markdown",
        reader_method="docling",
        ocr_method=self.model_name,
        page_placeholder=page_placeholder_value,
        metadata=kwargs.get("metadata", {}),
    )

To execute pipelines, DoclingReader has a utils class, DoclingUtils.

MarkItDownReader

MarkItDownReader logo MarkItDownReader logo

MarkItDownReader

Bases: BaseReader

Read multiple file types using Microsoft's MarkItDown library, and convert the documents using markdown format.

This reader supports both standard MarkItDown conversion and the use of Vision Language Models (VLMs) for LLM-based OCR when extracting text from images or scanned documents.

Source code in src/splitter_mr/reader/readers/markitdown_reader.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
class MarkItDownReader(BaseReader):
    """
    Read multiple file types using Microsoft's MarkItDown library, and convert
    the documents using markdown format.

    This reader supports both standard MarkItDown conversion and the use of Vision Language Models (VLMs)
    for LLM-based OCR when extracting text from images or scanned documents.
    """

    def __init__(self, model: BaseVisionModel = None) -> None:
        """
        Initializer method for MarkItDownReader

        Args:
            model (Optional[BaseVisionModel], optional): An optional vision-language
                model instance used for PDF pipelines that require image captioning
                or per-page analysis. If provided, the model’s client and metadata
                (e.g., Azure deployment settings) are stored for use in downstream
                processing. Defaults to None.
        """
        self.model = model
        self.model_name = model.model_name if self.model else None

    def _convert_to_pdf(self, file_path: str) -> str:
        """
        Converts DOCX, PPTX, or XLSX to PDF using LibreOffice (headless mode).

        Args:
            file_path (str): Path to the Office file.
            ext (str): File extension (lowercase, no dot).

        Returns:
            str: Path to the converted PDF.

        Raises:
            RuntimeError: If conversion fails or LibreOffice is not installed.
        """
        if not shutil.which("soffice"):
            raise RuntimeError(
                "LibreOffice (soffice) is required for Office to PDF conversion but was not found in PATH. "
                "Please install LibreOffice or set split_by_pages=False. "
                "How to install: https://www.libreoffice.org/get-help/install-howto/"
            )

        outdir = tempfile.mkdtemp()
        # Use soffice (LibreOffice) in headless mode
        cmd = [
            "soffice",
            "--headless",
            "--convert-to",
            "pdf",
            "--outdir",
            outdir,
            file_path,
        ]
        result = subprocess.run(cmd, capture_output=True)
        if result.returncode != 0:
            raise RuntimeError(
                f"Failed to convert {file_path} to PDF: {result.stderr.decode()}"
            )
        pdf_name = os.path.splitext(os.path.basename(file_path))[0] + ".pdf"
        pdf_path = os.path.join(outdir, pdf_name)
        if not os.path.exists(pdf_path):
            raise RuntimeError(f"PDF was not created: {pdf_path}")
        return pdf_path

    def _pdf_pages_to_streams(self, pdf_path: str) -> List[io.BytesIO]:
        """
        Convert each PDF page to a PNG and wrap in a BytesIO stream.

        Args:
            pdf_path (str): Path to the PDF file.

        Returns:
            List[io.BytesIO]: List of PNG image streams for each page.
        """
        doc = fitz.open(pdf_path)
        streams = []
        for idx in range(len(doc)):
            pix = doc.load_page(idx).get_pixmap()
            buf = io.BytesIO(pix.tobytes("png"))
            buf.name = f"page_{idx + 1}.png"
            buf.seek(0)
            streams.append(buf)
        return streams

    def _split_pdf_to_temp_pdfs(self, pdf_path: str) -> List[str]:
        """
        Split a PDF file into single-page temporary PDF files.

        Args:
            pdf_path (str): Path to the PDF file to split.

        Returns:
            List[str]: List of file paths for the temporary single-page PDFs.

        Example:
            temp_files = self._split_pdf_to_temp_pdfs("document.pdf")
            # temp_files = ["/tmp/tmpa1b2c3.pdf", "/tmp/tmpd4e5f6.pdf", ...]
        """
        temp_files = []
        reader = PdfReader(pdf_path)
        for i, page in enumerate(reader.pages):
            writer = PdfWriter()
            writer.add_page(page)
            with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
                writer.write(tmp)
                temp_files.append(tmp.name)
        return temp_files

    def _pdf_pages_to_markdown(
        self, file_path: str, md: MarkItDown, prompt: str, page_placeholder: str
    ) -> str:
        """
        Convert each scanned PDF page to markdown using the provided MarkItDown instance.

        Args:
            file_path (str): Path to PDF.
            md (MarkItDown): The MarkItDown converter instance.
            prompt (str): The LLM prompt for OCR.
            page_placeholder (str): Page break placeholder for markdown.

        Returns:
            str: Markdown of the entire PDF (one page per placeholder).
        """
        page_md = []
        for idx, page_stream in enumerate(
            self._pdf_pages_to_streams(file_path), start=1
        ):
            page_md.append(page_placeholder.replace("{page}", str(idx)))
            result = md.convert(page_stream, llm_prompt=prompt)
            page_md.append(result.text_content)
        return "\n".join(page_md)

    def _pdf_file_per_page_to_markdown(
        self, file_path: str, md: "MarkItDown", prompt: str, page_placeholder: str
    ) -> str:
        """
        Convert each page of a PDF to markdown by splitting the PDF into temporary single-page files,
        extracting text from each page using MarkItDown, and joining the results with a page placeholder.

        Args:
            file_path (str): Path to the PDF file.
            md (MarkItDown): The MarkItDown converter instance.
            prompt (str): The LLM prompt for extraction.
            page_placeholder (str): Markdown placeholder for page breaks; supports '{page}' for numbering.

        Returns:
            str: Concatenated markdown content for the entire PDF, separated by page placeholders.

        Raises:
            Any exception raised by MarkItDown or file I/O will propagate.

        Example:
            markdown = self._pdf_file_per_page_to_markdown("doc.pdf", md, prompt, "<!-- page {page} -->")
        """
        temp_files = self._split_pdf_to_temp_pdfs(pdf_path=file_path)
        page_md = []
        try:
            for idx, temp_pdf in enumerate(temp_files, start=1):
                page_md.append(page_placeholder.replace("{page}", str(idx)))
                result = md.convert(temp_pdf, llm_prompt=prompt)
                page_md.append(result.text_content)
            return "\n".join(page_md)
        finally:
            # Clean up temp files
            for temp_pdf in temp_files:
                os.remove(temp_pdf)

    def _get_markitdown(self) -> tuple:
        """
        Returns a MarkItDown instance and OCR method name depending on model presence.

        Returns:
            tuple[MarkItDown, Optional[str]]: MarkItDown instance, OCR method or None.

        Raises:
            ValueError: If provided model is not supported.
        """
        if self.model:
            self.client = self.model.get_client()
            if not isinstance(self.client, OpenAI):
                raise ValueError(
                    "Incompatible client. Only models that use the OpenAI client are supported."
                )
            return (
                MarkItDown(llm_client=self.client, llm_model=self.model.model_name),
                self.model.model_name,
            )
        else:
            return MarkItDown(), None

    def read(self, file_path: Path | str = None, **kwargs: Any) -> ReaderOutput:
        """
        Reads a file and converts its contents to Markdown using MarkItDown.

        Features:
            - Standard file-to-Markdown conversion for most formats.
            - LLM-based OCR (if a Vision model is provided) for images and scanned PDFs.
            - Optional PDF page-wise OCR with fine-grained control and custom LLM prompt.

        Args:
            file_path (str): Path to the input file to be read and converted.
            **kwargs:
                - `document_id (Optional[str])`: Unique document identifier.
                    If not provided, a UUID will be generated.
                - `metadata (Dict[str, Any], optional)`: Additional metadata, given in dictionary format.
                    If not provided, no metadata is returned.
                - `prompt (Optional[str])`: Prompt for image captioning or VLM extraction.
                - `page_placeholder (str)`: Markdown placeholder string for pages (default: "<!-- page -->").
                - split_by_pages (bool): If True and the input is a PDF, split the PDF by pages and process
                    each page separately. Default is False.

        Returns:
            ReaderOutput: Dataclass defining the output structure for all readers.

        Example:
            ```python
            from splitter_mr.model import OpenAIVisionModel
            from splitter_mr.reader import MarkItDownReader

            model = AzureOpenAIVisionModel()
            reader = MarkItDownReader(model=model)
            output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.pdf")
            print(output.text)
            ```
            ```python
            Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
            rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
            Pellentesque ex felis, cursus ege...
            ```
        """
        # Initialize MarkItDown reader
        file_path: str | Path = os.fspath(file_path)
        ext: str = os.path.splitext(file_path)[1].lower().lstrip(".")
        prompt: str = kwargs.get("prompt", DEFAULT_IMAGE_EXTRACTION_PROMPT)
        page_placeholder: str = kwargs.get("page_placeholder", "<!-- page -->")
        split_by_pages: bool = kwargs.get("split_by_pages", False)
        conversion_method: str = None
        md, ocr_method = self._get_markitdown()

        PDF_CONVERTIBLE_EXT: Set[str] = {"docx", "pptx", "xlsx"}

        if split_by_pages and ext != "pdf":
            if ext in PDF_CONVERTIBLE_EXT:
                file_path = self._convert_to_pdf(file_path)

        md, ocr_method = self._get_markitdown()

        # Process text
        if split_by_pages:
            markdown_text = self._pdf_file_per_page_to_markdown(
                file_path=file_path,
                md=md,
                prompt=prompt,
                page_placeholder=page_placeholder,
            )
            conversion_method = "markdown"
        elif self.model is not None:
            markdown_text = self._pdf_pages_to_markdown(
                file_path=file_path,
                md=md,
                prompt=prompt,
                page_placeholder=page_placeholder,
            )
            conversion_method = "markdown"
        else:
            markdown_text = md.convert(file_path, llm_prompt=prompt).text_content
            conversion_method = "json" if ext == "json" else "markdown"

        page_placeholder_value = (
            page_placeholder
            if page_placeholder and page_placeholder in markdown_text
            else None
        )

        # Return output
        return ReaderOutput(
            text=markdown_text,
            document_name=os.path.basename(file_path),
            document_path=file_path,
            document_id=kwargs.get("document_id", str(uuid.uuid4())),
            conversion_method=conversion_method,
            reader_method="markitdown",
            ocr_method=ocr_method,
            page_placeholder=page_placeholder_value,
            metadata=kwargs.get("metadata", {}),
        )
__init__(model=None)

Initializer method for MarkItDownReader

Parameters:

Name Type Description Default
model Optional[BaseVisionModel]

An optional vision-language model instance used for PDF pipelines that require image captioning or per-page analysis. If provided, the model’s client and metadata (e.g., Azure deployment settings) are stored for use in downstream processing. Defaults to None.

None
Source code in src/splitter_mr/reader/readers/markitdown_reader.py
29
30
31
32
33
34
35
36
37
38
39
40
41
def __init__(self, model: BaseVisionModel = None) -> None:
    """
    Initializer method for MarkItDownReader

    Args:
        model (Optional[BaseVisionModel], optional): An optional vision-language
            model instance used for PDF pipelines that require image captioning
            or per-page analysis. If provided, the model’s client and metadata
            (e.g., Azure deployment settings) are stored for use in downstream
            processing. Defaults to None.
    """
    self.model = model
    self.model_name = model.model_name if self.model else None
read(file_path=None, **kwargs)

Reads a file and converts its contents to Markdown using MarkItDown.

Features
  • Standard file-to-Markdown conversion for most formats.
  • LLM-based OCR (if a Vision model is provided) for images and scanned PDFs.
  • Optional PDF page-wise OCR with fine-grained control and custom LLM prompt.

Parameters:

Name Type Description Default
file_path str

Path to the input file to be read and converted.

None
**kwargs Any
  • document_id (Optional[str]): Unique document identifier. If not provided, a UUID will be generated.
  • metadata (Dict[str, Any], optional): Additional metadata, given in dictionary format. If not provided, no metadata is returned.
  • prompt (Optional[str]): Prompt for image captioning or VLM extraction.
  • page_placeholder (str): Markdown placeholder string for pages (default: "").
  • split_by_pages (bool): If True and the input is a PDF, split the PDF by pages and process each page separately. Default is False.
{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

Dataclass defining the output structure for all readers.

Example

from splitter_mr.model import OpenAIVisionModel
from splitter_mr.reader import MarkItDownReader

model = AzureOpenAIVisionModel()
reader = MarkItDownReader(model=model)
output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.pdf")
print(output.text)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
Pellentesque ex felis, cursus ege...

Source code in src/splitter_mr/reader/readers/markitdown_reader.py
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
def read(self, file_path: Path | str = None, **kwargs: Any) -> ReaderOutput:
    """
    Reads a file and converts its contents to Markdown using MarkItDown.

    Features:
        - Standard file-to-Markdown conversion for most formats.
        - LLM-based OCR (if a Vision model is provided) for images and scanned PDFs.
        - Optional PDF page-wise OCR with fine-grained control and custom LLM prompt.

    Args:
        file_path (str): Path to the input file to be read and converted.
        **kwargs:
            - `document_id (Optional[str])`: Unique document identifier.
                If not provided, a UUID will be generated.
            - `metadata (Dict[str, Any], optional)`: Additional metadata, given in dictionary format.
                If not provided, no metadata is returned.
            - `prompt (Optional[str])`: Prompt for image captioning or VLM extraction.
            - `page_placeholder (str)`: Markdown placeholder string for pages (default: "<!-- page -->").
            - split_by_pages (bool): If True and the input is a PDF, split the PDF by pages and process
                each page separately. Default is False.

    Returns:
        ReaderOutput: Dataclass defining the output structure for all readers.

    Example:
        ```python
        from splitter_mr.model import OpenAIVisionModel
        from splitter_mr.reader import MarkItDownReader

        model = AzureOpenAIVisionModel()
        reader = MarkItDownReader(model=model)
        output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.pdf")
        print(output.text)
        ```
        ```python
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
        rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
        Pellentesque ex felis, cursus ege...
        ```
    """
    # Initialize MarkItDown reader
    file_path: str | Path = os.fspath(file_path)
    ext: str = os.path.splitext(file_path)[1].lower().lstrip(".")
    prompt: str = kwargs.get("prompt", DEFAULT_IMAGE_EXTRACTION_PROMPT)
    page_placeholder: str = kwargs.get("page_placeholder", "<!-- page -->")
    split_by_pages: bool = kwargs.get("split_by_pages", False)
    conversion_method: str = None
    md, ocr_method = self._get_markitdown()

    PDF_CONVERTIBLE_EXT: Set[str] = {"docx", "pptx", "xlsx"}

    if split_by_pages and ext != "pdf":
        if ext in PDF_CONVERTIBLE_EXT:
            file_path = self._convert_to_pdf(file_path)

    md, ocr_method = self._get_markitdown()

    # Process text
    if split_by_pages:
        markdown_text = self._pdf_file_per_page_to_markdown(
            file_path=file_path,
            md=md,
            prompt=prompt,
            page_placeholder=page_placeholder,
        )
        conversion_method = "markdown"
    elif self.model is not None:
        markdown_text = self._pdf_pages_to_markdown(
            file_path=file_path,
            md=md,
            prompt=prompt,
            page_placeholder=page_placeholder,
        )
        conversion_method = "markdown"
    else:
        markdown_text = md.convert(file_path, llm_prompt=prompt).text_content
        conversion_method = "json" if ext == "json" else "markdown"

    page_placeholder_value = (
        page_placeholder
        if page_placeholder and page_placeholder in markdown_text
        else None
    )

    # Return output
    return ReaderOutput(
        text=markdown_text,
        document_name=os.path.basename(file_path),
        document_path=file_path,
        document_id=kwargs.get("document_id", str(uuid.uuid4())),
        conversion_method=conversion_method,
        reader_method="markitdown",
        ocr_method=ocr_method,
        page_placeholder=page_placeholder_value,
        metadata=kwargs.get("metadata", {}),
    )