Skip to content

Reader

Introduction

The Reader component is designed to read files homogeneously which come from many different formats and extensions. All of these readers are implemented sharing the same parent class, BaseReader.

Which Reader should I use for my project?

Each Reader component extracts document text in different ways. Therefore, choosing the most suitable Reader component depends on your use case.

  • If you want to preserve the original structure as much as possible, without any kind of markdown parsing, you can use the VanillaReader class.
  • In case that you have documents which have presented many tables in its structure or with many visual components (such as images), we strongly recommend to use DoclingReader.
  • If you are looking to maximize efficiency or make conversions to markdown simpler, we recommend using the MarkItDownReader component.

Note

Remember to visit the official repository and guides for these two last reader classes:

Additionally, the file compatibility depending on the Reader class is given by the following table:

Reader Unstructured files & PDFs MS Office suite files Tabular data Files with hierarchical schema Image files Markdown conversion
VanillaReader txt, md, pdf xlsx, docx, pptx csv, tsv, parquet json, yaml, html, xml jpg, png, webp, gif Yes
MarkItDownReader txt, md, pdf docx, xlsx, pptx csv, tsv json, html, xml jpg, png, pneg Yes
DoclingReader txt, md, pdf docx, xlsx, pptx html, xhtml png, jpeg, tiff, bmp, webp Yes

Output format

Bases: BaseModel

Pydantic model defining the output structure for all readers.

Attributes:

Name Type Description
text Optional[str]

The textual content extracted by the reader.

document_name Optional[str]

The name of the document.

document_path str

The path to the document.

document_id Optional[str]

A unique identifier for the document.

conversion_method Optional[str]

The method used for document conversion.

reader_method Optional[str]

The method used for reading the document.

ocr_method Optional[str]

The OCR method used, if any.

page_placeholder Optional[str]

The placeholder use to identify each page, if used.

metadata Optional[Dict[str, Any]]

Additional metadata associated with the document.

Source code in src/splitter_mr/schema/models.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
class ReaderOutput(BaseModel):
    """Pydantic model defining the output structure for all readers.

    Attributes:
        text: The textual content extracted by the reader.
        document_name: The name of the document.
        document_path: The path to the document.
        document_id: A unique identifier for the document.
        conversion_method: The method used for document conversion.
        reader_method: The method used for reading the document.
        ocr_method: The OCR method used, if any.
        page_placeholder: The placeholder use to identify each page, if used.
        metadata: Additional metadata associated with the document.
    """

    text: Optional[str] = ""
    document_name: Optional[str] = None
    document_path: str = ""
    document_id: Optional[str] = None
    conversion_method: Optional[str] = None
    reader_method: Optional[str] = None
    ocr_method: Optional[str] = None
    page_placeholder: Optional[str] = None
    metadata: Optional[Dict[str, Any]] = Field(default_factory=dict)

    @field_validator("document_id", mode="before")
    def default_document_id(cls, v: str):
        """Generate a default UUID for document_id if not provided.

        Args:
            v (str): The provided document_id value.

        Returns:
            document_id (str): The provided document_id or a newly generated UUID string.
        """
        document_id = v or str(uuid.uuid4())
        return document_id

    def from_variable(
        self, variable: Union[str, Dict[str, Any]], variable_name: str
    ) -> "ReaderOutput":
        """
        Generate a new ReaderOutput object from a variable (str or dict).

        Args:
            variable (Union[str, Dict[str, Any]]): The variable to use as text.
            variable_name (str): The name for document_name.

        Returns:
            ReaderOutput: The new ReaderOutput object.
        """
        if isinstance(variable, dict):
            text = json.dumps(variable, ensure_ascii=False, indent=2)
            conversion_method = "json"
            metadata = {"details": "Generated from a json variable"}
        elif isinstance(variable, str):
            text = variable
            conversion_method = "txt"
            metadata = {"details": "Generated from a str variable"}
        else:
            raise ValueError("Variable must be either a string or a dictionary.")

        return ReaderOutput(
            text=text,
            document_name=variable_name,
            document_path="",
            conversion_method=conversion_method,
            reader_method="vanilla",
            ocr_method=None,
            page_placeholder=None,
            metadata=metadata,
        )

    def append_metadata(self, metadata: Dict[str, Any]) -> None:
        """
        Append (update) the metadata dictionary with new key-value pairs.

        Args:
            metadata (Dict[str, Any]): The metadata to add or update.
        """
        if self.metadata is None:
            self.metadata = {}
        self.metadata.update(metadata)
default_document_id(v)

Generate a default UUID for document_id if not provided.

Parameters:

Name Type Description Default
v str

The provided document_id value.

required

Returns:

Name Type Description
document_id str

The provided document_id or a newly generated UUID string.

Source code in src/splitter_mr/schema/models.py
33
34
35
36
37
38
39
40
41
42
43
44
@field_validator("document_id", mode="before")
def default_document_id(cls, v: str):
    """Generate a default UUID for document_id if not provided.

    Args:
        v (str): The provided document_id value.

    Returns:
        document_id (str): The provided document_id or a newly generated UUID string.
    """
    document_id = v or str(uuid.uuid4())
    return document_id
from_variable(variable, variable_name)

Generate a new ReaderOutput object from a variable (str or dict).

Parameters:

Name Type Description Default
variable Union[str, Dict[str, Any]]

The variable to use as text.

required
variable_name str

The name for document_name.

required

Returns:

Name Type Description
ReaderOutput ReaderOutput

The new ReaderOutput object.

Source code in src/splitter_mr/schema/models.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
def from_variable(
    self, variable: Union[str, Dict[str, Any]], variable_name: str
) -> "ReaderOutput":
    """
    Generate a new ReaderOutput object from a variable (str or dict).

    Args:
        variable (Union[str, Dict[str, Any]]): The variable to use as text.
        variable_name (str): The name for document_name.

    Returns:
        ReaderOutput: The new ReaderOutput object.
    """
    if isinstance(variable, dict):
        text = json.dumps(variable, ensure_ascii=False, indent=2)
        conversion_method = "json"
        metadata = {"details": "Generated from a json variable"}
    elif isinstance(variable, str):
        text = variable
        conversion_method = "txt"
        metadata = {"details": "Generated from a str variable"}
    else:
        raise ValueError("Variable must be either a string or a dictionary.")

    return ReaderOutput(
        text=text,
        document_name=variable_name,
        document_path="",
        conversion_method=conversion_method,
        reader_method="vanilla",
        ocr_method=None,
        page_placeholder=None,
        metadata=metadata,
    )
append_metadata(metadata)

Append (update) the metadata dictionary with new key-value pairs.

Parameters:

Name Type Description Default
metadata Dict[str, Any]

The metadata to add or update.

required
Source code in src/splitter_mr/schema/models.py
81
82
83
84
85
86
87
88
89
90
def append_metadata(self, metadata: Dict[str, Any]) -> None:
    """
    Append (update) the metadata dictionary with new key-value pairs.

    Args:
        metadata (Dict[str, Any]): The metadata to add or update.
    """
    if self.metadata is None:
        self.metadata = {}
    self.metadata.update(metadata)

Readers

BaseReader

BaseReader

Bases: ABC

Abstract base class for all document readers.

This interface defines the contract for file readers that process documents and return a standardized dictionary containing the extracted text and document-level metadata. Subclasses must implement the read method to handle specific file formats or reading strategies.

Methods:

Name Description
read

Reads the input file and returns a dictionary with text and metadata.

is_valid_file_path

Check if a path is valid.

is_url

Check if the string provided is an URL.

parse_json

Try to parse a JSON object when a dictionary or string is provided.

Source code in src/splitter_mr/reader/base_reader.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
class BaseReader(ABC):
    """
    Abstract base class for all document readers.

    This interface defines the contract for file readers that process documents and return
    a standardized dictionary containing the extracted text and document-level metadata.
    Subclasses must implement the `read` method to handle specific file formats or reading
    strategies.

    Methods:
        read: Reads the input file and returns a dictionary with text and metadata.
        is_valid_file_path: Check if a path is valid.
        is_url: Check if the string provided is an URL.
        parse_json: Try to parse a JSON object when a dictionary or string is provided.
    """

    @staticmethod
    def is_valid_file_path(path: str) -> bool:
        """
        Checks if the provided string is a valid file path.

        Args:
            path (str): The string to check.

        Returns:
            bool: True if the string is a valid file path to an existing file, False otherwise.

        Example:
            ```python
            BaseReader.is_valid_file_path("/tmp/myfile.txt")
            ```
            ```bash
            True
            ```
        """
        return os.path.isfile(path)

    @staticmethod
    def is_url(string: str) -> bool:
        """
        Determines whether the given string is a valid HTTP or HTTPS URL.

        Args:
            string (str): The string to check.

        Returns:
            bool: True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

        Example:
            ```python
            BaseReader.is_url("https://example.com")
            ```
            ```bash
            True
            ```
            ```python
            BaseReader.is_url("not_a_url")
            ```
            ```bash
            False
            ```
        """
        try:
            result = urlparse(string)
            return all([result.scheme in ("http", "https"), result.netloc])
        except Exception:
            return False

    @staticmethod
    def parse_json(obj: Union[dict, str]) -> dict:
        """
        Attempts to parse the provided object as JSON.

        Args:
            obj (Union[dict, str]): The object to parse. If a dict, returns it as-is.
                If a string, attempts to parse it as a JSON string.

        Returns:
            dict: The parsed JSON object.

        Raises:
            ValueError: If a string is provided that cannot be parsed as valid JSON.
            TypeError: If the provided object is neither a dict nor a string.

        Example:
            ```python
            BaseReader.try_parse_json('{"a": 1}')
            ```
            ```python
            {'a': 1}
            ```
            ```python
            BaseReader.try_parse_json({'b': 2})
            ```
            ```python
            {'b': 2}
            ```
            ```python
            BaseReader.try_parse_json('[not valid json]')
            ```
            ```python
            ValueError: String could not be parsed as JSON: ...
            ```
        """
        if isinstance(obj, dict):
            return obj
        if isinstance(obj, str):
            try:
                return json.loads(obj)
            except Exception as e:
                raise ValueError(f"String could not be parsed as JSON: {e}")
        raise TypeError("Provided object is not a string or dictionary")

    @abstractmethod
    def read(
        self, file_path: str, model: Optional[BaseModel] = None, **kwargs: Any
    ) -> ReaderOutput:
        """
        Reads input and returns a ReaderOutput with text content and standardized metadata.

        Args:
            file_path (str): Path to the input file, a URL, raw string, or dictionary.
            model (Optional[BaseModel]): Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.
            **kwargs: Additional keyword arguments for implementation-specific options.

        Returns:
            ReaderOutput: Dataclass defining the output structure for all readers.

        Raises:
            ValueError: If the provided string is not valid file path, URL, or parsable content.
            TypeError: If input type is unsupported.

        Example:
            ```python
            class MyReader(BaseReader):
                def read(self, file_path: str, **kwargs) -> ReaderOutput:
                    return ReaderOutput(
                        text="example",
                        document_name="example.txt",
                        document_path=file_path,
                        document_id=kwargs.get("document_id"),
                        conversion_method="custom",
                        ocr_method=None,
                        metadata={}
                    )
            ```
        """
is_valid_file_path(path) staticmethod

Checks if the provided string is a valid file path.

Parameters:

Name Type Description Default
path str

The string to check.

required

Returns:

Name Type Description
bool bool

True if the string is a valid file path to an existing file, False otherwise.

Example

BaseReader.is_valid_file_path("/tmp/myfile.txt")
True

Source code in src/splitter_mr/reader/base_reader.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
@staticmethod
def is_valid_file_path(path: str) -> bool:
    """
    Checks if the provided string is a valid file path.

    Args:
        path (str): The string to check.

    Returns:
        bool: True if the string is a valid file path to an existing file, False otherwise.

    Example:
        ```python
        BaseReader.is_valid_file_path("/tmp/myfile.txt")
        ```
        ```bash
        True
        ```
    """
    return os.path.isfile(path)
is_url(string) staticmethod

Determines whether the given string is a valid HTTP or HTTPS URL.

Parameters:

Name Type Description Default
string str

The string to check.

required

Returns:

Name Type Description
bool bool

True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

Example

BaseReader.is_url("https://example.com")
True
BaseReader.is_url("not_a_url")
False

Source code in src/splitter_mr/reader/base_reader.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
@staticmethod
def is_url(string: str) -> bool:
    """
    Determines whether the given string is a valid HTTP or HTTPS URL.

    Args:
        string (str): The string to check.

    Returns:
        bool: True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

    Example:
        ```python
        BaseReader.is_url("https://example.com")
        ```
        ```bash
        True
        ```
        ```python
        BaseReader.is_url("not_a_url")
        ```
        ```bash
        False
        ```
    """
    try:
        result = urlparse(string)
        return all([result.scheme in ("http", "https"), result.netloc])
    except Exception:
        return False
parse_json(obj) staticmethod

Attempts to parse the provided object as JSON.

Parameters:

Name Type Description Default
obj Union[dict, str]

The object to parse. If a dict, returns it as-is. If a string, attempts to parse it as a JSON string.

required

Returns:

Name Type Description
dict dict

The parsed JSON object.

Raises:

Type Description
ValueError

If a string is provided that cannot be parsed as valid JSON.

TypeError

If the provided object is neither a dict nor a string.

Example

BaseReader.try_parse_json('{"a": 1}')
{'a': 1}
BaseReader.try_parse_json({'b': 2})
{'b': 2}
BaseReader.try_parse_json('[not valid json]')
ValueError: String could not be parsed as JSON: ...

Source code in src/splitter_mr/reader/base_reader.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
@staticmethod
def parse_json(obj: Union[dict, str]) -> dict:
    """
    Attempts to parse the provided object as JSON.

    Args:
        obj (Union[dict, str]): The object to parse. If a dict, returns it as-is.
            If a string, attempts to parse it as a JSON string.

    Returns:
        dict: The parsed JSON object.

    Raises:
        ValueError: If a string is provided that cannot be parsed as valid JSON.
        TypeError: If the provided object is neither a dict nor a string.

    Example:
        ```python
        BaseReader.try_parse_json('{"a": 1}')
        ```
        ```python
        {'a': 1}
        ```
        ```python
        BaseReader.try_parse_json({'b': 2})
        ```
        ```python
        {'b': 2}
        ```
        ```python
        BaseReader.try_parse_json('[not valid json]')
        ```
        ```python
        ValueError: String could not be parsed as JSON: ...
        ```
    """
    if isinstance(obj, dict):
        return obj
    if isinstance(obj, str):
        try:
            return json.loads(obj)
        except Exception as e:
            raise ValueError(f"String could not be parsed as JSON: {e}")
    raise TypeError("Provided object is not a string or dictionary")
read(file_path, model=None, **kwargs) abstractmethod

Reads input and returns a ReaderOutput with text content and standardized metadata.

Parameters:

Name Type Description Default
file_path str

Path to the input file, a URL, raw string, or dictionary.

required
model Optional[BaseModel]

Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.

None
**kwargs Any

Additional keyword arguments for implementation-specific options.

{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

Dataclass defining the output structure for all readers.

Raises:

Type Description
ValueError

If the provided string is not valid file path, URL, or parsable content.

TypeError

If input type is unsupported.

Example
class MyReader(BaseReader):
    def read(self, file_path: str, **kwargs) -> ReaderOutput:
        return ReaderOutput(
            text="example",
            document_name="example.txt",
            document_path=file_path,
            document_id=kwargs.get("document_id"),
            conversion_method="custom",
            ocr_method=None,
            metadata={}
        )
Source code in src/splitter_mr/reader/base_reader.py
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
@abstractmethod
def read(
    self, file_path: str, model: Optional[BaseModel] = None, **kwargs: Any
) -> ReaderOutput:
    """
    Reads input and returns a ReaderOutput with text content and standardized metadata.

    Args:
        file_path (str): Path to the input file, a URL, raw string, or dictionary.
        model (Optional[BaseModel]): Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.
        **kwargs: Additional keyword arguments for implementation-specific options.

    Returns:
        ReaderOutput: Dataclass defining the output structure for all readers.

    Raises:
        ValueError: If the provided string is not valid file path, URL, or parsable content.
        TypeError: If input type is unsupported.

    Example:
        ```python
        class MyReader(BaseReader):
            def read(self, file_path: str, **kwargs) -> ReaderOutput:
                return ReaderOutput(
                    text="example",
                    document_name="example.txt",
                    document_path=file_path,
                    document_id=kwargs.get("document_id"),
                    conversion_method="custom",
                    ocr_method=None,
                    metadata={}
                )
        ```
    """

📚 Note: file examples are extracted from thedata folder in the GitHub repository: link.

VanillaReader

VanillaReader logo VanillaReader logo

VanillaReader

Bases: BaseReader

Read multiple file types using Python's built-in and standard libraries. Supported: .json, .html, .txt, .xml, .yaml/.yml, .csv, .tsv, .parquet, .pdf

For PDFs, this reader uses PDFPlumberReader to extract text, tables, and images, with options to show or omit images, and to annotate images using a vision model.

Source code in src/splitter_mr/reader/readers/vanilla_reader.py
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
class VanillaReader(BaseReader):
    """
    Read multiple file types using Python's built-in and standard libraries.
    Supported: .json, .html, .txt, .xml, .yaml/.yml, .csv, .tsv, .parquet, .pdf

    For PDFs, this reader uses PDFPlumberReader to extract text, tables, and images,
    with options to show or omit images, and to annotate images using a vision model.
    """

    def __init__(self, model: Optional[BaseModel] = None):
        super().__init__()
        self.model = model
        self.pdf_reader = PDFPlumberReader()

    def read(
        self,
        file_path: str | Path = None,
        **kwargs: Any,
    ) -> ReaderOutput:
        """
        Read a document from various sources and return standardized output.

        This method supports:
        - Local file paths (``file_path`` or positional arg)
        - URLs (``file_url``)
        - JSON/dict objects (``json_document``)
        - Raw text strings (``text_document``)

        If multiple sources are provided, the priority is:
        ``file_path`` > ``file_url`` > ``json_document`` > ``text_document``.
        If only ``file_path`` is provided, auto-detects whether it is a path, URL,
        JSON, YAML, or plain text.

        Args:
        file_path (str | Path): Path to the input file (overridden by
            ``kwargs['file_path']`` if present).
        **kwargs: Optional arguments that adjust behavior:

            Source selection:
            file_path (str): Path to the input file (overrides positional arg).
            file_url (str): HTTPS/HTTP URL to read from.
            json_document (dict | str): JSON-like document (dict or JSON string).
            text_document (str): Raw text content.

            Identification/metadata:
            document_id (str): Explicit document id. Defaults to a new UUID.
            metadata (dict): Additional metadata to attach to the output.

            PDF extraction:
            scan_pdf_pages (bool): If True, rasterize and describe pages using a
                vision model (VLM). If False (default), use element-wise extraction.
            model (BaseModel): Vision-capable model used for scanned PDFs and/or
                image captioning (also used for image files).
            prompt (str): Prompt for image captioning / page description. Defaults to
                ``DEFAULT_IMAGE_CAPTION_PROMPT`` for element-wise PDFs and
                ``DEFAULT_IMAGE_EXTRACTION_PROMPT`` for scanned PDFs/images.
            resolution (int): DPI when rasterizing pages for VLM. Default: 300.
            show_base64_images (bool): Include base64-embedded images in PDF output.
                Default: False.
            image_placeholder (str): Placeholder for omitted images in PDFs.
                Default: ``"<!-- image -->"``.
            page_placeholder (str): Placeholder inserted between PDF pages (only
                surfaced when scanning or when the placeholder occurs in text).
                Default: ``"<!-- page -->"``.
            vlm_parameters (dict): Extra keyword args forwarded to
                ``model.extract_text(...)``.

            Excel / Parquet reading:
            as_table (bool): For Excel (``.xlsx``/``.xls``), if True read as a table
                using pandas and return CSV text. If False (default), convert to PDF
                and run the PDF pipeline.
            excel_engine (str): pandas Excel engine. Default: ``"openpyxl"``.
            parquet_engine (str): pandas Parquet engine (e.g. ``"pyarrow"``,
                ``"fastparquet"``). Default: pandas auto-selection.

        Returns:
            ReaderOutput: Unified result containing text, metadata, and extraction info.

        Raises:
            ValueError: If the source is invalid/unsupported, or if a VLM is required
                but not provided.
            TypeError: If provided arguments are of unsupported types.

        Notes:
            - PDF extraction now supports image captioning/omission indicators.
            - For `.parquet` files, content is loaded via pandas and returned as CSV-formatted text.

        Example:
            ```python
            from splitter_mr.readers import VanillaReader
            from splitter_mr.models import AzureOpenAIVisionModel

            model = AzureOpenAIVisionModel()
            reader = VanillaReader(model=model)
            output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.pdf")
            print(output.text)
            ```
            ```bash
            Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
            rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
            Pellentesque ex felis, cursus ege...
            ```
        """

        source_type, source_val = _guess_source(kwargs, file_path)
        name, path, text, conv, ocr = self._dispatch_source(
            source_type, source_val, kwargs
        )

        page_ph: str = kwargs.get("page_placeholder", "<!-- page -->")
        page_ph_out = self._surface_page_placeholder(
            scan=bool(kwargs.get("scan_pdf_pages")),
            placeholder=page_ph,
            text=text,
        )

        return ReaderOutput(
            text=_ensure_str(text),
            document_name=name,
            document_path=path or "",
            document_id=kwargs.get("document_id", str(uuid.uuid4())),
            conversion_method=conv,
            reader_method="vanilla",
            ocr_method=ocr,
            page_placeholder=page_ph_out,
            metadata=kwargs.get("metadata", {}),
        )

    def _dispatch_source(  # noqa: WPS231
        self,
        src_type: str,
        src_val: Any,
        kw: Dict[str, Any],
    ) -> Tuple[str, Optional[str], Any, str, Optional[str]]:
        """
        Route the request to a specialised handler and return
        (document_name, document_path, text/content, conversion_method, ocr_method)
        """
        handlers = {
            "file_path": self._handle_local_path,
            "file_url": self._handle_url,
            "json_document": self._handle_explicit_json,
            "text_document": self._handle_explicit_text,
        }
        if src_type not in handlers:
            raise ValueError(f"Unrecognized document source: {src_type}")
        return handlers[src_type](src_val, kw)

    # ---- individual strategies below – each ~20 lines or fewer ---------- #

    # 1) Local / drive paths
    def _handle_local_path(
        self,
        path_like: str | Path,
        kw: Dict[str, Any],
    ) -> Tuple[str, str, Any, str, Optional[str]]:
        """Load from the filesystem (or, if it ‘looks like’ one, via HTTP)."""
        path_str = os.fspath(path_like) if isinstance(path_like, Path) else path_like
        if not isinstance(path_str, str):
            raise ValueError("file_path must be a string or Path object.")

        if not self.is_valid_file_path(path_str):
            if self.is_url(path_str):
                return self._handle_url(path_str, kw)
            return self._handle_fallback(path_str, kw)

        ext = os.path.splitext(path_str)[1].lower().lstrip(".")
        doc_name = os.path.basename(path_str)
        rel_path = os.path.relpath(path_str)

        # ---- type‑specific branches ---- #
        # TODO: Refactor to sort the code and make it more readable
        if ext == "pdf":
            return (
                doc_name,
                rel_path,
                *self._process_pdf(path_str, kw),
            )
        if ext in ("json", "html", "txt", "xml", "csv", "tsv", "md", "markdown"):
            return doc_name, rel_path, _read_text_file(path_str, ext), ext, None
        if ext == "parquet":
            parquet_engine = kw.get(
                "parquet_engine"
            )  # e.g., "pyarrow" or "fastparquet"
            return (
                doc_name,
                rel_path,
                _read_parquet(path_str, engine=parquet_engine),
                "csv",
                None,
            )
        if ext in ("yaml", "yml"):
            return doc_name, rel_path, _read_text_file(path_str, ext), "json", None
        if ext in ("xlsx", "xls"):
            # When as_table=True, pass excel_engine
            if kw.get("as_table", False):
                excel_engine = kw.get("excel_engine", "openpyxl")
                return (
                    doc_name,
                    rel_path,
                    _read_excel(path_str, engine=excel_engine),
                    ext,
                    None,
                )
            # Otherwise convert workbook to PDF and reuse the PDF extractor
            pdf_path = self._convert_office_to_pdf(path_str)
            return (
                os.path.basename(pdf_path),
                os.path.relpath(pdf_path),
                *self._process_pdf(pdf_path, kw),
            )
        if ext in ("docx", "pptx"):
            pdf_path = self._convert_office_to_pdf(path_str)
            return (
                os.path.basename(pdf_path),
                os.path.relpath(pdf_path),
                *self._process_pdf(pdf_path, kw),
            )
        if ext in ("xlsx", "xls"):
            if kw.get("as_table", False):
                # direct spreadsheet → pandas → CSV
                return doc_name, rel_path, _read_excel(path_str), ext, None
            # otherwise convert workbook to PDF and reuse the PDF extractor
            pdf_path = self._convert_office_to_pdf(path_str)
            return (
                os.path.basename(pdf_path),
                os.path.relpath(pdf_path),
                *self._process_pdf(pdf_path, kw),
            )
        if ext in SUPPORTED_VANILLA_IMAGE_EXTENSIONS:
            model = kw.get("model", self.model)
            prompt = kw.get("prompt", DEFAULT_IMAGE_EXTRACTION_PROMPT)
            vlm_parameters = kw.get("vlm_parameters", {})
            return self._handle_image_to_llm(
                model, path_str, prompt=prompt, vlm_parameters=vlm_parameters
            )
        if ext in SUPPORTED_PROGRAMMING_LANGUAGES:
            return doc_name, rel_path, _read_text_file(path_str, ext), "txt", None

        raise ValueError(f"Unsupported file extension: {ext}. Use another Reader.")

    # 2) Remote URL
    def _handle_url(
        self,
        url: str,
        kw: Dict[str, Any],
    ) -> Tuple[str, str, Any, str, Optional[str]]:  # noqa: D401
        """Fetch via HTTP(S)."""
        if not isinstance(url, str) or not self.is_url(url):
            raise ValueError("file_url must be a valid URL string.")
        content, conv = _load_via_requests(url)
        name = url.split("/")[-1] or "downloaded_file"
        return name, url, content, conv, None

    # 3) Explicit JSON (dict or str)
    def _handle_explicit_json(
        self,
        json_doc: Any,
        _kw: Dict[str, Any],
    ) -> Tuple[str, None, Any, str, None]:
        """JSON passed straight in."""
        return (
            _kw.get("document_name", None),
            None,
            self.parse_json(json_doc),
            "json",
            None,
        )

    # 4) Explicit raw text
    def _handle_explicit_text(
        self,
        txt: str,
        _kw: Dict[str, Any],
    ) -> Tuple[str, None, Any, str, None]:  # noqa: D401
        """Text (maybe JSON / YAML) passed straight in."""
        for parser, conv in ((self.parse_json, "json"), (yaml.safe_load, "json")):
            try:
                parsed = parser(txt)
                if isinstance(parsed, (dict, list)):
                    return _kw.get("document_name", None), None, parsed, conv, None
            except Exception:  # pragma: no cover
                pass
        return _kw.get("document_name", None), None, txt, "txt", None

    # ----- shared utilities ------------------------------------------------ #

    def _process_pdf(
        self,
        path: str,
        kw: Dict[str, Any],
    ) -> Tuple[Any, str, Optional[str]]:
        """
        Process a PDF file and extract content.

        This method supports two modes:
        - Scanned PDF pages using a vision-capable model (image-based extraction).
        - Element-wise text and image extraction using PDFPlumber.

        Args:
            path (str): The path to the PDF file.
            kw (dict): Keyword arguments controlling extraction behavior. Recognized keys include:
                scan_pdf_pages (bool): If True, process the PDF as scanned images.
                model (BaseModel, optional): Vision-capable model for scanned PDFs or image captioning.
                prompt (str, optional): Prompt for image captioning.
                show_base64_images (bool): Whether to include base64 images in the output.
                image_placeholder (str): Placeholder for omitted images.
                page_placeholder (str): Placeholder for page breaks.

        Returns:
            tuple: A tuple of:
                - content (Any): Extracted text/content from the PDF.
                - conv (str): Conversion method used (e.g., "pdf", "png").
                - ocr_method (str or None): OCR model name if applicable.

        Raises:
            ValueError: If `scan_pdf_pages` is True but no vision-capable model is provided.
        """
        if kw.get("scan_pdf_pages"):
            model = kw.get("model", self.model)
            if model is None:
                raise ValueError("scan_pdf_pages=True requires a vision‑capable model.")
            joined = self._scan_pdf_pages(path, model=model, **kw)
            return joined, "png", model.model_name
        # element‑wise extraction
        content = self.pdf_reader.read(
            path,
            model=kw.get("model", self.model),
            prompt=kw.get("prompt") or DEFAULT_IMAGE_CAPTION_PROMPT,
            show_base64_images=kw.get("show_base64_images", False),
            image_placeholder=kw.get("image_placeholder", "<!-- image -->"),
            page_placeholder=kw.get("page_placeholder", "<!-- page -->"),
        )
        ocr_name = (
            (kw.get("model") or self.model).model_name
            if kw.get("model") or self.model
            else None
        )
        return content, "pdf", ocr_name

    def _scan_pdf_pages(self, file_path: str, model: BaseModel, **kw) -> str:
        """
        Describe each page of a PDF using a vision model.

        Args:
            file_path (str): The path to the PDF file.
            model (BaseModel): Vision-capable model used for page description.
            **kw: Additional keyword arguments. Recognized keys include:
                prompt (str, optional): Prompt for describing PDF pages.
                resolution (int): DPI resolution for rasterizing pages (default: 300).
                vlm_parameters (dict): Extra parameters for the vision model.

        Returns:
            str: A string containing page descriptions separated by page placeholders.
        """
        page_ph = kw.get("page_placeholder", "<!-- page -->")
        pages = self.pdf_reader.describe_pages(
            file_path=file_path,
            model=model,
            prompt=kw.get("prompt") or DEFAULT_IMAGE_EXTRACTION_PROMPT,
            resolution=kw.get("resolution", 300),
            **kw.get("vlm_parameters", {}),
        )
        return "\n\n---\n\n".join(f"{page_ph}\n\n{md}" for md in pages)

    def _handle_fallback(self, raw: str, kw: Dict[str, Any]):
        """
        Handle unsupported or unknown sources.

        Attempts to parse the input as JSON, then as text.
        Falls back to returning the raw content as plain text.

        Args:
            raw (str): Raw string content to be processed.
            kw (dict): Additional keyword arguments, may include:
                document_name (str): Optional name of the document.

        Returns:
            tuple: A tuple of:
                - document_name (str or None)
                - document_path (None)
                - content (Any): Parsed or raw content
                - conversion_method (str)
                - ocr_method (None)
        """
        try:
            return self._handle_explicit_json(raw, kw)
        except Exception:
            try:
                return self._handle_explicit_text(raw, kw)
            except Exception:  # pragma: no cover
                return kw.get("document_name", None), None, raw, "txt", None

    def _handle_image_to_llm(
        self,
        model: BaseModel,
        file_path: str,
        prompt: Optional[str] = None,
        vlm_parameters: Optional[dict] = None,
    ) -> Tuple[str, str, Any, str, str]:
        """
        Extract content from an image file using a vision model.

        Reads the image, encodes it in base64, and sends it to the given vision model
        with the provided prompt.

        Args:
            model (BaseModel): Vision-capable model to process the image.
            file_path (str): Path to the image file.
            prompt (str, optional): Prompt for guiding the vision model.
            vlm_parameters (dict, optional): Additional parameters for the vision model.

        Returns:
            tuple: A tuple of:
                - document_name (str)
                - document_path (str)
                - extracted (Any): Extracted content from the image.
                - conversion_method (str): Always "image".
                - ocr_method (str): Model name.

        Raises:
            ValueError: If no vision model is provided.
        """
        if model is None:
            raise ValueError("No vision model provided for image extraction.")
        # Read image as bytes and encode as base64
        with open(file_path, "rb") as f:
            img_bytes = f.read()
        ext = os.path.splitext(file_path)[1].lstrip(".").lower()
        img_b64 = base64.b64encode(img_bytes).decode("utf-8")
        prompt = prompt or DEFAULT_IMAGE_EXTRACTION_PROMPT
        vlm_parameters = vlm_parameters or {}
        extracted = model.extract_text(
            img_b64, prompt=prompt, file_ext=ext, **vlm_parameters
        )
        doc_name = os.path.basename(file_path)
        rel_path = os.path.relpath(file_path)
        return doc_name, rel_path, extracted, "image", model.model_name

    @staticmethod
    def _surface_page_placeholder(
        scan: bool, placeholder: str, text: Any
    ) -> Optional[str]:
        """
        Decide whether to expose the page placeholder in output.

        Never exposes placeholders containing '%'. Returns the placeholder if
        scanning mode is enabled or if the placeholder is found in the text.

        Args:
            scan (bool): Whether the document was scanned.
            placeholder (str): Page placeholder string.
            text (Any): Extracted text or content.

        Returns:
            str or None: The placeholder string if it should be exposed, else None.
        """
        if "%" in placeholder:
            return None
        txt = _ensure_str(text)
        return placeholder if (scan or placeholder in txt) else None

    def _convert_office_to_pdf(self, file_path: str) -> str:
        """
        Convert a DOCX/XLSX/PPTX file to PDF using LibreOffice.

        Args:
            file_path: Absolute path to the Office document.

        Returns:
            Path to the generated PDF in a temporary directory.

        Raises:
            RuntimeError: If LibreOffice (``soffice``) is not in *PATH* or the
            conversion fails for any reason.
        """
        if not shutil.which("soffice"):
            raise RuntimeError(
                "LibreOffice/soffice is required for Office‑to‑PDF conversion "
                "but was not found in PATH.  Install LibreOffice or use a "
                "different reader."
            )

        outdir = tempfile.mkdtemp(prefix="vanilla_office2pdf_")
        cmd = [
            "soffice",
            "--headless",
            "--convert-to",
            "pdf",
            "--outdir",
            outdir,
            file_path,
        ]
        proc = subprocess.run(cmd, capture_output=True)
        if proc.returncode != 0:
            raise RuntimeError(
                f"LibreOffice failed converting {file_path} → PDF:\n{proc.stderr.decode()}"
            )

        pdf_name = os.path.splitext(os.path.basename(file_path))[0] + ".pdf"
        pdf_path = os.path.join(outdir, pdf_name)
        if not os.path.exists(pdf_path):
            raise RuntimeError(f"Expected PDF not found: {pdf_path}")

        return pdf_path
read(file_path=None, **kwargs)

Read a document from various sources and return standardized output.

This method supports: - Local file paths (file_path or positional arg) - URLs (file_url) - JSON/dict objects (json_document) - Raw text strings (text_document)

If multiple sources are provided, the priority is: file_path > file_url > json_document > text_document. If only file_path is provided, auto-detects whether it is a path, URL, JSON, YAML, or plain text.

file_path (str | Path): Path to the input file (overridden by kwargs['file_path'] if present). **kwargs: Optional arguments that adjust behavior:

Source selection:
file_path (str): Path to the input file (overrides positional arg).
file_url (str): HTTPS/HTTP URL to read from.
json_document (dict | str): JSON-like document (dict or JSON string).
text_document (str): Raw text content.

Identification/metadata:
document_id (str): Explicit document id. Defaults to a new UUID.
metadata (dict): Additional metadata to attach to the output.

PDF extraction:
scan_pdf_pages (bool): If True, rasterize and describe pages using a
    vision model (VLM). If False (default), use element-wise extraction.
model (BaseModel): Vision-capable model used for scanned PDFs and/or
    image captioning (also used for image files).
prompt (str): Prompt for image captioning / page description. Defaults to
    ``DEFAULT_IMAGE_CAPTION_PROMPT`` for element-wise PDFs and
    ``DEFAULT_IMAGE_EXTRACTION_PROMPT`` for scanned PDFs/images.
resolution (int): DPI when rasterizing pages for VLM. Default: 300.
show_base64_images (bool): Include base64-embedded images in PDF output.
    Default: False.
image_placeholder (str): Placeholder for omitted images in PDFs.
    Default: ``"<!-- image -->"``.
page_placeholder (str): Placeholder inserted between PDF pages (only
    surfaced when scanning or when the placeholder occurs in text).
    Default: ``"<!-- page -->"``.
vlm_parameters (dict): Extra keyword args forwarded to
    ``model.extract_text(...)``.

Excel / Parquet reading:
as_table (bool): For Excel (``.xlsx``/``.xls``), if True read as a table
    using pandas and return CSV text. If False (default), convert to PDF
    and run the PDF pipeline.
excel_engine (str): pandas Excel engine. Default: ``"openpyxl"``.
parquet_engine (str): pandas Parquet engine (e.g. ``"pyarrow"``,
    ``"fastparquet"``). Default: pandas auto-selection.

Returns:

Name Type Description
ReaderOutput ReaderOutput

Unified result containing text, metadata, and extraction info.

Raises:

Type Description
ValueError

If the source is invalid/unsupported, or if a VLM is required but not provided.

TypeError

If provided arguments are of unsupported types.

Notes
  • PDF extraction now supports image captioning/omission indicators.
  • For .parquet files, content is loaded via pandas and returned as CSV-formatted text.
Example

from splitter_mr.readers import VanillaReader
from splitter_mr.models import AzureOpenAIVisionModel

model = AzureOpenAIVisionModel()
reader = VanillaReader(model=model)
output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.pdf")
print(output.text)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
Pellentesque ex felis, cursus ege...

Source code in src/splitter_mr/reader/readers/vanilla_reader.py
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
def read(
    self,
    file_path: str | Path = None,
    **kwargs: Any,
) -> ReaderOutput:
    """
    Read a document from various sources and return standardized output.

    This method supports:
    - Local file paths (``file_path`` or positional arg)
    - URLs (``file_url``)
    - JSON/dict objects (``json_document``)
    - Raw text strings (``text_document``)

    If multiple sources are provided, the priority is:
    ``file_path`` > ``file_url`` > ``json_document`` > ``text_document``.
    If only ``file_path`` is provided, auto-detects whether it is a path, URL,
    JSON, YAML, or plain text.

    Args:
    file_path (str | Path): Path to the input file (overridden by
        ``kwargs['file_path']`` if present).
    **kwargs: Optional arguments that adjust behavior:

        Source selection:
        file_path (str): Path to the input file (overrides positional arg).
        file_url (str): HTTPS/HTTP URL to read from.
        json_document (dict | str): JSON-like document (dict or JSON string).
        text_document (str): Raw text content.

        Identification/metadata:
        document_id (str): Explicit document id. Defaults to a new UUID.
        metadata (dict): Additional metadata to attach to the output.

        PDF extraction:
        scan_pdf_pages (bool): If True, rasterize and describe pages using a
            vision model (VLM). If False (default), use element-wise extraction.
        model (BaseModel): Vision-capable model used for scanned PDFs and/or
            image captioning (also used for image files).
        prompt (str): Prompt for image captioning / page description. Defaults to
            ``DEFAULT_IMAGE_CAPTION_PROMPT`` for element-wise PDFs and
            ``DEFAULT_IMAGE_EXTRACTION_PROMPT`` for scanned PDFs/images.
        resolution (int): DPI when rasterizing pages for VLM. Default: 300.
        show_base64_images (bool): Include base64-embedded images in PDF output.
            Default: False.
        image_placeholder (str): Placeholder for omitted images in PDFs.
            Default: ``"<!-- image -->"``.
        page_placeholder (str): Placeholder inserted between PDF pages (only
            surfaced when scanning or when the placeholder occurs in text).
            Default: ``"<!-- page -->"``.
        vlm_parameters (dict): Extra keyword args forwarded to
            ``model.extract_text(...)``.

        Excel / Parquet reading:
        as_table (bool): For Excel (``.xlsx``/``.xls``), if True read as a table
            using pandas and return CSV text. If False (default), convert to PDF
            and run the PDF pipeline.
        excel_engine (str): pandas Excel engine. Default: ``"openpyxl"``.
        parquet_engine (str): pandas Parquet engine (e.g. ``"pyarrow"``,
            ``"fastparquet"``). Default: pandas auto-selection.

    Returns:
        ReaderOutput: Unified result containing text, metadata, and extraction info.

    Raises:
        ValueError: If the source is invalid/unsupported, or if a VLM is required
            but not provided.
        TypeError: If provided arguments are of unsupported types.

    Notes:
        - PDF extraction now supports image captioning/omission indicators.
        - For `.parquet` files, content is loaded via pandas and returned as CSV-formatted text.

    Example:
        ```python
        from splitter_mr.readers import VanillaReader
        from splitter_mr.models import AzureOpenAIVisionModel

        model = AzureOpenAIVisionModel()
        reader = VanillaReader(model=model)
        output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.pdf")
        print(output.text)
        ```
        ```bash
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
        rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
        Pellentesque ex felis, cursus ege...
        ```
    """

    source_type, source_val = _guess_source(kwargs, file_path)
    name, path, text, conv, ocr = self._dispatch_source(
        source_type, source_val, kwargs
    )

    page_ph: str = kwargs.get("page_placeholder", "<!-- page -->")
    page_ph_out = self._surface_page_placeholder(
        scan=bool(kwargs.get("scan_pdf_pages")),
        placeholder=page_ph,
        text=text,
    )

    return ReaderOutput(
        text=_ensure_str(text),
        document_name=name,
        document_path=path or "",
        document_id=kwargs.get("document_id", str(uuid.uuid4())),
        conversion_method=conv,
        reader_method="vanilla",
        ocr_method=ocr,
        page_placeholder=page_ph_out,
        metadata=kwargs.get("metadata", {}),
    )
SimpleHTMLTextExtractor

Bases: HTMLParser

Extract HTML Structures from a text

Source code in src/splitter_mr/reader/readers/vanilla_reader.py
695
696
697
698
699
700
701
702
703
704
705
706
class SimpleHTMLTextExtractor(HTMLParser):
    """Extract HTML Structures from a text"""

    def __init__(self):
        super().__init__()
        self.text_parts = []

    def handle_data(self, data):
        self.text_parts.append(data)

    def get_text(self):
        return " ".join(self.text_parts).strip()

VanillaReader uses a helper class to read PDF and use Visual Language Models. This class is PDFPlumberReader.

DoclingReader

DoclingReader logo DoclingReader logo

DoclingReader

Bases: BaseReader

High-level document reader leveraging IBM Docling for flexible document-to-Markdown conversion, with optional image captioning or VLM-based PDF processing. Supports automatic pipeline selection, seamless integration with custom vision-language models, and configurable output for both PDF and non-PDF files.

Source code in src/splitter_mr/reader/readers/docling_reader.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
class DoclingReader(BaseReader):
    """
    High-level document reader leveraging IBM Docling for flexible document-to-Markdown conversion,
    with optional image captioning or VLM-based PDF processing. Supports automatic pipeline selection,
    seamless integration with custom vision-language models, and configurable output for both PDF
    and non-PDF files.
    """

    SUPPORTED_EXTENSIONS = SUPPORTED_DOCLING_FILE_EXTENSIONS

    _IMAGE_PATTERN = re.compile(
        r"!\[(?P<alt>[^\]]*?)\]"
        r"\((?P<uri>data:image/[a-zA-Z0-9.+-]+;base64,(?P<b64>[A-Za-z0-9+/=]+))\)"
    )

    def __init__(self, model: Optional[BaseModel] = None) -> None:
        self.model = model
        self.client = None
        self.model_name: Optional[str] = None
        if model:
            self.client = model.get_client()
            self.model_name = model.model_name
            for attr in ("_azure_deployment", "_azure_endpoint", "_api_version"):
                setattr(self, attr, getattr(self.client, attr, None))

    def read(
        self,
        file_path: str | Path,
        **kwargs: Any,
    ) -> ReaderOutput:
        """
        Reads a document, automatically selecting the appropriate Docling pipeline for extraction.
        Supports PDFs (per-page VLM or standard extraction), as well as other file types.

        Args:
            file_path (str | Path): Path or URL to the document file.
            **kwargs: Keyword arguments to control extraction, including:
                - prompt (str): Prompt for image captioning or VLM-based PDF extraction.
                - scan_pdf_pages (bool): If True (and model provided), analyze each PDF page via VLM.
                - show_base64_images (bool): If True, embed base64 images in Markdown; if False, use
                    image placeholders.
                - page_placeholder (str): Placeholder for page breaks in output Markdown.
                - image_placeholder (str): Placeholder for image locations in output Markdown.
                - image_resolution (float): Resolution scaling factor for image extraction.
                - document_id (Optional[str]): Optional document ID for metadata.
                - metadata (Optional[dict]): Optional metadata dictionary.

        Returns:
            ReaderOutput: Extracted document in Markdown format and associated metadata.

        Raises:
            Warning: If a file extension is unsupported, falls back to VanillaReader and emits a warning.
            ValueError: If PDF pipeline requirements are not satisfied (e.g., neither model nor
                show_base64_images provided).
        """

        ext: str = os.path.splitext(file_path)[1].lower().lstrip(".")
        if ext not in self.SUPPORTED_EXTENSIONS:
            warnings.warn(f"Unsupported extension '{ext}'. Using VanillaReader.")
            return VanillaReader().read(file_path=file_path, **kwargs)

        # Pipeline selection and execution
        pipeline_name, pipeline_args = self._select_pipeline(file_path, ext, **kwargs)
        md = DoclingPipelineFactory.run(pipeline_name, file_path, **pipeline_args)

        page_placeholder: str = pipeline_args.get("page_placeholder", "<!-- page -->")
        page_placeholder_value = (
            page_placeholder if page_placeholder and page_placeholder in md else None
        )

        text = md

        return ReaderOutput(
            text=text,
            document_name=os.path.basename(file_path),
            document_path=file_path,
            document_id=kwargs.get("document_id", str(uuid.uuid4())),
            conversion_method="markdown",
            reader_method="docling",
            ocr_method=self.model_name,
            page_placeholder=page_placeholder_value,
            metadata=kwargs.get("metadata", {}),
        )

    def _select_pipeline(self, file_path: str, ext: str, **kwargs) -> tuple[str, dict]:
        """
        Decides which pipeline to use and prepares arguments for it.

        Args:
            file_path (str): Path to the input document.
            ext (str): File extension.
            **kwargs: Extraction and pipeline control options, including:
                - prompt (str)
                - scan_pdf_pages (bool)
                - show_base64_images (bool)
                - page_placeholder (str)
                - image_placeholder (str)
                - image_resolution (float)

        Returns:
            tuple[str, dict]: Name of the selected pipeline and the dictionary of arguments for that pipeline.

        Pipeline selection logic:
            - For PDFs:
                - If scan_pdf_pages is True: uses per-page VLM/image pipeline.
                - Else if model is provided: uses VLM pipeline.
                - Else: uses default Markdown pipeline.
            - For other extensions: always uses Markdown pipeline.
        """
        # Defaults
        show_base64_images: bool = kwargs.get("show_base64_images", False)
        page_placeholder: str = kwargs.get("page_placeholder", "<!-- page -->")
        image_placeholder: str = kwargs.get("image_placeholder", "<!-- image -->")
        image_resolution: float = kwargs.get("image_resolution", 1.0)
        scan_pdf_pages: bool = kwargs.get("scan_pdf_pages", False)

        # --- PDF logic ---
        if ext == "pdf":
            if scan_pdf_pages:
                # Scan pages as images and extract their content
                pipeline_args = {
                    "model": self.model,
                    "prompt": kwargs.get("prompt", DEFAULT_IMAGE_EXTRACTION_PROMPT),
                    "image_resolution": image_resolution,
                    "page_placeholder": page_placeholder,
                    "show_base64_images": show_base64_images,
                }
                pipeline_name = "page_image"
            else:
                if self.model:
                    if show_base64_images:
                        warnings.warn(
                            "When using a model, base64 images are not rendered. So, deactivate the `show_base64_images` option or don't provide the model in the class constructor."
                        )
                    # Use VLM pipeline for the whole PDF
                    pipeline_args = {
                        "model": self.model,
                        "prompt": kwargs.get("prompt", DEFAULT_IMAGE_CAPTION_PROMPT),
                        "page_placeholder": page_placeholder,
                        "image_placeholder": image_placeholder,
                    }
                    pipeline_name = "vlm"
                else:
                    # No model: use markdown pipeline (default docling, base64 or placeholders)
                    pipeline_args = {
                        "show_base64_images": show_base64_images,
                        "page_placeholder": page_placeholder,
                        "image_placeholder": image_placeholder,
                        "image_resolution": image_resolution,
                        "ext": ext,
                    }
                    pipeline_name = "markdown"
        else:
            # For non-PDF: use markdown pipeline
            pipeline_args = {
                "show_base64_images": show_base64_images,
                "page_placeholder": page_placeholder,
                "image_placeholder": image_placeholder,
                "ext": ext,
            }
            pipeline_name = "markdown"

        return pipeline_name, pipeline_args
read(file_path, **kwargs)

Reads a document, automatically selecting the appropriate Docling pipeline for extraction. Supports PDFs (per-page VLM or standard extraction), as well as other file types.

Parameters:

Name Type Description Default
file_path str | Path

Path or URL to the document file.

required
**kwargs Any

Keyword arguments to control extraction, including: - prompt (str): Prompt for image captioning or VLM-based PDF extraction. - scan_pdf_pages (bool): If True (and model provided), analyze each PDF page via VLM. - show_base64_images (bool): If True, embed base64 images in Markdown; if False, use image placeholders. - page_placeholder (str): Placeholder for page breaks in output Markdown. - image_placeholder (str): Placeholder for image locations in output Markdown. - image_resolution (float): Resolution scaling factor for image extraction. - document_id (Optional[str]): Optional document ID for metadata. - metadata (Optional[dict]): Optional metadata dictionary.

{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

Extracted document in Markdown format and associated metadata.

Raises:

Type Description
Warning

If a file extension is unsupported, falls back to VanillaReader and emits a warning.

ValueError

If PDF pipeline requirements are not satisfied (e.g., neither model nor show_base64_images provided).

Source code in src/splitter_mr/reader/readers/docling_reader.py
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
def read(
    self,
    file_path: str | Path,
    **kwargs: Any,
) -> ReaderOutput:
    """
    Reads a document, automatically selecting the appropriate Docling pipeline for extraction.
    Supports PDFs (per-page VLM or standard extraction), as well as other file types.

    Args:
        file_path (str | Path): Path or URL to the document file.
        **kwargs: Keyword arguments to control extraction, including:
            - prompt (str): Prompt for image captioning or VLM-based PDF extraction.
            - scan_pdf_pages (bool): If True (and model provided), analyze each PDF page via VLM.
            - show_base64_images (bool): If True, embed base64 images in Markdown; if False, use
                image placeholders.
            - page_placeholder (str): Placeholder for page breaks in output Markdown.
            - image_placeholder (str): Placeholder for image locations in output Markdown.
            - image_resolution (float): Resolution scaling factor for image extraction.
            - document_id (Optional[str]): Optional document ID for metadata.
            - metadata (Optional[dict]): Optional metadata dictionary.

    Returns:
        ReaderOutput: Extracted document in Markdown format and associated metadata.

    Raises:
        Warning: If a file extension is unsupported, falls back to VanillaReader and emits a warning.
        ValueError: If PDF pipeline requirements are not satisfied (e.g., neither model nor
            show_base64_images provided).
    """

    ext: str = os.path.splitext(file_path)[1].lower().lstrip(".")
    if ext not in self.SUPPORTED_EXTENSIONS:
        warnings.warn(f"Unsupported extension '{ext}'. Using VanillaReader.")
        return VanillaReader().read(file_path=file_path, **kwargs)

    # Pipeline selection and execution
    pipeline_name, pipeline_args = self._select_pipeline(file_path, ext, **kwargs)
    md = DoclingPipelineFactory.run(pipeline_name, file_path, **pipeline_args)

    page_placeholder: str = pipeline_args.get("page_placeholder", "<!-- page -->")
    page_placeholder_value = (
        page_placeholder if page_placeholder and page_placeholder in md else None
    )

    text = md

    return ReaderOutput(
        text=text,
        document_name=os.path.basename(file_path),
        document_path=file_path,
        document_id=kwargs.get("document_id", str(uuid.uuid4())),
        conversion_method="markdown",
        reader_method="docling",
        ocr_method=self.model_name,
        page_placeholder=page_placeholder_value,
        metadata=kwargs.get("metadata", {}),
    )

To execute pipelines, DoclingReader has a utils class, DoclingUtils.

MarkItDownReader

MarkItDownReader logo MarkItDownReader logo

MarkItDownReader

Bases: BaseReader

Read multiple file types using Microsoft's MarkItDown library, and convert the documents using markdown format.

This reader supports both standard MarkItDown conversion and the use of Vision Language Models (VLMs) for LLM-based OCR when extracting text from images or scanned documents.

Source code in src/splitter_mr/reader/readers/markitdown_reader.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
class MarkItDownReader(BaseReader):
    """
    Read multiple file types using Microsoft's MarkItDown library, and convert
    the documents using markdown format.

    This reader supports both standard MarkItDown conversion and the use of Vision Language Models (VLMs)
    for LLM-based OCR when extracting text from images or scanned documents.
    """

    def __init__(
        self, model: Optional[Union[AzureOpenAIVisionModel, OpenAIVisionModel]] = None
    ):
        self.model = model
        self.model_name = model.model_name if self.model else None

    def _convert_to_pdf(self, file_path: str) -> str:
        """
        Converts DOCX, PPTX, or XLSX to PDF using LibreOffice (headless mode).

        Args:
            file_path (str): Path to the Office file.
            ext (str): File extension (lowercase, no dot).

        Returns:
            str: Path to the converted PDF.

        Raises:
            RuntimeError: If conversion fails or LibreOffice is not installed.
        """
        if not shutil.which("soffice"):
            raise RuntimeError(
                "LibreOffice (soffice) is required for Office to PDF conversion but was not found in PATH. "
                "Please install LibreOffice or set split_by_pages=False. "
                "How to install: https://www.libreoffice.org/get-help/install-howto/"
            )

        outdir = tempfile.mkdtemp()
        # Use soffice (LibreOffice) in headless mode
        cmd = [
            "soffice",
            "--headless",
            "--convert-to",
            "pdf",
            "--outdir",
            outdir,
            file_path,
        ]
        result = subprocess.run(cmd, capture_output=True)
        if result.returncode != 0:
            raise RuntimeError(
                f"Failed to convert {file_path} to PDF: {result.stderr.decode()}"
            )
        pdf_name = os.path.splitext(os.path.basename(file_path))[0] + ".pdf"
        pdf_path = os.path.join(outdir, pdf_name)
        if not os.path.exists(pdf_path):
            raise RuntimeError(f"PDF was not created: {pdf_path}")
        return pdf_path

    def _pdf_pages_to_streams(self, pdf_path: str) -> List[io.BytesIO]:
        """
        Convert each PDF page to a PNG and wrap in a BytesIO stream.

        Args:
            pdf_path (str): Path to the PDF file.

        Returns:
            List[io.BytesIO]: List of PNG image streams for each page.
        """
        doc = fitz.open(pdf_path)
        streams = []
        for idx in range(len(doc)):
            pix = doc.load_page(idx).get_pixmap()
            buf = io.BytesIO(pix.tobytes("png"))
            buf.name = f"page_{idx + 1}.png"
            buf.seek(0)
            streams.append(buf)
        return streams

    def _split_pdf_to_temp_pdfs(self, pdf_path: str) -> List[str]:
        """
        Split a PDF file into single-page temporary PDF files.

        Args:
            pdf_path (str): Path to the PDF file to split.

        Returns:
            List[str]: List of file paths for the temporary single-page PDFs.

        Example:
            temp_files = self._split_pdf_to_temp_pdfs("document.pdf")
            # temp_files = ["/tmp/tmpa1b2c3.pdf", "/tmp/tmpd4e5f6.pdf", ...]
        """
        temp_files = []
        reader = PdfReader(pdf_path)
        for i, page in enumerate(reader.pages):
            writer = PdfWriter()
            writer.add_page(page)
            with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
                writer.write(tmp)
                temp_files.append(tmp.name)
        return temp_files

    def _pdf_pages_to_markdown(
        self, file_path: str, md: MarkItDown, prompt: str, page_placeholder: str
    ) -> str:
        """
        Convert each scanned PDF page to markdown using the provided MarkItDown instance.

        Args:
            file_path (str): Path to PDF.
            md (MarkItDown): The MarkItDown converter instance.
            prompt (str): The LLM prompt for OCR.
            page_placeholder (str): Page break placeholder for markdown.

        Returns:
            str: Markdown of the entire PDF (one page per placeholder).
        """
        page_md = []
        for idx, page_stream in enumerate(
            self._pdf_pages_to_streams(file_path), start=1
        ):
            page_md.append(page_placeholder.replace("{page}", str(idx)))
            result = md.convert(page_stream, llm_prompt=prompt)
            page_md.append(result.text_content)
        return "\n".join(page_md)

    def _pdf_file_per_page_to_markdown(
        self, file_path: str, md: MarkItDown, prompt: str, page_placeholder: str
    ) -> str:
        """
        Convert each page of a PDF to markdown by splitting the PDF into temporary single-page files,
        extracting text from each page using MarkItDown, and joining the results with a page placeholder.

        Args:
            file_path (str): Path to the PDF file.
            md (MarkItDown): The MarkItDown converter instance.
            prompt (str): The LLM prompt for extraction.
            page_placeholder (str): Markdown placeholder for page breaks; supports '{page}' for numbering.

        Returns:
            str: Concatenated markdown content for the entire PDF, separated by page placeholders.

        Raises:
            Any exception raised by MarkItDown or file I/O will propagate.

        Example:
            markdown = self._pdf_file_per_page_to_markdown("doc.pdf", md, prompt, "<!-- page {page} -->")
        """
        temp_files = self._split_pdf_to_temp_pdfs(pdf_path=file_path)
        page_md = []
        try:
            for idx, temp_pdf in enumerate(temp_files, start=1):
                page_md.append(page_placeholder.replace("{page}", str(idx)))
                result = md.convert(temp_pdf, llm_prompt=prompt)
                page_md.append(result.text_content)
            return "\n".join(page_md)
        finally:
            # Clean up temp files
            for temp_pdf in temp_files:
                os.remove(temp_pdf)

    def _get_markitdown(self) -> tuple:
        """
        Returns a MarkItDown instance and OCR method name depending on model presence.

        Returns:
            tuple[MarkItDown, Optional[str]]: MarkItDown instance, OCR method or None.

        Raises:
            ValueError: If provided model is not supported.
        """
        if self.model:
            if not isinstance(self.model, (OpenAIVisionModel, AzureOpenAIVisionModel)):
                raise ValueError(
                    "Incompatible client. Only AzureOpenAIVisionModel or OpenAIVisionModel are supported."
                )
            client = self.model.get_client()
            return (
                MarkItDown(llm_client=client, llm_model=self.model.model_name),
                self.model.model_name,
            )
        else:
            return MarkItDown(), None

    def read(self, file_path: Path | str = None, **kwargs: Any) -> ReaderOutput:
        """
        Reads a file and converts its contents to Markdown using MarkItDown.

        Features:
            - Standard file-to-Markdown conversion for most formats.
            - LLM-based OCR (if a Vision model is provided) for images and scanned PDFs.
            - Optional PDF page-wise OCR with fine-grained control and custom LLM prompt.

        Args:
            file_path (str): Path to the input file to be read and converted.
            **kwargs:
                - `document_id (Optional[str])`: Unique document identifier.
                    If not provided, a UUID will be generated.
                - `metadata (Dict[str, Any], optional)`: Additional metadata, given in dictionary format.
                    If not provided, no metadata is returned.
                - `prompt (Optional[str])`: Prompt for image captioning or VLM extraction.
                - `page_placeholder (str)`: Markdown placeholder string for pages (default: "<!-- page -->").
                - split_by_pages (bool): If True and the input is a PDF, split the PDF by pages and process
                    each page separately. Default is False.

        Returns:
            ReaderOutput: Dataclass defining the output structure for all readers.

        Example:
            ```python
            from splitter_mr.model import OpenAIVisionModel
            from splitter_mr.reader import MarkItDownReader

            model = AzureOpenAIVisionModel()
            reader = MarkItDownReader(model=model)
            output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.pdf")
            print(output.text)
            ```
            ```python
            Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
            rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
            Pellentesque ex felis, cursus ege...
            ```
        """

        # Initialize MarkItDown reader
        file_path: str | Path = os.fspath(file_path)
        ext: str = os.path.splitext(file_path)[1].lower().lstrip(".")
        prompt: str = kwargs.get("prompt", DEFAULT_IMAGE_EXTRACTION_PROMPT)
        page_placeholder: str = kwargs.get("page_placeholder", "<!-- page -->")
        split_by_pages: bool = kwargs.get("split_by_pages", False)
        conversion_method: str = None
        md, ocr_method = self._get_markitdown()

        PDF_CONVERTIBLE_EXT: Set[str] = {"docx", "pptx", "xlsx"}

        if split_by_pages and ext != "pdf":
            if ext in PDF_CONVERTIBLE_EXT:
                file_path = self._convert_to_pdf(file_path)

        md, ocr_method = self._get_markitdown()

        # Process text
        if split_by_pages:
            markdown_text = self._pdf_file_per_page_to_markdown(
                file_path=file_path,
                md=md,
                prompt=prompt,
                page_placeholder=page_placeholder,
            )
            conversion_method = "markdown"
        elif self.model is not None:
            markdown_text = self._pdf_pages_to_markdown(
                file_path=file_path,
                md=md,
                prompt=prompt,
                page_placeholder=page_placeholder,
            )
            conversion_method = "markdown"
        else:
            markdown_text = md.convert(file_path, llm_prompt=prompt).text_content
            conversion_method = "json" if ext == "json" else "markdown"

        page_placeholder_value = (
            page_placeholder
            if page_placeholder and page_placeholder in markdown_text
            else None
        )

        # Return output
        return ReaderOutput(
            text=markdown_text,
            document_name=os.path.basename(file_path),
            document_path=file_path,
            document_id=kwargs.get("document_id", str(uuid.uuid4())),
            conversion_method=conversion_method,
            reader_method="markitdown",
            ocr_method=ocr_method,
            page_placeholder=page_placeholder_value,
            metadata=kwargs.get("metadata", {}),
        )
read(file_path=None, **kwargs)

Reads a file and converts its contents to Markdown using MarkItDown.

Features
  • Standard file-to-Markdown conversion for most formats.
  • LLM-based OCR (if a Vision model is provided) for images and scanned PDFs.
  • Optional PDF page-wise OCR with fine-grained control and custom LLM prompt.

Parameters:

Name Type Description Default
file_path str

Path to the input file to be read and converted.

None
**kwargs Any
  • document_id (Optional[str]): Unique document identifier. If not provided, a UUID will be generated.
  • metadata (Dict[str, Any], optional): Additional metadata, given in dictionary format. If not provided, no metadata is returned.
  • prompt (Optional[str]): Prompt for image captioning or VLM extraction.
  • page_placeholder (str): Markdown placeholder string for pages (default: "").
  • split_by_pages (bool): If True and the input is a PDF, split the PDF by pages and process each page separately. Default is False.
{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

Dataclass defining the output structure for all readers.

Example

from splitter_mr.model import OpenAIVisionModel
from splitter_mr.reader import MarkItDownReader

model = AzureOpenAIVisionModel()
reader = MarkItDownReader(model=model)
output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.pdf")
print(output.text)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
Pellentesque ex felis, cursus ege...

Source code in src/splitter_mr/reader/readers/markitdown_reader.py
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
def read(self, file_path: Path | str = None, **kwargs: Any) -> ReaderOutput:
    """
    Reads a file and converts its contents to Markdown using MarkItDown.

    Features:
        - Standard file-to-Markdown conversion for most formats.
        - LLM-based OCR (if a Vision model is provided) for images and scanned PDFs.
        - Optional PDF page-wise OCR with fine-grained control and custom LLM prompt.

    Args:
        file_path (str): Path to the input file to be read and converted.
        **kwargs:
            - `document_id (Optional[str])`: Unique document identifier.
                If not provided, a UUID will be generated.
            - `metadata (Dict[str, Any], optional)`: Additional metadata, given in dictionary format.
                If not provided, no metadata is returned.
            - `prompt (Optional[str])`: Prompt for image captioning or VLM extraction.
            - `page_placeholder (str)`: Markdown placeholder string for pages (default: "<!-- page -->").
            - split_by_pages (bool): If True and the input is a PDF, split the PDF by pages and process
                each page separately. Default is False.

    Returns:
        ReaderOutput: Dataclass defining the output structure for all readers.

    Example:
        ```python
        from splitter_mr.model import OpenAIVisionModel
        from splitter_mr.reader import MarkItDownReader

        model = AzureOpenAIVisionModel()
        reader = MarkItDownReader(model=model)
        output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.pdf")
        print(output.text)
        ```
        ```python
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
        rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
        Pellentesque ex felis, cursus ege...
        ```
    """

    # Initialize MarkItDown reader
    file_path: str | Path = os.fspath(file_path)
    ext: str = os.path.splitext(file_path)[1].lower().lstrip(".")
    prompt: str = kwargs.get("prompt", DEFAULT_IMAGE_EXTRACTION_PROMPT)
    page_placeholder: str = kwargs.get("page_placeholder", "<!-- page -->")
    split_by_pages: bool = kwargs.get("split_by_pages", False)
    conversion_method: str = None
    md, ocr_method = self._get_markitdown()

    PDF_CONVERTIBLE_EXT: Set[str] = {"docx", "pptx", "xlsx"}

    if split_by_pages and ext != "pdf":
        if ext in PDF_CONVERTIBLE_EXT:
            file_path = self._convert_to_pdf(file_path)

    md, ocr_method = self._get_markitdown()

    # Process text
    if split_by_pages:
        markdown_text = self._pdf_file_per_page_to_markdown(
            file_path=file_path,
            md=md,
            prompt=prompt,
            page_placeholder=page_placeholder,
        )
        conversion_method = "markdown"
    elif self.model is not None:
        markdown_text = self._pdf_pages_to_markdown(
            file_path=file_path,
            md=md,
            prompt=prompt,
            page_placeholder=page_placeholder,
        )
        conversion_method = "markdown"
    else:
        markdown_text = md.convert(file_path, llm_prompt=prompt).text_content
        conversion_method = "json" if ext == "json" else "markdown"

    page_placeholder_value = (
        page_placeholder
        if page_placeholder and page_placeholder in markdown_text
        else None
    )

    # Return output
    return ReaderOutput(
        text=markdown_text,
        document_name=os.path.basename(file_path),
        document_path=file_path,
        document_id=kwargs.get("document_id", str(uuid.uuid4())),
        conversion_method=conversion_method,
        reader_method="markitdown",
        ocr_method=ocr_method,
        page_placeholder=page_placeholder_value,
        metadata=kwargs.get("metadata", {}),
    )