Skip to content

Reader

Introduction

The Reader component is designed to read files homogeneously which come from many different formats and extensions. All of these readers are implemented sharing the same parent class, BaseReader.

Which Reader should I use for my project?

Each Reader component extracts document text in different ways. Therefore, choosing the most suitable Reader component depends on your use case.

  • If you want to preserve the original structure as much as possible, without any kind of markdown parsing, you can use the VanillaReader class.
  • In case that you have documents which have presented many tables in its structure or with many visual components (such as images), we strongly recommend to use DoclingReader.
  • If you are looking to maximize efficiency or make conversions to markdown simpler, we recommend using the MarkItDownReader component.

Note

Remember to visit the official repository and guides for these two last reader classes:

Additionally, the file compatibility depending on the Reader class is given by the following table:

Reader Unstructured files & PDFs MS Office suite files Tabular data Files with hierarchical schema Image files Markdown conversion
Vanilla Reader txt, md, pdf xlsx, docx, pptx csv, tsv, parquet json, yaml, html, xml jpg, png, webp, gif Yes
MarkItDown Reader txt, md, pdf docx, xlsx, pptx csv, tsv json, html, xml jpg, png, pneg Yes
Docling Reader txt, md, pdf docx, xlsx, pptx html, xhtml png, jpeg, tiff, bmp, webp Yes

Installing Docling & MarkItDown

By default, pip install splitter-mr installs core features only.
To use DoclingReader and/or MarkItDownReader, install the corresponding extras:

Python ≥ 3.11 is required.

MarkItDown:

pip install "splitter-mr[markitdown]"

Docling:

pip install "splitter-mr[docling]"

Both:

pip install "splitter-mr[markitdown,docling]"

Note

For the full matrix of extras and alternative package managers, see the global How to install section in the project README: Splitter_MR — How to install

Output format

Bases: BaseModel

Pydantic model defining the output structure for all readers.

Attributes:

Name Type Description
text Optional[str]

The textual content extracted by the reader.

document_name Optional[str]

The name of the document.

document_path str

The path to the document.

document_id Optional[str]

A unique identifier for the document.

conversion_method Optional[str]

The method used for document conversion.

reader_method Optional[str]

The method used for reading the document.

ocr_method Optional[str]

The OCR method used, if any.

page_placeholder Optional[str]

The placeholder use to identify each page, if used.

metadata Optional[Dict[str, Any]]

Additional metadata associated with the document.

Source code in src/splitter_mr/schema/models.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
class ReaderOutput(BaseModel):
    """Pydantic model defining the output structure for all readers.

    Attributes:
        text: The textual content extracted by the reader.
        document_name: The name of the document.
        document_path: The path to the document.
        document_id: A unique identifier for the document.
        conversion_method: The method used for document conversion.
        reader_method: The method used for reading the document.
        ocr_method: The OCR method used, if any.
        page_placeholder: The placeholder use to identify each page, if used.
        metadata: Additional metadata associated with the document.
    """

    text: Optional[str] = ""
    document_name: Optional[str] = None
    document_path: str = ""
    document_id: Optional[str] = None
    conversion_method: Optional[str] = None
    reader_method: Optional[str] = None
    ocr_method: Optional[str] = None
    page_placeholder: Optional[str] = None
    metadata: Optional[Dict[str, Any]] = Field(default_factory=dict)

    @field_validator("document_id", mode="before")
    def default_document_id(cls, v: str):
        """Generate a default UUID for document_id if not provided.

        Args:
            v (str): The provided document_id value.

        Returns:
            document_id (str): The provided document_id or a newly generated UUID string.
        """
        document_id = v or str(uuid.uuid4())
        return document_id

    def from_variable(
        self, variable: Union[str, Dict[str, Any]], variable_name: str
    ) -> "ReaderOutput":
        """
        Generate a new ReaderOutput object from a variable (str or dict).

        Args:
            variable (Union[str, Dict[str, Any]]): The variable to use as text.
            variable_name (str): The name for document_name.

        Returns:
            ReaderOutput: The new ReaderOutput object.
        """
        if isinstance(variable, dict):
            text = json.dumps(variable, ensure_ascii=False, indent=2)
            conversion_method = "json"
            metadata = {"details": "Generated from a json variable"}
        elif isinstance(variable, str):
            text = variable
            conversion_method = "txt"
            metadata = {"details": "Generated from a str variable"}
        else:
            raise ValueError("Variable must be either a string or a dictionary.")

        return ReaderOutput(
            text=text,
            document_name=variable_name,
            document_path="",
            conversion_method=conversion_method,
            reader_method="vanilla",
            ocr_method=None,
            page_placeholder=None,
            metadata=metadata,
        )

    def append_metadata(self, metadata: Dict[str, Any]) -> None:
        """
        Append (update) the metadata dictionary with new key-value pairs.

        Args:
            metadata (Dict[str, Any]): The metadata to add or update.
        """
        if self.metadata is None:
            self.metadata = {}
        self.metadata.update(metadata)
append_metadata(metadata)

Append (update) the metadata dictionary with new key-value pairs.

Parameters:

Name Type Description Default
metadata Dict[str, Any]

The metadata to add or update.

required
Source code in src/splitter_mr/schema/models.py
 99
100
101
102
103
104
105
106
107
108
def append_metadata(self, metadata: Dict[str, Any]) -> None:
    """
    Append (update) the metadata dictionary with new key-value pairs.

    Args:
        metadata (Dict[str, Any]): The metadata to add or update.
    """
    if self.metadata is None:
        self.metadata = {}
    self.metadata.update(metadata)
default_document_id(v)

Generate a default UUID for document_id if not provided.

Parameters:

Name Type Description Default
v str

The provided document_id value.

required

Returns:

Name Type Description
document_id str

The provided document_id or a newly generated UUID string.

Source code in src/splitter_mr/schema/models.py
51
52
53
54
55
56
57
58
59
60
61
62
@field_validator("document_id", mode="before")
def default_document_id(cls, v: str):
    """Generate a default UUID for document_id if not provided.

    Args:
        v (str): The provided document_id value.

    Returns:
        document_id (str): The provided document_id or a newly generated UUID string.
    """
    document_id = v or str(uuid.uuid4())
    return document_id
from_variable(variable, variable_name)

Generate a new ReaderOutput object from a variable (str or dict).

Parameters:

Name Type Description Default
variable Union[str, Dict[str, Any]]

The variable to use as text.

required
variable_name str

The name for document_name.

required

Returns:

Name Type Description
ReaderOutput ReaderOutput

The new ReaderOutput object.

Source code in src/splitter_mr/schema/models.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
def from_variable(
    self, variable: Union[str, Dict[str, Any]], variable_name: str
) -> "ReaderOutput":
    """
    Generate a new ReaderOutput object from a variable (str or dict).

    Args:
        variable (Union[str, Dict[str, Any]]): The variable to use as text.
        variable_name (str): The name for document_name.

    Returns:
        ReaderOutput: The new ReaderOutput object.
    """
    if isinstance(variable, dict):
        text = json.dumps(variable, ensure_ascii=False, indent=2)
        conversion_method = "json"
        metadata = {"details": "Generated from a json variable"}
    elif isinstance(variable, str):
        text = variable
        conversion_method = "txt"
        metadata = {"details": "Generated from a str variable"}
    else:
        raise ValueError("Variable must be either a string or a dictionary.")

    return ReaderOutput(
        text=text,
        document_name=variable_name,
        document_path="",
        conversion_method=conversion_method,
        reader_method="vanilla",
        ocr_method=None,
        page_placeholder=None,
        metadata=metadata,
    )

Readers

To see a comparison between reading methods, refer to the following example.

BaseReader

BaseReader

Bases: ABC

Abstract base class for all document readers.

This interface defines the contract for file readers that process documents and return a standardized dictionary containing the extracted text and document-level metadata. Subclasses must implement the read method to handle specific file formats or reading strategies.

Methods:

Name Description
read

Reads the input file and returns a dictionary with text and metadata.

is_valid_file_path

Check if a path is valid.

is_url

Check if the string provided is an URL.

parse_json

Try to parse a JSON object when a dictionary or string is provided.

Source code in src/splitter_mr/reader/base_reader.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
class BaseReader(ABC):
    """
    Abstract base class for all document readers.

    This interface defines the contract for file readers that process documents and return
    a standardized dictionary containing the extracted text and document-level metadata.
    Subclasses must implement the `read` method to handle specific file formats or reading
    strategies.

    Methods:
        read: Reads the input file and returns a dictionary with text and metadata.
        is_valid_file_path: Check if a path is valid.
        is_url: Check if the string provided is an URL.
        parse_json: Try to parse a JSON object when a dictionary or string is provided.
    """

    @staticmethod
    def is_valid_file_path(path: str) -> bool:
        """
        Checks if the provided string is a valid file path.

        Args:
            path (str): The string to check.

        Returns:
            bool: True if the string is a valid file path to an existing file, False otherwise.

        Example:
            ```python
            BaseReader.is_valid_file_path("/tmp/myfile.txt")
            ```
            ```bash
            True
            ```
        """
        return os.path.isfile(path)

    @staticmethod
    def is_url(string: str) -> bool:
        """
        Determines whether the given string is a valid HTTP or HTTPS URL.

        Args:
            string (str): The string to check.

        Returns:
            bool: True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

        Example:
            ```python
            BaseReader.is_url("https://example.com")
            ```
            ```bash
            True
            ```
            ```python
            BaseReader.is_url("not_a_url")
            ```
            ```bash
            False
            ```
        """
        try:
            result = urlparse(string)
            return all([result.scheme in ("http", "https"), result.netloc])
        except Exception:
            return False

    @staticmethod
    def parse_json(obj: Union[dict, str]) -> dict:
        """
        Attempts to parse the provided object as JSON.

        Args:
            obj (Union[dict, str]): The object to parse. If a dict, returns it as-is.
                If a string, attempts to parse it as a JSON string.

        Returns:
            dict: The parsed JSON object.

        Raises:
            ValueError: If a string is provided that cannot be parsed as valid JSON.
            TypeError: If the provided object is neither a dict nor a string.

        Example:
            ```python
            BaseReader.try_parse_json('{"a": 1}')
            ```
            ```python
            {'a': 1}
            ```
            ```python
            BaseReader.try_parse_json({'b': 2})
            ```
            ```python
            {'b': 2}
            ```
            ```python
            BaseReader.try_parse_json('[not valid json]')
            ```
            ```python
            ValueError: String could not be parsed as JSON: ...
            ```
        """
        if isinstance(obj, dict):
            return obj
        if isinstance(obj, str):
            try:
                return json.loads(obj)
            except Exception as e:
                raise ValueError(f"String could not be parsed as JSON: {e}")
        raise TypeError("Provided object is not a string or dictionary")

    @abstractmethod
    def read(
        self, file_path: str, model: Optional[BaseVisionModel] = None, **kwargs: Any
    ) -> ReaderOutput:
        """
        Reads input and returns a ReaderOutput with text content and standardized metadata.

        Args:
            file_path (str): Path to the input file, a URL, raw string, or dictionary.
            model (Optional[BaseVisionModel]): Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.
            **kwargs: Additional keyword arguments for implementation-specific options.

        Returns:
            ReaderOutput: Dataclass defining the output structure for all readers.

        Raises:
            ValueError: If the provided string is not valid file path, URL, or parsable content.
            TypeError: If input type is unsupported.

        Example:
            ```python
            class MyReader(BaseReader):
                def read(self, file_path: str, **kwargs) -> ReaderOutput:
                    return ReaderOutput(
                        text="example",
                        document_name="example.txt",
                        document_path=file_path,
                        document_id=kwargs.get("document_id"),
                        conversion_method="custom",
                        ocr_method=None,
                        metadata={}
                    )
            ```
        """
is_url(string) staticmethod

Determines whether the given string is a valid HTTP or HTTPS URL.

Parameters:

Name Type Description Default
string str

The string to check.

required

Returns:

Name Type Description
bool bool

True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

Example

BaseReader.is_url("https://example.com")
True
BaseReader.is_url("not_a_url")
False

Source code in src/splitter_mr/reader/base_reader.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
@staticmethod
def is_url(string: str) -> bool:
    """
    Determines whether the given string is a valid HTTP or HTTPS URL.

    Args:
        string (str): The string to check.

    Returns:
        bool: True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

    Example:
        ```python
        BaseReader.is_url("https://example.com")
        ```
        ```bash
        True
        ```
        ```python
        BaseReader.is_url("not_a_url")
        ```
        ```bash
        False
        ```
    """
    try:
        result = urlparse(string)
        return all([result.scheme in ("http", "https"), result.netloc])
    except Exception:
        return False
is_valid_file_path(path) staticmethod

Checks if the provided string is a valid file path.

Parameters:

Name Type Description Default
path str

The string to check.

required

Returns:

Name Type Description
bool bool

True if the string is a valid file path to an existing file, False otherwise.

Example

BaseReader.is_valid_file_path("/tmp/myfile.txt")
True

Source code in src/splitter_mr/reader/base_reader.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
@staticmethod
def is_valid_file_path(path: str) -> bool:
    """
    Checks if the provided string is a valid file path.

    Args:
        path (str): The string to check.

    Returns:
        bool: True if the string is a valid file path to an existing file, False otherwise.

    Example:
        ```python
        BaseReader.is_valid_file_path("/tmp/myfile.txt")
        ```
        ```bash
        True
        ```
    """
    return os.path.isfile(path)
parse_json(obj) staticmethod

Attempts to parse the provided object as JSON.

Parameters:

Name Type Description Default
obj Union[dict, str]

The object to parse. If a dict, returns it as-is. If a string, attempts to parse it as a JSON string.

required

Returns:

Name Type Description
dict dict

The parsed JSON object.

Raises:

Type Description
ValueError

If a string is provided that cannot be parsed as valid JSON.

TypeError

If the provided object is neither a dict nor a string.

Example

BaseReader.try_parse_json('{"a": 1}')
{'a': 1}
BaseReader.try_parse_json({'b': 2})
{'b': 2}
BaseReader.try_parse_json('[not valid json]')
ValueError: String could not be parsed as JSON: ...

Source code in src/splitter_mr/reader/base_reader.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
@staticmethod
def parse_json(obj: Union[dict, str]) -> dict:
    """
    Attempts to parse the provided object as JSON.

    Args:
        obj (Union[dict, str]): The object to parse. If a dict, returns it as-is.
            If a string, attempts to parse it as a JSON string.

    Returns:
        dict: The parsed JSON object.

    Raises:
        ValueError: If a string is provided that cannot be parsed as valid JSON.
        TypeError: If the provided object is neither a dict nor a string.

    Example:
        ```python
        BaseReader.try_parse_json('{"a": 1}')
        ```
        ```python
        {'a': 1}
        ```
        ```python
        BaseReader.try_parse_json({'b': 2})
        ```
        ```python
        {'b': 2}
        ```
        ```python
        BaseReader.try_parse_json('[not valid json]')
        ```
        ```python
        ValueError: String could not be parsed as JSON: ...
        ```
    """
    if isinstance(obj, dict):
        return obj
    if isinstance(obj, str):
        try:
            return json.loads(obj)
        except Exception as e:
            raise ValueError(f"String could not be parsed as JSON: {e}")
    raise TypeError("Provided object is not a string or dictionary")
read(file_path, model=None, **kwargs) abstractmethod

Reads input and returns a ReaderOutput with text content and standardized metadata.

Parameters:

Name Type Description Default
file_path str

Path to the input file, a URL, raw string, or dictionary.

required
model Optional[BaseVisionModel]

Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.

None
**kwargs Any

Additional keyword arguments for implementation-specific options.

{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

Dataclass defining the output structure for all readers.

Raises:

Type Description
ValueError

If the provided string is not valid file path, URL, or parsable content.

TypeError

If input type is unsupported.

Example
class MyReader(BaseReader):
    def read(self, file_path: str, **kwargs) -> ReaderOutput:
        return ReaderOutput(
            text="example",
            document_name="example.txt",
            document_path=file_path,
            document_id=kwargs.get("document_id"),
            conversion_method="custom",
            ocr_method=None,
            metadata={}
        )
Source code in src/splitter_mr/reader/base_reader.py
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
@abstractmethod
def read(
    self, file_path: str, model: Optional[BaseVisionModel] = None, **kwargs: Any
) -> ReaderOutput:
    """
    Reads input and returns a ReaderOutput with text content and standardized metadata.

    Args:
        file_path (str): Path to the input file, a URL, raw string, or dictionary.
        model (Optional[BaseVisionModel]): Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.
        **kwargs: Additional keyword arguments for implementation-specific options.

    Returns:
        ReaderOutput: Dataclass defining the output structure for all readers.

    Raises:
        ValueError: If the provided string is not valid file path, URL, or parsable content.
        TypeError: If input type is unsupported.

    Example:
        ```python
        class MyReader(BaseReader):
            def read(self, file_path: str, **kwargs) -> ReaderOutput:
                return ReaderOutput(
                    text="example",
                    document_name="example.txt",
                    document_path=file_path,
                    document_id=kwargs.get("document_id"),
                    conversion_method="custom",
                    ocr_method=None,
                    metadata={}
                )
        ```
    """

📚 Note: file examples are extracted from thedata folder in the GitHub repository: link.

VanillaReader

VanillaReader logo VanillaReader logo

SimpleHTMLTextExtractor

Bases: HTMLParser

Legacy helper to extract raw text from HTML by concatenating data nodes.

Source code in src/splitter_mr/reader/readers/vanilla_reader.py
625
626
627
628
629
630
631
632
633
634
635
636
class SimpleHTMLTextExtractor(HTMLParser):
    """Legacy helper to extract raw text from HTML by concatenating data nodes."""

    def __init__(self):
        super().__init__()
        self.text_parts: list = []

    def handle_data(self, data):
        self.text_parts.append(data)

    def get_text(self):
        return " ".join(self.text_parts).strip()
VanillaReader

Bases: BaseReader

Read multiple file types using Python's built-in and standard libraries.

Supported formats include: .json, .html/.htm, .txt, .xml, .yaml/.yml, .csv, .tsv, .parquet, .pdf, and various image formats.

Source code in src/splitter_mr/reader/readers/vanilla_reader.py
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
class VanillaReader(BaseReader):
    """
    Read multiple file types using Python's built-in and standard libraries.

    Supported formats include: .json, .html/.htm, .txt, .xml, .yaml/.yml,
    .csv, .tsv, .parquet, .pdf, and various image formats.
    """

    def __init__(self, model: Optional[BaseVisionModel] = None) -> None:
        """
        Initialize the VanillaReader.

        Args:
            model (Optional[BaseVisionModel]): A vision-capable model used for
                image captioning, scanned PDF processing, or image file analysis.
                Defaults to None.
        """
        super().__init__()
        self.model = model
        self.pdf_reader = PDFPlumberReader()

    # ---- Public method ---- #

    def read(
        self,
        file_path: str | Path = None,
        **kwargs: Any,
    ) -> ReaderOutput:
        """
        Read a document from a file path, URL, or raw content.

        This method supports local files, URLs, JSON objects, or raw text strings.
        Priority of sources: ``kwargs['file_path']`` > ``file_path`` (arg) >
        ``kwargs['file_url']`` > ``kwargs['json_document']`` > ``kwargs['text_document']``.

        Args:
            file_path (str | Path, optional): Path to the input file (local path) or a URL.
            **kwargs: Configuration options. Common keys include:
                file_path (str | Path): Same as the positional arg; takes precedence if provided.
                file_url (str): HTTPS/HTTP URL to read from.
                json_document (dict | str): JSON-like document or JSON string.
                text_document (str): Raw text content (auto-detects JSON/YAML if possible).
                document_name (str): Name to use when input is json_document/text_document or fallback.
                document_id (str): Explicit ID for the output document.
                metadata (dict): Metadata to attach to the output.
                html_to_markdown (bool): If True, convert HTML to Markdown.
                scan_pdf_pages (bool): If True, rasterize PDF pages for VLM processing.
                resolution (int): DPI for scan_pdf_pages rasterization (default 300).
                model (BaseVisionModel): Override model for specific calls.
                prompt (str): Prompt for image/page description.
                vlm_parameters (dict): Extra kwargs forwarded to the vision model.
                image_placeholder (str): Placeholder inserted for images when extracting PDFs.
                show_base64_images (bool): If True, include base64 images in extracted PDF output.
                page_placeholder (str): Placeholder used for PDF page separation / surfacing.
                as_table (bool): If True, read Excel via pandas and return CSV text (first sheet).
                excel_engine (str): pandas.read_excel engine (default "openpyxl").
                parquet_engine (str | None): pandas.read_parquet engine override.

        Returns:
            ReaderOutput: A standardized result containing the extracted text,
            metadata, and processing details.

        Raises:
            ReaderConfigException: If arguments are invalid (e.g., malformed URL,
                unsupported extension, missing required model).
            VanillaReaderException: If an error occurs during file I/O, parsing,
                PDF extraction, or external tool execution (e.g., LibreOffice).
            HtmlConversionError: If HTML-to-Markdown conversion fails.
        """
        try:
            source_type, source_val = _guess_source(kwargs, file_path)

            name, path, text, conv, ocr = self._dispatch_source(
                source_type, source_val, kwargs
            )

            page_ph: str = kwargs.get("page_placeholder", DEFAULT_PAGE_PLACEHOLDER)
            page_ph_out: str | None = self._surface_page_placeholder(
                scan=bool(kwargs.get("scan_pdf_pages")),
                placeholder=page_ph,
                text=text,
            )

            return ReaderOutput(
                text=_ensure_str(text),
                document_name=name,
                document_path=path or "",
                document_id=kwargs.get("document_id", str(uuid.uuid4())),
                conversion_method=conv,
                reader_method="vanilla",
                ocr_method=ocr,
                page_placeholder=page_ph_out,
                metadata=kwargs.get("metadata", {}),
            )

        except (ReaderConfigException, VanillaReaderException, HtmlConversionError):
            raise
        except Exception as e:
            raise VanillaReaderException(
                f"Unexpected error processing document: {e}"
            ) from e

    # ---- Internal helpers ---- #

    def _dispatch_source(  # noqa: WPS231
        self,
        src_type: str,
        src_val: Any,
        kw: Dict[str, Any],
    ) -> Tuple[str, Optional[str], Any, str, Optional[str]]:
        """Route the request to a specialised handler based on source type."""
        handlers: dict[str, callable] = {
            "file_path": self._handle_local_path,
            "file_url": self._handle_url,
            "json_document": self._handle_explicit_json,
            "text_document": self._handle_explicit_text,
        }
        if src_type not in handlers:
            raise ReaderConfigException(f"Unrecognized document source: {src_type}")

        return handlers[src_type](src_val, kw)

    # ---- individual strategies below – each ~20 lines or fewer ---------- #

    # 1) Local / drive paths
    def _handle_local_path(
        self,
        path_like: str | Path,
        kw: Dict[str, Any],
    ) -> Tuple[str, str, Any, str, Optional[str]]:
        """Handle content loading from a local filesystem path."""
        if path_like is None:
            raise ReaderConfigException("file_path cannot be None.")

        path_str: str | Path = (
            os.fspath(path_like) if isinstance(path_like, Path) else path_like
        )

        if not isinstance(path_str, str):
            raise ReaderConfigException("file_path must be a string or Path object.")

        if self.is_url(path_str):
            return self._handle_url(path_str, kw)

        if not self.is_valid_file_path(path_str):
            return self._handle_fallback(path_str, kw)

        ext: str = os.path.splitext(path_str)[1].lower().lstrip(".")
        doc_name: str = os.path.basename(path_str)

        try:
            rel_path = os.path.relpath(path_str)
        except ValueError:
            rel_path = path_str

        try:
            if ext == "pdf":
                return (
                    doc_name,
                    rel_path,
                    *self._process_pdf(path_str, kw),
                )
            if ext == "html" or ext == "htm":
                content, conv = _read_html_file(
                    path_str, html_to_markdown=bool(kw.get("html_to_markdown", False))
                )
                return doc_name, rel_path, content, conv, None
            if ext in VANILLA_TXT_FILES_EXTENSIONS:
                return doc_name, rel_path, _read_text_file(path_str, ext), ext, None
            if ext == "parquet":
                return (
                    doc_name,
                    rel_path,
                    _read_parquet(path_str, engine=kw.get("parquet_engine")),
                    "csv",
                    None,
                )
            if ext in ("yaml", "yml"):
                return doc_name, rel_path, _read_text_file(path_str, ext), "json", None
            if ext in ("xlsx", "xls"):
                if kw.get("as_table", False):
                    excel_engine = kw.get("excel_engine", "openpyxl")
                    return (
                        doc_name,
                        rel_path,
                        _read_excel(path_str, engine=excel_engine),
                        ext,
                        None,
                    )
                pdf_path = self._convert_office_to_pdf(path_str)
                return (
                    os.path.basename(pdf_path),
                    os.path.relpath(pdf_path),
                    *self._process_pdf(pdf_path, kw),
                )
            if ext in ("docx", "pptx"):
                pdf_path = self._convert_office_to_pdf(path_str)
                return (
                    os.path.basename(pdf_path),
                    os.path.relpath(pdf_path),
                    *self._process_pdf(pdf_path, kw),
                )
            if ext in SUPPORTED_VANILLA_IMAGE_EXTENSIONS:
                model = kw.get("model", self.model)
                prompt = kw.get("prompt", DEFAULT_IMAGE_EXTRACTION_PROMPT)
                vlm_parameters = kw.get("vlm_parameters", {})
                return self._handle_image_to_llm(
                    model, path_str, prompt=prompt, vlm_parameters=vlm_parameters
                )
            if ext in SUPPORTED_PROGRAMMING_LANGUAGES:
                return doc_name, rel_path, _read_text_file(path_str, ext), "txt", None

            raise ReaderConfigException(
                f"Unsupported file extension: .{ext}. Please check documentation for supported formats."
            )

        except (VanillaReaderException, ReaderConfigException):
            raise
        except Exception as e:
            raise VanillaReaderException(f"Error reading file '{path_str}': {e}") from e

    # 2) Remote URL
    def _handle_url(
        self,
        url: str,
        kw: Dict[str, Any],
    ) -> Tuple[str, str, Any, str, Optional[str]]:  # noqa: D401
        """Fetch content via HTTP(S) and handle content-type detection."""
        if not isinstance(url, str):
            raise ReaderConfigException("file_url must be a string.")

        if not url.startswith(("http://", "https://")):
            raise ReaderConfigException("file_url must start with http:// or https://")

        content, conv = _load_via_requests(
            url, html_to_markdown=bool(kw.get("html_to_markdown", False))
        )
        name: str = url.split("/")[-1] or "downloaded_file"
        return name, url, content, conv, None

    # 3) Explicit JSON (dict or str)
    def _handle_explicit_json(
        self,
        json_doc: Any,
        _kw: Dict[str, Any],
    ) -> Tuple[str, None, Any, str, None]:
        """Process a JSON object or string passed directly."""
        try:
            return (
                _kw.get("document_name", None),
                None,
                self.parse_json(json_doc),
                "json",
                None,
            )
        except Exception as e:
            raise VanillaReaderException(
                f"Failed to parse provided JSON document: {e}"
            ) from e

    # 4) Explicit raw text
    def _handle_explicit_text(
        self,
        txt: str,
        _kw: Dict[str, Any],
    ) -> Tuple[str, None, Any, str, None]:  # noqa: D401
        """Process raw text, attempting to auto-detect structured formats (JSON/YAML)."""
        for parser, conv in ((self.parse_json, "json"), (yaml.safe_load, "json")):
            try:
                parsed = parser(txt)
                if isinstance(parsed, (dict, list)):
                    return _kw.get("document_name", None), None, parsed, conv, None
            except Exception:
                continue

        return _kw.get("document_name", None), None, txt, "txt", None

    # ----- shared utilities ------------------------------------------------ #

    def _process_pdf(
        self,
        path: str,
        kw: Dict[str, Any],
    ) -> Tuple[Any, str, Optional[str]]:
        """Extract content from a PDF, supporting both text extraction and visual scanning."""
        if kw.get("scan_pdf_pages"):
            model = kw.get("model", self.model)
            if model is None:
                raise ReaderConfigException(
                    "scan_pdf_pages=True requires a vision-capable model (kwarg 'model' or init 'model')."
                )
            joined = self._scan_pdf_pages(path, model=model, **kw)
            return joined, "png", model.model_name

        try:
            content = self.pdf_reader.read(
                path,
                model=kw.get("model", self.model),
                prompt=kw.get("prompt") or DEFAULT_IMAGE_CAPTION_PROMPT,
                show_base64_images=kw.get("show_base64_images", False),
                image_placeholder=kw.get(
                    "image_placeholder", DEFAULT_IMAGE_PLACEHOLDER
                ),
                page_placeholder=kw.get("page_placeholder", DEFAULT_PAGE_PLACEHOLDER),
            )
        except Exception as e:
            raise VanillaReaderException(
                f"PDF extraction failed for {path}: {e}"
            ) from e

        ocr_name: str | None = (
            (kw.get("model") or self.model).model_name
            if kw.get("model") or self.model
            else None
        )
        return content, "pdf", ocr_name

    def _scan_pdf_pages(self, file_path: str, model: BaseVisionModel, **kw) -> str:
        """Rasterize PDF pages and describe them using a vision model."""
        page_ph = kw.get("page_placeholder", DEFAULT_PAGE_PLACEHOLDER)
        try:
            pages: list[str] = self.pdf_reader.describe_pages(
                file_path=file_path,
                model=model,
                prompt=kw.get("prompt") or DEFAULT_IMAGE_EXTRACTION_PROMPT,
                resolution=kw.get("resolution", 300),
                **kw.get("vlm_parameters", {}),
            )
            return "\n\n---\n\n".join(f"{page_ph}\n\n{md}" for md in pages)
        except Exception as e:
            raise VanillaReaderException(
                f"Failed to scan PDF pages for {file_path}: {e}"
            ) from e

    def _handle_fallback(self, raw: str, kw: Dict[str, Any]):
        """Attempt to handle unrecognized sources by trying explicit JSON or text handlers."""
        try:
            return self._handle_explicit_json(raw, kw)
        except Exception:
            try:
                return self._handle_explicit_text(raw, kw)
            except Exception:
                return kw.get("document_name", None), None, raw, "txt", None

    def _handle_image_to_llm(
        self,
        model: BaseVisionModel,
        file_path: str,
        prompt: Optional[str] = None,
        vlm_parameters: Optional[dict] = None,
    ) -> Tuple[str, str, Any, str, str]:
        """Extract information from an image file using a Vision Language Model."""
        if model is None:
            raise ReaderConfigException(
                "No vision model provided for image extraction. Pass 'model' to init or read()."
            )

        try:
            with open(file_path, "rb") as f:
                img_bytes = f.read()
        except OSError as e:
            raise VanillaReaderException(
                f"Could not read image file {file_path}: {e}"
            ) from e

        ext: str = os.path.splitext(file_path)[1].lstrip(".").lower()
        img_b64: str = base64.b64encode(img_bytes).decode("utf-8")
        prompt: str = prompt or DEFAULT_IMAGE_EXTRACTION_PROMPT
        vlm_parameters: dict[str, Any] = vlm_parameters or {}

        try:
            extracted: str = model.analyze_content(
                img_b64, prompt=prompt, file_ext=ext, **vlm_parameters
            )
        except Exception as e:
            raise VanillaReaderException(f"Vision model analysis failed: {e}") from e

        doc_name: str = os.path.basename(file_path)
        rel_path: str = os.path.relpath(file_path)
        return doc_name, rel_path, extracted, "image", model.model_name

    @staticmethod
    def _surface_page_placeholder(
        scan: bool, placeholder: str, text: Any
    ) -> Optional[str]:
        """Determine if the page placeholder should be exposed in the output text."""
        if "%" in placeholder:
            return None
        txt: str = _ensure_str(text)
        return placeholder if (scan or placeholder in txt) else None

    def _convert_office_to_pdf(self, file_path: str) -> str:
        """Convert a Microsoft Office document to PDF using a headless LibreOffice process."""
        if not shutil.which("soffice"):
            raise VanillaReaderException(
                "LibreOffice/soffice is required for Office-to-PDF conversion "
                "but was not found in PATH."
            )

        try:
            outdir: str = tempfile.mkdtemp(prefix="vanilla_office2pdf_")
        except OSError as e:
            raise VanillaReaderException(
                f"Failed to create temp directory for PDF conversion: {e}"
            ) from e

        cmd = [
            "soffice",
            "--headless",
            "--convert-to",
            "pdf",
            "--outdir",
            outdir,
            file_path,
        ]

        try:
            proc: CompletedProcess[bytes] = subprocess.run(
                cmd, capture_output=True, check=False
            )
        except Exception as e:
            raise VanillaReaderException(
                f"Subprocess failed when executing LibreOffice: {e}"
            ) from e

        if proc.returncode != 0:
            err_msg = proc.stderr.decode() if proc.stderr else "Unknown error"
            raise VanillaReaderException(
                f"LibreOffice failed converting {file_path} -> PDF. Exit code {proc.returncode}.\nError: {err_msg}"
            )

        pdf_name: str = os.path.splitext(os.path.basename(file_path))[0] + ".pdf"
        pdf_path: str = os.path.join(outdir, pdf_name)

        if not os.path.exists(pdf_path):
            raise VanillaReaderException(
                f"LibreOffice finished, but expected PDF was not found at: {pdf_path}"
            )

        return pdf_path
__init__(model=None)

Initialize the VanillaReader.

Parameters:

Name Type Description Default
model Optional[BaseVisionModel]

A vision-capable model used for image captioning, scanned PDF processing, or image file analysis. Defaults to None.

None
Source code in src/splitter_mr/reader/readers/vanilla_reader.py
48
49
50
51
52
53
54
55
56
57
58
59
def __init__(self, model: Optional[BaseVisionModel] = None) -> None:
    """
    Initialize the VanillaReader.

    Args:
        model (Optional[BaseVisionModel]): A vision-capable model used for
            image captioning, scanned PDF processing, or image file analysis.
            Defaults to None.
    """
    super().__init__()
    self.model = model
    self.pdf_reader = PDFPlumberReader()
read(file_path=None, **kwargs)

Read a document from a file path, URL, or raw content.

This method supports local files, URLs, JSON objects, or raw text strings. Priority of sources: kwargs['file_path'] > file_path (arg) > kwargs['file_url'] > kwargs['json_document'] > kwargs['text_document'].

Parameters:

Name Type Description Default
file_path str | Path

Path to the input file (local path) or a URL.

None
**kwargs Any

Configuration options. Common keys include: file_path (str | Path): Same as the positional arg; takes precedence if provided. file_url (str): HTTPS/HTTP URL to read from. json_document (dict | str): JSON-like document or JSON string. text_document (str): Raw text content (auto-detects JSON/YAML if possible). document_name (str): Name to use when input is json_document/text_document or fallback. document_id (str): Explicit ID for the output document. metadata (dict): Metadata to attach to the output. html_to_markdown (bool): If True, convert HTML to Markdown. scan_pdf_pages (bool): If True, rasterize PDF pages for VLM processing. resolution (int): DPI for scan_pdf_pages rasterization (default 300). model (BaseVisionModel): Override model for specific calls. prompt (str): Prompt for image/page description. vlm_parameters (dict): Extra kwargs forwarded to the vision model. image_placeholder (str): Placeholder inserted for images when extracting PDFs. show_base64_images (bool): If True, include base64 images in extracted PDF output. page_placeholder (str): Placeholder used for PDF page separation / surfacing. as_table (bool): If True, read Excel via pandas and return CSV text (first sheet). excel_engine (str): pandas.read_excel engine (default "openpyxl"). parquet_engine (str | None): pandas.read_parquet engine override.

{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

A standardized result containing the extracted text,

ReaderOutput

metadata, and processing details.

Raises:

Type Description
ReaderConfigException

If arguments are invalid (e.g., malformed URL, unsupported extension, missing required model).

VanillaReaderException

If an error occurs during file I/O, parsing, PDF extraction, or external tool execution (e.g., LibreOffice).

HtmlConversionError

If HTML-to-Markdown conversion fails.

Source code in src/splitter_mr/reader/readers/vanilla_reader.py
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
def read(
    self,
    file_path: str | Path = None,
    **kwargs: Any,
) -> ReaderOutput:
    """
    Read a document from a file path, URL, or raw content.

    This method supports local files, URLs, JSON objects, or raw text strings.
    Priority of sources: ``kwargs['file_path']`` > ``file_path`` (arg) >
    ``kwargs['file_url']`` > ``kwargs['json_document']`` > ``kwargs['text_document']``.

    Args:
        file_path (str | Path, optional): Path to the input file (local path) or a URL.
        **kwargs: Configuration options. Common keys include:
            file_path (str | Path): Same as the positional arg; takes precedence if provided.
            file_url (str): HTTPS/HTTP URL to read from.
            json_document (dict | str): JSON-like document or JSON string.
            text_document (str): Raw text content (auto-detects JSON/YAML if possible).
            document_name (str): Name to use when input is json_document/text_document or fallback.
            document_id (str): Explicit ID for the output document.
            metadata (dict): Metadata to attach to the output.
            html_to_markdown (bool): If True, convert HTML to Markdown.
            scan_pdf_pages (bool): If True, rasterize PDF pages for VLM processing.
            resolution (int): DPI for scan_pdf_pages rasterization (default 300).
            model (BaseVisionModel): Override model for specific calls.
            prompt (str): Prompt for image/page description.
            vlm_parameters (dict): Extra kwargs forwarded to the vision model.
            image_placeholder (str): Placeholder inserted for images when extracting PDFs.
            show_base64_images (bool): If True, include base64 images in extracted PDF output.
            page_placeholder (str): Placeholder used for PDF page separation / surfacing.
            as_table (bool): If True, read Excel via pandas and return CSV text (first sheet).
            excel_engine (str): pandas.read_excel engine (default "openpyxl").
            parquet_engine (str | None): pandas.read_parquet engine override.

    Returns:
        ReaderOutput: A standardized result containing the extracted text,
        metadata, and processing details.

    Raises:
        ReaderConfigException: If arguments are invalid (e.g., malformed URL,
            unsupported extension, missing required model).
        VanillaReaderException: If an error occurs during file I/O, parsing,
            PDF extraction, or external tool execution (e.g., LibreOffice).
        HtmlConversionError: If HTML-to-Markdown conversion fails.
    """
    try:
        source_type, source_val = _guess_source(kwargs, file_path)

        name, path, text, conv, ocr = self._dispatch_source(
            source_type, source_val, kwargs
        )

        page_ph: str = kwargs.get("page_placeholder", DEFAULT_PAGE_PLACEHOLDER)
        page_ph_out: str | None = self._surface_page_placeholder(
            scan=bool(kwargs.get("scan_pdf_pages")),
            placeholder=page_ph,
            text=text,
        )

        return ReaderOutput(
            text=_ensure_str(text),
            document_name=name,
            document_path=path or "",
            document_id=kwargs.get("document_id", str(uuid.uuid4())),
            conversion_method=conv,
            reader_method="vanilla",
            ocr_method=ocr,
            page_placeholder=page_ph_out,
            metadata=kwargs.get("metadata", {}),
        )

    except (ReaderConfigException, VanillaReaderException, HtmlConversionError):
        raise
    except Exception as e:
        raise VanillaReaderException(
            f"Unexpected error processing document: {e}"
        ) from e

VanillaReader uses a helper class to read PDF and use Visual Language Models. This class is PDFPlumberReader.

DoclingReader

DoclingReader logo DoclingReader logo

DoclingReader

Bases: BaseReader

High-level document reader leveraging IBM Docling for flexible document-to-Markdown conversion, with optional image captioning or VLM-based PDF processing. Supports automatic pipeline selection, seamless integration with custom vision-language models, and configurable output for both PDF and non-PDF files.

Parameters:

Name Type Description Default
model Optional[BaseVisionModel]

An optional vision-language model instance used for PDF pipelines that require image captioning or per-page analysis. If provided, the model’s client and metadata (e.g., Azure deployment settings) are stored for use in downstream processing. Defaults to None.

None
Source code in src/splitter_mr/reader/readers/docling_reader.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
class DoclingReader(BaseReader):
    """
    High-level document reader leveraging IBM Docling for flexible document-to-Markdown conversion,
    with optional image captioning or VLM-based PDF processing. Supports automatic pipeline selection,
    seamless integration with custom vision-language models, and configurable output for both PDF
    and non-PDF files.

    Args:
        model (Optional[BaseVisionModel], optional): An optional vision-language
            model instance used for PDF pipelines that require image captioning
            or per-page analysis. If provided, the model’s client and metadata
            (e.g., Azure deployment settings) are stored for use in downstream
            processing. Defaults to None.
    """

    def __init__(self, model: Optional[BaseVisionModel] = None) -> None:
        self.model = model
        self.client = None
        self.model_name: Optional[str] = None
        if model:
            self.client = model.get_client()
            self.model_name = model.model_name

    def read(
        self,
        file_path: str | Path,
        **kwargs: Any,
    ) -> ReaderOutput:
        """
        Reads a document, automatically selecting the appropriate Docling pipeline for extraction.
        Supports PDFs (per-page VLM or standard extraction), as well as other file types.

        Args:
            file_path (str | Path): Path or URL to the document file.
            **kwargs: Keyword arguments to control extraction, including:
                - prompt (str): Prompt for image captioning or VLM-based PDF extraction.
                - scan_pdf_pages (bool): If True (and model provided), analyze each PDF page via VLM.
                - show_base64_images (bool): If True, embed base64 images in Markdown; if False, use
                    image placeholders.
                - page_placeholder (str): Placeholder for page breaks in output Markdown.
                - image_placeholder (str): Placeholder for image locations in output Markdown.
                - image_resolution (float): Resolution scaling factor for image extraction.
                - document_id (Optional[str]): Optional document ID for metadata.
                - metadata (Optional[dict]): Optional metadata dictionary.

        Returns:
            ReaderOutput: Extracted document in Markdown format and associated metadata.

        Warns:
            BaseReaderWarning: If the file extension is not supported by Docling,
                this method falls back to ``VanillaReader``.

        Raises:
            DoclingReaderException: If an specific docling exception is raised during
                pipeline execution (e.g., ConversionError, OperationNotAllowed, etc.)
        """
        ext: str = os.path.splitext(str(file_path))[1].lower().lstrip(".")
        if ext not in SUPPORTED_DOCLING_FILE_EXTENSIONS:
            msg = f"Unsupported extension '{ext}'. Using VanillaReader."
            warnings.warn(msg, BaseReaderWarning)
            return VanillaReader().read(file_path=file_path, **kwargs)

        # Pipeline selection and execution
        pipeline_name, pipeline_args = self._select_pipeline(ext, **kwargs)

        try:
            md = DoclingPipelineFactory.run(
                pipeline_name, str(file_path), **pipeline_args
            )
        except ReaderConfigException:
            raise
        except DoclingBaseError as exc:
            raise DoclingReaderException(
                f"Docling pipeline '{pipeline_name}' failed for '{file_path}': {exc}"
            ) from exc
        except Exception as exc:
            raise DoclingReaderException(
                f"Unexpected error in Docling pipeline '{pipeline_name}' for '{file_path}': {exc}"
            ) from exc

        page_placeholder: str = pipeline_args.get(
            "page_placeholder", DEFAULT_PAGE_PLACEHOLDER
        )
        page_placeholder_value = (
            page_placeholder if page_placeholder and page_placeholder in md else None
        )

        text = md

        return ReaderOutput(
            text=text,
            document_name=os.path.basename(str(file_path)),
            document_path=str(file_path),
            document_id=kwargs.get("document_id", str(uuid.uuid4())),
            conversion_method="markdown",
            reader_method="docling",
            ocr_method=self.model_name,
            page_placeholder=page_placeholder_value,
            metadata=kwargs.get("metadata", {}),
        )

    def _select_pipeline(self, ext: str, **kwargs) -> tuple[str, dict]:
        """
        Decides which pipeline to use and prepares arguments for it.

        Args:
            ext (str): File extension.
            **kwargs: Extraction and pipeline control options, including:
                - prompt (str)
                - scan_pdf_pages (bool)
                - show_base64_images (bool)
                - page_placeholder (str)
                - image_placeholder (str)
                - image_resolution (float)

        Returns:
            tuple[str, dict]: Name of the selected pipeline and the dictionary of arguments for that pipeline.

        Pipeline selection logic:
            - For PDFs:
                - If scan_pdf_pages is True: uses per-page VLM/image pipeline.
                - Else if model is provided: uses VLM pipeline.
                - Else: uses default Markdown pipeline.
            - For other extensions: always uses Markdown pipeline.
        """

        # ---- Initialization ---- #

        show_base64_images: bool = kwargs.get("show_base64_images", False)
        page_placeholder: str = kwargs.get("page_placeholder", DEFAULT_PAGE_PLACEHOLDER)
        image_placeholder: str = kwargs.get(
            "image_placeholder", DEFAULT_IMAGE_PLACEHOLDER
        )
        image_resolution: float = kwargs.get("image_resolution", 1.0)
        scan_pdf_pages: bool = kwargs.get("scan_pdf_pages", False)

        # ---- PDF logic ---- #

        if ext == "pdf":
            if scan_pdf_pages:
                # Scan pages as images and extract their content
                pipeline_args = {
                    "model": self.model,
                    "prompt": kwargs.get("prompt", DEFAULT_IMAGE_EXTRACTION_PROMPT),
                    "image_resolution": image_resolution,
                    "page_placeholder": page_placeholder,
                    "show_base64_images": show_base64_images,
                }
                pipeline_name = "page_image"
            else:
                if self.model:
                    if show_base64_images:
                        warnings.warn(
                            "When using a model, base64 images are not rendered. "
                            "Deactivate `show_base64_images` or do not provide a model "
                            "to DoclingReader.",
                            BaseReaderWarning,
                        )
                    # Read the whole PDF using a VLM
                    pipeline_args = {
                        "model": self.model,
                        "prompt": kwargs.get("prompt", DEFAULT_IMAGE_CAPTION_PROMPT),
                        "page_placeholder": page_placeholder,
                        "image_placeholder": image_placeholder,
                    }
                    pipeline_name = "vlm"
                else:
                    # No model: use markdown pipeline (default docling, base64 or placeholders)
                    pipeline_args = {
                        "show_base64_images": show_base64_images,
                        "page_placeholder": page_placeholder,
                        "image_placeholder": image_placeholder,
                        "image_resolution": image_resolution,
                        "ext": ext,
                    }
                    pipeline_name = "markdown"

        # ---- Main logic ---- #

        else:
            pipeline_args = {
                "show_base64_images": show_base64_images,
                "page_placeholder": page_placeholder,
                "image_placeholder": image_placeholder,
                "ext": ext,
            }
            pipeline_name = "markdown"

        return pipeline_name, pipeline_args
read(file_path, **kwargs)

Reads a document, automatically selecting the appropriate Docling pipeline for extraction. Supports PDFs (per-page VLM or standard extraction), as well as other file types.

Parameters:

Name Type Description Default
file_path str | Path

Path or URL to the document file.

required
**kwargs Any

Keyword arguments to control extraction, including: - prompt (str): Prompt for image captioning or VLM-based PDF extraction. - scan_pdf_pages (bool): If True (and model provided), analyze each PDF page via VLM. - show_base64_images (bool): If True, embed base64 images in Markdown; if False, use image placeholders. - page_placeholder (str): Placeholder for page breaks in output Markdown. - image_placeholder (str): Placeholder for image locations in output Markdown. - image_resolution (float): Resolution scaling factor for image extraction. - document_id (Optional[str]): Optional document ID for metadata. - metadata (Optional[dict]): Optional metadata dictionary.

{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

Extracted document in Markdown format and associated metadata.

Warns:

Type Description
BaseReaderWarning

If the file extension is not supported by Docling, this method falls back to VanillaReader.

Raises:

Type Description
DoclingReaderException

If an specific docling exception is raised during pipeline execution (e.g., ConversionError, OperationNotAllowed, etc.)

Source code in src/splitter_mr/reader/readers/docling_reader.py
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
def read(
    self,
    file_path: str | Path,
    **kwargs: Any,
) -> ReaderOutput:
    """
    Reads a document, automatically selecting the appropriate Docling pipeline for extraction.
    Supports PDFs (per-page VLM or standard extraction), as well as other file types.

    Args:
        file_path (str | Path): Path or URL to the document file.
        **kwargs: Keyword arguments to control extraction, including:
            - prompt (str): Prompt for image captioning or VLM-based PDF extraction.
            - scan_pdf_pages (bool): If True (and model provided), analyze each PDF page via VLM.
            - show_base64_images (bool): If True, embed base64 images in Markdown; if False, use
                image placeholders.
            - page_placeholder (str): Placeholder for page breaks in output Markdown.
            - image_placeholder (str): Placeholder for image locations in output Markdown.
            - image_resolution (float): Resolution scaling factor for image extraction.
            - document_id (Optional[str]): Optional document ID for metadata.
            - metadata (Optional[dict]): Optional metadata dictionary.

    Returns:
        ReaderOutput: Extracted document in Markdown format and associated metadata.

    Warns:
        BaseReaderWarning: If the file extension is not supported by Docling,
            this method falls back to ``VanillaReader``.

    Raises:
        DoclingReaderException: If an specific docling exception is raised during
            pipeline execution (e.g., ConversionError, OperationNotAllowed, etc.)
    """
    ext: str = os.path.splitext(str(file_path))[1].lower().lstrip(".")
    if ext not in SUPPORTED_DOCLING_FILE_EXTENSIONS:
        msg = f"Unsupported extension '{ext}'. Using VanillaReader."
        warnings.warn(msg, BaseReaderWarning)
        return VanillaReader().read(file_path=file_path, **kwargs)

    # Pipeline selection and execution
    pipeline_name, pipeline_args = self._select_pipeline(ext, **kwargs)

    try:
        md = DoclingPipelineFactory.run(
            pipeline_name, str(file_path), **pipeline_args
        )
    except ReaderConfigException:
        raise
    except DoclingBaseError as exc:
        raise DoclingReaderException(
            f"Docling pipeline '{pipeline_name}' failed for '{file_path}': {exc}"
        ) from exc
    except Exception as exc:
        raise DoclingReaderException(
            f"Unexpected error in Docling pipeline '{pipeline_name}' for '{file_path}': {exc}"
        ) from exc

    page_placeholder: str = pipeline_args.get(
        "page_placeholder", DEFAULT_PAGE_PLACEHOLDER
    )
    page_placeholder_value = (
        page_placeholder if page_placeholder and page_placeholder in md else None
    )

    text = md

    return ReaderOutput(
        text=text,
        document_name=os.path.basename(str(file_path)),
        document_path=str(file_path),
        document_id=kwargs.get("document_id", str(uuid.uuid4())),
        conversion_method="markdown",
        reader_method="docling",
        ocr_method=self.model_name,
        page_placeholder=page_placeholder_value,
        metadata=kwargs.get("metadata", {}),
    )

To execute pipelines, DoclingReader has a utils class, DoclingUtils.

MarkItDownReader

MarkItDownReader logo MarkItDownReader logo

MarkItDownReader

Bases: BaseReader

Reads multiple file types using Microsoft's MarkItDown library and converts them to Markdown.

This reader serves as a bridge between standard document formats (PDF, DOCX, PPTX, XLSX) and Markdown. It supports two modes of operation: 1. Standard Conversion: Uses native parsers for high-fidelity text extraction. 2. VLM-Enhanced Conversion: Integrates with Vision Language Models (VLMs) via the BaseVisionModel interface to perform LLM-based OCR on images or scanned documents.

Attributes:

Name Type Description
model BaseVisionModel

The vision model instance used for OCR/Captioning tasks.

model_name str

The identifier of the model (e.g., 'gpt-4o'), used for metadata.

client OpenAI

The OpenAI-compatible client extracted from the vision model.

Raises:

Type Description
ReaderConfigException

If the provided model uses an incompatible client (non-OpenAI).

Source code in src/splitter_mr/reader/readers/markitdown_reader.py
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
class MarkItDownReader(BaseReader):
    """Reads multiple file types using Microsoft's MarkItDown library and converts them to Markdown.

    This reader serves as a bridge between standard document formats (PDF, DOCX, PPTX, XLSX)
    and Markdown. It supports two modes of operation:
    1.  **Standard Conversion:** Uses native parsers for high-fidelity text extraction.
    2.  **VLM-Enhanced Conversion:** Integrates with Vision Language Models (VLMs) via the
        `BaseVisionModel` interface to perform LLM-based OCR on images or scanned documents.

    Attributes:
        model (BaseVisionModel): The vision model instance used for OCR/Captioning tasks.
        model_name (str): The identifier of the model (e.g., 'gpt-4o'), used for metadata.
        client (OpenAI): The OpenAI-compatible client extracted from the vision model.

    Raises:
        ReaderConfigException: If the provided model uses an incompatible client (non-OpenAI).
    """

    def __init__(self, model: BaseVisionModel = None) -> None:
        """Initializes the MarkItDownReader.

        Args:
            model (Optional[BaseVisionModel], optional): An optional vision-language model
                wrapper. If provided, its underlying client is injected into the MarkItDown
                instance to enable image description and optical character recognition.

        Raises:
            ReaderConfigException: If the `model` provided does not expose an `OpenAI`
                compatible client, or if initialization fails unexpectedly.
        """
        try:
            self.model: BaseVisionModel = model
            self.model_name: str = model.model_name if self.model else None

            # Pre-validate client compatibility if model is provided
            if self.model:
                client = self.model.get_client()
                if not isinstance(client, OpenAI):
                    raise ReaderConfigException(
                        f"Incompatible client type: {type(client)}. "
                        "MarkItDownReader currently only supports models using the OpenAI client."
                    )
        except Exception as e:
            if isinstance(e, ReaderConfigException):
                raise
            raise ReaderConfigException(
                f"Failed to initialize MarkItDownReader: {str(e)}"
            ) from e

    def _convert_to_pdf(self, file_path: str | Path) -> str:
        """Converts Office documents (DOCX, PPTX, XLSX) to PDF using headless LibreOffice.

        This method acts as a pre-processing step when `split_by_pages=True` is requested for
        office formats. It delegates conversion to the system's installed `soffice` binary.

        Args:
            file_path (str | Path): The path to the source Office file.

        Returns:
            str: The absolute path to the newly created PDF file located in a temporary directory.

        Raises:
            MarkItDownReaderException:
                - If the `soffice` binary is not found in the system PATH.
                - If the subprocess returns a non-zero exit code.
                - If the expected output PDF file was not created.
        """
        if not shutil.which("soffice"):
            raise MarkItDownReaderException(
                "LibreOffice (soffice) is required for Office to PDF conversion but was not found in PATH. "
                "Please install LibreOffice or set split_by_pages=False. "
                "How to install: https://www.libreoffice.org/get-help/install-howto/"
            )

        try:
            outdir: str = tempfile.mkdtemp()
            # Use soffice (LibreOffice) in headless mode
            cmd: list[str] = [
                "soffice",
                "--headless",
                "--convert-to",
                "pdf",
                "--outdir",
                outdir,
                str(file_path),
            ]

            result: CompletedProcess[bytes] = subprocess.run(
                cmd, capture_output=True, check=False
            )

            if result.returncode != 0:
                raise MarkItDownReaderException(
                    f"LibreOffice conversion failed for {file_path}.\n"
                    f"Stderr: {result.stderr.decode() if result.stderr else 'Unknown error'}"
                )

            filename = os.path.basename(file_path)
            pdf_name = os.path.splitext(filename)[0] + ".pdf"
            pdf_path = os.path.join(outdir, pdf_name)

            if not os.path.exists(pdf_path):
                raise MarkItDownReaderException(
                    f"PDF was not created at expected path: {pdf_path}"
                )

            return pdf_path

        except subprocess.SubprocessError as e:
            raise MarkItDownReaderException(
                f"Subprocess error during PDF conversion: {str(e)}"
            ) from e
        except OSError as e:
            raise MarkItDownReaderException(
                f"I/O error during PDF conversion: {str(e)}"
            ) from e

    def _pdf_pages_to_streams(self, pdf_path: str | Path) -> list[io.BytesIO]:
        """Rasterizes PDF pages into in-memory PNG streams using PyMuPDF (fitz).

        This method is preferred when processing speed is prioritized over memory usage.
        It avoids writing intermediate image files to disk.

        Args:
            pdf_path (str | Path): The path to the PDF file.

        Returns:
            list[io.BytesIO]: A list of byte streams, where each stream contains a PNG
                representation of a single PDF page.

        Raises:
            MarkItDownReaderException: If PyMuPDF encounters a corrupted file or fails to render.
        """
        try:
            doc = fitz.open(pdf_path)
            streams: list[io.BytesIO] = []
            for idx in range(len(doc)):
                pix = doc.load_page(idx).get_pixmap()
                buf = io.BytesIO(pix.tobytes("png"))
                buf.name = f"page_{idx + 1}.png"
                buf.seek(0)
                streams.append(buf)
            return streams
        except Exception as e:
            raise MarkItDownReaderException(
                f"Failed to convert PDF pages to image streams via PyMuPDF: {str(e)}"
            ) from e

    def _split_pdf_to_temp_pdfs(self, pdf_path: str | Path) -> list[str]:
        """Splits a multi-page PDF into multiple single-page PDF files on disk.

        This approach is safer than in-memory streams for extremely large documents or
        when the downstream converter requires a physical file path rather than a byte stream.

        Args:
            pdf_path (str | Path): The path to the source PDF.

        Returns:
            list[str]: A list of absolute file paths to the temporary single-page PDFs.

        Note:
            The caller is responsible for cleaning up these temporary files.
            (Handled automatically in `_pdf_file_per_page_to_markdown`).

        Raises:
            MarkItDownReaderException: If `pypdf` fails to read or split the document.
        """
        temp_files: list[str] = []
        try:
            reader = PdfReader(pdf_path)
            for i, page in enumerate(reader.pages):
                writer = PdfWriter()
                writer.add_page(page)
                with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
                    writer.write(tmp)
                    temp_files.append(tmp.name)
            return temp_files
        except Exception as e:
            # Clean up any files created before the failure
            for f in temp_files:
                if os.path.exists(f):
                    os.remove(f)
            raise MarkItDownReaderException(f"Failed to split PDF: {str(e)}") from e

    def _pdf_pages_to_markdown(
        self, file_path: str, md: "MarkItDown", prompt: str, page_placeholder: str
    ) -> str:
        """Processes a PDF by converting each page to an image stream and then to Markdown.

        This method is typically used when OCR is needed on a per-page basis without
        creating intermediate PDF files.

        Args:
            file_path (str): Path to the source PDF.
            md (MarkItDown): The configured MarkItDown instance.
            prompt (str): The LLM prompt used for image extraction/OCR.
            page_placeholder (str): The string used to separate pages (e.g., '').

        Returns:
            str: The concatenated Markdown content of all pages.

        Raises:
            MarkItDownReaderException: If conversion fails for a specific page.
        """
        # Exceptions here are caught by the calling methods or bubble up as MarkItDownReaderException
        # from the _pdf_pages_to_streams call.
        page_md: list[str] = []
        streams = self._pdf_pages_to_streams(file_path)

        try:
            for idx, page_stream in enumerate(streams, start=1):
                page_md.append(page_placeholder.replace("{page}", str(idx)))
                try:
                    result = md.convert(page_stream, llm_prompt=prompt)
                    page_md.append(result.text_content)
                except Exception as e:
                    raise MarkItDownReaderException(
                        f"MarkItDown conversion failed on page {idx} of PDF image stream: {str(e)}"
                    ) from e
            return "\n".join(page_md)
        finally:
            # Close streams
            for s in streams:
                s.close()

    def _pdf_file_per_page_to_markdown(
        self, file_path: str, md: "MarkItDown", prompt: str, page_placeholder: str
    ) -> str:
        """Processes a PDF by splitting it into temp files and converting each individually.

        This method provides robust isolation: if one page crashes the converter due to
        file corruption, it allows for easier debugging (though currently it fails fast).

        Args:
            file_path (str): Path to the source PDF.
            md (MarkItDown): The configured MarkItDown instance.
            prompt (str): The LLM prompt used for extraction.
            page_placeholder (str): The string used to separate pages.

        Returns:
            str: The concatenated Markdown content.

        Raises:
            MarkItDownReaderException: If splitting or conversion fails.
        """
        temp_files: list[str] = self._split_pdf_to_temp_pdfs(pdf_path=file_path)
        page_md: list[str] = []

        try:
            for idx, temp_pdf in enumerate(temp_files, start=1):
                page_md.append(page_placeholder.replace("{page}", str(idx)))
                try:
                    result = md.convert(temp_pdf, llm_prompt=prompt)
                    page_md.append(result.text_content)
                except Exception as e:
                    raise MarkItDownReaderException(
                        f"MarkItDown conversion failed on page {idx} (temp file {temp_pdf}): {str(e)}"
                    ) from e
            return "\n".join(page_md)
        finally:
            # Clean up temp files
            for temp_pdf in temp_files:
                try:
                    if os.path.exists(temp_pdf):
                        os.remove(temp_pdf)
                except OSError:
                    pass  # Best effort cleanup

    def _get_markitdown(self) -> tuple["MarkItDown", Optional[str]]:
        """Configures and returns the MarkItDown instance based on available models.

        Returns:
            tuple[MarkItDown, Optional[str]]:
                - A configured `MarkItDown` instance.
                - The name of the OCR model used (if any), or None.

        Raises:
            ReaderConfigException: If the OpenAI client cannot be retrieved from the model.
        """
        if self.model:
            try:
                self.client = self.model.get_client()
                # Double check client in case it changed or wasn't checked in init
                if not isinstance(self.client, OpenAI):
                    raise ValueError("Client must be an instance of OpenAI.")

                return (
                    MarkItDown(llm_client=self.client, llm_model=self.model.model_name),
                    self.model.model_name,
                )
            except Exception as e:
                raise ReaderConfigException(
                    f"Failed to configure MarkItDown with model: {str(e)}"
                ) from e
        else:
            return MarkItDown(), None

    def read(self, file_path: Path | str = None, **kwargs: Any) -> ReaderOutput:
        """Orchestrates the file reading and conversion process.

        This method handles file existence checks, format detection, optional
        Office-to-PDF conversion, and the final Markdown extraction.

        Args:
            file_path (Path | str): The absolute or relative path to the input file.
            **kwargs: Additional configuration parameters:
                - document_id (str, optional): A unique ID for the document. Defaults to UUID.
                - metadata (dict, optional): Metadata to attach to the output.
                - prompt (str, optional): Custom prompt for VLM-based extraction.
                - page_placeholder (str, optional): String to delimit pages (default: '').
                - split_by_pages (bool, optional): If True, splits PDFs/Office docs by page
                  before processing. This is useful for granular OCR control but requires
                  LibreOffice for Office files. Defaults to False.

        Returns:
            ReaderOutput: A standardized object containing the extracted text, metadata,
            and processing details.

        Raises:
            ReaderConfigException: If `file_path` is missing.
            MarkItDownReaderException: For file not found, conversion failures, or internal library errors.
            ReaderOutputException: If the final output object cannot be constructed.
        """
        if not file_path:
            raise ReaderConfigException("file_path must be provided.")

        file_path_obj = Path(file_path)
        if not file_path_obj.exists():
            raise MarkItDownReaderException(f"File not found: {file_path}")

        file_path_str: str = os.fspath(file_path_obj)
        ext: str = file_path_obj.suffix.lower().lstrip(".")

        prompt: str = kwargs.get("prompt", DEFAULT_IMAGE_EXTRACTION_PROMPT)
        page_placeholder: str = kwargs.get("page_placeholder", DEFAULT_PAGE_PLACEHOLDER)
        split_by_pages: bool = kwargs.get("split_by_pages", False)

        # Determine conversion strategy
        try:
            md, ocr_method = self._get_markitdown()
        except Exception as e:
            raise MarkItDownReaderException(
                f"Failed to initialize MarkItDown instance: {str(e)}"
            ) from e

        PDF_CONVERTIBLE_EXT: set[str] = {"docx", "pptx", "xlsx"}

        # Handle Office -> PDF conversion
        if split_by_pages and ext in PDF_CONVERTIBLE_EXT:
            try:
                file_path_str = self._convert_to_pdf(file_path_str)
            except Exception as e:
                raise MarkItDownReaderException(
                    f"Pre-conversion of {ext} to PDF failed: {str(e)}"
                ) from e

        # Process text
        try:
            if split_by_pages:
                markdown_text: str = self._pdf_file_per_page_to_markdown(
                    file_path=file_path_str,
                    md=md,
                    prompt=prompt,
                    page_placeholder=page_placeholder,
                )
            else:
                result = md.convert(file_path_str, llm_prompt=prompt)
                markdown_text: str = result.text_content
        except MarkItDownReaderException:
            raise  # Re-raise already wrapped exceptions
        except Exception as e:
            raise MarkItDownReaderException(
                f"MarkItDown processing failed for file {file_path_str}: {str(e)}"
            ) from e

        conversion_method = "json" if ext == "json" else "markdown"

        page_placeholder_value: str = (
            page_placeholder
            if (page_placeholder and page_placeholder in markdown_text)
            else None
        )

        # Return output
        try:
            return ReaderOutput(
                text=markdown_text,
                document_name=os.path.basename(file_path_str),
                document_path=file_path_str,
                document_id=kwargs.get("document_id", str(uuid.uuid4())),
                conversion_method=conversion_method,
                reader_method="markitdown",
                ocr_method=ocr_method,
                page_placeholder=page_placeholder_value,
                metadata=kwargs.get("metadata", {}),
            )
        except Exception as e:
            raise ReaderOutputException(
                f"Failed to construct ReaderOutput: {str(e)}"
            ) from e
__init__(model=None)

Initializes the MarkItDownReader.

Parameters:

Name Type Description Default
model Optional[BaseVisionModel]

An optional vision-language model wrapper. If provided, its underlying client is injected into the MarkItDown instance to enable image description and optical character recognition.

None

Raises:

Type Description
ReaderConfigException

If the model provided does not expose an OpenAI compatible client, or if initialization fails unexpectedly.

Source code in src/splitter_mr/reader/readers/markitdown_reader.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def __init__(self, model: BaseVisionModel = None) -> None:
    """Initializes the MarkItDownReader.

    Args:
        model (Optional[BaseVisionModel], optional): An optional vision-language model
            wrapper. If provided, its underlying client is injected into the MarkItDown
            instance to enable image description and optical character recognition.

    Raises:
        ReaderConfigException: If the `model` provided does not expose an `OpenAI`
            compatible client, or if initialization fails unexpectedly.
    """
    try:
        self.model: BaseVisionModel = model
        self.model_name: str = model.model_name if self.model else None

        # Pre-validate client compatibility if model is provided
        if self.model:
            client = self.model.get_client()
            if not isinstance(client, OpenAI):
                raise ReaderConfigException(
                    f"Incompatible client type: {type(client)}. "
                    "MarkItDownReader currently only supports models using the OpenAI client."
                )
    except Exception as e:
        if isinstance(e, ReaderConfigException):
            raise
        raise ReaderConfigException(
            f"Failed to initialize MarkItDownReader: {str(e)}"
        ) from e
read(file_path=None, **kwargs)

Orchestrates the file reading and conversion process.

This method handles file existence checks, format detection, optional Office-to-PDF conversion, and the final Markdown extraction.

Parameters:

Name Type Description Default
file_path Path | str

The absolute or relative path to the input file.

None
**kwargs Any

Additional configuration parameters: - document_id (str, optional): A unique ID for the document. Defaults to UUID. - metadata (dict, optional): Metadata to attach to the output. - prompt (str, optional): Custom prompt for VLM-based extraction. - page_placeholder (str, optional): String to delimit pages (default: ''). - split_by_pages (bool, optional): If True, splits PDFs/Office docs by page before processing. This is useful for granular OCR control but requires LibreOffice for Office files. Defaults to False.

{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

A standardized object containing the extracted text, metadata,

ReaderOutput

and processing details.

Raises:

Type Description
ReaderConfigException

If file_path is missing.

MarkItDownReaderException

For file not found, conversion failures, or internal library errors.

ReaderOutputException

If the final output object cannot be constructed.

Source code in src/splitter_mr/reader/readers/markitdown_reader.py
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
def read(self, file_path: Path | str = None, **kwargs: Any) -> ReaderOutput:
    """Orchestrates the file reading and conversion process.

    This method handles file existence checks, format detection, optional
    Office-to-PDF conversion, and the final Markdown extraction.

    Args:
        file_path (Path | str): The absolute or relative path to the input file.
        **kwargs: Additional configuration parameters:
            - document_id (str, optional): A unique ID for the document. Defaults to UUID.
            - metadata (dict, optional): Metadata to attach to the output.
            - prompt (str, optional): Custom prompt for VLM-based extraction.
            - page_placeholder (str, optional): String to delimit pages (default: '').
            - split_by_pages (bool, optional): If True, splits PDFs/Office docs by page
              before processing. This is useful for granular OCR control but requires
              LibreOffice for Office files. Defaults to False.

    Returns:
        ReaderOutput: A standardized object containing the extracted text, metadata,
        and processing details.

    Raises:
        ReaderConfigException: If `file_path` is missing.
        MarkItDownReaderException: For file not found, conversion failures, or internal library errors.
        ReaderOutputException: If the final output object cannot be constructed.
    """
    if not file_path:
        raise ReaderConfigException("file_path must be provided.")

    file_path_obj = Path(file_path)
    if not file_path_obj.exists():
        raise MarkItDownReaderException(f"File not found: {file_path}")

    file_path_str: str = os.fspath(file_path_obj)
    ext: str = file_path_obj.suffix.lower().lstrip(".")

    prompt: str = kwargs.get("prompt", DEFAULT_IMAGE_EXTRACTION_PROMPT)
    page_placeholder: str = kwargs.get("page_placeholder", DEFAULT_PAGE_PLACEHOLDER)
    split_by_pages: bool = kwargs.get("split_by_pages", False)

    # Determine conversion strategy
    try:
        md, ocr_method = self._get_markitdown()
    except Exception as e:
        raise MarkItDownReaderException(
            f"Failed to initialize MarkItDown instance: {str(e)}"
        ) from e

    PDF_CONVERTIBLE_EXT: set[str] = {"docx", "pptx", "xlsx"}

    # Handle Office -> PDF conversion
    if split_by_pages and ext in PDF_CONVERTIBLE_EXT:
        try:
            file_path_str = self._convert_to_pdf(file_path_str)
        except Exception as e:
            raise MarkItDownReaderException(
                f"Pre-conversion of {ext} to PDF failed: {str(e)}"
            ) from e

    # Process text
    try:
        if split_by_pages:
            markdown_text: str = self._pdf_file_per_page_to_markdown(
                file_path=file_path_str,
                md=md,
                prompt=prompt,
                page_placeholder=page_placeholder,
            )
        else:
            result = md.convert(file_path_str, llm_prompt=prompt)
            markdown_text: str = result.text_content
    except MarkItDownReaderException:
        raise  # Re-raise already wrapped exceptions
    except Exception as e:
        raise MarkItDownReaderException(
            f"MarkItDown processing failed for file {file_path_str}: {str(e)}"
        ) from e

    conversion_method = "json" if ext == "json" else "markdown"

    page_placeholder_value: str = (
        page_placeholder
        if (page_placeholder and page_placeholder in markdown_text)
        else None
    )

    # Return output
    try:
        return ReaderOutput(
            text=markdown_text,
            document_name=os.path.basename(file_path_str),
            document_path=file_path_str,
            document_id=kwargs.get("document_id", str(uuid.uuid4())),
            conversion_method=conversion_method,
            reader_method="markitdown",
            ocr_method=ocr_method,
            page_placeholder=page_placeholder_value,
            metadata=kwargs.get("metadata", {}),
        )
    except Exception as e:
        raise ReaderOutputException(
            f"Failed to construct ReaderOutput: {str(e)}"
        ) from e