Skip to content

Reader

Introduction

The Reader component is designed to read files homogeneously which come from many different formats and extensions. All of these readers are implemented sharing the same parent class, BaseReader.

Which Reader should I use for my project?

Each Reader component extracts document text in different ways. Therefore, choosing the most suitable Reader component depends on your use case.

  • If you want to preserve the original structure as much as possible, without any kind of markdown parsing, you can use the VanillaReader class.
  • In case that you have documents which have presented many tables in its structure or with many visual components (such as images), we strongly recommend to use DoclingReader.
  • If you are looking to maximize efficiency or make conversions to markdown simpler, we recommend using the MarkItDownReader component.

Note

Remember to visit the official repository and guides for these two last reader classes:

Additionally, the file compatibility depending on the Reader class is given by the following table:

Reader Unstructured files & PDFs MS Office suite files Tabular data Files with hierarchical schema Image files Markdown conversion
VanillaReader txt, md xlsx csv, tsv, parquet json, yaml, html, xml - No
MarkItDownReader txt, md, pdf docx, xlsx, pptx csv, tsv json, html, xml jpg, png, pneg Yes
DoclingReader txt, md, pdf docx, xlsx, pptx html, xhtml png, jpeg, tiff, bmp, webp Yes

Output format

Dataclass defining the output structure for all readers.

Source code in src/splitter_mr/schema/schemas.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
@dataclass
class ReaderOutput:
    """
    Dataclass defining the output structure for all readers.
    """

    text: Optional[str] = ""
    document_name: Optional[str] = None
    document_path: str = ""
    document_id: Optional[str] = None
    conversion_method: Optional[str] = None
    reader_method: Optional[str] = None
    ocr_method: Optional[str] = None
    conversion_method: Optional[str] = None
    metadata: Optional[Dict[str, Any]] = field(default_factory=dict)

    def __post_init__(self):
        if not self.document_id:
            self.document_id = str(uuid.uuid4())

    def to_dict(self):
        return asdict(self)

Readers

BaseReader

BaseReader

Bases: ABC

Abstract base class for all document readers.

This interface defines the contract for file readers that process documents and return a standardized dictionary containing the extracted text and document-level metadata. Subclasses must implement the read method to handle specific file formats or reading strategies.

Methods:

Name Description
read

Reads the input file and returns a dictionary with text and metadata.

is_valid_file_path

Check if a path is valid.

is_url

Check if the string provided is an URL.

parse_json

Try to parse a JSON object when a dictionary or string is provided.

Source code in src/splitter_mr/reader/base_reader.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
class BaseReader(ABC):
    """
    Abstract base class for all document readers.

    This interface defines the contract for file readers that process documents and return
    a standardized dictionary containing the extracted text and document-level metadata.
    Subclasses must implement the `read` method to handle specific file formats or reading
    strategies.

    Methods:
        read: Reads the input file and returns a dictionary with text and metadata.
        is_valid_file_path: Check if a path is valid.
        is_url: Check if the string provided is an URL.
        parse_json: Try to parse a JSON object when a dictionary or string is provided.
    """

    @staticmethod
    def is_valid_file_path(path: str) -> bool:
        """
        Checks if the provided string is a valid file path.

        Args:
            path (str): The string to check.

        Returns:
            bool: True if the string is a valid file path to an existing file, False otherwise.

        Example:
            ```python
            BaseReader.is_valid_file_path("/tmp/myfile.txt")
            ```
            ```bash
            True
            ```
        """
        return os.path.isfile(path)

    @staticmethod
    def is_url(string: str) -> bool:
        """
        Determines whether the given string is a valid HTTP or HTTPS URL.

        Args:
            string (str): The string to check.

        Returns:
            bool: True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

        Example:
            ```python
            BaseReader.is_url("https://example.com")
            ```
            ```bash
            True
            ```
            ```python
            BaseReader.is_url("not_a_url")
            ```
            ```bash
            False
            ```
        """
        try:
            result = urlparse(string)
            return all([result.scheme in ("http", "https"), result.netloc])
        except Exception:
            return False

    @staticmethod
    def parse_json(obj: Union[dict, str]) -> dict:
        """
        Attempts to parse the provided object as JSON.

        Args:
            obj (Union[dict, str]): The object to parse. If a dict, returns it as-is.
                If a string, attempts to parse it as a JSON string.

        Returns:
            dict: The parsed JSON object.

        Raises:
            ValueError: If a string is provided that cannot be parsed as valid JSON.
            TypeError: If the provided object is neither a dict nor a string.

        Example:
            ```python
            BaseReader.try_parse_json('{"a": 1}')
            ```
            ```python
            {'a': 1}
            ```
            ```python
            BaseReader.try_parse_json({'b': 2})
            ```
            ```python
            {'b': 2}
            ```
            ```python
            BaseReader.try_parse_json('[not valid json]')
            ```
            ```python
            ValueError: String could not be parsed as JSON: ...
            ```
        """
        if isinstance(obj, dict):
            return obj
        if isinstance(obj, str):
            try:
                return json.loads(obj)
            except Exception as e:
                raise ValueError(f"String could not be parsed as JSON: {e}")
        raise TypeError("Provided object is not a string or dictionary")

    @abstractmethod
    def read(
        self, file_path: str, model: Optional[BaseModel] = None, **kwargs: Any
    ) -> ReaderOutput:
        """
        Reads input and returns a ReaderOutput with text content and standardized metadata.

        Args:
            file_path (str): Path to the input file, a URL, raw string, or dictionary.
            model (Optional[BaseModel]): Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.
            **kwargs: Additional keyword arguments for implementation-specific options.

        Returns:
            ReaderOutput: Dataclass defining the output structure for all readers.

        Raises:
            ValueError: If the provided string is not valid file path, URL, or parsable content.
            TypeError: If input type is unsupported.

        Example:
            ```python
            class MyReader(BaseReader):
                def read(self, file_path: str, **kwargs) -> ReaderOutput:
                    return ReaderOutput(
                        text="example",
                        document_name="example.txt",
                        document_path=file_path,
                        document_id=kwargs.get("document_id"),
                        conversion_method="custom",
                        ocr_method=None,
                        metadata={}
                    )
            ```
        """
        pass
is_valid_file_path(path) staticmethod

Checks if the provided string is a valid file path.

Parameters:

Name Type Description Default
path str

The string to check.

required

Returns:

Name Type Description
bool bool

True if the string is a valid file path to an existing file, False otherwise.

Example

BaseReader.is_valid_file_path("/tmp/myfile.txt")
True

Source code in src/splitter_mr/reader/base_reader.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
@staticmethod
def is_valid_file_path(path: str) -> bool:
    """
    Checks if the provided string is a valid file path.

    Args:
        path (str): The string to check.

    Returns:
        bool: True if the string is a valid file path to an existing file, False otherwise.

    Example:
        ```python
        BaseReader.is_valid_file_path("/tmp/myfile.txt")
        ```
        ```bash
        True
        ```
    """
    return os.path.isfile(path)
is_url(string) staticmethod

Determines whether the given string is a valid HTTP or HTTPS URL.

Parameters:

Name Type Description Default
string str

The string to check.

required

Returns:

Name Type Description
bool bool

True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

Example

BaseReader.is_url("https://example.com")
True
BaseReader.is_url("not_a_url")
False

Source code in src/splitter_mr/reader/base_reader.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
@staticmethod
def is_url(string: str) -> bool:
    """
    Determines whether the given string is a valid HTTP or HTTPS URL.

    Args:
        string (str): The string to check.

    Returns:
        bool: True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

    Example:
        ```python
        BaseReader.is_url("https://example.com")
        ```
        ```bash
        True
        ```
        ```python
        BaseReader.is_url("not_a_url")
        ```
        ```bash
        False
        ```
    """
    try:
        result = urlparse(string)
        return all([result.scheme in ("http", "https"), result.netloc])
    except Exception:
        return False
parse_json(obj) staticmethod

Attempts to parse the provided object as JSON.

Parameters:

Name Type Description Default
obj Union[dict, str]

The object to parse. If a dict, returns it as-is. If a string, attempts to parse it as a JSON string.

required

Returns:

Name Type Description
dict dict

The parsed JSON object.

Raises:

Type Description
ValueError

If a string is provided that cannot be parsed as valid JSON.

TypeError

If the provided object is neither a dict nor a string.

Example

BaseReader.try_parse_json('{"a": 1}')
{'a': 1}
BaseReader.try_parse_json({'b': 2})
{'b': 2}
BaseReader.try_parse_json('[not valid json]')
ValueError: String could not be parsed as JSON: ...

Source code in src/splitter_mr/reader/base_reader.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
@staticmethod
def parse_json(obj: Union[dict, str]) -> dict:
    """
    Attempts to parse the provided object as JSON.

    Args:
        obj (Union[dict, str]): The object to parse. If a dict, returns it as-is.
            If a string, attempts to parse it as a JSON string.

    Returns:
        dict: The parsed JSON object.

    Raises:
        ValueError: If a string is provided that cannot be parsed as valid JSON.
        TypeError: If the provided object is neither a dict nor a string.

    Example:
        ```python
        BaseReader.try_parse_json('{"a": 1}')
        ```
        ```python
        {'a': 1}
        ```
        ```python
        BaseReader.try_parse_json({'b': 2})
        ```
        ```python
        {'b': 2}
        ```
        ```python
        BaseReader.try_parse_json('[not valid json]')
        ```
        ```python
        ValueError: String could not be parsed as JSON: ...
        ```
    """
    if isinstance(obj, dict):
        return obj
    if isinstance(obj, str):
        try:
            return json.loads(obj)
        except Exception as e:
            raise ValueError(f"String could not be parsed as JSON: {e}")
    raise TypeError("Provided object is not a string or dictionary")
read(file_path, model=None, **kwargs) abstractmethod

Reads input and returns a ReaderOutput with text content and standardized metadata.

Parameters:

Name Type Description Default
file_path str

Path to the input file, a URL, raw string, or dictionary.

required
model Optional[BaseModel]

Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.

None
**kwargs Any

Additional keyword arguments for implementation-specific options.

{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

Dataclass defining the output structure for all readers.

Raises:

Type Description
ValueError

If the provided string is not valid file path, URL, or parsable content.

TypeError

If input type is unsupported.

Example
class MyReader(BaseReader):
    def read(self, file_path: str, **kwargs) -> ReaderOutput:
        return ReaderOutput(
            text="example",
            document_name="example.txt",
            document_path=file_path,
            document_id=kwargs.get("document_id"),
            conversion_method="custom",
            ocr_method=None,
            metadata={}
        )
Source code in src/splitter_mr/reader/base_reader.py
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
@abstractmethod
def read(
    self, file_path: str, model: Optional[BaseModel] = None, **kwargs: Any
) -> ReaderOutput:
    """
    Reads input and returns a ReaderOutput with text content and standardized metadata.

    Args:
        file_path (str): Path to the input file, a URL, raw string, or dictionary.
        model (Optional[BaseModel]): Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.
        **kwargs: Additional keyword arguments for implementation-specific options.

    Returns:
        ReaderOutput: Dataclass defining the output structure for all readers.

    Raises:
        ValueError: If the provided string is not valid file path, URL, or parsable content.
        TypeError: If input type is unsupported.

    Example:
        ```python
        class MyReader(BaseReader):
            def read(self, file_path: str, **kwargs) -> ReaderOutput:
                return ReaderOutput(
                    text="example",
                    document_name="example.txt",
                    document_path=file_path,
                    document_id=kwargs.get("document_id"),
                    conversion_method="custom",
                    ocr_method=None,
                    metadata={}
                )
        ```
    """
    pass

📚 Note: file examples are extracted from thedata folder in the GitHub repository: link.

VanillaReader

Vanilla Reader logo

SimpleHTMLTextExtractor

Bases: HTMLParser

Extract HTML Structures from a text

Source code in src/splitter_mr/reader/readers/vanilla_reader.py
16
17
18
19
20
21
22
23
24
25
26
27
class SimpleHTMLTextExtractor(HTMLParser):
    """Extract HTML Structures from a text"""

    def __init__(self):
        super().__init__()
        self.text_parts = []

    def handle_data(self, data):
        self.text_parts.append(data)

    def get_text(self):
        return " ".join(self.text_parts).strip()
VanillaReader

Bases: BaseReader

Read multiple file types using Python's built-in and standard libraries. Supported: .json, .html, .txt, .xml, .yaml/.yml, .csv, .tsv, .parquet, .pdf

For PDFs, this reader uses PDFPlumberReader to extract text, tables, and images, with options to show or omit images, and to annotate images using a vision model.

Source code in src/splitter_mr/reader/readers/vanilla_reader.py
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
class VanillaReader(BaseReader):
    """
    Read multiple file types using Python's built-in and standard libraries.
    Supported: .json, .html, .txt, .xml, .yaml/.yml, .csv, .tsv, .parquet, .pdf

    For PDFs, this reader uses PDFPlumberReader to extract text, tables, and images,
    with options to show or omit images, and to annotate images using a vision model.
    """

    def __init__(self, model: Optional[BaseModel] = None):
        super().__init__()
        self.model = model

    def read(self, file_path: Any = None, **kwargs: Any) -> ReaderOutput:
        """
        Reads a document from various sources and returns its text content along with standardized metadata.

        This method supports reading from:
            - Local file paths (file_path, or as a positional argument)
            - URLs (file_url)
            - JSON/dict objects (json_document)
            - Raw text strings (text_document)
        If multiple sources are provided, the following priority is used: file_path, file_url,
        json_document, text_document.
        If only file_path is provided, the method will attempt to automatically detect if the value is
        a path, URL, JSON, YAML, or plain text.

        Args:
            file_path (str, optional): Path to the input file.
            **kwargs:
                file_path (str, optional): Path to the input file (overrides positional argument).
                file_url (str, optional): URL to read the document from.
                json_document (dict or str, optional): Dictionary or JSON string containing document content.
                text_document (str, optional): Raw text or string content of the document.
                show_images (bool, optional): If True (default), images in PDFs are shown inline as base64 PNG.
                    If False, images are omitted (or annotated if a model is provided).
                model (BaseModel, optional): Vision model for image annotation/captioning.
                prompt (str, optional): Custom prompt for image captioning.

        Returns:
            ReaderOutput: Dataclass defining the output structure for all readers.

        Raises:
            ValueError: If the provided source is not valid or supported, or if file/URL/JSON detection fails.
            TypeError: If provided arguments are of unsupported types.

        Notes:
            - PDF extraction now supports image captioning/omission indicators.
            - For `.parquet` files, content is loaded via pandas and returned as CSV-formatted text.

        Example:
            ```python
            from splitter_mr.readers import VanillaReader
            from splitter_mr.models import AzureOpenAIVisionModel

            model = AzureOpenAIVisionModel()
            reader = VanillaReader(model=model)
            output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf", show_images=False)
            print(output.text)
            ```
            ```bash
            \\n---\\n## Page 1\\n---\\n\\nMultiRAG Project – Splitter\\nMultiRAG | Splitter\\nLorem ipsum dolor sit amet, ...
            ```
        """

        SOURCE_PRIORITY = [
            "file_path",
            "file_url",
            "json_document",
            "text_document",
        ]

        # Pick the highest-priority source provided
        document_source = None
        source_type = None
        for key in SOURCE_PRIORITY:
            if key in kwargs and kwargs[key] is not None:
                document_source = kwargs[key]
                source_type = key
                break

        if document_source is None:
            document_source = file_path
            source_type = "file_path"

        document_name = kwargs.get("document_name")
        document_path = None
        conversion_method = None

        # --- 1. File path or default
        if source_type == "file_path":
            if not isinstance(document_source, str):
                raise ValueError("file_path must be a string.")

            if self.is_valid_file_path(document_source):
                ext = os.path.splitext(document_source)[-1].lower().lstrip(".")
                document_name = os.path.basename(document_source)
                document_path = os.path.relpath(document_source)

                if ext == "pdf":
                    pdf_reader = PDFPlumberReader()
                    model = kwargs.get("model", self.model)
                    if model is not None:
                        text = pdf_reader.read(
                            document_source,
                            model=model,
                            prompt=kwargs.get("prompt"),
                            show_images=kwargs.get("show_images", False),
                        )
                        # use the **actual** model that was passed in
                        ocr_method = model.model_name
                    else:
                        text = pdf_reader.read(
                            document_source,
                            show_images=kwargs.get("show_images", False),
                        )
                        conversion_method = "pdf"
                elif ext in (
                    "json",
                    "html",
                    "txt",
                    "xml",
                    "csv",
                    "tsv",
                    "md",
                    "markdown",
                ):
                    with open(document_source, "r", encoding="utf-8") as f:
                        text = f.read()
                    conversion_method = ext
                elif ext == "parquet":
                    df = pd.read_parquet(document_source)
                    text = df.to_csv(index=False)
                    conversion_method = "csv"
                elif ext in ("yaml", "yml"):
                    with open(document_source, "r", encoding="utf-8") as f:
                        yaml_text = f.read()
                    text = yaml.safe_load(yaml_text)
                    conversion_method = "json"
                elif ext in ("xlsx", "xls"):
                    text = str(
                        pd.read_excel(document_source, engine="openpyxl").to_csv()
                    )
                    conversion_method = ext
                elif ext in LANGUAGES:
                    with open(document_source, "r", encoding="utf-8") as f:
                        text = f.read()
                    conversion_method = "txt"
                else:
                    raise ValueError(
                        f"Unsupported file extension: {ext}. Use another Reader component."
                    )

            # (2) URL
            elif self.is_url(document_source):
                ext = os.path.splitext(document_source)[-1].lower().lstrip(".")
                response = requests.get(document_source)
                response.raise_for_status()
                document_name = document_source.split("/")[-1] or "downloaded_file"
                document_path = document_source
                conversion_method = ext
                content_type = response.headers.get("Content-Type", "")

                if "application/json" in content_type or document_name.endswith(
                    ".json"
                ):
                    text = response.json()
                elif "text/html" in content_type or document_name.endswith(".html"):
                    parser = SimpleHTMLTextExtractor()
                    parser.feed(response.text)
                    text = parser.get_text()
                elif "text/yaml" in content_type or document_name.endswith(
                    (".yaml", ".yml")
                ):
                    text = yaml.safe_load(response.text)
                    conversion_method = "json"
                elif "text/csv" in content_type or document_name.endswith(".csv"):
                    text = response.text
                else:
                    text = response.text

            # (3) JSON/dict string
            else:
                try:
                    text = self.parse_json(document_source)
                    conversion_method = "json"
                except Exception:
                    try:
                        text = yaml.safe_load(document_source)
                        conversion_method = "json"
                    except Exception:
                        text = document_source
                        conversion_method = "txt"

        # --- 2. Explicit URL
        elif source_type == "file_url":
            ext = os.path.splitext(document_source)[-1].lower().lstrip(".")
            if not isinstance(document_source, str) or not self.is_url(document_source):
                raise ValueError("file_url must be a valid URL string.")
            response = requests.get(document_source)
            response.raise_for_status()
            document_name = document_source.split("/")[-1] or "downloaded_file"
            document_path = document_source
            conversion_method = ext
            content_type = response.headers.get("Content-Type", "")

            if "application/json" in content_type or document_name.endswith(".json"):
                text = response.json()
            elif "text/html" in content_type or document_name.endswith(".html"):
                parser = SimpleHTMLTextExtractor()
                parser.feed(response.text)
                text = parser.get_text()
            elif "text/yaml" in content_type or document_name.endswith(
                (".yaml", ".yml")
            ):
                text = yaml.safe_load(response.text)
                conversion_method = "json"
            elif "text/csv" in content_type or document_name.endswith(".csv"):
                text = response.text
            else:
                text = response.text

        # --- 3. Explicit JSON
        elif source_type == "json_document":
            document_name = kwargs.get("document_name", None)
            document_path = None
            text = self.parse_json(document_source)
            conversion_method = "json"

        # --- 4. Explicit text
        elif source_type == "text_document":
            document_name = kwargs.get("document_name", None)
            document_path = None
            try:
                parsed = self.parse_json(document_source)
                # Only treat as JSON if result is dict or list, not a string!
                if isinstance(parsed, (dict, list)):
                    text = parsed
                    conversion_method = "json"
                else:
                    raise ValueError  # Force fallback
            except Exception:
                try:
                    parsed = yaml.safe_load(document_source)
                    # Only treat as YAML if it returns a dict or list
                    if isinstance(parsed, (dict, list)):
                        text = parsed
                        conversion_method = "json"
                    else:
                        raise ValueError
                except Exception:
                    text = document_source
                    conversion_method = "txt"

        else:
            raise ValueError(f"Unrecognized document source: {source_type}")

        metadata = kwargs.get("metadata", {})
        document_id = kwargs.get("document_id") or str(uuid.uuid4())
        ocr_method = kwargs.get("ocr_method")

        return ReaderOutput(
            text=text,
            document_name=document_name,
            document_path=document_path,
            document_id=document_id,
            conversion_method=conversion_method,
            reader_method="vanilla",
            ocr_method=ocr_method,
            metadata=metadata,
        )
read(file_path=None, **kwargs)

Reads a document from various sources and returns its text content along with standardized metadata.

This method supports reading from
  • Local file paths (file_path, or as a positional argument)
  • URLs (file_url)
  • JSON/dict objects (json_document)
  • Raw text strings (text_document)

If multiple sources are provided, the following priority is used: file_path, file_url, json_document, text_document. If only file_path is provided, the method will attempt to automatically detect if the value is a path, URL, JSON, YAML, or plain text.

Parameters:

Name Type Description Default
file_path str

Path to the input file.

None
**kwargs Any

file_path (str, optional): Path to the input file (overrides positional argument). file_url (str, optional): URL to read the document from. json_document (dict or str, optional): Dictionary or JSON string containing document content. text_document (str, optional): Raw text or string content of the document. show_images (bool, optional): If True (default), images in PDFs are shown inline as base64 PNG. If False, images are omitted (or annotated if a model is provided). model (BaseModel, optional): Vision model for image annotation/captioning. prompt (str, optional): Custom prompt for image captioning.

{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

Dataclass defining the output structure for all readers.

Raises:

Type Description
ValueError

If the provided source is not valid or supported, or if file/URL/JSON detection fails.

TypeError

If provided arguments are of unsupported types.

Notes
  • PDF extraction now supports image captioning/omission indicators.
  • For .parquet files, content is loaded via pandas and returned as CSV-formatted text.
Example

from splitter_mr.readers import VanillaReader
from splitter_mr.models import AzureOpenAIVisionModel

model = AzureOpenAIVisionModel()
reader = VanillaReader(model=model)
output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf", show_images=False)
print(output.text)
\n---\n## Page 1\n---\n\nMultiRAG Project – Splitter\nMultiRAG | Splitter\nLorem ipsum dolor sit amet, ...

Source code in src/splitter_mr/reader/readers/vanilla_reader.py
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
def read(self, file_path: Any = None, **kwargs: Any) -> ReaderOutput:
    """
    Reads a document from various sources and returns its text content along with standardized metadata.

    This method supports reading from:
        - Local file paths (file_path, or as a positional argument)
        - URLs (file_url)
        - JSON/dict objects (json_document)
        - Raw text strings (text_document)
    If multiple sources are provided, the following priority is used: file_path, file_url,
    json_document, text_document.
    If only file_path is provided, the method will attempt to automatically detect if the value is
    a path, URL, JSON, YAML, or plain text.

    Args:
        file_path (str, optional): Path to the input file.
        **kwargs:
            file_path (str, optional): Path to the input file (overrides positional argument).
            file_url (str, optional): URL to read the document from.
            json_document (dict or str, optional): Dictionary or JSON string containing document content.
            text_document (str, optional): Raw text or string content of the document.
            show_images (bool, optional): If True (default), images in PDFs are shown inline as base64 PNG.
                If False, images are omitted (or annotated if a model is provided).
            model (BaseModel, optional): Vision model for image annotation/captioning.
            prompt (str, optional): Custom prompt for image captioning.

    Returns:
        ReaderOutput: Dataclass defining the output structure for all readers.

    Raises:
        ValueError: If the provided source is not valid or supported, or if file/URL/JSON detection fails.
        TypeError: If provided arguments are of unsupported types.

    Notes:
        - PDF extraction now supports image captioning/omission indicators.
        - For `.parquet` files, content is loaded via pandas and returned as CSV-formatted text.

    Example:
        ```python
        from splitter_mr.readers import VanillaReader
        from splitter_mr.models import AzureOpenAIVisionModel

        model = AzureOpenAIVisionModel()
        reader = VanillaReader(model=model)
        output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf", show_images=False)
        print(output.text)
        ```
        ```bash
        \\n---\\n## Page 1\\n---\\n\\nMultiRAG Project – Splitter\\nMultiRAG | Splitter\\nLorem ipsum dolor sit amet, ...
        ```
    """

    SOURCE_PRIORITY = [
        "file_path",
        "file_url",
        "json_document",
        "text_document",
    ]

    # Pick the highest-priority source provided
    document_source = None
    source_type = None
    for key in SOURCE_PRIORITY:
        if key in kwargs and kwargs[key] is not None:
            document_source = kwargs[key]
            source_type = key
            break

    if document_source is None:
        document_source = file_path
        source_type = "file_path"

    document_name = kwargs.get("document_name")
    document_path = None
    conversion_method = None

    # --- 1. File path or default
    if source_type == "file_path":
        if not isinstance(document_source, str):
            raise ValueError("file_path must be a string.")

        if self.is_valid_file_path(document_source):
            ext = os.path.splitext(document_source)[-1].lower().lstrip(".")
            document_name = os.path.basename(document_source)
            document_path = os.path.relpath(document_source)

            if ext == "pdf":
                pdf_reader = PDFPlumberReader()
                model = kwargs.get("model", self.model)
                if model is not None:
                    text = pdf_reader.read(
                        document_source,
                        model=model,
                        prompt=kwargs.get("prompt"),
                        show_images=kwargs.get("show_images", False),
                    )
                    # use the **actual** model that was passed in
                    ocr_method = model.model_name
                else:
                    text = pdf_reader.read(
                        document_source,
                        show_images=kwargs.get("show_images", False),
                    )
                    conversion_method = "pdf"
            elif ext in (
                "json",
                "html",
                "txt",
                "xml",
                "csv",
                "tsv",
                "md",
                "markdown",
            ):
                with open(document_source, "r", encoding="utf-8") as f:
                    text = f.read()
                conversion_method = ext
            elif ext == "parquet":
                df = pd.read_parquet(document_source)
                text = df.to_csv(index=False)
                conversion_method = "csv"
            elif ext in ("yaml", "yml"):
                with open(document_source, "r", encoding="utf-8") as f:
                    yaml_text = f.read()
                text = yaml.safe_load(yaml_text)
                conversion_method = "json"
            elif ext in ("xlsx", "xls"):
                text = str(
                    pd.read_excel(document_source, engine="openpyxl").to_csv()
                )
                conversion_method = ext
            elif ext in LANGUAGES:
                with open(document_source, "r", encoding="utf-8") as f:
                    text = f.read()
                conversion_method = "txt"
            else:
                raise ValueError(
                    f"Unsupported file extension: {ext}. Use another Reader component."
                )

        # (2) URL
        elif self.is_url(document_source):
            ext = os.path.splitext(document_source)[-1].lower().lstrip(".")
            response = requests.get(document_source)
            response.raise_for_status()
            document_name = document_source.split("/")[-1] or "downloaded_file"
            document_path = document_source
            conversion_method = ext
            content_type = response.headers.get("Content-Type", "")

            if "application/json" in content_type or document_name.endswith(
                ".json"
            ):
                text = response.json()
            elif "text/html" in content_type or document_name.endswith(".html"):
                parser = SimpleHTMLTextExtractor()
                parser.feed(response.text)
                text = parser.get_text()
            elif "text/yaml" in content_type or document_name.endswith(
                (".yaml", ".yml")
            ):
                text = yaml.safe_load(response.text)
                conversion_method = "json"
            elif "text/csv" in content_type or document_name.endswith(".csv"):
                text = response.text
            else:
                text = response.text

        # (3) JSON/dict string
        else:
            try:
                text = self.parse_json(document_source)
                conversion_method = "json"
            except Exception:
                try:
                    text = yaml.safe_load(document_source)
                    conversion_method = "json"
                except Exception:
                    text = document_source
                    conversion_method = "txt"

    # --- 2. Explicit URL
    elif source_type == "file_url":
        ext = os.path.splitext(document_source)[-1].lower().lstrip(".")
        if not isinstance(document_source, str) or not self.is_url(document_source):
            raise ValueError("file_url must be a valid URL string.")
        response = requests.get(document_source)
        response.raise_for_status()
        document_name = document_source.split("/")[-1] or "downloaded_file"
        document_path = document_source
        conversion_method = ext
        content_type = response.headers.get("Content-Type", "")

        if "application/json" in content_type or document_name.endswith(".json"):
            text = response.json()
        elif "text/html" in content_type or document_name.endswith(".html"):
            parser = SimpleHTMLTextExtractor()
            parser.feed(response.text)
            text = parser.get_text()
        elif "text/yaml" in content_type or document_name.endswith(
            (".yaml", ".yml")
        ):
            text = yaml.safe_load(response.text)
            conversion_method = "json"
        elif "text/csv" in content_type or document_name.endswith(".csv"):
            text = response.text
        else:
            text = response.text

    # --- 3. Explicit JSON
    elif source_type == "json_document":
        document_name = kwargs.get("document_name", None)
        document_path = None
        text = self.parse_json(document_source)
        conversion_method = "json"

    # --- 4. Explicit text
    elif source_type == "text_document":
        document_name = kwargs.get("document_name", None)
        document_path = None
        try:
            parsed = self.parse_json(document_source)
            # Only treat as JSON if result is dict or list, not a string!
            if isinstance(parsed, (dict, list)):
                text = parsed
                conversion_method = "json"
            else:
                raise ValueError  # Force fallback
        except Exception:
            try:
                parsed = yaml.safe_load(document_source)
                # Only treat as YAML if it returns a dict or list
                if isinstance(parsed, (dict, list)):
                    text = parsed
                    conversion_method = "json"
                else:
                    raise ValueError
            except Exception:
                text = document_source
                conversion_method = "txt"

    else:
        raise ValueError(f"Unrecognized document source: {source_type}")

    metadata = kwargs.get("metadata", {})
    document_id = kwargs.get("document_id") or str(uuid.uuid4())
    ocr_method = kwargs.get("ocr_method")

    return ReaderOutput(
        text=text,
        document_name=document_name,
        document_path=document_path,
        document_id=document_id,
        conversion_method=conversion_method,
        reader_method="vanilla",
        ocr_method=ocr_method,
        metadata=metadata,
    )

DoclingReader

Docling logo

DoclingReader

Bases: BaseReader

Read multiple file types using IBM's Docling library, and convert the documents into markdown or JSON format.

Source code in src/splitter_mr/reader/readers/docling_reader.py
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
class DoclingReader(BaseReader):
    """
    Read multiple file types using IBM's Docling library, and convert the documents
    into markdown or JSON format.
    """

    SUPPORTED_EXTENSIONS = (
        "pdf",
        "docx",
        "html",
        "md",
        "markdown",
        "htm",
        "pptx",
        "xlsx",
        "odt",
        "rtf",
        "jpg",
        "jpeg",
        "png",
        "bmp",
        "gif",
        "tiff",
    )

    def __init__(self, model: Optional[BaseModel] = None):
        self.model = model
        self.model_name = None
        if self.model is not None:
            self.client = self.model.get_client()
            for attr in ["_azure_deployment", "_azure_endpoint", "_api_version"]:
                setattr(self, attr, getattr(self.client, attr, None))
            self.api_key = self.client.api_key
            self.model_name = self.model.model_name

    def _get_vlm_url_and_headers(self, client: Any) -> Tuple[str, Dict[str, str]]:
        """
        Returns VLM API URL and headers based on model type.
        """
        if isinstance(client, AzureOpenAI):
            url = f"{self._azure_endpoint}/openai/deployments/{self._azure_deployment}/chat/completions?api-version={self._api_version}"
            headers = {"Authorization": f"Bearer {client.api_key}"}
        elif isinstance(client, OpenAI):
            url = "https://api.openai.com/v1/chat/completions"
            headers = {"Authorization": f"Bearer {client.api_key}"}
        else:
            raise ValueError(f"Unknown client type: {type(client)}")
        return url, headers

    def _make_docling_reader(self, prompt: str, timeout: int = 60) -> DocumentConverter:
        """
        Returns a configured DocumentConverter with VLM pipeline options for OpenAI or Azure.
        """
        url, headers = self._get_vlm_url_and_headers(self.client)
        vlm_options = ApiVlmOptions(
            url=url,
            params={"model": self.model_name},
            headers=headers,
            prompt=prompt,
            timeout=timeout,
            response_format=ResponseFormat.MARKDOWN,
        )
        pipeline_options = VlmPipelineOptions(
            enable_remote_services=True,
            vlm_options=vlm_options,
        )
        reader = DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(
                    pipeline_cls=VlmPipeline,
                    pipeline_options=pipeline_options,
                )
            }
        )
        return reader

    def read(
        self,
        file_path: str,
        prompt: str = "Analyze the following resource in the original language. Be concise but comprehensive, according to the image context. Return the content in markdown format",
        **kwargs: Any,
    ) -> ReaderOutput:
        """
        Reads and converts a document to Markdown format using the
        [Docling](https://github.com/docling-project/docling) library, supporting a wide range
        of file types including PDF, DOCX, HTML, and images.

        This method leverages Docling's advanced document parsing capabilities—including layout
        and table detection, code and formula extraction, and integrated OCR—to produce clean,
        markdown-formatted output for downstream processing. The output includes standardized
        metadata and can be easily integrated into generative AI or information retrieval pipelines.

        Args:
            file_path (str): Path to the input file to be read and converted.
            **kwargs:
                document_id (Optional[str]): Unique document identifier.
                    If not provided, a UUID will be generated.
                conversion_method (Optional[str]): Name or description of the
                    conversion method used. Default is None.
                ocr_method (Optional[str]): OCR method applied (if any).
                    Default is None.
                metadata (Optional[List[str]]): Additional metadata as a list of strings.
                    Default is an empty list.

        Returns:
            ReaderOutput: Dataclass defining the output structure for all readers.

        Example:
            ```python
            from splitter_mr.readers import DoclingReader

            reader = DoclingReader()
            result = reader.read(file_path = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf")
            print(result.text)
            ```
            ```bash
            Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
            rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
            Pellentesque ex felis, cursus ege...
            ```
        """
        # Check if the extension is valid
        ext = os.path.splitext(file_path)[-1].lower().lstrip(".")
        if ext not in self.SUPPORTED_EXTENSIONS:
            print(
                f"Warning: File extension not compatible: {ext}. Fallback to VanillaReader."
            )
            return VanillaReader().read(file_path=file_path, **kwargs)

        if self.model is not None:
            reader = self._make_docling_reader(prompt)
        else:
            reader = DocumentConverter()

        # Read and convert to markdown
        text = reader.convert(file_path)
        markdown_text = text.document.export_to_markdown()

        # Return output
        return ReaderOutput(
            text=markdown_text,
            document_name=os.path.basename(file_path),
            document_path=file_path,
            document_id=kwargs.get("document_id") or str(uuid.uuid4()),
            conversion_method="markdown",
            reader_method="docling",
            ocr_method=self.model_name,
            metadata=kwargs.get("metadata"),
        )
read(file_path, prompt='Analyze the following resource in the original language. Be concise but comprehensive, according to the image context. Return the content in markdown format', **kwargs)

Reads and converts a document to Markdown format using the Docling library, supporting a wide range of file types including PDF, DOCX, HTML, and images.

This method leverages Docling's advanced document parsing capabilities—including layout and table detection, code and formula extraction, and integrated OCR—to produce clean, markdown-formatted output for downstream processing. The output includes standardized metadata and can be easily integrated into generative AI or information retrieval pipelines.

Parameters:

Name Type Description Default
file_path str

Path to the input file to be read and converted.

required
**kwargs Any

document_id (Optional[str]): Unique document identifier. If not provided, a UUID will be generated. conversion_method (Optional[str]): Name or description of the conversion method used. Default is None. ocr_method (Optional[str]): OCR method applied (if any). Default is None. metadata (Optional[List[str]]): Additional metadata as a list of strings. Default is an empty list.

{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

Dataclass defining the output structure for all readers.

Example

from splitter_mr.readers import DoclingReader

reader = DoclingReader()
result = reader.read(file_path = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf")
print(result.text)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
Pellentesque ex felis, cursus ege...

Source code in src/splitter_mr/reader/readers/docling_reader.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
def read(
    self,
    file_path: str,
    prompt: str = "Analyze the following resource in the original language. Be concise but comprehensive, according to the image context. Return the content in markdown format",
    **kwargs: Any,
) -> ReaderOutput:
    """
    Reads and converts a document to Markdown format using the
    [Docling](https://github.com/docling-project/docling) library, supporting a wide range
    of file types including PDF, DOCX, HTML, and images.

    This method leverages Docling's advanced document parsing capabilities—including layout
    and table detection, code and formula extraction, and integrated OCR—to produce clean,
    markdown-formatted output for downstream processing. The output includes standardized
    metadata and can be easily integrated into generative AI or information retrieval pipelines.

    Args:
        file_path (str): Path to the input file to be read and converted.
        **kwargs:
            document_id (Optional[str]): Unique document identifier.
                If not provided, a UUID will be generated.
            conversion_method (Optional[str]): Name or description of the
                conversion method used. Default is None.
            ocr_method (Optional[str]): OCR method applied (if any).
                Default is None.
            metadata (Optional[List[str]]): Additional metadata as a list of strings.
                Default is an empty list.

    Returns:
        ReaderOutput: Dataclass defining the output structure for all readers.

    Example:
        ```python
        from splitter_mr.readers import DoclingReader

        reader = DoclingReader()
        result = reader.read(file_path = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf")
        print(result.text)
        ```
        ```bash
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
        rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
        Pellentesque ex felis, cursus ege...
        ```
    """
    # Check if the extension is valid
    ext = os.path.splitext(file_path)[-1].lower().lstrip(".")
    if ext not in self.SUPPORTED_EXTENSIONS:
        print(
            f"Warning: File extension not compatible: {ext}. Fallback to VanillaReader."
        )
        return VanillaReader().read(file_path=file_path, **kwargs)

    if self.model is not None:
        reader = self._make_docling_reader(prompt)
    else:
        reader = DocumentConverter()

    # Read and convert to markdown
    text = reader.convert(file_path)
    markdown_text = text.document.export_to_markdown()

    # Return output
    return ReaderOutput(
        text=markdown_text,
        document_name=os.path.basename(file_path),
        document_path=file_path,
        document_id=kwargs.get("document_id") or str(uuid.uuid4()),
        conversion_method="markdown",
        reader_method="docling",
        ocr_method=self.model_name,
        metadata=kwargs.get("metadata"),
    )

MarkItDownReader

MarkItDown logo

MarkItDownReader

Bases: BaseReader

Read multiple file types using Microsoft's MarkItDown library, and convert the documents using markdown format.

This reader supports both standard MarkItDown conversion and the use of Vision Language Models (VLMs) for LLM-based OCR when extracting text from images or scanned documents.

Currently, only the following VLMs are supported: - OpenAIVisionModel - AzureOpenAIVisionModel

If a compatible model is provided, MarkItDown will leverage the specified VLM for OCR, and the model's name will be recorded as the OCR method used.

Notes
  • This method uses MarkItDown to convert a wide variety of file formats (e.g., PDF, DOCX, images, HTML, CSV) to Markdown.
  • If document_id is not provided, a UUID will be automatically assigned.
  • If metadata is not provided, an empty list will be used.
  • MarkItDown should be installed with all relevant optional dependencies for full file format support.
Source code in src/splitter_mr/reader/readers/markitdown_reader.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
class MarkItDownReader(BaseReader):
    """
    Read multiple file types using Microsoft's MarkItDown library, and convert
    the documents using markdown format.

    This reader supports both standard MarkItDown conversion and the use of Vision Language Models (VLMs)
    for LLM-based OCR when extracting text from images or scanned documents.

    Currently, only the following VLMs are supported:
        - OpenAIVisionModel
        - AzureOpenAIVisionModel

    If a compatible model is provided, MarkItDown will leverage the specified VLM for OCR, and the
    model's name will be recorded as the OCR method used.

    Notes:
        - This method uses [MarkItDown](https://github.com/microsoft/markitdown) to convert
            a wide variety of file formats (e.g., PDF, DOCX, images, HTML, CSV) to Markdown.
        - If `document_id` is not provided, a UUID will be automatically assigned.
        - If `metadata` is not provided, an empty list will be used.
        - MarkItDown should be installed with all relevant optional dependencies for full
            file format support.
    """

    def __init__(
        self, model: Optional[Union[AzureOpenAIVisionModel, OpenAIVisionModel]] = None
    ):
        self.model = model
        self.model_name = None

        if model is not None:
            if not isinstance(model, (OpenAIVisionModel, AzureOpenAIVisionModel)):
                raise ValueError(
                    "Incompatible client. Only AzureOpenAIVisionModel and OpenAIVisionModel are supported."
                )
            client = model.get_client()
            self.model_name = self.model.model_name
            self.md = MarkItDown(llm_client=client, llm_model=self.model_name)
        else:
            self.md = MarkItDown()

    def read(self, file_path: str, **kwargs: Any) -> ReaderOutput:
        """
        Reads a file and converts its contents to Markdown using MarkItDown, returning
        structured metadata.

        Args:
            file_path (str): Path to the input file to be read and converted.
            **kwargs:
                document_id (Optional[str]): Unique document identifier.
                    If not provided, a UUID will be generated.
                conversion_method (Optional[str]): Name or description of the
                    conversion method used. Default is None.
                ocr_method (Optional[str]): OCR method applied (if any).
                    Default is None.
                metadata (Optional[List[str]]): Additional metadata as a list of strings.
                    Default is an empty list.

        Returns:
            ReaderOutput: Dataclass defining the output structure for all readers.

        Example:
            ```python
            from splitter_mr.reader import MarkItDownReader
            from splitter_mr.model import OpenAIVisionModel # Or AzureOpenAIVisionModel

            openai = OpenAIVisionModel() # make sure to have necessary environment variables on `.env`.

            reader = MarkItDownReader(model = openai)
            result = reader.read(file_path = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf")
            print(result.text)
            ```
            ```python
            Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
            rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
            Pellentesque ex felis, cursus ege...
            ```
        """
        # Read using MarkItDown
        markdown_text = self.md.convert(file_path).text_content
        ext = os.path.splitext(file_path)[-1].lower().lstrip(".")
        conversion_method = "json" if ext == "json" else "markdown"

        # Return output
        return ReaderOutput(
            text=markdown_text,
            document_name=os.path.basename(file_path),
            document_path=file_path,
            document_id=kwargs.get("document_id") or str(uuid.uuid4()),
            conversion_method=conversion_method,
            reader_method="markitdown",
            ocr_method=self.model_name,
            metadata=kwargs.get("metadata"),
        )
read(file_path, **kwargs)

Reads a file and converts its contents to Markdown using MarkItDown, returning structured metadata.

Parameters:

Name Type Description Default
file_path str

Path to the input file to be read and converted.

required
**kwargs Any

document_id (Optional[str]): Unique document identifier. If not provided, a UUID will be generated. conversion_method (Optional[str]): Name or description of the conversion method used. Default is None. ocr_method (Optional[str]): OCR method applied (if any). Default is None. metadata (Optional[List[str]]): Additional metadata as a list of strings. Default is an empty list.

{}

Returns:

Name Type Description
ReaderOutput ReaderOutput

Dataclass defining the output structure for all readers.

Example

from splitter_mr.reader import MarkItDownReader
from splitter_mr.model import OpenAIVisionModel # Or AzureOpenAIVisionModel

openai = OpenAIVisionModel() # make sure to have necessary environment variables on `.env`.

reader = MarkItDownReader(model = openai)
result = reader.read(file_path = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf")
print(result.text)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
Pellentesque ex felis, cursus ege...

Source code in src/splitter_mr/reader/readers/markitdown_reader.py
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
def read(self, file_path: str, **kwargs: Any) -> ReaderOutput:
    """
    Reads a file and converts its contents to Markdown using MarkItDown, returning
    structured metadata.

    Args:
        file_path (str): Path to the input file to be read and converted.
        **kwargs:
            document_id (Optional[str]): Unique document identifier.
                If not provided, a UUID will be generated.
            conversion_method (Optional[str]): Name or description of the
                conversion method used. Default is None.
            ocr_method (Optional[str]): OCR method applied (if any).
                Default is None.
            metadata (Optional[List[str]]): Additional metadata as a list of strings.
                Default is an empty list.

    Returns:
        ReaderOutput: Dataclass defining the output structure for all readers.

    Example:
        ```python
        from splitter_mr.reader import MarkItDownReader
        from splitter_mr.model import OpenAIVisionModel # Or AzureOpenAIVisionModel

        openai = OpenAIVisionModel() # make sure to have necessary environment variables on `.env`.

        reader = MarkItDownReader(model = openai)
        result = reader.read(file_path = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf")
        print(result.text)
        ```
        ```python
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
        rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
        Pellentesque ex felis, cursus ege...
        ```
    """
    # Read using MarkItDown
    markdown_text = self.md.convert(file_path).text_content
    ext = os.path.splitext(file_path)[-1].lower().lstrip(".")
    conversion_method = "json" if ext == "json" else "markdown"

    # Return output
    return ReaderOutput(
        text=markdown_text,
        document_name=os.path.basename(file_path),
        document_path=file_path,
        document_id=kwargs.get("document_id") or str(uuid.uuid4()),
        conversion_method=conversion_method,
        reader_method="markitdown",
        ocr_method=self.model_name,
        metadata=kwargs.get("metadata"),
    )