Reader¶

Introduction¶

The Reader component is designed to read files homogeneously which come from many different formats and extensions. All of these readers are implemented sharing the same parent class, BaseReader.

Which Reader should I use for my project?¶

Each Reader component extracts document text in different ways. Therefore, choosing the most suitable Reader component depends on your use case.

If you want to preserve the original structure as much as possible, without any kind of markdown parsing, you can use the VanillaReader class.
In case that you have documents which have presented many tables in its structure or with many visual components (such as images), we strongly recommend to use DoclingReader.
If you are looking to maximize efficiency or make conversions to markdown simpler, we recommend using the MarkItDownReader component.

Note

Remember to visit the official repository and guides for these two last reader classes:

Docling Developer guide
MarkItDown GitHub repository.

Additionally, the file compatibility depending on the Reader class is given by the following table:

Reader	Unstructured files & PDFs	MS Office suite files	Tabular data	Files with hierarchical schema	Image files	Markdown conversion
`VanillaReader`	`txt`, `md`	`xlsx`	`csv`, `tsv`, `parquet`	`json`, `yaml`, `html`, `xml`	-	No
`MarkItDownReader`	`txt`, `md`, `pdf`	`docx`, `xlsx`, `pptx`	`csv`, `tsv`	`json`, `html`, `xml`	`jpg`, `png`, `pneg`	Yes
`DoclingReader`	`txt`, `md`, `pdf`	`docx`, `xlsx`, `pptx`	–	`html`, `xhtml`	`png`, `jpeg`, `tiff`, `bmp`, `webp`	Yes

Output format¶

Dataclass defining the output structure for all readers.

Source code in src/splitter_mr/schema/schemas.py

@dataclass
class ReaderOutput:
    """
    Dataclass defining the output structure for all readers.
    """

    text: Optional[str] = ""
    document_name: Optional[str] = None
    document_path: str = ""
    document_id: Optional[str] = None
    conversion_method: Optional[str] = None
    reader_method: Optional[str] = None
    ocr_method: Optional[str] = None
    conversion_method: Optional[str] = None
    metadata: Optional[Dict[str, Any]] = field(default_factory=dict)

    def __post_init__(self):
        if not self.document_id:
            self.document_id = str(uuid.uuid4())

    def to_dict(self):
        return asdict(self)

Readers¶

BaseReader¶

`BaseReader` ¶

Bases: ABC

Abstract base class for all document readers.

This interface defines the contract for file readers that process documents and return a standardized dictionary containing the extracted text and document-level metadata. Subclasses must implement the read method to handle specific file formats or reading strategies.

Methods:

Name	Description
`read`	Reads the input file and returns a dictionary with text and metadata.
`is_valid_file_path`	Check if a path is valid.
`is_url`	Check if the string provided is an URL.
`parse_json`	Try to parse a JSON object when a dictionary or string is provided.

Source code in src/splitter_mr/reader/base_reader.py

class BaseReader(ABC):
    """
    Abstract base class for all document readers.

    This interface defines the contract for file readers that process documents and return
    a standardized dictionary containing the extracted text and document-level metadata.
    Subclasses must implement the `read` method to handle specific file formats or reading
    strategies.

    Methods:
        read: Reads the input file and returns a dictionary with text and metadata.
        is_valid_file_path: Check if a path is valid.
        is_url: Check if the string provided is an URL.
        parse_json: Try to parse a JSON object when a dictionary or string is provided.
    """

    @staticmethod
    def is_valid_file_path(path: str) -> bool:
        """
        Checks if the provided string is a valid file path.

        Args:
            path (str): The string to check.

        Returns:
            bool: True if the string is a valid file path to an existing file, False otherwise.

        Example:
            ```python
            BaseReader.is_valid_file_path("/tmp/myfile.txt")
            ```
            ```bash
            True
            ```
        """
        return os.path.isfile(path)

    @staticmethod
    def is_url(string: str) -> bool:
        """
        Determines whether the given string is a valid HTTP or HTTPS URL.

        Args:
            string (str): The string to check.

        Returns:
            bool: True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

        Example:
            ```python
            BaseReader.is_url("https://example.com")
            ```
            ```bash
            True
            ```
            ```python
            BaseReader.is_url("not_a_url")
            ```
            ```bash
            False
            ```
        """
        try:
            result = urlparse(string)
            return all([result.scheme in ("http", "https"), result.netloc])
        except Exception:
            return False

    @staticmethod
    def parse_json(obj: Union[dict, str]) -> dict:
        """
        Attempts to parse the provided object as JSON.

        Args:
            obj (Union[dict, str]): The object to parse. If a dict, returns it as-is.
                If a string, attempts to parse it as a JSON string.

        Returns:
            dict: The parsed JSON object.

        Raises:
            ValueError: If a string is provided that cannot be parsed as valid JSON.
            TypeError: If the provided object is neither a dict nor a string.

        Example:
            ```python
            BaseReader.try_parse_json('{"a": 1}')
            ```
            ```python
            {'a': 1}
            ```
            ```python
            BaseReader.try_parse_json({'b': 2})
            ```
            ```python
            {'b': 2}
            ```
            ```python
            BaseReader.try_parse_json('[not valid json]')
            ```
            ```python
            ValueError: String could not be parsed as JSON: ...
            ```
        """
        if isinstance(obj, dict):
            return obj
        if isinstance(obj, str):
            try:
                return json.loads(obj)
            except Exception as e:
                raise ValueError(f"String could not be parsed as JSON: {e}")
        raise TypeError("Provided object is not a string or dictionary")

    @abstractmethod
    def read(
        self, file_path: str, model: Optional[BaseModel] = None, **kwargs: Any
    ) -> ReaderOutput:
        """
        Reads input and returns a ReaderOutput with text content and standardized metadata.

        Args:
            file_path (str): Path to the input file, a URL, raw string, or dictionary.
            model (Optional[BaseModel]): Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.
            **kwargs: Additional keyword arguments for implementation-specific options.

        Returns:
            ReaderOutput: Dataclass defining the output structure for all readers.

        Raises:
            ValueError: If the provided string is not valid file path, URL, or parsable content.
            TypeError: If input type is unsupported.

        Example:
            ```python
            class MyReader(BaseReader):
                def read(self, file_path: str, **kwargs) -> ReaderOutput:
                    return ReaderOutput(
                        text="example",
                        document_name="example.txt",
                        document_path=file_path,
                        document_id=kwargs.get("document_id"),
                        conversion_method="custom",
                        ocr_method=None,
                        metadata={}
                    )
            ```
        """
        pass

`is_valid_file_path(path)` `staticmethod` ¶

Checks if the provided string is a valid file path.

Parameters:

Name	Type	Description	Default
`path`	`str`	The string to check.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if the string is a valid file path to an existing file, False otherwise.

Example

BaseReader.is_valid_file_path("/tmp/myfile.txt")

True

Source code in src/splitter_mr/reader/base_reader.py

@staticmethod
def is_valid_file_path(path: str) -> bool:
    """
    Checks if the provided string is a valid file path.

    Args:
        path (str): The string to check.

    Returns:
        bool: True if the string is a valid file path to an existing file, False otherwise.

    Example:
        ```python
        BaseReader.is_valid_file_path("/tmp/myfile.txt")
        ```
        ```bash
        True
        ```
    """
    return os.path.isfile(path)

`is_url(string)` `staticmethod` ¶

Determines whether the given string is a valid HTTP or HTTPS URL.

Parameters:

Name	Type	Description	Default
`string`	`str`	The string to check.	required

Returns:

Name	Type	Description
`bool`	`bool`	True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

Example

BaseReader.is_url("https://example.com")

True

BaseReader.is_url("not_a_url")

False

Source code in src/splitter_mr/reader/base_reader.py

@staticmethod
def is_url(string: str) -> bool:
    """
    Determines whether the given string is a valid HTTP or HTTPS URL.

    Args:
        string (str): The string to check.

    Returns:
        bool: True if the string is a valid URL with HTTP or HTTPS scheme, False otherwise.

    Example:
        ```python
        BaseReader.is_url("https://example.com")
        ```
        ```bash
        True
        ```
        ```python
        BaseReader.is_url("not_a_url")
        ```
        ```bash
        False
        ```
    """
    try:
        result = urlparse(string)
        return all([result.scheme in ("http", "https"), result.netloc])
    except Exception:
        return False

`parse_json(obj)` `staticmethod` ¶

Attempts to parse the provided object as JSON.

Parameters:

Name	Type	Description	Default
`obj`	`Union[dict, str]`	The object to parse. If a dict, returns it as-is. If a string, attempts to parse it as a JSON string.	required

Returns:

Name	Type	Description
`dict`	`dict`	The parsed JSON object.

Raises:

Type	Description
`ValueError`	If a string is provided that cannot be parsed as valid JSON.
`TypeError`	If the provided object is neither a dict nor a string.

Example

BaseReader.try_parse_json('{"a": 1}')

{'a': 1}

BaseReader.try_parse_json({'b': 2})

{'b': 2}

BaseReader.try_parse_json('[not valid json]')

ValueError: String could not be parsed as JSON: ...

Source code in src/splitter_mr/reader/base_reader.py

@staticmethod
def parse_json(obj: Union[dict, str]) -> dict:
    """
    Attempts to parse the provided object as JSON.

    Args:
        obj (Union[dict, str]): The object to parse. If a dict, returns it as-is.
            If a string, attempts to parse it as a JSON string.

    Returns:
        dict: The parsed JSON object.

    Raises:
        ValueError: If a string is provided that cannot be parsed as valid JSON.
        TypeError: If the provided object is neither a dict nor a string.

    Example:
        ```python
        BaseReader.try_parse_json('{"a": 1}')
        ```
        ```python
        {'a': 1}
        ```
        ```python
        BaseReader.try_parse_json({'b': 2})
        ```
        ```python
        {'b': 2}
        ```
        ```python
        BaseReader.try_parse_json('[not valid json]')
        ```
        ```python
        ValueError: String could not be parsed as JSON: ...
        ```
    """
    if isinstance(obj, dict):
        return obj
    if isinstance(obj, str):
        try:
            return json.loads(obj)
        except Exception as e:
            raise ValueError(f"String could not be parsed as JSON: {e}")
    raise TypeError("Provided object is not a string or dictionary")

`read(file_path, model=None, **kwargs)` `abstractmethod` ¶

Reads input and returns a ReaderOutput with text content and standardized metadata.

Parameters:

Name	Type	Description	Default
`file_path`	`str`	Path to the input file, a URL, raw string, or dictionary.	required
`model`	`Optional[BaseModel]`	Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.	`None`
`**kwargs`	`Any`	Additional keyword arguments for implementation-specific options.	`{}`

Returns:

Name	Type	Description
`ReaderOutput`	`ReaderOutput`	Dataclass defining the output structure for all readers.

Raises:

Type	Description
`ValueError`	If the provided string is not valid file path, URL, or parsable content.
`TypeError`	If input type is unsupported.

Example

class MyReader(BaseReader):
    def read(self, file_path: str, **kwargs) -> ReaderOutput:
        return ReaderOutput(
            text="example",
            document_name="example.txt",
            document_path=file_path,
            document_id=kwargs.get("document_id"),
            conversion_method="custom",
            ocr_method=None,
            metadata={}
        )

Source code in src/splitter_mr/reader/base_reader.py

@abstractmethod
def read(
    self, file_path: str, model: Optional[BaseModel] = None, **kwargs: Any
) -> ReaderOutput:
    """
    Reads input and returns a ReaderOutput with text content and standardized metadata.

    Args:
        file_path (str): Path to the input file, a URL, raw string, or dictionary.
        model (Optional[BaseModel]): Optional model instance to assist or customize the reading or extraction process. Used for cases where VLMs or specialized parsers are required for processing the file content.
        **kwargs: Additional keyword arguments for implementation-specific options.

    Returns:
        ReaderOutput: Dataclass defining the output structure for all readers.

    Raises:
        ValueError: If the provided string is not valid file path, URL, or parsable content.
        TypeError: If input type is unsupported.

    Example:
        ```python
        class MyReader(BaseReader):
            def read(self, file_path: str, **kwargs) -> ReaderOutput:
                return ReaderOutput(
                    text="example",
                    document_name="example.txt",
                    document_path=file_path,
                    document_id=kwargs.get("document_id"),
                    conversion_method="custom",
                    ocr_method=None,
                    metadata={}
                )
        ```
    """
    pass

📚 Note: file examples are extracted from thedata folder in the GitHub repository: link.

VanillaReader¶

Vanilla Reader logo

`SimpleHTMLTextExtractor` ¶

Bases: HTMLParser

Extract HTML Structures from a text

Source code in src/splitter_mr/reader/readers/vanilla_reader.py

class SimpleHTMLTextExtractor(HTMLParser):
    """Extract HTML Structures from a text"""

    def __init__(self):
        super().__init__()
        self.text_parts = []

    def handle_data(self, data):
        self.text_parts.append(data)

    def get_text(self):
        return " ".join(self.text_parts).strip()

`VanillaReader` ¶

Bases: BaseReader

Read multiple file types using Python's built-in and standard libraries. Supported: .json, .html, .txt, .xml, .yaml/.yml, .csv, .tsv, .parquet, .pdf

For PDFs, this reader uses PDFPlumberReader to extract text, tables, and images, with options to show or omit images, and to annotate images using a vision model.

Source code in src/splitter_mr/reader/readers/vanilla_reader.py

class VanillaReader(BaseReader):
    """
    Read multiple file types using Python's built-in and standard libraries.
    Supported: .json, .html, .txt, .xml, .yaml/.yml, .csv, .tsv, .parquet, .pdf

    For PDFs, this reader uses PDFPlumberReader to extract text, tables, and images,
    with options to show or omit images, and to annotate images using a vision model.
    """

    def __init__(self, model: Optional[BaseModel] = None):
        super().__init__()
        self.model = model

    def read(self, file_path: Any = None, **kwargs: Any) -> ReaderOutput:
        """
        Reads a document from various sources and returns its text content along with standardized metadata.

        This method supports reading from:
            - Local file paths (file_path, or as a positional argument)
            - URLs (file_url)
            - JSON/dict objects (json_document)
            - Raw text strings (text_document)
        If multiple sources are provided, the following priority is used: file_path, file_url,
        json_document, text_document.
        If only file_path is provided, the method will attempt to automatically detect if the value is
        a path, URL, JSON, YAML, or plain text.

        Args:
            file_path (str, optional): Path to the input file.
            **kwargs:
                file_path (str, optional): Path to the input file (overrides positional argument).
                file_url (str, optional): URL to read the document from.
                json_document (dict or str, optional): Dictionary or JSON string containing document content.
                text_document (str, optional): Raw text or string content of the document.
                show_images (bool, optional): If True (default), images in PDFs are shown inline as base64 PNG.
                    If False, images are omitted (or annotated if a model is provided).
                model (BaseModel, optional): Vision model for image annotation/captioning.
                prompt (str, optional): Custom prompt for image captioning.

        Returns:
            ReaderOutput: Dataclass defining the output structure for all readers.

        Raises:
            ValueError: If the provided source is not valid or supported, or if file/URL/JSON detection fails.
            TypeError: If provided arguments are of unsupported types.

        Notes:
            - PDF extraction now supports image captioning/omission indicators.
            - For `.parquet` files, content is loaded via pandas and returned as CSV-formatted text.

        Example:
            ```python
            from splitter_mr.readers import VanillaReader
            from splitter_mr.models import AzureOpenAIVisionModel

            model = AzureOpenAIVisionModel()
            reader = VanillaReader(model=model)
            output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf", show_images=False)
            print(output.text)
            ```
            ```bash
            \\n---\\n## Page 1\\n---\\n\\nMultiRAG Project – Splitter\\nMultiRAG | Splitter\\nLorem ipsum dolor sit amet, ...
            ```
        """

        SOURCE_PRIORITY = [
            "file_path",
            "file_url",
            "json_document",
            "text_document",
        ]

        # Pick the highest-priority source provided
        document_source = None
        source_type = None
        for key in SOURCE_PRIORITY:
            if key in kwargs and kwargs[key] is not None:
                document_source = kwargs[key]
                source_type = key
                break

        if document_source is None:
            document_source = file_path
            source_type = "file_path"

        document_name = kwargs.get("document_name")
        document_path = None
        conversion_method = None

        # --- 1. File path or default
        if source_type == "file_path":
            if not isinstance(document_source, str):
                raise ValueError("file_path must be a string.")

            if self.is_valid_file_path(document_source):
                ext = os.path.splitext(document_source)[-1].lower().lstrip(".")
                document_name = os.path.basename(document_source)
                document_path = os.path.relpath(document_source)

                if ext == "pdf":
                    pdf_reader = PDFPlumberReader()
                    model = kwargs.get("model", self.model)
                    if model is not None:
                        text = pdf_reader.read(
                            document_source,
                            model=model,
                            prompt=kwargs.get("prompt"),
                            show_images=kwargs.get("show_images", False),
                        )
                        # use the **actual** model that was passed in
                        ocr_method = model.model_name
                    else:
                        text = pdf_reader.read(
                            document_source,
                            show_images=kwargs.get("show_images", False),
                        )
                        conversion_method = "pdf"
                elif ext in (
                    "json",
                    "html",
                    "txt",
                    "xml",
                    "csv",
                    "tsv",
                    "md",
                    "markdown",
                ):
                    with open(document_source, "r", encoding="utf-8") as f:
                        text = f.read()
                    conversion_method = ext
                elif ext == "parquet":
                    df = pd.read_parquet(document_source)
                    text = df.to_csv(index=False)
                    conversion_method = "csv"
                elif ext in ("yaml", "yml"):
                    with open(document_source, "r", encoding="utf-8") as f:
                        yaml_text = f.read()
                    text = yaml.safe_load(yaml_text)
                    conversion_method = "json"
                elif ext in ("xlsx", "xls"):
                    text = str(
                        pd.read_excel(document_source, engine="openpyxl").to_csv()
                    )
                    conversion_method = ext
                elif ext in LANGUAGES:
                    with open(document_source, "r", encoding="utf-8") as f:
                        text = f.read()
                    conversion_method = "txt"
                else:
                    raise ValueError(
                        f"Unsupported file extension: {ext}. Use another Reader component."
                    )

            # (2) URL
            elif self.is_url(document_source):
                ext = os.path.splitext(document_source)[-1].lower().lstrip(".")
                response = requests.get(document_source)
                response.raise_for_status()
                document_name = document_source.split("/")[-1] or "downloaded_file"
                document_path = document_source
                conversion_method = ext
                content_type = response.headers.get("Content-Type", "")

                if "application/json" in content_type or document_name.endswith(
                    ".json"
                ):
                    text = response.json()
                elif "text/html" in content_type or document_name.endswith(".html"):
                    parser = SimpleHTMLTextExtractor()
                    parser.feed(response.text)
                    text = parser.get_text()
                elif "text/yaml" in content_type or document_name.endswith(
                    (".yaml", ".yml")
                ):
                    text = yaml.safe_load(response.text)
                    conversion_method = "json"
                elif "text/csv" in content_type or document_name.endswith(".csv"):
                    text = response.text
                else:
                    text = response.text

            # (3) JSON/dict string
            else:
                try:
                    text = self.parse_json(document_source)
                    conversion_method = "json"
                except Exception:
                    try:
                        text = yaml.safe_load(document_source)
                        conversion_method = "json"
                    except Exception:
                        text = document_source
                        conversion_method = "txt"

        # --- 2. Explicit URL
        elif source_type == "file_url":
            ext = os.path.splitext(document_source)[-1].lower().lstrip(".")
            if not isinstance(document_source, str) or not self.is_url(document_source):
                raise ValueError("file_url must be a valid URL string.")
            response = requests.get(document_source)
            response.raise_for_status()
            document_name = document_source.split("/")[-1] or "downloaded_file"
            document_path = document_source
            conversion_method = ext
            content_type = response.headers.get("Content-Type", "")

            if "application/json" in content_type or document_name.endswith(".json"):
                text = response.json()
            elif "text/html" in content_type or document_name.endswith(".html"):
                parser = SimpleHTMLTextExtractor()
                parser.feed(response.text)
                text = parser.get_text()
            elif "text/yaml" in content_type or document_name.endswith(
                (".yaml", ".yml")
            ):
                text = yaml.safe_load(response.text)
                conversion_method = "json"
            elif "text/csv" in content_type or document_name.endswith(".csv"):
                text = response.text
            else:
                text = response.text

        # --- 3. Explicit JSON
        elif source_type == "json_document":
            document_name = kwargs.get("document_name", None)
            document_path = None
            text = self.parse_json(document_source)
            conversion_method = "json"

        # --- 4. Explicit text
        elif source_type == "text_document":
            document_name = kwargs.get("document_name", None)
            document_path = None
            try:
                parsed = self.parse_json(document_source)
                # Only treat as JSON if result is dict or list, not a string!
                if isinstance(parsed, (dict, list)):
                    text = parsed
                    conversion_method = "json"
                else:
                    raise ValueError  # Force fallback
            except Exception:
                try:
                    parsed = yaml.safe_load(document_source)
                    # Only treat as YAML if it returns a dict or list
                    if isinstance(parsed, (dict, list)):
                        text = parsed
                        conversion_method = "json"
                    else:
                        raise ValueError
                except Exception:
                    text = document_source
                    conversion_method = "txt"

        else:
            raise ValueError(f"Unrecognized document source: {source_type}")

        metadata = kwargs.get("metadata", {})
        document_id = kwargs.get("document_id") or str(uuid.uuid4())
        ocr_method = kwargs.get("ocr_method")

        return ReaderOutput(
            text=text,
            document_name=document_name,
            document_path=document_path,
            document_id=document_id,
            conversion_method=conversion_method,
            reader_method="vanilla",
            ocr_method=ocr_method,
            metadata=metadata,
        )

`read(file_path=None, **kwargs)` ¶

Reads a document from various sources and returns its text content along with standardized metadata.

This method supports reading from

Local file paths (file_path, or as a positional argument)
URLs (file_url)
JSON/dict objects (json_document)
Raw text strings (text_document)

If multiple sources are provided, the following priority is used: file_path, file_url, json_document, text_document. If only file_path is provided, the method will attempt to automatically detect if the value is a path, URL, JSON, YAML, or plain text.

Parameters:

Name	Type	Description	Default
`file_path`	`str`	Path to the input file.	`None`
`**kwargs`	`Any`	file_path (str, optional): Path to the input file (overrides positional argument). file_url (str, optional): URL to read the document from. json_document (dict or str, optional): Dictionary or JSON string containing document content. text_document (str, optional): Raw text or string content of the document. show_images (bool, optional): If True (default), images in PDFs are shown inline as base64 PNG. If False, images are omitted (or annotated if a model is provided). model (BaseModel, optional): Vision model for image annotation/captioning. prompt (str, optional): Custom prompt for image captioning.	`{}`

Returns:

Name	Type	Description
`ReaderOutput`	`ReaderOutput`	Dataclass defining the output structure for all readers.

Raises:

Type	Description
`ValueError`	If the provided source is not valid or supported, or if file/URL/JSON detection fails.
`TypeError`	If provided arguments are of unsupported types.

Notes

PDF extraction now supports image captioning/omission indicators.
For .parquet files, content is loaded via pandas and returned as CSV-formatted text.

Example

from splitter_mr.readers import VanillaReader
from splitter_mr.models import AzureOpenAIVisionModel

model = AzureOpenAIVisionModel()
reader = VanillaReader(model=model)
output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf", show_images=False)
print(output.text)

\n---\n## Page 1\n---\n\nMultiRAG Project – Splitter\nMultiRAG | Splitter\nLorem ipsum dolor sit amet, ...

Source code in src/splitter_mr/reader/readers/vanilla_reader.py

def read(self, file_path: Any = None, **kwargs: Any) -> ReaderOutput:
    """
    Reads a document from various sources and returns its text content along with standardized metadata.

    This method supports reading from:
        - Local file paths (file_path, or as a positional argument)
        - URLs (file_url)
        - JSON/dict objects (json_document)
        - Raw text strings (text_document)
    If multiple sources are provided, the following priority is used: file_path, file_url,
    json_document, text_document.
    If only file_path is provided, the method will attempt to automatically detect if the value is
    a path, URL, JSON, YAML, or plain text.

    Args:
        file_path (str, optional): Path to the input file.
        **kwargs:
            file_path (str, optional): Path to the input file (overrides positional argument).
            file_url (str, optional): URL to read the document from.
            json_document (dict or str, optional): Dictionary or JSON string containing document content.
            text_document (str, optional): Raw text or string content of the document.
            show_images (bool, optional): If True (default), images in PDFs are shown inline as base64 PNG.
                If False, images are omitted (or annotated if a model is provided).
            model (BaseModel, optional): Vision model for image annotation/captioning.
            prompt (str, optional): Custom prompt for image captioning.

    Returns:
        ReaderOutput: Dataclass defining the output structure for all readers.

    Raises:
        ValueError: If the provided source is not valid or supported, or if file/URL/JSON detection fails.
        TypeError: If provided arguments are of unsupported types.

    Notes:
        - PDF extraction now supports image captioning/omission indicators.
        - For `.parquet` files, content is loaded via pandas and returned as CSV-formatted text.

    Example:
        ```python
        from splitter_mr.readers import VanillaReader
        from splitter_mr.models import AzureOpenAIVisionModel

        model = AzureOpenAIVisionModel()
        reader = VanillaReader(model=model)
        output = reader.read(file_path="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf", show_images=False)
        print(output.text)
        ```
        ```bash
        \\n---\\n## Page 1\\n---\\n\\nMultiRAG Project – Splitter\\nMultiRAG | Splitter\\nLorem ipsum dolor sit amet, ...
        ```
    """

    SOURCE_PRIORITY = [
        "file_path",
        "file_url",
        "json_document",
        "text_document",
    ]

    # Pick the highest-priority source provided
    document_source = None
    source_type = None
    for key in SOURCE_PRIORITY:
        if key in kwargs and kwargs[key] is not None:
            document_source = kwargs[key]
            source_type = key
            break

    if document_source is None:
        document_source = file_path
        source_type = "file_path"

    document_name = kwargs.get("document_name")
    document_path = None
    conversion_method = None

    # --- 1. File path or default
    if source_type == "file_path":
        if not isinstance(document_source, str):
            raise ValueError("file_path must be a string.")

        if self.is_valid_file_path(document_source):
            ext = os.path.splitext(document_source)[-1].lower().lstrip(".")
            document_name = os.path.basename(document_source)
            document_path = os.path.relpath(document_source)

            if ext == "pdf":
                pdf_reader = PDFPlumberReader()
                model = kwargs.get("model", self.model)
                if model is not None:
                    text = pdf_reader.read(
                        document_source,
                        model=model,
                        prompt=kwargs.get("prompt"),
                        show_images=kwargs.get("show_images", False),
                    )
                    # use the **actual** model that was passed in
                    ocr_method = model.model_name
                else:
                    text = pdf_reader.read(
                        document_source,
                        show_images=kwargs.get("show_images", False),
                    )
                    conversion_method = "pdf"
            elif ext in (
                "json",
                "html",
                "txt",
                "xml",
                "csv",
                "tsv",
                "md",
                "markdown",
            ):
                with open(document_source, "r", encoding="utf-8") as f:
                    text = f.read()
                conversion_method = ext
            elif ext == "parquet":
                df = pd.read_parquet(document_source)
                text = df.to_csv(index=False)
                conversion_method = "csv"
            elif ext in ("yaml", "yml"):
                with open(document_source, "r", encoding="utf-8") as f:
                    yaml_text = f.read()
                text = yaml.safe_load(yaml_text)
                conversion_method = "json"
            elif ext in ("xlsx", "xls"):
                text = str(
                    pd.read_excel(document_source, engine="openpyxl").to_csv()
                )
                conversion_method = ext
            elif ext in LANGUAGES:
                with open(document_source, "r", encoding="utf-8") as f:
                    text = f.read()
                conversion_method = "txt"
            else:
                raise ValueError(
                    f"Unsupported file extension: {ext}. Use another Reader component."
                )

        # (2) URL
        elif self.is_url(document_source):
            ext = os.path.splitext(document_source)[-1].lower().lstrip(".")
            response = requests.get(document_source)
            response.raise_for_status()
            document_name = document_source.split("/")[-1] or "downloaded_file"
            document_path = document_source
            conversion_method = ext
            content_type = response.headers.get("Content-Type", "")

            if "application/json" in content_type or document_name.endswith(
                ".json"
            ):
                text = response.json()
            elif "text/html" in content_type or document_name.endswith(".html"):
                parser = SimpleHTMLTextExtractor()
                parser.feed(response.text)
                text = parser.get_text()
            elif "text/yaml" in content_type or document_name.endswith(
                (".yaml", ".yml")
            ):
                text = yaml.safe_load(response.text)
                conversion_method = "json"
            elif "text/csv" in content_type or document_name.endswith(".csv"):
                text = response.text
            else:
                text = response.text

        # (3) JSON/dict string
        else:
            try:
                text = self.parse_json(document_source)
                conversion_method = "json"
            except Exception:
                try:
                    text = yaml.safe_load(document_source)
                    conversion_method = "json"
                except Exception:
                    text = document_source
                    conversion_method = "txt"

    # --- 2. Explicit URL
    elif source_type == "file_url":
        ext = os.path.splitext(document_source)[-1].lower().lstrip(".")
        if not isinstance(document_source, str) or not self.is_url(document_source):
            raise ValueError("file_url must be a valid URL string.")
        response = requests.get(document_source)
        response.raise_for_status()
        document_name = document_source.split("/")[-1] or "downloaded_file"
        document_path = document_source
        conversion_method = ext
        content_type = response.headers.get("Content-Type", "")

        if "application/json" in content_type or document_name.endswith(".json"):
            text = response.json()
        elif "text/html" in content_type or document_name.endswith(".html"):
            parser = SimpleHTMLTextExtractor()
            parser.feed(response.text)
            text = parser.get_text()
        elif "text/yaml" in content_type or document_name.endswith(
            (".yaml", ".yml")
        ):
            text = yaml.safe_load(response.text)
            conversion_method = "json"
        elif "text/csv" in content_type or document_name.endswith(".csv"):
            text = response.text
        else:
            text = response.text

    # --- 3. Explicit JSON
    elif source_type == "json_document":
        document_name = kwargs.get("document_name", None)
        document_path = None
        text = self.parse_json(document_source)
        conversion_method = "json"

    # --- 4. Explicit text
    elif source_type == "text_document":
        document_name = kwargs.get("document_name", None)
        document_path = None
        try:
            parsed = self.parse_json(document_source)
            # Only treat as JSON if result is dict or list, not a string!
            if isinstance(parsed, (dict, list)):
                text = parsed
                conversion_method = "json"
            else:
                raise ValueError  # Force fallback
        except Exception:
            try:
                parsed = yaml.safe_load(document_source)
                # Only treat as YAML if it returns a dict or list
                if isinstance(parsed, (dict, list)):
                    text = parsed
                    conversion_method = "json"
                else:
                    raise ValueError
            except Exception:
                text = document_source
                conversion_method = "txt"

    else:
        raise ValueError(f"Unrecognized document source: {source_type}")

    metadata = kwargs.get("metadata", {})
    document_id = kwargs.get("document_id") or str(uuid.uuid4())
    ocr_method = kwargs.get("ocr_method")

    return ReaderOutput(
        text=text,
        document_name=document_name,
        document_path=document_path,
        document_id=document_id,
        conversion_method=conversion_method,
        reader_method="vanilla",
        ocr_method=ocr_method,
        metadata=metadata,
    )

DoclingReader¶

Docling logo

`DoclingReader` ¶

Bases: BaseReader

Read multiple file types using IBM's Docling library, and convert the documents into markdown or JSON format.

Source code in src/splitter_mr/reader/readers/docling_reader.py

class DoclingReader(BaseReader):
    """
    Read multiple file types using IBM's Docling library, and convert the documents
    into markdown or JSON format.
    """

    SUPPORTED_EXTENSIONS = (
        "pdf",
        "docx",
        "html",
        "md",
        "markdown",
        "htm",
        "pptx",
        "xlsx",
        "odt",
        "rtf",
        "jpg",
        "jpeg",
        "png",
        "bmp",
        "gif",
        "tiff",
    )

    def __init__(self, model: Optional[BaseModel] = None):
        self.model = model
        self.model_name = None
        if self.model is not None:
            self.client = self.model.get_client()
            for attr in ["_azure_deployment", "_azure_endpoint", "_api_version"]:
                setattr(self, attr, getattr(self.client, attr, None))
            self.api_key = self.client.api_key
            self.model_name = self.model.model_name

    def _get_vlm_url_and_headers(self, client: Any) -> Tuple[str, Dict[str, str]]:
        """
        Returns VLM API URL and headers based on model type.
        """
        if isinstance(client, AzureOpenAI):
            url = f"{self._azure_endpoint}/openai/deployments/{self._azure_deployment}/chat/completions?api-version={self._api_version}"
            headers = {"Authorization": f"Bearer {client.api_key}"}
        elif isinstance(client, OpenAI):
            url = "https://api.openai.com/v1/chat/completions"
            headers = {"Authorization": f"Bearer {client.api_key}"}
        else:
            raise ValueError(f"Unknown client type: {type(client)}")
        return url, headers

    def _make_docling_reader(self, prompt: str, timeout: int = 60) -> DocumentConverter:
        """
        Returns a configured DocumentConverter with VLM pipeline options for OpenAI or Azure.
        """
        url, headers = self._get_vlm_url_and_headers(self.client)
        vlm_options = ApiVlmOptions(
            url=url,
            params={"model": self.model_name},
            headers=headers,
            prompt=prompt,
            timeout=timeout,
            response_format=ResponseFormat.MARKDOWN,
        )
        pipeline_options = VlmPipelineOptions(
            enable_remote_services=True,
            vlm_options=vlm_options,
        )
        reader = DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(
                    pipeline_cls=VlmPipeline,
                    pipeline_options=pipeline_options,
                )
            }
        )
        return reader

    def read(
        self,
        file_path: str,
        prompt: str = "Analyze the following resource in the original language. Be concise but comprehensive, according to the image context. Return the content in markdown format",
        **kwargs: Any,
    ) -> ReaderOutput:
        """
        Reads and converts a document to Markdown format using the
        [Docling](https://github.com/docling-project/docling) library, supporting a wide range
        of file types including PDF, DOCX, HTML, and images.

        This method leverages Docling's advanced document parsing capabilities—including layout
        and table detection, code and formula extraction, and integrated OCR—to produce clean,
        markdown-formatted output for downstream processing. The output includes standardized
        metadata and can be easily integrated into generative AI or information retrieval pipelines.

        Args:
            file_path (str): Path to the input file to be read and converted.
            **kwargs:
                document_id (Optional[str]): Unique document identifier.
                    If not provided, a UUID will be generated.
                conversion_method (Optional[str]): Name or description of the
                    conversion method used. Default is None.
                ocr_method (Optional[str]): OCR method applied (if any).
                    Default is None.
                metadata (Optional[List[str]]): Additional metadata as a list of strings.
                    Default is an empty list.

        Returns:
            ReaderOutput: Dataclass defining the output structure for all readers.

        Example:
            ```python
            from splitter_mr.readers import DoclingReader

            reader = DoclingReader()
            result = reader.read(file_path = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf")
            print(result.text)
            ```
            ```bash
            Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
            rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
            Pellentesque ex felis, cursus ege...
            ```
        """
        # Check if the extension is valid
        ext = os.path.splitext(file_path)[-1].lower().lstrip(".")
        if ext not in self.SUPPORTED_EXTENSIONS:
            print(
                f"Warning: File extension not compatible: {ext}. Fallback to VanillaReader."
            )
            return VanillaReader().read(file_path=file_path, **kwargs)

        if self.model is not None:
            reader = self._make_docling_reader(prompt)
        else:
            reader = DocumentConverter()

        # Read and convert to markdown
        text = reader.convert(file_path)
        markdown_text = text.document.export_to_markdown()

        # Return output
        return ReaderOutput(
            text=markdown_text,
            document_name=os.path.basename(file_path),
            document_path=file_path,
            document_id=kwargs.get("document_id") or str(uuid.uuid4()),
            conversion_method="markdown",
            reader_method="docling",
            ocr_method=self.model_name,
            metadata=kwargs.get("metadata"),
        )

`read(file_path, prompt='Analyze the following resource in the original language. Be concise but comprehensive, according to the image context. Return the content in markdown format', **kwargs)` ¶

Reads and converts a document to Markdown format using the Docling library, supporting a wide range of file types including PDF, DOCX, HTML, and images.

This method leverages Docling's advanced document parsing capabilities—including layout and table detection, code and formula extraction, and integrated OCR—to produce clean, markdown-formatted output for downstream processing. The output includes standardized metadata and can be easily integrated into generative AI or information retrieval pipelines.

Parameters:

Name	Type	Description	Default
`file_path`	`str`	Path to the input file to be read and converted.	required
`**kwargs`	`Any`	document_id (Optional[str]): Unique document identifier. If not provided, a UUID will be generated. conversion_method (Optional[str]): Name or description of the conversion method used. Default is None. ocr_method (Optional[str]): OCR method applied (if any). Default is None. metadata (Optional[List[str]]): Additional metadata as a list of strings. Default is an empty list.	`{}`

Returns:

Name	Type	Description
`ReaderOutput`	`ReaderOutput`	Dataclass defining the output structure for all readers.

Example

from splitter_mr.readers import DoclingReader

reader = DoclingReader()
result = reader.read(file_path = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf")
print(result.text)

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
Pellentesque ex felis, cursus ege...

Source code in src/splitter_mr/reader/readers/docling_reader.py

def read(
    self,
    file_path: str,
    prompt: str = "Analyze the following resource in the original language. Be concise but comprehensive, according to the image context. Return the content in markdown format",
    **kwargs: Any,
) -> ReaderOutput:
    """
    Reads and converts a document to Markdown format using the
    [Docling](https://github.com/docling-project/docling) library, supporting a wide range
    of file types including PDF, DOCX, HTML, and images.

    This method leverages Docling's advanced document parsing capabilities—including layout
    and table detection, code and formula extraction, and integrated OCR—to produce clean,
    markdown-formatted output for downstream processing. The output includes standardized
    metadata and can be easily integrated into generative AI or information retrieval pipelines.

    Args:
        file_path (str): Path to the input file to be read and converted.
        **kwargs:
            document_id (Optional[str]): Unique document identifier.
                If not provided, a UUID will be generated.
            conversion_method (Optional[str]): Name or description of the
                conversion method used. Default is None.
            ocr_method (Optional[str]): OCR method applied (if any).
                Default is None.
            metadata (Optional[List[str]]): Additional metadata as a list of strings.
                Default is an empty list.

    Returns:
        ReaderOutput: Dataclass defining the output structure for all readers.

    Example:
        ```python
        from splitter_mr.readers import DoclingReader

        reader = DoclingReader()
        result = reader.read(file_path = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf")
        print(result.text)
        ```
        ```bash
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
        rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
        Pellentesque ex felis, cursus ege...
        ```
    """
    # Check if the extension is valid
    ext = os.path.splitext(file_path)[-1].lower().lstrip(".")
    if ext not in self.SUPPORTED_EXTENSIONS:
        print(
            f"Warning: File extension not compatible: {ext}. Fallback to VanillaReader."
        )
        return VanillaReader().read(file_path=file_path, **kwargs)

    if self.model is not None:
        reader = self._make_docling_reader(prompt)
    else:
        reader = DocumentConverter()

    # Read and convert to markdown
    text = reader.convert(file_path)
    markdown_text = text.document.export_to_markdown()

    # Return output
    return ReaderOutput(
        text=markdown_text,
        document_name=os.path.basename(file_path),
        document_path=file_path,
        document_id=kwargs.get("document_id") or str(uuid.uuid4()),
        conversion_method="markdown",
        reader_method="docling",
        ocr_method=self.model_name,
        metadata=kwargs.get("metadata"),
    )

MarkItDownReader¶

MarkItDown logo

`MarkItDownReader` ¶

Bases: BaseReader

Read multiple file types using Microsoft's MarkItDown library, and convert the documents using markdown format.

This reader supports both standard MarkItDown conversion and the use of Vision Language Models (VLMs) for LLM-based OCR when extracting text from images or scanned documents.

Currently, only the following VLMs are supported: - OpenAIVisionModel - AzureOpenAIVisionModel

If a compatible model is provided, MarkItDown will leverage the specified VLM for OCR, and the model's name will be recorded as the OCR method used.

Notes

This method uses MarkItDown to convert a wide variety of file formats (e.g., PDF, DOCX, images, HTML, CSV) to Markdown.
If document_id is not provided, a UUID will be automatically assigned.
If metadata is not provided, an empty list will be used.
MarkItDown should be installed with all relevant optional dependencies for full file format support.

Source code in src/splitter_mr/reader/readers/markitdown_reader.py

class MarkItDownReader(BaseReader):
    """
    Read multiple file types using Microsoft's MarkItDown library, and convert
    the documents using markdown format.

    This reader supports both standard MarkItDown conversion and the use of Vision Language Models (VLMs)
    for LLM-based OCR when extracting text from images or scanned documents.

    Currently, only the following VLMs are supported:
        - OpenAIVisionModel
        - AzureOpenAIVisionModel

    If a compatible model is provided, MarkItDown will leverage the specified VLM for OCR, and the
    model's name will be recorded as the OCR method used.

    Notes:
        - This method uses [MarkItDown](https://github.com/microsoft/markitdown) to convert
            a wide variety of file formats (e.g., PDF, DOCX, images, HTML, CSV) to Markdown.
        - If `document_id` is not provided, a UUID will be automatically assigned.
        - If `metadata` is not provided, an empty list will be used.
        - MarkItDown should be installed with all relevant optional dependencies for full
            file format support.
    """

    def __init__(
        self, model: Optional[Union[AzureOpenAIVisionModel, OpenAIVisionModel]] = None
    ):
        self.model = model
        self.model_name = None

        if model is not None:
            if not isinstance(model, (OpenAIVisionModel, AzureOpenAIVisionModel)):
                raise ValueError(
                    "Incompatible client. Only AzureOpenAIVisionModel and OpenAIVisionModel are supported."
                )
            client = model.get_client()
            self.model_name = self.model.model_name
            self.md = MarkItDown(llm_client=client, llm_model=self.model_name)
        else:
            self.md = MarkItDown()

    def read(self, file_path: str, **kwargs: Any) -> ReaderOutput:
        """
        Reads a file and converts its contents to Markdown using MarkItDown, returning
        structured metadata.

        Args:
            file_path (str): Path to the input file to be read and converted.
            **kwargs:
                document_id (Optional[str]): Unique document identifier.
                    If not provided, a UUID will be generated.
                conversion_method (Optional[str]): Name or description of the
                    conversion method used. Default is None.
                ocr_method (Optional[str]): OCR method applied (if any).
                    Default is None.
                metadata (Optional[List[str]]): Additional metadata as a list of strings.
                    Default is an empty list.

        Returns:
            ReaderOutput: Dataclass defining the output structure for all readers.

        Example:
            ```python
            from splitter_mr.reader import MarkItDownReader
            from splitter_mr.model import OpenAIVisionModel # Or AzureOpenAIVisionModel

            openai = OpenAIVisionModel() # make sure to have necessary environment variables on `.env`.

            reader = MarkItDownReader(model = openai)
            result = reader.read(file_path = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf")
            print(result.text)
            ```
            ```python
            Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
            rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
            Pellentesque ex felis, cursus ege...
            ```
        """
        # Read using MarkItDown
        markdown_text = self.md.convert(file_path).text_content
        ext = os.path.splitext(file_path)[-1].lower().lstrip(".")
        conversion_method = "json" if ext == "json" else "markdown"

        # Return output
        return ReaderOutput(
            text=markdown_text,
            document_name=os.path.basename(file_path),
            document_path=file_path,
            document_id=kwargs.get("document_id") or str(uuid.uuid4()),
            conversion_method=conversion_method,
            reader_method="markitdown",
            ocr_method=self.model_name,
            metadata=kwargs.get("metadata"),
        )

`read(file_path, **kwargs)` ¶

Reads a file and converts its contents to Markdown using MarkItDown, returning structured metadata.

Parameters:

Name	Type	Description	Default
`file_path`	`str`	Path to the input file to be read and converted.	required
`**kwargs`	`Any`	document_id (Optional[str]): Unique document identifier. If not provided, a UUID will be generated. conversion_method (Optional[str]): Name or description of the conversion method used. Default is None. ocr_method (Optional[str]): OCR method applied (if any). Default is None. metadata (Optional[List[str]]): Additional metadata as a list of strings. Default is an empty list.	`{}`

Returns:

Name	Type	Description
`ReaderOutput`	`ReaderOutput`	Dataclass defining the output structure for all readers.

Example

from splitter_mr.reader import MarkItDownReader
from splitter_mr.model import OpenAIVisionModel # Or AzureOpenAIVisionModel

openai = OpenAIVisionModel() # make sure to have necessary environment variables on `.env`.

reader = MarkItDownReader(model = openai)
result = reader.read(file_path = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf")
print(result.text)

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
Pellentesque ex felis, cursus ege...

Source code in src/splitter_mr/reader/readers/markitdown_reader.py

def read(self, file_path: str, **kwargs: Any) -> ReaderOutput:
    """
    Reads a file and converts its contents to Markdown using MarkItDown, returning
    structured metadata.

    Args:
        file_path (str): Path to the input file to be read and converted.
        **kwargs:
            document_id (Optional[str]): Unique document identifier.
                If not provided, a UUID will be generated.
            conversion_method (Optional[str]): Name or description of the
                conversion method used. Default is None.
            ocr_method (Optional[str]): OCR method applied (if any).
                Default is None.
            metadata (Optional[List[str]]): Additional metadata as a list of strings.
                Default is an empty list.

    Returns:
        ReaderOutput: Dataclass defining the output structure for all readers.

    Example:
        ```python
        from splitter_mr.reader import MarkItDownReader
        from splitter_mr.model import OpenAIVisionModel # Or AzureOpenAIVisionModel

        openai = OpenAIVisionModel() # make sure to have necessary environment variables on `.env`.

        reader = MarkItDownReader(model = openai)
        result = reader.read(file_path = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/test_1.pdf")
        print(result.text)
        ```
        ```python
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec eget purus non est porta
        rutrum. Suspendisse euismod lectus laoreet sem pellentesque egestas et et sem.
        Pellentesque ex felis, cursus ege...
        ```
    """
    # Read using MarkItDown
    markdown_text = self.md.convert(file_path).text_content
    ext = os.path.splitext(file_path)[-1].lower().lstrip(".")
    conversion_method = "json" if ext == "json" else "markdown"

    # Return output
    return ReaderOutput(
        text=markdown_text,
        document_name=os.path.basename(file_path),
        document_path=file_path,
        document_id=kwargs.get("document_id") or str(uuid.uuid4()),
        conversion_method=conversion_method,
        reader_method="markitdown",
        ocr_method=self.model_name,
        metadata=kwargs.get("metadata"),
    )

Reader¶

Introduction¶

Which Reader should I use for my project?¶

Output format¶

Readers¶

BaseReader¶

BaseReader ¶

is_valid_file_path(path) staticmethod ¶

is_url(string) staticmethod ¶

parse_json(obj) staticmethod ¶

read(file_path, model=None, **kwargs) abstractmethod ¶

VanillaReader¶

SimpleHTMLTextExtractor ¶

VanillaReader ¶

read(file_path=None, **kwargs) ¶

DoclingReader¶

DoclingReader ¶

read(file_path, prompt='Analyze the following resource in the original language. Be concise but comprehensive, according to the image context. Return the content in markdown format', **kwargs) ¶

MarkItDownReader¶

MarkItDownReader ¶

read(file_path, **kwargs) ¶

`BaseReader` ¶

`is_valid_file_path(path)` `staticmethod` ¶

`is_url(string)` `staticmethod` ¶

`parse_json(obj)` `staticmethod` ¶

`read(file_path, model=None, **kwargs)` `abstractmethod` ¶

`SimpleHTMLTextExtractor` ¶

`VanillaReader` ¶

`read(file_path=None, **kwargs)` ¶

`DoclingReader` ¶

`read(file_path, prompt='Analyze the following resource in the original language. Be concise but comprehensive, according to the image context. Return the content in markdown format', **kwargs)` ¶

`MarkItDownReader` ¶

`read(file_path, **kwargs)` ¶