Example: Reading a PDF using several Reading methods¶
Converting a PDF into a readable format is not an easy task. PDF introduces compression, which often results in a complete loss of formatting. As a result, many tools have been developed to convert PDF to text, each of which works differently.
In this example, we will show how to read a PDF file using three readers: VanillaReader
, MarkItDownReader
, and DoclingReader
, and we will observe the differences between each.
Note
A complete description of each of these classes is defined in the Developer guide.
1. Read PDF files using VanillaReader
¶
VanillaReader
uses open-source libraries to read many file formats, aiming to preserve the text as a string. However, converting a PDF directly to text results in a complete loss of readability. So, to read PDFs, VanillaReader
uses PDFPlumber as the core library. PDFPlumber is a Python library that extracts text, tables, and metadata from PDF files while preserving their layout as much as possible. It is widely used for converting PDF content into readable and structured formats for further processing. Let's see how it works and what results it produces:
First, we instantiate our VanillaReader
object:
from splitter_mr.reader import VanillaReader
reader = VanillaReader()
To read the file, you simply call to the read
method:
file = "data/sample_pdf.pdf"
reader_output = reader.read(file)
The result will be a ReaderOutput
object with the following structure:
print(reader_output.model_dump_json(indent=4))
{
"text": "<!-- page -->\n\nA sample PDF\nConverting PDF files to other formats, such as Markdown, is a surprisingly\ncomplex task due to the nature of the PDF format itself. PDF (Portable\nDocument Format) was designed primarily for preserving the visual layout of\ndocuments, making them look the same across different devices and\nplatforms. However, this design goal introduces several challenges when trying to\nextract and convert the underlying content into a more flexible, structured for
...
structure recognition) to produce\nusable, readable, and faithful Markdown output. As a result, perfect conversion\nis rarely possible, and manual review and cleanup are often required.\n\n<!-- image -->\n",
"document_name": "sample_pdf.pdf",
"document_path": "data/sample_pdf.pdf",
"document_id": "df3944d6-eb56-4d25-ae87-27db76138063",
"conversion_method": "pdf",
"reader_method": "vanilla",
"ocr_method": null,
"page_placeholder": "<!-- page -->",
"metadata": {}
}
So, we can print the text using this command:
print(reader_output.text)
<!-- page -->
A sample PDF
Converting PDF files to other formats, such as Markdown, is a surprisingly
complex task due to the nature of the PDF format itself. PDF (Portable
Document Format) was designed primarily for preserving the visual layout of
documents, making them look the same across different devices and
platforms. However, this design goal introduces several challenges when trying to
extract and convert the underlying content into a more flexible, structured format
like Markdown.
<!-
...
xample.com |
Conclusion
While it may seem simple on the surface, converting PDFs to formats like
Markdown involves a series of technical and interpretive challenges. Effective
conversion tools must blend text extraction, document analysis, and sometimes
machine learning techniques (such as OCR or structure recognition) to produce
usable, readable, and faithful Markdown output. As a result, perfect conversion
is rarely possible, and manual review and cleanup are often required.
<!-- image -->
As we can see from the original file, all the text has been preserved. Bold, italics, etc. are not highlighted, nor are text colors, headers, and font type. Despite that, the format is mostly plain text rather than markdown. In addition, we can observe that images are signaled by a <!-- image -->
placeholder, which can be useful to identify where a image has been placed. In the same way, pages are marked with another placeholder: <!-- page -->
. The order of the document is preserved.
Now, let's see how well the other readers handle markdown conversion:
2. Read PDF files using MarkItDownReader
¶
The process is analogous to VanillaReader
. So, we instantiate the MarkItDownReader
class and we call to the read method:
from splitter_mr.reader import MarkItDownReader
reader = MarkItDownReader()
reader_output = reader.read(file)
print(reader_output.text)
A sample PDF
Converting PDF files to other formats, such as Markdown, is a surprisingly
complex task due to the nature of the PDF format itself. PDF (Portable
Document Format) was designed primarily for preserving the visual layout of
documents, making them look the same across different devices and
platforms. However, this design goal introduces several challenges when trying to
extract and convert the underlying content into a more flexible, structured format
like Markdown.
Ilustración 1. Sp
...
ct Lead carol@example.com
Conclusion
While it may seem simple on the surface, converting PDFs to formats like
Markdown involves a series of technical and interpretive challenges. Effective
conversion tools must blend text extraction, document analysis, and sometimes
machine learning techniques (such as OCR or structure recognition) to produce
usable, readable, and faithful Markdown output. As a result, perfect conversion
is rarely possible, and manual review and cleanup are often required.
Again, all the text has been preserved. However, we can observe some inconsistencies in line spacing: sometimes there is a single line of separation, while in other cases there are two. Similarly to VanillaReader
, text formatting has not been preserved: no headers, no italics, no bold... It is simply plain text.
3. Read PDF files using DoclingReader
¶
docling
is an open-source Python library designed to analyze and extract structured information from documents, including PDFs. It focuses on preserving the original layout, structure, and semantic elements of documents, making it useful for handling complex formats beyond plain text extraction.
Let's see how it works for this use case:
from splitter_mr.reader import DoclingReader
FILE_PATH = "data/sample_pdf.pdf"
reader = DoclingReader()
reader_output = reader.read(file_path=FILE_PATH)
print(reader_output.text)
## A sample PDF
Converting PDF files to other formats, such as Markdown, is a surprisingly complex task due to the nature of the PDF format itself . PDF (Portable Document Format) was designed primarily for preserving the visual layout of documents, making them look the same across different devices and platforms. However, this design goal introduces several challenges when trying to extract and convert the underlying content into a more flexible, structured format like Markdown.
Ilustración 1
...
ple.com |
## Conclusion
While it may seem simple on the surface, converting PDFs to formats like Markdown involves a series of technical and interpretive challenges. Effective conversion tools must blend text extraction, document analysis, and sometimes machine learning techniques (such as OCR or structure recognition) to produce usable, readable, and faithful Markdown output. As a result, perfect conversion is rarely possible, and manual review and cleanup are often required.
<!-- image -->
We can see that the layout is generally better. All the text has been preserved, but markdown format is more present. We can see that headers, tables and lists are markdown formatted, despite bold or italics are not showing. In addition, formulas (<!-- formula-not-decoded -->
) and images (<!-- Image -->
) are detected too, despite no description or rendering is provided. Sometimes the line spacing is inconsistent as it was in MarkItDown. However, in general terms, it could be said that it is the method that best formats Markdown.
So, does this mean you should always use this method to parse PDFs? Not exactly. Let's analyze an additional metric: computation time.
4. Measuring compute time¶
To measure the compute time for every method, we can encapsulate every reading logic into a function and define a decorator which computes a function execution time. Then, we can compare compute times in relative terms. Then, we can compare compute times in relative terms by executing the following code:
import time
from splitter_mr.reader import DoclingReader, MarkItDownReader, VanillaReader
def timeit(func):
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
elapsed = time.time() - start
print(f"Time taken by '{func.__name__}': {elapsed:.4f} seconds\n")
return result
return wrapper
@timeit
def get_reader_output(file, reader=VanillaReader()):
output = reader.read(file)
print()
return output.text
FILE_PATH = "data/sample_pdf.pdf"
print("*" * 20 + " Vanilla Reader " + "*" * 20)
vanilla_output = get_reader_output(file, reader=VanillaReader())
print("*" * 20 + " MarkItDown Reader " + "*" * 20)
markitdown_output = get_reader_output(file, reader=MarkItDownReader())
print("*" * 20 + " Docling Reader " + "*" * 20)
markitdown_output = get_reader_output(file, reader=DoclingReader())
******************** Vanilla Reader ********************
Time taken by 'get_reader_output': 0.1170 seconds
******************** MarkItDown Reader ********************
Time taken by 'get_reader_output': 0.0624 seconds
******************** Docling Reader ********************
Time taken by 'get_reader_output': 4.4744 seconds
As we can observe, although DoclingReader offers a really good conversion, it's a resource-intensive method, and therefore takes the longest to return the result. On the other hand, MarkItDownReader, although it preserves the markdown format the least, is the fastest of all. VanillaReader
offers a balance between computation time and format preservation.
5. Comparison between methods¶
As we've seen, each method has its advantages and disadvantages. Therefore, choosing a reading method depends on the specific needs of the user.
- If you prioritize conversion quality regardless of execution time,
DoclingReader
will be the best option. - If you want a fast conversion that preserves only the text,
MarkItDownReader
may be your best option. - If you want a fast conversion but need to detect images and other graphic elements,
VanillaReader
is suitable.
Finally, here we present a comparative table of each method, with the strengths and weaknesses of each one:
Feature | VanillaReader |
MarkItDownReader |
DoclingReader |
---|---|---|---|
Header preservation | low | mid | high |
Text formatting (bold, italic, etc.) | no | no | partial |
Text color & highlighting | no | no | no |
Markdown tables | yes | no (txt format) | yes |
Markdown lists | partial | no | yes |
Image placeholders | yes | no | yes |
Formulas placeholders | no | no | yes |
Pagination | yes | yes (split_by_pages = True ) |
yes |
Execution time | low | the lowest | the highest |
With this information, we know which method to use. However, there is an element that we have not yet analyzed: the description and annotation of images. Currently, all three methods can describe and annotate images using VLMs. To see how to do this, jump to the next tutorial.
Thanks for Reading!