Example: Splitting Files by pages using PagedSplitter
¶
For some documents, one of the best splitting strategies can be divide them by pages. To do so, you can use the PagedSplitter
.
For this example, we will read the file using `VanillaReader. The file can be found on the GitHub repository, and it consists of a scientific paper (Attention is All You Need) with 15 pages. Let's see how to split it.
Step 1. Read the file¶
You can read the file using VanillaReader
or DoclingReader
. In case that you use MarkItDownReader
, you should pass the parameter split_by_pages = True
, since MarkItDown by default does not provide any placeholder to split by pages.
Show Python examples for all Readers
from splitter_mr.reader import VanillaReader
FILE_PATH = "data/attention.pdf"
reader = VanillaReader()
reader_output = reader.read(file_path=FILE_PATH)
from splitter_mr.reader import DoclingReader
FILE_PATH = "data/attention.pdf"
reader = DoclingReader()
reader_output = reader.read(file_path=FILE_PATH)
from splitter_mr.reader import MarkItDownReader
FILE_PATH = "data/attention.pdf"
reader = MarkItDownReader()
reader_output = reader.read(file_path=FILE_PATH, split_by_pages=True)
The output will be the following:
import warnings
warnings.filterwarnings("ignore", message=".*pin_memory.*MPS.*")
from splitter_mr.reader import DoclingReader
FILE_PATH = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/attention.pdf"
reader = DoclingReader()
reader_output = reader.read(file_path=FILE_PATH)
print(reader_output.model_dump_json(indent=4))
{
"text": "Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.\n\n## Attention Is All You Need\n\nAshish Vaswani ∗ Google Brain avaswani@google.com\n\nNoam Shazeer ∗ Google Brain noam@google.com\n\nNiki Parmar ∗ Google Research nikip@google.com\n\nJakob Uszkoreit ∗ Google Research usz@google.com\n\nLlion Jones ∗ Google Research llion@google.com\n\nAidan N. Gomez ∗ † Unive
...
heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks.\n\n<!-- image -->",
"document_name": "attention.pdf",
"document_path": "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/attention.pdf",
"document_id": "04aa7dd7-e397-464d-aa29-4f444be12514",
"conversion_method": "markdown",
"reader_method": "docling",
"ocr_method": null,
"page_placeholder": "<!-- page -->",
"metadata": {}
}
As you can see, the ReaderOutput
object has an attribute named page_placeholder
which allows to identify every page.
Split by pages¶
So, we can simply instantiate the PageSplitter
object and use the split
method to get the chunks page-by-page:
from splitter_mr.splitter import PagedSplitter
splitter = PagedSplitter()
splitter_output = splitter.split(reader_output=reader_output)
for idx, chunk in enumerate(splitter_output.chunks):
print("\n" + "*" * 80 + f" Chunk {idx} " + "*" * 80 + "\n")
print(chunk)
******************************************************************************** Chunk 0 ********************************************************************************
Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.
## Attention Is All You Need
Ashish Vaswani ∗ Google Brain avaswani@google.com
Noam Shazeer ∗ Google Brain noam@google.com
Niki Parmar ∗ Google Resear
...
ord.
<!-- image -->
******************************************************************************** Chunk 14 ********************************************************************************
Input-Input Layer5
Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the sentence. We give two such examples above, from two different heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks.
<!-- image -->
Indeed, we have obtained a list of chunks with the extracted content, one per page.
Experimenting with custom parameteres¶
In case that we want to split by group of many pages (e.g., 3
), we can specify that value on the PageSplitter
object. In addition, we can define an overlap between characters:
splitter = PagedSplitter(chunk_size=3, chunk_overlap=100)
splitter_output = splitter.split(reader_output=reader_output)
for idx, chunk in enumerate(splitter_output.chunks):
print("\n" + "*" * 80 + f" Chunk {idx} " + "*" * 80 + "\n")
print(chunk)
******************************************************************************** Chunk 0 ********************************************************************************
Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.
## Attention Is All You Need
Ashish Vaswani ∗ Google Brain avaswani@google.com
Noam Shazeer ∗ Google Brain noam@google.com
Niki Parmar ∗ Google Resear
...
lution. Top: Full attentions for head 5. Bottom: Isolated attentions from just the word 'its' for attention heads 5 and 6. Note that the attentions are very sharp for this word.
<!-- image -->
Input-Input Layer5
Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the sentence. We give two such examples above, from two different heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks.
<!-- image -->
And that's it! Try to experiment which values are the best option for your use case. A full reference to this class is available on the API Reference.
Thank you for reading! :)
Complete script¶
from splitter_mr.reader import DoclingReader #, VanillaReader, MarkItDownReader
from splitter_mr.splitter import PagedSplitter
FILE_PATH = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/attention.pdf"
reader = DoclingReader()
reader_output = reader.read(file_path=FILE_PATH)
print(reader_output.model_dump_json(indent=4))
splitter = PagedSplitter()
splitter_output = splitter.split(reader_output=reader_output)
for idx, chunk in enumerate(splitter_output.chunks):
print("\n" + "*"*80 + f" Chunk {idx} " + "*"*80 + "\n")
print(chunk)
splitter = PagedSplitter(chunk_size=3, chunk_overlap = 100)
splitter_output = splitter.split(reader_output=reader_output)
for idx, chunk in enumerate(splitter_output.chunks):
print("\n" + "*"*80 + f" Chunk {idx} " + "*"*80 + "\n")
print(chunk)