Example: Split Pinocchio Chapters Using KeywordSplitter
¶
The KeywordSplitter
is a highly versatile text-splitting utility that divides documents based on custom regular-expression (regex) patterns. Unlike simpler splitters that break text by fixed lengths (characters, words, or sentences), the KeywordSplitter allows you to identify semantic boundaries—such as headings, section markers, or unique keywords—and chunk your text accordingly.
This splitter is particularly useful when working with structured or semi-structured text where meaningful sections are defined by repeated markers. Examples include:
- Splitting classic books or plays by chapter headings (e.g., "CHAPTER I", "ACT II").
- Parsing logs by timestamps or error codes.
- Extracting sections from Markdown or plain text documents where a specific keyword (e.g., "TODO" or "NOTE") separates ideas.
- Segmenting transcripts or interview notes at speaker identifiers.
This notebook demonstrates splitting Pinocchio by chapter markers such as “CHAPTER I”, “CHAPTER II”, etc., using the KeywordSplitter
.
Step 1: Read the text using a Reader component¶
We will use the VanillaReader
to fetch the text from Project Gutenberg:
from splitter_mr.reader import VanillaReader
FILE_PATH = "https://www.gutenberg.org/cache/epub/16865/pg16865.txt"
reader = VanillaReader()
reader_output = reader.read(file_path=FILE_PATH)
print(reader_output.model_dump_json(indent=4))
print(reader_output.text[:1000]) # preview first 1000 characters
{
"text": "The Project Gutenberg eBook of Pinocchio: The Tale of a Puppet\r\n \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this ebook or online\r\nat www.gutenberg.org. If you are not located in the United States,\r\nyou will have to check the laws of the country
...
you are located
before using this eBook.
Title: Pinocchio: The Tale of a Puppet
Author: Carlo Collodi
Illustrator: Alice Carsey
Release date: October 13, 2005 [eBook #16865]
Most recently updated: December 12, 2020
Language: English
Credits: Produced by Mark C. Orton, Melissa Er-Raqabi and the Online
Distributed Proofreading Team at https://www.pgdp.net.
*** START OF THE PROJECT GUTENBERG EBOOK PINOCCHIO: THE TALE OF A PUPPET ***
Produced by Mark C. Orton, Me
Step 2: Split the text using KeywordSplitter
¶
We define the regex pattern for Roman numeral chapters and instantiate the splitter:
import re
from splitter_mr.splitter import KeywordSplitter
chapter_pattern = r"CHAPTER\s+[IVXLCDM]+"
splitter = KeywordSplitter(
patterns={"chapter": chapter_pattern},
include_delimiters="after",
flags=re.IGNORECASE,
)
splitter_output = splitter.split(reader_output)
print(splitter_output.model_dump_json(indent=4))
{
"chunks": [
"The Project Gutenberg eBook of Pinocchio: The Tale of a Puppet\r\n \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this ebook or online\r\nat www.gutenberg.org. If you are not located in the United States,\r\nyou will have to check the laws of
...
],
[
203258,
203271
],
[
214484,
214496
],
[
222566,
222579
]
],
"include_delimiters": "after",
"flags": 2,
"pattern_names": [
"chapter"
],
"chunk_size": 100000
}
}
}
As you can see, all the chunks start with CHAPTER
+ a roman number, so the Splitter has divided the text by chapters correctly.
Step 3: Visualize the chunks¶
To visualize the chunks, you can execute the following code:
for idx, chunk in enumerate(splitter_output.chunks):
print("=" * 40 + f" Chunk {idx + 1} " + "=" * 40)
print(chunk[:800] + "\n")
======================================== Chunk 1 ========================================
The Project Gutenberg eBook of Pinocchio: The Tale of a Puppet
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United
...
e
fever.
Was he trembling from cold or from fear. Perhaps a little from both the
one and the other. But Pinocchio, thinking it was from fear, said, to
comfort him:
"Courage, papa! In a few minutes we shall be safely on shore."
"But where is this blessed shore?" asked the little old man, becoming
still more frightened, and screwing up his eyes as tailors do when they
wish to thread a needle. "I have been looking in every direction and I
see nothing but the sky and the sea."
"But I see the s
Step 4: Analyze metadata¶
meta = splitter_output.metadata["keyword_matches"]
print("Chapter matches:", meta["counts"])
print("Spans:", meta["spans"][:5])
print("Splitter parameters:", splitter_output.split_params)
Chapter matches: {'chapter': 36}
Spans: [(7063, 7072), (10320, 10330), (14085, 14096), (20136, 20146), (23487, 23496)]
Splitter parameters: {'include_delimiters': 'after', 'flags': re.IGNORECASE, 'chunk_size': 100000, 'pattern_names': ['chapter']}
You can also verify the configuration:
print(splitter_output.split_params)
{'include_delimiters': 'after', 'flags': re.IGNORECASE, 'chunk_size': 100000, 'pattern_names': ['chapter']}
Note
You can experiment with:
* include_delimiters
= "after"
, "both"
, or "none"
to control delimiter placement.
* Adjust chunk_size
for soft wrapping inside chapters.
* Use multiple patterns (dict or list) to split by different keywords simultaneously.
And that’s it! This notebook demonstrates how to segment Pinocchio into chapters interactively with KeywordSplitter
.
Complete script¶
import re
from splitter_mr.reader import VanillaReader
from splitter_mr.splitter import KeywordSplitter
FILE_PATH = "https://www.gutenberg.org/cache/epub/16865/pg16865.txt"
# Step 1: Read Pinocchio
reader = VanillaReader()
reader_output = reader.read(file_path=FILE_PATH)
print(reader_output.model_dump_json(indent=4)) # inspect ReaderOutput
# Step 2: Split by chapter markers
chapter_pattern = r"CHAPTER\s+[IVXLCDM]+"
splitter = KeywordSplitter(
patterns={"chapter": chapter_pattern},
include_delimiters="before",
flags=re.IGNORECASE,
chunk_size=3000,
)
splitter_output = splitter.split(reader_output)
print(splitter_output.model_dump_json(indent=4)) # inspect SplitterOutput
# Step 3: Visualize chunks
for idx, chunk in enumerate(splitter_output.chunks):
print("=" * 40 + f" Chapter {idx + 1} " + "=" * 40)
print(chunk[:800] + "\n")
# Step 4: Explore metadata
meta = splitter_output.metadata["keyword_matches"]
print("Counts:", meta["counts"])
print("First spans:", meta["spans"][:5])
print("Splitter parameters:", splitter_output.split_params)
{
"text": "The Project Gutenberg eBook of Pinocchio: The Tale of a Puppet\r\n \r\nThis ebook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this ebook or online\r\nat www.gutenberg.org. If you are not located in the United States,\r\nyou will have to check the laws of the country
...
resses. Donations are accepted in a number of other
ways including checks, online payments and credit card donations. To
donate, please visit: www.gutenberg.org/donate.
Section 5. General Information About Project Gutenberg™ electronic works
Professor
Counts: {'chapter': 36}
First spans: [(7063, 7072), (10320, 10330), (14085, 14096), (20136, 20146), (23487, 23496)]
Splitter parameters: {'include_delimiters': 'before', 'flags': re.IGNORECASE, 'chunk_size': 3000, 'pattern_names': ['chapter']}