Skip to content

Example: Split a Document by Tokens with TokenSplitter (SpaCy, NLTK, tiktoken)

In this example, we will use several popular NLP libraries to split a text document into token-based chunks. A token is the minimal lexical unit into which a text is divided. Tokenization can be performed in many ways: by words, by characters, by lemmas, etc. One of the most common methods is by sub-words.

Observe the following example:

Tokenization illustration

Every Large Language Model (LLM) uses tokenizers to process text into comprehensible lexical units. Hence, splitting by tokens could be a suitable option to produce chunks of a fixed length compatible with the LLM context window. So, in this tutorial, we show how to split text using three tokenizers: SpaCy, NLTK, and tiktoken (OpenAI tokenization). Let's see!


Step 1: Read the Text Using a Reader

We will start by reading a text file using the MarkItDownReader. Remember that you can use any other compatible Reader. Simply, instantiate a Reader object and use the read method. Provide as an argument the file to be read, which can be an URL, variable or path

from splitter_mr.reader import MarkItDownReader

file = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/my_wonderful_family.txt"
reader = MarkItDownReader()
reader_output = reader.read(file)

The output is a ReaderOutput object:

print(reader_output)
ReaderOutput(
    text='My Wonderful Family\nI live in a house near the mountains. ...',
    document_name='my_wonderful_family.txt',
    document_path='https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/my_wonderful_family.txt',
    document_id='9a72ac14-0fad-41ab-992f-3aaf2fa97afd',
    conversion_method='markdown',
    reader_method='markitdown',
    ocr_method=None,
    metadata={}
)

To see only the document text, you can access to the text attribute of this object:

print(reader_output.text)
My Wonderful Family
I live in a house near the mountains. I have two brothers and one sister, and I was born last. My father teaches mathematics, and my mother is a nurse at a big hospital. My brothers are very smart and work hard in school. My sister is a nervous girl, but she is very kind. My grandmother also lives with us. She came from Italy when I was two years old. She has grown old, but she is still very strong. She cooks the best food!

My family is very important to me. We do lots of things together. My brothers and I like to go on long walks in the mountains. My sister likes to cook with my grandmother. On the weekends we all play board games together. We laugh and always have a good time. I love my family very much.

Step 2: Split the Document by Tokens

As we have said, the TokenSplitter lets you pick the tokenization backend: SpaCy, NLTK, or tiktoken. Use one or another depending on your needs. For every tokenizer, it should be passed:

  • A chunk_size, the maximum chunk size in characters for the tokenization process. It tries to never cut a sentence in two chunks.
  • A model_name, the tokenizer model to use. It should always follows this structure: {tokenizer}/{model_name}, e.g., tiktoken/cl100k_base.

Note

For spaCy and tiktoken, the corresponding models must be installed in your environment.

To see a complete list of available tokenizers, refer to Available models.

2.1. Split by Tokens Using SpaCy

To split using a spaCy tokenizer model, you firstly need to instantiate the TokenSplitter class and select the parameters. Then, call to the split method with the path, URL or variable to split on:

from splitter_mr.splitter import TokenSplitter

spacy_splitter = TokenSplitter(
    chunk_size=100, 
    model_name="spacy/en_core_web_sm" # Use the SpaCy model with "spacy/{model_name}" format
)
spacy_output = spacy_splitter.split(reader_output)

print(spacy_output)  # See the SplitterOutput object
SplitterOutput(
    chunks=[
        'My Wonderful Family\nI live in a house near the mountains.', 'I have two brothers and one sister, and I was born last.', 'My father teaches mathematics, and my mother is a nurse at a big hospital.', 'My brothers are very smart and work hard in school.', 'My sister is a nervous girl, but she is very kind.\n\nMy grandmother also lives with us.', 'She came from Italy when I was two years old.\n\nShe has grown old, but she is still very strong.', 'She cooks the best food!\n\n\n\nMy family is very important to me.\n\nWe do lots of things together.', 'My brothers and I like to go on long walks in the mountains.', 'My sister likes to cook with my grandmother.\n\nOn the weekends we all play board games together.', 'We laugh and always have a good time.\n\nI love my family very much.'
        ], 
    chunk_id=[
        '8225f436-b039-4b54-9472-093dee2068d8', '1c347f11-421f-4549-9074-0dbe18072eb8', '1582e42a-aac2-46ba-bfe0-c87a25b452f4', '82d76292-103e-4a94-9ea4-8bbe4e321a6c', '36d5d71d-3a2c-42c7-a722-7b65dcf3ffc0', 'cb4c57d9-1174-49f4-a7d3-6e52a96bb8ed', '9f67d776-f2df-4cc7-864e-1f5b43b36658', 'ef130c6f-7b69-430a-9af5-4c1c1b72ac99', '744aef78-806b-4439-880e-7921111659d2', 'd3355854-099d-4a05-ae63-94ee0954fd92'
    ], 
    document_name='my_wonderful_family.txt', 
    document_path='https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/my_wonderful_family.txt', 
    document_id='90cf6e00-b4ca-439e-9b3d-3bd8713934b4',
    conversion_method='markdown', 
    reader_method='markitdown', 
    ocr_method=None, 
    split_method='token_splitter', 
    split_params={
        'chunk_size': 100, 'model_name': 'spacy/en_core_web_sm', 'language': 'english'
        }, 
    metadata={}
    )

To see the resulting chunks, you can use the following code:

# Visualize each chunk
for idx, chunk in enumerate(spacy_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")
======================================== Chunk 1 ========================================
My Wonderful Family
I live in a house near the mountains.

======================================== Chunk 2 ========================================
I have two brothers and one sister, and I was born last.
...

2.2. Split by Tokens Using NLTK

Similarly, you can use a NLTK tokenizer. This library will always use punkt as the tokenizer, but you can customize the language through this parameter.

nltk_splitter = TokenSplitter(
    chunk_size=100,
    model_name="nltk/punkt",   # Use the NLTK model as "nltk/{model_name}"
    language="english"         # Defaults to English
)
nltk_output = nltk_splitter.split(reader_output)

for idx, chunk in enumerate(nltk_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")
======================================== Chunk 1 ========================================
My Wonderful Family
I live in a house near the mountains.

======================================== Chunk 2 ========================================
I have two brothers and one sister, and I was born last.
...

As you can see, the results are basically the same.

2.3. Split by Tokens Using tiktoken (OpenAI)

TikToken is one of the most extended tokenizer models. In this case, this tokenizer split by the number of tokens and chunks if \\n\\n is detected. Hence, the results are the following:

tiktoken_splitter = TokenSplitter(
    chunk_size=100,
    model_name="tiktoken/cl100k_base", # Use the tiktoken model as "tiktoken/{model_name}"
    language="english"
)
tiktoken_output = tiktoken_splitter.split(reader_output)

for idx, chunk in enumerate(tiktoken_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")
======================================== Chunk 1 ========================================
My Wonderful Family

======================================== Chunk 2 ========================================
I live in a house near the mountains. I have two brothers and one sister, and I was born last. My father teaches mathematics, and my mother is a nurse at a big hospital. My brothers are very smart and work hard in school. My sister is a nervous girl, but she is very kind. My grandmother also lives with us. She came from Italy when I was two years old. She has grown old, but she is still very strong. She cooks the best food!
...

Extra: Split by Tokens in Other Languages (e.g., Spanish)

In previous examples, we show you how to split the text by tokens, but these models were adapted to English. But in case that you have texts in other languages, you can use other Tokenizers. Here, there are two examples with SpaCy and NLTK (tiktoken is multilingual by default):

from splitter_mr.reader import DoclingReader

sp_file = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/mi_nueva_casa.txt"
sp_reader = DoclingReader()
sp_reader_output = sp_reader.read(sp_file)
print(sp_reader_output.text)
Mi nueva casa
Yo vivo en Granada, una ciudad pequeña que tiene monumentos muy importantes como la Alhambra. Aquí la comida es deliciosa y son famosos el gazpacho, el rebujito y el salmorejo.

Mi nueva casa está en una calle ancha que tiene muchos árboles. El piso de arriba de mi casa tiene tres dormitorios y un despacho para trabajar. El piso de abajo tiene una cocina muy grande, un comedor con una mesa y seis sillas, un salón con dos sofás verdes, una televisión y cortinas. Además, tiene una pequeña terraza con piscina donde puedo tomar el sol en verano.

Me gusta mucho mi casa porque puedo invitar a mis amigos a cenar o a ver el fútbol en mi televisión. Además, cerca de mi casa hay muchas tiendas para hacer la compra, como panadería, carnicería y pescadería.

Split Spanish by Tokens Using SpaCy

spacy_sp_splitter = TokenSplitter(
    chunk_size=100,
    model_name="spacy/es_core_news_sm"  # Use a Spanish SpaCy model
)
spacy_sp_output = spacy_sp_splitter.split(sp_reader_output)

for idx, chunk in enumerate(spacy_sp_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")
Created a chunk of size 107, which is longer than the specified 100
Created a chunk of size 142, which is longer than the specified 100
======================================== Chunk 1 ========================================
Mi nueva casa
Yo vivo en Granada, una ciudad pequeña que tiene monumentos muy importantes como la Alhambra.

======================================== Chunk 2 ========================================
Aquí la comida es deliciosa y son famosos el gazpacho, el rebujito y el salmorejo.
...

Split Spanish by Tokens Using NLTK

nltk_sp_splitter = TokenSplitter(
    chunk_size=100,
    model_name="nltk/punkt",
    language="spanish"
)
nltk_sp_output = nltk_sp_splitter.split(sp_reader_output)

for idx, chunk in enumerate(nltk_sp_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")
Created a chunk of size 107, which is longer than the specified 100
Created a chunk of size 142, which is longer than the specified 100
======================================== Chunk 1 ========================================
Mi nueva casa
Yo vivo en Granada, una ciudad pequeña que tiene monumentos muy importantes como la Alhambra.

======================================== Chunk 2 ========================================
Aquí la comida es deliciosa y son famosos el gazpacho, el rebujito y el salmorejo.

And that’s it! You can now tokenize and chunk text with precision, using the NLP backend and language that best fits your project.

Note

For best results, make sure to install any SpaCy/NLTK/tiktoken models needed for your language and task.

Complete Script

from splitter_mr.reader import DoclingReader, MarkItDownReader
from splitter_mr.splitter import TokenSplitter

# 1. Read the file using any Reader (e.g., MarkItDownReader)

file = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/my_wonderful_family.txt"

reader = MarkItDownReader()
reader_output = reader.read(file)
print(reader_output.text)

# 2. Split by Tokens

## 2.1. Using SpaCy

print("*"*40 + " spaCy " + "*"*40)

spacy_splitter = TokenSplitter(
    chunk_size=100, 
    model_name = "spacy/en_core_web_sm" # Select a valid model with nomenclature spacy/{model_name}.
    ) 
# Note that it is required to have the model installed in your execution machine.

spacy_output = spacy_splitter.split(reader_output) # Split the text
print(spacy_output)  # Print the SplitterOutput object

# Visualize each chunk
for idx, chunk in enumerate(spacy_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")

## 2.2. Using NLTK

print("*"*40 + " NLTK " + "*"*40)

nltk_splitter = TokenSplitter(
    chunk_size=100,
    model_name="nltk/punkt", # introduce the model as nltk/{model_name}
    language="english" # defaults to this language
)

nltk_output = nltk_splitter.split(reader_output)

# Visualize each chunk
for idx, chunk in enumerate(nltk_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")

## 2.3. Using tiktoken

print("*"*40 + " Tiktoken " + "*"*40)

tiktoken_splitter = TokenSplitter(
    chunk_size=100,
    model_name="tiktoken/cl100k_base", # introduce the model as tiktoken/{model_name}
    language="english"
)

tiktoken_output = tiktoken_splitter.split(reader_output)

# Visualize each chunk
for idx, chunk in enumerate(tiktoken_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")

## 2.4. Split by tokens in other languages (e.g., Spanish)

sp_file = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/mi_nueva_casa.txt"

sp_reader = DoclingReader()
sp_reader_output = sp_reader.read(sp_file)
print(sp_reader_output.text) # Visualize the text content

### 2.4.1. Using SpaCy

print("*"*40 + " Spacy in Spanish " + "*"*40)

spacy_sp_splitter = TokenSplitter(
    chunk_size = 100,
    model_name = "spacy/es_core_news_sm", # Pick another model in Spanish
)
nltk_sp_output = spacy_sp_splitter.split(sp_reader_output)

for idx, chunk in enumerate(nltk_sp_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")

### 2.4.2 Using NLTK

print("*"*40 + " NLTK in Spanish " + "*"*40)

nltk_sp_splitter = TokenSplitter(
    chunk_size = 100,
    model_name = "nltk/punkt",
    language="spanish" # select `spanish` as language for the tokenizer
)
nltk_sp_output = nltk_sp_splitter.split(sp_reader_output)

for idx, chunk in enumerate(nltk_sp_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")

Available models

There are several tokenizer models that you can use to split your text. The following table provides a summary of the models you can currently use, along with some implementation examples:

Library Model identifier/template How to implement Reference Guide
NLTK (Punkt) <language> See NLTK Example NLTK Tokenizers
Tiktoken <encoder> See Tiktoken Example tiktoken
spaCy {CC}_core_web_sm,
{CC}_core_web_md,
{CC}_core_web_lg,
{CCe}_core_web_trf,
xx_ent_wiki_sm,
xx_sent_ud_sm
See spaCy Example spaCy Models
About spaCy model templates

spaCy Model Suffixes:

  • sm (small): Fastest, small in size, less accurate; good for prototyping and lightweight use-cases.
  • md (medium): Medium size and accuracy; balances speed and performance.
  • lg (large): Largest and most accurate pipeline with the most vectors; slower and uses more memory.
  • trf (transformer): Uses transformer-based architectures (e.g., BERT, RoBERTa); highest accuracy, slowest, and requires more resources.

spaCy Model Prefixes:

  • CC codes: ca (catalan), zh (chinese), hr (croatian), da (danish), nl (dutch), en (english), fi (finnish), fr (french), ge (german), gr (greek), it (italian), ja (japanese), ko (korean), li (lithuanian), mk (macedonian), nb (norwegian), pl (polish), pt (portuguese), ro (romanian), ru (russian), sl (slovenian), sp (spanish), sw (swedish), uk (ukranian)
  • CCe codes (for trf): ca, zh, en, fr, ge, ja, sl, sp, uk

NLTK Example

language = "english"
TokenSplitter(
    model_name="nltk/punkt",
    language=language
)

Tiktoken Example

encoder = "cl100k_base"
TokenSplitter(
    model_name=f"tiktoken/{encoder}"
)

spaCy Example

CC = "en"
ext = "sm"
encoder = f"{CC}_core_web_{ext}"
TokenSplitter(
    model_name=f"spacy/{encoder}"
)