Skip to content

Example: Splitting Tabular Data with RowColumnSplitter

Tabular files such as CSVs, TSVs, or Markdown tables are ubiquitous in business and data workflows, but can become too large for direct LLM ingestion, annotation, or analysis. SplitterMR’s RowColumnSplitter provides flexible chunking for tabular data, letting you split tables by rows, columns, or character size—while preserving the structural integrity of each chunk.

Tabular data example

Step 1: Read the Tabular File

Let's use the VanillaReader to load a CSV file:

from splitter_mr.reader import VanillaReader

file = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/invoices.csv"
reader = VanillaReader()
reader_output = reader.read(file)

# Print metadata and content
print(reader_output)
print(reader_output.text)

Sample output:

ReaderOutput(
    text='id,name,amount,Remark\n1,"Johnson, Smith, and Jones Co.",345.33,Pays on time\n2,"Sam ""Mad Dog"" Smith",993.44,\n3,Barney & Company,0,"Great to work with and always pays with cash."\n4,Johnson\'s Automotive,2344,',
    document_name='invoices.csv',
    document_path='https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/invoices.csv',
    document_id='4eb99715-b11c-4a34-b0be-724ae6e34e4c',
    conversion_method='csv',
    reader_method='vanilla',
    ocr_method=None,
    metadata={}
)

The content file is extracted accessing to the text attribute:

id,name,amount,Remark
1,"Johnson, Smith, and Jones Co.",345.33,Pays on time
2,"Sam ""Mad Dog"" Smith",993.44,
3,Barney & Company,0,"Great to work with
and always pays with cash."
4,Johnson's Automotive,2344,

Transformed into a markdown table will be:

id name amount Remark
1 Johnson, Smith, and Jones Co. 345.33 Pays on time
2 Sam "Mad Dog" Smith 993.44 nan
3 Barney & Company 0 Great to work with and always pays with cash.
4 Johnson's Automotive 2344 nan

Step 2: Split the Table

2.1. Split by Character Size (row-wise, preserving full rows)

Split into chunks such that each chunk's markdown table representation stays under a character limit:

from splitter_mr.splitter import RowColumnSplitter

splitter = RowColumnSplitter(chunk_size=200)
splitter_output = splitter.split(reader_output)

for idx, chunk in enumerate(splitter_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")

Chunk 1:

id name amount Remark
1 Johnson, Smith, and Jones Co. 345.33 Pays on time
2 Sam "Mad Dog" Smith 993.44 nan

Chunk 2:

id name amount Remark
3 Barney & Company 0 Great to work with and always pays with cash.

Chunk 3:

id name amount Remark
4 Johnson's Automotive 2344 nan

Each output chunk is a valid markdown table with the header and as many full rows as will fit the character size.

Note

No chunk will ever split a row or a column in half.

2.2. Split by a Fixed Number of Rows

Set num_rows to split the table into smaller tables, each with a fixed number of rows:

splitter = RowColumnSplitter(num_rows=2)
splitter_output = splitter.split(reader_output)

for idx, chunk in enumerate(splitter_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")

The output will be:

Chunk 1:

id name amount Remark
1 Johnson, Smith, and Jones Co. 345.33 Pays on time
2 Sam "Mad Dog" Smith 993.44 nan

Chunk 2:

id name amount Remark
3 Barney & Company 0 Great to work with and always pays with cash.
4 Johnson's Automotive 2344 nan

2.3. Split by a Fixed Number of Columns

Set num_cols to split the table into column groups, each containing a fixed set of columns (e.g., for wide tables):

splitter = RowColumnSplitter(num_cols=2)
splitter_output = splitter.split(reader_output)

for idx, chunk in enumerate(splitter_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")

The output will be a list with this format:

======================================== Chunk 1 ========================================
[['id', 1, 2, 3, 4], ['name', 'Johnson, Smith, and Jones Co.', 'Sam "Mad Dog" Smith', 'Barney & Company', "Johnson's Automotive"]]

======================================== Chunk 2 ========================================
[['amount', 345.33, 993.44, 0.0, 2344.0], ['Remark', 'Pays on time', nan, 'Great to work with and always pays with cash.', nan]]

2.4. Add Overlapping Rows/Columns

Use chunk_overlap (int or float between 0 and 1) to specify how many rows or columns are repeated between consecutive chunks for context preservation:

splitter = RowColumnSplitter(chunk_size=150, chunk_overlap=0.2)
splitter_output = splitter.split(reader_output)

for idx, chunk in enumerate(splitter_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")

The output is a table with an overlapping row column:

Chunk 1:

id name amount Remark
1 Johnson, Smith, and Jones Co. 345.33 Pays on time
2 Sam "Mad Dog" Smith 993.44 nan
3 Barney & Company 0 Great to work with and always pays with cash.

Chunk 2:

id name amount Remark
3 Barney & Company 0 Great to work with and always pays with cash.
4 Johnson's Automotive 2344 nan

Note

chunk_overlap parameter can be used by rows or columns.

And that's it! In this example we have used a CSV files, but we can process other file formats. The compatible file extensions are: csv, tsv, md, txt and tabular json. Parquet files which are processed as JSON can be processed as well.

Warning

Setting both num_rows and num_cols will raise an error. If chunk_overlap is a float, it is interpreted as a percentage (e.g., 0.2 means 20% overlap).

3. Use cases

RowColumnSplitter is useful for the following use cases:

  • For splitting large tabular datasets into LLM-friendly or context-aware chunks.
  • For preserving row/column integrity in csv/tsv/markdown data.
  • When you need easy chunking with overlap for annotation, document search, or analysis.

Complete script

from splitter_mr.reader import VanillaReader
from splitter_mr.splitter import RowColumnSplitter

file = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/invoices.csv"

reader = VanillaReader()
reader_output = reader.read(file)

# Visualize the ReaderOutput object
print(reader_output)

# Access to the text content
print(reader_output.text)

print("*"*20 + " Split by rows based on chunk size " + "*"*20)

splitter = RowColumnSplitter(chunk_size=200)
splitter_output = splitter.split(reader_output)

for idx, chunk in enumerate(splitter_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")

print("*"*20 + " Split by an specific number of rows " + "*"*20)

splitter = RowColumnSplitter(num_rows=2)
splitter_output = splitter.split(reader_output)

for idx, chunk in enumerate(splitter_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")

print("*"*20 + " Split by an specific number of columns " + "*"*20)

splitter = RowColumnSplitter(num_cols=2)
splitter_output = splitter.split(reader_output)

for idx, chunk in enumerate(splitter_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")

print("*"*20 + " Split with overlap " + "*"*20)

splitter = RowColumnSplitter(chunk_size=300, chunk_overlap=0.4)
splitter_output = splitter.split(reader_output)

for idx, chunk in enumerate(splitter_output.chunks):
    print("="*40 + f" Chunk {idx + 1} " + "="*40 + "\n" + chunk + "\n")