Split JSON recursively

Example: Splitting JSON Files with `RecursiveJSONSplitter`¶

When working with structured data such as invoices, user records, or any other JSON document, it's often necessary to split the data into manageable chunks for downstream processing, storage, or LLM ingestion. SplitterMR provides the RecursiveJSONSplitter, an splitter which divides a JSON structure into key-based chunks, preserving the hierarchy and content integrity. Let's see how it works!

JSON object

Step 1: Read the JSON Document¶

First, use the VanillaReader to load the JSON file. You can use other Reader methods as your choice. Note that you can read from an URL, Path or variable.

from splitter_mr.reader import VanillaReader

file = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/invoices.json"  # Path to your JSON file
reader = VanillaReader()
reader_output = reader.read(file)

print(reader_output)        # Show metadata and summary

ReaderOutput(
    text='[\n   {\n      "id": 1,\n      "name": "Johnson, Smith, and Jones Co.",\n      "amount": 345.33,\n      "Remark": "Pays on time"\n   },\n   {\n      "id": 2,\n      "name": "Sam \\"Mad Dog\\" Smith",\n      "amount": 993.44,\n      "Remark": ""\n   },\n   {\n      "id": 3,\n      "name": "Barney & Company",\n      "amount": 0,\n      "Remark": "Great to work with\\nand always pays with cash."\n   },\n   {\n      "id": 4,\n      "name": "Johnson\'s Automotive",\n      "amount": 2344,\n      "Remark": ""\n   }\n]', 
    document_name='invoices.json', 
    document_path='https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/invoices.json', 
    document_id='103c305c-60c8-485a-9f43-73db3216e2a4', 
    conversion_method='json', 
    reader_method='vanilla', 
    ocr_method=None, 
    metadata={}
    )

Accessing to the text attribute you find the following JSON object:

print(reader_output.text)

[
   {
      "id": 1,
      "name": "Johnson, Smith, and Jones Co.",
      "amount": 345.33,
      "Remark": "Pays on time"
   },
   {
      "id": 2,
      "name": "Sam \"Mad Dog\" Smith",
      "amount": 993.44,
      "Remark": ""
   },
   {
      "id": 3,
      "name": "Barney & Company",
      "amount": 0,
      "Remark": "Great to work with\nand always pays with cash."
   },
   {
      "id": 4,
      "name": "Johnson's Automotive",
      "amount": 2344,
      "Remark": ""
   }
]

This is a JSON dataset with some sample invoices.

id	name	amount	Remark
1	Johnson, Smith, and Jones Co.	345.33	Pays on time
2	Sam "Mad Dog" Smith	993.44
3	Barney & Company	0	Great to work with and always pays with cash.
4	Johnson's Automotive	2344

Step 2: Split the JSON Document¶

To split the text, instantiate the RecursiveJSONSplitter and split the loaded JSON content:

from splitter_mr.splitter import RecursiveJSONSplitter

splitter = RecursiveJSONSplitter(chunk_size=100, min_chunk_size=20)
splitter_output = splitter.split(reader_output)

print(splitter_output)  # Show the SplitterOutput object

You will get an object like this:

SplitterOutput(
    chunks=[{'0': {'id': 1, 'name': 'Johnson, Smith, and Jones Co.', 'amount': 345.33, 'Remark': 'Pays on time'}}, {'1': {'id': 2, 'name': 'Sam "Mad Dog" Smith', 'amount': 993.44, 'Remark': ''}}, {'2': {'id': 3, 'name': 'Barney & Company', 'amount': 0, 'Remark': 'Great to work with\nand always pays with cash.'}}, {'3': {'id': 4, 'name': "Johnson's Automotive", 'amount': 2344, 'Remark': ''}}], 
    chunk_id=['58556605-6e2b-4994-b764-56555b1eb68f', '74826121-4ef7-42a6-8288-7d950feaf0c5', 'e73b6464-2ec2-49a8-9387-a31ab9295b49', '4368e0f4-c1f2-440a-8aec-6f9d28eb07cd'], 
    document_name='invoices.json', 
    document_path='https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/invoices.json', 
    document_id='103c305c-60c8-485a-9f43-73db3216e2a4', 
    conversion_method='json', 
    reader_method='vanilla', 
    ocr_method=None, 
    split_method='recursive_json_splitter', 
    split_params={
        'max_chunk_size': 100, 'min_chunk_size': 20
        },
    metadata={}
    )

To inspect every chunk individually, print them as follows:

for idx, chunk in enumerate(splitter_output.chunks):
    print("="*40 + f" Chunk {int(idx + 1)} " + "="*40 + "\n" + str(chunk[str(idx)]) + "\n")

======================================== Chunk 1 ========================================
{'id': 1, 'name': 'Johnson, Smith, and Jones Co.', 'amount': 345.33, 'Remark': 'Pays on time'}

======================================== Chunk 2 ========================================
{'id': 2, 'name': 'Sam "Mad Dog" Smith', 'amount': 993.44, 'Remark': ''}

======================================== Chunk 3 ========================================
{'id': 3, 'name': 'Barney & Company', 'amount': 0, 'Remark': 'Great to work with\nand always pays with cash.'}

======================================== Chunk 4 ========================================
{'id': 4, 'name': "Johnson's Automotive", 'amount': 2344, 'Remark': ''}

In markdown format table:

id	name	amount	Remark
1	Johnson, Smith, and Jones Co.	345.33	Pays on time

id	name	amount	Remark
2	Sam "Mad Dog" Smith	993.44

id	name	amount	Remark
3	Barney & Company	0	Great to work with and always pays with cash.

id	name	amount	Remark
4	Johnson's Automotive	2344

And that's it! As you can see, a chunk for every row in the JSON table has been generated.

Note

All the objects obtained by the SplitterOutput .chunks attribute are dict. In case that you need to transform it into str elements, you need to process them with the instruction str(chunk[idx]).

Complete Example Script¶

from splitter_mr.reader import VanillaReader
from splitter_mr.splitter import RecursiveJSONSplitter

file = "https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/invoices.json"

reader = VanillaReader() # Load a Reader
reader_output = reader.read(file) # Read the file

print(reader_output) # ReaderOutput object
print(reader_output.text) # Visualize the file

splitter = RecursiveJSONSplitter(chunk_size = 100, min_chunk_size=20) # Instantiate the Splitter class
splitter_output = splitter.split(reader_output) # Split the text

print(splitter_output) # SplitterOutput object

# Visualize every chunk
for idx, chunk in enumerate(splitter_output.chunks):
    print("="*40 + f" Chunk {int(idx + 1)} " + "="*40 + "\n" + str(chunk[str(idx)]) + "\n")

References¶

RecursiveJSONSplitter API (Langchain)