Skip to content

v0.3.x

v0.3.0

Features

  • Implement AzureOpenAI and OpenAI Visual Models to analyse graphical resources in PDF files.
  • Add support to read PDF files to VanillaReader using PDFPlumber.

Documentation

  • Update examples.
  • Add new examples to documentation.
  • Add Reading methods with PDF documentation.
  • Add information about implementing VLMs in your reading pipeline.
  • Change file names on data folder to be more descriptive.
  • Update README.md and CHANGELOG.

Fixes

  • Update tests.
  • Update docstrings.
  • Update TokenSplitter to raise Exceptions if no valid models are provided.
  • Update TokenSplitter to take as default argument a valid tiktoken model.
  • Change HTMLTagSplitter to take the headers if a table is provided.
  • Change HeaderSplitter to preserve headers in chunks.

v0.2.x

v0.2.2

Features

  • Implement TokenSplitter: split text into chunks by number of tokens, using selectable tokenizers (tiktoken, spacy, nltk).
  • MarkItDownReader now supports more file extensions: PDFs, audio, etc.

Fixes

  • HTMLTagSplitter does not correctly chunking the document as desired.
  • Change docstring documentation.

Documentation

  • Updated Splitter strategies documentation to include TokenSplitter.
  • Expanded example scripts and test scripts for end-to-end manual and automated verification of all Splitter strategies.
  • New examples in documentation server for HTMLTagSplitter.

Developer Features

  • Remove Pipfile and Pipfile.lock.
  • Update to poethepoet as task runner tool

Fixes

v0.2.1

Features

Implement CodeSplitter.

Fixes

  • Change docstring for BaseSplitter to update with current parameters.
  • Some minor warnings in documentations when deploying Mkdocs server.

v0.2.0

[!IMPORTANT] Breaking change!

  • All Readers now return ReaderOutput dataclass objects.
  • All Splitters now return SplitterOutput dataclass objects.

You must access fields using dot notation (e.g., result.text, result.chunks), not dictionary keys.

Features

New splitting strategy: RowColumnSplitter for flexible splitting of tabular data.

New reader_method attribute in output dataclasses.

Migration

Update all code/tests to use attribute access for results from Readers and Splitters.

Use .to_dict() on output if a dictionary is required.

Update any custom splitter/reader implementations to use the new output dataclasses.

v0.1.x

v0.1.3

Features

  • Add a new splitting strategy: RowColumnSplitter.

Fixes

  • Change Readers to properly handle JSON files.

Documentation

  • Update documentation.

v0.1.2

Features

  • Now VanillaReader can read from multiple sources: URL, text, file_path and dictionaries.

Fixes

  • By default, the document_id was None instead of and uuid in VanillaReader.
  • Some name changes for splitter_method attribute in SplitterOutput method.

Developer features

  • Extend CI/CD lifecycle. Now it uses Dockerfile to check tests and deploy docs.
  • Automate versioning for the Python project.
  • The project has been published to PyPI.org. New versions will be deployed using CI/CD script.
  • requirements.txt has been added in the root of the project.
  • A new stage in pre-commit has been introduced for generating the requirements.txt file.
  • Makefile extended: new commands to serve mkdocs. Now make clean remove more temporary files.

Documentation

  • Update documentation in examples for the Documentation server.
  • Documentation server now can be served using Make.

v0.1.1

Some bug fixes in HeaderSplitter and RecursiveCharacterSplitter, and documentation updates.

Bug fixes

  • chunk_overlap (between 0 and 1) was not working in the split method from RecursiveCharacterSplitter.
  • Some markdown code was not properly formatted in README.md.
  • Reformat examples from docstring documentation in every Reader and Splitter classes.
  • HeaderSplitter was not properly handling the headers in some markdown and HTML files.

Documentation

  • Some examples have been provided in the documentation (docs/, and in the documentation server).
  • New examples in docstrings.

v0.1.0

First version of the project

Functional features

  • Add first readers, VanillaReader: reader which reads the files and format them into a string.
  • DoclingReader: reader which uses the docling package to read the files.
  • MarkItDownReader: reader which uses the markitdown package to read the files.
  • Add first splitters: CharacterSplitter, RecursiveSplitter, WordSplitter, SentenceSplitter, ParagraphSplitter, HTMLTagSplitter, JSONSplitter, HeaderSplitter:
  • The package can be installed using pip.
  • README.md has been updated.
  • Tests cases for main functionalities are available.
  • Some data has been added for testing purposes.
  • A documentation server is deployed with up-to-date information.

Developer features

  • Update pyproject.toml project information.
  • Add pre-commit configurations (flake8, check commit messages, run test coverage, and update documentation).
  • Add first Makefile commands (focused on developers):
  • make help: Provide a list with all the Make commands.
  • make clean: Clean temporal files and cache
  • make shell: Run a uv shell.
  • make install: Install uv CLI and pre-commit.
  • make precommit: Install pre-commit hooks.
  • make format: Run pyupgrade, isort, black, and flake8 for code style.