v0.3.x¶
v0.3.0¶
Features¶
- Implement
AzureOpenAI
andOpenAI
Visual Models to analyse graphical resources in PDF files. - Add support to read PDF files to VanillaReader using
PDFPlumber
.
Documentation¶
- Update examples.
- Add new examples to documentation.
- Add Reading methods with PDF documentation.
- Add information about implementing VLMs in your reading pipeline.
- Change file names on data folder to be more descriptive.
- Update
README.md
andCHANGELOG
.
Fixes¶
- Update tests.
- Update docstrings.
- Update
TokenSplitter
to raise Exceptions if no valid models are provided. - Update
TokenSplitter
to take as default argument a valid tiktoken model. - Change
HTMLTagSplitter
to take the headers if a table is provided. - Change
HeaderSplitter
to preserve headers in chunks.
v0.2.x¶
v0.2.2¶
Features¶
- Implement
TokenSplitter
: split text into chunks by number of tokens, using selectable tokenizers (tiktoken
,spacy
,nltk
). MarkItDownReader
now supports more file extensions: PDFs, audio, etc.
Fixes¶
HTMLTagSplitter
does not correctly chunking the document as desired.- Change
docstring
documentation.
Documentation¶
- Updated Splitter strategies documentation to include
TokenSplitter
. - Expanded example scripts and test scripts for end-to-end manual and automated verification of all Splitter strategies.
- New examples in documentation server for
HTMLTagSplitter
.
Developer Features¶
- Remove
Pipfile
andPipfile.lock
. - Update to
poethepoet
as task runner tool
Fixes¶
v0.2.1¶
Features¶
Implement CodeSplitter
.
Fixes¶
- Change
docstring
forBaseSplitter
to update with current parameters. - Some minor warnings in documentations when deploying Mkdocs server.
v0.2.0¶
[!IMPORTANT] Breaking change!
- All Readers now return
ReaderOutput
dataclass objects.- All Splitters now return
SplitterOutput
dataclass objects.You must access fields using dot notation (e.g.,
result.text
,result.chunks
), not dictionary keys.
Features¶
New splitting strategy: RowColumnSplitter
for flexible splitting of tabular data.
New reader_method attribute in output dataclasses.
Migration¶
Update all code/tests to use attribute access for results from Readers and Splitters.
Use .to_dict()
on output if a dictionary is required.
Update any custom splitter/reader implementations to use the new output dataclasses.
v0.1.x¶
v0.1.3¶
Features¶
- Add a new splitting strategy:
RowColumnSplitter
.
Fixes¶
- Change Readers to properly handle JSON files.
Documentation¶
- Update documentation.
v0.1.2¶
Features¶
- Now
VanillaReader
can read from multiple sources: URL, text, file_path and dictionaries.
Fixes¶
- By default, the document_id was
None
instead of anduuid
inVanillaReader
. - Some name changes for
splitter_method
attribute inSplitterOutput
method.
Developer features¶
- Extend CI/CD lifecycle. Now it uses Dockerfile to check tests and deploy docs.
- Automate versioning for the Python project.
- The project has been published to PyPI.org. New versions will be deployed using CI/CD script.
requirements.txt
has been added in the root of the project.- A new stage in
pre-commit
has been introduced for generating therequirements.txt
file. Makefile
extended: new commands to servemkdocs
. Now make clean remove more temporary files.
Documentation¶
- Update documentation in examples for the Documentation server.
- Documentation server now can be served using Make.
v0.1.1¶
Some bug fixes in HeaderSplitter and RecursiveCharacterSplitter, and documentation updates.
Bug fixes¶
chunk_overlap
(between 0 and 1) was not working in thesplit
method fromRecursiveCharacterSplitter
.- Some markdown code was not properly formatted in
README.md
. - Reformat examples from docstring documentation in every Reader and Splitter classes.
HeaderSplitter
was not properly handling the headers in somemarkdown
andHTML
files.
Documentation¶
- Some examples have been provided in the documentation (
docs/
, and in the documentation server). - New examples in docstrings.
v0.1.0¶
First version of the project
Functional features¶
- Add first readers,
VanillaReader
: reader which reads the files and format them into a string. DoclingReader
: reader which uses the docling package to read the files.MarkItDownReader
: reader which uses the markitdown package to read the files.- Add first splitters:
CharacterSplitter
,RecursiveSplitter
,WordSplitter
,SentenceSplitter
,ParagraphSplitter
,HTMLTagSplitter
,JSONSplitter
,HeaderSplitter
: - The package can be installed using pip.
README.md
has been updated.- Tests cases for main functionalities are available.
- Some data has been added for testing purposes.
- A documentation server is deployed with up-to-date information.
Developer features¶
- Update
pyproject.toml
project information. - Add pre-commit configurations (
flake8
, check commit messages, run test coverage, and update documentation). - Add first Makefile commands (focused on developers):
make help
: Provide a list with all the Make commands.make clean
: Clean temporal files and cachemake shell
: Run auv
shell.make install
: Install uv CLI and pre-commit.make precommit
: Install pre-commit hooks.make format
: Run pyupgrade, isort, black, and flake8 for code style.