No comments yet.
We added a recently added a data diff tool to the Oxen.ai toolchain. Feel free to checkout the docs here:
https://docs.oxen.ai/concepts/diffs
If you aren’t familiar with Oxen.ai we are building dataset version control optimized for structured datasets of csvs, jsonl, parquet files, etc as well as multimodal datasets of images, audio, video, text, etc.
We wanted add a tool to the toolchain that could quickly narrow down schema changes as well as find added/removed/modified rows in data frames. The hope is that this will be a more powerful tool than a git diff when it comes to iterating on datasets. With data you are no longer scanning line by line to verify changes - but dealing with data distributions and data frames. You can pipe these diffs into any data analysis tool (pandas, polars, your custom Jupyter notebook) and quickly get to the crux of what changed.
Let us know if this type of tool would be helpful in your workflow! The github can be found here: