> particularly useful for append-only datasets of immutable records such as system logs or sensor readings which are often among the largest (and fastest-growing) datasets our customers use
Diffs seem to consist of additional files in separate folders:
> behind the scenes we effectively store each diff in a separate folder in the backing file system (e.g.,datasetA/diff1, datasetA/diff2, …) so that the whole dataset is simply represented by datasetA/*.
Without exposing technicalities, the author suggests that the delete use case is taken care of logically and not physically, since datasetA/* may not reflect the actual whole dataset. I infer that they might be logging changes under the hood in a Git-like fashion.
> It’s a bit more complicated than this because users can selectively delete files from those diffs
However, it seems that the versioning raw data they manage are not available to clients or users directly:
> a simple request that we frequently get from our customers: “can we export our datasets from Palantir Foundry to our existing data lake or S3 bucket?“ While this is of course possible, it is important to understand that such exported datasets lack precisely those versioning and sandboxing features that make Foundry a great tool for collaborative data engineering.
This could be a mechanism for vendor lock-in, tied to the very important ACID guarantees of their implementation.
I came across their post while doing research on existing solutions for dataset versioning. Some extra background here: https://news.ycombinator.com/item?id=35930895
On the homepage I read "An in-memory, distributed, and open-source document graph database". Do you know whether the whole database, including all documents, needs to be in memory, and what happens when the datasets exceed available memory space? Or is it perhaps in-memory per document, one-by-one?
Do you have to create diffs manually with terminusdb (CLI example below from https://terminusdb.com/products/terminusdb/), or can they be detected automatically from, e.g., SQL database tables or files in a folder, similarly to Git monitoring a working directory and committing changes based on its contents?
# Add more philosophers to new branch
echo '{ "name": "Plato" }' | terminusdb doc insert admin/philosophers/local/branch/changes
echo '{ "name": "Aristotle" }' | terminusdb doc insert admin/philosophers/local/branch/changes
# Look at the difference between branches
terminusdb diff admin/philosophers --before-commit main --after-commit changes | jq
# Apply the differences to main
terminusdb apply admin/philosophers --before-commit main --after-commit changes