This is a reference from The Good Place and it is an amazing name!
Also works for XML, HTML, YAML and CSV.
Since this is an opaque serialization of an instruction set, why not try to encode more bits per number (JSON floats support lossless integers of many more bits), and moving the "symbol table" (string data) to the end?
This way you could also compress redundant symbols into single strings.
The format itself is not strictly speaking coupled to JSON: https://github.com/sanity-io/mendoza/blob/master/docs/format.... If you can encode int8 and string more efficiently, then you can save some bytes with a custom binary encoding. However, you always need to be able to represent JSON as well. If a part of the JSON file isn't present in the old version, then you encode that part by the plain JSON.
> and moving the "symbol table" (string data) to the end? … you could also compress redundant symbols into single strings.
Sounds interesting, but in my experience these types of tricks are usually not paying off compared to just sending it through Gzip afterwards.
I would love to get an understanding of how the HN crowd sees diffing datasets should be (lets say >1GB in size).
Are you more interested in a "patch" quality diff of the data which is more machine tailored? Or is a change report/summary/highlights more interesting in that case?
Currently I'm leaning more towards the understanding/human consumption perspective which offers some interesting tradeoffs.
Thinking about diffs, plain text diffs are typically compressed for transport anyway, so you end up with something that's human readable at the point of generation and application (where the storage/processing cost associated with legibility is basically insignificant) while being highly compressed during transport (where legibility is irrelevant and no processing beyond copying bits is necessary).
As a side effect, we've also been able to use the Mendoza format for tracking changes in documents. The JavaScript implementation supports replaying the patches while maintaining information about where each part of the document was introduced. We use this in our Review Changes feature so that you're able to see exactly who introduced a change (in a JSON document!): https://www.sanity.io/blog/review-changes
At this point all of this is probably completely moot.
You may find this relevant: https://github.com/ottypes/json1
Also is there enough here to build a turing machine? I guess not, but it does seem pretty close.
Is it really minimal, or is it an attempt at minimal?