Mendoza: Use stack machines to compute efficient JSON diffs (opens in new tab)

(sanity.io)

115 pointsevenw5y ago33 comments

33 comments

“ Naming a project is always difficult. Since this project is focused on representing changes between JSON documents I naturally started thinking about names like "JSON differ, JSON changes, JSON patch, …". However, most of these names have already been used by existing projects. While I was muttering "JSON, JSON, JSON" it suddenly turned into "JSON, JSON, Jason, Jason Mendoza".”

This is a reference from The Good Place and it is an amazing name!

feanaro5y ago

This is an interesting tool for computing minimal diffs, but the result is not very human friendly. If this is your goal and you are looking for something better than diff, have a look at graphtage: https://github.com/trailofbits/graphtage

Also works for XML, HTML, YAML and CSV.

jiofih5y ago

The article explains very clearly that this is a purpose-built non-human-friendly solution.

feanaro5y ago

I know, I wasn't trying to ignore or denigrate this purpose. It seemed apt to show a similar tool which still has a drastically different (and complementing) purpose, which is human readability.

jdc5y ago

Would be cool if there is were a way to plug this into git

smabie5y ago

I'm pretty sure git allows custom diffs, no?

dddw5y ago

Thanks for sharing that! Looks usefull to me

anonymousblip5y ago

Interesting approach! But aren't JSON arrays a pretty wasteful encoding?

Since this is an opaque serialization of an instruction set, why not try to encode more bits per number (JSON floats support lossless integers of many more bits), and moving the "symbol table" (string data) to the end?

This way you could also compress redundant symbols into single strings.

judofyr5y ago

Author of the article here.

The format itself is not strictly speaking coupled to JSON: https://github.com/sanity-io/mendoza/blob/master/docs/format.... If you can encode int8 and string more efficiently, then you can save some bytes with a custom binary encoding. However, you always need to be able to represent JSON as well. If a part of the JSON file isn't present in the old version, then you encode that part by the plain JSON.

> and moving the "symbol table" (string data) to the end? … you could also compress redundant symbols into single strings.

Sounds interesting, but in my experience these types of tricks are usually not paying off compared to just sending it through Gzip afterwards.

AaronFriel5y ago

Now that you have .TEXT and .DATA sections, you're only a few sentences away from suggesting that the compression/diff algorithm generate and send WASM's binary encoding.

Joker_vD5y ago

What about BPF?

Arqu5y ago

I'm super interested in this topic. Recently (and still ongoing) I started on hashing out how to diff large datasets and what that even means.

I would love to get an understanding of how the HN crowd sees diffing datasets should be (lets say >1GB in size).

Are you more interested in a "patch" quality diff of the data which is more machine tailored? Or is a change report/summary/highlights more interesting in that case?

Currently I'm leaning more towards the understanding/human consumption perspective which offers some interesting tradeoffs.

nerdponx5y ago

Both! I need to be able to handle merge conflicts for data, but I also need the machine to be able to apply the changes.

konda5y ago

You should checkout dolthub.com. It's versioned DB tool that allows for diffs and merges.

TheRealPomax5y ago

Out of curiosity, what problem did you have that this approach solves?

Thinking about diffs, plain text diffs are typically compressed for transport anyway, so you end up with something that's human readable at the point of generation and application (where the storage/processing cost associated with legibility is basically insignificant) while being highly compressed during transport (where legibility is irrelevant and no processing beyond copying bits is necessary).

judofyr5y ago

We have a JSON data store at Sanity.io where you update your data by sending in a transaction script which is at a quite high level ("increment karma by 5"). While these scripts explain your intention well, they are a bit of a hassle to work with internally since the language is so rich. We therefore wanted (what we've called) an effect format. The effect format would explain what actually happened when applying the transaction: "Set karma to 183". Mendoza is what we ended up with. The Mendoza patches are much easier to work with from a decoder's point of view.

As a side effect, we've also been able to use the Mendoza format for tracking changes in documents. The JavaScript implementation supports replaying the patches while maintaining information about where each part of the document was introduced. We use this in our Review Changes feature so that you're able to see exactly who introduced a change (in a JSON document!): https://www.sanity.io/blog/review-changes

TheRealPomax5y ago

Ah, so you're using it in a transport context that has no (built-in) compression?

tomswartz075y ago

I appreciate the Good Place reference in the name.

zen_of_prog5y ago

My first thought was The Simpsons

hackerbrother5y ago

Mendozaaaa!!

kgwgk5y ago

So do I now. Thanks.

remexre5y ago

Have you looked at generating a specialized version of this stack machine for other types than JSON values? It'd be neat to have a version that works directly on data with Haskell's GHC.Generics, or a custom derive in Rust.

remexre5y ago

Why does this contain an implementation of sha256 instead of using the standard library's crypto/sha256?

judofyr5y ago

Ehm… As I was profiling the code base I found that most of the time was spent hashing the data. The hashing part is quite critical for correctness and any collisions will cause it to generate incorrect. Initially I used SHA256 and was thinking about ways to speed it up. One experiment I did was to tweak it to use 256 bits internally, but then only output 64 bits. This turned out to help quite a lot. Of course, later I got a bit anxious about only using 64 bits so I increased it to 128 bits: https://github.com/sanity-io/mendoza/commit/42c7dee4b42bfb4c...

At this point all of this is probably completely moot.

mjgp25y ago

Isn't this what https://tools.ietf.org/html/rfc6902 is for?

narrationbox5y ago

Add a few more lines of code and you have yourself an operational transformation library for collaborative data structures.

You may find this relevant: https://github.com/ottypes/json1

zokier5y ago

Somehow I'm reminded how patch runs ed beneath and I guess could use more advanced editing commands: https://news.ycombinator.com/item?id=16767509

Also is there enough here to build a turing machine? I guess not, but it does seem pretty close.

tonyg5y ago

How does it decide whether two JSON values are equal? e.g. 1 vs 1.00 vs 1e0. (https://preserves.gitlab.io/preserves/why-not-json.html)

georgewsinger5y ago

https://www.youtube.com/watch?v=MOk4hQXbGDs&ab_channel=Thing...

dreamer75y ago

On a tangential topic, how does one go about integrating Sanity Studio with Hugo so non-developers can create content more quickly?

kmelve5y ago

Hugo doesn't have an easy way of making pages/paginating from an API yet, so you'd have to first generate the markdown files. That can be done though. Here’s a demo: https://codesandbox.io/s/rj4y72j34n

amelius5y ago

> Mendoza constructs a minimal recipe for transforming a document into another.

Is it really minimal, or is it an attempt at minimal?

vicbergquist5y ago

Cool!

j / k navigate · click thread line to collapse

33 comments

thehappypm5y ago

This is a reference from The Good Place and it is an amazing name!

feanaro5y ago

Also works for XML, HTML, YAML and CSV.

jiofih5y ago

The article explains very clearly that this is a purpose-built non-human-friendly solution.

feanaro5y ago

I know, I wasn't trying to ignore or denigrate this purpose. It seemed apt to show a similar tool which still has a drastically different (and complementing) purpose, which is human readability.

jdc5y ago

Would be cool if there is were a way to plug this into git

smabie5y ago

I'm pretty sure git allows custom diffs, no?

dddw5y ago

Thanks for sharing that! Looks usefull to me

anonymousblip5y ago

Interesting approach! But aren't JSON arrays a pretty wasteful encoding?

This way you could also compress redundant symbols into single strings.

judofyr5y ago

Author of the article here.

> and moving the "symbol table" (string data) to the end? … you could also compress redundant symbols into single strings.

Sounds interesting, but in my experience these types of tricks are usually not paying off compared to just sending it through Gzip afterwards.

AaronFriel5y ago

Now that you have .TEXT and .DATA sections, you're only a few sentences away from suggesting that the compression/diff algorithm generate and send WASM's binary encoding.

Joker_vD5y ago

What about BPF?

Arqu5y ago

I'm super interested in this topic. Recently (and still ongoing) I started on hashing out how to diff large datasets and what that even means.

I would love to get an understanding of how the HN crowd sees diffing datasets should be (lets say >1GB in size).

Are you more interested in a "patch" quality diff of the data which is more machine tailored? Or is a change report/summary/highlights more interesting in that case?

Currently I'm leaning more towards the understanding/human consumption perspective which offers some interesting tradeoffs.

nerdponx5y ago

Both! I need to be able to handle merge conflicts for data, but I also need the machine to be able to apply the changes.

konda5y ago

You should checkout dolthub.com. It's versioned DB tool that allows for diffs and merges.

TheRealPomax5y ago

Out of curiosity, what problem did you have that this approach solves?

judofyr5y ago

TheRealPomax5y ago

Ah, so you're using it in a transport context that has no (built-in) compression?

tomswartz075y ago

I appreciate the Good Place reference in the name.

zen_of_prog5y ago

My first thought was The Simpsons

hackerbrother5y ago

Mendozaaaa!!

kgwgk5y ago

So do I now. Thanks.

remexre5y ago

Why does this contain an implementation of sha256 instead of using the standard library's crypto/sha256?

judofyr5y ago

At this point all of this is probably completely moot.

mjgp25y ago

Isn't this what https://tools.ietf.org/html/rfc6902 is for?

narrationbox5y ago

Add a few more lines of code and you have yourself an operational transformation library for collaborative data structures.

You may find this relevant: https://github.com/ottypes/json1

zokier5y ago

Somehow I'm reminded how patch runs ed beneath and I guess could use more advanced editing commands: https://news.ycombinator.com/item?id=16767509

Also is there enough here to build a turing machine? I guess not, but it does seem pretty close.

tonyg5y ago

How does it decide whether two JSON values are equal? e.g. 1 vs 1.00 vs 1e0. (https://preserves.gitlab.io/preserves/why-not-json.html)

georgewsinger5y ago

https://www.youtube.com/watch?v=MOk4hQXbGDs&ab_channel=Thing...

dreamer75y ago

On a tangential topic, how does one go about integrating Sanity Studio with Hugo so non-developers can create content more quickly?

kmelve5y ago

amelius5y ago

> Mendoza constructs a minimal recipe for transforming a document into another.

Is it really minimal, or is it an attempt at minimal?

vicbergquist5y ago

Cool!

j / k navigate · click thread line to collapse