Graphtage: A semantic diff utility for JSON, HTML, YAML, CSV, etc (opens in new tab)

(github.com)

405 pointsautoditype5y ago51 comments

51 comments

Cheers for making this a local tool. So many versions of this sort of tool are online-only, which is a non-starter for proprietary data.

jkaptur5y ago

My diff tool, diff.so, is a web app, but it doesn’t send the text to the server (unless you publish). I wish there were a technical mechanism that could guarantee and enforce that, and certify it to users.

tmpz225y ago

You could distribute it as a single index.html file, with unobfuscated/unminimized vanilla javascript. The user could then execute it offline no problem.

2 more replies

mariushn5y ago

It doesn't recognize JSON input. Could you please do this, auto-format it and show the diff similarly to Graphtage?

saagarjha5y ago

Interesting use of U+031F COMBINING PLUS SIGN BELOW to indicate additions, I don't think I've seen that before.

hinkley5y ago

It's the kind of UI experimentation I like to see people try, but I'm not sure they nailed this one.

Probably onto something though. Try different diacritical symbols and see what sticks. Given how '"' looks, maybe combining above or below needs to vary by character. Above probably looks awful for '.' and ','.

Really I think the strikethrough might suffice. The only way to know for sure is to take away the color highlighting, so my brain doesn't use it as a crutch, and see if people can still read the diff.

jacobolus5y ago

Using diacritics for this is extremely visually noisy and not very legible.

There has to be a better method.

1 more reply

ynx05y ago

I think it's really cool as well. Although, it does start to look a little wonky with quotes and probably some other characters and makes skimming a bit harder. Maybe monospaced fonts will start to handle this better if it gets popular?

eurasiantiger5y ago

Ruby typography support in terminals maybe?

jrockway5y ago

This is neat. I decided to diff two pods in a replicated Kubernetes service. It seemed that it was going to take forever to run, so I just wrote a short Go program to do the same thing (load two JSON files into a map[string]interface{}, cmp.Diff them) while it was going:

https://gist.github.com/jrockway/73982949b3d2ce9b443528042c4...

My program runs in less than 10 milliseconds (/usr/bin/time reports 0.00 seconds), and graphtage takes 5 minutes and 17 seconds. I'm 317,000x faster! (Not including the time to write the program; if you do that, then it's about even assuming graphtage took 0 seconds to write.)

Graphtage prints the entire file in JQ colors, with diffs inside fields colored red and green, which I love:

    ...
    "hostIP": "10.136.13̟9̟2̶1̶.139̟1̶",
    "podIP": "10.244.1.18̟6̟7̶",
    ...

(My terminal can't display the dots under the numbers and the strikethrough, but it looks great on HN! I really love it.)

My program produces relatively boring line-by-line diffs:

    -               "hostIP":    string("10.136.121.131"),
    +               "hostIP":    string("10.136.139.139"),
                    "phase":     string("Running"),
    -               "podIP":     string("10.244.1.17"),
    +               "podIP":     string("10.244.1.186"),

Honestly, I get what I want out of mine, and wait 317,000x less time, so... I probably won't be using this on a daily basis. But I will be stealing those dots and Unicode strikethroughs.

rmetzler5y ago

It's neat that you built what you needed in a few lines of code. I must say, I don't quite like the output of Graphtage, I like your's a little bit more, but without a context it's not easy to see how podIP is nested.

Usecases might be a little bit different, but please allow me to share my solution.

The problem with diffing JSON and yaml is, that these formats aren't line based and hashes don't need to be ordered. But there is gron to turn json into a greppable line-based format [1]. Then you can sort. The sorted output is possible to diff now and then you can color the diff output with delta or a similar tool [2].

    diff -u <(kubectl get pod pod1 -o json | gron | sort) <(kubectl get pod pod2 -o json | gron | sort) | delta --light --word-diff-regex="\W+"

This output provides a lot of context for me to see and understand the differences.

[1] gron https://github.com/tomnomnom/gron

[2] delta https://github.com/dandavison/delta

jrockway5y ago

The diff I did is aware of the structure of the object. It's not just sorted lines.

2 more replies

ascar5y ago

Well you are kinda comparing apples and oranges here.

According to their readme they don't just match on keys, but even try to detect changed keys for the same content, even when the two files have a different inner order of elements.

Your diff is probably equivalent to a pretty print and then running regular diff on it, i.e. not even sorting the file.

Having said that and assuming your file wasn't extraordinary large, a 5min runtime makes this tool kinda unusable.

bboreham5y ago

Your approach works less well if someone re-orders one of the arrays, like `containers` or `env`.

IshKebab5y ago

Yes but does your code actually do tree diffing or does it just do line based diffing?

Proper tree diffing is a really hard (I would say unsolved) problem. The "standard" algorithm is O(N^4)!

acidbaseextract5y ago

It's cool, but does seem quite slow. I'm diffing two 45kB CSVs on a fast computer and after 10 minutes I'm still at:

  Diffing:   0% ... 0/93195 [00:00<?, ?it/s]
  Tightening Fringe Diagonal 48 of 792: ...

dan-robertson5y ago

I think it’s ended up with a quadratic algorithm for diffing sequences and a quadratic log algorithm for diffing dictionaries.

To understand why the sequences problem is quadratic, consider a sequence A of length m being doffed with a sequence B of length n. We want to express our diff in the minimum number of operations where an operation is removing, adding, or editing an element in the sequence. Construct a graph as follows: the nodes will be the points on an mxn lattice corresponding to points in the two sequences. An edge going right means “delete this item from sequence A,” and costs (eg 1). An edge going down means “add this item from sequence B” and has a similar cost. An edge going diagonally down and right means to edit the item in A into the item in B and it’s cost depends on how different they are. The problem is to find the shortest path from the top left to the bottom right.

If you could compute the entire graph for free and then applied something like Dijkstra’s algorithm you would be worst-case quadratic (if all the diagonal costs were 2 or more, you would need to touch every node).

There are a few ways you could try to improve this:

1. Look for easy opportunities to optimise. Eg you could have a patience style strategy of cutting off any common prefix or suffix. This won’t help in the worst case.

2. Limit to a fixed width diagonal. This might mean worse diffs but means the graph search problem becomes more linear. I suspect something is going on with the diagonal based on the description

3. Somehow develop some good heuristics and use a better search algorithm like A*. This might not help in the worst case

4. Something else.

meetups3235y ago

Quadratic doesn't need to equal "bad", especially in this case. Two 45 kB items is 2 billion entries. Allocating 2 billion bytes is easy enough. Iterating over 2 billion bytes is also not terrible. The GP says the process is estimated to take 150 hours, or half a million seconds, or 1.62e15 cycles... so around 1 million cycles per cell.

1 more reply

acidbaseextract5y ago

30 minutes in and I still don't have a diff of the two CSVs... Has this program worked for anyone here?

    Diffing:   0%|                                | 153/93195 [30:02<151:31:17,  5.86s/it]
    Tightening Fringe Diagonal 82 of 792:  48%|...| 7453/15514 [00:30<00:35, 226.08it/s]

dj_mc_merlin5y ago

Have you tried -l or -k?

acidbaseextract5y ago

I don't currently have access to the machine and files that I was doing it on, but I'll try those in the morning!

IshKebab5y ago

Tree diffing algorithms have very bad complexity (e.g. O(N^4)!) so this will probably only work on really small examples.

rmccue5y ago

I have a use case for diffing trees, so would love to know of any optimal algorithms you may know of; I'm operating generally with less than 100 nodes, so it's not a huge concern, but I'm finding that discovering _any_ algorithms for this has been tough.

1 more reply

trentdotexe5y ago

There's also a blog post for Graphtage. https://blog.trailofbits.com/2020/08/28/graphtage/

kemitchell5y ago

> Graphtage matches ordered sequences like lists using an “online”[note], “constructive”[note] implementation of the Levenshtein distance metric[note], similar to the Wagner–Fischer algorithm[note[. The algorithm starts with an unbounded mapping and iteratively improves it until the bounds converge, at which point the optimal edit sequence is discovered. This is implemented in the graphtage.levenshtein module.

https://trailofbits.github.io/graphtage/latest/howitworks.ht...

So not a tree algo, but an adaptation of a list-diff algo? Or is this just a note on how the tree-diff compares sequences?

jonnycomputer5y ago

This could really use more input/output examples in the readme to get a feel for what it would be like to use.

indeyets5y ago

Is there a corresponding patch utility? Can I generate compact diff file for changed dataset, distribute it and have people apply the difference on their side?

spankalee5y ago

Anyone know of similar libraries in JavaScript?

It's difficult to search for HTML diff libraries these days because all the hits are vdom like things, instead of diffing HTML text for development / testing.

stereosteve5y ago

Can relate. After some searching I found:

https://github.com/Teamwork/visual-dom-diff

Which is quite good and fast. Encodes HTML tags as Unicode chars, calls diff-match-path and uses diff to build a final visual output.

BadOakOx5y ago

It seems, nobody is satisfied with the recent JSON diff utilities and everybody has a different take on it.

A few weeks back I started my semantic JSON compare too: https://paldys.github.io/semantic-json/

Your data stays in the browser. The list compare is pretty naive at the moment, and it doesn't allow key changes as Graphtage promises.

carapace5y ago

(The name Trail of Bits reminds me of the Trail of Tears. Maybe it's just an unfortunate co-incidence?)

ducktective5y ago

Any chance of integration with vimdiff? Any ways to have "semantic" diffing in vimdiff?

davedrouin5y ago

Thanks for making this - great for displaying diffs in dicts for unit test assertion failures.

beams5y ago

Cool I will definitely try this. Thank you

deknos5y ago

and now please, build a tool which converts all these thinks, like imagemagick for machinereadable documents.

hermitcrab5y ago

If you mean convert between data formats (e.g. between, CSV,YAML,JSON, XML) there are programs already that can do that. For example our Easy Data Transform. However there are wrinkles because some of these formats are trees and some are tables. Flattening a tree into a table isn't too hard. But unflattening a table back into the same tree as the original is trickier. Has anyone got any good references on that?

deknos5y ago

do you have a list for that? i am aware, that there are dialects and {table,trees,graphs}<->{table,trees,graphs} may be a problem for some languages, but having it at least would be a progress.

1 more reply

j / k navigate · click thread line to collapse

51 comments

pimlottc5y ago

Cheers for making this a local tool. So many versions of this sort of tool are online-only, which is a non-starter for proprietary data.

jkaptur5y ago

tmpz225y ago

You could distribute it as a single index.html file, with unobfuscated/unminimized vanilla javascript. The user could then execute it offline no problem.

2 more replies

mariushn5y ago

It doesn't recognize JSON input. Could you please do this, auto-format it and show the diff similarly to Graphtage?

saagarjha5y ago

Interesting use of U+031F COMBINING PLUS SIGN BELOW to indicate additions, I don't think I've seen that before.

hinkley5y ago

It's the kind of UI experimentation I like to see people try, but I'm not sure they nailed this one.

jacobolus5y ago

Using diacritics for this is extremely visually noisy and not very legible.

There has to be a better method.

1 more reply

ynx05y ago

eurasiantiger5y ago

Ruby typography support in terminals maybe?

jrockway5y ago

https://gist.github.com/jrockway/73982949b3d2ce9b443528042c4...

Graphtage prints the entire file in JQ colors, with diffs inside fields colored red and green, which I love:

    ...
    "hostIP": "10.136.13̟9̟2̶1̶.139̟1̶",
    "podIP": "10.244.1.18̟6̟7̶",
    ...

(My terminal can't display the dots under the numbers and the strikethrough, but it looks great on HN! I really love it.)

My program produces relatively boring line-by-line diffs:

    -               "hostIP":    string("10.136.121.131"),
    +               "hostIP":    string("10.136.139.139"),
                    "phase":     string("Running"),
    -               "podIP":     string("10.244.1.17"),
    +               "podIP":     string("10.244.1.186"),

Honestly, I get what I want out of mine, and wait 317,000x less time, so... I probably won't be using this on a daily basis. But I will be stealing those dots and Unicode strikethroughs.

rmetzler5y ago

Usecases might be a little bit different, but please allow me to share my solution.

    diff -u <(kubectl get pod pod1 -o json | gron | sort) <(kubectl get pod pod2 -o json | gron | sort) | delta --light --word-diff-regex="\W+"

This output provides a lot of context for me to see and understand the differences.

[1] gron https://github.com/tomnomnom/gron

[2] delta https://github.com/dandavison/delta

jrockway5y ago

The diff I did is aware of the structure of the object. It's not just sorted lines.

2 more replies

ascar5y ago

Well you are kinda comparing apples and oranges here.

According to their readme they don't just match on keys, but even try to detect changed keys for the same content, even when the two files have a different inner order of elements.

Your diff is probably equivalent to a pretty print and then running regular diff on it, i.e. not even sorting the file.

Having said that and assuming your file wasn't extraordinary large, a 5min runtime makes this tool kinda unusable.

bboreham5y ago

Your approach works less well if someone re-orders one of the arrays, like `containers` or `env`.

IshKebab5y ago

Yes but does your code actually do tree diffing or does it just do line based diffing?

Proper tree diffing is a really hard (I would say unsolved) problem. The "standard" algorithm is O(N^4)!

acidbaseextract5y ago

It's cool, but does seem quite slow. I'm diffing two 45kB CSVs on a fast computer and after 10 minutes I'm still at:

  Diffing:   0% ... 0/93195 [00:00<?, ?it/s]
  Tightening Fringe Diagonal 48 of 792: ...

dan-robertson5y ago

I think it’s ended up with a quadratic algorithm for diffing sequences and a quadratic log algorithm for diffing dictionaries.

There are a few ways you could try to improve this:

1. Look for easy opportunities to optimise. Eg you could have a patience style strategy of cutting off any common prefix or suffix. This won’t help in the worst case.

2. Limit to a fixed width diagonal. This might mean worse diffs but means the graph search problem becomes more linear. I suspect something is going on with the diagonal based on the description

3. Somehow develop some good heuristics and use a better search algorithm like A*. This might not help in the worst case

4. Something else.

meetups3235y ago

1 more reply

acidbaseextract5y ago

30 minutes in and I still don't have a diff of the two CSVs... Has this program worked for anyone here?

    Diffing:   0%|                                | 153/93195 [30:02<151:31:17,  5.86s/it]
    Tightening Fringe Diagonal 82 of 792:  48%|...| 7453/15514 [00:30<00:35, 226.08it/s]

dj_mc_merlin5y ago

Have you tried -l or -k?

acidbaseextract5y ago

I don't currently have access to the machine and files that I was doing it on, but I'll try those in the morning!

IshKebab5y ago

Tree diffing algorithms have very bad complexity (e.g. O(N^4)!) so this will probably only work on really small examples.

rmccue5y ago

1 more reply

trentdotexe5y ago

There's also a blog post for Graphtage. https://blog.trailofbits.com/2020/08/28/graphtage/

kemitchell5y ago

https://trailofbits.github.io/graphtage/latest/howitworks.ht...

So not a tree algo, but an adaptation of a list-diff algo? Or is this just a note on how the tree-diff compares sequences?

jonnycomputer5y ago

This could really use more input/output examples in the readme to get a feel for what it would be like to use.

indeyets5y ago

Is there a corresponding patch utility? Can I generate compact diff file for changed dataset, distribute it and have people apply the difference on their side?

spankalee5y ago

Anyone know of similar libraries in JavaScript?

It's difficult to search for HTML diff libraries these days because all the hits are vdom like things, instead of diffing HTML text for development / testing.

stereosteve5y ago

Can relate. After some searching I found:

https://github.com/Teamwork/visual-dom-diff

Which is quite good and fast. Encodes HTML tags as Unicode chars, calls diff-match-path and uses diff to build a final visual output.

BadOakOx5y ago

It seems, nobody is satisfied with the recent JSON diff utilities and everybody has a different take on it.

A few weeks back I started my semantic JSON compare too: https://paldys.github.io/semantic-json/

Your data stays in the browser. The list compare is pretty naive at the moment, and it doesn't allow key changes as Graphtage promises.

carapace5y ago

(The name Trail of Bits reminds me of the Trail of Tears. Maybe it's just an unfortunate co-incidence?)

ducktective5y ago

Any chance of integration with vimdiff? Any ways to have "semantic" diffing in vimdiff?

davedrouin5y ago

Thanks for making this - great for displaying diffs in dicts for unit test assertion failures.

beams5y ago

Cool I will definitely try this. Thank you

deknos5y ago

and now please, build a tool which converts all these thinks, like imagemagick for machinereadable documents.

hermitcrab5y ago

deknos5y ago

do you have a list for that? i am aware, that there are dialects and {table,trees,graphs}<->{table,trees,graphs} may be a problem for some languages, but having it at least would be a progress.

1 more reply

j / k navigate · click thread line to collapse