Ydiff: Structural Comparison of Programs (opens in new tab)

(yinwang0.wordpress.com)

176 pointsroquin13y ago33 comments

33 comments

This is pretty damn cool, but I can think of a few things that I'm curious how you're going to address:

1) Your C++ example, we have "namespace 'v8'" -- which has been removed and reinserted. That made me scratch my head a bit. Has it been refactored enough that this is kind of a rewrite? If so, there might be a third or fourth color here for "hey, this part didn't /change/ per se, but everything under it did"

2) Your Python example: right at the top we have a new class, PairIterator, with "class" rightly green. However I'm fighting against years of reading textual diffs that suggest the whole block is really the new thing. Could the block highlight the class as-is (where it changed) and subtly color the rest of the tree (what else changed with it)?

3) I get your whole in-the-future-we-store-ASTs argument, but that's certainly not the case today. Today we store text, and I don't see that changing. Could there also be a diff, perhaps even another mode that, counter to just dealing with structure, deals with formatting? ie, find the ranges that contribute nothing to the AST and then diff those textually.

I like the ideas and the paradigm shifting -- and the real answer is likely somewhere in between, because, at the end of the day, programmers are still editing things in text, even if they are manipulating greater structures.

As a diff tool, I'd introduce this to my workflow in a heartbeat if I wasn't working so hard to interpret what the diff means.

shurcooL13y ago

> 3) I get your whole in-the-future-we-store-ASTs argument, but that's certainly not the case today. Today we store text, and I don't see that changing. Could there also be a diff, perhaps even another mode that, counter to just dealing with structure, deals with formatting? ie, find the ranges that contribute nothing to the AST and then diff those textually.

Agree. I think it's likely code will be stored as text, but parsed into ASTs easily. Consider Go standard library has a full language parser built right in, so to get an AST from a .go text source file is about 3 lines of code.

Because you can get the code back from AST easily (while maintaining spacing), it really doesn't matter what you save. You can use tools to edit the text form or AST form without any duplication.

pshc13y ago

I've actually been trying to make 3) happen with a long-standing project of mine, because serializing AST to text discards information. In a (proper) AST, human-readable names are never used as pointer values.

When you bind something (reference a variable), you just point at that variable's node, rather than mentioning it by string. This removes a huge class of artificial problems brought on by plaintext source code (symbol name clashes, namespaces, overeager imports, var name typos, shadowing, and other programmer-compiler miscommunications), but introduces some editor UI concerns (e.g. indicating shadowing).

There are a whole host of other advantages of saving as ASTs directly. One of them is granular, semantic diffs like in the OP. I'm convinced it's the future... there are a lot of UI problems to solve in a solid, practical editor for it though.

2 more replies

RBerenguel13y ago

Neat, love new approaches to old problems. At first sight I thought it would be an awesome implementation of an idea I had (and probably tons of other people before and after :) a few years ago to find cheating among C programming assignments: convert programs to PostScript "drawings": ifs, fors and certain other functions gave rise to different "movements" of the cursor, eventually drawing paths for different expressions. Since it didn't take into account variable names, malloc-calloc-free order or any other extraneous thing, it could pinpoint cheating with a relatively good accuracy (and indeed, we caught 3 cheaters among ~90 submissions, and one of them would have been pretty hard to spot without the tool.) If anyone's interested, it was created with lex and basically parsed the expressions I was interested. I think I wrote about it once in my blog, but I'm not sure if I ever published the code (it was hacky as hell!)

jacques_chester13y ago

I've seen cheating-detection programs that use tree diff scores to compare code from different students.

The theory goes that students who share solutions will probably change the variable names, do some reformatting etc, which would fool a text diff. But the actual structure of the AST will be the same or similar.

So if you find close matches, you inspect them more closely.

Some quick Googling reveals that plagiarism detection using tree comparisons is a common idea.

RBerenguel13y ago

Yup, when I thought "this looks neat!" I didn't think it was very novel, the idea is very clear once you are actually looking at cheated code (IIRC my brother in law told me a few years ago - later an I wrote my code, which was written before meeting him - he had used a similar method years earlier in another university). But I didn't bother much with looking for existing solutions: this was just soo fun! I love reinventing wheels for fun/learning :)

andrewflnr13y ago

Am I correct in guessing, though, that it would only work for relatively sophisticated assignments that can be done in lots of different ways? Or is there always enough variety in different people's structure to tell when things are being copied?

RBerenguel13y ago

It was more like a quick way to check without having to remember 90 assignments. Visual inspection was far, far better (almost like a game of match 2). I taught numerical analysis, and usually there was more than one way to structure the code, in fact most assignments were clearly very different. You'd be surprised how many ways our students found to write the same Gauss inversion algorithm, for example (my personal experience is that I write always the exact same code with very minor variations, but of course I alwas refer to Golub-Van Loan for the pseudocode...) Other assignments were harder to check, there are not that many ways to write a Runge-Kutta integrator with variable stepsizes (it's basically a set of for statements in a precise order, except for the variable step integrator part), but the "fingerprints" pinpointed some odd stuff that could be checked manually.

Basically it was a fun project that got me a little into Lex, the kind of odd stuff I do in Saturday afternoons

tikhonj13y ago

I was working on a very similar project myself recently [1], heavily inspired by ydiff. Unfortunately, I didn't implement the tree diff algorithm entirely correctly, and it's been on hiatus for over a year :(.

The one interesting addition my project had was merging. The neat trick was that we reused the same tree diff algorithm to find conflicts :P. With a bit of work, we would have some very neat features, including the ability to resolve certain conflicts which physically overlap.

[1]: http://jelv.is/cow

ay13y ago

I thought of a variant of this approach for distributed code storage/reuse. Store each function as AST with normalized variable names. Then, a cryptographic hash of this would uniquely identify algorithm. So, caching becomes quite easy and very scalable. Callers refer to specific version by hash also. Thus your entire program can be identified by a single hash. Nice to distribute and cache at web scale. What do you think ?

pshc13y ago

It's what we need. This became clear shortly after learning git--it's the right data structure for the problem.

The hurdle is that hashes aren't user-friendly in a text-based code editor. We need an editor that lets us view and work at the right level of abstraction.

goldfeld13y ago

Lovely to see Racket being used. Given the pure meta-ness of Racket hopefully ydiff will not only evolve into a full-fledged version control system but a completely configurable one at that, akin to the power of Emacs and it's Lisp. Having something like a .yinrc where I could quickly throw around and test out custom functionality would be awesome. I haven't gotten much into git's plumbing yet, but I don't see a culture of hacking it like Emacs and Vim, and I'm not sure why that couldn't be. There's definitely a need, since you have things like git-flow. With structural versioning I think the possibilities of customization and automation would be greatly expanded, and I'd love to be able to use an .*rc file like I use my .vimrc daily, as a scratchpad for small scripts or for fleshing out rough ideas into later externalized plugins.

reubenmorais13y ago

That's why I love Mercurial. The API is great, people build great things with it, and it has a policy of "default to sane, simple and working; let users tweak stuff easily".

LukeShu13y ago

In my experience, git has that culture of hacking, but it isn't as visible. Things get added to the "contrib" folder pretty quickly; but few ever get the polish to graduate out. On my box, contrib is installed at `/usr/share/git/`.

Just create an executable named git-WHATEVER and put it somewhere in $PATH (or $GIT_EXEC_PATH).

taliesinb13y ago

We use a similar thing internally at Wolfram to diff Wolfram Language expressions, though I believe it doesn't have the subexpression matcher. But the diffs are much clearer when you have full m-expressions instead of s-expressions.

The coder who wrote it (@wmacura) now works at Tumblr.

n7213y ago

Relatedly, see Rich Hickey's http://blog.datomic.com/2012/10/codeq.html

frozenport13y ago

Wonderful, although I would rather have the AST integrated with GIT and not as some other kind of version control.

jared31413y ago

As long as you can convert the AST into a git Tree Object[0] you should be good to go.

[0] http://git-scm.com/book/en/Git-Internals-Git-Objects#Tree-Ob...

rmc13y ago

Really neat. Shall have to play with it.

Related to this, I wish there was a way to navigate a programme based on a tree, not just line and character navigation.

gngeal13y ago

Have you tried Paredit?

aethertap13y ago

This is great. I've wanted to build a similar tool for a couple of years, the distinction being that I want a structural grep utility. I'd pass in a pattern in the form of a code snippet (with pattern operators), and have the thing search for matching code structures. I suspect that this tool does 95% of what would be needed for that, very nice.

JadeNB13y ago

It's not what you want, but, as perhaps a first step, there's an SGML sgrep (http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html). Perhaps one might combine it with programming based on Oleg's SXML (http://okmij.org/ftp/Scheme/xml.html) and similar.

EDIT (fixing formatting, and …): I knew something else was nagging at me. The idea of a structural grep makes me think of Rob Pike's structural regular expressions (http://doc.cat-v.org/bell_labs/structural_regexps). The Sam editor (http://sam.cat-v.org) is based on this, and I thought I remembered a more modern version called something like 'e'; but, of course, such a name is impossible to Google.

aethertap13y ago

Awesome, thanks for the links.

jongraehl13y ago

Cool, but not perfect:

at the end of void ArmDebugger::Stop(Instruction* instr) (rhs of http://www.cs.indiana.edu/~yw21/demos/simulator-mips-simulat... )

we have an alignment of ' 2 * Instruction::kInstrSize' to some random code at the end

Scaevolus13y ago

Semantic Merge [1] is a product based around semantic diffs for C#/VB.NET. Their demo is quite impressive, and they're working on Java and C++ next.

[1]: http://www.semanticmerge.com/

mcmire13y ago

Worth noting this was written up a year ago and there hasn't been a whole lot of progress since. Would love to see this idea progressed further, this seems really neat.

thesebas13y ago

interesting thing in similar topic http://programmers.stackexchange.com/questions/119095/why-do...

pagekicker13y ago

can this be pointed at bash scripts?

blacksqr13y ago

Name too close to "yiff" to take seriously.

j / k navigate · click thread line to collapse

33 comments

barakm13y ago

This is pretty damn cool, but I can think of a few things that I'm curious how you're going to address:

As a diff tool, I'd introduce this to my workflow in a heartbeat if I wasn't working so hard to interpret what the diff means.

shurcooL13y ago

Because you can get the code back from AST easily (while maintaining spacing), it really doesn't matter what you save. You can use tools to edit the text form or AST form without any duplication.

pshc13y ago

2 more replies

RBerenguel13y ago

jacques_chester13y ago

I've seen cheating-detection programs that use tree diff scores to compare code from different students.

So if you find close matches, you inspect them more closely.

Some quick Googling reveals that plagiarism detection using tree comparisons is a common idea.

RBerenguel13y ago

andrewflnr13y ago

RBerenguel13y ago

Basically it was a fun project that got me a little into Lex, the kind of odd stuff I do in Saturday afternoons

tikhonj13y ago

[1]: http://jelv.is/cow

ay13y ago

pshc13y ago

It's what we need. This became clear shortly after learning git--it's the right data structure for the problem.

The hurdle is that hashes aren't user-friendly in a text-based code editor. We need an editor that lets us view and work at the right level of abstraction.

goldfeld13y ago

reubenmorais13y ago

That's why I love Mercurial. The API is great, people build great things with it, and it has a policy of "default to sane, simple and working; let users tweak stuff easily".

LukeShu13y ago

Just create an executable named git-WHATEVER and put it somewhere in $PATH (or $GIT_EXEC_PATH).

taliesinb13y ago

The coder who wrote it (@wmacura) now works at Tumblr.

n7213y ago

Relatedly, see Rich Hickey's http://blog.datomic.com/2012/10/codeq.html

frozenport13y ago

Wonderful, although I would rather have the AST integrated with GIT and not as some other kind of version control.

jared31413y ago

As long as you can convert the AST into a git Tree Object[0] you should be good to go.

[0] http://git-scm.com/book/en/Git-Internals-Git-Objects#Tree-Ob...

rmc13y ago

Really neat. Shall have to play with it.

Related to this, I wish there was a way to navigate a programme based on a tree, not just line and character navigation.

gngeal13y ago

Have you tried Paredit?

aethertap13y ago

JadeNB13y ago

aethertap13y ago

Awesome, thanks for the links.

jongraehl13y ago

Cool, but not perfect:

at the end of void ArmDebugger::Stop(Instruction* instr) (rhs of http://www.cs.indiana.edu/~yw21/demos/simulator-mips-simulat... )

we have an alignment of ' 2 * Instruction::kInstrSize' to some random code at the end

Scaevolus13y ago

Semantic Merge [1] is a product based around semantic diffs for C#/VB.NET. Their demo is quite impressive, and they're working on Java and C++ next.

[1]: http://www.semanticmerge.com/

mcmire13y ago

Worth noting this was written up a year ago and there hasn't been a whole lot of progress since. Would love to see this idea progressed further, this seems really neat.

thesebas13y ago

interesting thing in similar topic http://programmers.stackexchange.com/questions/119095/why-do...

pagekicker13y ago

can this be pointed at bash scripts?

blacksqr13y ago

Name too close to "yiff" to take seriously.

j / k navigate · click thread line to collapse