Colm programming language released: best parser-writer ever | Better HN

Colm programming language released: best parser-writer ever | Better HN

42 comments

ScottBurson15y ago

Thurston claims that no previous grammar system supports his three requirements of generalized parsing, grammar-dependent scanning, and context-dependent parsing.

I would argue that Prolog Definite Clause Grammars, which date back to the early 1970s, have all three of these properties. Furthermore, since the context is maintained functionally, by threading additional values through the productions, no "undo actions" are required; Prolog's built-in backtracking is all that's needed.

Of course, the problem with DCGs is performance: they're exponential in the worst case. But I think they deserve mention in a dissertation like this anyway. Also, any backtracking parser risks exponential worst-case performance; it will be interesting to see how Colm avoids this fate (I've only read the first few pages yet).

alan-crowe15y ago

I remember, back in 1985, trying to write a parser in Prolog on a VAX 11/750. I didn't know what I was doing and my code fell into the exponential case trap. Even minimalist examples appeared to crash as they took unreasonably long, so I wasn't getting clues to the problem, and just gave up.

Consequently I want to rephrase the first sentence of your third paragraph. The problem with DCGs is performance: they're exponential in the naive case. That might not sound too bad today, but back in 1985 the hype was that Prolog let you program declaratively. Just declare what a parse looks like. Reasonable performance in the naive case was the popular selling point for Prolog.

thurston15y ago

Where is the grammar-dependent scanning?

Note that threading the context through the parse tree while maintaining fully generalized parsing requires keeping all versions of the parsing context in memory. Consider making a C++ parser in that way ... ie every time you modify the structures you build in memory you make a copy of them first.

Hemospectrum15y ago

If you know that every subfield of your state object is immutable, and you're using references instead of "local" copies (inconvenient unless you have garbage collection) then your copying costs are limited to the size of the overall state object plus whichever field changed.

Obviously this makes no sense for C++, but Clojure and OCaml get away with it and in Haskell it's the standard way of implementing almost every stateful computation.

swannodette15y ago

If your data structures are persistent data structures you don't incur the costs of copying.

thurston15y ago

Indeed I did not follow what you meant. I missed the "persistent" and saw ... "don't incur the cost of copying." I took that to mean that you were referring to not maintaining any history and using a single, mutable global state. It is with that understanding that I made the previous comment.

To address what you meant ... it's more a matter of what's practical. For example, to parse C++ one needs to do quite a bit of semantic analysis during parsing. A full implementation of the type system is necessary. Then the data structures change with every statement/clause parsed. Achieving this with persistent data structures is far too costly.

thurston15y ago

Then you forgo generalized parsing.

Edit: indeed I did not follow what you meant.

jws15y ago

Notice that the DNS example is parsing a binary DNS request, not a text file.

thurston15y ago

:) If I had my way this comment would be closer to the top. Not many grammar-based parsing systems can claim raw DNS parsing.

haberman15y ago

From my quick scan of the thesis, the basic design seems to be a programming language in which you write both the parser and any transformations you want to perform. It's not clear whether there is an easily-accessible parse tree serialization that you can use to load the output into another language, or whether you'd have to invent that yourself.

I think it's generally a hard sell if you try to convince people that they need to write their algorithms in your special language. Parsing tools deliver value because grammars are easier to write than the imperative code that implements those grammars. That value offsets the cost of having to learn a new special-purpose language. But imperative programming languages are already pretty good at tree traversal and transformation, so there's little benefit to using a special-purpose language for this.

I think that the next big thing in parsing will be a runtime that easily integrates into other languages so that the parsing framework can handle only the parsing and all of the tree traversal and transformation can be performed using whatever language the programmer was already using. This requires much less buy-in to a special-purpose language.

thurston15y ago

Colm has built-in serialization. There is still some work to do in this area though. Colm will preserve whitespace for minimal disruption of untransformed text, but figuring out what to do at the boundaries between modified and unmodified trees can be tricky.

You are right, people want to use general purpose languages for the more complex algorithms. I agree a means of embedding is necessary and I have kept this in mind, though not yet achieved it. I would very much like to be able to parse, transform, then have the option to import the data into another environment and carry on there.

haberman15y ago

Thanks for the info. What is the built-in serialization format?

thurston15y ago

Just plain old text as it came in. I see now that is not what you were referring to. You're talking about JSON, XML, etc I now think.

There is also a print_xml function, which puts the tree into XML, but it's mostly used for debugging at this point, not export to other systems. I'm hoping that with time these kinds of features will crop up.

beza1e115y ago

AntLR can do this, although it does not work that well. I used the C backend, which is pretty directly ported from the Java backend. C-in-Java-style is pretty awkward.

haberman15y ago

ANTLR is not the same as what I am describing. ANTLR generates code in each target language: I am talking about a common runtime that all languages call into. Think of it as a "parsing VM." Using this scheme, there would be no need to have separate backends for C and Java, the only thing you'd need to port is the bindings.

beza1e115y ago

Why would that be a good idea? What advantage would a just-in-time parser generator have that a static parser generator does not?

danellis15y ago

What you're describing sounds like the Gold parser: http://www.devincook.com/goldparser/

haberman15y ago

Yes, or my own project, Gazelle: http://www.gazelle-parser.org/

bdfh4215y ago

Quote "Colm does not yet have any documentation".

Then I would hazard that it is not yet a language as without documentation it has no "grammar". At best it is a patois.

thurston15y ago

Grammar: http://svn.complang.org/colm/trunk/colm/lmparse.kl

wzdd15y ago

TXL, its apparently predecessor, is very well documented (http://www.txl.ca/). TXL is a very interesting approach to parsing and worth reading up on if you're interested in the area (or are waiting for documentation for Colm :)

scscsc15y ago

There seems to be a PhD thesis behind, so you should check it for the grammar.

colomon15y ago

It would be interesting to see someone who understood both this and Perl 6's grammars to do a comparison. Based on Colm's quick description and my rough understanding of Perl 6 grammars, they sound like they are roughly equally powerful. But I admit I'm not sure I understand what "transformation language" means...

audreyt15y ago

Although similar in expressive power, Colm offers instruction logging to auto-reverse global state changes upon backtracking, something Perl 6 grammars does not (yet) support; at the moment we need to manually manage them with embedded blocks.

chocolateboy15y ago

Re: "reverse global state changes upon backtracking": this sounds similar to the (manual) "undo actions" supported by the Kelbt parser [1], perhaps unsurprisingly as it was developed by the same author :-)

[1] http://www.complang.org/kelbt/

colomon15y ago

Thanks!

thurston15y ago

If they are then I don't deserve to be called "Dr. Thurston!"

Twisol15y ago

Adrian Thurston (the creator of Colm) is also responsible for the fantastic Ragel state machine generator.

DrCatbox15y ago

I am more interested in DSNP, how come this project has not received more fame than the infamous Disapora? http://www.complang.org/dsnp/

thurston15y ago

There are some difficult problems in that space. I've posted to HN and reddit a few times, but mostly I've been working on it quietly so I can focus. Lately, that's starting to change. I'll be talking about it at FSW 11 in Berlin in a few weeks.

DrCatbox15y ago

My google sense failed me this time around to find information on this FSW 11 in Berlin. Care to explain? Is it a conference, can anybody come?

I am really interested in DSNP and am fairly well versed in GNU/Linux and can do some programming, Java and Python mostly. I work as a web-frontend developer guy. Can I be of some help? Do you need testers, peers, documenters?

thurston15y ago

Ya it's currently hard to find. http://d-cent.org/fsw2011/

I need help from people like you actually. What I've done.

1. defined the protocol

2. implmented it in a C++ daemon that

   a) talks to other daemons

   b) serves the content managers (frontend UIs)

3. written a (crappy) example content manager.

What needs to happen next is step 3 needs to be repeated by other people who know what they are doing. They don't need to understand the details of the protocol, they just need to understand the basic model, which is just message broadcast, distributed agreement, etc.

Email me for more details, will get back to you later tonight.

Barrasmara15y ago

This kind of sounds like Semantic Design's DMS software Reengineering toolkit and the Parlanse language.

thurston15y ago

They are related systems. DMS is much more mature.

j / k navigate · click thread line to collapse