Glush: A robust parser compiler built using non-deterministic automatons (opens in new tab)

(sanity.io)

84 pointsevenw6y ago33 comments

33 comments

> However, in practice it turns out that LALR(1) isn’t always the answer. For instance, both GCC, Rust and Go have chosen to use handwritten parsers that are not based on a declarative grammar file. I find this disappointing: We have decades of experience with specifying languages in a declarative format, and apparently it’s still easier to manually write parsers. Obviously there’s something lacking with the algorithms we have today.

Many production compilers have hand-written parsers because of error reporting, not because of issues with parser-generator algorithms.

Consider the example of missing parenthesis. I can specify a grammar with the following:

  EXPR =
       ...
       | '(' EXPR ')'

The generated parser will not gracefully report a mismatched number of parentheses! Instead, it will report that the user's input is not formatted as expected, which makes it hard for the user to understand what went wrong.

It is possible to modify the grammar to look for this case specifically:

  EXPR =
       ...
       | '(' EXPR ')'
       | '(' EXPR      // report this explicitly

But now I've just polluted my once-pristine grammar file. And adding all of the necessary error checks throughout the grammar will really make things complicated.

So, many production compilers end-up with custom parsers to report unexpected input in a sane way. It is not that the algorithms are lacking; there is simply no mechanism for a generator to know what a mismatched parenthesis should be reported as.

ltratt6y ago

I completely agree: the poor quality of error reporting in parsing tools is a significant reason not to use them. When I started looking into this a couple of years back, what surprised me is how much work there's been in this spread over decades.

Building upon that (and obvious bias alert!) led us to come up with a couple of new algorithms. The current paper draft is https://arxiv.org/abs/1804.07133 (a future version of that paper will probably chop out MF which, in retrospect, adds too much complexity over CPCT+ for the relatively small gains). The Rust library (https://crates.io/crates/lrpar) it's implemented in is documented at https://softdevteam.github.io/grmtools/master/book/errorreco... which is probably a friendlier starting point. Summary: one can do a surprisingly decent job of error recovery for very little work.

seanmcdirmid6y ago

Error recovery and reporting are for sure the biggest challenge of a production parser. If you work out of band, you can play a lot of tricks in your parser to make it much more effective in handling errors. For example, braces can be matched in many languages without parsing any other constructs in the language, meaning these errors can all be reported and recovered from independent of the rest of the language. Then again, the overloading of < and > as operators and braces in many languages defeats that a bit :)

Hand coding a parser is usually worth it for a moderately popular production language. Parser generators really shine for smaller language efforts that can’t afford the overhead and don’t really care about the niceties of decent error reporting.

joe_the_user6y ago

I don't think it's just error handling. A nice, human readable form has significant chance of being ambiguous - you can remove the ambiguity with a variety of transforms, putting the code into Greibach Normal Form, and this resolves the ambiguities. ... and translates pretty directly to a hand-written recursive descent parser (which you can generate also). But the thing is that here the code essentially serves as a lower level of abstraction, teasing out what you really mean. Which is to say there's nothing wrong with the code being the final spec, that way you don't have to keep the code in sync with the spec.

1 more reply

judofyr6y ago

(Author of the article here)

> Many production compilers have hand-written parsers because of error reporting, not because of issues with parser-generator algorithms.

Yes! I totally agree. Error reporting was one of the things I meant when I said that there is somethings "lacking with the algorithms we have today".

> The generated parser will not gracefully report a mismatched number of parentheses! Instead, it will report that the user's input is not formatted as expected, which makes it hard for the user to understand what went wrong.

I'm not sure if this is the best example. Ruby is using Bison for parsing and it's certainly able to detect a missing parenthesis:

    ruby -e 'p (1 + 2'
    -e:1: syntax error, unexpected end-of-input, expecting ')'

Most parser algorithms have a concept for "currently expected characters/tokens" and can easily report this on an error.

I also think that this will be more precise in Glush than in e.g. LR-based algorithms because Glush is following the grammar strictly top-down. It knows exactly which characters are allowed at any given moment.

I have however not made an effort to make nice error messages so I don't know how well automatic error reporting would work and I don't feel comfortable making any concrete claims or promises.

> It is possible to modify the grammar to look for this case specifically:

Not sure if I see much value in manually adding such a case. This seems like a case that would be easily handled automatically (again: see Ruby above).

> But now I've just polluted my once-pristine grammar file. And adding all of the necessary error checks throughout the grammar will really make things complicated.

Well, if you want the equivalent error reporting in a hand-written parser you also need to add an equivalent amount of code. There are probably ways you can annotate a grammar to improve error reporting and I think it's a far more interesting solution to investigate. I haven't experimented with this yet, but I'd love to try to tackle it if I can find the time.

tempguy99996y ago

> But now I've just polluted my once-pristine grammar file

What's bollixed up your grammar file are the users who absolutely insist on not writing syntactically perfect programs every time.

Your 2nd example is still cleaner than writing a hand parser that does the same thing, surely? A slightly mucky DSL has got to be cleaner, clearer, shorter and so more maintainable than a handwritten equivalent, no?

What would you want the grammar language to look like that handled errors 'properly'?

dukoid6y ago

There is also the friction added by throwing in another language and subsystem, potentially with some fancy build system requirements for integrating code generation....

abecedarius6y ago

Sounds promising! And a great exposition of the background and development.

About the Glushkov construction for regular expressions, here's some Python code that may perhaps help to explain it, since the Wikipedia page is hard to follow (at least for me) and the Mastodon thread linked in the post admitted to some confusion about the "Play on Regular Expressions" paper (which I haven't read): https://github.com/darius/sketchbook/blob/master/regex/nfa_p...

Sorry it's not pedagogically great either; I was rederiving the method for myself rather than implementing from textbooks. What I called 'follows' in the comment should correspond to 'pair set' in the post; the other fields have the same names.

Incidentally I think the post a bit overstates the problems with PEGs (you shouldn't need left recursion anywhere a hand-written recursive-descent parser would also work; PEGs were meant to streamline and formalize recursive descent, and this failure to model the practice can be fixed without big changes) -- but CFGs are great too and I'm looking forward to studying Glush. I wish I'd thought of exploring the generalization to context-free grammars.

erezsh6y ago

That sounds really cool, and I hope it will find success. The field of parsing truly needs some refreshment, especially in usability.

I do have a few questions:

1. You claim the algorithm used guarantees O(n^3) for all CFGs, but is different from Earley and CYK. Are there more details for this algorithm, perhaps with a rigorous mathematical proof for this claim?

2. How is Glush's performance in relation to Earley, in terms of constant time and memory requirements? (even for low complexity, a high constant can be problematic)

3. How do you deal with ambiguity? The blog mentions priority, but is there a way to get something similar to a SPPF?

judofyr6y ago

Author of the article here. Thanks for your comments!

> 1. You claim the algorithm used guarantees O(n^3) for all CFGs, but is different from Earley and CYK. Are there more details for this algorithm, perhaps with a rigorous mathematical proof for this claim?

Not quite yet. There is a lot more work needed and writing this article was one step in right direction. I was hoping I was able to get even closer to

> 2. How is Glush's performance in relation to Earley, in terms of constant time and memory requirements? (even for low complexity, a high constant can be problematic)

This is yet to be determined. A few comments:

- I'm not 100% concerned about performance because as mentioned in the article, I don't plan to use this in a place where performance is critical.

- I'm not too worried about the performance since I'm not doing too complicated data structures (it's all hash maps), and most intermediate data structures are small in size and are probably best implemented linearly (e.g. plain array) for practical grammars with few ambiguities.

- It's based on NFA-style "advance one state at a time" which actually is not good for performance. This means that parsing a keyword such as "function" actually passed through 8 different states instead of just looking ahead and jumping straight ahead. I think taking advantage of these look ahead will probably help a lot

> 3. How do you deal with ambiguity? The blog mentions priority, but is there a way to get something similar to a SPPF?

A single output from the parser is conceptually a flat list of _marks_ (think like a reverse RPN):

    // this input:
    1 + 2 * 3 + 4
    // with this precedence:
    (1 + (2 * 3)) + 4
    // could generate this flat list of marks:
    add, add, 1, mul, 2, 3, 4

The total output from the parser is a "tree of marks".

anhldbk6y ago

Thanks for sharing. Currently I'm looking for an elegant language for querying Json structures. In addition to JsonPath, JMESPath, Glush & GROQ is definitely worth a try

sshine6y ago

Have you considered jq?

https://stedolan.github.io/jq/

https://github.com/stedolan/jq

naniwaduni6y ago

Worth noting jq is not quite like what GP has tried, in that it seems to be designed as a command-line utility first with only incidental concessions to allow itself to be linked as a library. Its C API is poorly documented and essentially defined by the single implementation. The language itself is immensely powerful and not really sandboxable.

It's a nice language to work in but definitely not one I'd want to expose as a user interface.

1 more reply

namibj6y ago

I can recommend jtc as something nice to use for extracting stuff from deep json structures. It should be quite feasible to integrate as a library if one so desires.

https://github.com/ldn-softdev/jtc/blob/master/User%20Guide....

devj6y ago

Try jsonata.org

sshine6y ago

Skipping to "Glush, the algorithm" can be helpful if you already know the history of parsing.

TL;DR: Glush grammars consist of "rules" that are regexes + recursion of rule calls. Compiles to NFAs.

I'm not sure I see the point in declaring precedence levels inside rules -- maybe it's just a preference thing, but I like having operator precedences in a separate section of the grammar. Yacc does this. Megaparsec for Haskell does this.

This reminds me of two things:

1. How syntax highlighting is implemented in Vim and GEdit (which is to say GtkSourceView). See for example how "contexts" can be nested here: https://github.com/rubencaro/gedit_hacks/blob/master/.local/...

2. Kleenex: https://kleenexlang.org - https://github.com/diku-kmc/kleenexlang - a silly example: https://github.com/athas/EggsML/blob/master/concieggs/compil...

I am tempted to classify this as "regex extended with recursion" rather than "an alternative to CFGs", since I cannot decipher from the article the exact expressive power.

tom_mellior6y ago

> Compiles to NFAs.

No. It cannot, since NFAs cannot parse languages that are not regular. Rather, the algorithm executes like a pushdown automaton. I don't think it's correct to call this compilation.

> I cannot decipher from the article the exact expressive power.

To me it seems to suggest being able to parse (exactly) all context-free languages.

tom_mellior6y ago

Nice article, this looks like a useful parser generator. Though, as others have mentioned, error reporting is a very important point that should be considered even at this early stage.

Here's something that surprised me though:

> a parser toolkit that lets you create efficient parsers in multiple languages (currently JavaScript, Go, and Ruby)

> [some other approach] it looked very interesting, but it’s not completely smooth. First of all, all of the implementations have been in highly expressive, garbage collected languages (Haskell, Scala, Racket).

This isn't the author's only reason to dismiss that alternative approach, but it's still a strange point to spend almost an entire paragraph on. If all the targets the author currently envisions are GC'd languages, it's premature pessimization to worry about how other, hypothetical, targets might be accommodated.

judofyr6y ago

(Author here!)

> This isn't the author's only reason to dismiss that alternative approach, but it's still a strange point to spend almost an entire paragraph on. If all the targets the author currently envisions are GC'd languages, it's premature pessimization to worry about how other, hypothetical, targets might be accommodated.

Aah. I should probably have clarified that. At Sanity.io today we only need JavaScript and Go, and those are the most imminent languages to support (and what I can actually spend time on during work). However, as an overall parser algorithm I consider it crucial to be able to easily target C as this would unlock a whole new world of possibilities.

Does that make it more clear?

> but it's still a strange point to spend almost an entire paragraph on

Yeah, this article isn't exactly short on details. The purpose of this post was a mixture between (1) arguing for the value of declarative-based grammars, (2) showing all of the fascinating algorithms that have been discovered, and (3) showing how Glush works.

I've had a great time playing with Parsing with Derivatives so even if I didn't end up directly using it, I feel like sharing cool ideas to new people.

tom_mellior6y ago

Thanks for your response. You made your point more explicit, but I still think that being afraid of GC due to planning to support C at some point could be premature pessimization. Your milage obviously varies!

Also, yes, thank you for mentioning Parsing with Derivatives and all the other stuff, I really liked the overview you gave of the field.

IshKebab6y ago

No mention of Tree-sitter (http://tree-sitter.github.io/tree-sitter/), which this looks very similar to.

vinodkd6y ago

Fascinating journey through various parsing techniques, aside from describing glush itelf. bookmarked!

7532yahoogmail6y ago

Intrigued .. keep posting. I tried a bison/flex parser and was stuck in shift/reduce hell.

naniwaduni6y ago

In my experience, shift/reduce conflicts typically reflect real ambiguities, at least in your initial draft of a grammar. Reporting them at that stage is a good thing; you should be addressing them instead of glossing over them.

There are legitimate reasons to need more power than LALR(1) provides, but careless thinking is more common.

lopsidedBrain6y ago

I would have really preferred a table of contents, either at the top or in a sidebar. It would have made it easier to skip over all the background materials that essentially rehashes an undergrad complier course.

naniwaduni6y ago

There is one, which appears as a sidebar once you're below the fold. Bizarre placement imo, but it's there.

lopsidedBrain6y ago

Huh, you're right. I guess I shouldn't be reading this on mobile.

duelingjello6y ago

Feedback is most helpful when directed to the appropriate, actionable context.

7532yahoogmail6y ago

Nobody asked for an editor. Add to the engineering here or move on.

j / k navigate · click thread line to collapse

33 comments

chrisaycock6y ago

Many production compilers have hand-written parsers because of error reporting, not because of issues with parser-generator algorithms.

Consider the example of missing parenthesis. I can specify a grammar with the following:

  EXPR =
       ...
       | '(' EXPR ')'

It is possible to modify the grammar to look for this case specifically:

  EXPR =
       ...
       | '(' EXPR ')'
       | '(' EXPR      // report this explicitly

But now I've just polluted my once-pristine grammar file. And adding all of the necessary error checks throughout the grammar will really make things complicated.

ltratt6y ago

seanmcdirmid6y ago

joe_the_user6y ago

1 more reply

judofyr6y ago

(Author of the article here)

> Many production compilers have hand-written parsers because of error reporting, not because of issues with parser-generator algorithms.

Yes! I totally agree. Error reporting was one of the things I meant when I said that there is somethings "lacking with the algorithms we have today".

I'm not sure if this is the best example. Ruby is using Bison for parsing and it's certainly able to detect a missing parenthesis:

    ruby -e 'p (1 + 2'
    -e:1: syntax error, unexpected end-of-input, expecting ')'

Most parser algorithms have a concept for "currently expected characters/tokens" and can easily report this on an error.

I have however not made an effort to make nice error messages so I don't know how well automatic error reporting would work and I don't feel comfortable making any concrete claims or promises.

> It is possible to modify the grammar to look for this case specifically:

Not sure if I see much value in manually adding such a case. This seems like a case that would be easily handled automatically (again: see Ruby above).

> But now I've just polluted my once-pristine grammar file. And adding all of the necessary error checks throughout the grammar will really make things complicated.

tempguy99996y ago

> But now I've just polluted my once-pristine grammar file

What's bollixed up your grammar file are the users who absolutely insist on not writing syntactically perfect programs every time.

What would you want the grammar language to look like that handled errors 'properly'?

dukoid6y ago

There is also the friction added by throwing in another language and subsystem, potentially with some fancy build system requirements for integrating code generation....

abecedarius6y ago

Sounds promising! And a great exposition of the background and development.

erezsh6y ago

That sounds really cool, and I hope it will find success. The field of parsing truly needs some refreshment, especially in usability.

I do have a few questions:

2. How is Glush's performance in relation to Earley, in terms of constant time and memory requirements? (even for low complexity, a high constant can be problematic)

3. How do you deal with ambiguity? The blog mentions priority, but is there a way to get something similar to a SPPF?

judofyr6y ago

Author of the article here. Thanks for your comments!

Not quite yet. There is a lot more work needed and writing this article was one step in right direction. I was hoping I was able to get even closer to

> 2. How is Glush's performance in relation to Earley, in terms of constant time and memory requirements? (even for low complexity, a high constant can be problematic)

This is yet to be determined. A few comments:

- I'm not 100% concerned about performance because as mentioned in the article, I don't plan to use this in a place where performance is critical.

> 3. How do you deal with ambiguity? The blog mentions priority, but is there a way to get something similar to a SPPF?

A single output from the parser is conceptually a flat list of _marks_ (think like a reverse RPN):

    // this input:
    1 + 2 * 3 + 4
    // with this precedence:
    (1 + (2 * 3)) + 4
    // could generate this flat list of marks:
    add, add, 1, mul, 2, 3, 4

The total output from the parser is a "tree of marks".

anhldbk6y ago

Thanks for sharing. Currently I'm looking for an elegant language for querying Json structures. In addition to JsonPath, JMESPath, Glush & GROQ is definitely worth a try

sshine6y ago

Have you considered jq?

https://stedolan.github.io/jq/

https://github.com/stedolan/jq

naniwaduni6y ago

It's a nice language to work in but definitely not one I'd want to expose as a user interface.

1 more reply

namibj6y ago

I can recommend jtc as something nice to use for extracting stuff from deep json structures. It should be quite feasible to integrate as a library if one so desires.

https://github.com/ldn-softdev/jtc/blob/master/User%20Guide....

devj6y ago

Try jsonata.org

sshine6y ago

Skipping to "Glush, the algorithm" can be helpful if you already know the history of parsing.

TL;DR: Glush grammars consist of "rules" that are regexes + recursion of rule calls. Compiles to NFAs.

This reminds me of two things:

2. Kleenex: https://kleenexlang.org - https://github.com/diku-kmc/kleenexlang - a silly example: https://github.com/athas/EggsML/blob/master/concieggs/compil...

I am tempted to classify this as "regex extended with recursion" rather than "an alternative to CFGs", since I cannot decipher from the article the exact expressive power.

tom_mellior6y ago

> Compiles to NFAs.

No. It cannot, since NFAs cannot parse languages that are not regular. Rather, the algorithm executes like a pushdown automaton. I don't think it's correct to call this compilation.

> I cannot decipher from the article the exact expressive power.

To me it seems to suggest being able to parse (exactly) all context-free languages.

tom_mellior6y ago

Nice article, this looks like a useful parser generator. Though, as others have mentioned, error reporting is a very important point that should be considered even at this early stage.

Here's something that surprised me though:

> a parser toolkit that lets you create efficient parsers in multiple languages (currently JavaScript, Go, and Ruby)

judofyr6y ago

(Author here!)

Does that make it more clear?

> but it's still a strange point to spend almost an entire paragraph on

I've had a great time playing with Parsing with Derivatives so even if I didn't end up directly using it, I feel like sharing cool ideas to new people.

tom_mellior6y ago

Also, yes, thank you for mentioning Parsing with Derivatives and all the other stuff, I really liked the overview you gave of the field.

IshKebab6y ago

No mention of Tree-sitter (http://tree-sitter.github.io/tree-sitter/), which this looks very similar to.

vinodkd6y ago

Fascinating journey through various parsing techniques, aside from describing glush itelf. bookmarked!

7532yahoogmail6y ago

Intrigued .. keep posting. I tried a bison/flex parser and was stuck in shift/reduce hell.

naniwaduni6y ago

There are legitimate reasons to need more power than LALR(1) provides, but careless thinking is more common.

lopsidedBrain6y ago

naniwaduni6y ago

There is one, which appears as a sidebar once you're below the fold. Bizarre placement imo, but it's there.

lopsidedBrain6y ago

Huh, you're right. I guess I shouldn't be reading this on mobile.

duelingjello6y ago

Feedback is most helpful when directed to the appropriate, actionable context.

7532yahoogmail6y ago

Nobody asked for an editor. Add to the engineering here or move on.

j / k navigate · click thread line to collapse