Many production compilers have hand-written parsers because of error reporting, not because of issues with parser-generator algorithms.
Consider the example of missing parenthesis. I can specify a grammar with the following:
EXPR =
...
| '(' EXPR ')'
The generated parser will not gracefully report a mismatched number of parentheses! Instead, it will report that the user's input is not formatted as expected, which makes it hard for the user to understand what went wrong.It is possible to modify the grammar to look for this case specifically:
EXPR =
...
| '(' EXPR ')'
| '(' EXPR // report this explicitly
But now I've just polluted my once-pristine grammar file. And adding all of the necessary error checks throughout the grammar will really make things complicated.So, many production compilers end-up with custom parsers to report unexpected input in a sane way. It is not that the algorithms are lacking; there is simply no mechanism for a generator to know what a mismatched parenthesis should be reported as.
Building upon that (and obvious bias alert!) led us to come up with a couple of new algorithms. The current paper draft is https://arxiv.org/abs/1804.07133 (a future version of that paper will probably chop out MF which, in retrospect, adds too much complexity over CPCT+ for the relatively small gains). The Rust library (https://crates.io/crates/lrpar) it's implemented in is documented at https://softdevteam.github.io/grmtools/master/book/errorreco... which is probably a friendlier starting point. Summary: one can do a surprisingly decent job of error recovery for very little work.
Hand coding a parser is usually worth it for a moderately popular production language. Parser generators really shine for smaller language efforts that can’t afford the overhead and don’t really care about the niceties of decent error reporting.
> Many production compilers have hand-written parsers because of error reporting, not because of issues with parser-generator algorithms.
Yes! I totally agree. Error reporting was one of the things I meant when I said that there is somethings "lacking with the algorithms we have today".
> The generated parser will not gracefully report a mismatched number of parentheses! Instead, it will report that the user's input is not formatted as expected, which makes it hard for the user to understand what went wrong.
I'm not sure if this is the best example. Ruby is using Bison for parsing and it's certainly able to detect a missing parenthesis:
ruby -e 'p (1 + 2'
-e:1: syntax error, unexpected end-of-input, expecting ')'
Most parser algorithms have a concept for "currently expected characters/tokens" and can easily report this on an error.I also think that this will be more precise in Glush than in e.g. LR-based algorithms because Glush is following the grammar strictly top-down. It knows exactly which characters are allowed at any given moment.
I have however not made an effort to make nice error messages so I don't know how well automatic error reporting would work and I don't feel comfortable making any concrete claims or promises.
> It is possible to modify the grammar to look for this case specifically:
Not sure if I see much value in manually adding such a case. This seems like a case that would be easily handled automatically (again: see Ruby above).
> But now I've just polluted my once-pristine grammar file. And adding all of the necessary error checks throughout the grammar will really make things complicated.
Well, if you want the equivalent error reporting in a hand-written parser you also need to add an equivalent amount of code. There are probably ways you can annotate a grammar to improve error reporting and I think it's a far more interesting solution to investigate. I haven't experimented with this yet, but I'd love to try to tackle it if I can find the time.
What's bollixed up your grammar file are the users who absolutely insist on not writing syntactically perfect programs every time.
Your 2nd example is still cleaner than writing a hand parser that does the same thing, surely? A slightly mucky DSL has got to be cleaner, clearer, shorter and so more maintainable than a handwritten equivalent, no?
What would you want the grammar language to look like that handled errors 'properly'?
About the Glushkov construction for regular expressions, here's some Python code that may perhaps help to explain it, since the Wikipedia page is hard to follow (at least for me) and the Mastodon thread linked in the post admitted to some confusion about the "Play on Regular Expressions" paper (which I haven't read): https://github.com/darius/sketchbook/blob/master/regex/nfa_p...
Sorry it's not pedagogically great either; I was rederiving the method for myself rather than implementing from textbooks. What I called 'follows' in the comment should correspond to 'pair set' in the post; the other fields have the same names.
Incidentally I think the post a bit overstates the problems with PEGs (you shouldn't need left recursion anywhere a hand-written recursive-descent parser would also work; PEGs were meant to streamline and formalize recursive descent, and this failure to model the practice can be fixed without big changes) -- but CFGs are great too and I'm looking forward to studying Glush. I wish I'd thought of exploring the generalization to context-free grammars.
I do have a few questions:
1. You claim the algorithm used guarantees O(n^3) for all CFGs, but is different from Earley and CYK. Are there more details for this algorithm, perhaps with a rigorous mathematical proof for this claim?
2. How is Glush's performance in relation to Earley, in terms of constant time and memory requirements? (even for low complexity, a high constant can be problematic)
3. How do you deal with ambiguity? The blog mentions priority, but is there a way to get something similar to a SPPF?
> 1. You claim the algorithm used guarantees O(n^3) for all CFGs, but is different from Earley and CYK. Are there more details for this algorithm, perhaps with a rigorous mathematical proof for this claim?
Not quite yet. There is a lot more work needed and writing this article was one step in right direction. I was hoping I was able to get even closer to
> 2. How is Glush's performance in relation to Earley, in terms of constant time and memory requirements? (even for low complexity, a high constant can be problematic)
This is yet to be determined. A few comments:
- I'm not 100% concerned about performance because as mentioned in the article, I don't plan to use this in a place where performance is critical.
- I'm not too worried about the performance since I'm not doing too complicated data structures (it's all hash maps), and most intermediate data structures are small in size and are probably best implemented linearly (e.g. plain array) for practical grammars with few ambiguities.
- It's based on NFA-style "advance one state at a time" which actually is not good for performance. This means that parsing a keyword such as "function" actually passed through 8 different states instead of just looking ahead and jumping straight ahead. I think taking advantage of these look ahead will probably help a lot
> 3. How do you deal with ambiguity? The blog mentions priority, but is there a way to get something similar to a SPPF?
A single output from the parser is conceptually a flat list of _marks_ (think like a reverse RPN):
// this input:
1 + 2 * 3 + 4
// with this precedence:
(1 + (2 * 3)) + 4
// could generate this flat list of marks:
add, add, 1, mul, 2, 3, 4
The total output from the parser is a "tree of marks".It's a nice language to work in but definitely not one I'd want to expose as a user interface.
https://github.com/ldn-softdev/jtc/blob/master/User%20Guide....
TL;DR: Glush grammars consist of "rules" that are regexes + recursion of rule calls. Compiles to NFAs.
I'm not sure I see the point in declaring precedence levels inside rules -- maybe it's just a preference thing, but I like having operator precedences in a separate section of the grammar. Yacc does this. Megaparsec for Haskell does this.
This reminds me of two things:
1. How syntax highlighting is implemented in Vim and GEdit (which is to say GtkSourceView). See for example how "contexts" can be nested here: https://github.com/rubencaro/gedit_hacks/blob/master/.local/...
2. Kleenex: https://kleenexlang.org - https://github.com/diku-kmc/kleenexlang - a silly example: https://github.com/athas/EggsML/blob/master/concieggs/compil...
I am tempted to classify this as "regex extended with recursion" rather than "an alternative to CFGs", since I cannot decipher from the article the exact expressive power.
No. It cannot, since NFAs cannot parse languages that are not regular. Rather, the algorithm executes like a pushdown automaton. I don't think it's correct to call this compilation.
> I cannot decipher from the article the exact expressive power.
To me it seems to suggest being able to parse (exactly) all context-free languages.
Here's something that surprised me though:
> a parser toolkit that lets you create efficient parsers in multiple languages (currently JavaScript, Go, and Ruby)
> [some other approach] it looked very interesting, but it’s not completely smooth. First of all, all of the implementations have been in highly expressive, garbage collected languages (Haskell, Scala, Racket).
This isn't the author's only reason to dismiss that alternative approach, but it's still a strange point to spend almost an entire paragraph on. If all the targets the author currently envisions are GC'd languages, it's premature pessimization to worry about how other, hypothetical, targets might be accommodated.
> This isn't the author's only reason to dismiss that alternative approach, but it's still a strange point to spend almost an entire paragraph on. If all the targets the author currently envisions are GC'd languages, it's premature pessimization to worry about how other, hypothetical, targets might be accommodated.
Aah. I should probably have clarified that. At Sanity.io today we only need JavaScript and Go, and those are the most imminent languages to support (and what I can actually spend time on during work). However, as an overall parser algorithm I consider it crucial to be able to easily target C as this would unlock a whole new world of possibilities.
Does that make it more clear?
> but it's still a strange point to spend almost an entire paragraph on
Yeah, this article isn't exactly short on details. The purpose of this post was a mixture between (1) arguing for the value of declarative-based grammars, (2) showing all of the fascinating algorithms that have been discovered, and (3) showing how Glush works.
I've had a great time playing with Parsing with Derivatives so even if I didn't end up directly using it, I feel like sharing cool ideas to new people.
Also, yes, thank you for mentioning Parsing with Derivatives and all the other stuff, I really liked the overview you gave of the field.
There are legitimate reasons to need more power than LALR(1) provides, but careless thinking is more common.