Big choices are handrolled recursive decent vs LALR, probably backed by bison or lemon generator and re2c for a lexer.
Passing the lalr(1) check, i.e. having bison actually accept the grammar without complain about ambiguities, is either very annoying or requires thinking clearly about your language, depending on your perspective.
I claim that a lot of the misfires in language implementations are from not doing that work, and using a hand rolled approximation to the parser you had in mind instead, because that's nicer/easier than the formal grammar.
The parser generators emit useless error messages, yes. So if you want nice user feedback, that'll be handrolled in some fashion. Sure.
Sometimes people write a grammar and use a hand rolled parser, hoping they match. Maybe with tests.
The right answer, used by noone as far as I can tell, is to parse with the lalr generated parser, then if that rejects your string because the program was ill formed, call the hand rolled one for guesswork/diagnostics. Never feed the parse tree from the hand rolled parser into the rest of the compiler, that way lies all the bugs.
As alternative phrasing, your linter and your parser don't need to be the same tool, even if it's convenient in some senses to mash them together.
This feels like a recipe for disaster. If the hand-rolled parser won't match a formal grammar, why would it match the generated parser?
The poor programmer will be debugging the wrong thing.
It reminds me of my short stint writing C++ where I'd read undefined memory in release mode, but when I ran it under debug mode it just worked.
I assume it’s far too late at this point, but that almost always means that you’re invoking UB. Your next step should be enabling UBSan.
The hand rolled parser might do, but also might not, what with software being difficult and testing being boring and so forth.
In OCaml, a language highly suited for developing languages in, that de facto standard is the Menhir LR parser generator. It's a modern Yacc with many convenient features, including combinator-like library functions. I honestly enjoy the work of mastering Menhir, poring over the manual, which is all one page: https://gallium.inria.fr/~fpottier/menhir/manual.html
These days I just handroll recursive descent parsers with a mutable stream record, `raise_notrace` and maybe some combinators inspired by FParsec for choices, repetition and error messages. I know it's not as rigorous, but at least it's regular code without unexpected limitations.
What makes OCaml suited for that?
LR however is more powerful, though this mostly matters if you don't have access to automatic grammar rewriting for your LL. More significantly, however, there's probably more good tooling for LR (or perhaps: you can assume that if tooling exists, it is good at what it is designed for); one problem with LL being so "simple" is that there's a lot of bad tooling out there.
The important things are 1. that you meaningfully eliminate ambiguities (which is easy to enforce for LR and doable for LL if your tooling is good), and 2. that you keep linear time complexity. Any parser other than LL/LR should be rejected because it fails at least one of these, and often both.
Within the LL and LR families there are actually quite a few members. SLR(1) is strong enough to be interesting but too weak for anything I would call a "language". LALR(1) is probably fine; I have never encountered a useful language that must resort to LR(1) (though note that modern tooling can do an optimistic fallback, avoiding the massive blowups of ancient LR tools). SLL(1) I'm not personally familiar with. X(k), where X is one of {SLL, LL, SLR, LALR, LR} and where k > 1, are not very useful; k=1 suffices. LL(*) however should be avoided due to backtracking, but in some cases consider if you can parsing token trees first (this is currently poorly represented in the literature; you want to be doing some form of this for error recovery anyway - automated error recovery is a useless lie) and/or defer the partial ambiguity until the AST is built (often better for error messages anyway, independent of using token trees).
The idea that you're going to hand-roll a parser generator and then use that to generate a parser and the result is going to be less buggy than just hand-rolling a recursive descent parser, screams "I've never written code outside of an academic context".
An LR(1) parser can have many more states in it's DFA than LALR(1). That was important back in the 1970's when I was fighting for every byte of RAM, but now it's a total non-issue. I don't know why you would bother with LALR(1) now if you had a LR(1) parser generator.
It's important to note that ambiguities are something which exist in service of parser generators and the restricted formal grammars that drive them. They do not actually exist in the language to be parsed (unless that language is not well-specified, but then all bets are off and it is meaningless to speak of parsing), because they can be eliminated by side-conditions.
For example, one famous ambiguity is the dangling 'else' problem in C. But this isn't an actual ambiguity in the C language: the language has a side-condition which says that 'else' matches to the closest unmatched 'if'. This is completely unambiguous and so a recursive descent parser for C simply doesn't encounter this problem. It is only because parser generators, at least in their most academic form, lack a way to specify this side-condition that their proponents have to come up with a whole theory of "ambiguities". (Shockingly, Wikipedia gets this exactly right in the article on dangling else which I just thought to look up: "The dangling else is a problem in programming of parser generators".)
Likewise goes the problem of left-recursion. Opponents of recursive descent always present left-recursion as a gotcha which requires some special handling. Meanwhile actual programmers writing actual recursive descent parsers don't have any idea what these academics are talking about because the language that they're parsing (as it exists in their mind) doesn't feature left-recursion, but instead iteration. Left-recursion is only introduced in service of restricted formal grammars in which recursion is the only available primitive and iteration either doesn't exist or is syntactic sugar for recursion. For the recursive descent user, iteration is a perfectly acceptable primitive. The reason for the discrepancy goes back to side-conditions: iteration requires a side-condition stating how to build the parse tree; parser generators call this "resolving the ambiguity" because they can't express this in their restricted grammar, not because the language was ambiguous.
In between "easiest to get started with" and "what production-grade systems use", there is "easy to actually finish a medium-sized project with." I think LR parsers still defend that middle ground pretty well.
That was part of my question I think. I wouldn't have been able to tell you that the dominant paradigm being argued against was LR parsers, because I've never come across even one that I'm aware of (I've heard of them, but that's about it). Perhaps it's academia where they're popular?
It seems to be mainly academics and others interested in parsing theory, and those who like complexity for the sake of complexity.
I get really annoyed when people still complain about YACC while ignoring the four decades of practical improvement that Bison has given us if you bother to configure it.
Is is also written in a badass style and argues that this is superior to parser generators.
For those to whom they are new: I found them a little tricky to implement directly from Pratt's paper or even Crockford's javascript that popularized them.
So, through trial and error I figured out how to actually implement them in regular languages (i.e. not in Lisp).
If it helps, examples in C and Go are here:
https://github.com/glycerine/PrattParserInC
https://github.com/glycerine/zygomys/blob/master/zygo/pratt....
I find them easier to work with than the cryptic LALR(1) bison/yacc tools, but then I never really felt like I mastered yacc to begin with.
In practice most recursive descent parsers use if-else liberally. Thus, they effectively work like pegs where the first match wins (but without the limited backtracking of pegs). They are deterministic in the sense that the implementation always returns a predictable result. But they are still ambiguous in the sense that this behavior might not have been planned by the language designer, and the ambiguity may not have been resolved how the programmer expected.
By that, do you mean parser combinators?
https://pypi.org/project/pybison/ , or its predecessors such as https://pypi.org/project/ply/ ?
But yes, the decidedly non-traditional https://github.com/pyparsing/pyparsing/ is certainly more popular.
Recursive decent parsers can simply be implemented with recusive functions. Implementing semantic checks becomes easy with additional parameters.
What a waste of time. I failed miserably.
However, I also realized that the only semantic information needed was to keep track of typedefs. That made recursive descent practical and effective.
You then construct the parser by combining unambiguous parsers from the bottom up. The result ends up unambiguous by construction.
This high level algorithm is much easier to implement without a global lexer. Global lexing can be a source of inadvertent ambiguity. Strings make this obvious. If instead, you lex in a context specific way, it is usually easy to efficiently eliminate ambiguities.
I wish I could have save the source. It would be fun to see it.
It seems that, from the outside looking in, ~all significant PL projects end up using a hand-written recursive descent parser, eventually.
I am interested in that area, and reading up and learning about it.
For new languages this should be avoided - just design a sane grammar in the first place.
1. Replace any expression that's within parentheses by its parse tree by using recursion
2. Find the lowest precedence operator, breaking ties however you'd like. Call this lowest precedence operator OP.
3. View the whole unparsed expression as `x OP y`
4. Generate a parse tree for x and for y. Call them P(x) and P(y).
5. Return ["OP", P(x), P(y)].
It's easy to speed up step 2 by keeping a table of all the operators in an expression, sorted by their precedence levels. For this table to work properly, the positions of all the tokens must never change.A PEG is always unambiguous because it picks the first option - but whether that was the intended parse is not necessarily straightforward. In practice these problems don't usually show up, so they're fine to work with.
The advantage LR gives you is that it produces a parser where there are no ambiguities and every successful parse is the one intended. An LR grammar is a proof, as well as a means of producing a parser. A decent LR parser generator is like a simple proof assistant - it will find problems with your language before you do, so you can fix your syntax before putting it into production.
In "real-world" parsing tasks as you put it, the problems of LR parser generators is that they're not the best suited to parsing languages that have ambiguities, like C, C++ and many others. Some of the complaints about LR are about the workarounds that need to be done to parse these languages, where it's obviously the wrong tool for the job because those languages aren't described by proper LR grammars.
But if you're designing a new language from scratch, surely it's better to not repeat those mistakes? If you carefully design your language to be parsed by an LR grammar then other developers who come to parse your language won't encounter those issues. They won't need lexical tie-ins and other nonsense that complicates the process.