Why I write recursive descent parsers, despite their issues (2020) (opens in new tab)

(utcc.utoronto.ca)

126 pointsblobcode10mo ago86 comments

86 comments

Pet subject of the week here.

Big choices are handrolled recursive decent vs LALR, probably backed by bison or lemon generator and re2c for a lexer.

Passing the lalr(1) check, i.e. having bison actually accept the grammar without complain about ambiguities, is either very annoying or requires thinking clearly about your language, depending on your perspective.

I claim that a lot of the misfires in language implementations are from not doing that work, and using a hand rolled approximation to the parser you had in mind instead, because that's nicer/easier than the formal grammar.

The parser generators emit useless error messages, yes. So if you want nice user feedback, that'll be handrolled in some fashion. Sure.

Sometimes people write a grammar and use a hand rolled parser, hoping they match. Maybe with tests.

The right answer, used by noone as far as I can tell, is to parse with the lalr generated parser, then if that rejects your string because the program was ill formed, call the hand rolled one for guesswork/diagnostics. Never feed the parse tree from the hand rolled parser into the rest of the compiler, that way lies all the bugs.

As alternative phrasing, your linter and your parser don't need to be the same tool, even if it's convenient in some senses to mash them together.

mrkeen10mo ago

> parse with the lalr generated parser, then if that rejects your string because the program was ill formed, call the hand rolled one for guesswork/diagnostics

This feels like a recipe for disaster. If the hand-rolled parser won't match a formal grammar, why would it match the generated parser?

The poor programmer will be debugging the wrong thing.

It reminds me of my short stint writing C++ where I'd read undefined memory in release mode, but when I ran it under debug mode it just worked.

senkora10mo ago

> It reminds me of my short stint writing C++ where I'd read undefined memory in release mode, but when I ran it under debug mode it just worked.

I assume it’s far too late at this point, but that almost always means that you’re invoking UB. Your next step should be enabling UBSan.

JonChesterfield10mo ago

The generated parser will match the grammar.

The hand rolled parser might do, but also might not, what with software being difficult and testing being boring and so forth.

8n4vidtmkvmk10mo ago

There's risk, but it seems like you could run both parsers against the same unit tests to help mitigate.

jasperry10mo ago

> If I was routinely working in a language that had a well respected de facto standard parser generator and lexer, and regularly building parsers for little languages for my programs, it would probably be worth mastering these tools.

In OCaml, a language highly suited for developing languages in, that de facto standard is the Menhir LR parser generator. It's a modern Yacc with many convenient features, including combinator-like library functions. I honestly enjoy the work of mastering Menhir, poring over the manual, which is all one page: https://gallium.inria.fr/~fpottier/menhir/manual.html

debugnik10mo ago

I gave up on Menhir after I understood how allocation-heavy it is during the hot path, at least in the incremental API which is needed for proper errors; and how much of a giant hack you need to force extra lookahead, which shouldn't be such a big deal for parser generators.

These days I just handroll recursive descent parsers with a mutable stream record, `raise_notrace` and maybe some combinators inspired by FParsec for choices, repetition and error messages. I know it's not as rigorous, but at least it's regular code without unexpected limitations.

jasperry10mo ago

Could be, I'm not that far along yet. I've only just peeked into the incremental API. I'm still using the error token to try to improve my messages. It's just for syntax errors anyway, right?

fuzztester10mo ago

>In OCaml, a language highly suited for developing languages in,

What makes OCaml suited for that?

mjburgess10mo ago

algebraic datatypes (tagged unions + pattern matching); compiled, garbage collected (you dont really need memory management for a compiler), statically typed with inference

hibikir10mo ago

Yeah, the same reasons Scala has a built in parser combinator module in the standard library: Just easy to use with those features in the language

fuzztester10mo ago

thanks.

greggyb10mo ago

ML, the language heritage from which OCaml derives, was explicitly designed with interpreters and compilers in mind.

nicoburns10mo ago

I wonder who it is that likes other kinds of parser. Over the last ~10 years or so I've read several articles arguing that recursive descent parsers are in fact great on HN. And they seem to be both the easiest to get started with and what almost all production-grade systems use. I've seen very little in the way of anything arguing for any other approaches.

o11c10mo ago

Recursive descent is fine if you trust that you won't write buggy code. If you implement a generator for it (easy enough), this may be a justifiable thing to trust (though this is not a given). I am assuming you're willing to put up with the insanity of grammar rewriting, one way or another.

LR however is more powerful, though this mostly matters if you don't have access to automatic grammar rewriting for your LL. More significantly, however, there's probably more good tooling for LR (or perhaps: you can assume that if tooling exists, it is good at what it is designed for); one problem with LL being so "simple" is that there's a lot of bad tooling out there.

The important things are 1. that you meaningfully eliminate ambiguities (which is easy to enforce for LR and doable for LL if your tooling is good), and 2. that you keep linear time complexity. Any parser other than LL/LR should be rejected because it fails at least one of these, and often both.

Within the LL and LR families there are actually quite a few members. SLR(1) is strong enough to be interesting but too weak for anything I would call a "language". LALR(1) is probably fine; I have never encountered a useful language that must resort to LR(1) (though note that modern tooling can do an optimistic fallback, avoiding the massive blowups of ancient LR tools). SLL(1) I'm not personally familiar with. X(k), where X is one of {SLL, LL, SLR, LALR, LR} and where k > 1, are not very useful; k=1 suffices. LL(*) however should be avoided due to backtracking, but in some cases consider if you can parsing token trees first (this is currently poorly represented in the literature; you want to be doing some form of this for error recovery anyway - automated error recovery is a useless lie) and/or defer the partial ambiguity until the AST is built (often better for error messages anyway, independent of using token trees).

kerkeslager10mo ago

> Recursive descent is fine if you trust that you won't write buggy code. If you implement a generator for it (easy enough), this may be a justifiable thing to trust (though this is not a given).

The idea that you're going to hand-roll a parser generator and then use that to generate a parser and the result is going to be less buggy than just hand-rolling a recursive descent parser, screams "I've never written code outside of an academic context".

pjc5010mo ago

One of the smartest projects I've ever seen was a tool that took the human-readable tables of the HEVC and AV1 specs, used them as input to https://en.wikipedia.org/wiki/OMeta parser-generator, and then output both HEVC parsers in a variety of languages and also auto-fuzzers for test coverage. Ended up at https://www.graphcore.ai/posts/graphcore-open-sources-argon-...

Personally I've also written a parser-generator for XML in C# to overcome some of the odd limitations of Microsoft's one when used in AOT contexts.

Hand-rolling is easy if the grammar is small. The larger it gets (and video codecs are huge!) the more you want something with automatic consistency.

1 more reply

maxbond10mo ago

> [It] screams "I've never written code outside of an academic context".

SQLite, perhaps the most widely deployed software system, takes this approach.

https://sqlite.org/lemon.html

> The Lemon LALR(1) Parser Generator

> The SQL language parser for SQLite is generated using a code-generator program called "Lemon".

> ...

> Lemon was originally written by D. Richard Hipp (also the creator of SQLite) while he was in graduate school at Duke University between 1987 and 1992.

Here are the grammars, if you're curious.

https://github.com/sqlite/sqlite/blob/master/src/parse.y

2 more replies

motorest10mo ago

> The idea that you're going to hand-roll a parser generator and then use that to generate a parser and the result is going to be less buggy than just hand-rolling a recursive descent parser, screams "I've never written code outside of an academic context"

Your comment is quite funny as hand-rolling a recursive descent parser is the kind of thing that is often accused of being a) bug-prone, b) only done in academic environments.

3 more replies

rstuart41339mo ago

> LALR(1) is probably fine; I have never encountered a useful language that must resort to LR(1)

An LR(1) parser can have many more states in it's DFA than LALR(1). That was important back in the 1970's when I was fighting for every byte of RAM, but now it's a total non-issue. I don't know why you would bother with LALR(1) now if you had a LR(1) parser generator.

nrds10mo ago

> ambiguities

It's important to note that ambiguities are something which exist in service of parser generators and the restricted formal grammars that drive them. They do not actually exist in the language to be parsed (unless that language is not well-specified, but then all bets are off and it is meaningless to speak of parsing), because they can be eliminated by side-conditions.

For example, one famous ambiguity is the dangling 'else' problem in C. But this isn't an actual ambiguity in the C language: the language has a side-condition which says that 'else' matches to the closest unmatched 'if'. This is completely unambiguous and so a recursive descent parser for C simply doesn't encounter this problem. It is only because parser generators, at least in their most academic form, lack a way to specify this side-condition that their proponents have to come up with a whole theory of "ambiguities". (Shockingly, Wikipedia gets this exactly right in the article on dangling else which I just thought to look up: "The dangling else is a problem in programming of parser generators".)

Likewise goes the problem of left-recursion. Opponents of recursive descent always present left-recursion as a gotcha which requires some special handling. Meanwhile actual programmers writing actual recursive descent parsers don't have any idea what these academics are talking about because the language that they're parsing (as it exists in their mind) doesn't feature left-recursion, but instead iteration. Left-recursion is only introduced in service of restricted formal grammars in which recursion is the only available primitive and iteration either doesn't exist or is syntactic sugar for recursion. For the recursive descent user, iteration is a perfectly acceptable primitive. The reason for the discrepancy goes back to side-conditions: iteration requires a side-condition stating how to build the parse tree; parser generators call this "resolving the ambiguity" because they can't express this in their restricted grammar, not because the language was ambiguous.

rstuart41339mo ago

> It's important to note that ambiguities are something which exist in service of parser generators and the restricted formal grammars that drive them. They do not actually exist in the language to be parsed

Only partially true. How do you define the language to be parsed? It's with a grammar. If the grammar can yield two different parse trees for the same input, it's ambiguous. In LR parlance, if your grammar is ambiguous because of a shift-reduce conflict, it's because you stuffed up your grammar.

That's a real problem. It the difference between parsing "1 + 2 / 3" as "(1 + 2) / 3" and "1 + (2 / 3)". The two interpretations yield very different outcomes. The reason you see so many people here say "use a generated LL or LR parser" is the generator will find and report that mistake. It's a very easy mistake to make, and you won't realise you've made it.

Then there are what LR calls reduce-reduce conflicts. Yes, that may happen because the LR parser can't look far enough ahead. Or, it may again be because you've stuffed you grammar. Or it may be because the language you have in your head really isn't context free. Perl is in the last category. They claim to have got around it by saying its a "do what I mean" language. Fine, but it turns out in some cases what they think a string obviously means doesn't agree with what I thought it obviously meant.

1 more reply

layer810mo ago

> They do not actually exist in the language to be parsed (unless that language is not well-specified

How do you specify your language “well” when you don’t know if your grammar is unambiguous? Determining whether a grammar is ambiguous is famously undecidable in the general case. So how do you decide, if you don’t restrict your grammar to one of the decidable forms checkable by parser generators? You can add some disambiguation rules, but how do you know they cover all ambiguities?

We use formal systems exactly to make sure that the language is well-defined.

o11c10mo ago

Dangling "else" isn't actually a problem for parser generators. All you have to do is either:

* use proper rules rather than cramming everything into "statement", or

* specify explicit precedence rules, which is just a shortcut for the above (also skipping useless reductions)

Doing this is ubiquitous with parser generators when dealing with vaguely Algol-like languages, and is no different than the fact that you have to do the same thing for expressions.

jasperry10mo ago

But remember that the articles arguing for recursive descent parsers are arguing against the long-dominant paradigm of using LR parsers. Plenty of us still like LR parser generators (see my other comment.)

In between "easiest to get started with" and "what production-grade systems use", there is "easy to actually finish a medium-sized project with." I think LR parsers still defend that middle ground pretty well.

nicoburns10mo ago

> But remember that the articles arguing for recursive descent parsers are arguing against the long-dominant paradigm of using LR parsers

That was part of my question I think. I wouldn't have been able to tell you that the dominant paradigm being argued against was LR parsers, because I've never come across even one that I'm aware of (I've heard of them, but that's about it). Perhaps it's academia where they're popular?

jasperry10mo ago

I did learn about LR parser generators first in my Compilers class in college, but I assumed they were generally known about in language development communities.

userbinator10mo ago

I wonder who it is that likes other kinds of parser.

It seems to be mainly academics and others interested in parsing theory, and those who like complexity for the sake of complexity.

masklinn10mo ago

Pratt parsers are really fun if slightly mind-bending, their ability to handle odd associativities, precedences, and arities is basically unmatched making them really useful to embed inside recursive descent for when you reach expressions. If you need infix and mixfix operators anyway.

lenkite10mo ago

The literature for incremental parsing doesn't appear to have much for recursive descent. Everyone appears to use the LR tree sitter approach.

cxr10mo ago

The post by Laurence Tratt, which this piece is a response to, argues for another approach and is mentioned in the first sentence.

o11c10mo ago

In terms of language-agnosticism, you can use Bison to calculate the tables (the hard part) and dump an xml file, then implement the machine yourself trivially.

I get really annoyed when people still complain about YACC while ignoring the four decades of practical improvement that Bison has given us if you bother to configure it.

randomNumber710mo ago

The paper "Top Down Operator Precedence" also called "Pratt's Paper" introduced a very elegant algorithm for recursive descent parsers in 1973.

Is is also written in a badass style and argues that this is superior to parser generators.

https://dl.acm.org/doi/pdf/10.1145/512927.512931

pratt4the_win10mo ago

Pratt parsers are elegant. I really like them.

For those to whom they are new: I found them a little tricky to implement directly from Pratt's paper or even Crockford's javascript that popularized them.

So, through trial and error I figured out how to actually implement them in regular languages (i.e. not in Lisp).

If it helps, examples in C and Go are here:

https://github.com/glycerine/PrattParserInC

https://github.com/glycerine/zygomys/blob/master/zygo/pratt....

I find them easier to work with than the cryptic LALR(1) bison/yacc tools, but then I never really felt like I mastered yacc to begin with.

ivanjermakov10mo ago

Related: Resilient LL Parsing Tutorial https://matklad.github.io/2023/05/21/resilient-ll-parsing-tu...

ufo10mo ago

A middle ground that I think is sometimes useful is to use an LR parser generator to check if the grammar is ambiguous, but use recursive descent for the actual implementation. Since we won't actually use any code from the LR parser generator, you can pick whatever one you prefer regardless of the programming language.

sirwhinesalot10mo ago

It's trivial to get a recursive descent parser without any ambiguities hidden in it if you don't go the PEG route (which is only unambiguous because you always pick the first choice, which might not be what you want). Just always branch on the current token. No way to have an ambiguity like that.

ufo10mo ago

I disagree. When writing recursive descent by hand, it's easy to miss an ambiguity because of miscomputed FIRST and FOLLOW sets.

In practice most recursive descent parsers use if-else liberally. Thus, they effectively work like pegs where the first match wins (but without the limited backtracking of pegs). They are deterministic in the sense that the implementation always returns a predictable result. But they are still ambiguous in the sense that this behavior might not have been planned by the language designer, and the ambiguity may not have been resolved how the programmer expected.

kazinator10mo ago

Without a comprehensive test suite you can easily break a recursive descent parser. By adding code into some function to handle something new, you can accidentally prevent some existing syntax from being recognized.

It has been my eperience that if you have a LALR parser that reports no errors at generation time, and you add something such that there are still no errors, you've not ruined any existing syntax. That could be a theorem.

sirwhinesalot10mo ago

Don't compute first and follow sets. Just branch on the current token. It is trivially unambiguous since 1 token = 1 branch. Expressions can be dealt with using precedence climbing / pratt, which still just amounts to branching on the current token after the "lhs" has been computed.

If the language doesn't fit this LL1 + operator precedence mold then I would not use a recursive descent parser.

1 more reply

thechao10mo ago

I've been having thoughts along these lines. Earley parsers match recursive descent really nicely. In my head there'd by an Earley parser "oracle": you'd tell the oracle about the operations you've performed (in terms of terminal consumption); and, then, you can ask the oracle which recursive descent subfunctions are safe to call (based on the prediction phase).

marssaxman10mo ago

I have never found parser generators to be worth the hassle. Recursive descent with a little Pratt-style precedence climbing is all you need.

derriz10mo ago

Agree completely and I’ve used a bunch of them and also functional combinator libraries. I‘d go further and say the recursive descent and Pratt approach is the way if you want to offer useful error messages and feedback to the user. They’re also trivial to debug and test unlike any generation based approach.

fuzztester10mo ago

>functional combinator libraries

By that, do you mean parser combinators?

derriz10mo ago

Yes - but this was decades ago so my memory is hazy. It was with an early Haskell variant called Gofer - which had a nice feature which allowed using list comprehension notation with arbitrary monads - which for simple grammars produced very readable - even beautiful - parser code. But like with parser generators, once the grammar became complex, the beauty and simplicity disappeared.

Actually I wish this generalization of list comprehensions had been taken up by Haskell or other languages. Haskell decided on the do notation while Python users these days seem to shun the feature.

2 more replies

zahlman10mo ago

> But in practice I bounce back and forth between two languages right now (Go and Python, neither of which have such a standard parser ecology)

https://pypi.org/project/pybison/ , or its predecessors such as https://pypi.org/project/ply/ ?

But yes, the decidedly non-traditional https://github.com/pyparsing/pyparsing/ is certainly more popular.

somat10mo ago

To add to your survey, I have been reading the lark documentation https://github.com/lark-parser/lark and like the cut of it's jib, I have not used it yet as I don't really have any projects that need a full parser.

fjfaase10mo ago

I recently wrote a small C compiler that uses a recursive decent parser while this should not be possible if you just look at the syntax grammar. Why, because it looks at some semantic information about the class of identifiers, whether they are variables of typedefs for example. On the otherhand this is not very surprising, because in the days C was developed, easy parsing was a practical implication of it not being an academic research thing, but something that just had to work.

Recursive decent parsers can simply be implemented with recusive functions. Implementing semantic checks becomes easy with additional parameters.

ufo10mo ago

It sounds like you're describing the Lexer Hack[1]. That trick works just the same in an LR parser, so I wouldn't count it as an advantage of recursive descent.

[1] https://en.wikipedia.org/wiki/Lexer_hack

fjfaase10mo ago

Yes, it is basically this. I feel that writing a recursive descent parser with recursive functions is a bit easier than using an LR parser generator or a back-tracking PEG parser from my experience. It also does not requirer any third party tools or libraries, which I see as advantage.

WalterBright10mo ago

When I developed ImportC (which enables D compilers to read and use C code) I tried hard to build it and not require semantic analysis.

What a waste of time. I failed miserably.

However, I also realized that the only semantic information needed was to keep track of typedefs. That made recursive descent practical and effective.

norir10mo ago

There is a straightforward technique for writing unambiguous recursive descent parsers. The high level algorithm is this: parsing always consumes one character, never backtracks and is locally unambiguous.

You then construct the parser by combining unambiguous parsers from the bottom up. The result ends up unambiguous by construction.

This high level algorithm is much easier to implement without a global lexer. Global lexing can be a source of inadvertent ambiguity. Strings make this obvious. If instead, you lex in a context specific way, it is usually easy to efficiently eliminate ambiguities.

coldcode10mo ago

Funny, I wrote a recursive descent parser in 1982, in Fortran, to parse the syntax of the Jovial programming language. That was my first ever professional programming project, with no university degree in CS, or job experience. Note, Fortran (78) is a terrible language to write a parser in.

I wish I could have save the source. It would be fun to see it.

pklausler10mo ago

Recursive descent is a cleaner way to go when the language cannot be lexed without feedback from the parser and semantics, like Fortran. And parser combinators make RD straightforward to code.

favorited10mo ago

Which mainline compilers or runtimes use a generated parser? I know that CRuby does, though they've recently standardized on Prism as their public AST, and it's possible that they'll switch to Prism for parsing eventually. I know that Go used to, as well as ancient versions of GCC.

It seems that, from the outside looking in, ~all significant PL projects end up using a hand-written recursive descent parser, eventually.

layer810mo ago

The problem remains how to verify that the hand-written parser matches the purported grammar, and that the grammar isn’t ambiguous in the first place.

keithnz10mo ago

recursive descent parsers are usually what I do for my little domain specific scripting languages. They are just easy and straightforward. I do like things like ANTLR, but most of the time it seems unnecessary.

fuzztester10mo ago

Got any open source ones you can share links / code of?

I am interested in that area, and reading up and learning about it.

deterministic9mo ago

I use recursive descent parsers all the time for small DSL's and for a JIT compiled optimizing production quality compiler. It works great.

markus_zhang10mo ago

I have heard that RDP is prominent in production parsers, I wonder is it true? And is it pure handwritten RDP or combined with other automated techniques?

o11c10mo ago

One reason hand-written recursive-descent parsers are common is because a lot of languages are poorly designed, and it's easier to hack around the mistakes in a hand-written parser.

For new languages this should be avoided - just design a sane grammar in the first place.

chadcmulligan10mo ago

fwiw LLM's seem very good at writing recursive descent parsers, at least for the small experiments I've done (wrote a Lua parser in Delphi).

UncleOxidant10mo ago

Agreed. I recently had Gemini write a recursive descent parser for a specified subset of C in C and it did quite well. I've tried similar with Claude 4 and Qwen3 Coder and again, both did quite well.

ogogmad10mo ago

Have people heard of the following top-down parsing algorithm for mathematical expressions:

  1. Replace any expression that's within parentheses by its parse tree by using recursion
  2. Find the lowest precedence operator, breaking ties however you'd like. Call this lowest precedence operator OP.
  3. View the whole unparsed expression as `x OP y`
  4. Generate a parse tree for x and for y. Call them P(x) and P(y).
  5. Return ["OP", P(x), P(y)].

It's easy to speed up step 2 by keeping a table of all the operators in an expression, sorted by their precedence levels. For this table to work properly, the positions of all the tokens must never change.

da-bacon10mo ago

For 2, I don’t think you can break ties however you like because this would give you random left or right associativity https://en.m.wikipedia.org/wiki/Operator_associativity For example 2-4-7 would be either (2-4)-7 or 2-(4-7), depending on how you broke the tie.

johnwbyrd10mo ago

I'm surprised, and a little disappointed, that no one in this thread has mentioned parsing expression grammars (https://en.wikipedia.org/wiki/Parsing_expression_grammar) which are a much more human-friendly form of grammar for real-world parsing tasks.

sparkie10mo ago

PEGs are closely related to recursive descent, and have some of the same problems.

A PEG is always unambiguous because it picks the first option - but whether that was the intended parse is not necessarily straightforward. In practice these problems don't usually show up, so they're fine to work with.

The advantage LR gives you is that it produces a parser where there are no ambiguities and every successful parse is the one intended. An LR grammar is a proof, as well as a means of producing a parser. A decent LR parser generator is like a simple proof assistant - it will find problems with your language before you do, so you can fix your syntax before putting it into production.

In "real-world" parsing tasks as you put it, the problems of LR parser generators is that they're not the best suited to parsing languages that have ambiguities, like C, C++ and many others. Some of the complaints about LR are about the workarounds that need to be done to parse these languages, where it's obviously the wrong tool for the job because those languages aren't described by proper LR grammars.

But if you're designing a new language from scratch, surely it's better to not repeat those mistakes? If you carefully design your language to be parsed by an LR grammar then other developers who come to parse your language won't encounter those issues. They won't need lexical tie-ins and other nonsense that complicates the process.

j / k navigate · click thread line to collapse

86 comments

JonChesterfield10mo ago

Pet subject of the week here.

Big choices are handrolled recursive decent vs LALR, probably backed by bison or lemon generator and re2c for a lexer.

The parser generators emit useless error messages, yes. So if you want nice user feedback, that'll be handrolled in some fashion. Sure.

Sometimes people write a grammar and use a hand rolled parser, hoping they match. Maybe with tests.

As alternative phrasing, your linter and your parser don't need to be the same tool, even if it's convenient in some senses to mash them together.

mrkeen10mo ago

> parse with the lalr generated parser, then if that rejects your string because the program was ill formed, call the hand rolled one for guesswork/diagnostics

This feels like a recipe for disaster. If the hand-rolled parser won't match a formal grammar, why would it match the generated parser?

The poor programmer will be debugging the wrong thing.

It reminds me of my short stint writing C++ where I'd read undefined memory in release mode, but when I ran it under debug mode it just worked.

senkora10mo ago

> It reminds me of my short stint writing C++ where I'd read undefined memory in release mode, but when I ran it under debug mode it just worked.

I assume it’s far too late at this point, but that almost always means that you’re invoking UB. Your next step should be enabling UBSan.

JonChesterfield10mo ago

The generated parser will match the grammar.

The hand rolled parser might do, but also might not, what with software being difficult and testing being boring and so forth.

8n4vidtmkvmk10mo ago

There's risk, but it seems like you could run both parsers against the same unit tests to help mitigate.

jasperry10mo ago

debugnik10mo ago

jasperry10mo ago

Could be, I'm not that far along yet. I've only just peeked into the incremental API. I'm still using the error token to try to improve my messages. It's just for syntax errors anyway, right?

fuzztester10mo ago

>In OCaml, a language highly suited for developing languages in,

What makes OCaml suited for that?

mjburgess10mo ago

algebraic datatypes (tagged unions + pattern matching); compiled, garbage collected (you dont really need memory management for a compiler), statically typed with inference

hibikir10mo ago

Yeah, the same reasons Scala has a built in parser combinator module in the standard library: Just easy to use with those features in the language

fuzztester10mo ago

thanks.

greggyb10mo ago

ML, the language heritage from which OCaml derives, was explicitly designed with interpreters and compilers in mind.

nicoburns10mo ago

o11c10mo ago

kerkeslager10mo ago

> Recursive descent is fine if you trust that you won't write buggy code. If you implement a generator for it (easy enough), this may be a justifiable thing to trust (though this is not a given).

pjc5010mo ago

Personally I've also written a parser-generator for XML in C# to overcome some of the odd limitations of Microsoft's one when used in AOT contexts.

Hand-rolling is easy if the grammar is small. The larger it gets (and video codecs are huge!) the more you want something with automatic consistency.

1 more reply

maxbond10mo ago

> [It] screams "I've never written code outside of an academic context".

SQLite, perhaps the most widely deployed software system, takes this approach.

https://sqlite.org/lemon.html

> The Lemon LALR(1) Parser Generator

> The SQL language parser for SQLite is generated using a code-generator program called "Lemon".

> ...

> Lemon was originally written by D. Richard Hipp (also the creator of SQLite) while he was in graduate school at Duke University between 1987 and 1992.

Here are the grammars, if you're curious.

https://github.com/sqlite/sqlite/blob/master/src/parse.y

2 more replies

motorest10mo ago

Your comment is quite funny as hand-rolling a recursive descent parser is the kind of thing that is often accused of being a) bug-prone, b) only done in academic environments.

3 more replies

rstuart41339mo ago

> LALR(1) is probably fine; I have never encountered a useful language that must resort to LR(1)

nrds10mo ago

> ambiguities

rstuart41339mo ago

1 more reply

layer810mo ago

> They do not actually exist in the language to be parsed (unless that language is not well-specified

We use formal systems exactly to make sure that the language is well-defined.

o11c10mo ago

Dangling "else" isn't actually a problem for parser generators. All you have to do is either:

* use proper rules rather than cramming everything into "statement", or

* specify explicit precedence rules, which is just a shortcut for the above (also skipping useless reductions)

Doing this is ubiquitous with parser generators when dealing with vaguely Algol-like languages, and is no different than the fact that you have to do the same thing for expressions.

jasperry10mo ago

nicoburns10mo ago

> But remember that the articles arguing for recursive descent parsers are arguing against the long-dominant paradigm of using LR parsers

jasperry10mo ago

I did learn about LR parser generators first in my Compilers class in college, but I assumed they were generally known about in language development communities.

userbinator10mo ago

I wonder who it is that likes other kinds of parser.

It seems to be mainly academics and others interested in parsing theory, and those who like complexity for the sake of complexity.

masklinn10mo ago

lenkite10mo ago

The literature for incremental parsing doesn't appear to have much for recursive descent. Everyone appears to use the LR tree sitter approach.

cxr10mo ago

The post by Laurence Tratt, which this piece is a response to, argues for another approach and is mentioned in the first sentence.

o11c10mo ago

In terms of language-agnosticism, you can use Bison to calculate the tables (the hard part) and dump an xml file, then implement the machine yourself trivially.

I get really annoyed when people still complain about YACC while ignoring the four decades of practical improvement that Bison has given us if you bother to configure it.

randomNumber710mo ago

The paper "Top Down Operator Precedence" also called "Pratt's Paper" introduced a very elegant algorithm for recursive descent parsers in 1973.

Is is also written in a badass style and argues that this is superior to parser generators.

https://dl.acm.org/doi/pdf/10.1145/512927.512931

pratt4the_win10mo ago

Pratt parsers are elegant. I really like them.

For those to whom they are new: I found them a little tricky to implement directly from Pratt's paper or even Crockford's javascript that popularized them.

So, through trial and error I figured out how to actually implement them in regular languages (i.e. not in Lisp).

If it helps, examples in C and Go are here:

https://github.com/glycerine/PrattParserInC

https://github.com/glycerine/zygomys/blob/master/zygo/pratt....

I find them easier to work with than the cryptic LALR(1) bison/yacc tools, but then I never really felt like I mastered yacc to begin with.

ivanjermakov10mo ago

Related: Resilient LL Parsing Tutorial https://matklad.github.io/2023/05/21/resilient-ll-parsing-tu...

ufo10mo ago

sirwhinesalot10mo ago

ufo10mo ago

I disagree. When writing recursive descent by hand, it's easy to miss an ambiguity because of miscomputed FIRST and FOLLOW sets.

kazinator10mo ago

sirwhinesalot10mo ago

If the language doesn't fit this LL1 + operator precedence mold then I would not use a recursive descent parser.

1 more reply

thechao10mo ago

marssaxman10mo ago

I have never found parser generators to be worth the hassle. Recursive descent with a little Pratt-style precedence climbing is all you need.

derriz10mo ago

fuzztester10mo ago

>functional combinator libraries

By that, do you mean parser combinators?

derriz10mo ago

Actually I wish this generalization of list comprehensions had been taken up by Haskell or other languages. Haskell decided on the do notation while Python users these days seem to shun the feature.

2 more replies

zahlman10mo ago

> But in practice I bounce back and forth between two languages right now (Go and Python, neither of which have such a standard parser ecology)

https://pypi.org/project/pybison/ , or its predecessors such as https://pypi.org/project/ply/ ?

But yes, the decidedly non-traditional https://github.com/pyparsing/pyparsing/ is certainly more popular.

somat10mo ago

fjfaase10mo ago

Recursive decent parsers can simply be implemented with recusive functions. Implementing semantic checks becomes easy with additional parameters.

ufo10mo ago

It sounds like you're describing the Lexer Hack[1]. That trick works just the same in an LR parser, so I wouldn't count it as an advantage of recursive descent.

[1] https://en.wikipedia.org/wiki/Lexer_hack

fjfaase10mo ago

WalterBright10mo ago

When I developed ImportC (which enables D compilers to read and use C code) I tried hard to build it and not require semantic analysis.

What a waste of time. I failed miserably.

However, I also realized that the only semantic information needed was to keep track of typedefs. That made recursive descent practical and effective.

norir10mo ago

You then construct the parser by combining unambiguous parsers from the bottom up. The result ends up unambiguous by construction.

coldcode10mo ago

I wish I could have save the source. It would be fun to see it.

pklausler10mo ago

Recursive descent is a cleaner way to go when the language cannot be lexed without feedback from the parser and semantics, like Fortran. And parser combinators make RD straightforward to code.

favorited10mo ago

It seems that, from the outside looking in, ~all significant PL projects end up using a hand-written recursive descent parser, eventually.

layer810mo ago

The problem remains how to verify that the hand-written parser matches the purported grammar, and that the grammar isn’t ambiguous in the first place.

keithnz10mo ago

fuzztester10mo ago

Got any open source ones you can share links / code of?

I am interested in that area, and reading up and learning about it.

deterministic9mo ago

I use recursive descent parsers all the time for small DSL's and for a JIT compiled optimizing production quality compiler. It works great.

markus_zhang10mo ago

I have heard that RDP is prominent in production parsers, I wonder is it true? And is it pure handwritten RDP or combined with other automated techniques?

o11c10mo ago

One reason hand-written recursive-descent parsers are common is because a lot of languages are poorly designed, and it's easier to hack around the mistakes in a hand-written parser.

For new languages this should be avoided - just design a sane grammar in the first place.

chadcmulligan10mo ago

fwiw LLM's seem very good at writing recursive descent parsers, at least for the small experiments I've done (wrote a Lua parser in Delphi).

UncleOxidant10mo ago

Agreed. I recently had Gemini write a recursive descent parser for a specified subset of C in C and it did quite well. I've tried similar with Claude 4 and Qwen3 Coder and again, both did quite well.

ogogmad10mo ago

Have people heard of the following top-down parsing algorithm for mathematical expressions:

  1. Replace any expression that's within parentheses by its parse tree by using recursion
  2. Find the lowest precedence operator, breaking ties however you'd like. Call this lowest precedence operator OP.
  3. View the whole unparsed expression as `x OP y`
  4. Generate a parse tree for x and for y. Call them P(x) and P(y).
  5. Return ["OP", P(x), P(y)].

da-bacon10mo ago

johnwbyrd10mo ago

sparkie10mo ago

PEGs are closely related to recursive descent, and have some of the same problems.

j / k navigate · click thread line to collapse