Practical parsing with Flex and Bison (2021) (opens in new tab)

(begriffs.com)

69 pointsgnubison2y ago29 comments

29 comments

I used Lex and Yacc (as the article mentions, the direct predecessors of the Flex and Bison it talks about) a bit a long time ago, but more recently when I've written parsers I've just used recursive descent - but I was parsing a well-defined grammar. It did leave me with the knowledge of how to describe grammars in Yacc though, which I have found useful.

Guy Steele said something about a use of Yacc for language design (rather than implementation) that stuck with me :

Be sure that your language will parse. It seems stupid to sit down and start designing constructs and not worry how they will fit together. You can get a language that's difficult if not impossible to parse, not only for a computer, but for a person. I use YACC constantly as a check of all my language designs, but I very seldom use YACC in the implementation. I use it as a tester, to be sure that it's LR(1) ... because if a language is LR(1) it's more likely that a person can deal with it.

From the Dynamic Languages Wizards series (in 2001), in the panel on language design (1:09:05) [1]

I've not yet employed Yacc in this fashion, but it did give me a tool for thinking about object models. A while ago when I was puzzling over how some classes in an entity relationship diagram should be related, and I considered it from the point of view of how would I design a grammar for serializing an instance of the model into text. This essentially made my decision for me in a principled way, though I never reached the point of writing up a grammar for the whole model, just considered the implications for the local bit that was troubling me.

[1] https://youtu.be/agw-wlHGi0E?si=n-ann0TYjvZ45ie5&t=4145

edit: added a few clarifying notes

kazinator2y ago

If you want a language easy to deal with for humans, you want LL(1) or possibly LL(k) for some reasonable k.

LR(1) isn't easy to deal with: LR(1) shift-reduce parsing allows an arbitrary number of tokens to be stuffed into the parser stack before making a single reduction.

If you want easy to parse, make sure that a left-to-right scan can make a parsing decision by looking at a small number of tokens (often one). A permanent parsing decision: parse it this way or that way, no backtracking to try something else.

beeforpork2y ago

OK, the basics. But do not stop reading here if you want to write a parser. There are more modern tools to look at (e.g., antlr).

Warning 1: parsing Unicode streams well is awkward with flex -- it's from an age where ASCII ruled. But handling multiple input incodings may get weird. If it is only UTF-8, maybe it works, because that's essentially bytes. But I find a hand-written scanner more convenient (the grammar is seldom too complex for that). But regexps based on General_Category or ID_Start etc.? Difficult...

Warning 2: for various reasons, usually flexibility, conflict resolving, error reporting, and/or error recovery, many projects move from bison to something else, even a handwritten recursive descent parser. It's longer, but not that difficult.

kazinator2y ago

Flex is fine for UTF-8 because:

- You don't have UTF-8 everywhere in a language. You might have it only in comments and string literals, in which case you can just be 8-bit clean and don't bother.

- If you have UTF-8 in identifiers, you can recognize it via a few easy patterns.

- The tokenizer doesn't have to do full validation on the UTF-8; it can defer that to some routines. I.e. it has to recognize a potentially valid UTF-8 sequence, capture it in yytext[], and then a proper UTF-8 recognizer can be invoked on yytext which will do things like reject overlong forms. You can further sanitize it to reject characters that you don't want in an identifier, like various kinds of Unicode spaces or whatever your rules are.

Multiple encodings though? Hmph. You can write a custom YYINPUT macro in Lex/Flex where you could convert any input encoding, normalizing it into UTF-8. Or read through a stream filtering abstraction which reads various encodings on its input end, pumping out UTF-8.

In TXR Lisp, when you do (read "(1 2 3)") what happens is that the string is made of wide characters: code points. But the parser takes UTF-8. So, a UTF-8 stream is created which scans the wide string as UTF-8:

  2> (make-string-byte-input-stream "(1 2 3)")
  #<byte-input-stream b7b713b0>
  3> (get-byte *2)
  40
  4> (get-byte *2)
  49
  5> (get-byte *2)
  32

The effect is that Flex and Bison are parsing a UTF-32 string. Based on this principle, we could do other encodings.

HaoZeke2y ago

I have to say though, most compilers courses I've seen have an inordinate emphasis on parsing and little else. Still a great post.

msla2y ago

> I have to say though, most compilers courses I've seen have an inordinate emphasis on parsing and little else.

Which is sad, because parsing is simpler than the theory-heavy books make it out to be and understanding optimization is probably more practically important for programmers.

Verdex2y ago

I mostly agree. Parsing is as complex as the theory-heavy books make it out to be (and more as you keep going down the rabbit hole). But the necessity of that complexity is isolated to some pretty niche areas. If you've been brushing up against the limitations of straight forward parsing for a number of years and are still finding them wanting, then you might be one of the few people who need to delve deeper.

Almost everyone should just hand code a recursive descent parser and then move on with their lives. I've been a part of a few book clubs at work that tried to dig into compiler books and the same thing kept happening. People get bogged down by the parser theory and give up. One of the people even eventually left to work on cryptograph for a company doing government contracts. These people could understand complex topics, but apparently parsing theory is a bridge to far for nearly everyone.

mdaniel2y ago

the happy path may very well be "simpler," but I get the impression a lot of the newer parsing techniques focus on error recovery which is what made clang and friends game-changers because they emit helpful and often actionable error messages on malformed input

it's a whole other ballgame when writing interactive parsers (as one would for IJ or I think tree-sitter, too) where the document spends the majority of its life in an erroneous state

now, I can appreciate diving into such an 80/20 topic may be too much for a compiler course, but as for really rolling out a compiler or editor plugin, it's the difference between "this is crap" and "wow, that's genuinely helpful"

Verdex2y ago

If lex/yacc style parsing works for you, then great. However, I suspect most people are going to get more mileage out of just hand writing a recursive descent parser and moving on with their lives.

The benefit of recursive descent is that they're easy to write and modify and understand. You don't need any new paradigms, just write code like you typically do. If something goes wrong, your standard debugging skills will serve you well.

There's also a lot of other relatively easy parsing technologies out there. For example, you can also consider monadic parsing, parser combinators, PEG libraries.

I spent a year trying to figure out which parser technique worked best for me, and I'm glad I didn't just stick with my starting point of lex/yacc. So again, if this guide allows parsing to just work for you, then great stick with it. But if you find yourself encountering a lot of problems, then it might be worth it to look around because other options exist and work just fine.

zer8k2y ago

The problem of course is most people aren't going to deliberately produce an LL grammar. It's very easy to take your language into LR territory by accident and if you've centered your features around a particular LR-able process, converting to LL may be extremely hard or practically impossible.

kazinator2y ago

LL can be augmented by backtracking approaches: parse it that way and if it doesn't work out, do a long return to such and such point, and try it another way.

It hasn't been historically popular because it's expensive. Particularly, when you do several such searches in a nested fashion, the combinations can explode.

That might not matter on today's hardware.

The fix against LR features creeping in is not to write a grammar. Just write the code. If you have some new idea, code it in the recursive descent parser. There is a risk that by doing that, you break something which used to parse (because you hijacked the prefix by which it was recognized). You have to have a test case for that; don't add anything to the parser without tests.

LL(1) or LL(k) parsing relies by identifying the construct by looking at a few prefix tokens of the input. So when you extend the language, you have to think about where in that prefix space you stick the new features.

If you make the language a Lisp, you have a set blueprint for how to avoid LR(1). ANSI Common Lisp defines the actual algorithm by which code is parsed, because a user-reprogrammable part of the run-time. The reader follows a read table, which dispatches functions based on what character, or in some cases what pair of characters, is seen. There are subtleties: some characters are identified as token constituents while some are terminating (if they are seen in the middle of a token, that token is delimited, and that character is not part of the token). For instance, the ' character, in the standard read table, is tied to a function which will read the following object X and produce the object (quote x). The character is a terminating character, so that when we have abcd', an abcd symbol token will be recognized and come to an end because of the '. Then the function for ' will be dispatched to read the quote.

metadat2y ago

In case you're like me and rusty on LL grammars vs LR, I found this quick refresher:

https://blog.reverberate.org/2013/07/ll-and-lr-parsing-demys...

davidhs2y ago

After taking a compiler course in uni I found the emphasis on dealing with syntax mostly a waste of time. To begin with, do yourself a favor and use S-expression syntax (like Lisp) for your language. They're dead simple to parse. With the syntax out of the way, you can get to meat and potatoes of implementing a language. Later on you can always define a "look" for your language, and you can spend an inordinate amount of time on that.

lelanthran2y ago

> After taking a compiler course in uni I found the emphasis on dealing with syntax mostly a waste of time. To begin with, do yourself a favor and use S-expression syntax (like Lisp) for your language. They're dead simple to parse. With the syntax out of the way, you can get to meat and potatoes of implementing a language. Later on you can always define a "look" for your language, and you can spend an inordinate amount of time on that.

I maintained same attitude for years.

I've changed my mind now. Anything feature I want already exists in some programming language. The only distinguishing feature I can offer when designing a new programming language is "very readable", so syntax matters more than I used to think.

Whether you get traction or not depends a lot on syntax - if your syntax is too much of an outlier compared to mainstream languages, your features don't matter.

davidhs2y ago

My suggestion was more for people who are starting out developing a programming language. It's the over-emphasis on the syntax so early on I find to be suboptimal. I also find that slapping on a more human readable or aesthetically pleasing syntax afterwards to be relatively easy.

1 more reply

shortrounddev22y ago

I feel like you're discounting the importance of syntaxes. Python is as popular as it is because the syntaxes is so simple. Lisp isn't as popular because prefix notation doesn't make intuitive sense to beginners. I would argue that syntax is probably the most important feature of a language

davidhs2y ago

The problem I find with parenthesized languages is the amount of visual noise coming from the parentheses. If you're learning how to write your first interpreter or compiler I kind of assume you already know how to program and understanding S-expressions is not too difficult. Personally, I would have liked to learn and implement closures, continuations, macros, garbage collection, type systems, etc. earlier, and then going deeper into syntax.

_gabe_2y ago

Python is popular because of the ecosystem that’s been built around the language.

2 more replies

firtoz2y ago

Kind of related, for anyone curious with parsing and JS: I have to recommend peggy for writing simple parsers for files to be consumed by JavaScript. Pretty niche, but does it so well. I developed a few packages using it so far.

ladberg2y ago

Also see: https://langcc.io/

frabert2y ago

re2c is a better alternative to flex imo. Also lemon (from the sqlite project) in place of bison/yacc.

arnon2y ago

I learned a lot from the first few pages but it really escalated very quickly at some point

j / k navigate · click thread line to collapse

29 comments

EdwardCoffin2y ago

Guy Steele said something about a use of Yacc for language design (rather than implementation) that stuck with me :

From the Dynamic Languages Wizards series (in 2001), in the panel on language design (1:09:05) [1]

[1] https://youtu.be/agw-wlHGi0E?si=n-ann0TYjvZ45ie5&t=4145

edit: added a few clarifying notes

kazinator2y ago

If you want a language easy to deal with for humans, you want LL(1) or possibly LL(k) for some reasonable k.

LR(1) isn't easy to deal with: LR(1) shift-reduce parsing allows an arbitrary number of tokens to be stuffed into the parser stack before making a single reduction.

beeforpork2y ago

OK, the basics. But do not stop reading here if you want to write a parser. There are more modern tools to look at (e.g., antlr).

kazinator2y ago

Flex is fine for UTF-8 because:

- You don't have UTF-8 everywhere in a language. You might have it only in comments and string literals, in which case you can just be 8-bit clean and don't bother.

- If you have UTF-8 in identifiers, you can recognize it via a few easy patterns.

  2> (make-string-byte-input-stream "(1 2 3)")
  #<byte-input-stream b7b713b0>
  3> (get-byte *2)
  40
  4> (get-byte *2)
  49
  5> (get-byte *2)
  32

The effect is that Flex and Bison are parsing a UTF-32 string. Based on this principle, we could do other encodings.

HaoZeke2y ago

I have to say though, most compilers courses I've seen have an inordinate emphasis on parsing and little else. Still a great post.

msla2y ago

> I have to say though, most compilers courses I've seen have an inordinate emphasis on parsing and little else.

Which is sad, because parsing is simpler than the theory-heavy books make it out to be and understanding optimization is probably more practically important for programmers.

Verdex2y ago

mdaniel2y ago

it's a whole other ballgame when writing interactive parsers (as one would for IJ or I think tree-sitter, too) where the document spends the majority of its life in an erroneous state

Verdex2y ago

If lex/yacc style parsing works for you, then great. However, I suspect most people are going to get more mileage out of just hand writing a recursive descent parser and moving on with their lives.

There's also a lot of other relatively easy parsing technologies out there. For example, you can also consider monadic parsing, parser combinators, PEG libraries.

zer8k2y ago

kazinator2y ago

LL can be augmented by backtracking approaches: parse it that way and if it doesn't work out, do a long return to such and such point, and try it another way.

It hasn't been historically popular because it's expensive. Particularly, when you do several such searches in a nested fashion, the combinations can explode.

That might not matter on today's hardware.

metadat2y ago

In case you're like me and rusty on LL grammars vs LR, I found this quick refresher:

https://blog.reverberate.org/2013/07/ll-and-lr-parsing-demys...

davidhs2y ago

lelanthran2y ago

I maintained same attitude for years.

Whether you get traction or not depends a lot on syntax - if your syntax is too much of an outlier compared to mainstream languages, your features don't matter.

davidhs2y ago

1 more reply

shortrounddev22y ago

davidhs2y ago

_gabe_2y ago

Python is popular because of the ecosystem that’s been built around the language.

2 more replies

firtoz2y ago

ladberg2y ago

Also see: https://langcc.io/

frabert2y ago

re2c is a better alternative to flex imo. Also lemon (from the sqlite project) in place of bison/yacc.

arnon2y ago

I learned a lot from the first few pages but it really escalated very quickly at some point

j / k navigate · click thread line to collapse