Guy Steele said something about a use of Yacc for language design (rather than implementation) that stuck with me :
Be sure that your language will parse. It seems stupid to sit down and start designing constructs and not worry how they will fit together. You can get a language that's difficult if not impossible to parse, not only for a computer, but for a person. I use YACC constantly as a check of all my language designs, but I very seldom use YACC in the implementation. I use it as a tester, to be sure that it's LR(1) ... because if a language is LR(1) it's more likely that a person can deal with it.
From the Dynamic Languages Wizards series (in 2001), in the panel on language design (1:09:05) [1]
I've not yet employed Yacc in this fashion, but it did give me a tool for thinking about object models. A while ago when I was puzzling over how some classes in an entity relationship diagram should be related, and I considered it from the point of view of how would I design a grammar for serializing an instance of the model into text. This essentially made my decision for me in a principled way, though I never reached the point of writing up a grammar for the whole model, just considered the implications for the local bit that was troubling me.
[1] https://youtu.be/agw-wlHGi0E?si=n-ann0TYjvZ45ie5&t=4145
edit: added a few clarifying notes
LR(1) isn't easy to deal with: LR(1) shift-reduce parsing allows an arbitrary number of tokens to be stuffed into the parser stack before making a single reduction.
If you want easy to parse, make sure that a left-to-right scan can make a parsing decision by looking at a small number of tokens (often one). A permanent parsing decision: parse it this way or that way, no backtracking to try something else.
Warning 1: parsing Unicode streams well is awkward with flex -- it's from an age where ASCII ruled. But handling multiple input incodings may get weird. If it is only UTF-8, maybe it works, because that's essentially bytes. But I find a hand-written scanner more convenient (the grammar is seldom too complex for that). But regexps based on General_Category or ID_Start etc.? Difficult...
Warning 2: for various reasons, usually flexibility, conflict resolving, error reporting, and/or error recovery, many projects move from bison to something else, even a handwritten recursive descent parser. It's longer, but not that difficult.
- You don't have UTF-8 everywhere in a language. You might have it only in comments and string literals, in which case you can just be 8-bit clean and don't bother.
- If you have UTF-8 in identifiers, you can recognize it via a few easy patterns.
- The tokenizer doesn't have to do full validation on the UTF-8; it can defer that to some routines. I.e. it has to recognize a potentially valid UTF-8 sequence, capture it in yytext[], and then a proper UTF-8 recognizer can be invoked on yytext which will do things like reject overlong forms. You can further sanitize it to reject characters that you don't want in an identifier, like various kinds of Unicode spaces or whatever your rules are.
Multiple encodings though? Hmph. You can write a custom YYINPUT macro in Lex/Flex where you could convert any input encoding, normalizing it into UTF-8. Or read through a stream filtering abstraction which reads various encodings on its input end, pumping out UTF-8.
In TXR Lisp, when you do (read "(1 2 3)") what happens is that the string is made of wide characters: code points. But the parser takes UTF-8. So, a UTF-8 stream is created which scans the wide string as UTF-8:
2> (make-string-byte-input-stream "(1 2 3)")
#<byte-input-stream b7b713b0>
3> (get-byte *2)
40
4> (get-byte *2)
49
5> (get-byte *2)
32
The effect is that Flex and Bison are parsing a UTF-32 string. Based on this principle, we could do other encodings.Which is sad, because parsing is simpler than the theory-heavy books make it out to be and understanding optimization is probably more practically important for programmers.
Almost everyone should just hand code a recursive descent parser and then move on with their lives. I've been a part of a few book clubs at work that tried to dig into compiler books and the same thing kept happening. People get bogged down by the parser theory and give up. One of the people even eventually left to work on cryptograph for a company doing government contracts. These people could understand complex topics, but apparently parsing theory is a bridge to far for nearly everyone.
it's a whole other ballgame when writing interactive parsers (as one would for IJ or I think tree-sitter, too) where the document spends the majority of its life in an erroneous state
now, I can appreciate diving into such an 80/20 topic may be too much for a compiler course, but as for really rolling out a compiler or editor plugin, it's the difference between "this is crap" and "wow, that's genuinely helpful"
The benefit of recursive descent is that they're easy to write and modify and understand. You don't need any new paradigms, just write code like you typically do. If something goes wrong, your standard debugging skills will serve you well.
There's also a lot of other relatively easy parsing technologies out there. For example, you can also consider monadic parsing, parser combinators, PEG libraries.
I spent a year trying to figure out which parser technique worked best for me, and I'm glad I didn't just stick with my starting point of lex/yacc. So again, if this guide allows parsing to just work for you, then great stick with it. But if you find yourself encountering a lot of problems, then it might be worth it to look around because other options exist and work just fine.
It hasn't been historically popular because it's expensive. Particularly, when you do several such searches in a nested fashion, the combinations can explode.
That might not matter on today's hardware.
The fix against LR features creeping in is not to write a grammar. Just write the code. If you have some new idea, code it in the recursive descent parser. There is a risk that by doing that, you break something which used to parse (because you hijacked the prefix by which it was recognized). You have to have a test case for that; don't add anything to the parser without tests.
LL(1) or LL(k) parsing relies by identifying the construct by looking at a few prefix tokens of the input. So when you extend the language, you have to think about where in that prefix space you stick the new features.
If you make the language a Lisp, you have a set blueprint for how to avoid LR(1). ANSI Common Lisp defines the actual algorithm by which code is parsed, because a user-reprogrammable part of the run-time. The reader follows a read table, which dispatches functions based on what character, or in some cases what pair of characters, is seen. There are subtleties: some characters are identified as token constituents while some are terminating (if they are seen in the middle of a token, that token is delimited, and that character is not part of the token). For instance, the ' character, in the standard read table, is tied to a function which will read the following object X and produce the object (quote x). The character is a terminating character, so that when we have abcd', an abcd symbol token will be recognized and come to an end because of the '. Then the function for ' will be dispatched to read the quote.
https://blog.reverberate.org/2013/07/ll-and-lr-parsing-demys...
I maintained same attitude for years.
I've changed my mind now. Anything feature I want already exists in some programming language. The only distinguishing feature I can offer when designing a new programming language is "very readable", so syntax matters more than I used to think.
Whether you get traction or not depends a lot on syntax - if your syntax is too much of an outlier compared to mainstream languages, your features don't matter.