Show HN: Carburetta – C/C++ Fused Scanner and Parser Generator (opens in new tab)

(carburetta.com)

3 pointsquincunx5y ago6 comments

6 comments

The distinction between a scanner and a parser is somewhat arbitrary. One could use one and the same formalism for it. The scanner usually deals with things that are considered 'atomic' elements in the language, while grammar is used for 'compound' elements consisting of one or more other elements. If there are seen as one and the same, than it naturally flows that the scanner is called from the parser, and not how it is traditionally done, that the scanner acts as a first pass. This seems a logical approach, but in practices, when scanning is context sensitive, requires the implementation of all kinds of hacks. Also, the treatment of keywords (where it is possible that they are case insensitive) it is better to have a grammar for parsing a keyword 'identifier' and a check whether the result matches the keyword. For pure performance this would not be the best solution, but I understand that Carburetta is not design for that. I have been developing a parser that makes no distinction between scanning and parsing in C, which I called RawParser: https://github.com/FransFaase/RawParser . It also offers more powerful grammar constructs and gives examples on how to implement memory management in a uniform way.

quincunxOP5y ago

Hey, thanks for taking the time. I appreciate the insights.

Carburetta is designed for developer productivity and hence "least surprise"; and so combines the classical distinction of scanner ("lex") and parser ("yacc") but fuses those capabilities together.

This simplifies development as the generated driver code can do more housekeeping and the code in Carburetta becomes more expressive. It is not the goal to improve on the algorithms / state of the art academics of parsers. Instead, it is the goal to improve productivity while using predictable and well known approaches / not reinventing the wheel, however noble that can be. Algorithms here are NFA -> DFA -> scantable for the scanner, and LALR for the parser; these are old, well established algorithms.

That said, have a look at xlalr.c if you'd like to go exotic and disambiguate grammars (multi symbol reductions determined at runtime), but a conscious choice was made to stick with straight LALR for least surprise. Ambiguous grammars, I find, become difficult to think about. xlalr.c is therefore not used for the main parser generation, but the classic lalr.c is favored. You should be able to understand everything that's going on using only base textbook materials (e.g. dragon book.) The xlalr.c algorithm will be phased out eventually, I believe it is currently only used internally (Carburetta, due to the bootstrap problem, is a bit of a chimera, and does not always use itself.)

That said, there is direct support for some of the ambiguities you speak of; e.g. for instance, the infamous "typedef name" ambiguity on identifiers in C can be implemented with <prefix>stack_accepts() and $chg_term(terminal). Have a look in the manual (HTML version is online) for those.

RawParser is cool, but seems to have different goals in mind.

HelloNurse5y ago

Come on, the seventies have ended...

> Note that the above implies there is currently no support for features like:

> UTF-8 (or other unicode) input, input characters are all deemed to be in the range 0 to 255.

quincunxOP5y ago

Thanks for studying this, your emphasis is noted - aside from this particular bullet from the "notable limitations," are there any other short-comings that you feel are deal breakers for you?

HelloNurse5y ago

I don't like the YACC style of code fragments with placeholders, but the system seems well designed and it's likely to be good enough in practice.

But seriously, not being able to parse text is more than enough of a limitation for a text parsing tool. The token specification keeps close to standard regular expression, and matching Unicode text with regular expression (https://unicode.org/reports/tr18/) is a rather well researched problem with good implementations.

1 more reply

j / k navigate · click thread line to collapse

6 comments

fjfaase5y ago

quincunxOP5y ago

Hey, thanks for taking the time. I appreciate the insights.

Carburetta is designed for developer productivity and hence "least surprise"; and so combines the classical distinction of scanner ("lex") and parser ("yacc") but fuses those capabilities together.

RawParser is cool, but seems to have different goals in mind.

HelloNurse5y ago

Come on, the seventies have ended...

> Note that the above implies there is currently no support for features like:

> UTF-8 (or other unicode) input, input characters are all deemed to be in the range 0 to 255.

quincunxOP5y ago

Thanks for studying this, your emphasis is noted - aside from this particular bullet from the "notable limitations," are there any other short-comings that you feel are deal breakers for you?

HelloNurse5y ago

I don't like the YACC style of code fragments with placeholders, but the system seems well designed and it's likely to be good enough in practice.

1 more reply

j / k navigate · click thread line to collapse