Rust GCC backend: Why and how (opens in new tab)

(blog.guillaume-gomez.fr)

199 pointsahlCVA3mo ago130 comments

130 comments

> On that note: GCC doesn't provide a nice library to give access to its internals (unlike LLVM). So we have to use libgccjit which, unlike the "jit" ("just in time", meaning compiling sub-parts of the code on the fly, only when needed for performance reasons and often used in script languages like Javascript) part in its name implies, can be used as "aot" ("ahead of time", meaning you compile everything at once, allowing you to spend more time on optimization).

Is libgccjit not “a nice library to give access to its internals?”

compiler-guy3mo ago

To use an illustrative (but inevitably flawed) metaphor: Using libgccjit for this is a bit like networking two computers via the MIDI protocol.

The MIDI protocol is pretty good for what it is designed for, and you can make it work for actual real networking, but the connections will be clunky, unergonomic, and will be missing useful features that you really want in a networking protocol.

aargh_aargh3mo ago

Or, the obligatory RFC 1149 (IP over Avian Carriers).

jacquesm3mo ago

Oh come on, SLIP over MIDI is tried and true.

1 more reply

saghm3mo ago

I could be wrong, but my surface level understanding is that it's more of a library version of the external API of GCC than one that gives access to the internals.

LukeShu3mo ago

libgccjit is much higher level than what's documented in the "GCC Internals" manual.

keyle3mo ago

If the author reads this...

I'd be very interested if the author could provide a post with a more in depth view of the passes, as suggested!

petcat3mo ago

> Little side-note: If enough people are interested by this topic, I can write a (much) longer explanation of these passes.

Yes, please!

grokx3mo ago

When I studied compiler theory, a large part of the compilation involved a lexical analyser (e.g. `flex`) and a syntax analyser (e.g. `bison`), that would produce an internal representation of the input code (the AST), used to generate the compiled files.

It seems that the terminology as evolved, as we speak more broadly of frontends and backends.

So, I'm wondering if Bison and Flex (or equivalent tools) are still in use by the modern compilers? Or are they built directly in GCC, LLVM, ...?

eslaught3mo ago

The other answers are great, but let me just add that C++ cannot be parsed with conventional LL/LALR/LR parsers, because the syntax is ambiguous and requires disambiguation via type checking (i.e., there may be multiple parse trees but at most one will type check).

There was some research on parsing C++ with GLR but I don't think it ever made it into production compilers.

Other, more sane languages with unambiguous grammars may still choose to hand-write their parsers for all the reasons mentioned in the sibling comments. However, I would note that, even when using a parsing library, almost every compiler in existence will use its own AST, and not reuse the parse tree generated by the parser library. That's something you would only ever do in a compiler class.

Also I wouldn't say that frontend/backend is an evolution of previous terminology, it's just that parsing is not considered an "interesting" problem by most of the community so the focus has moved elsewhere (from the AST design through optimization and code generation).

nextaccountic3mo ago

Note that depending on what parsing lib you use, it may produce nodes of your own custom AST type

Personally I love the (Rust) combo of logos for lexing, chumsky for parsing, and ariadne for error reporting. Chumsky has options for error recovery and good performance, ariadne is gorgeous (there is another alternative for Rust, miette, both are good).

The only thing chumsky is lacking is incremental parsing. There is a chumsky-inspired library for incremental parsing called incpa though

1 more reply

ajb3mo ago

GLR C++ parsers were for a short time in use on production code at Mozilla, in refactoring tools: Oink (and it's fork, pork). Not quite sure what ended that, but I don't think it was any issue with parsing.

fithisux3mo ago

I disagree. It is interesting, that is why there many languages out there without an LSP.

ricudis3mo ago

Not just C++. Even C parsing is context-dependent because of typedef. Requires a bit of hackery to parse in a conventional LL/LARL/LR parser.

umanwizard3mo ago

"Frontend" as used by mainstream compilers is slightly broader than just lexing/parsing.

In typical modern compilers "frontend" is basically everything involving analyzing the source language and producing a compiler-internal IR, so lexing, parsing, semantic analysis and type checking, etc. And "backend" means everything involving producing machine code from the IR, so optimization and instruction selection.

In the context of Rust, rustc is the frontend (and it is already a very big and complicated Rust program, much more complicated than just a Rust lexer/parser would be), and then LLVM (typically bundled with rustc though some distros package them separately) is the backend (and is another very big and complicated C++ program).

pklausler3mo ago

Table-driven parsers with custom per-statement tokenizers are still common in surviving Fortran compilers, with the exception of flang-new in LLVM. I used a custom parser combinator library there, inspired by a prototype in Haskell's Parsec, to implement a recursive descent algorithm with backtracking on failure. I'm still happy with the results, especially with the fact that it's all very strongly typed and coupled with the parse tree definition.

brooke2k3mo ago

Not sure about GCC, but in general there has been a big move away from using parser generators like flex/bison/ANTLR/etc, and towards using handwritten recursive descent parsers. Clang (which is the C/C++ frontend for LLVM) does this, and so does rustc.

afdbcreid3mo ago

I don't know a single mainstream language that uses parser generators. Python used to, and even they have moved.

AFAIK the reason is solely error messages: the customization available with handwritten parsers is just way better for the user.

2 more replies

gpderetta3mo ago

I believe that GCC also moved to a handwritten parser, at least for c++, a couple of decades ago.

jojomodding3mo ago

This was in the olden days when your language's type system would maybe look like C's if you were serious and be even less of a thing when you were not.

The hard part about compiling Rust is not really parsing, it's the type system including parts like borrow checking, generics, trait solving (which is turing-complete itself), name resolution, drop checking, and of course all of these features interact in fun and often surprising ways. Also macros. Also all the "magic" types in the StdLib that require special compiler support.

This is why e.g. `rustc` has several different intermediate representations. You no longer have "the" AST, you have token trees, HIR, THIR, and MIR, and then that's lowered to LLVM or Cranelift or libgccjit. Each stage has important parts of the type system happen.

astrange3mo ago

Compiler theory a) doesn't seem to have much to do with production compilers b) is unnecessarily heavyweight and scary about everything.

In particular, it makes parsing everything look like a huge difficult problem. This is my main problem with the Dragon Book.

In practice everyone uses hacky informal recursive-descent parsers because they're the only way to get good error messages.

quamserena3mo ago

Not really. Here’s a comparison of different languages: https://notes.eatonphil.com/parser-generators-vs-handwritten...

Most roll their own for three reasons: performance, context, and error handling. Bison/Menhir et al. are easy to write a grammar and get started with, but in exchange you get less flexibility overall. It becomes difficult to handle context-sensitive parts, do error recovery, and give the user meaningful errors that describe exactly what’s wrong. Usually if there’s a small syntax error we want to try to tell the user how to fix it instead of just producing “Syntax error”, and that requires being able to fix the input and keep parsing.

Menhir has a new mode where the parser is driven externally; this allows your code to drive the entire thing, which requires a lot more machinery than fire-and-forget but also affords you more flexibility.

wrs3mo ago

If you're parsing a new language that you're trying to define, I do recommend using a parser generator to check your grammar, even if your "real" parser is handwritten for good reasons. A parser generator will insist on your grammar being unambiguous, or at least tell you where it is ambiguous. Without this sanity check, your unconstrained handwritten parser is almost guaranteed to not actually parse the language you think it parses.

peterfirefly3mo ago

Mostly because that's the part that had the best developed theory so that's what tended to be taught.

The rest of the f*cking owl is the interesting part.

MangoToupe3mo ago

I find it shocking that 20 years after LLVM was created, gcc still hasn't moved towards modularization of codegen.

17186274403mo ago

It is a political not a technical decision. Essentially the same like the Linux kernel not encouraging the use of out-of-tree kernel modules. https://gcc.gnu.org/legacy-ml/gcc/2000-01/msg00572.html

surajrmal3mo ago

And it shows how silly the idea is. gcc still sees plenty of forks from vendors who don't upstream, and llvm sees a lot more commercial participation. Unfortunately the Linux kernel equivalent doesn't exist.

3 more replies

chuckadams3mo ago

Linux's position is more like "your out-of-tree code is not our problem". Linus didn't go out of his way to make out-of-tree modules more difficult to write.

pjmlp3mo ago

LLVM wasn't the first modularization of codegen, see Amsterdam Compiler Kit for prior art, among others.

GCC approach is on purpose, plus even if they wanted to change, who would take the effort to make existing C, C++, Objective-C, Objective-C++, Fortran, Modula-2, Algol 68, Ada, D, and Go frontends adopt the new architecture?

Even clang with all the LLVM modularization is going to take a couple of years to move from plain LLVM IR into MLIR dialect for C based languages, https://github.com/llvm/clangir

ayende3mo ago

Isn't that very much intentional on the part of GCC?

colejohnson663mo ago

Somewhat. Stallman claims to have tried to make it modular,[0] but also that he wants to avoid "misuse of [the] front ends".[1]

The idea is that you should link the front and back ends, to prevent out-of-process GPL runarounds. But because of that, the mingling of the front and back ends ended up winning out over attempts to stay modular.

[0]: https://lists.gnu.org/archive/html/emacs-devel/2015-02/msg00...

[1]: https://lists.gnu.org/archive/html/emacs-devel/2015-01/msg00...

4 more replies

wahern3mo ago

Not anymore. Modularization is somewhat tangential, but for awhile Stallman did actively oppose rearchitecting GCC to better support non-free plugins and front-ends. But Stallman lost that battle years ago. AFAIU, the current state of GCC is the result of intentional technical choices (certain kinds of decoupling not as beneficial as people might think--Rust has often been stymied by lack of features in LLVM, i.e. defacto (semantic?) coupling), works in progress (decoupling ongoing), or lack of time or wherewithal to commit to certain major changes (decoupling too onerous).

1 more reply

demurgos3mo ago

It is intentional to avoid non-free projects from building on top of gcc components.

I am not familiar enough with gcc to know how it impacts out-of-tree free projects or internal development.

The decision was taken a long time ago, it may be worth revisiting it.

1 more reply

kunley3mo ago

A perhaps naive question: does it have a chance to be faster than LLVM backend?

MerrimanInd3mo ago

Another reason to have a second compiler is for safety-critical applications. In the assessment of safety-critical tools if something like a compiler can have a second redundant version then each one of them can be certified to a lower criticality level since they'll crosscheck each other. When a tool is single-sourced the level of qualification goes up quite significantly.

steveklabnik3mo ago

rustc (via Ferrocene) is already being qualified, and form what I hear it’s been fairly easy to do so, for various reasons.

MerrimanInd3mo ago

Yeah it is and that's a great effort, I've worked with that team on various things. But the industry is still itching for a second compiler with no crossover (can't just be another LLVM frontend or rustc fork) for those certification reasons. Not that people want to replace rustc! It's just a cert requirement.

1 more reply

17186274403mo ago

I don't necessary like the focus on Rust, but if it happens, then we need to have support in the free compiler!

lionkor3mo ago

Why not? Like what about the technology or ecosystem do you disagree with

throwaway17_173mo ago

Not parent, but I share the ambivalence (at best) or outright negativity (at worst) toward the focus on Rust. It is a question of preference on my part, I don’t like the language and I do not want to see it continue to propagate through the software I use and want to control/edit/customize. This is particularly true of having Rust become entrenched in the depths of the open-source software I use on my personal and work machines. For me, Rust is just another dependency to add to a system and it also pulls along another compiler and the accompanying LLVM. I’m not going to learn a language that I disagree with strongly on multiple levels, so the less Rust in my open source the more control I retain over my software. So for me the less entrenched Rust remains the more ability I keep to work on the software I use.

That said, if Rust is going to continue entrenching itself in the open source software that is widely in use, it should at least be able to be compiled with by the mainline GPL compiler used and utilized by the open source community. Permissive licenses are useful and appreciated in some context, but the GPL’d character of the Linux stack’s core is worth fighting to hold onto.

It’s not Rust in open source I have a problem with, it is Rust being added to existing software that I use that I don’t want. A piece of software, open source, written in Rust is equivalent to proprietary software from my perspective. I’ll use it, but I will always prefer software I can control/edit/hack on as the key portions of my stack.

4 more replies

ladyanita223mo ago

LLVM is also free

1 more reply

pessimizer3mo ago

Almost the only thing I don't like about Rust is that a bunch of people actively looking to subvert software freedom have set up shop around it. If everything was licensed correctly and designed to resist control by special interests, I'd be a lot happier with having committed to it.

The language itself I find wonderful, and I suspect that it will get significantly better. Being GPL-hostile, centralized without proper namespacing, and having a Microsoft dependency through Github registration is aggravating. When it all goes bad, all the people silencing everyone complaining about it will play dumb.

If there's anything I would want rewritten in something like Rust, it would be an OS kernel.

JoshTriplett3mo ago

> actively looking to subvert software freedom

Never attribute to malice that which can be adequately explained by apathy. We have, unfortunately, reached a point where most people writing new software default to permissive and don't sufficiently care about copyleft. I wish we hadn't, but we have. This is not unique to Rust.

Ironically, we're better off when existing projects migrate to Rust, because they'll keep their licenses, while rewrites do what most new software does, and default to permissive.

Personally, I'm happy every time I see a new crate using the GPL.

> GPL-hostile

Rust is not GPL-hostile. LLVM was the available tool that spawned a renaissance of new languages; GCC wasn't. The compiler uses a permissive license; I personally wish it were GPL, but it isn't. But there's nothing at all wrong with writing GPLed software in Rust, and people do.

> having a Microsoft dependency through Github registration is aggravating

This one bugs a lot of us, and it is being worked on.

timeon3mo ago

> GPL-hostile

Not sure if it is particularly hostile. There are several GPL crates like Slint.

> Microsoft dependency through Github registration is aggravating

This one is concerning.

pjmlp3mo ago

You missed that Microsoft has several Rust contributors and is one of the main sponsors of using Rust in Linux, alongside Google.

Many forget that Microsoft went from "FOSS is bad", to now having their fingers across many key FOSS projects.

They are naturally not the only ones, a developer got to eat, and big tech gladly pays the bills when it fits their purposes.

umanwizard3mo ago

Rustc (+ LLVM) already is a free compiler.

notepad0x903mo ago

I would just like to encourage all Rust devs to distribute binaries. No matter what compiler you choose, or what Rust version, users shouldn't have to build from source. I mostly see this with small projects to be fair.

j / k navigate · click thread line to collapse

130 comments

mastax3mo ago

Is libgccjit not “a nice library to give access to its internals?”

compiler-guy3mo ago

To use an illustrative (but inevitably flawed) metaphor: Using libgccjit for this is a bit like networking two computers via the MIDI protocol.

aargh_aargh3mo ago

Or, the obligatory RFC 1149 (IP over Avian Carriers).

jacquesm3mo ago

Oh come on, SLIP over MIDI is tried and true.

1 more reply

saghm3mo ago

I could be wrong, but my surface level understanding is that it's more of a library version of the external API of GCC than one that gives access to the internals.

LukeShu3mo ago

libgccjit is much higher level than what's documented in the "GCC Internals" manual.

keyle3mo ago

If the author reads this...

I'd be very interested if the author could provide a post with a more in depth view of the passes, as suggested!

petcat3mo ago

> Little side-note: If enough people are interested by this topic, I can write a (much) longer explanation of these passes.

Yes, please!

grokx3mo ago

It seems that the terminology as evolved, as we speak more broadly of frontends and backends.

So, I'm wondering if Bison and Flex (or equivalent tools) are still in use by the modern compilers? Or are they built directly in GCC, LLVM, ...?

eslaught3mo ago

There was some research on parsing C++ with GLR but I don't think it ever made it into production compilers.

nextaccountic3mo ago

Note that depending on what parsing lib you use, it may produce nodes of your own custom AST type

The only thing chumsky is lacking is incremental parsing. There is a chumsky-inspired library for incremental parsing called incpa though

1 more reply

ajb3mo ago

fithisux3mo ago

I disagree. It is interesting, that is why there many languages out there without an LSP.

ricudis3mo ago

Not just C++. Even C parsing is context-dependent because of typedef. Requires a bit of hackery to parse in a conventional LL/LARL/LR parser.

umanwizard3mo ago

"Frontend" as used by mainstream compilers is slightly broader than just lexing/parsing.

pklausler3mo ago

brooke2k3mo ago

afdbcreid3mo ago

I don't know a single mainstream language that uses parser generators. Python used to, and even they have moved.

AFAIK the reason is solely error messages: the customization available with handwritten parsers is just way better for the user.

2 more replies

gpderetta3mo ago

I believe that GCC also moved to a handwritten parser, at least for c++, a couple of decades ago.

jojomodding3mo ago

This was in the olden days when your language's type system would maybe look like C's if you were serious and be even less of a thing when you were not.

astrange3mo ago

Compiler theory a) doesn't seem to have much to do with production compilers b) is unnecessarily heavyweight and scary about everything.

In particular, it makes parsing everything look like a huge difficult problem. This is my main problem with the Dragon Book.

In practice everyone uses hacky informal recursive-descent parsers because they're the only way to get good error messages.

quamserena3mo ago

Not really. Here’s a comparison of different languages: https://notes.eatonphil.com/parser-generators-vs-handwritten...

wrs3mo ago

peterfirefly3mo ago

Mostly because that's the part that had the best developed theory so that's what tended to be taught.

The rest of the f*cking owl is the interesting part.

MangoToupe3mo ago

I find it shocking that 20 years after LLVM was created, gcc still hasn't moved towards modularization of codegen.

17186274403mo ago

It is a political not a technical decision. Essentially the same like the Linux kernel not encouraging the use of out-of-tree kernel modules. https://gcc.gnu.org/legacy-ml/gcc/2000-01/msg00572.html

surajrmal3mo ago

3 more replies

chuckadams3mo ago

Linux's position is more like "your out-of-tree code is not our problem". Linus didn't go out of his way to make out-of-tree modules more difficult to write.

pjmlp3mo ago

LLVM wasn't the first modularization of codegen, see Amsterdam Compiler Kit for prior art, among others.

Even clang with all the LLVM modularization is going to take a couple of years to move from plain LLVM IR into MLIR dialect for C based languages, https://github.com/llvm/clangir

ayende3mo ago

Isn't that very much intentional on the part of GCC?

colejohnson663mo ago

Somewhat. Stallman claims to have tried to make it modular,[0] but also that he wants to avoid "misuse of [the] front ends".[1]

[0]: https://lists.gnu.org/archive/html/emacs-devel/2015-02/msg00...

[1]: https://lists.gnu.org/archive/html/emacs-devel/2015-01/msg00...

4 more replies

wahern3mo ago

1 more reply

demurgos3mo ago

It is intentional to avoid non-free projects from building on top of gcc components.

I am not familiar enough with gcc to know how it impacts out-of-tree free projects or internal development.

The decision was taken a long time ago, it may be worth revisiting it.

1 more reply

kunley3mo ago

A perhaps naive question: does it have a chance to be faster than LLVM backend?

MerrimanInd3mo ago

steveklabnik3mo ago

rustc (via Ferrocene) is already being qualified, and form what I hear it’s been fairly easy to do so, for various reasons.

MerrimanInd3mo ago

1 more reply

17186274403mo ago

I don't necessary like the focus on Rust, but if it happens, then we need to have support in the free compiler!

lionkor3mo ago

Why not? Like what about the technology or ecosystem do you disagree with

throwaway17_173mo ago

4 more replies

ladyanita223mo ago

LLVM is also free

1 more reply

pessimizer3mo ago

If there's anything I would want rewritten in something like Rust, it would be an OS kernel.

JoshTriplett3mo ago

> actively looking to subvert software freedom

Ironically, we're better off when existing projects migrate to Rust, because they'll keep their licenses, while rewrites do what most new software does, and default to permissive.

Personally, I'm happy every time I see a new crate using the GPL.

> GPL-hostile

> having a Microsoft dependency through Github registration is aggravating

This one bugs a lot of us, and it is being worked on.

timeon3mo ago

> GPL-hostile

Not sure if it is particularly hostile. There are several GPL crates like Slint.

> Microsoft dependency through Github registration is aggravating

This one is concerning.

pjmlp3mo ago

You missed that Microsoft has several Rust contributors and is one of the main sponsors of using Rust in Linux, alongside Google.

Many forget that Microsoft went from "FOSS is bad", to now having their fingers across many key FOSS projects.

They are naturally not the only ones, a developer got to eat, and big tech gladly pays the bills when it fits their purposes.

umanwizard3mo ago

Rustc (+ LLVM) already is a free compiler.

notepad0x903mo ago

j / k navigate · click thread line to collapse