This is just wrong, though! LLVM provides stack slots for a good reason: stack allocated variables can escape. You need a place in memory to stick them. If mem2reg can prove that the stack allocated variable doesn't escape, it gets promoted into an SSA value.
The difference is that LLVM forces you to use them even when you know your locals do not escape (i.e. your source language is Pascal), QBE doesn't.
So, LLVM makes you use stack slots for two independent problems: 1. Compiling languages like C where locals can escape, and 2. Avoiding to construct SSA form in the frontend. In QBE, you use stack slots (alloc4, alloc8) to solve 1, but to solve 2, you can simply emit non-ssa form and QBE will fixup things for you.
That's exactly what LLVM does though, except the "non-SSA form" involves loads/stores to alloca'd values. The difference is that in LLVM, the language is always in SSA form, it just has some reads / writes to memory that can be pruned, while QBE alternates between being an SSA language and not an SSA language.
LLVM also doesn't "make" you use stack slots. If you wanted to, you could emit fully pruned programs as LLVM IR. Using allocas for variables is a choice that clang makes.
> LLVM IL is more cluttered with type annotations and casts.
With LLVM moving to typeless pointers[1], this will be less true in the future than it is now.
[1] https://groups.google.com/forum/#!topic/llvm-dev/vphBegBWyyE
Nothing is stopping you from writing LLVM that doesn't use types though! That's entirely a front end decision. If you wanted to, in your front end, you could never emit record or array types and do book keeping on cells in memory entirely on your own using pointer arithmetic / pointertoint / inttopointer. Doing so is entirely at the expense of the speed of the generated code though!
> QBE aims to be a pure C embeddable backend that provides 70% of the performance of advanced compilers in 10% of the code.
This philosophy is very similar to that of vis [1], a vim-like editor that aims to have 80% of vim's functionality in 1% of the code. I hope that more such project see the light of the day; I'm very interested in simpler programs and simpler ecosystems, even if that comes at the cost of features.
In the second version we build with symmetry and regularity.
We need both versions, and in the second we drop some corner cases to make things crystalline. It is important to recognize the things we drop and why, sometimes they are important, but the purity of the construction takes precedent. And sometimes the corner cases are the thing, without them the chandelier would cast no light.
Anyway, I don't know where I am going with this. Small, playful, understandable systems are the most informative. We need playful, useful models that can be used for experimentation in a momentum free way. And it is important not to suffocate the future with exhaustive engineering. Recognize the is.
It's always a compromise subject to design goals of the project or just personal preferences. There are a lot of reads on the Internet about specific compromises chosen for a technology but contrasting choices between projects or versions of a project are difficult to find.
Also I think large systems are just as informative as smaller ones but require more investment.
http://c9x.me/compile/doc/llvm.html
Especially the optimizations that get you the first 70%. I've been asking optimization people as I see them to do a survey of various methods to find the smallest collection that provide the hugest benefit. This will help people writing new compilers and formal verification community get a head start.
Now, back to correctness. Your project aims to be small, simple, and rigorous. That's perfect time to use methods for higher assurance software or design yours to work with them easily. So, I suggest trying Design-by-Contract w/ interface checks, some static analysis tools, or even something like Softbound+CETS that makes C safer automatically. I also usually recommend writing compilers in something like Ocaml as it prevents lots of bugs and easier to do.
Also, if you do Ocaml or safe imperative, you can always do two implementations side-by-side in the safe language and in portable C. You run tests, analysis, etc available for each one. Problems in one might be problems in the other. Regardless, though, you get the benefits of the safer language by default with a C implementation if you absolutely need it. Many of us have used this with Sandia Labs even doing it for hardware with ML and hardware language of a Java processor that came out great.
So, I challenge you to try to push the envelope by using some available techniques and tools to boost the assurance of your code. Preferably getting it off C, too, due to all the errors it (esp pointers) introduce into the compilers.
To be honest, C is not well suited to places where there is adversarial input such as servers, but when it comes to logical correctness of code I do not see amazing benefits from languages like ocaml.
I'll try to address this, though:
"but when it comes to logical correctness of code I do not see amazing benefits from languages like ocaml."
First, why Ocaml is good for compilers. Most of this is still true and why Rust team used it:
http://flint.cs.yale.edu/cs421/case-for-ml.html
What I can't overstate, already in that link, is how much easier it is to verify the properties of ML languages. They were originally designed to write a theorem prover IIRC. SML had a formal semantics for easier modeling. It allowed for functional programming style that avoids state management issues while (esp Ocaml) still lets you get hands dirty where needed. Modifications for tracking information flow, dependent types, concurrency... all sorts of things were pretty straight-forward. Had a certifying compiler early. Relevant to logical correctness, it's easier to map functional specifications to a ML than a mess like C. This was proven when people doused Leroy et al with praise over first, verified compiler for C because it really was that hard. Likewise seL4 matching functional specs to C kernel code took man-years of effort.
So, languages like C are really hard to get logically correct. Languages like SML/Ocaml are pretty easy to get it done right if you understand the problem. Rarely the language causing your issues. A hybrid like a modified C or Modula that has ML-like advantages without C's disadvantages brings you closer to that ideal while preserving performance & control.
Notes: He sprinkles references to method in various papers. Design Principles is one I think. He says the work like COBOL compiler and RC 4000 was coded in assembler. Yet, he and compiler group wrote it by hand in ALGOL to better analyze, structure, and extend these systems. So, an ALGOL form for understanding and hand-written assembler for optimal execution.
Gypsy Verification Environment for high-assurance security - Overview and Example https://www.cs.rice.edu/~dwallach/courses/comp527_s99/gypsy-... ftp://ftp.cs.utexas.edu/pub/AI-Lab/tech-reports/UT-AI-TR-78-11.pdf
Notes: Gypsy was one of more developed methods for high-assurance security back in the day. Used in A1-class SCOMP system and more. The usage in A1 was to do two versions of system in parallel: one in Gypsy for use with provers, info flow analysis, and so on; one in actual code that would run on system. Problems in one could often tell you something about the other.
4GL combining BASIC, LISP, and C http://lost-in-triple-HD-failure :(
Notes: This was mine cuz I hated C and C++ but liked industrial BASIC's I started with. Made a dumb, but effective, translator from one BASIC to C and/or C++ (memory fuzzy there). Let me iterate rapidly & safely then have portable C + safety checks automatically in when done. Started adding custom commands 4GL-style to eliminate boilerplate. Upon finding LISP, but not really learning it, I ported tools so I basically coded a 4GL BASIC in LISP w/ per-function compilation, easier parsing, macros for easy 4GL stuff, and extraction to C or C++. Always two forms of the code in place: one for development/analysis; one for production. Crazy RAD pace w/ usually safe, efficient results. I really miss my tool as I'm too brain-damaged to recreate it now. Nim has many of its attributes, though, so there's hope. :)
Adam Chlipala's Certified Programming book/work and Bedrock http://adam.chlipala.net/
Notes: This method is quite loaded. You can use a theorem prover and dependently-typed language to express your program in a way that catches all types of problems. Usually extracts to ML that you might hand-optimize. Alternatively, reminiscient of Hansen, you can do low-level coding in Coq via Bedrock that's basically a safer, assembly language. Microsoft's Coq ASM and another's TALC typed ASM do similar stuff.
seL4 verification method http://concrete-semantics.org/
Notes: Someone told me above book teaches you what skills they used. Google AutoCorres for C-to-HOL extraction part. Anyway, their method is to develop in Haskell and C simultaneously for separation kernel. The Haskell forms the abstract specification that easily feeds into HOL for analysis of basic function and security. The C obviously is the running code. They go further than Gypsy by using AutoCorres to extract HOL from C then proving Haskell and C HOL specs are equivalent. Might look for C-level problems, too, but don't recall.
Bootsafe - Develop Forth firmware in Java for safety http://slideshowes.com/doc/437272/efficient-code-certificati...
Notes: One of military's funded projects to improve security. Open Firmware specifically. Many firmware, including OF, is in the flexible and fast but unsafe language Forth. The Java language is safer, easy to analyze, easy for experienced to compile, and has no business in firmware w/ its big-ass runtime. :) BootSafe lets security engineers code and analyze firmware in Java then extract that in a verified way to executable Forth. Online presentations are all partial and crappy w/ this most detailed I found. It's a finished product from atcorp.com, though. Notice one could manually do this, too, w/ side-by-side developments with C, C++, Java, or SPARK. Further evidence is that many embedded MCU developers code in C or C++ with mockups of MCU assembler or interfaces. Claim much better results in maintenance & defect reduction.
Another old one of mine - safe or fast? choose two (no website link)
Notes: The idea was that many development or analysis tools should be implemented in as safe and sound a fashion as possible. However, that usually meant sloooow. Development, to maximize flow, needs to give immediate results to programmer if possible. So, for each tool or app, build two versions of it with logical equivalence: safe and fast. The fast is used for day-to-day development. The safe one is used overnight or if problems show up in fast that are resisting debugging (eg might be in compiler/lib). Example combos included MLton with FLINT or CakeML compilers, Racket dialect with PreScheme/VLISP, and CompCert C and/or Softbound+CETS with GCC/LLVM. Batches run overnight to create safe versions of any changes or commits. Batches also re-run tests from fast versions to ensure they match safe versions.
Another of my proposals - C, Java, Ada/SPARK, and ML/Haskell (no website link)
Notes: There's many static, dynamic, and formal analysis tools for C that catch specific types of stuff. Same for Java including basically all the great analyzers for concurrency errors. The best pushbutton analysis is in SPARK for Ada which also has design-by-contract for interface checks. There are ML's with dependent types, easy covert channel analysis, easier formal verification, and so on with Haskell having similar tools. One of my concepts... for where it matters most... was implementing algorithm or system in at least two of these simultaneously with a max of four. In lowest common denominator and consistent way of course. Then, apply at least two of best tools for each language to system. Any flaws found in one are checked in others. Result, from eyeballing to static analysis to auto-testing, will probably be the most thorough assurance you can get without being a formal verification expert. You can do that too, though. Not sure how practical it is. I think it would be easy with 4-5 person team of specialists w/ good communication with probably interesting results.
Can you expand on your point about "LLVM does NOT provide full C compatibility for you"? I have specifically been hacking on calling conventions/ABI stuff recently in LLVM and this "well known" problem is news to me.
That code is not so trivial to do yourself actually.
Furthermore, the output does match the System V ABI: https://ghostbin.com/paste/4a5ms
Notice how the struct is placed on the stack and the pointer to it is placed in %rdi for call/return. If you reduce the number of i32's in the struct to 3, the struct's fields are passed via registers since the type fewer than four eightbytes.
My older clang does indeed produce wonky LLVM IR code that seems to try and do this classification in the frontend, so ABI compatibility may have been a problem in the past, but I'm not convinced it's a problem in current versions of LLVM.
But the major difference is the IL: I use a human-readable and easily-printable text IL. This means that you don't need a graph-viewing tool to read the IL (it's just text) and that you can modify the IL between two passes super easily. This simple IL is a blessing when debugging a compiler.
I think QBE also has better support for the x64 ABI.
Finally, it is much less advanced (less optimizations, less tested) than libfirm and supports only x64 as a target.
This is a sketchy comparison.
Which to me reads as Qbe being entirely differently scoped, and therefor not an alternative.
Assuming the upstream developers fail to deliver, it's up to us to fix
or workaround compiler problems as we encounter them; sometimes it's as
easy as finding out which patch has been commited upstream, but not
backported to the version we use; and sometimes it's a genuine issue
which may or may not have been reported in the latest compiler version,
and we are on our own. When this happens, we can only rely upon our
developer skills and intimacy with the compiler.
A few of our developers have, over the years, become unafraid of gcc,
and able to investigate issues, backport fixes, and fix or work around
bugs: I'll only mention niklas@, espie@, etoh@ and otto@, and hope the
few others will forgive me for not listing their names. This has not
been an easy road, to say the least. Now, another few of our developers
are working on building a similar knowledge of llvm. I wish them a lot
of luck, and I will try to join them in the near future.
In the meantime I am not sure they feel confident enough to support
switching the most popular OpenBSD platforms from gcc to llvm.
In a few months or years from now, things will be different...
...but there is something I wish would happen first.
An LTS release of an open source compiler.
Because all compilers nowadays are full of subtle bugs, but so many of
them than you can't avoid them as soon as you compile any nontrivial
piece of code, and because we can't afford to going back to assembly, we
need a compiler we can trust.
GCC, as well as LLVM, have Fortune 500 companies backing them, paying
smart developers to work fulltime on these projects.
Yet none of them dares to provide a long time support version. Bugs in
version N are fixed in version N+1, but new bugs are introduced. And
noone cares about trying to settle things down and produce a compiler
one can trust (because version N+1 runs 3.14% faster in the loonystones
benchmark which doesn't match any real life use case). Who cares?
Tomorrow's compiler will generate code which will complete an infinite
loop in less than 5 seconds; stay tuned for more accomplishments!
The free software world needs an LTS compiler. The last de-facto LTS
compiler we have had was gcc 2.7.2.1, and it is too old to compile
modern C and C++ code.
Should a free software LTS compiler appear (be it a gcc fork, or an llvm
fork, or something else), then OpenBSD would consider using it, very
seriously. And we probably wouldn't be the only free software project
doing so.
-- http://marc.info/?l=openbsd-misc&m=137530560232232( Previous discussion of this post on HN: https://news.ycombinator.com/item?id=9322259 )
I like that this project is in-line with these same goals.
How much harder would it be to call C++?
To link against code using things like the STL, your compiler would basically need to understand how to compile C++.
Same problem as macros in C, but macros are usually less pervasive in C code, meaning that often you get a good enough binding by ignoring them or filling in wrapper functions by hand.
(Compiling it with Cython would allow you to provide a C api and library with only lib python as a dep)
No, a compiler (hell, a software) written in C is not correct, period. If you want a C compiler that has slight chances to be correct, you use compcert.
The small size is an awesome property for education (both for the writer and the reader). For this purpose, it's absolutely great, so kudos for that. :)
And also, we are seeing more and more certified C programs: see the DeepSpec NSF expedition grant, the Verified Software Toolchain, and the CertiKOS project for examples. I work with these guys.
Btw, hows progress coming on those projects? Specifically, are any of the tools (a) useful for non-experts with a little bit of training by tutorials, etc and (b) available for download in open-source or binary form yet? Thanks ahead of time.