Fixing the document is worthwhile, and certainly a reminder that WG21's equivalent effort needs to make the list before it can even begin that process on its even longer document, but practical C programmers don't read the document and since this UB was a "ghost" they weren't tripped by it. Removing items from the list this way does not translate to the meaningful safety improvement you might imagine.
There's not a whole lot of movement there towards actually fixing the problem. Maybe it will come later?
I would strongly suspect that C compiler implementers very much do read the document, though. Which, as far as I can see, means "ghosts" could easily become actual UB (and worse, sneaky UB that you wouldn't expect.)
But the original article also complains about the number of trivial UB.
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p14...
i.e., the UB already existed, but it was not explicit had to be inferred from the whole text and the boundaries were fuzzy. Remember that anything not explicitly defined by the standard, is implicitly undefined.
Also remember, just because you can legally construct a pointer it doesn't mean it is safe to dereference.
Including a header that is not in the program, and not in ISO C, is undefined behavior. So is calling a function that is not in ISO C and not in the program. (If the function is not anywhere, the program won't link. But if it is somewhere, then ISO C has nothing to say about its behavior.)
Correct, portable POSIX C programs have undefined behavior in ISO C; only if we interpret them via IEEE 1003 are they defined by that document.
If you invent a new platform with a C compiler, you can have it such that #include <windows.h> reformats all the attached storage devices. ISO C allows this because it doesn't specify what happens if #include <windows.h> successfully resolves to a file and includes its contents. Those contents could be anything, including some compile-time instruction to do harm.
Even if a compiler's documentationd doesn't grant that a certain instance of undefined behavior is a documented extension, the existence of a de facto extension can be inferred empirically through numerous experiments: compiling test code and reverse engineering the object code.
Moreover, the source code for a compiler may be available; the behavior of something can be inferred from studying the code. The code could change in the next version. But so could the documentation; documentation can take away a documented extension the same way as a compiler code change can take away a de facto extension.
Speaking of object code: if you follow a programming paradigm of verifying the object code, then undefined behavior becomes moot, to an extent. You don't trust the compiler anyway. If the machine code has the behavior which implements the requirements that your project expects of the source code, then the necessary thing has been somehow obtained.
True, most compilers have sane defaults in many cases for things that are technically undefined (like take sizeof(void) or do pointer arithmetic on something other than a char). But not all of these cases can be saved by sane defaults.
Undefined behavior means the compiler can replace the code with whatever. So if you e.g. compile optimizing for size, the compiler will rip out the offending code, as replacing it with nothing yields the greatest size optimization.
See also John Regehr's collection of UB-Canaries: https://github.com/regehr/ub-canaries
Snippets of software exhibiting undefined behavior, executing e.g. both the true and the false branch of an if-statement or none etc. UB should not be taken lightly IMO...
Or replacing all you mp3s with a Rick Roll. Technically legal.
(Some old version of GHC had a hilarious bug where it would delete any source code with a compiler error in it. Something like this would technically legal for most compiler errors a C compiler could spot.)
The code change might come in something as innocent as a bug fix to the compiler.
I for one am glad that compilers can assume that things that can't happen according to the language do in fact not happen and don't bloat my programs with code to handle them.
What is this supposed to mean? I can't think of any interpretation that makes sense.
I think ISO C defines the executable program to be something like the compiled translation units linked together. But header files do not have to have any particular correspondence to translation units. For example, a header might declare functions whose definitions are spread across multiple translation units, or define things that don't need any definitions in particular translation units (e.g. enum or struct definitions). It could even play macro tricks which means it declares or defines different things each time you include it.
Maybe you mean it's undefined behaviour to include a header file that declares functions that are not defined in any translation unit. I'm not sure even that is true, so long as you don't use those functions. It's definitely not true in C++, where it's only a problem (not sure if it's undefined exactly) if you ODR-rule use a function that has been declared but not defined anywhere. (Examples of ODR-rule use are calling or taking the address of the function, but not, for example, using sizeof on an expression that includes it.)
Start with a concrete example. A header that is not in our program, or described in ISO C. How about:
#include <winkle.h>
Defined behavior or not? How can an implementation respond to this #include while remaining conforming? What are the limits on that response?> But header files do not have to have any particular correspondence to translation units.
A header inclusion is just a mechanism that brings preprocessor tokens into a translation unit. So, what does the standard tell us about the tokens coming from #include <winkle.h> into whatever translation unit we put it into?
Say we have a single file program and we made that the first line. Without that include, it's a standard-conforming Hello World.
i) "Fil-C is a fanatically compatible memory-safe implementation of C and C++. Lots of software compiles and runs with Fil-C with zero or minimal changes. All memory safety errors are caught as Fil-C panics." "Fil-C only works on Linux/X86_64."
ii) "scpptool is a command line tool to help enforce a memory and data race safe subset of C++. It's designed to work with the SaferCPlusPlus library. It analyzes the specified C++ file(s) and reports places in the code that it cannot verify to be safe. By design, the tool and the library should be able to fully ensure "lifetime", bounds and data race safety." "This tool also has some ability to convert C source files to the memory safe subset of C++ it enforces"
The resulting language doesn't make sense for commercial purposes but there's no reason it couldn't be popular with hobbyists.
Run your test suite and some other workloads under Fil-C for a while, fix any problems report, and if it doesn't report any problems after a while, compile the whole thing with GCC afterwards for your release version.
They at least fixed this in c++26. No longer UB, but "erroneous behavior". Still some random garbage value (so an uninitialized pointer will likely lead to disastrous results still), but the compiler isn't allowed to fuck up your code, it has to generate code as if it had some value.
In effect if you don't opt out your value will always be initialized but not to a useful value you chose. You can think of this as similar to the (current, defanged and deprecated as well as unsafe) Rust std::mem::uninitialized()
There were earlier attempts to make this value zero, or rather, as many 0x00 bytes as needed, because on most platforms that's markedly cheaper to do, but unfortunately some C++ would actually have worse bugs if the "forgot to initialize" case was reliably zero instead.
Access to an uninitialized object defined in automatic storage, whose address is not taken, is UB.
Access to any uninitialized object whose bit pattern is a non-value, likewise.
Otherwise, it's good: the value implied by the bit pattern is obtained and computation goes on its merry way.
We have zig, Hare, Odin, V too.
Because it never achieved mainstream success?
And Zig for example is very much not memory safe. Which a cursory search for ”segfault” in the Bun repo quickly tells you.
https://github.com/oven-sh/bun/issues?q=is%3Aissue%20state%3...
And with this attitude it never will. With Rust's hype, it would.
Ada would rather be a nice choice, but most hackers love their curly brackets.
(The B language was implemented for the PDP-7 before the PDP-11, which are rather different machines. It’s sometimes suggested that the increment and decrement operators in C, which were inherited from B, are due to the instruction set architecture of the PDP-11, but this could not have been the case. Per Dennis Ritchie:¹
> Thompson went a step further by inventing the ++ and -- operators, which increment or decrement; their prefix or postfix position determines whether the alteration occurs before or after noting the value of the operand. They were not in the earliest versions of B, but appeared along the way. People often guess that they were created to use the auto-increment and auto-decrement address modes provided by the DEC PDP-11 on which C and Unix first became popular. This is historically impossible, since there was no PDP-11 when B was developed. The PDP-7, however, did have a few “auto-increment” memory cells, with the property that an indirect memory reference through them incremented the cell. This feature probably suggested such operators to Thompson; the generalization to make them both prefix and postfix was his own.
Another person puts it this way:²
> It's a myth to suggest C’s design is based on the PDP-11. People often quote, for example, the increment and decrement operators because they have an analogue in the PDP-11 instruction set. This is, however, a coincidence. Those operators were invented before the language [i.e. B] was ported to the PDP-11.
In any case, the PDP-11 usually gets all the love, but I want to make sure the other PDPs get some too!)
Not only you are faced with creating your own wrappers, if no one else has done it already.
The tooling, for IDEs and graphical debuggers, assumes either C or C++, so it won't be there for Rust.
Ideally the day will come where those ecosystems might also embrace Rust, but that is still decades away maybe.
IMHO you can today deal with UB just fine in C if you want to by following best practices, and the reasons given when those are not followed would also rule out use of most other safer languages.
C is portable in the least interesting way, namely that compilers exist for all architectures. But that's where it stops.
> IMHO you can today deal with UB just fine in C if you want to by following best practices
In the other words, short compilation time has been traded off with wetware brainwashing... well, adjustment time, which makes the supposed advantage much less desirable. It is still an advantage, I reckon though.
C is a different kind of animal that encourages terseness and economy of expression. When you know what you are doing with C pointers, the compiler just doesn't get in the way.
> When you know what you are doing with C pointers, the compiler just doesn't get in the way.
Alas, it doesn't get in the way of you shooting your own foot off, too.
Rust allows unsafe and other shenanigans, if you want that.
Tell me you use -fno-strict-aliasing without telling me.
Fwiw, I agree with you and we're in good[citation needed] company: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg...
C and C++ force you to code in the C and C++ ways. It may that that's what you want, but they certainly dont let me code how I want to code!
Beyond that, recent C++ versions have much more expressive metaprogramming capability. The ability to do extensive codegen and code verification within C++ at compile-time reduces lines of code and increases safety in a significant way.
Rusts tooling is hands down better than C/++ which aids to a more streamlined and efficient development experience
The smallest binary rustc has produced is like ~145 bytes.
Well, anything were your people have more experience in the other language or the libraries are a lot better.
- it is an automatic variable whose address has not been taken; or
- the uninitialized object' bits are such that it takes on a non-value representation.
And I especially don’t buy that UB is there for register allocation.
First of all, that argument only explains UB of OOB memory accesses at best.
Second, you could define the meaning of OOB by just saying “pointers are integers” and then further state that nonescaping locals don’t get addresses. Many ways you could specify that, if you cared badly enough. My favorite way to do it involves saying that pointers to locals are lazy thunks that create addresses on demand.
Same thing with e.g. strict aliasing or the various UB that exists in the standard library. For instance, it's UB to pass a null pointer to strlen. Of course, you can make that perfectly defined by adding an `if` to strlen that just returns 0. But then you're adding a branch to every strlen, and C is simply not willing to do that for performance reasons, so they say "this is UB" instead.
Pretty much instance of UB in standard C or C++ is because making it defined would either hamper the optimizer, or it would make standard library functions slower. They don't just make things UB for fun.
For example the reason why 2s complement took so long is because of some machine that ran C that still existed that was 1s complement.
> The reason is that if you compile with flags that make it defined, you lose a few percentage points of performance (primarily from preventing loop unrolling and auto-vectorization).
I certainly don’t lose any perf on any workload of mine if I set -fwrapv
If your claim is that implementers use optimization as the excuse for wanting UB, then I can agree with that.
I don’t agree that it’s a valid argument though. The performance wins from UB are unconvincing, except maybe on BS benchmarks that C compilers overtune for marketing reasons.
It explains many loop-unroll and integer overflow as well.
inlining, interprocedural optimizations.
For example, something as an trivial accessor member function would be hard to optimize.
This means losing a lot of optimisations, so in fact when you say you "don't buy" this argument you only mean that you don't care about optimisation. Which is fine, but this does mean the "improved" C isn't very useful in a lot of applications, might as well choose Java.
You won’t lose “a lot” of optimizations and you certainly won’t lose enough for it to make a noticeable difference in any workload that isn’t SPEC
The spec even says:
> behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements
No motivation is given that I could find, so the actual difference between undefined and implementation defined behaviour seems to be based on whether the behaviour needs to be documented.
Also the C spec has always been a pragmatic afterthought, created and maintained to establish at least a minimal common feature set expected of C compilers.
The really interesting stuff still only exists outside the spec in vendor language extensions.
The two instances where UB allows for optimisation are as follows:
1. The 'signed overflow' UB allows for faster array indexing. By ignoring potential overflow, the compiler can generate code that doesn't check for accidental overflow (which would require masking the array index, recomputing the address on each loop iteration). I believe the better solution here would be to introduce a specific type for iterating over arrays that will never overflow; size_t would do fine, and making signed overflow at least implementation defined, if not outright fully defined, after a suitable period during which compilers warn if you use a too-small type for array indexing.
2. The 'aliasing' UB does away with the need to read/write values to/from memory each time they're used, and is extremely important to performance optimisation.
But the rest? Most of it does precisely nothing for performance. At 'best', the compiler uses detected UB to silently eliminate code branches, but that's something to be feared, not celebrated. It isn't an optimisation if it removes vital program logic, because the compiler could 'demonstrate' that it could not possibly take the removed branch, on account of it containing UB.
The claim in the linked article ("what every C programmer should know") that use of uninitialized variables allows for additional optimisation is incorrect. What it does instead is this: if the compiler see you declare a variable, and then reading from it before writing to it, it has detected UB, and since the rule is that "the compiler is allowed to assume UB does not occur", use that as 'evidence' that that code branch will never occur and can be eliminated. It does not make things go faster; it makes them go _wrong_.
Undefined behaviour, ultimately, exists for many reasons: because the standards committee forgot a case, because the underlying platforms differ too wildly, because you cannot predict in advance what the result of a bug may be, to grandfather in broken old compilers, etc. It does not, in any way, shape, or form, exist _in order to_ enable optimisation. It _allows_ it in some cases, but that is, and never was, not the goal.
Moreover, the phrasing of "the compiler is allowed to assume that UB does not occur" was originally only meant to indicate that the compiler was allowed to emit code as if all was well, without introducing additional tests (for example, to see if overflow occurred or if a pointer was valid) - clearly that would be very expensive or downright infeasible. Unfortunately, over time this has enabled a toxic attitude to grow that turns minor bugs into major disasters, all in the name of 'performance'.
The two bullet points towards the end of the article are both true: the compiler SHOULD NOT behave like an adversary, and the compiler DOES NEED license to optimize. The mistake is thinking that UB is a necessary component of such license. If that were true, a language with more UB would automatically be faster than one with less. In reality, C++ and Rust are roughly identical in performance.
The dustbin of programming languages is jam packed with elegant, technically terrific, languages that never went anywhere.
Chill the fuck out.