Undefined behavior in C is a reading error (2021) (opens in new tab)

(yodaiken.com)

25 pointsnequo1y ago67 comments

67 comments

TFA is misunderstanding. As he cites from the C standard, “Undefined behavior gives the implementor license not to catch certain program errors that are difficult to diagnose.” Since it’s difficult (and even, in the general case of runtime conditions, impossible at compile time) to diagnose, the implementor (compiler writer) has two choices: (a) assume that the undefined behavior doesn’t occur, and implement optimizations under that assumption, or (b) nevertheless implement a defined behavior for it, which in many cases amounts to a pessimization. Given that competition between compilers is driven by benchmarks, guess which option compiler writers are choosing.

The discussions in comp.lang.c (a Usenet newsgroup, not a mailing list) were educating C programmers that they can’t rely on (b) in portable C, and moreover, can’t make any assumptions about undefined behavior in portable C, because the C specification (the standard) explicitly refrains from imposing any requirements whatsoever on the C implementation in that case.

The additional thing to understand is that compiler writers are not malevolently detecting undefined behavior and then inserting optimizations in that case, but instead that applying optimizations is a process of logical deduction within the compiler, and that it is the lack of assumptions related to undefined behavior being put into the compiler implementation, that is leading to surprising consequences if undefined behavior actually occurs. This is also the reason why undefined behavior can affect code executing prior to the occurrence of the undefined condition, because logical deduction as performed by the compiler is not restricted to the forward direction of control flow (and also because compilers reorder code as a consequence of their analysis).

nothrabannosir1y ago

> This is also the reason why undefined behavior can affect code executing prior to the occurrence of the undefined condition, because logical deduction as performed by the compiler is not restricted to the forward direction of control flow (and also because compilers reorder code as a consequence of their analysis).

According to Martin Uecker, of the C standard comittee, that is not true:

> In C, undefined behavior can not time travel. This was never supported by the wording and we clarified this in C23.

https://news.ycombinator.com/item?id=40790203

layer81y ago

It is really hard to prevent this in an optimizing compiler. I don’t think it’s realistic. For example, loop invariants can be affected by undefined behavior in the loop body, and that in turn can affect the code that is generated for a loop condition at the start of the loop, whose execution precedes the loop body. This is a general consequence in static code analysis. Even more so with whole-program optimization.

AlotOfReading1y ago

It's also completely necessary to have any sort of reasonable language semantics. The goal is to have programmers be able to write code that does what they intend. With the C23 addition, time travelling UB doesn't exist, so programmers can write code that does what they intend up to the point of invoking UB. Good enough.

Let's say that's too difficult for compiler writers, so we bring back time travelling UB. That implies UB on a future execution path means the entire execution path has no semantics. We now have to ensure there is no UB on any future execution path to meet our goal. There are basically 4 options:

1. Rely on programmers to never write UB. This has not worked out historically.

2. Compilers must detect and/or prevent all UB statically. This is obviously impossible.

3. Runtimes must exhaustively detect and/or prevent all UB. This is both infeasible and expensive.

4. Give up on semantics for essentially all nontrivial programs. This is the situation today, but if we're going to make this the official position why should we even have a standard?

ordu1y ago

Maybe I don't understand something, but for me it seems pretty easy. What is needed to be done:

1. Make a list of all UB

2. Define the sensible compiler behavior in each case (for example, let MAX_INT+1 to calculate into MIN_INT on x86_64, just because `add` on x86_64 does that)

3. Treat this as a part of a standard, when compiling the code.

This approach allows to have different compiler behavior on different architectures, which are better suited for the architecture. Maybe on some architectures `add` on signed numbers will generate a CPU exception on overflow, so define this as a way to behave and go with it.

2 more replies

kazinator1y ago

I'm not seeing how this is a change. C99 also said "for which".

C99: "behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements."

Martin Uecker said that something was fixed in the C23 draft, and when asked about it, pointed to the "for which" as pertaining to just that construct and not the entire program.

I'm afraid that this fellow showed himself unreliable in that thread, in matters of interpreting the C standard. In any case, a random forum remark by a committee member, is not the same thing as a committee response to a request for clarification. It has to be backed by citations and precise reasoning, like anyone else's remark.

Suppose we have a sequence of statements S1; S2; S3 where S3 contains the expression i + 1, i being of type int, and nothing in these statements alters the value of i (it is live on entry into S1 and there is no other entry). It is valid for S1 to be translated according to the supposition that i is less than INT_MAX. Because if that is not the case, then S3 invokes undefined behavior, and S3 is unconditionally reachable via S1.

The whole idea that we can have an __notreached expression which does nothing but invoke UB is predicated on time travel (time travel at program analysis time: being able to reason about the program in any direction). Since __notreached invokes UB, the implementation may assume that the control flow does not reach it and behave accordingly. Any statement which serves as a gateway to __notreached itself invokes undefined behavior, and is therefore assumed unreachable and may be deleted. This reasoning propagates backwards, and so the optimizer can simply delete a whole swath of statements.

Backwards reasoning has been essential in the implementation of compilers for decades. Basic algorithms like liveness analysis involve scanning basic blocks of instructions in reverse order! How you know that a variable is dead at a given point (so its register can be reused for another value) is due to having scanned backwards: the next-use information is a peek into the future (what will be the future when that instruction is running).

And, about the quesiton whether undefined behavior can make the whole program undefined, the answer is maybe. If there is no way to execute the program such that an undefined behavior is avoided, then the whole program is undefined. If the situation can be deduced while it is being translated, then the translator can stop with a diagnostic message.

E.g.:

  #include <stdio.h>

  int main() // (void) is now deprecated in draft ISO C
  {
     printf("hello, world\n");
     return 0/0;
  }

This program does not have a visible behavior of printing the hello message. The 0/0 division is undefined, and amounts to a declaration that the printf statement is unreachable. The implementation is free to delete that statement, or to issue a diagnostic and not translate the program.

Uecker is right in that there are limits on this. If a program issues some output (visible effect) and then performs input, from which it obtains a value, and that value is then embroiled in a calculation that causes undefined behavior, that previous visible effect stands. The whole program is not undefined. It's something like: the program's execution becomes undefined at the point where it becomes inevitable that UB shall occur. That could be where the value is prepared that will inevitably cause the erroneous calculation. So, as far back as that point of no return, the implementation could insert a termination, with or without a diagnostic.

Undefined behavior is not a visible behavior; it doesn't have to be ordered with regard to visible behaviors.

SAI_Peregrinus1y ago

> implementor (compiler writer) has two choices: (a) assume that the undefined behavior doesn’t occur, and implement optimizations under that assumption, or (b) nevertheless implement a defined behavior for it, which in many cases amounts to a pessimization.

No, they have two choices: (a) assume that the undefined behavior doesn't occur, and implement any output code generation whatsoever under that assumption, or (b) define a behavior for it, and implement output code generation based on that assumption which in many cases amounts to a pessimization.

Optimization isn't relevant. Assuming it can't happen and then continuing to generate code as though it can't happen is all that matters. You can't make any assumptions, including that disabling optimization will change the output code.

torstenvl1y ago

> the implementor (compiler writer) has two choices: (a) assume that the undefined behavior doesn’t occur, and implement optimizations under that assumption, or (b) nevertheless implement a defined behavior for it, which in many cases amounts to a pessimization.

No. The implementor has three choices: (1) Ignore the situation altogether; (2) behave according to documentation (with or without a warning); or (3) issue an error and stop compilation.

Consider

    for (int i=0; i>=0; i++);

(1) Doesn't attempt to detect UB, it just ignores UB and generates the straightforward translation

          mov 0, %l0
    loop: cmp %l0, 0
          bge loop
            add %l0, 1, %l0 ! Delay slot
          ! Continue with rest of program

(2) May detect that would result in integer overflow and do something it documents (like a trap instruction, or run the whole loop, or elide the whole loop).

(3) Detects that would result in integer overflow and stops compilation with an error message.

An expressio unius interpretation—or simply following the well-worn principle of construing ambiguity against the drafter—would not permit crazy things with UB that many current compilers do.

cokernel_hacker1y ago

What behavior should the following have:

  int f(int x) {
    switch (x) {
    case 0:
      return 31;
    case 1:
      return 28;
    case 2:
      return 30;
    }
  }

This code on its own has no undefined behavior.

In another translation unit, someone calls `f(3)`. What would you have compilers do in that case?

That path through the program has undefined behavior. However, the two translation units are separate and as such normal tooling will not be able to detect any sort of UB without some kind of whole program static analysis or heavy instrumentation which would harm performance.

zzo38computer1y ago

What I would have it to do is: Return a number that is in the range of the "int" type, but there is no guarantee what number it will be, and it will not necessarily be consistent when called more than once, when the program is executed more than once (unless the operating system has features to enforce consistent behaviour), when the program is compiled for and running on a different computer, etc. I would also have the undefined value to be frozen, like the "freeze" command in LLVM. Normally, the effect would be according to the target instruction set, because it would be compiled in the best way for that target instruction set. Depending on the compiler options, it might also display a warning that not all cases are handled, although this warning would be disabled by default. (However, some instruction sets might allow it to be handled differently; e.g. if you have an instruction set with tagged pointers that can be stored in ordinary registers and memory, then there is the possibility that trying to use the return value causes an error condition.)

torstenvl1y ago

I would do what the standard tells me to do, which is to ignore the undefined behavior if I don't detect it.

On most platforms, that would probably result in the return value of 3 (it would still be in AX, EAX, r0, x0, o0/i0, whatever, when execution hits the ret instruction or whatever that ISA/ABI uses to mark the end of the function). But it would be undefined. But that's fine.

[EDIT: I misremembered the x86 calling convention, so my references to AX and EAX are wrong above. Mea culpa.]

What isn't fine is ignoring the end of the function, not emitting a ret instruction, and letting execution fall through to the next label/function, which is what I suspect GCC does.

2 more replies

qiqitori1y ago

This won't compile with reasonable compiler flags. (-Wall and a reasonable set of -Werror settings).

Now, assume that you didn't compile this with those flags; what actually happens is entirely obvious but platform-dependent. Assume amd64 (and many other architectures) where the return value is in the "accumulator register", assume that int is 32 bits. The return value will be whatever was in eax. The called function doesn't set eax (or maybe does in order to implement some unrelated surrounding code). The caller takes eax without knowledge of where it came from.

loeg1y ago

In a new language that isn't C, that function shouldn't compile at all (missing return).

In a C compiler, inserting a trap (x86 ud2, for example) might be reasonable.

Ericson23141y ago

I really don't like reading these dimwitted screeds. We did not get here because layering over a standard or a document --- this is not the US supreme court or similar. We got here because

- There are legit issues trying to define everything without loosing portability. This affects C and anything like it.

- Compiler writes do want to write optimizations regardless of whether this is C or anything else --- witness that GCC / LLVM will use the same optimizations regardless of the input language / compiler frontend

- Almost nobody in this space, neither the cranky programmers against, or the normy compiler writers for, has a good grasp of modern logic and proof theory, which is needed to make this stuff precise.

Ericson23141y ago

*lawyering over a standard or document

kragen1y ago

> I really don't like reading these dimwitted screeds.

this 'dimwitted screed' is by the primary author of rtlinux, which was to my knowledge the first instance of running linux under a hypervisor, and the leader of the small team that ported linux to the powerpc in the 90s. he has also written a highly cited paper on priority inheritance. if you disagree with him, it is probably for some reason other than his dimwittedness

i can't specifically testify to his knowledge of modern proof theory, but his dissertation was on 'a modal arithmetic for reasoning about multilevel systems of finite state machines', and his recent preprints include 'standard automata theory and process algebra' https://arxiv.org/abs/2205.03515 (no citations), 'understanding paxos and other distributed consensus algorithms' https://arxiv.org/abs/2202.06348 (one citation), and 'the meaning of concurrent programs' https://arxiv.org/abs/0810.1316 (draft, no citations), so i wouldn't bet too much against it

i'm interested to hear what you've written on modern logic and proof theory to understand your perspective better

nickelpro1y ago

If you want slow code that behaves exactly the way you expect, turn off optimizations. Congrats, you have the loose assembly wrapper language you always wanted C to be.

For the rest of us, we're going to keep getting every last drop out of performance that we can wring out of the compiler. I do not want my compiler to produce an "obvious" or "reasonable" interpretation of my code, I want it to produce the fastest possible "as if" behavior for what I described within the bounds of the standard.

If I went outside the bounds of the standard, that's my problem, not the compiler's.

krackers1y ago

Just like -ffast-math, that should be an opt-in flag. I'd bet most people want (and even expect) the compiler to do the sane thing, especially if it only costs them 2% performance. The quest for mythical "performance" over correctness is precisely why we've landed into this situation in the first place.

nickelpro1y ago

The compiler does sane things. I also fundamentally disagree with the OP that the situations that invoke UB are surprising.

This largely derives from C developers believing they understand the language by thinking of it as a loose wrapper around assembly, instead of an abstract machine with specific rules and requirements.

Within the bounds of the abstract machine described by the standard, things like signed integer overflow, pointer aliasing, etc, don't make any sense. They are intuitively undefined. If C developers actually read the standard instead of pretending C describes the 8086 they learned in undergrad, they wouldn't be worried about the compiler doing "sane" things with their unsound code because they wouldn't gravitate towards that code in the first place. No one thinks that dereferencing an integer makes sense, no one accidentally writes that code, because it intuitively doesn't work even in misguided internal models.

This doesn't solve problems like buffer overflows of course, which are much more about the logical structure of the program than its language rules. For that style of logical error there's no hope for C in the general case, although static analyzers help.

krackers1y ago

Isn't this tautologically saying the compiler does sane things if you define "sane" as what the compiler allows? As Linus said standards are written on toilet paper, in the real world you have a need to do signed overflow and type punning. In the real-world pretty much every system of note uses 2s complement, and if there was a concern about maintaining compatibility with archaic systems it could be made "implementation defined" behavior instead.

To get something approximating "sane behavior" you have to set a dozen flags to disable various types of questionable optimizations.

1 more reply

RevEng1y ago

The problem is in how compiler writers have interpreted this as license to do some bizarre things. Does it contain UB? Skip that code entirely without saying a word. Regardless of whether or not the code is malformed, this kind of response is hostile to the user.

Filligree1y ago

For the rest of us, actually we will be using different languages that don't pretend UB is what we want. I'd be using C a lot more if I could possibly trust myself to write correct C.

brudgers1y ago

A consensus standard happens by multiple stakeholders sitting down and agreeing on what everyone will do the same way. And agreeing one what they won't all do the same way. The things they agree to doing differently don't become part of the standard.

With compilers, different companies usually do things differently. That was the case with C87. The things they talked about but could not or would not agree to do the same way are listed as undefined behaviors. The things everyone agreed to do the same way are the standard.

The consensus process reflects stakeholder interests. Stakeholders can afford to rewrite some parts of their compilers to comply with the standards and cannot afford to rewrite other parts to comply with the standards because their customers rely on the existing implementation and/or because of core design decisions.

kragen1y ago

the main stakeholders are c programmers and the users of their programs, not c compiler vendors. the stake held by c compiler vendors is quite small by comparison. however, the c standards committee consists entirely of c compiler vendors, as you implicitly acknowledge by referring to 'their compilers' and 'their customers'. this largely happens through the same process through which drug regulations are written by drug companies and energy regulations are written by oil companies: the c compiler vendors have much deeper knowledge of the subject matter; the standard is put into practice by what the c compiler vendors choose to do; and, although the c compiler vendors' interests in the c standard are vastly less significant than those of c programmers and users of c programs, they are also vastly more concentrated

consequently, the consensus process systematically and reproducibly fails to reflect stakeholder interests

bitwize1y ago

Specifically, undefined behavior is when the compiler vendors couldn't agree whether a particular bit of code should legitimately compile to something or be considered erroneous. Ex.: null pointer access. Clearly an error in user-space programs running on a sophisticated operating system, but in kernel or embedded code sometimes you do want to read or write to memory location 0. So the standards committee just shrugged and said "it's undefined". Could be an error, could not be. It depends on your compiler, OS, and environment. Check your local docs for details.

LudwigNagasena1y ago

Behaviour that merely differs based on implementation is either unspecified behaviour or implementation-defined behaviour. That and undefined behaviour are different things in the C++ standard.

bitwize1y ago

I think for implementation-defined behavior the code has to do something sensible, but the standard doesn't specify what; the distinction for undefined behavior is that it could be erroneous (meaning it triggers an exception or just goes completely bonkers) but it could also do something sensible and expected, again depending on environment.

desiderantes1y ago

C++ is a separate language with a separate standard.

layer81y ago

It’s also cases where some compilation targets could, for example, raise an interrupt on signed overflow, and that kind of behavior would be completely out of the scope of the C standard, because it would be highly hardware-specific.

kragen1y ago

i think the issue is more that in embedded code you can't depend on the hardware to detect an access to any given memory location (the standard does permit the bit pattern of a null pointer to be different from all zeroes, which is what you are supposed to do if you want memory location 0 to be referenceable with a pointer)

bitwize1y ago

As I recall the standard also mandates that ((void *)0) is a null pointer, even if it gets converted behind the scenes to some other bit pattern that represents null for that architecture. So it's all a wash.

1 more reply

AlotOfReading1y ago

No compiler I'm aware of implements non-zero null pointers on systems where address 0 is valid (e.g. armv7), so it ends up being kind of a moot point.

1 more reply

dgfitz1y ago

> Stakeholders can afford to rewrite some parts of their compilers to comply with the standards and cannot afford to rewrite other parts to comply with the standards because their customers rely on the existing implementation and/or because of core design decisions.

I was nodding along until here. Wouldn’t one, given the option, always choose, if possible, a compiler that doesn’t differ from the standard? And if that isn’t an option, wouldn’t it be up to said stakeholders to own the inconsistency?

Tough problem to solve for sure.

1 more reply

saghm1y ago

The crux of this argument seems to be that the author interprets the "range of permissible behavior" they cite as specifications on undefined behavior as not allowing the sort of optimizations that potentially render anything else in the program moot. A large part of this argument depends arguing that the earlier section defining the term undefined behavior has an "obvious" interpretation that's been ignored in favor of a differing one. I don't think their interpretation of the definition of undefined behavior is necessarily the strongest argument against the case they're making though; to me, the second section they quote is if anything even more nebulous.

To be overly pedantic (which seems to be the point of this exercise), the section cites a "range" of permissible behavior, not an exhaustive list; it doesn't sound to me like it requires that only those three behaviors are allowed. The potential first behavior it includes is "ignoring the situation completely with unpredictable results", followed by "behaving during translation or program execution in a documented manner characteristic of the environment". I'd argue that the behavior this article complains about is somewhere between "willfully ignoring the situation completely with unpredictable results" to "recognizing the situation with unpredictable results", and it's hard for me read this as being obviously outside the range of permissible behavior. Otherwise, it essentially would mean that it's still totally allowed by the standard to have the exact behavior that the author complains about, but only if it's due to the compiler author being ignorant rather than willful. I think it would be a lot weirder if the intent of the standard was that deviant behavior due to bugs is somehow totally okay but purposely writing the same buggy code is a violation.

torstenvl1y ago

Expressio unius est exclusio alterius.

nlewycky1y ago

This construction is called a "false range" in English.

https://www.chicagomanualofstyle.org/qanda/data/faq/topics/C... https://www.cjr.org/language_corner/out_of_range.php

The wording change from Permissible to Possible and making it non-normative was an attempt to clarify that the list of behaviors that follows is a false range and not an exhaustive list.

It's a submarine change because in the eyes of the committee, this is not a change, merely a clarification of what it already said, to guard against ongoing misinterpretation.

mst1y ago

The practical reality appears to be that compilers use the loose interpretation of UB and that every compiler that works hard to optimise things as much as possible takes advantage of that as much as it can.

I am very much sympathetic to the people who really wish that wasn't the case, and I appreciate the logic of arguments like this one that in theory it shouldn't be the case, but in practice, it is the case, and has been for some years now.

So it goes.

ajross1y ago

I think the problem is sort of a permutation of this argument: way way too much attention is being paid to warning about the dangers and inadequacies of the standard's UB corners, and basically none to a good faith effort to clean up the problem.

I mean, it wouldn't be that hard in a technical sense to bless a C dialect that did things like guarantee 8-bit bytes, signed char, NULL with a value of numerical zero, etc... The overwhelming majority of these areas are just spots where hardware historically varied (plus a few things that were simple mistakes), and modern hardware doesn't have that kind of diversity.

Instead, we're writing, running and trying to understand tools like UBSan, which is IMHO a much, much harder problem.

tialaramex1y ago

The exercise you suggest is futile. You've assumed that all these C programmers are writing software with a clear meaning and we just need to properly translate it so that the meaning is delivered.

There were C programmers like that, most of them now write Rust. They write what they meant, in Rust it just does what they wrote, they're happy.

But a large number - by now a majority of the die-hard C programmers - don't want that. They want to write nonsense and have it magically work. They don't need a new C dialect or a better compiler, or anything like that, they need fairy tale magic.

ajross1y ago

> There were C programmers like that, most of them now write Rust.

Please don't. There's a space in the world for language flames. But the real world is filled with people trying to evolve existing codebases using tools like ubsan, and that's what I'm talking about.

tialaramex1y ago

I don't see how this constitutes a language flame. I spent decades writing C. Twenty years ago I'd have agreed that it was worth trying to "fix" C, today I just write Rust instead.

nlewycky1y ago

I'm a huge proponent of UBSan and ASan. Genuine curiosity, what don't you like about them?

FWIW, there once was a real good-faith effort to clean up the problems, Friendly C by Prof Regehr, https://blog.regehr.org/archives/1180 and https://blog.regehr.org/archives/1287 .

It turns out it's really hard. Let's take an easy-to-understand example, signed integer overflow. C has unsigned types with guaranteed 2's complement rules, and signed types with UB on overflow, which leaves the compiler free to rewrite the expression using the field axioms, if it wants to. "a = b * c / c;" may emit the multiply and divide, or it can eliminate the pair and replace the expression with "a = b;".

Why do we connect interpreting the top bit as a sign bit with whether field axiom based rewriting should be allowed? It would make sense to have a language which splits those two choices apart, but if you do that, either the result isn't backwards compatible with C anyways or it is but doesn't add any safety to old C code even as it permits you to write new safe C code.

Sometimes the best way to rewrite an expression is not what you'd consider "simplified form" from school because of the availability of CPU instructions that don't match simple operations, and also because of register pressure limiting the number of temporaries. There's real world code out there that has UB in simple integer expressions and relies on it being run in the correct environment, either x86-64 CPU or ARM CPU. If you define one specific interpretation for the same expression, you are guaranteed to break somebody's real world "working" code.

I claim without evidence that trying to fix up C's underlying issues is all decisions like this. That leads to UBSan as the next best idea, or at least, something we can do right now. If nothing else it has pedagogical value in teaching what the existing rules are.

ajross1y ago

I'm not sure I understand your example. Existing modern hardware is AFAICT pretty conventionally compatible with multiply/divide operations[1] and will factor equivalently for any common/performance-sensitive situation. A better area to argue about are shifts, where some architectures are missing compatible sign extension and overflow modes; the language would have to pick one, possibly to the detriment of some arch or another.

But... so what? That's fine. Applications sensitive to performance on that level are already worrying about per-platform tuning and always have been. Much better to start from a baseline that works reliably and then tune than to have to write "working" code you then must fight about and eventually roll back due to a ubsan warning.

[1] It's true that when you get to things like multi-word math that there are edge cases that make some conventions easier to optimize on some architectures (e.g. x86's widening multiply, etc...).

SAI_Peregrinus1y ago

The issue is that would not be backwards-compatible with all existing code. People might have to actually fix their programs to work reliably on the hardware they're using. That's almost always considered unacceptable. Also there are still lots of projects using C89, where `gets()` still exists. It got removed in 2011, but if you compile with -std=c89 or -std=c99 it still works, 36 years after the Morris worm should have taught everyone better!

The C standard developers did guarantee 8-bit bytes for C23, so maybe in 50 years that'll be the default C version.

ajross1y ago

> The issue is that would not be backwards-compatible with all existing code

Well, by definition it would, since the only behavior affected is undefined. But sure, in practice. Any change to the language is a danger, and ISO C is very conservative.

I'm just saying that the mental bandwidth of dealing with the old UB mess is now much higher than what it would take to try to fix it. All those cycles of "run ubsan on this old codebase and have endless arguments about interpretation and false positives[1]" could be trivially replaced with "port old codebase to C-unundefined standard", which would put us in a much better position.

[1] Trust me, I've been in these bikesheds and it's not pretty.

Animats1y ago

Well, where have we had trouble in C in the past? Usually, with de-referencing null pointers. The classic is

   char* p = 0;
   char c = *p;
   if (p) {
      ...
   }

Some compilers will observe that de-referencing p implies that P is non-null. Therefore, the test for (p) is unnecessary and can optimized out. The if-clause is then executed unconditionally, leading to trouble.

The program is wrong. On some hardware, you can't de-reference address 0 and the program will abort at "*p". But many machines (i.e. x86) let you de-reference 0 without a trap. This one has caught the Linux kernel devs at least once.

From a compiler point of view, inferring that some pointers are valid is useful as an optimization. C lacks a notation for non-null pointers. In theory, C++ references should never be null, but there are some people who think they're cool and force a null into a reference.

Rust, of course, has

    Option<&Foo>

with unambiguous semantics. This is often implemented with a zero pointer indicating None, but the user doesn't see that.

So, what else? Use after free? In C++, the compiler knows that "delete" should make the memory go away. But that doesn't kill the variable in that scope. It's still possible to reference a gone object. This is common in some old C code, where something is accessed after "free". This is Common Security Weakness #414.[1]

Not a problem in Rust, or any GC language.

Over-optimization in benchmarks can be amusing.

   for (i=0; i<100000000; i++) {}

will be removed by many compilers today. If the loop body is identical every time, it might only be done once. This is usually not a cause of bad program behavior. The program isn't wrong, just pointless.

What else is a legit problem?

[1] https://cwe.mitre.org/data/definitions/416.html

mianos1y ago

I can't see what is 'undefined' here. I would expect the program to read the first byte of memory and test if it is 0 or not. If I was writing this in assembly for an MCU, I would write exactly the same code in the target instructions.

There may be many environments where this would be invalid, but why would the compiler optimise this out based on, say, the operating system, if it is valid code?

tialaramex1y ago

> This is often implemented with a zero pointer indicating None, but the user doesn't see that.

The Guaranteed Niche Optimisation is, as its name suggests, guaranteed by the Rust language. That is, Option<&T> is guaranteed to be the same size as &T. The choice for the niche to be the all-zero bit representation is in some sense arbitrary but I believe it is a written promise too.

jcranmer1y ago

So a while back, I did some spelunking into the history of C99 to actually try to put the one-word-change-theory to bed, but I've never gotten around to writing anything that's public on the internet yet. I guess it's time for me to rectify it.

Tracking down the history of the changes at that time is a bit difficult, because there's clearly multiple drafts that didn't make it into the WG14 document log (this is the days when the document log was literal physical copies being mailed to people), and the drafts in question are also of a status that makes them not publicly available. Nevertheless, by reading N827 (the editors report for one of the drafts), we do find this quote about the changes made:

> Definitions are only allowed to contain actual definitions in the normative text; anything else must be a note or an example. Things that were obviously requirements have been moved elsewhere (generally to Conformance, see above), the examples that used to be at the end of the clause have been distributed to the appropriate definitions, anything else has been made into a note. (Some of the notes appear to be requirements, but I haven't figured out a good place to put them yet.)

In other words, the change seems to be have made purely editorially. The original wording was not intended to be read as imposing requirements, and the change therefore made it a note instead of moving it to conformance. This is probably why "permissible" became "possible": the former is awkward word choice for non-normative text.

Second, the committee had, before this change, discussed the distinctions between implementation-defined, unspecified, and undefined behavior in a way that makes it clear that the anything-goes interpretation is intentional. Specifically, in N732, one committee introduces unspecified behavior as consisting of four properties: 1) multiple possible behaviors; 2) the choice need not be consistent; 3) the choice need not be documented; and 4) the choice needs to not have long-range impacts. Drop the third option, and you get implementation-defined behavior; drop the fourth option, and you get undefined behavior[1]. This results in a change to the definition of unspecified and implementation-defined behavior, while undefined behavior retains the same. Notice how, given a chance to very explicitly repudiate the notion that undefined behavior has spooky-action-at-a-distance, the committee declined to, and it declined to before the supposed critical change in the standard.

Finally, the C committee even by C99 was explicitly endorsing optimizations permitted only by undefined behavior. In N802, a C rationale draft for C99 (that again predates the supposed critical change, which was part of the new draft in N828), there is this quote:

> The bitwise logical operators can be arbitrarily regrouped [converting `(a op b) op c` to `a op (b op c)`], since any regrouping gives the same result as if the expression had not been regrouped. This is also true of integer addition and multiplication in implementations with twos-complement arithmetic and silent wraparound on overflow. Indeed, in any implementation, regroupings which do not introduce overflows behave as if no regrouping had occurred. (Results may also differ in such an implementation if the expression as written results in overflows: in such a case the behavior is undefined, so any regrouping couldn’t be any worse.)

This is the C committee, in 1998, endorsing an optimization relying on the undefined nature of signed integer overflow. If the C committee is doing that way back then, then there is really no grounds one can stand on to claim that it was somehow an unintended interpretation of the standard.

[1] What happens if you want to drop both the third and fourth option is the point of the paper, with the consensus seeming to be "you don't want to do both at the same time."

krackers1y ago

kazinator1y ago

This drivel is posted in a private blog precisely in order to evade expert arguments.

mianos1y ago

It's quite ironic that a blog named 'keeping simple' is summarised by GPT as:

"In summary, the essay employs a hyperbolic tone to argue that the prevailing interpretation of undefined behavior has severely compromised the utility and stability of C. While it raises valid points about the implications of undefined behavior, the dramatic language and sweeping claims might make the situation appear more catastrophic than is universally agreed upon."

I know it's bad form to quote GPT, but I could not say this better.

As someone who writes C and C++ every day of the week I feel I just wasted 30 minutes of my life reading it and the arguments.

j / k navigate · click thread line to collapse

67 comments

layer81y ago

nothrabannosir1y ago

According to Martin Uecker, of the C standard comittee, that is not true:

> In C, undefined behavior can not time travel. This was never supported by the wording and we clarified this in C23.

https://news.ycombinator.com/item?id=40790203

layer81y ago

AlotOfReading1y ago

1. Rely on programmers to never write UB. This has not worked out historically.

2. Compilers must detect and/or prevent all UB statically. This is obviously impossible.

3. Runtimes must exhaustively detect and/or prevent all UB. This is both infeasible and expensive.

4. Give up on semantics for essentially all nontrivial programs. This is the situation today, but if we're going to make this the official position why should we even have a standard?

ordu1y ago

Maybe I don't understand something, but for me it seems pretty easy. What is needed to be done:

1. Make a list of all UB

2. Define the sensible compiler behavior in each case (for example, let MAX_INT+1 to calculate into MIN_INT on x86_64, just because `add` on x86_64 does that)

3. Treat this as a part of a standard, when compiling the code.

2 more replies

kazinator1y ago

I'm not seeing how this is a change. C99 also said "for which".

C99: "behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements."

Martin Uecker said that something was fixed in the C23 draft, and when asked about it, pointed to the "for which" as pertaining to just that construct and not the entire program.

E.g.:

  #include <stdio.h>

  int main() // (void) is now deprecated in draft ISO C
  {
     printf("hello, world\n");
     return 0/0;
  }

Undefined behavior is not a visible behavior; it doesn't have to be ordered with regard to visible behaviors.

SAI_Peregrinus1y ago

torstenvl1y ago

No. The implementor has three choices: (1) Ignore the situation altogether; (2) behave according to documentation (with or without a warning); or (3) issue an error and stop compilation.

Consider

    for (int i=0; i>=0; i++);

(1) Doesn't attempt to detect UB, it just ignores UB and generates the straightforward translation

          mov 0, %l0
    loop: cmp %l0, 0
          bge loop
            add %l0, 1, %l0 ! Delay slot
          ! Continue with rest of program

(2) May detect that would result in integer overflow and do something it documents (like a trap instruction, or run the whole loop, or elide the whole loop).

(3) Detects that would result in integer overflow and stops compilation with an error message.

An expressio unius interpretation—or simply following the well-worn principle of construing ambiguity against the drafter—would not permit crazy things with UB that many current compilers do.

cokernel_hacker1y ago

What behavior should the following have:

  int f(int x) {
    switch (x) {
    case 0:
      return 31;
    case 1:
      return 28;
    case 2:
      return 30;
    }
  }

This code on its own has no undefined behavior.

In another translation unit, someone calls `f(3)`. What would you have compilers do in that case?

zzo38computer1y ago

torstenvl1y ago

I would do what the standard tells me to do, which is to ignore the undefined behavior if I don't detect it.

[EDIT: I misremembered the x86 calling convention, so my references to AX and EAX are wrong above. Mea culpa.]

What isn't fine is ignoring the end of the function, not emitting a ret instruction, and letting execution fall through to the next label/function, which is what I suspect GCC does.

2 more replies

qiqitori1y ago

This won't compile with reasonable compiler flags. (-Wall and a reasonable set of -Werror settings).

loeg1y ago

In a new language that isn't C, that function shouldn't compile at all (missing return).

In a C compiler, inserting a trap (x86 ud2, for example) might be reasonable.

Ericson23141y ago

I really don't like reading these dimwitted screeds. We did not get here because layering over a standard or a document --- this is not the US supreme court or similar. We got here because

- There are legit issues trying to define everything without loosing portability. This affects C and anything like it.

Ericson23141y ago

*lawyering over a standard or document

kragen1y ago

> I really don't like reading these dimwitted screeds.

i'm interested to hear what you've written on modern logic and proof theory to understand your perspective better

nickelpro1y ago

If you want slow code that behaves exactly the way you expect, turn off optimizations. Congrats, you have the loose assembly wrapper language you always wanted C to be.

If I went outside the bounds of the standard, that's my problem, not the compiler's.

krackers1y ago

nickelpro1y ago

The compiler does sane things. I also fundamentally disagree with the OP that the situations that invoke UB are surprising.

krackers1y ago

To get something approximating "sane behavior" you have to set a dozen flags to disable various types of questionable optimizations.

1 more reply

RevEng1y ago

Filligree1y ago

For the rest of us, actually we will be using different languages that don't pretend UB is what we want. I'd be using C a lot more if I could possibly trust myself to write correct C.

brudgers1y ago

kragen1y ago

consequently, the consensus process systematically and reproducibly fails to reflect stakeholder interests

bitwize1y ago

LudwigNagasena1y ago

Behaviour that merely differs based on implementation is either unspecified behaviour or implementation-defined behaviour. That and undefined behaviour are different things in the C++ standard.

bitwize1y ago

desiderantes1y ago

C++ is a separate language with a separate standard.

layer81y ago

kragen1y ago

bitwize1y ago

1 more reply

AlotOfReading1y ago

No compiler I'm aware of implements non-zero null pointers on systems where address 0 is valid (e.g. armv7), so it ends up being kind of a moot point.

1 more reply

dgfitz1y ago

Tough problem to solve for sure.

1 more reply

saghm1y ago

torstenvl1y ago

Expressio unius est exclusio alterius.

nlewycky1y ago

This construction is called a "false range" in English.

https://www.chicagomanualofstyle.org/qanda/data/faq/topics/C... https://www.cjr.org/language_corner/out_of_range.php

The wording change from Permissible to Possible and making it non-normative was an attempt to clarify that the list of behaviors that follows is a false range and not an exhaustive list.

It's a submarine change because in the eyes of the committee, this is not a change, merely a clarification of what it already said, to guard against ongoing misinterpretation.

mst1y ago

So it goes.

ajross1y ago

Instead, we're writing, running and trying to understand tools like UBSan, which is IMHO a much, much harder problem.

tialaramex1y ago

The exercise you suggest is futile. You've assumed that all these C programmers are writing software with a clear meaning and we just need to properly translate it so that the meaning is delivered.

There were C programmers like that, most of them now write Rust. They write what they meant, in Rust it just does what they wrote, they're happy.

ajross1y ago

> There were C programmers like that, most of them now write Rust.

Please don't. There's a space in the world for language flames. But the real world is filled with people trying to evolve existing codebases using tools like ubsan, and that's what I'm talking about.

tialaramex1y ago

I don't see how this constitutes a language flame. I spent decades writing C. Twenty years ago I'd have agreed that it was worth trying to "fix" C, today I just write Rust instead.

nlewycky1y ago

I'm a huge proponent of UBSan and ASan. Genuine curiosity, what don't you like about them?

FWIW, there once was a real good-faith effort to clean up the problems, Friendly C by Prof Regehr, https://blog.regehr.org/archives/1180 and https://blog.regehr.org/archives/1287 .

ajross1y ago

[1] It's true that when you get to things like multi-word math that there are edge cases that make some conventions easier to optimize on some architectures (e.g. x86's widening multiply, etc...).

SAI_Peregrinus1y ago

The C standard developers did guarantee 8-bit bytes for C23, so maybe in 50 years that'll be the default C version.

ajross1y ago

> The issue is that would not be backwards-compatible with all existing code

Well, by definition it would, since the only behavior affected is undefined. But sure, in practice. Any change to the language is a danger, and ISO C is very conservative.

[1] Trust me, I've been in these bikesheds and it's not pretty.

Animats1y ago

Well, where have we had trouble in C in the past? Usually, with de-referencing null pointers. The classic is

   char* p = 0;
   char c = *p;
   if (p) {
      ...
   }

Rust, of course, has

    Option<&Foo>

with unambiguous semantics. This is often implemented with a zero pointer indicating None, but the user doesn't see that.

Not a problem in Rust, or any GC language.

Over-optimization in benchmarks can be amusing.

   for (i=0; i<100000000; i++) {}

What else is a legit problem?

[1] https://cwe.mitre.org/data/definitions/416.html

mianos1y ago

There may be many environments where this would be invalid, but why would the compiler optimise this out based on, say, the operating system, if it is valid code?

tialaramex1y ago

> This is often implemented with a zero pointer indicating None, but the user doesn't see that.

jcranmer1y ago

[1] What happens if you want to drop both the third and fourth option is the point of the paper, with the consensus seeming to be "you don't want to do both at the same time."

krackers1y ago

kazinator1y ago

This drivel is posted in a private blog precisely in order to evade expert arguments.

mianos1y ago

It's quite ironic that a blog named 'keeping simple' is summarised by GPT as:

I know it's bad form to quote GPT, but I could not say this better.

As someone who writes C and C++ every day of the week I feel I just wasted 30 minutes of my life reading it and the arguments.

j / k navigate · click thread line to collapse