I'd clarify this with "They understand that all values are just bytes".
> Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead.
It's partly the standards fault here - rather than saying "We don't know how vendors will implement this, so we shall leave it as implementation-defined", they say "We don't know how vendors will implement this, so we will leave it as undefined".
A clear majority of the UB problems with C could be fixed if the standards committee slowly moved all UB into IB. It's not that there isn't any progress (Signed twos-complement is coming, after all), it's that there is (I believe) much pushback from compiler authors (who dominate the standards) who don't want to make UB into IB.
There is no such thing as getting rid of "all UB."
What behavior is the implementation supposed to prescribe for a write to an unpredictable garbage address you read from the network? It could overwrite your code. It could overwrite any value anywhere. It could overlap with anything. Prescribing defined behavior for absolutely everything would require defining a precise, unoptimizable 1-to-1 mapping to assembly code and disallowing any multithreading.
I'd agree to a point. I still think it's unreasonable for compiler writers to get all lawyery about precise terminology. After all "implementation defined" could still be subject to the same lawyeriness (we implemented it, ergo we define it).
To me this is an issue of culture. We need to push back against the view that UB means anything can happen, therefore the compiler can do anything.
That said, I think there are many cases where compilers could make a better effort to link UB they're optimizing against to UB that appears in the code as originally authored and emit a diagnostic or even error out. But at least we've got ubsan and friends so it seems like things are within reason if not optimal.
I am skeptical that NULL-pointer checks being removed contribute anything more than a rounding error in performance gains in any non-trivial program.
A standard way to eliminate those is to invoke undefined behavior if some condition is not met;
if (a == NULL) {
__builtin_unreachable();
}
Which then allows elimination of the null check in later code, possibly after inlining some function.Well I think there is a tension here. C is the language for microcontrollers and the language for high performance.
In ye olden days both groups interests were aligned because speed in C was about working with the machine. Now the UB has been highjacked for speed, that microcontroller that I'm working on, where I know and int will overflow and rely on that is UB so may be optimised out, so I then have to think about what the compiler may do.
I wouldn't say C is the wrong language. I would say there are wrong compilers though.
Being able to assume certain things don't happen is powerful when you're writing optimisations, not doing that would have a real performance cost
A few of those are significant performance gains, the majority are not.
Emitting the instruction for a NULL pointer dereference is effectively no more costly than not emitting that instruction.
It's the code removal that's killing me.
Compilers optimise in multiple passes and removing things earlier can expose optimisation opportunities later that can affect other parts of the code too
It's undefined so it doesn't have to be zeroed therefore increasing efficiency.
But it's also UB so if you do know that memory contains something, you can't take advantage of that because it's UB. Having it UB is fine. It's the compilers assuming UB can't happen and optimising it away.
"Going past the end of the array results in addressing arbitrary values" I can live with. "Going past the end of an array results in anything happening" is a hard sell.
Once you are addressing arbitrary values you are firmly in the realm of "anything happening" in practice, but you've now given up optimization opportunities. As has been repeatedly demonstrated over the years, once memory safety breaks it is practically impossible to make any guarantees about program behavior.
Your compiler emitting a load operation and it failing isn't "anything". The failure being handled by code that the compiler authors can't predict doesn't make it "anything".
And if you lose optimization opportunities because of this it's because your optimization is broken. By the way, if you lose optimization opportunities because of this, that means both codes are meaningfully different and you knew it all the time.
Documenting that the instructions to access will always be eliminated makes it easier to predict what will happen.
For the former, I kinda get it. It may need to be there for cases like with segmented address space where p+10 could actually be a value less than p, for the eventually generated assembly. Maybe it should be fine to create such a pointer, but have it be "indeterminate value" or whatever, if you try to compare that pointer to anything? I don't know enough about compiler internals to say one way or the other.
Dereferencing, though, can only be UB. There may not be a "value" behind that address. There may be a motor that's been I/O mapped, or a self destruct button.
Right now, if a dereference results in UB, the compiler may omit it entirely.