Who Says C is Simple? (2010) (opens in new tab)

(eecs.berkeley.edu)

117 pointsStylifyYourBlog11y ago54 comments

54 comments

All of the examples here are really horrible code. This is the second article in a few days on Hacker News to list out a few examples of how hard C is. And for little reason; there's absolutely no value in being able to write something like:

    return ({goto L; 0;}) && ({L: 5;});

It probably has a bug, will be hard to debug, and isn't more performant than writing it in a clearer way. And unfortunately, while the examples here are probably all contrived, there are plenty of real-life cases where code as bad as this gets into production systems.

So why are we still writing code like this?

The answer is reverse compatibility. Not just of compilers, but of tools and skillsets: people are unwilling to support multiple versions of C and want their code to run forever.

Objective-C and C++ do things to add functionality to C, but they don't remove the functionality of C that allows these kinds of problems.

This points to a need for a new language that avoids these issues. I think Rust is the answer, but I would like to see more languages try to fill that gap--competition is healthy.

sclangdon11y ago

> This points to a need for a new language that avoids these issues. I think Rust is the answer, but I would like to see more languages try to fill that gap--competition is healthy.

Programs written in C and C++ may have issues because the languages assume the programmer knows what they are doing. This assumption leads to some great solutions to hard problems because the programmer is essentially free to do what they want.

Of course, this assumption, as with most others, doesn't always hold true. This doesn't mean there is a problem with the language. The problem is with the programmer.

If you're going to write something like "return ({goto L; 0;}) && ({L: 5;});", no language is going to save you.

C and C++ are still used today, in part, because modern languages try to restrict the programmer. Rather than assume the programmer knows what they are doing, they assume the programmer is stupid and needs help to cross the road. By assuming stupidity, the restrictions modern languages put in place prohibit certain solutions and as such C and C++ will remain the go-to systems languages.

We do not need new languages. What we need is programmers who won't abuse the languages we already have.

copsarebastards11y ago

> Programs written in C and C++ may have issues because the languages assume the programmer knows what they are doing. This assumption leads to some great solutions to hard problems because the programmer is essentially free to do what they want.

There’s an assumption here that it would be impossible to design a language which would make these solutions available without being as error prone. Existing languages may be less capable than C, but that’s only because equally capable languages with less risk haven’t been created (Rust may be a solution, I'm not sure yet).

What exactly do you think can’t be done in a language that is less error-prone?

> We do not need new languages. What we need is programmers who won't abuse the languages we already have.

You’re part of the problem. It takes incredible hubris to say something like this, to think that it’s even possible for a human to do this.

Every nontrivial networking program written in C has security holes caused by memory management issues. If you’re going to claim that these errors are caused by bad programmers, then every C programmer is a bad programmer, because every C programmer has written bugs like this. If you’re claiming that bugs caused by C’s error-prone semantics are programmers abusing the language, then using C is equivalent to abusing C. The very best programmers writing C write bugs in C that they wouldn’t write in a language like Rust.

A system which depends on humans being perfect is bound to fail. There’s simply no way you can reasonably debate this fact.

Every other engineering field has redundancy, multiple layers of error checking that catch errors.

Until you see this as a problem then you’re a danger to any mission-critical product you work on. Not understanding that using C is a risk displays a shocking level of naiveté for a professional in this field. I’m not saying C is never a good choice. I write a lot of C myself, but I do so with the awareness that my code is not being checked adequately and that I have to take extreme measures to ensure that my code is well-validated.

kibwen11y ago

In Rust, programmers are also "essentially free to do what they want", you just need to explicitly declare the parts of your code where you know better than the type system. The pertinent difference is Rust is memory-safe by default, and forces you to opt-in to unsafety and undefined behavior so that those pieces of code in particular can be more closely audited.

Karliss11y ago

This is odd, GCC documentation clearly states that jumping into statement expression is not permitted. It is properly identified as error when compiling as c file, but no errors are given when compiled as c++ file. https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html

pjmlp11y ago

> This points to a need for a new language that avoids these issues.

They already existed back when C was UNIX only, but then UNIX became widespread....

cbd198411y ago

C is simple to compile into nearly-optimal code, or at least it was back in the 1970s, when computers had single-opcode dispatch or trivial pipelines, no SIMD hardware, no other parallelism worth mentioning, and it wasn't worth worrying about cache too much. (Running in the registers was a neater trick.)

That meant it was relatively simple to 'see' the assembly language 'behind' a given C function or stretch of code; it didn't take much to get inside the head of a C compiler, so you could be reasonably sure that a simple piece of C would result in a similarly simple piece of assembly out the other end.

That, of course, was well and good when it was reasonably simple to predict actual performance from glancing at assembly code, which assumes opcode performance (as opposed to, say, cache performance) dominates how fast the code runs.

Now... how many of those things still hold true on desktop and server class hardware?

revelation11y ago

But everything that came after C hasn't improved on this, at all. In fact, languages now dominate that aren't compiled.

So as it stands, C is still your best bet when you are looking for that optimal translation. Intel has recently made some effort to augment it in ways to fully utilize new CPUs various parallel pipelines and specific functionality:

https://ispc.github.io/

yoklov11y ago

ISPC is cool, what it helps out with (writing the SIMD kernel) has never really been the bottleneck in my experience.

The data still has to be arranged optimally for the hardware in order for SIMD code to have any benefit (and at this point, writing SIMD code is straightforward). You also still need to be experienced with the capabilities of the hardware if you have any chance of writing good ISPC code (although this is true of C, as well as any shading language).

That said, using it to target SSE and AVX with the same code is attractive.

GFK_of_xmaspast11y ago

"C is simple to compile into nearly-optimal code, or at least it was back in the 1970s,"

And yet the best practice of the time was not to use C for time-critical applications. If your hypothesis were true, why would, say, all those NES programmers write all that assembly?

DougMerritt11y ago

He means it was nearly-optimal on its original target, the PDP 11.

It was decidedly not nearly-optimal a few years later on microprocessors like 8080, z80, 6502, etc., which were highly register starved, 8-bit rather than 16-bit, non-orthogonal instructions and registers, etc.

As for "not to use C for time-critical applications", both then and now people sometimes write critical inner loops in assembly, it's just less common now because compilers are much more sophisticated.

But C was indeed used on "time-critical applications", aside perhaps from inner loops, back in the 70s, certainly on PDP 11s, and sometimes on less ideal microprocessors.

> why would, say, all those NES programmers write all that assembly?

Several reasons. First and foremost, things like that were highly RAM starved by the standards of the day. The PDP-11/70 had 64k of instructions and a separate 64k of data per process, with a total amount of RAM of up to something like a megabyte.

The NES had 2k RAM onboard -- although cartridges could extend that -- and the register starved 6502.

Another big reason is that, in every era, games are always pushing the limits of the hardware, and developers were typically quite willing to code in assembly if they believed it would give them a 20% edge in speed or decrease in space.

But also there was a mythos (that hasn't completely disappeared) that assembly would yield vastly more than 10%-20% speed increase over high level languages of the day, including C, so for most developers, they never even considered anything but assembler.

It also was not uncommon at the time for many of those game programmers to only know assembler, and not any other language except perhaps Basic.

The availability of C compilers for various platforms was not so universal then as it is now, especially on non-Unix systems, and the non-Unix C compilers, when available, were not necessarily at the same level of quality as the Unix C compilers.

Last but not least, C had not yet taken the world by storm, and a lot of those developers and companies had never even heard of C, and the ones that had heard of it were pretty dubious, more often than not.

pjmlp11y ago

> Last but not least, C had not yet taken the world by storm, and a lot of those developers and companies had never even heard of C, and the ones that had heard of it were pretty dubious, more often than not.

Specially since some of us were exposed to languages (Modula-2, Turbo Pascal) that were more feature rich than C while allowing similar performance levels on the same systems.

1 more reply

Moral_11y ago

The first two invoke UB and are thus completely illogical.

EpicEng11y ago

The first invokes UB if you assume that x is a local variable, and the second doesn't at all as far as I can tell. Care to explain?

To elaborate, the second expression has an underflow at 1 - sizeof(int) on an unsigned integer (promotion due to sizeof being unsigned), which is perfectly well defined:

"if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type."

The right shift is fine on a signed or unsigned integer. For the unsigned case (which is this one due to operator precedence), the behavior is well defined. For signed, implementation defined.

EDIT: The right shift is in fact UB assuming sizeof(int) <= 4.

klodolph11y ago

The behavior is only well defined if the shift amount is strictly less than the width of the operand. If `size_t` is 32 bits, then shifting right by 32 bits is undefined.

I know of three different ways in which platforms implement shifts by greater than the word size.

EpicEng11y ago

Yep, you're right. I knew left shift for signed/unsigned as it is more complicated and I've had to look it up more often, but I forgot that right shift can be UB for unsigned integral types as well.

l_dopa11y ago

The example is written in a strange way. You're supposed to assume x is initialized ("for most values of x").

I think it's just pointing out the difference between '&' and '&&'.

Dewie11y ago

The number of people who disagree on whether or not it's UB to begin with speaks for itself...

kyberias11y ago

UB? I don't get the first one.

pavanky11y ago

Googled it. I think he means undefined behavior.

EpicEng11y ago

Reading an uninitialized variable is UB.

barrkel11y ago

Is it a local? I think the int declaration is for showing the type of x. It's just a code fragment as is.

1 more reply

barrkel11y ago

No they don't. What UB do you think they invoke?

They do invoke implementation defined behaviour, but not undefined.

DSMan19527611y ago

An uninitialized variable is UB, not implementation-defined. Thus, the compiler is free to treat the variable as though it doesn't have a value at all, or change it's value at will. It's not uncommon for the value of an uninitalized variable to change at strange places in the code that you wouldn't expect, because the compiler initially said "Variable x will be kept in register %eax", but then without a value to initialize it too, it just uses whatever happens to be in %eax at the time. For example, if they compiled this:

    return x == (1 && x);

Into this:

    movl $0, %eax
    andl $1, %eax
    cmpl %eax, %eax
    # Result in %eax

That will (obviously) return 1 every time, because we compare %eax and %eax. The reason for this is that the value of 'x' changes half-way through the computation (Because the computation is done in %eax, which is where 'x' is assumed to be). This is valid because 'x' is uninitalized, so it doesn't have a defined value.

barrkel11y ago

What evidence do you have that it's an uninitialized variable?

It's a code fragment; it's not in a function body. The way I read it, it was simply to document the type of x. 'x' could be a global for all we know.

1 more reply

kazinator11y ago

I stopped reading after the explanations about return ((1 - sizeof(int)) >> 32);

Unless size_t is wider than 32 bits, it has undefined behavior. That's why it returns 0; it could as well be 42, or the program could terminate with or without a diagnostic message, etc.

taeric11y ago

Who says simple things always yield simple results?

cbd198411y ago

Depends on how you define simple:

If something is simple for the compiler-writer, then simple things do yield simple results.

If something is simple for the programmer, simple things often yield quite complex results.

For example, in a language that's simple for the compiler-writer, (1/10) times 10 is only very rarely 1. 0 is a common answer, as is some fraction which is almost, but not completely, unlike 1.

In a language which is simple for the programmer, Heaven, Earth, and minor deities will be moved to make (1/10) times 10 come out to the obvious, simple answer.

taeric11y ago

It really only depends on if you define simple as "can only derive simple results."

And, you do realize that one of the simplest languages for compiler writers, lisp, doesn't have to move heaven/earth to make that calculation work out how you want it.

copsarebastards11y ago

> And, you do realize that one of the simplest languages for compiler writers, lisp, doesn't have to move heaven/earth to make that calculation work out how you want it.

I've written a C compiler and am currently writing a Lisp compiler, and I'm not sure where you get the idea that Lisp is a simple language for compiler writers. Lisp's simple representation belies a very complicated runtime, to the point that the majority of Lisp implementations don't support compilation at all--they're interpreted only.

1 more reply

agounaris11y ago

who said C is simple? :S

EpicEng11y ago

C is simple, but that doesn't mean it's not powerful, and with power comes... yada yada yada. There are many edge cases to be sure, but as a language, its constructs and features are very simple.

kd0amg11y ago

A lot of people have said it. They develop a mental model that is a rough approximation of its semantics, see that their mental model is simple and conclude that the semantics itself must also be simple. Then other people write articles like these to remind the first group that their "simple" mental model is not the actual semantics, just an approximation.

userbinator11y ago

After looking at the JavaScript/ECMAscript standard, or even C++, C feels positively trivial in comparison.

pavanky11y ago

It is fairly simply if you understand how the data types and data structures are laid out in memory.

The problems are with the corner cases like he mentions at the beginning of his post.

dang11y ago

Can anybody figure out the year on this one? Internet Archive says 2011 but it seems it might be earlier.

pjscott11y ago

The HTTP headers can put an upper bound on the date:

    Last-Modified: Fri, 29 Oct 2010 16:59:15 GMT

And the changelog for CIL suggests that it may be significantly older:

http://www.eecs.berkeley.edu/~necula/cil/changes.html

dang11y ago

That header trick is a good one.

Ok, we'll put 2010 on it even though it is likely quite a bit older. An upper bound is better than nothing.

mmozeiko11y ago

This URL http://www.eecs.berkeley.edu/~necula/cil/cil002.html#toc1 mentions cil version 1.3.7. Changelog http://www.eecs.berkeley.edu/~necula/cil/changes.html says 1.3.7 was released on 2009-04-24.

dang11y ago

But 2010 may just have been when they stopped updating a version number. It looks like early 2000s material to me, based on nothing in particular. There's also this:

When I (George) started to write CIL I thought it was going to take two weeks. Exactly a year has passed since then and I am still fixing bugs in it

The CIL paper was published in 2002. Actually the whole project looks interesting—arguably more so than the currently posted page. It should have its own HN thread sometime. https://news.ycombinator.com/item?id=836735 was a while ago!

bjwbell11y ago

FYI the project is still active, https://github.com/cil-project/cil

That website for it hasn't been updated.

iopq11y ago

I've been looking for this for months, I wanted to link this to my friend who said he prefers C to Java for his CS classes.

tdsamardzhiev11y ago

Come on, almost any language I can think of is a better fit for CS classes than Java.

iopq11y ago

But get this, he said Java was HARD. Compared to C?

j / k navigate · click thread line to collapse

54 comments

copsarebastards11y ago

    return ({goto L; 0;}) && ({L: 5;});

So why are we still writing code like this?

The answer is reverse compatibility. Not just of compilers, but of tools and skillsets: people are unwilling to support multiple versions of C and want their code to run forever.

Objective-C and C++ do things to add functionality to C, but they don't remove the functionality of C that allows these kinds of problems.

This points to a need for a new language that avoids these issues. I think Rust is the answer, but I would like to see more languages try to fill that gap--competition is healthy.

sclangdon11y ago

> This points to a need for a new language that avoids these issues. I think Rust is the answer, but I would like to see more languages try to fill that gap--competition is healthy.

Of course, this assumption, as with most others, doesn't always hold true. This doesn't mean there is a problem with the language. The problem is with the programmer.

If you're going to write something like "return ({goto L; 0;}) && ({L: 5;});", no language is going to save you.

We do not need new languages. What we need is programmers who won't abuse the languages we already have.

copsarebastards11y ago

What exactly do you think can’t be done in a language that is less error-prone?

> We do not need new languages. What we need is programmers who won't abuse the languages we already have.

You’re part of the problem. It takes incredible hubris to say something like this, to think that it’s even possible for a human to do this.

A system which depends on humans being perfect is bound to fail. There’s simply no way you can reasonably debate this fact.

Every other engineering field has redundancy, multiple layers of error checking that catch errors.

kibwen11y ago

Karliss11y ago

pjmlp11y ago

> This points to a need for a new language that avoids these issues.

They already existed back when C was UNIX only, but then UNIX became widespread....

cbd198411y ago

Now... how many of those things still hold true on desktop and server class hardware?

revelation11y ago

But everything that came after C hasn't improved on this, at all. In fact, languages now dominate that aren't compiled.

https://ispc.github.io/

yoklov11y ago

ISPC is cool, what it helps out with (writing the SIMD kernel) has never really been the bottleneck in my experience.

That said, using it to target SSE and AVX with the same code is attractive.

GFK_of_xmaspast11y ago

"C is simple to compile into nearly-optimal code, or at least it was back in the 1970s,"

And yet the best practice of the time was not to use C for time-critical applications. If your hypothesis were true, why would, say, all those NES programmers write all that assembly?

DougMerritt11y ago

He means it was nearly-optimal on its original target, the PDP 11.

But C was indeed used on "time-critical applications", aside perhaps from inner loops, back in the 70s, certainly on PDP 11s, and sometimes on less ideal microprocessors.

> why would, say, all those NES programmers write all that assembly?

The NES had 2k RAM onboard -- although cartridges could extend that -- and the register starved 6502.

It also was not uncommon at the time for many of those game programmers to only know assembler, and not any other language except perhaps Basic.

pjmlp11y ago

Specially since some of us were exposed to languages (Modula-2, Turbo Pascal) that were more feature rich than C while allowing similar performance levels on the same systems.

1 more reply

Moral_11y ago

The first two invoke UB and are thus completely illogical.

EpicEng11y ago

The first invokes UB if you assume that x is a local variable, and the second doesn't at all as far as I can tell. Care to explain?

To elaborate, the second expression has an underflow at 1 - sizeof(int) on an unsigned integer (promotion due to sizeof being unsigned), which is perfectly well defined:

The right shift is fine on a signed or unsigned integer. For the unsigned case (which is this one due to operator precedence), the behavior is well defined. For signed, implementation defined.

EDIT: The right shift is in fact UB assuming sizeof(int) <= 4.

klodolph11y ago

The behavior is only well defined if the shift amount is strictly less than the width of the operand. If `size_t` is 32 bits, then shifting right by 32 bits is undefined.

I know of three different ways in which platforms implement shifts by greater than the word size.

EpicEng11y ago

Yep, you're right. I knew left shift for signed/unsigned as it is more complicated and I've had to look it up more often, but I forgot that right shift can be UB for unsigned integral types as well.

l_dopa11y ago

The example is written in a strange way. You're supposed to assume x is initialized ("for most values of x").

I think it's just pointing out the difference between '&' and '&&'.

Dewie11y ago

The number of people who disagree on whether or not it's UB to begin with speaks for itself...

kyberias11y ago

UB? I don't get the first one.

pavanky11y ago

Googled it. I think he means undefined behavior.

EpicEng11y ago

Reading an uninitialized variable is UB.

barrkel11y ago

Is it a local? I think the int declaration is for showing the type of x. It's just a code fragment as is.

1 more reply

barrkel11y ago

No they don't. What UB do you think they invoke?

They do invoke implementation defined behaviour, but not undefined.

DSMan19527611y ago

    return x == (1 && x);

Into this:

    movl $0, %eax
    andl $1, %eax
    cmpl %eax, %eax
    # Result in %eax

barrkel11y ago

What evidence do you have that it's an uninitialized variable?

It's a code fragment; it's not in a function body. The way I read it, it was simply to document the type of x. 'x' could be a global for all we know.

1 more reply

kazinator11y ago

I stopped reading after the explanations about return ((1 - sizeof(int)) >> 32);

Unless size_t is wider than 32 bits, it has undefined behavior. That's why it returns 0; it could as well be 42, or the program could terminate with or without a diagnostic message, etc.

taeric11y ago

Who says simple things always yield simple results?

cbd198411y ago

Depends on how you define simple:

If something is simple for the compiler-writer, then simple things do yield simple results.

If something is simple for the programmer, simple things often yield quite complex results.

For example, in a language that's simple for the compiler-writer, (1/10) times 10 is only very rarely 1. 0 is a common answer, as is some fraction which is almost, but not completely, unlike 1.

In a language which is simple for the programmer, Heaven, Earth, and minor deities will be moved to make (1/10) times 10 come out to the obvious, simple answer.

taeric11y ago

It really only depends on if you define simple as "can only derive simple results."

And, you do realize that one of the simplest languages for compiler writers, lisp, doesn't have to move heaven/earth to make that calculation work out how you want it.

copsarebastards11y ago

> And, you do realize that one of the simplest languages for compiler writers, lisp, doesn't have to move heaven/earth to make that calculation work out how you want it.

1 more reply

agounaris11y ago

who said C is simple? :S

EpicEng11y ago

C is simple, but that doesn't mean it's not powerful, and with power comes... yada yada yada. There are many edge cases to be sure, but as a language, its constructs and features are very simple.

kd0amg11y ago

userbinator11y ago

After looking at the JavaScript/ECMAscript standard, or even C++, C feels positively trivial in comparison.

pavanky11y ago

It is fairly simply if you understand how the data types and data structures are laid out in memory.

The problems are with the corner cases like he mentions at the beginning of his post.

dang11y ago

Can anybody figure out the year on this one? Internet Archive says 2011 but it seems it might be earlier.

pjscott11y ago

The HTTP headers can put an upper bound on the date:

    Last-Modified: Fri, 29 Oct 2010 16:59:15 GMT

And the changelog for CIL suggests that it may be significantly older:

http://www.eecs.berkeley.edu/~necula/cil/changes.html

dang11y ago

That header trick is a good one.

Ok, we'll put 2010 on it even though it is likely quite a bit older. An upper bound is better than nothing.

mmozeiko11y ago

This URL http://www.eecs.berkeley.edu/~necula/cil/cil002.html#toc1 mentions cil version 1.3.7. Changelog http://www.eecs.berkeley.edu/~necula/cil/changes.html says 1.3.7 was released on 2009-04-24.

dang11y ago

But 2010 may just have been when they stopped updating a version number. It looks like early 2000s material to me, based on nothing in particular. There's also this:

When I (George) started to write CIL I thought it was going to take two weeks. Exactly a year has passed since then and I am still fixing bugs in it

bjwbell11y ago

FYI the project is still active, https://github.com/cil-project/cil

That website for it hasn't been updated.

iopq11y ago

I've been looking for this for months, I wanted to link this to my friend who said he prefers C to Java for his CS classes.

tdsamardzhiev11y ago

Come on, almost any language I can think of is a better fit for CS classes than Java.

iopq11y ago

But get this, he said Java was HARD. Compared to C?

j / k navigate · click thread line to collapse