How to Think About Variables in C (opens in new tab)

(denniskubes.com)

38 pointsdenniskubes13y ago59 comments

59 comments

Extremely uninteresting- It is like a page of "C-S 1XX: Intro to C" fell out of its bindings and landed on Hacker News.

This might have been mildly interesting if there had been the assembly for a few different architectures (x86, MIPS, ARM, PowerPC, etc) showing how the C code was translated to assembler for each. And could have been very interesting with an additional discussion of memory barriers and atomic operations in C and their relation to assignments and pointers.

holyjaw13y ago

Amendment: 'Extremely uninteresting' -TO YOU-.

As someone who has had difficulty picking up real programming languages, and has only found some marginal success due to Obj-C's ARC feature, I can tell you this puts everything I've read in to much better perspective.

Try not to be so negative, man, I think it's clear you weren't even the intended target anyways.

minimax13y ago

HN has a pretty broad audience and a pretty big chunk of it doesn't know $language. These types of beginner posts for $language pop up from time to time. It's nothing to worry about.

voidlogic13y ago

$language in this case is C, the lingua franca of computing.

It is almost always the first language ported to any system, almost every computer science program at least covers the basics, it has been in 1st/2nd place on the TIOBE index for over a decade, its the 5th most popular language on github by commits and it is over 40 years old.

But- I'm willing to accept there might be people on Hacker news that don't know C, thats why I gave suggestions to the author to expand on the content and make it interesting to a wider audience. That was the point of my post.

mturmon13y ago

Posts on elementary topics (should be) noteworthy only if mastery is exhibited. Hence, griping.

ultimoo13y ago

I agree. I liked the opening line though: "C is memory with syntactic sugar." It is a good introductory article for someone who has never used C -- CS-1xx Intro as you said.

greenyoda13y ago

"Syntactic sugar" generally means a syntax that's just a nicer-looking version of something that can be equivalently expressed in a more fundamental syntax. But C is more than that: it provides a way of abstracting away the details of the machine so that you don't have to explicitly deal with the fact that your machine has 64-bit pointers and 2's complement integer arithmetic and IEEE floating point and an instruction set that handles shift operations in a particular way.

So a better formulation might be: "C provides an abstraction layer on top of a computer's memory model and instruction set that will allow your code to be portable between different machine architectures, but only if you play strictly by the rules."

By the way, the classic K&R book explains the fundamentals of C pretty well. If you really want to understand C, I'd recommend reading it cover to cover (it's pretty short).

denniskubesOP13y ago

I was trying to describe a simple mental model that has been helpful to me. While I agree assembly details would have been interesting putting that in would have lost more than half the audience.

nemetroid13y ago

> putting that in would have lost more than half the audience.

I surely hope not.

blt13y ago

The least they could have done is explain how structs work.

haberman13y ago

There are some subtle problems with the model as explained in this article. If you use this as your mental model, you will probably run afoul of undefined behavior without realizing it.

If you read the C standard, you'll notice it doesn't talk much about "memory" (the word only appears 13 times in C99); it mostly talks about "objects" (mentioned 735 times in C99). These objects aren't OO-objects -- obviously C doesn't have OOP built in -- but rather all the basic types like int, float, struct, etc are objects. When you declare a variable like "int x", you are creating an object.

C's aliasing rules dictate that you can only access an object via a pointer of that object's actual type. This is why it is dangerous to think of the assignment operator as a simple memory-copying operation. If assignment were a simple memcpy, you could do something like this:

  int x = 5;
  // BAD: undefined behavior, violates aliasing.
  short y = *(short*)&x;

If a variable were just a memory address and assignment were just a memory copy, this would be a valid operation. But the right way to think of it is that a variable is a storage object whose address can be taken, and and a dereference is an operation that reads a storage object.

A pointer isn't a generic memory-reading facility, it must actually point to a valid storage object of the pointer's type (or to NULL).

If you do want to read and write arbitrary objects in memory, you can always use memcpy():

  int x = 5;
  short y;
  // This is fine, and smart C compilers optimize away the
  // function call.
  memcpy(&y, &x, sizeof(y));

sillysaurus13y ago

If a variable were just a memory address and assignment were just a memory copy, this would be a valid operation.

It's a valid operation regardless of whether a standards body says it's not.

  uint32 x = 5;
  uint16 y = *(uint16*)&x;

The effect is to set y to the first two bytes of memory from x. Values assigned to x are serialized into memory in either big endian or little endian order. Those are the only two cases you have to account for. Quake 3 engine has a macro for the above operation which produces the same value of y on all platforms. This is useful for serializing x to disk, then loading it later (and possibly on a different architecture).

One source of confusion is that int and short are essentially, for all intents and purposes, undefined -- they are of course defined by the standards, but their implementation is allowed to vary so much that no programmer can make any assumptions about their size (in bytes) at runtime.

int8, int16, int32, int64 are all explicit and force the compiler (and the hardware) to obey the wishes of the programmer. This is, I think, the right approach. People make much ado about the fact that "a byte isn't necessarily 8 bits" and "the only assumption you can make about a short is that it's smaller than an int, and larger than a char", etc, which is probably unnecessary mental effort.

"Bytes are 8 bits. Here are four bytes. Here's the value that the four bytes store. Copy two of the four bytes to this other spot (adjusting for endianness appropriately via a macro)."

You typically don't want a memcpy in situations like this due to endianness.

The reason it's useful to explicitly "break the rules" like this is because it's important to know what assumptions you in fact can rely on, regardless of what standards bodies have to say about it. Because at that point you can do incredible things such as http://www.codercorner.com/RadixSortRevisited.htm

   inline float fabs(float x){
        return (float&) ((unsigned int&)x)&0x7fffffff ;
   }

The reason this is incredible and awesome (rather than horrible and dangerous) is because it enabled game developers to achieve a more impressive product for end users, because they were able to do more with the CPU resources that were available at the time.

It's of course not so relevant nowadays, since it's reasonable to assume that most gamers have at least a core 2 duo. But it's one of those things that isn't relevant until suddenly it is -- you're in some situation that requires sorting millions of floats, and your dataset simply demands more performance than your compiler typically gives you. Then suddenly you find you can do amazing things like this, and surprise people with how effectively you can use a modern CPU.

(Although, the modern antidote to "I need to sort millions of floats quickly" is to use SSE, not to sort floats as integers. Yet that's even more evidence that it's better to understand the capabilities of the hardware.)

haberman13y ago

> It's a valid operation regardless of whether a standards body says it's not.

Whoa there, cowboy. You may not feel personally beholden to standards bodies, but compiler vendors are following their lead. The major compilers are getting more and more aggressive about optimizing away undefined behavior every year.

> The effect is to set y to the first two bytes of memory from x.

No, it's really not. It's undefined behavior and the compiler is free to do absolutely whatever it wants.

> One source of confusion is that int and short are essentially, for all intents and purposes, undefined -- they are of course defined by the standards, but their implementation is allowed to vary so much that no programmer can make any assumptions about their size (in bytes) at runtime.

I agree with this, and have made this argument before: http://blog.reverberate.org/2013/03/cc-gripe-1-integer-types...

But this is an entirely separate issue.

1 more reply

brigade13y ago

The reason it's useful to explicitly "break the rules" like this is because it's important to know what assumptions you can in fact rely on, regardless of what standards bodies have to say about it.

Given that compilers do break when programmers violate aliasing rules, you should recheck what assumptions you think you can rely on. Non-strict aliasing is not one of them. Unless you want to slow everything down with compiler-specific flags like -fno-strict-aliasing.

    uint8_t foo[4]; *(uint32_t*)foo = 0;

Besides even without strict aliasing, the above is not at all guaranteed to work since not all architectures support unaligned loads. (and if you think "well but no one uses them, just like no one uses 1's complement architectures anymore", keep in mind that this includes ARM)

(also use stdint types already)

sillysaurus13y ago

  uint8_t foo[4]; *(uint32_t*)foo = 0;

Besides even without strict aliasing, the above is not at all guaranteed to work since not all architectures support unaligned loads.

So, the interesting thing about this example is that it does work. It's in fact very, very difficult to find a platform where that example won't work (i.e. crashes the program). For example, any C library involving image manipulation is likely going to have code similar to what you've described, and those libraries work on almost every platform.

Standards are a good and useful thing. All I'm saying is that it's important to know which rules you can safely violate.

2 more replies

150010090013y ago

>int8, int16, int32, int64 are all explicit and force the compiler (and the hardware) to obey the wishes of the programmer.

At least in C99, the compiler doesn't need to support exact-width integer types.

>People make much ado about the fact that "a byte isn't necessarily 8 bits"

Well, POSIX.1-2004 requires that CHAR_BIT == 8.

derleth13y ago

> It's a valid operation regardless of whether a standards body says it's not.

All the world's a VAX, sure. Don't mind the next generation of hardware coming down the pike and the next wave of compiler optimizations.

http://catb.org/jargon/html/V/vaxocentrism.html

_kst_13y ago

"A data type is a number of bytes to the compiler."

The size of a type is just one of its many attributes. Even if, for example, "long", "float", and "void* " happen to have the same size, they're still very distinct types.

"Integer data types are defined in the limits.h file. Float data types are defined via macros in the floats.h file."

Integer and floating-point types are defined by the compiler, guided by the hardware and the ABI for the platform. <limits.h> and <float.h> document the characteristics of the predefined numeric types.

"A pointer doesn’t hold a memory address, it holds a number that represents a memory address."

Sure, and a floating-point object is ultimately just a collection of bits -- but that's hardly the best way to think about either of them. Integers and pointers (addresses) are logically very distinct things, even if they happen to have similar representations. For example, the addresses of two distinct variables have no defined relationship to each other (other than being unequal); just evaluating (&x < &y) has undefined behavior.

C lets you get away with a lot of type-unsafe stuff, particularly if you resort to pointer casts, but it's fundamentally much more strongly typed than the author seems to think it is.

revelation13y ago

See also: strict aliasing

dllthomas13y ago

1 int x = 10;

2 &x = 20; // this doesn't work

3 * (&x) = 20; // this does work

Why does line 2 &x not work but line 3 does? Because &x returns a pointer, a number representing a memory address. This is an important distinction. A pointer doesn’t hold a memory address, it holds a number that represents a memory address.

=======

No, that is not why. Note that the following does work:

int * x = 0;

and the following works, though typically yields a warning:

int * x = 20;

Line 2 fails because & doesn't give back an l-value.

1 more reply

asveikau13y ago

> Every variable is a starting memory address to the compiler.

Definitely not true. More like, "it will have an address, if you take the address with the & operator". Otherwise, the compiler is quite free to store locals in registers.

denniskubesOP13y ago

> Yes I am being simplistic and yes certain data types have certain syntactic sugar but I have found this to be a good mental model

As stated in the post.

mturmon13y ago

I think you're going to keep getting comments on these ill-considered asides, but here is another problem:

"In most assembly languages, data types don’t exist. You operate on bytes and offsets."

This is just not true.

Most assembly languages (I learned on PDP-11 assembler, which I remember best, but what I say is true of 68000 and x86 too) have a notion of a byte, but also integers of various word lengths, and floating point numbers.

In fact, some registers are in effect designated as "pointers" for various kinds of conventional indirect addressing (the instruction pointer, the register holding the stack pointer, and others).

In this sense, C is even closer to assembly than you indicate, because the data types are so analogous.

asveikau13y ago

This reminds me of another comment I had: I personally find the phrase "syntactic sugar" irritating. As used, I don't feel like it adds anything to the blog post. IMO you could write nothing there and it'd make the exact same point.

What exactly is the "syntactic sugar" that hides the idea that names can have addresses? Structs? Some specific kind of expression? Array index syntax? The names themselves?

halayli13y ago

Simplicity here doesn't help. Variables aren't about how they are stored and where but more about what gets applied to them and how.

snorkel13y ago

Integers are the simple case, but you really haven't grasped the C memory model until you're comfortable handling text strings at any length, calling functions by pointers, working with structure pointers, and knowing when you need a pointer to a pointer. Part of it is understanding variable scope, local vs global vs stack frame memory. It's not rocket science, just takes practice, and the courage to segfault your way through it.

denniskubesOP13y ago

What other mental models do people use to think about variables and memory? I would like to hear about them.

bcoates13y ago

My mental model for C is symbol-referent diagrams like the first picture on http://www.exforsys.com/tutorials/c-language/c-pointers.html

If you keep track of which boxes are and are not runtime memory cells, that should be enough to work out any particular C pointer problem except the pointer-array almost-equivalence mess.

denniskubesOP13y ago

That is nice. I have seem different pointer diagrams but none that linked it to a memory list as that does. I like.

ericbb13y ago

My understanding of types took a big step forward when I read some of Robert Harper's stuff. In particular, the blog post, Dynamic Languages are Static Languages, and his book, Practical Foundations for Programming Languages. (The book is a tome and I've only read parts of it but it's very good).

When it comes to understanding memory in C, another important aspect is understanding how linkers and loaders work. Also, it's good to know something about calling conventions.

georgemcbay13y ago

Go: Basically the same as C, but with better specification for type sizes, more rigid rules about automatic type conversion, no pointer arithmetic (you can do it using the unsafe package but it is highly discouraged by both the language design and idiomatic usage) and a compiler which can do type inference.

Also, when you get to manually allocated heap data (which this article doesn't cover) you don't have to worry about deallocations... usually.

wting13y ago

In Haskell:

Variables? What state? Everything is puuuuuuuuure.

In Python:

Everything is an object (numbers, true/false values, strings, etc), some are mutable and some are not. Variables are temporary labels on objects (think of them as hard links).

In Rust/C++:

There are various types of boxes / smart pointers (shared, unique, heap, etc), and unsafe / raw pointers should be avoided when possible.

In C:

Not every variable has a data type, e.g. void or function pointers.

_kst_13y ago

"Not every variable has a data type, e.g. void or function pointers."

A void pointer has type "void* "; a function pointer also has some appropriate type.

Not every object has a type (e.g., a chunk of memory allocated by `malloc()`), but if "variable" means "object created by a declaration", then yes, every object has a type.

BruceIV13y ago

As the lab TA for a first year course in Java, I don't know how many times I repeated "A variable is like a box: it has a label (the variable name), and it stores something." - it's a simplistic analogy, but not far wrong (at least for Java), and it helps the new programmers get the idea.

150010090013y ago

Scopes of identifiers, linkages of identifiers, name-spaces of identifiers, storage durations of objects, types, and representations of types.

jimmaswell13y ago

In higher-level languages I don't consciously think about how they're represented in memory.

1 more reply

16s13y ago

It sounds simple, but you'd be surprised how many programmers don't grok the fact that types/data have sizes (especially numeric types). For many tasks, this doesn't matter, but when it does matter, you need people who understand.

As an example, an IPv4 address is 32 bits. Don't convert it to a string and put it in a varchar(64) in your database when you are optimizing for space (I actually saw this once). And yes, the DB had an inet type, but no one knew how to use it, what it was or why it mattered.

__david__13y ago

My favorite bit of pointer code is one I had to write in the bootstrap code of an embedded processor:

    int r = ((int (*)())startAddress)(); // Wheeee!

derleth13y ago

> C is memory with syntactic sugar and as such it is helpful to think of things in C as starting from memory.

http://en.wikipedia.org/wiki/Lie-to-children

> A lie-to-children, sometimes referred to as a Wittgenstein's ladder (see below), is an expression that describes the simplification of technical or difficult-to-understand material for consumption by children. The word "children" should not be taken literally, but as encompassing anyone in the process of learning about a given topic, regardless of age. [snip] Because life and its aspects can be extremely difficult to understand without experience, to present a full level of complexity to a student or child all at once can be overwhelming. Hence elementary explanations tend to be simple, concise, or simply "wrong" — but in a way that attempts to make the lesson more understandable.

OK, the very first sentence of this piece falls flat on its face when you begin to think about how a computer actually handles getting data into and out of the parts of the CPU that actually do the work of modifying data according to the opcodes in flight.

In specific, C is meant to be a pleasant syntax to sling data around a large, flat address space, where the assumption is that every part of the address space can be treated like any other, with no special consideration given to some locations being faster than others. (The 'register' keyword mucked with this a bit, but approximately nobody uses it anymore in new code. Just as well, because good compilers ignore it anyway; more below.)

This is horribly, hilariously wrong when you learn about cache hierarchy, and becomes even more wrong when you throw an OS implementing virtual memory and a disk cache into the picture. C doesn't have any way to refer to cache; you can't tell the compiler 'store this in cache' because that would break the abstraction C enforces.

So we loop back around: C enforces the abstraction for a good reason; namely, compilers are better than humans at scheduling memory use in practically every case, and in the few cases they aren't, you're doing something hardware-specific enough you'll need to drop into assembly anyway. This is also the reason the 'register' keyword is a no-op and has been for decades. Compilers can schedule registers better than humans because compilers know more about all of the optimizations in play, and when they can't, you'll have to drop into assembly anyway.

TL;DR: This is a basic introductory post. Nitpicking it for things that compilers take care of for you anyway is pointless.

denniskubesOP13y ago

Thank you.

j / k navigate · click thread line to collapse

59 comments

voidlogic13y ago

Extremely uninteresting- It is like a page of "C-S 1XX: Intro to C" fell out of its bindings and landed on Hacker News.

holyjaw13y ago

Amendment: 'Extremely uninteresting' -TO YOU-.

Try not to be so negative, man, I think it's clear you weren't even the intended target anyways.

minimax13y ago

HN has a pretty broad audience and a pretty big chunk of it doesn't know $language. These types of beginner posts for $language pop up from time to time. It's nothing to worry about.

voidlogic13y ago

$language in this case is C, the lingua franca of computing.

mturmon13y ago

Posts on elementary topics (should be) noteworthy only if mastery is exhibited. Hence, griping.

ultimoo13y ago

I agree. I liked the opening line though: "C is memory with syntactic sugar." It is a good introductory article for someone who has never used C -- CS-1xx Intro as you said.

greenyoda13y ago

By the way, the classic K&R book explains the fundamentals of C pretty well. If you really want to understand C, I'd recommend reading it cover to cover (it's pretty short).

denniskubesOP13y ago

I was trying to describe a simple mental model that has been helpful to me. While I agree assembly details would have been interesting putting that in would have lost more than half the audience.

nemetroid13y ago

> putting that in would have lost more than half the audience.

I surely hope not.

blt13y ago

The least they could have done is explain how structs work.

haberman13y ago

There are some subtle problems with the model as explained in this article. If you use this as your mental model, you will probably run afoul of undefined behavior without realizing it.

  int x = 5;
  // BAD: undefined behavior, violates aliasing.
  short y = *(short*)&x;

A pointer isn't a generic memory-reading facility, it must actually point to a valid storage object of the pointer's type (or to NULL).

If you do want to read and write arbitrary objects in memory, you can always use memcpy():

  int x = 5;
  short y;
  // This is fine, and smart C compilers optimize away the
  // function call.
  memcpy(&y, &x, sizeof(y));

sillysaurus13y ago

If a variable were just a memory address and assignment were just a memory copy, this would be a valid operation.

It's a valid operation regardless of whether a standards body says it's not.

  uint32 x = 5;
  uint16 y = *(uint16*)&x;

"Bytes are 8 bits. Here are four bytes. Here's the value that the four bytes store. Copy two of the four bytes to this other spot (adjusting for endianness appropriately via a macro)."

You typically don't want a memcpy in situations like this due to endianness.

   inline float fabs(float x){
        return (float&) ((unsigned int&)x)&0x7fffffff ;
   }

haberman13y ago

> It's a valid operation regardless of whether a standards body says it's not.

> The effect is to set y to the first two bytes of memory from x.

No, it's really not. It's undefined behavior and the compiler is free to do absolutely whatever it wants.

I agree with this, and have made this argument before: http://blog.reverberate.org/2013/03/cc-gripe-1-integer-types...

But this is an entirely separate issue.

1 more reply

brigade13y ago

    uint8_t foo[4]; *(uint32_t*)foo = 0;

(also use stdint types already)

sillysaurus13y ago

  uint8_t foo[4]; *(uint32_t*)foo = 0;

Besides even without strict aliasing, the above is not at all guaranteed to work since not all architectures support unaligned loads.

Standards are a good and useful thing. All I'm saying is that it's important to know which rules you can safely violate.

2 more replies

150010090013y ago

>int8, int16, int32, int64 are all explicit and force the compiler (and the hardware) to obey the wishes of the programmer.

At least in C99, the compiler doesn't need to support exact-width integer types.

>People make much ado about the fact that "a byte isn't necessarily 8 bits"

Well, POSIX.1-2004 requires that CHAR_BIT == 8.

derleth13y ago

> It's a valid operation regardless of whether a standards body says it's not.

All the world's a VAX, sure. Don't mind the next generation of hardware coming down the pike and the next wave of compiler optimizations.

http://catb.org/jargon/html/V/vaxocentrism.html

_kst_13y ago

"A data type is a number of bytes to the compiler."

The size of a type is just one of its many attributes. Even if, for example, "long", "float", and "void* " happen to have the same size, they're still very distinct types.

"Integer data types are defined in the limits.h file. Float data types are defined via macros in the floats.h file."

"A pointer doesn’t hold a memory address, it holds a number that represents a memory address."

C lets you get away with a lot of type-unsafe stuff, particularly if you resort to pointer casts, but it's fundamentally much more strongly typed than the author seems to think it is.

revelation13y ago