Indices Point Between Elements (opens in new tab)

(blog.nelhage.com)

133 pointsnelhage10y ago62 comments

62 comments

The visuals are definitely valuable in explaining this.

It used to be popular, and still is in some circles, to debate whether programming languages ought start array indexing at 0 or 1.

When talking about this with other programmers, I've discovered that a lot of the issues/confusion could be avoided by consistent use of terminology: Offsets/offsetting always being zero-based and indexes/indexing always being one-based.

Using rulers and birthdays also helps to explain differences. You're in the first year before your first birthday, being zero (whole) years old.

To make matters potentially more confusing, culturally, I remember something about the ground floor in the UK buildings not being "Floor 1" like it is in the United States.

http://www.hacker-dictionary.com/terms/fencepost-error

duaneb10y ago

I would argue an index, in the sense that it might be an integer, is indistinguishable from an offset. I would refer to one-based indexing as positional—1st, 2nd, 3rd, ..., ith.

Of course, it's the offset from the first element, so it's kind of a circular definition.

kmill10y ago

Generally, it's useful to distinguish offset from name sometimes. For instance, gravitational potential energy has numbers associated with different amounts, but once you choose an origin you can compute energy differences.

Another example is screen space versus screen displacement. This is the difference between affine space and a vector space. Whether the upper corner is (0,0) or (222,22) shouldn't matter as long as you are doing everything relative to some point.

In C, I argue we always use offsets. Each element of an array A is at a particular memory location, the name of the array being the first memory location, and then A[i] means take the thing at A+i. Notice that the difference p-q between two memory locations p and q is exactly the offset you put into an index expression: q[p-q] == *p.

That said, it is convenient to confuse the offset with the memory location since 1. the memory location is likely not known when writing the program 2. if it were known, it would be almost impossible to use.

Now, an anecdote: I was helping implement a QR factoring algorithm from a textbook which uses 1-based indexing in a language which uses 0-based offsets. We tried changing the bounds of the nested loops to account for the difference, but it was basically impossible to avoid off-by-one errors. So, we left the loop bounds as the textbook had them and instead indexed like A[i-1], since this i-1 is the offset from A[0], the array element labeled 1.

pavlov10y ago

I remember something about the ground floor in the UK buildings not being "Floor 1" like it is in the United States.

Actually, that's perfectly explained with your offset vs. index terminology. In some countries, the floor number is an index within the array of floors. In others, it's an offset from the ground.

fhars10y ago

Except in the UK, where there often is a Mezzanine floor somewhere above the ground floor (usually, but not always, between the ground floor and the first floor).

Is there an Esolang that numbers its arrays with 0,M,1,2,3...?

1 more reply

thwarted10y ago

Oh, I know it's perfectly explained by that, but using an example of building floors has cultural idioms. Ruler measurements and birthdays seem to be more universal.

1 more reply

cjhveal10y ago

The birthday analogy is a good one, I'll have to keep that in mind. Your actual "birthday" being the points between year elements, your "age" as the offset from birth.

As to the point about floor numbering, the situation can be a little confusing in Canada. Some buildings label in the US style, labeling stories [1st, 2nd, 3rd] while others label in the UK style as [Ground, 1st, 2nd]. We also often mix them and you'll see [Ground, 2nd, 3rd], with ground sometimes replaced by main, or lobby.

wlesieutre10y ago

[Lobby/Ground, 2nd, 3rd] is common in the US too

SerpentJoe10y ago

I prefer that model for floor numbering, personally. It means if you're on floor N that you're N stories above the ground.

ibejoeb10y ago

Not that it matters, but I feel the opposite. Barring the presence of subterranean levels, if you're on the ground floor, you are standing on the first literal "floor" of the structure.

TazeTSchnitzel10y ago

Well, depends which country you live in.

In British English, the first floor is the first above the ground.

In American English, the first floor is the ground.

derefr10y ago

> You're in the first year before your first birthday, being zero (whole) years old.

I would argue that your "first birthday" is, in fact, the day you are born—your birth day. The thing that happens for the first time a year later, is the first anniversary of your birth day.

wlesieutre10y ago

And I'd say that the word "birthday" is defined to mean "anniversary of the day you were born"

LoSboccacc10y ago

_It used to be popular, and still is in some circles, to debate whether programming languages ought start array indexing at 0 or 1_

this is an exemplary case of citation needed if I ever saw one. maybe it's a valid debate for programming languages that doesn't allow people to do pointer arithmetic, which already restrict the field a lot, but even then that's sound as part of the 4GL bullshit that never really took off, and for good reasons

beagle310y ago

I don't know what kind of citation would satisfy you. These debates still come up in e.g. the Lua (1-based) mailing list, and used to be everywhere.

Visual Basic had the "OPTION BASE" statement to select.[0] (Many other versions of basic did too)

APL also has the ⎕IO Index Origin setting [1]

If you want to see a lively debate, there's c2[2], and there's also Dijkstra[3]

[0] https://msdn.microsoft.com/en-us/library/aa266179%28v=vs.60%...

[1] https://en.wikipedia.org/wiki/APL_syntax_and_symbols

[2] http://c2.com/cgi/wiki?ZeroAndOneBasedIndexes

[3] http://www.cs.utexas.edu/users/EWD/ewd08xx/EWD831.PDF

1 more reply

ajanuary10y ago

> sound as part of the 4GL bullshit

Actually it's primarily early languages plus Lua.

[0] https://en.m.wikipedia.org/wiki/Comparison_of_programming_la...

1 more reply

Someone10y ago

See also https://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/E... ("Why numbering should start at zero")

hammock10y ago

And the HN discussion: https://news.ycombinator.com/item?id=9761355

surrealize10y ago

People have taken different approaches to this in bioinformatics for numbering intervals of dna bases in chromosomes. I think this approach is catching on, though. In that realm, it's important to speak unambiguously about insertions and deletions, and the "interbase" mental model makes it a lot clearer.

edit: Probably the most popular genome browser, based at UC Santa Cruz, uses this zero-based, half-open numbering internally. But, at some point in the past, biologists developed an expectation that numbering would be 1-based. So the Santa Cruz genome browser actually adds 1 to the start position of everything for display purposes.

rcthompson10y ago

One interesting side-effect of using 1-based closed indexing (i.e. numbering the positions, not between the positions) is that a zero-width range (which is something that actually comes up in genomics) starts at position N and ends at N-1.

carapace10y ago

Once at a party I accidentally started an argument about which way toilet paper should hang from the roll (front or back) by mentioning how silly it was that people would argue about such a trivial matter.

leni53610y ago

Also at which and do you start pealing a banana?

sago10y ago

I've found this a helpful way of explaining ranges in Python to students, it also makes Python's negative indices understandable for the same reason.

But it is contextual. When it comes to languages like C, where arrays are more directly mapped to pointers and memory layout, I've found it better to talk about pointers, and allow people to derive the behavior that way.

Either way, I'd be careful of trying to claim that 'this is what it is', rather than 'here is a way to remember it'.

nelhageOP10y ago

I get into this a bit later on, but I think the exact same model applies to pointers: You're much better off in most cases thinking of pointers as pointing at the zero-width points between elements, than at elements themselves.

rzzzt10y ago

I'm also having trouble meshing this way of thinking about indexes with the idea of a fixed bit-width, discretely addressable RAM, which would suggest that there is nothing "between" two storage elements.

I find it very useful, however, for imagining what the returned insertion point index of a binary search would mean, when the item you are looking for can not be found.

igravious10y ago

Isn't it more obvious that the indices are the addresses of the locations if the boxes are arranged vertically? Besides the boxes should contain their values not their address, an address is attached to a box, and to one box only. Better off indeed thinking of pointer existing in completely different boxes and their values as pointing at the zero-width sides of other boxes :)

Dysprosium10y ago

The Python documentation use this representation since a long time.

https://docs.python.org/2/tutorial/introduction.html#tut-str...

orclev10y ago

He undermines his own argument right out of the gate. When he shows the part about which number should be used for the last element in the range there's still two valid choices, you could choose to stop at 3 with the understanding that the element to the right of that is the last element to be selected, or to stop at 4 with the understanding that in this context your indexing is "special" and you're actually selecting the element to the left of the index.

morpher10y ago

If you treat the indices as between elements and include all elements that are between the start and end elements, there is only one reasonable choice. It would be more "special" to include elements that are outside of the specified range (i.e., the element to the right of the final index).

crusso10y ago

Yeah, but I think that orclev was saying that you have a rule at the beginning that the element to the right of the index is important and then in ranges, the element to the right of the index is know longer important.

One way or the other, someone needs to know the same number of rules in order to understand how the indices work.

perlgeek10y ago

When you talking about where zero-width regexes match, this mental model certainly helps, and I find it consistent otherwise too.

nelhageOP10y ago

zero-width matches (and empty lines) were a huge source of stupid edge-case bugs in livegrep[1], and being rigorous about maintaining this mental model definitely helped a lot.

[1] livegrep.com

DougBTX10y ago

Further notes about half open intervals and why starting at zero is a good idea: http://www.cs.utexas.edu/users/EWD/ewd08xx/EWD831.PDF

bch10y ago

There is a reference to Emacs in the OP, but not directly to what immediately sprung up in my mind[0]. In Emacs, it has a cursor (the block) and a "point" which is described as an imaginary spot at the left edge of the cursor - the point is what's important when one is indexing data (ie: setting a mark and selecting a bunch of text). The model fits for lots of things... Unfortunately for me, tmux highlighting does not follow this model.

[0] https://www.gnu.org/software/emacs/manual/html_node/emacs/Po...

edit: clarify Emacs reference in OP

MrManatee10y ago

I have sometimes wondered whether it would be useful to have two different types for indices, depending on if we are indexing the elements themselves or the "gaps" between them. Let's say I'm thinking about some language like Haskell, where types are already a big deal.

That way the compiler would be able to tell if I accidentally mixed the two. Every conversion would have to explicit: for example, there might be two functions, "before" and "after", that take a gap index, and return an element index.

I think I might actually enjoy programming this way, but perhaps others would find it needlessly bureaucratic.

curiousDog10y ago

Isn't this the classic fence post vs fence section analogy? I like this statement from Djikstra: "an element's ordinal ... equals the number of elements preceding it in the sequence"[1] [1]http://c2.com/cgi/wiki?WhyNumberingShouldStartAtZero

Another nice link: http://betterexplained.com/articles/learning-how-to-count-av...

marijn10y ago

And this is why JavaScript's `lastIndexOf` method gets it wrong, wrong, wrong. (`"foo".lastIndexOf("o", 2)` yields 2, whereas I, and probably a lot of other people, would expect it to be 1, with the search starting at the space between element 1 and 2)

Camillo10y ago

This is the mental model I use.

spartan3710y ago

Relevant https://github.com/vr6/blog/blob/master/Zero-One.md

colanderman10y ago

This is the same model used by the Cairo graphics library and the HTML5 Canvas (albeit in two dimensions). All the same benefits apply when you're addressing pixels as elements of an array.

ape410y ago

We need a different name for this. We're too used to "index". If the name is "Foo" then we can have fooSubString(i,j);

pshc10y ago

This echoes the way unums work in "The End of Error". (Recommended read by the way.) Good way of thinking about ranges.

amelius10y ago

Ugh. Please no.

This only adds to the confusion, as now there are 2 ways in which indices can be interpreted.

Retra10y ago

Specifically, this:

>"Indexing between elements, instead of indexing elements, helps avoid a large class of off-by-one errors."

It only replaces them with indexing-method errors. Instead of remembering if my ranges are open or closed, I have to remember if they are using between-element indices or on-element indices. It's still going to cause the same kinds of problems.

rcthompson10y ago

You always have to remember what kind of indexing you are using in order to avoid making errors. But using this mental model will help people avoid making off-by-one errors because they don't even understand the indexing in the first place.

LoSboccacc10y ago

also this works only in this special case of array being one sized.

an index is an offset to a pointer in memory, shifted by the size of the structure it points to. there is no other way around, no magic tricks about index being between elements.

of course people never exposed from c miss out all of this, and then are left to made up bullshit about how stuff actually works

array index are offset to a memory location, plain and simple; if that requires more explaining than this, then there is the need to get back to tech the basic underpinning on how computer memory and addressing works

matchu10y ago

Memory addresses are exactly analogous to array indices, and suffer from exactly the same semantics issues. After all, low-level memory is just an array of bytes.

We do have a convention that dereferencing a memory address returns the 8 bits to the right of that address. We've even optimized our hardware for that convention. But that's just a convention of the dereference operation; it's not fundamental to the addresses themselves.

I agree that a C pointer isn't analogous to an array index; that's because a pointer is a range, determined by a pair of memory addresses. One, stored at runtime, refers to the location before the first byte of the range. The other, implicitly derived from the runtime value and the size information in the pointer type, refers to the location after the last byte of the range. When we think of memory addresses as the article's indexes, and pointers as the article's ranges, everything falls into place.

(Incidentally, please be careful calling out people for not understanding computers. C isn't actually the lowest level of computing, and pointers aren't as primitive as your post implies. When you call someone out, you need to be 100% clear and 100% right.)

2 more replies

maaku10y ago

Great teaching tool, thank you.

anon544637210y ago

Some people spend too much time overthinking simple concepts...

kmill10y ago

Some people overestimate the simplicity of common concepts...

j / k navigate · click thread line to collapse

62 comments

thwarted10y ago

The visuals are definitely valuable in explaining this.

It used to be popular, and still is in some circles, to debate whether programming languages ought start array indexing at 0 or 1.

Using rulers and birthdays also helps to explain differences. You're in the first year before your first birthday, being zero (whole) years old.

To make matters potentially more confusing, culturally, I remember something about the ground floor in the UK buildings not being "Floor 1" like it is in the United States.

http://www.hacker-dictionary.com/terms/fencepost-error

duaneb10y ago

I would argue an index, in the sense that it might be an integer, is indistinguishable from an offset. I would refer to one-based indexing as positional—1st, 2nd, 3rd, ..., ith.

Of course, it's the offset from the first element, so it's kind of a circular definition.

kmill10y ago

pavlov10y ago

I remember something about the ground floor in the UK buildings not being "Floor 1" like it is in the United States.

Actually, that's perfectly explained with your offset vs. index terminology. In some countries, the floor number is an index within the array of floors. In others, it's an offset from the ground.

fhars10y ago

Except in the UK, where there often is a Mezzanine floor somewhere above the ground floor (usually, but not always, between the ground floor and the first floor).

Is there an Esolang that numbers its arrays with 0,M,1,2,3...?

1 more reply

thwarted10y ago

Oh, I know it's perfectly explained by that, but using an example of building floors has cultural idioms. Ruler measurements and birthdays seem to be more universal.

1 more reply

cjhveal10y ago

The birthday analogy is a good one, I'll have to keep that in mind. Your actual "birthday" being the points between year elements, your "age" as the offset from birth.

wlesieutre10y ago

[Lobby/Ground, 2nd, 3rd] is common in the US too

SerpentJoe10y ago

I prefer that model for floor numbering, personally. It means if you're on floor N that you're N stories above the ground.

ibejoeb10y ago

Not that it matters, but I feel the opposite. Barring the presence of subterranean levels, if you're on the ground floor, you are standing on the first literal "floor" of the structure.

TazeTSchnitzel10y ago

Well, depends which country you live in.

In British English, the first floor is the first above the ground.

In American English, the first floor is the ground.

derefr10y ago

> You're in the first year before your first birthday, being zero (whole) years old.

I would argue that your "first birthday" is, in fact, the day you are born—your birth day. The thing that happens for the first time a year later, is the first anniversary of your birth day.

wlesieutre10y ago

And I'd say that the word "birthday" is defined to mean "anniversary of the day you were born"

LoSboccacc10y ago

_It used to be popular, and still is in some circles, to debate whether programming languages ought start array indexing at 0 or 1_

beagle310y ago

I don't know what kind of citation would satisfy you. These debates still come up in e.g. the Lua (1-based) mailing list, and used to be everywhere.

Visual Basic had the "OPTION BASE" statement to select.[0] (Many other versions of basic did too)

APL also has the ⎕IO Index Origin setting [1]

If you want to see a lively debate, there's c2[2], and there's also Dijkstra[3]

[0] https://msdn.microsoft.com/en-us/library/aa266179%28v=vs.60%...

[1] https://en.wikipedia.org/wiki/APL_syntax_and_symbols

[2] http://c2.com/cgi/wiki?ZeroAndOneBasedIndexes

[3] http://www.cs.utexas.edu/users/EWD/ewd08xx/EWD831.PDF

1 more reply

ajanuary10y ago

> sound as part of the 4GL bullshit

Actually it's primarily early languages plus Lua.

[0] https://en.m.wikipedia.org/wiki/Comparison_of_programming_la...

1 more reply

Someone10y ago

See also https://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/E... ("Why numbering should start at zero")

hammock10y ago

And the HN discussion: https://news.ycombinator.com/item?id=9761355

surrealize10y ago

rcthompson10y ago

carapace10y ago

leni53610y ago

Also at which and do you start pealing a banana?

sago10y ago

I've found this a helpful way of explaining ranges in Python to students, it also makes Python's negative indices understandable for the same reason.

Either way, I'd be careful of trying to claim that 'this is what it is', rather than 'here is a way to remember it'.

nelhageOP10y ago

rzzzt10y ago

I find it very useful, however, for imagining what the returned insertion point index of a binary search would mean, when the item you are looking for can not be found.

igravious10y ago

Dysprosium10y ago

The Python documentation use this representation since a long time.

https://docs.python.org/2/tutorial/introduction.html#tut-str...

orclev10y ago

morpher10y ago

crusso10y ago

One way or the other, someone needs to know the same number of rules in order to understand how the indices work.

perlgeek10y ago

When you talking about where zero-width regexes match, this mental model certainly helps, and I find it consistent otherwise too.

nelhageOP10y ago

zero-width matches (and empty lines) were a huge source of stupid edge-case bugs in livegrep[1], and being rigorous about maintaining this mental model definitely helped a lot.

[1] livegrep.com

DougBTX10y ago

Further notes about half open intervals and why starting at zero is a good idea: http://www.cs.utexas.edu/users/EWD/ewd08xx/EWD831.PDF

bch10y ago

[0] https://www.gnu.org/software/emacs/manual/html_node/emacs/Po...

edit: clarify Emacs reference in OP

MrManatee10y ago

I think I might actually enjoy programming this way, but perhaps others would find it needlessly bureaucratic.

curiousDog10y ago

Another nice link: http://betterexplained.com/articles/learning-how-to-count-av...

marijn10y ago

Camillo10y ago

This is the mental model I use.

spartan3710y ago

Relevant https://github.com/vr6/blog/blob/master/Zero-One.md

colanderman10y ago

This is the same model used by the Cairo graphics library and the HTML5 Canvas (albeit in two dimensions). All the same benefits apply when you're addressing pixels as elements of an array.

ape410y ago

We need a different name for this. We're too used to "index". If the name is "Foo" then we can have fooSubString(i,j);

pshc10y ago

This echoes the way unums work in "The End of Error". (Recommended read by the way.) Good way of thinking about ranges.

amelius10y ago

Ugh. Please no.

This only adds to the confusion, as now there are 2 ways in which indices can be interpreted.

Retra10y ago

Specifically, this:

>"Indexing between elements, instead of indexing elements, helps avoid a large class of off-by-one errors."

rcthompson10y ago

LoSboccacc10y ago

also this works only in this special case of array being one sized.

an index is an offset to a pointer in memory, shifted by the size of the structure it points to. there is no other way around, no magic tricks about index being between elements.

of course people never exposed from c miss out all of this, and then are left to made up bullshit about how stuff actually works

matchu10y ago

Memory addresses are exactly analogous to array indices, and suffer from exactly the same semantics issues. After all, low-level memory is just an array of bytes.

2 more replies

maaku10y ago

Great teaching tool, thank you.

anon544637210y ago

Some people spend too much time overthinking simple concepts...

kmill10y ago

Some people overestimate the simplicity of common concepts...

j / k navigate · click thread line to collapse