undefined | Better HN

0 pointsmorelisp3y ago0 comments

You could write the same approach in C as `(size_t)i+(size_t)j` without UB. The real reason it doesn't work in C is because a memory region can be large enough to still overflow in that case.

0 comments

wizeman3y ago

That's not exactly the same approach, because you're doing unsigned addition while the Go code is doing signed addition.

And technically speaking, I think C doesn't guarantee that 'size_t' is at least as large as a 'signed int' (even though this is true on all platforms that I know of), so your approach would fail if that weren't the case. Although, you could use 'ssize_t' instead of 'int', or 'unsigned int' instead of 'size_t' to fix that.

> The real reason it doesn't work in C is because a memory region can be large enough to still overflow in that case.

The Go code we are discussing has nothing to do with memory regions, it's a generic binary search function, so it can be used for e.g. bisecting git commits. It doesn't require the calling function to use arrays.

Although yes, if the calling code were trying to do a binary search on an array, conceptually it could fail, but in that case you could argue the bug would be in the calling function, because it would be trying to pass the array length into a binary search function which only accepts an `int` or `ssize_t` function parameter, which could result in the array length being truncated. But strictly speaking, this would not be an arithmetic overflow issue.

That said, I would just fix the code so that it works for the full 'size_t' range, since the most common use case of a binary search function is indeed to do searches on arrays. In that case, the Go approach wouldn't work indeed.

a13692099933y ago

> I think C doesn't guarantee that 'size_t' is at least as large as a 'signed int'

That doesn't matter, because size_t is large enough to hold any array index (that's kind of[0] the defining property of size_t), so any array index in a signed int can be safely converted to size_t. The real problem is that (using 16-bit size_t for illustrative purposes) if you have, say, x = (size_t)40000 and y = (size_t)50000 into a 60000-element array, x+y = (size_t)90000 = (size_t)24464, which means (x+y)/2 = 12232, which is the completely wrong array element.

0: Technically, size_t is large enough to hold any object size, but array elements can't be smaller than char (sizeof can't be less than 1), so a array can't have more elements than it's sizeof.

wizeman3y ago

> > I think C doesn't guarantee that 'size_t' is at least as large as a 'signed int'

> That doesn't matter, because size_t is large enough to hold any array index (that's kind of[0] the defining property of size_t), so any array index in a signed int can be safely converted to size_t.

Well, the Go code we're discussing has nothing to do with arrays or array indices, so `size_t` doesn't help here.

Go look at the code :) It's a generic function for doing binary search, which accepts an `int` as a function argument, specifying the search size.

The code is then doing:

  h := int(uint(i+j) >> 1) // avoid overflow when computing h

Replacing the Go expression `uint(i+j)` with `(size_t)i+(size_t)j` in C like morelisp proposed would not work correctly if `size_t` is smaller than `int`.

That's the point I was making.

vlovich1233y ago

Pretty sure that’s not the case for 64 bit systems since you can “only” allocate about 48 bits of address space (maybe slightly more on newer systems).

For 32 bit systems using 64bit instead of size_t would similarly solve the problem.

wizeman3y ago

Well, that's not something the C standard (or POSIX, etc) guarantees, is it?

Conceptually, a 64-bit kernel today could allow your program to allocate (almost) the entire 64-bit address space, assuming it does memory overcommit (like Linux) and/or uses some kind of memory compression (like Linux supports as well).

There might be some MMU limitations on today's mainstream systems, but this doesn't mean that all 64-bit systems have those limitations or that those limitations will remain there in the future.

So your code would break as soon as a new system comes along without those limitations.

Also, this would be even more true if the code and stack would be stored in different address spaces, as theoretically that would even allow you to allocate the entire address space, I think.

vlovich1233y ago

The system you describe simply doesn’t exist, standards or no. A 64-bit kernel can’t hand out 64-bits worth of addresses because no CPU built today supports it.

A 48-bit index to an array can represent >240TBytes of RAM minimum - if your records are > 1 byte, you have significantly higher storage requirements. The largest system I could find that’s ever been built was a prototype that has ~160TiB of RAM [1]. Also remember. To make the algorithm incorrect, the sum of two numbers has to exceed 64bits - that means you’d need >63-bits of byte-addressable space. That just simply isn’t happening.

Now of course you might be searching through offline storage. 2^63 bits is ~9 exabytes of an array where each element is 1 byte. Note that now we’re talking scales of about about the aggregate total storage capacity of a public hyperscaled cloud. Your binary search simply won’t even finish.

So sure. You’re technically right except you’d never find the bug on any system that your algorithm would ever run on for the foreseeable future, so does it even matter?

As an aside, at the point where you’re talking about 48-bits worth of addressable bytes you’re searching, you’re choosing a different algorithm because a single lookup is going to take on the order of hours to complete. 63-bits is going to take ~27 years iff you can sustain 20gib/s for comparing the keys (sure binary search is logarithmic but then you’re not going to be hitting 20gib/s). Remember - data doesn’t come presorted either so simply getting all that data into a linearly sorted data structure is similarly impractical.

2 more replies

morelispOP3y ago

> For 32 bit systems using 64bit instead of size_t would similarly solve the problem.

~~C does not guarantee a 64 bit type exists.~~ This is not really correct these days.

wizeman3y ago

> C does not guarantee a 64 bit type exists.

Isn't (signed/unsigned) 'long long int' mandatory since C99?

It says: 'There are five standard signed integer types, designated as signed char, short int, int, long int, and long long int'.

My cursory search for 'long long' in the standard didn't find anything about it being optional...

1 more reply

j / k navigate · click thread line to collapse

0 comments

wizeman3y ago

That's not exactly the same approach, because you're doing unsigned addition while the Go code is doing signed addition.

> The real reason it doesn't work in C is because a memory region can be large enough to still overflow in that case.

a13692099933y ago

> I think C doesn't guarantee that 'size_t' is at least as large as a 'signed int'

0: Technically, size_t is large enough to hold any object size, but array elements can't be smaller than char (sizeof can't be less than 1), so a array can't have more elements than it's sizeof.

wizeman3y ago

> > I think C doesn't guarantee that 'size_t' is at least as large as a 'signed int'

Well, the Go code we're discussing has nothing to do with arrays or array indices, so `size_t` doesn't help here.

Go look at the code :) It's a generic function for doing binary search, which accepts an `int` as a function argument, specifying the search size.

The code is then doing:

  h := int(uint(i+j) >> 1) // avoid overflow when computing h

Replacing the Go expression `uint(i+j)` with `(size_t)i+(size_t)j` in C like morelisp proposed would not work correctly if `size_t` is smaller than `int`.

That's the point I was making.

vlovich1233y ago

Pretty sure that’s not the case for 64 bit systems since you can “only” allocate about 48 bits of address space (maybe slightly more on newer systems).

For 32 bit systems using 64bit instead of size_t would similarly solve the problem.

wizeman3y ago

Well, that's not something the C standard (or POSIX, etc) guarantees, is it?

There might be some MMU limitations on today's mainstream systems, but this doesn't mean that all 64-bit systems have those limitations or that those limitations will remain there in the future.

So your code would break as soon as a new system comes along without those limitations.

Also, this would be even more true if the code and stack would be stored in different address spaces, as theoretically that would even allow you to allocate the entire address space, I think.

vlovich1233y ago

The system you describe simply doesn’t exist, standards or no. A 64-bit kernel can’t hand out 64-bits worth of addresses because no CPU built today supports it.

So sure. You’re technically right except you’d never find the bug on any system that your algorithm would ever run on for the foreseeable future, so does it even matter?

2 more replies

morelispOP3y ago

> For 32 bit systems using 64bit instead of size_t would similarly solve the problem.

~~C does not guarantee a 64 bit type exists.~~ This is not really correct these days.

wizeman3y ago

> C does not guarantee a 64 bit type exists.

Isn't (signed/unsigned) 'long long int' mandatory since C99?

It says: 'There are five standard signed integer types, designated as signed char, short int, int, long int, and long long int'.

My cursory search for 'long long' in the standard didn't find anything about it being optional...

1 more reply

j / k navigate · click thread line to collapse