undefined | Better HN

0 pointsadastra222y ago0 comments

Those are syntactic sugar for the same thing though. Array[5] is just shorthand for *(Array + 5), which is why 5[Array] also works (because addition is commutative).

Note that C does have strong conventions, such as that strings are terminated by a zero byte. Nothing in the language demands that, it’s just a convention! C could adopt better conventions.

0 comments

astrobe_2y ago

> Note that C does have strong conventions, such as that strings are terminated by a zero byte

Stated the same on HN earlier, but someone pointed out that literal strings are ASCIIZ.

Someone2y ago

> literal strings are ASCIIZ.

If only. In C, it’s a (95+5)-item character set that happens to be a subset of ascii. See https://en.cppreference.com/w/c/language/charset:

“The basic literal character set consists of all characters of the basic character set, plus the following control characters”

That page also explicitly says:

The following characters are not in basic execution character set, but they are required to be encoded as a single byte in an ordinary character constant or ordinary string literal.

  Code unit Character Glyph

  U+0024 Dollar Sign $
  U+0040 Commercial At @
  U+0060 Grave Accent `”*

If I read that correctly, if you write a ‘$’ in a string literal before C23, there’s no guarantee that if gives you a byte with value 0x24.
Of course, C++ is different. Like C, it makes a distinction between the encoding of source files (nowadays called the “basic character set”) and the encoding that the compiler converts literals to (nowadays called the “basic literal character set”), but it seems to put even fewer restrictions on them (in my cursory reading)
Also (https://en.cppreference.com/w/cpp/language/charset):
“Mapping from source file (other than a UTF-8 source file) (since C++23) characters to the basic character set (until C++23) translation character set (since C++23) during translation phase 1 is implementation-defined, so an implementation is required to document how the basic source characters are represented in source files.”*

If I understand that correctly, you can’t portably write an euro sign in C++ source files in C++ foe C++23

Also, chances are this changed in subtle ways between C and C++ versions.

1 more reply

adastra22OP2y ago

One common trick in safer C libraries is to encode the length of the string one word prior to the beginning of the string. So "hello world" in memory would be

11 'h' 'e' 'l' 'l' 'o' ' ' 'w' 'o' 'r' 'l' 'd' '\0'

ptr ^

C could be upgraded to do this in future versions, without too much backwards incompatibility.

eesmith2y ago

From the C99 draft at https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf :

"A string is a contiguous sequence of characters terminated by and including the first null character. .. The length of a string is the number of bytes preceding the null character"

This means, for example, strlen() must always check for the location of the first null character - there's no advantage to checking the length.

How would this work?

  void *x = malloc(8);

   ...
  uint64_t i = 5216694956355289088; // Python: int.from_bytes(b'Hello!\0\0')
  memcpy(x, &i, 8);
  char *s = x;
  puts(s);

Assuming I did it correctly, this should print "Hello!".

When the length get added to the start of the string?

JohnFen2y ago

> C could be upgraded to do this in future versions, without too much backwards incompatibility.

But I'd hope that doing that would always be optional. There are numerous situations where that would seriously get in the way.

1 more reply

tedunangst2y ago

My copy of the C standard says "A string is a contiguous sequence of characters terminated by and including the first null character."

thelopa2y ago

Many of the str functions in the C standard library assume a nul terminator.

adastra22OP2y ago

Yes, but aside from string literals pointed out by a sibling comment, nothing in the language itself dictates this convention. The C library could be augmented with functions which expect strings structured in other ways.

ixtenu2y ago

> nothing in the language itself dictates this convention.

String literals are nul-terminated, e.g.: "foo"[3] == '\0'

j / k navigate · click thread line to collapse

0 comments

astrobe_2y ago

> Note that C does have strong conventions, such as that strings are terminated by a zero byte

Stated the same on HN earlier, but someone pointed out that literal strings are ASCIIZ.

Someone2y ago

> literal strings are ASCIIZ.

If only. In C, it’s a (95+5)-item character set that happens to be a subset of ascii. See https://en.cppreference.com/w/c/language/charset:

“The basic literal character set consists of all characters of the basic character set, plus the following control characters”

That page also explicitly says:

The following characters are not in basic execution character set, but they are required to be encoded as a single byte in an ordinary character constant or ordinary string literal.

  Code unit Character Glyph

  U+0024 Dollar Sign $
  U+0040 Commercial At @
  U+0060 Grave Accent `”*

If I understand that correctly, you can’t portably write an euro sign in C++ source files in C++ foe C++23

Also, chances are this changed in subtle ways between C and C++ versions.

1 more reply

adastra22OP2y ago

One common trick in safer C libraries is to encode the length of the string one word prior to the beginning of the string. So "hello world" in memory would be

11 'h' 'e' 'l' 'l' 'o' ' ' 'w' 'o' 'r' 'l' 'd' '\0'

ptr ^

C could be upgraded to do this in future versions, without too much backwards incompatibility.

eesmith2y ago

From the C99 draft at https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf :

"A string is a contiguous sequence of characters terminated by and including the first null character. .. The length of a string is the number of bytes preceding the null character"

This means, for example, strlen() must always check for the location of the first null character - there's no advantage to checking the length.

How would this work?

  void *x = malloc(8);

   ...
  uint64_t i = 5216694956355289088; // Python: int.from_bytes(b'Hello!\0\0')
  memcpy(x, &i, 8);
  char *s = x;
  puts(s);

Assuming I did it correctly, this should print "Hello!".

When the length get added to the start of the string?

JohnFen2y ago

> C could be upgraded to do this in future versions, without too much backwards incompatibility.

But I'd hope that doing that would always be optional. There are numerous situations where that would seriously get in the way.

1 more reply

tedunangst2y ago

My copy of the C standard says "A string is a contiguous sequence of characters terminated by and including the first null character."

thelopa2y ago

Many of the str functions in the C standard library assume a nul terminator.

adastra22OP2y ago

ixtenu2y ago

> nothing in the language itself dictates this convention.

String literals are nul-terminated, e.g.: "foo"[3] == '\0'

j / k navigate · click thread line to collapse