Note that C does have strong conventions, such as that strings are terminated by a zero byte. Nothing in the language demands that, it’s just a convention! C could adopt better conventions.
Stated the same on HN earlier, but someone pointed out that literal strings are ASCIIZ.
If only. In C, it’s a (95+5)-item character set that happens to be a subset of ascii. See https://en.cppreference.com/w/c/language/charset:
“The basic literal character set consists of all characters of the basic character set, plus the following control characters”
That page also explicitly says:
The following characters are not in basic execution character set, but they are required to be encoded as a single byte in an ordinary character constant or ordinary string literal.
Code unit Character Glyph
U+0024 Dollar Sign $
U+0040 Commercial At @
U+0060 Grave Accent `”*
If I read that correctly, if you write a ‘$’ in a string literal before C23, there’s no guarantee that if gives you a byte with value 0x24.Of course, C++ is different. Like C, it makes a distinction between the encoding of source files (nowadays called the “basic character set”) and the encoding that the compiler converts literals to (nowadays called the “basic literal character set”), but it seems to put even fewer restrictions on them (in my cursory reading)
Also (https://en.cppreference.com/w/cpp/language/charset):
“Mapping from source file (other than a UTF-8 source file) (since C++23) characters to the basic character set (until C++23) translation character set (since C++23) during translation phase 1 is implementation-defined, so an implementation is required to document how the basic source characters are represented in source files.”*
If I understand that correctly, you can’t portably write an euro sign in C++ source files in C++ foe C++23
Also, chances are this changed in subtle ways between C and C++ versions.
11 'h' 'e' 'l' 'l' 'o' ' ' 'w' 'o' 'r' 'l' 'd' '\0'
ptr ^
C could be upgraded to do this in future versions, without too much backwards incompatibility.
"A string is a contiguous sequence of characters terminated by and including the first null character. .. The length of a string is the number of bytes preceding the null character"
This means, for example, strlen() must always check for the location of the first null character - there's no advantage to checking the length.
How would this work?
void *x = malloc(8);
...
uint64_t i = 5216694956355289088; // Python: int.from_bytes(b'Hello!\0\0')
memcpy(x, &i, 8);
char *s = x;
puts(s);
Assuming I did it correctly, this should print "Hello!".When the length get added to the start of the string?
But I'd hope that doing that would always be optional. There are numerous situations where that would seriously get in the way.
String literals are nul-terminated, e.g.: "foo"[3] == '\0'