undefined | Better HN

0 pointsSomeone2y ago0 comments

> literal strings are ASCIIZ.

If only. In C, it’s a (95+5)-item character set that happens to be a subset of ascii. See https://en.cppreference.com/w/c/language/charset:

“The basic literal character set consists of all characters of the basic character set, plus the following control characters”

That page also explicitly says:

The following characters are not in basic execution character set, but they are required to be encoded as a single byte in an ordinary character constant or ordinary string literal.

  Code unit Character Glyph

  U+0024 Dollar Sign $
  U+0040 Commercial At @
  U+0060 Grave Accent `”*

If I read that correctly, if you write a ‘$’ in a string literal before C23, there’s no guarantee that if gives you a byte with value 0x24.
Of course, C++ is different. Like C, it makes a distinction between the encoding of source files (nowadays called the “basic character set”) and the encoding that the compiler converts literals to (nowadays called the “basic literal character set”), but it seems to put even fewer restrictions on them (in my cursory reading)
Also (https://en.cppreference.com/w/cpp/language/charset):
“Mapping from source file (other than a UTF-8 source file) (since C++23) characters to the basic character set (until C++23) translation character set (since C++23) during translation phase 1 is implementation-defined, so an implementation is required to document how the basic source characters are represented in source files.”*

If I understand that correctly, you can’t portably write an euro sign in C++ source files in C++ foe C++23

Also, chances are this changed in subtle ways between C and C++ versions.

0 comments

No comments yet.