Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.
I actually like UTF-8 because it will break very quickly, and force the programmer to do the right thing. The first time you hit é or € or ️an emoji, you'll have a multibyte character, and you'll need to deal with it.
All the other options will also break, but later on:
- If you use UTF-16, then é and € will work, but emoji will still result in surrogate pairs.
- If you use a 4-byte representation, then you'll be able to treat most emoji as single characters. But then somebody will build é from two separate code points as "e + U+0301 COMBINING ACUTE ACCENT", or you'll run into a flag or skin color emoji, and once again, you're back at square zero.
You can't really index Unicode characters like ASCII strings. Written language is just too weird for that. But if you use UTF-8 (with a good API), then you'll be forced to accept that "str[3]" is hopeless very quickly. It helps a lot if your language has separate types for "byte" and "Unicode codepoint", however, so you can't accidentally treat a single byte as a character.
Depending on the task at hand, iterating by UTF-8 byte or by code point can make sense, too. And the definition of these is frozen regardless of Unicode version, which makes these safer candidates for "fundamental" operations. There is no right unit of iteration for all tasks.
If I want to know how much memory to allocate, bytes are it. If I want to know how much screen space to allocate, font rendering metrics are it. If I want to do word-breaking, grapheme clusters are it.
None of these are fundamental.
There are languages whose orthographies don't fit the Unicode grapheme cluster specification, but they're complex enough that I doubt there's any way to deal with them properly other than having someone proficient in them looking over your text processing or pawning it off to a library. At least with grapheme clusters your code won't choke on something as simple as unnormalized Latin text.
Size in memory/bytes you could get trivially for any string (and this doesn't change with whether you chose bytes, graphemes or code points or whatever to iterate).
Screen space is irrelevant/orthogonal to encoding -- it appears at the font level and the font rendering engine that will give the metrics will accept whatever encoding it is.
I also wish Python would expose more of the Unicode database than it does; I've had to turn to third-party modules that basically build their own separate database for some Unicode stuff Python doesn't provide (like access to the Script property).
Depends on what you want to index into it for. I'll admit that once upon a time I opposed adding a "truncate at N characters" template helper to Django since there was a real risk it would cut in the middle of a grapheme cluster, and I don't particularly care for the compromise that ended up getting it added (it normalizes the string-to-truncate to a composed form first to try to minimize the chance of slicing at a bad spot).
But when you get right down to it, what I do for a living is write web applications, and sometimes I have to write validation that cares about length, or about finding specific things in specific positions, and so indexing into a string is something I have do to from time to time, and I'd rather have it behave as a sequence of code points than have it behave as a sequence of bytes in a variable-width encoding.
As to whether UTF-8 forces people to deal with Unicode up-front, I very strongly disagree; UTF-8 literally has as a design goal that it puts off your need to think about anything that isn't ASCII.
I had the need to write grapheme-level word wrap in Rust. Here it is. It assumes all graphemes have the same visible width. This is used mostly for debug output, not for general text rendering.
[1] https://github.com/John-Nagle/rust-rssclient/blob/master/src...
That is my use case for Python as well.
> sometimes I have to write validation that cares about length,
That's where a trucation function that understands grapheme clusters whould come in so handy. Tell it that you want to truncate to n bytes maximum and let it chop a bit more as to not split a grapheme cluster.
Fortunately my database does not have fixed with strings so I rarely bump into this one.
> or about finding specific things in specific positions, and so indexing into a string is something I have do to from time to time
I write my code to avoid this. Yes I still have to use an index because that's what Python supports but it would be trivial to convert it to another language that supports string iterators.
Combining accents are a great example: if you're an American, especially in the 80s, it's easy to assume that you only need a couple of accents like you used in Spanish and French classes and that's really simple for converting old data to a new encoding. Later, it becomes obvious that far more are needed but by then there's a ton of code and data in the wild so you end up needing the concept of normalization for compatibility.
(That's the same lapse which lead to things like UCS-2 assuming 2^16 characters even though that's not enough for a full representation of Chinese alone.)
I think it's also worth remembering the combination of arrogance and laziness which was not uncommon in the field, especially in the 90s. I remember impassioned rants about how nobody needed anything more than ASCII from programmers who didn't want to have to deal with iconv, thought encoding was too much hassle, claimed it was too slow, etc. as if that excused not being able to handle valid requests. About a decade ago I worked at a major university where the account management system crashed on apostrophes or accents (in a heavily Italian town!) and it was just excused as the natural order of things so the team could work on more interesting problems.
Take the country flag emoji. They're actually two seperate code points. The 26 code points used are just special country code letters A to Z. The pair of letters is the country code and shows up as a flag. So just 26 codes to make all the flags in the world. Plus new ones can be added easily without having to add more code points.
Another example is the new skin tone emoji. The new codes are just the colour and are put in front of the existing emoji codes. Existing software just shows the normal coloured emoji but you may see a square box or question mark symbol in front of it.
Still not answering the question though.
For one, when the unicode standard was originally designed it didn't have emoji in it.
Second, if it was limitations to the arbitrary addition of thousands of BS symbols like emoji that necessitate such a design, we could rather do without emojis in unicode at all (or klingon or whatever).
So, the question is rather: why not a design that doesn't need "normalization" and runes, code points, and all that...
Using less memory (like utf-8 allows) I guess is a valid concern.
> You can't really index Unicode characters like ASCII strings
But then why do strings-are-UTF8 languages like Go or D make it so easy? Why optimize your syntax with `len(txt)` and `txt[0]` when, as you point out, you can't do that? Why make it trivial to split code points or composed character sequences, but doing something like proper string truncation is brutally hard?
UTF-8's fail-fast property has not enabled more Unicode-savviness. Instead it just lets programmers pretend that we still are in the land of C strings.
I like Rust's approach. It's a strings-are-UTF8 language but strings (both str and String):
- are not directly indexable
- force you to be explicit when iterating: you iterate over either `s.chars()` or `s.bytes()`
- are called out in the docs as being a vector of unsigned 8-bit integers internally
- support a len() method that is called out as returning the length of that vector
- can be sliced if you reaaaally need to get around inability to index directly but attempting to slice in the middle of a character causes a panic
They should have called that one bytelen() then.
And how do you get a proper offset for slicing? Do you then have to interpret the UTF-8 bytes yourself, or can you somehow get it via the chars() iterator or something similar?
To clarify: strings in Go are not necessarily UTF-8. String literals will be, because the source code is defined to be UTF-8, but strings values in Go can contain any sequence of bytes: https://blog.golang.org/strings
Note that this prints 2, because the character contains two bytes in UTF-8, even though the two bytes correspond to one codepoint: https://play.golang.org/p/BqGzW1O2WX
Go also has the concept of a rune, which is separate from a byte and a string, and makes this easier when you're working with raw string encodings.