undefined | Better HN

0 pointsekidd8y ago0 comments

> it's one big reason why I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend.

Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.

I actually like UTF-8 because it will break very quickly, and force the programmer to do the right thing. The first time you hit é or € or ️an emoji, you'll have a multibyte character, and you'll need to deal with it.

All the other options will also break, but later on:

- If you use UTF-16, then é and € will work, but emoji will still result in surrogate pairs.

- If you use a 4-byte representation, then you'll be able to treat most emoji as single characters. But then somebody will build é from two separate code points as "e + U+0301 COMBINING ACUTE ACCENT", or you'll run into a flag or skin color emoji, and once again, you're back at square zero.

You can't really index Unicode characters like ASCII strings. Written language is just too weird for that. But if you use UTF-8 (with a good API), then you'll be forced to accept that "str[3]" is hopeless very quickly. It helps a lot if your language has separate types for "byte" and "Unicode codepoint", however, so you can't accidentally treat a single byte as a character.

0 comments

captaincrowbar8y ago

This sort of thing is why Swift treats grapheme clusters, rather than code points or bytes or "characters", as the fundamental unit of text. When I first started learning Swift I thought that was a weird choice that would just get in the way, but these days I'm coming around to their way of thinking.

hsivonen8y ago

Treating grapheme clusters as fundamental is slightly problematic in the sense that then the fundamentals change as Unicode adds more combining characters. A reasonable programming environment should still provide iteration by grapheme cluster as a library feature whose exact behavior is expected to change over time as the library tracks new Unicode versions.

Depending on the task at hand, iterating by UTF-8 byte or by code point can make sense, too. And the definition of these is frozen regardless of Unicode version, which makes these safer candidates for "fundamental" operations. There is no right unit of iteration for all tasks.

Retra8y ago

Pfft, that is just as bad. There is no 'fundamental unit of text'. There are different units of text that are appropriate to different tasks.

If I want to know how much memory to allocate, bytes are it. If I want to know how much screen space to allocate, font rendering metrics are it. If I want to do word-breaking, grapheme clusters are it.

None of these are fundamental.

johncolanduoni8y ago

Since a string doesn't have any font rendering metrics (in that it lacks a font or size), I'm not sure how you expect a language's String implementation to take it into account. Similarly, bytes will change based on encodings, which most people would expect a language's String type to abstract over. Do you really want UTF8 and UTF16 strings to behave differently and introduce even more complexity to a very complex system?

There are languages whose orthographies don't fit the Unicode grapheme cluster specification, but they're complex enough that I doubt there's any way to deal with them properly other than having someone proficient in them looking over your text processing or pawning it off to a library. At least with grapheme clusters your code won't choke on something as simple as unnormalized Latin text.

1 more reply

coldtea8y ago

>If I want to know how much memory to allocate, bytes are it. If I want to know how much screen space to allocate, font rendering metrics are it. If I want to do word-breaking, grapheme clusters are it.

Size in memory/bytes you could get trivially for any string (and this doesn't change with whether you chose bytes, graphemes or code points or whatever to iterate).

Screen space is irrelevant/orthogonal to encoding -- it appears at the font level and the font rendering engine that will give the metrics will accept whatever encoding it is.

1 more reply

ubernostrum8y ago

I would love it if Python at least would support the '\X' regex metacharacter in its own built-in regex module. Right now you have to turn to a third-party implementation to get that.

I also wish Python would expose more of the Unicode database than it does; I've had to turn to third-party modules that basically build their own separate database for some Unicode stuff Python doesn't provide (like access to the Script property).

ubernostrum8y ago

Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.

Depends on what you want to index into it for. I'll admit that once upon a time I opposed adding a "truncate at N characters" template helper to Django since there was a real risk it would cut in the middle of a grapheme cluster, and I don't particularly care for the compromise that ended up getting it added (it normalizes the string-to-truncate to a composed form first to try to minimize the chance of slicing at a bad spot).

But when you get right down to it, what I do for a living is write web applications, and sometimes I have to write validation that cares about length, or about finding specific things in specific positions, and so indexing into a string is something I have do to from time to time, and I'd rather have it behave as a sequence of code points than have it behave as a sequence of bytes in a variable-width encoding.

As to whether UTF-8 forces people to deal with Unicode up-front, I very strongly disagree; UTF-8 literally has as a design goal that it puts off your need to think about anything that isn't ASCII.

Animats8y ago

Yes, while "back up one UTF-8 rune" is a well defined operation, "back up one grapheme" is tough. Forward is easy, though.

I had the need to write grapheme-level word wrap in Rust. Here it is. It assumes all graphemes have the same visible width. This is used mostly for debug output, not for general text rendering.

[1] https://github.com/John-Nagle/rust-rssclient/blob/master/src...

Avernar8y ago

> But when you get right down to it, what I do for a living is write web applications,

That is my use case for Python as well.

> sometimes I have to write validation that cares about length,

That's where a trucation function that understands grapheme clusters whould come in so handy. Tell it that you want to truncate to n bytes maximum and let it chop a bit more as to not split a grapheme cluster.

Fortunately my database does not have fixed with strings so I rarely bump into this one.

> or about finding specific things in specific positions, and so indexing into a string is something I have do to from time to time

I write my code to avoid this. Yes I still have to use an index because that's what Python supports but it would be trivial to convert it to another language that supports string iterators.

patrickthebold8y ago

I get the variable byte encodings. And I know that Unicode has things like U+0301 as you say, and so code points are not the same as characters/glyphs. But I don't understand why it was designed that way. Why is Unicode not simply an enumeration of characters.

acdha8y ago

It's important to distinguish between Unicode the spec, where people definitely do make that distinction, and implementations. Most of the problems are due to history: we have half a century of mostly English-speaking programmers assuming one character is one byte, especially bad when that's baked into APIs, and treating the problem as simpler than it really is.

Combining accents are a great example: if you're an American, especially in the 80s, it's easy to assume that you only need a couple of accents like you used in Spanish and French classes and that's really simple for converting old data to a new encoding. Later, it becomes obvious that far more are needed but by then there's a ton of code and data in the wild so you end up needing the concept of normalization for compatibility.

(That's the same lapse which lead to things like UCS-2 assuming 2^16 characters even though that's not enough for a full representation of Chinese alone.)

I think it's also worth remembering the combination of arrogance and laziness which was not uncommon in the field, especially in the 90s. I remember impassioned rants about how nobody needed anything more than ASCII from programmers who didn't want to have to deal with iconv, thought encoding was too much hassle, claimed it was too slow, etc. as if that excused not being able to handle valid requests. About a decade ago I worked at a major university where the account management system crashed on apostrophes or accents (in a heavily Italian town!) and it was just excused as the natural order of things so the team could work on more interesting problems.

Avernar8y ago

One reason is because it would take a lot more code points to describe all the possible combinations.

Take the country flag emoji. They're actually two seperate code points. The 26 code points used are just special country code letters A to Z. The pair of letters is the country code and shows up as a flag. So just 26 codes to make all the flags in the world. Plus new ones can be added easily without having to add more code points.

Another example is the new skin tone emoji. The new codes are just the colour and are put in front of the existing emoji codes. Existing software just shows the normal coloured emoji but you may see a square box or question mark symbol in front of it.

coldtea8y ago

>The pair of letters is the country code and shows up as a flag. So just 26 codes to make all the flags in the world. Plus new ones can be added easily without having to add more code points. Another example is the new skin tone emoji.

Still not answering the question though.

For one, when the unicode standard was originally designed it didn't have emoji in it.

Second, if it was limitations to the arbitrary addition of thousands of BS symbols like emoji that necessitate such a design, we could rather do without emojis in unicode at all (or klingon or whatever).

So, the question is rather: why not a design that doesn't need "normalization" and runes, code points, and all that...

Using less memory (like utf-8 allows) I guess is a valid concern.

2 more replies

pseudalopex8y ago

Some languages use multiple marks on a single base character. The combining character system is more flexible, uses fewer code points, and doesn't require lookup tables to search or sort by base character.

millstone8y ago

In practice UTF-8 has done more to enable the wrong thing, rather than forcing programmers to do the right thing.

> You can't really index Unicode characters like ASCII strings

But then why do strings-are-UTF8 languages like Go or D make it so easy? Why optimize your syntax with `len(txt)` and `txt[0]` when, as you point out, you can't do that? Why make it trivial to split code points or composed character sequences, but doing something like proper string truncation is brutally hard?

UTF-8's fail-fast property has not enabled more Unicode-savviness. Instead it just lets programmers pretend that we still are in the land of C strings.

chucksmash8y ago

> Why optimize your syntax with `len(txt)` and `txt[0]` when, as you point out, you can't do that?

I like Rust's approach. It's a strings-are-UTF8 language but strings (both str and String):

- are not directly indexable

- force you to be explicit when iterating: you iterate over either `s.chars()` or `s.bytes()`

- are called out in the docs as being a vector of unsigned 8-bit integers internally

- support a len() method that is called out as returning the length of that vector

- can be sliced if you reaaaally need to get around inability to index directly but attempting to slice in the middle of a character causes a panic

gpvos8y ago

> - support a len() method that is called out as returning the length of that vector

They should have called that one bytelen() then.

And how do you get a proper offset for slicing? Do you then have to interpret the UTF-8 bytes yourself, or can you somehow get it via the chars() iterator or something similar?

camgunz8y ago

Yeah this is the way to go for sure.

chimeracoder8y ago

> But then why do strings-are-UTF8 languages like Go

To clarify: strings in Go are not necessarily UTF-8. String literals will be, because the source code is defined to be UTF-8, but strings values in Go can contain any sequence of bytes: https://blog.golang.org/strings

Note that this prints 2, because the character contains two bytes in UTF-8, even though the two bytes correspond to one codepoint: https://play.golang.org/p/BqGzW1O2WX

Go also has the concept of a rune, which is separate from a byte and a string, and makes this easier when you're working with raw string encodings.

millstone8y ago

This makes it sound like Go is even more confused. If strings in Go are not necessarily UTF-8, why does the strings package assume UTF-8, `for range` assumes UTF-8, etc?

1 more reply

j / k navigate · click thread line to collapse

0 comments

captaincrowbar8y ago

hsivonen8y ago

Retra8y ago

Pfft, that is just as bad. There is no 'fundamental unit of text'. There are different units of text that are appropriate to different tasks.

None of these are fundamental.

johncolanduoni8y ago

1 more reply

coldtea8y ago

Size in memory/bytes you could get trivially for any string (and this doesn't change with whether you chose bytes, graphemes or code points or whatever to iterate).

Screen space is irrelevant/orthogonal to encoding -- it appears at the font level and the font rendering engine that will give the metrics will accept whatever encoding it is.

1 more reply

ubernostrum8y ago

I would love it if Python at least would support the '\X' regex metacharacter in its own built-in regex module. Right now you have to turn to a third-party implementation to get that.

ubernostrum8y ago

Unfortunately, as long as you believe that you can index into a Unicode string, your code is going to break. The only question is how soon.

As to whether UTF-8 forces people to deal with Unicode up-front, I very strongly disagree; UTF-8 literally has as a design goal that it puts off your need to think about anything that isn't ASCII.

Animats8y ago

Yes, while "back up one UTF-8 rune" is a well defined operation, "back up one grapheme" is tough. Forward is easy, though.

I had the need to write grapheme-level word wrap in Rust. Here it is. It assumes all graphemes have the same visible width. This is used mostly for debug output, not for general text rendering.

[1] https://github.com/John-Nagle/rust-rssclient/blob/master/src...

Avernar8y ago

> But when you get right down to it, what I do for a living is write web applications,

That is my use case for Python as well.

> sometimes I have to write validation that cares about length,

Fortunately my database does not have fixed with strings so I rarely bump into this one.

> or about finding specific things in specific positions, and so indexing into a string is something I have do to from time to time

I write my code to avoid this. Yes I still have to use an index because that's what Python supports but it would be trivial to convert it to another language that supports string iterators.

patrickthebold8y ago

acdha8y ago

(That's the same lapse which lead to things like UCS-2 assuming 2^16 characters even though that's not enough for a full representation of Chinese alone.)

Avernar8y ago

One reason is because it would take a lot more code points to describe all the possible combinations.

coldtea8y ago

Still not answering the question though.

For one, when the unicode standard was originally designed it didn't have emoji in it.

So, the question is rather: why not a design that doesn't need "normalization" and runes, code points, and all that...

Using less memory (like utf-8 allows) I guess is a valid concern.

2 more replies

pseudalopex8y ago

millstone8y ago

In practice UTF-8 has done more to enable the wrong thing, rather than forcing programmers to do the right thing.

> You can't really index Unicode characters like ASCII strings

UTF-8's fail-fast property has not enabled more Unicode-savviness. Instead it just lets programmers pretend that we still are in the land of C strings.

chucksmash8y ago

> Why optimize your syntax with `len(txt)` and `txt[0]` when, as you point out, you can't do that?

I like Rust's approach. It's a strings-are-UTF8 language but strings (both str and String):

- are not directly indexable

- force you to be explicit when iterating: you iterate over either `s.chars()` or `s.bytes()`

- are called out in the docs as being a vector of unsigned 8-bit integers internally

- support a len() method that is called out as returning the length of that vector

- can be sliced if you reaaaally need to get around inability to index directly but attempting to slice in the middle of a character causes a panic

gpvos8y ago

> - support a len() method that is called out as returning the length of that vector

They should have called that one bytelen() then.

And how do you get a proper offset for slicing? Do you then have to interpret the UTF-8 bytes yourself, or can you somehow get it via the chars() iterator or something similar?

camgunz8y ago

Yeah this is the way to go for sure.

chimeracoder8y ago

> But then why do strings-are-UTF8 languages like Go

Note that this prints 2, because the character contains two bytes in UTF-8, even though the two bytes correspond to one codepoint: https://play.golang.org/p/BqGzW1O2WX

Go also has the concept of a rune, which is separate from a byte and a string, and makes this easier when you're working with raw string encodings.

millstone8y ago

This makes it sound like Go is even more confused. If strings in Go are not necessarily UTF-8, why does the strings package assume UTF-8, `for range` assumes UTF-8, etc?

1 more reply

j / k navigate · click thread line to collapse