undefined | Better HN

0 pointsjeberle8mo ago0 comments

UTF-16 arguably is Unicode 2.0+. It's how the code point address space is defined. Code points are either 1 or 2 16-bit code units. Easy. Compare w/ UTF-8 where a code point may be 1, 2, 3, or 4 8-bit code units.

UTF-16 is annoying, but it's far from the biggest design failure in Unicode.

0 comments

account428mo ago

We can argue about "biggest" all day long but UTF-16 is a huge design failure because it made a huge chunk of the lower Unicode space unusable, thereby making better encodings like UTF-8 that could easily represent those code points less efficient. This layer-violating hack should have made it clear that UTF-16 was a bad idea from the start.

Then there is also the issue that technically there is no such thing as UTF-16, instead you need to distinguish UTF-16LE and UTF-16BE. Even though approximately no one uses the latter we still can't ignore it and have to prepend documents and strings with byte order markers (another wasted pair of code points for the sake of an encoding issue) which mean you can't even trivially concatenate them anymore.

Meanwhile UTF-8 is backwards compatible with ASCII, byte order independent, has tons of useful properties and didn't require any Unicode code point assignments to achieve that.

The only reason we have UTF-16 is because early adopters of Unicode bet on UCS-2 and were too cheap to correct their mistake properly when it became clear that two bytes wasn't going to be enough. It's a dirty hack to cover up a mistake that should have never existed.

anonymars8mo ago

> The only reason we have UTF-16 is because early adopters of Unicode bet on UCS-2 and were too cheap to correct their mistake properly

That's a strange way to characterize years of backwards compatibility to deal with

https://devblogs.microsoft.com/oldnewthing/20190830-00/?p=10...

account428mo ago

There are many OS interfaces that were deprecated after five years or even longer. It's been multiple times those five years since then and we'll likely have to deal with UTF-16 for much longer still. Having to provide backwards compatibility for UTF-16 interface doesn't mean they had to keep these as the defaults or provide new UTF-16 interfaces. In particular WIN32 already has 8-bit char interfaces that Microsoft could have easily added UTF-8 support to right then and re-blessed as the default. The decision not to do that was not a technical one but a political one.

1 more reply

adgjlsfhk18mo ago

UTF-16 is the worst of all worlds. Either use UTF32 where code-points are fixed, or if you care about space efficiency use UTF8

mort968mo ago

UTF-32 is arguably even more worst of all worlds. You don't get fixed-size units in any meaningful way. Yes you have fixed sized code points, but those aren't the "units" you care about; you still have variable size grapheme clusters, so you still can't do things like reversing a string or splitting a string at an arbitrary index or anything else like that. Yet it consumes twice the space of UTF-16 for almost everything, and four times the space of UTF-8 for many things.

UTF-32 is the worst of all worlds. UTF-16 has the teeny tiny advantage that pure Chinese text takes a bit less space in UTF-16 than UTF-8 (typically irrelevant because that advantage is outweighed by the fact that the markup surrounding the text takes more space). UTF-8 is the best option for pretty much everything.

As a consequence, never use UTF-32, only use UTF-16 where necessary due to backwards compatibility, always use UTF-8 where possible.

kbolino8mo ago

In order to implement grapheme cluster segmentation, you have to start with a sequence of Unicode scalars. In practice, that means a sequence of 32-bit integers, which is UTF-32 in all but name. It's not a good interchange format, but it is a necessary intermediate/internal format.

There's also the problem that grapheme cluster boundaries change over time. Unicode has become a true mess.

1 more reply

jcranmer8mo ago

> It's how the code point address space is defined.

Not really. Unicode is still fundamentally based off of the codepoints, which go from 0 to 2^16 + 2^20, and all of the algorithms of Unicode properties operate on these codepoints. It's just that Unicode has left open a gap of codepoints so that the upper 2^20 codepoints can be encoded in UTF-16 without risk of confusion of other UCS-2 text.

jeberleOP8mo ago

You forgot `- 2^11` for the surrogate pairs. Gee, why isn't Unicode 2^21 code points? To understand the Unicode code point space you must understand UTF-16. The code space is defined by how UTF-16 works. That was my initial point.

jcranmer8mo ago

If you're going to count the surrogate pairs as not-a-Unicode-codepoint, you should also count the other noncharacters: the last two codepoints on each of the 17 planes and the range U+FDD0-U+FDEF.

The expansion of Unicode beyond the BMP was designed to facilitate an upgrade compatibility path from UCS-2 systems, but it is extremely incorrect to somehow equate Unicode with UTF-16.

2 more replies

welferkj8mo ago

UTF-8 is superior simply because you can trivially choose to parse it as ascii and ignore all the weird foreign bytes.

j / k navigate · click thread line to collapse

0 comments

account428mo ago

Meanwhile UTF-8 is backwards compatible with ASCII, byte order independent, has tons of useful properties and didn't require any Unicode code point assignments to achieve that.

anonymars8mo ago

> The only reason we have UTF-16 is because early adopters of Unicode bet on UCS-2 and were too cheap to correct their mistake properly

That's a strange way to characterize years of backwards compatibility to deal with

https://devblogs.microsoft.com/oldnewthing/20190830-00/?p=10...

account428mo ago

1 more reply

adgjlsfhk18mo ago

UTF-16 is the worst of all worlds. Either use UTF32 where code-points are fixed, or if you care about space efficiency use UTF8

mort968mo ago

As a consequence, never use UTF-32, only use UTF-16 where necessary due to backwards compatibility, always use UTF-8 where possible.

kbolino8mo ago

There's also the problem that grapheme cluster boundaries change over time. Unicode has become a true mess.

1 more reply

jcranmer8mo ago

> It's how the code point address space is defined.

jeberleOP8mo ago

jcranmer8mo ago

If you're going to count the surrogate pairs as not-a-Unicode-codepoint, you should also count the other noncharacters: the last two codepoints on each of the 17 planes and the range U+FDD0-U+FDEF.

The expansion of Unicode beyond the BMP was designed to facilitate an upgrade compatibility path from UCS-2 systems, but it is extremely incorrect to somehow equate Unicode with UTF-16.

2 more replies

welferkj8mo ago

UTF-8 is superior simply because you can trivially choose to parse it as ascii and ignore all the weird foreign bytes.

j / k navigate · click thread line to collapse