UTF-16 is annoying, but it's far from the biggest design failure in Unicode.
Then there is also the issue that technically there is no such thing as UTF-16, instead you need to distinguish UTF-16LE and UTF-16BE. Even though approximately no one uses the latter we still can't ignore it and have to prepend documents and strings with byte order markers (another wasted pair of code points for the sake of an encoding issue) which mean you can't even trivially concatenate them anymore.
Meanwhile UTF-8 is backwards compatible with ASCII, byte order independent, has tons of useful properties and didn't require any Unicode code point assignments to achieve that.
The only reason we have UTF-16 is because early adopters of Unicode bet on UCS-2 and were too cheap to correct their mistake properly when it became clear that two bytes wasn't going to be enough. It's a dirty hack to cover up a mistake that should have never existed.
That's a strange way to characterize years of backwards compatibility to deal with
https://devblogs.microsoft.com/oldnewthing/20190830-00/?p=10...
UTF-32 is the worst of all worlds. UTF-16 has the teeny tiny advantage that pure Chinese text takes a bit less space in UTF-16 than UTF-8 (typically irrelevant because that advantage is outweighed by the fact that the markup surrounding the text takes more space). UTF-8 is the best option for pretty much everything.
As a consequence, never use UTF-32, only use UTF-16 where necessary due to backwards compatibility, always use UTF-8 where possible.
There's also the problem that grapheme cluster boundaries change over time. Unicode has become a true mess.
Not really. Unicode is still fundamentally based off of the codepoints, which go from 0 to 2^16 + 2^20, and all of the algorithms of Unicode properties operate on these codepoints. It's just that Unicode has left open a gap of codepoints so that the upper 2^20 codepoints can be encoded in UTF-16 without risk of confusion of other UCS-2 text.
The expansion of Unicode beyond the BMP was designed to facilitate an upgrade compatibility path from UCS-2 systems, but it is extremely incorrect to somehow equate Unicode with UTF-16.