undefined | Better HN

0 pointsAvernar8y ago0 comments

> I think UTF-8 is generally the wrong way to expose Unicode to a programmer, since it lets them think they can keep that cherished "one byte == one character" assumption right up until something breaks at 2AM on a weekend.

The solution to that is simple, don't let the programmer access individual bytes in a Unicode string.

Get rid of indexing into them and replace it with iterators. Make string handling functions work on code points at the very least but better yet on grapheme clusters. There's a little more to it than that but it's a good start.

Yes, people are still stuck in the ASCII mindset and can't seem to get away from thinking in bytes. But I belive it's the ability to index into strings is what's to blame and not the encoding used.

0 comments

ninkendo8y ago

Agreed, assuming O(1) lookup of anything inside a string only leads to bad encoding bugs. UTF-8 everywhere, no exceptions.

You can never assume any user-visible character will align evenly with any byte boundary, even if you're using UTF-32. Composed characters throw that assumption out the window, as well as dozens of other unicode quirks I can't recall now.

j / k navigate · click thread line to collapse

0 pointsAvernar8y ago0 comments

The solution to that is simple, don't let the programmer access individual bytes in a Unicode string.

Yes, people are still stuck in the ASCII mindset and can't seem to get away from thinking in bytes. But I belive it's the ability to index into strings is what's to blame and not the encoding used.

0 comments

ninkendo8y ago

Agreed, assuming O(1) lookup of anything inside a string only leads to bad encoding bugs. UTF-8 everywhere, no exceptions.

j / k navigate · click thread line to collapse