undefined | Better HN

0 pointsdegamad7mo ago0 comments

It's not just the variable byte length that causes an issue, in some ways that's the easiest part of the problem. You also have to deal with code points that modify other code points, rather than being characters themselves. That's a huge part of the problem.

0 comments

amake7mo ago

That has nothing to do with UTF-8; that's a Unicode issue, and one that's entirely unescapable if you are the Unicode Consortium and your goal is to be compatible with all legacy charsets.

degamadOP7mo ago

Yep, that's the point I was making - that choosing fixed 4-byte code-points doesn't significantly reduce the complexity of capturing everything that Unicode does.

eru7mo ago

Thanks for explaining!

bawolff7mo ago

That goes all the way back to the beginning

Even ascii used to use "overstriking" where the backspace character was treated as a joiner character to put accents above letters.

degamadOP7mo ago

Agreed, we just conveniently forget about those when speaking about how complex Unicode is.

j / k navigate · click thread line to collapse

0 pointsdegamad7mo ago0 comments

0 comments

amake7mo ago

That has nothing to do with UTF-8; that's a Unicode issue, and one that's entirely unescapable if you are the Unicode Consortium and your goal is to be compatible with all legacy charsets.

degamadOP7mo ago

Yep, that's the point I was making - that choosing fixed 4-byte code-points doesn't significantly reduce the complexity of capturing everything that Unicode does.

eru7mo ago

Thanks for explaining!

bawolff7mo ago

That goes all the way back to the beginning

Even ascii used to use "overstriking" where the backspace character was treated as a joiner character to put accents above letters.

degamadOP7mo ago

Agreed, we just conveniently forget about those when speaking about how complex Unicode is.

j / k navigate · click thread line to collapse