undefined | Better HN

0 pointsexceptione2mo ago0 comments

What do you mean with non-english text? I don't think "Ä" will be more efficient in utf16 than in utf8. Or do you mean utf16 wins in cases of non-latin scripts with variable width? I always had the impression that utf8 wins on the vast majority of symbols, and that in case of very complex variable width char sets it depends on the wideness if utf16 can accommodate it. On a tangent, I wonder if emoji's would fit that bill too..

0 comments

Tuna-Fish2mo ago

Japanese, Chinese, Korean and Indic scripts are mostly 2 bytes per character on UTF-16 and mostly 3 bytes per character in UTF-8.

divingdragon2mo ago

Really, as an East Asian language user the rest of the comments here make me want to scream.

exceptioneOP2mo ago

I am not sure if you mean me, as I just asked a question. I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.

gfody2mo ago

> I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.

there are over a million codepoints in unicode, thousands for latin and other language agnostic symbols emojis etc. utf-8 is designed to be backwards compatible with ascii, not to efficiently encode all of unicode. utf-16 is the reasonably efficient compromise for native unicode applications hence it being the internal format of strings in C# and sql server and such.

the folks bleating about utf-8 being the best choice make the same mistake as the "utf-8 everywhere manifesto" guys: stats skewed by a web/american-centric bias - sure utf-8 is more efficient when your text is 99% markup and generally devoid of non-latin scripts, that's not my database and probably not most peoples

1 more reply

gfody2mo ago

hn often makes me want to scream

j / k navigate · click thread line to collapse

0 comments

Tuna-Fish2mo ago

Japanese, Chinese, Korean and Indic scripts are mostly 2 bytes per character on UTF-16 and mostly 3 bytes per character in UTF-8.

divingdragon2mo ago

Really, as an East Asian language user the rest of the comments here make me want to scream.

exceptioneOP2mo ago

gfody2mo ago

> I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.

1 more reply

gfody2mo ago

hn often makes me want to scream

j / k navigate · click thread line to collapse