Why we can't process Emoji anymore (opens in new tab)

(gist.github.com)

258 pointstpinto13y ago152 comments

152 comments

This is why UTF-8 is great. If it works for any Unicode character it will work for them all. Surrogate pairs are rare enough that they are poorly tested. With UTF-8, if there are issues with multi-byte characters, they are obvious enough to get fixed.

UTF-16 is not a very good encoding. It only exists for legacy reasons. It has the same major drawback as UTF-8 (variable-length encoding) but none of the benefits (ASCII compatibility, size efficient).

notJim13y ago

This comment is somewhat misleading. The issue at hand is orthogonal to any of the benefits of UTF-8 over UTF-16 (which are real, UTF-8 is great, you should use it.)

4-byte characters in UTF-8 are just as rare as surrogate pairs are just as rare in UTF-16, because they both are used to represent non-BMP characters. As a result, there is software that handles 3-byte characters (i.e., a huge percentage of what you'll ever see), but doesn't handle 4-byte characters.

MySQL is a high-profile example of software which, until recently, had this problem: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8m....

snogglethorpe13y ago

The problem is that handling 1 unit is very different from 2+ units, in terms of coding patterns, whereas 3 is not so different from 4+. In the latter case there's already probably a loop to handle multiple unit characters, which will in many cases work without change for longer sequences (and if not, probably the code probably requires very little change to do so).

So whereas it's rather common for programs to mis-handle multiple-unit UTF-16 characters, it seems much less likely that programs will mis-handle 4+ unit UTF-8 characters.

pixelcort13y ago

The problem with UTF-8 is that lots of tools have 3 byte limits, and characters like Emoji take up 4 bytes in UTF-8.

potatolicious13y ago

Honest question: but isn't that just a broken implementation (and a very obvious brokenness at that)? It seems to me there's a big difference between someone not coding to the standard, and the standard making your taks impossible.

2 more replies

pjscott13y ago

How many tools have 3-byte limits on UTF-8? The only one I can think of right now is MySQL. (The workaround is to specify the utf8mb4 character set. This is MySQL's cryptic internal name for "actually doing UTF-8 correctly.")

3 more replies

thristian13y ago

I don't think it's specifically a 3-byte limit, I think it's just that lots of tools decode UTF-8 into UCS-2 internally instead of UTF-16.

1 more reply

eps13y ago

Which tools?

Honest question, as the three byte limit seems rather arbitrary and no more logical than, say, a four byte one.

2 more replies

Tsagadai13y ago

File bugs with those tools. These sort of issues should have been sorted years ago and any program that can't do 3+ byte character encodings should be named and shamed.

derleth13y ago

And the problem with UTF-16 is that a lot of applications can't handle surrogate pairs, except a lot of Emoji are above the BMP, aren't they? So why is this a bigger deal for UTF-8 than UTF-16?

1 more reply

est13y ago

UCS2 is better than UTF8 internally because it counts unicode characters faster than UTF8. Every character is juest 2 bytes, instead of 1, 2, 3 or even 4 bytes.

In python:

    len(u'汉字') == 2
    len( '汉字') == 4 # or maybe 6, it varies based on console encoding and CPython options
    len(u'汉字'.encode('utf8')) == 6

csense13y ago

Issues like this are why I hate internationalization.

If it was simple as making everything Unicode and it Just Working, it would be possible. But the number of difficulties and problems I've seen have made me decide -- and tell everyone I know -- to avoid dealing with internationalization if you value your sanity.

Issues discussed here:

* Different incompatible variable-length encodings

* Broken implementations

* Character count != length

Issues discussed elsewhere:

* It's the major showstopper keeping people away from Python 3

* Right-to-left vs. left-to-right [1]

* BOM at the beginning of the stream

Conceptual issues -- questions I honestly don't know the answer to when it comes to internationalization. I don't even know where to look to find answers to these:

* If I split() a string, does each piece get its own BOM?

* If I copy-paste text characters from A into B, what encoding does B save as? If B isn't a text editor, what happens?

* If a chr(0x20) is part of a multi-byte escape sequence, does it count as a space when I use .split()?

* When it encounters right-to-left, does the renderer have to scan the entire string to figure out how far to the right it goes? Wouldn't this mean someone could create a malicious length-n string that took O(n^2) time to process?

* What happens if I try to print() a very long line -- more than a screenful -- with a right-to-left escape in a terminal?

* If I have a custom stream object, and I write characters to it, how does it "know" when to write the BOM?

* Do operators like [] operate on characters, bytes, 16-bit words, or something else?

* Does getting the length of a string really require a custom for loop with a complicated bit-twiddling expression?

* Is it possible for a zero byte to be part of a multibyte sequence representing a character? How does this work with C API's that expect zero-terminated strings?

* If I split() a string to extract words, how do the substrings know the BOM, right-to-left, and other states that apply to themselves? What if those strings are concatenated with other strings that have different values for those states?

* What exactly does "generating locales" do on Debian/Ubuntu and why aren't those files shipped with all the other binary parts of the distribution? All I know about locale generation is that it's some magic incantation you need to speak to keep apt-get from screaming bloody murder every time you run it on a newly debootstrapped chroot.

* Is there a MIME type for each historical, current, and future encoding? How do Web things know which encoding a document's in?

* How do other tools know what encoding a document uses? Is this something the user has to manually tell the tool -- should I be saying nano thing.txt --encoding=utf8? If the information about the format isn't stored anywhere, do you just guess until you get something that seems not to cause problems?

* If you're using UTF-16, what endianness is used? Is it the same as the machine endianness, or fixed? What operations cause endian conversion?

* Should my C programs handle the possibility that sizeof(char) != 1? Or at least check for this case and spit out a warning or error?

* What automated tools exist to remove BOM's or change accented characters into regular ones, if other automated tools don't accept Unicode? Once upon a time, I could not get javac to recognize source files I'd downloaded which had the author's name, which included an 'o' with two dots over it, in comments. That was the only non-ASCII character in the files, and I ended up removing them; syncing local patches with upstream would have been a nightmare. Do people in different countries run incompatible versions of programming languages that won't accept source files that are byte-for-byte identical? It sounds ridiculous, but this experience suggests it may be the case.

[1] http://xkcd.com/1137/

4 more replies

notJim13y ago

I don't really think you can argue that UCS2 is better than UTF-8, because UCS2 is simply broken. It's not a reliable way to encode unicode characters. It's like arguing that my Ferrari which is currently on fire is a better car than your Lamborghini (which is not.) I mean, they do each have their merits, but a flaming car is not useful to anyone.

I think this is probably semantics, but I just wanted to point that out in case anyone is confused, which would be understandable because this shit is whack.

dietrichepp13y ago

Why do you want to count Unicode characters? Why do you care if it is fast to do so? Why would you ever need to use character-based string indexing?

UTF-16 solves problems that don't exist.

(Honestly, I would love it if someone could explain what the purpose of counting characters is, because I don't know why you'd ever do that, except when you're posting to Twitter.)

5 more replies

speleding13y ago

Counting code points in UTF8 can actually be faster than UCS2 in case you do not know the length in advance such as null-terminated strings. The UTF8 encoding is cleverly defined such that to find the number of code points you just need to check the number of bytes that have the top bit cleared. Since UTF8 strings are generally shorter than UCS2 it can be faster in some cases. Either way this is not a serious enough concern to use one encoding over the other.

threedaymonk13y ago

"Character". You keep saying that word …

    >>> len(u'épicé')
    5
    >>> len(u'épicé')
    7

ender713y ago

Apropos: http://mathiasbynens.be/notes/javascript-encoding

TL;DR:

- Javascript engines are free to internally represent strings as either UCS-2 or UTF-16. Engines that choose to go USC-2 tend to replace all glyphs outside of the BMP with the replacement char (U+FFFD). Firefox, IE, Opera, and Safari all do this (with some inconsistencies).

- However, from the point of view of the actual JS code that gets executed, strings are always UCS-2 (sort of). In UTF-16, code points outside the BMP are encoded as surrogate pairs (4 bytes). But -- if you have a Javascript string that contains such a character, it will be treated as two consecutive 2-byte characters.

  var x = '𝌆';
  x.length; // 2
  x[0];     // \uD834
  x[1];     // \uDF06

Note that if you insert said string into the DOM, it will still render correctly (you'll see a single character instead of two ?s).

notJim13y ago

I'm relatively comfortable with this stuff, but I am confused by your response.

First you say that engines will "internally" replace non-BMP glyphs with the replacement character, but then you give an example that seems to work fine (and I think would work fine as long as you don't cut that character in half, or try to inspect its character code without doing the proper incantations[1].)

So, I guess what I'm asking is, at what point does the string become "internal", such that the engine will replace the character with the replacement character?

[1]: As given in the article you linked to.

Kaworu13y ago

I dare not try and reexplain the discussion in this bug report as my understanding feels insufficient, but the entire discussion at http://code.google.com/p/v8/issues/detail?id=761#c14 (note, I've linked to the 14th commment in the discussion, but there's more interesting stuff above) talks about it. At the core is a distinction between v8's internal representation of strings and it's API vs. what a browser engine which embeds v8 might do.

othermaciej13y ago

Safari uses UTF-16, not UCS-2. I believe this is true of other browsers as well. Otherwise this would render the replacement char, but it doesn't, it renders correctly:

javascript:var x = '𝌆';document.write(x);

gsnedders13y ago

Well, a JS string is just a series of UTF-16 code-units (per ES5, there is no impl choice here), so there isn't really any encoding pre-se (and isn't necessarily a UTF-16 string, per the spec's definition thereof, as lone surrogates are valid). The fact that that works is more a testament to the the DOM being UTF-16 than JS.

(On the other hand, I'm sure you knew that. But probably there are people reading your comment who didn't. :))

2 more replies

mryall13y ago

I hate the implication in this comment and in the linked article that the spec is somehow immutable. The ECMAScript spec here is fundamentally flawed with regard to character encoding and needs to be fixed.

UCS-2 is not a valid Unicode encoding any more, because there are several sets of characters encoded outside the BMP. The spec should be updated to require UTF-16 support in all implementations.

If a modern programming language like JavaScript doesn't provide a way to represent characters outside the BMP in its character data type, that needs to be fixed too. Indexing and counting characters in a JavaScript string need to reflect the human and Unicode notion of characters, not the arbitrary 2-byte blocks that UCS-2 happens to use.

The language authors should be ashamed of this situation - having a modern language without proper Unicode support is simply awful.

praptak13y ago

Sometimes you need to know about encodings, even if you're just a consumer. Putting just one non 7-bit character in your SMS message will silently change its encoding from 7-bit (160 chars) to 8-bit (140 chars) or even 16 bit (70 chars) which might make the phone split it into many chunks. The resulting chunks are billed as separate messages.

fwr13y ago

On iOS, using any non-basic Latin character in SMS makes it switch to 16 bit, even when there is no reason for that to happen. It's a thing that most foreign language speakers must live with.

By doing this full of excuses write-up, this guy wasted a substantial amount of time that he could have spent better researching the issue. Your consumer doesn't care that Emoji is this much or that much bits, it doesn't matter for him that you're running your infrastructure on poorly chosen software - there is absolutely no excuse for not supporting this in a native iOS app, especially now that Emoji is so widely used and deeply integrated in iOS.

How is that a problem they are focusing on, anyway, when their landing page features awful, out of date mockups of the app? (not even actual screenshots - notice the positions of menu bar items) They are also featuring Emoji in every screenshot - ending support might be a fresh development, but I still find that ironic.

jrabone13y ago

Absolutely right. The customer does not care that you made a shortsighted decision to pick a language for a TEXT based system that cannot correctly support none-BMP Unicode. There are no excuses, surrogates have been out there for years (Windows was using UTF16-LE from NT 3.51 / Unicode 2) as have 4 byte UTF8 encodings.

JavaScript is a joke in this respect, and is keeping horrors like Shift-JIS alive long after they should have been retired.

speednoise13y ago

This was an internal email: https://medium.com/tech-talk/1aff50f34fc

1 more reply

jrabone13y ago

The GSM 03.38 charset specified for SMS is not straight 7-bit ASCII. See eg. http://www.dreamfabric.com/sms/default_alphabet.html

pjscott13y ago

The quick summary, for people who don't like ignoring all those = signs, is that V8 uses UCS-2 internally to represent strings, and therefore can't handle Unicode characters which lie outside the Basic Multilingual Plane -- including Emoji.

bbotond13y ago

Honestly that's a shame.

masklinn13y ago

It was fixed back in March though.

driverdan13y ago

If you search for V8 UCS-2 you'll find a lot of discussion on this issue dating back at least a few years. There are ways to work around V8's lack of support for surrogate pairs. See this V8 issue for ideas: https://code.google.com/p/v8/issues/detail?id=761

My question is why does V8 (or anything else) still use UCS-2?

gsnedders13y ago

The ES5 spec defines a string as being a series of UTF-16 code-units, which inherently means surrogates show through.

APIs like that tend to be low priority because they aren't used by browsers (which pass everything through as UTF-16 code-units, typically treating them as possibly-valid UTF-16 strings).

masklinn13y ago

> My question is why does V8 (or anything else) still use UCS-2?

Because the ES spec defines a string as a sequence of UTF-16 code units (aka UCS-2-with-visible-surrogates), because as many others (e.g. Java) the language's strings were created during/inherited from Unicode 1.0 which fit in 16 bits (UTF-16 is a retrofitting of Unicode 1.0 fixed-width to accomodate the full range of later unicode version by adding surrogate pairs)

est13y ago

because counting 2 bytes is much faster for computers than counting vary 1, 2, 3 or even 4 bytes.

speleding13y ago

This is not a real issue because counting code points in an UTF8 string is easy too: the encoding is cleverly defined such that you just need to check the number of bytes that have the top bit cleared. Since UTF8 strings are generally shorter it can even be faster than counting UTF-16 if you don't know the length in advance.

gkoberger13y ago

Took me a bit to realize that this is talking about the Voxer iOS app (http://voxer.com/), not Github (https://github.com/blog/816-emoji).

whit53713y ago

Yeah, I was worried there for a sec. :^)

hkmurakami13y ago

>Wow, you read though all of that? You rock. I'm humbled that you gave me so much of your attention.

That was actually really fun to read, even as a now non-technical guy. I can't put a finger on it, but there was something about his style that gave off a really friendly vibe even through all the technical jargon. That's a definite skill!

jgeorge13y ago

DeSalvo's source comments have always been an entertaining read. :)

beaumartinez13y ago

This is dated January 2012. By the looks of things, this was fixed in March 2012[1]

[1] https://code.google.com/p/v8/issues/detail?id=761#c33

Cogito13y ago

I wonder if this has been rolled into Node yet.

[edit] Node currently uses V8 version 3.11.10.25, which was released after this fix was made, but not sure if the fix was merged to trunk

[edit2] actually, looks like it has, though I can't identify the merge commit

ricardobeat13y ago

Please, if you're going to post text to a Gist at least use the .md extension:

https://gist.github.com/4151124

ctrlaltesc13y ago

Which enables an even more readable layout with gist.io http://gist.io/4151124

pbiggar13y ago

A couple of reasons why it makes sense for V8 and other vendors to use UCS2:

- The spec says UCS2 or UTF16. Those are the only options.

- UCS2 allows random access to characters, UTF-16 does not.

- Remember how the JS engines were fighting for speed on arbitrary benchmarks, and nobody cared about anything else for 5 years? UCS2 helps string benchmarks be fast!

- Changing from UCS2 to UTF-16 might "break the web", something browser vendors hate (and so do web developers)

- Java was UCS2. Then Java 5 changed to UTF-16. Why didn't JS change to UTF-16? Because a Java VM only has to run one program at once! In JS, you can't specify a version, an encoding, and one engine has to run everything on the web. No migration path to other encodings!

cmccabe13y ago

UCS2 allows random access to characters, UTF-16 does not.

I'm not sure if that's really true. On IBM's site, they define 3 levels of UCS-2, only one of which excludes "combining characters" (really code points).

http://pic.dhe.ibm.com/infocenter/aix/v6r1/index.jsp?topic=%...

If you have combining characters, then you can't simply take the number of bytes and divide by 2 to get the number of letters. If you don't have combining characters, then you have something which isn't terribly useful except for European languages (I think?)

Maybe someone more familiar with the implementation can describe which path they actually went down for this... given what I've heard so far, I'm not optimistic.

pbiggar13y ago

OK, I cracked into the V8 source to take a look at what actually happens. It looks like the implementation does use random access for two-byte strings. However, it also uses multiple multiple string implementations (ASCII, 2 byte strings, "consString" (I presume some kind of Rope), "Sliced Strings" (sounds like a rope again, but might be shared storage of the string contents for immutable strings)), so they could likely use other implementations with whatever properties they choose.

See https://github.com/v8/v8/blob/3ff861bbbb62a6c0078e042d8077b2... and https://github.com/v8/v8/blob/3ff861bbbb62a6c0078e042d8077b2....

1 more reply

languagehacker13y ago

We seem to be seeing this more and more with Node-based applications. It's a symptom of the platform being too immature. This is why you shouldn't adopt these sorts of stacks unless there's some feature they provide that none of the more mature stacks support yet. And even then, you should probably ask yourself if you really need that feature.

fusiongyro13y ago

According to Cogito, this was fixed in March:

http://news.ycombinator.com/item?id=4834731

I want to agree with you simply because I don't like Node, but it's hardly fair to damn something over a bug that was fixed 9 months ago.

freedrull13y ago

Why on earth would the people who wrote V8 use UCS-2? What about alternative JS runtimes?

marshray13y ago

Because Unicode was sold to the world's software developers as a fixed-width encoding claiming 16 bits would be all we'd ever need.

thristian13y ago

Yes it was, in 1991 when NT and Java and Cocoa were new or under development. In 1996, Unicode 2.0 came out with surrogate pairs and astral planes, and Unicode was no longer a 16 bit encoding.

I'm pretty sure work on V8 started after 1996.

More likely is the idea that the authors of V8 felt that UCS-2 was an acceptable speed/correctness trade-off.

dmethvin13y ago

Yes, and several C/C++ conventions and types seemed to make that a safe choice, for example wchar_t. Let's face it, collectively we really screwed this one up. It's the biggest mistake since Microsoft chose the backslash as a path separator in DOS 2.0.

1 more reply

eps13y ago

They control their clients, so they could've just re-encoded emojies with custom 16bit escaping scheme, make the backend transparently relay it over in escaped form and decode it back to 17bits at the other end.

Or am I missing something obviuos here?

kstenerud13y ago

Small nitpick, but Objective-C does not require a particular string encoding internally. In Mac OS and iOS, NSString uses one of the cfinfo flags to specify whether the internal representation is UTF-16 or ASCII (as a space-saving mechanism).

dgreensp13y ago

The specific problems the author describes don't seem to be present today; perhaps they were fixed. That's not to say this conversions aren't a source of issues, just that I don't see any show-stopper problems currently in Node, V8, or JavaScript.

In JavaScript, a string is a series of UTF-16 code units, so the smiley face is written '\ud83d\ude04'. This string has length 2, not 1, and behaves like a length-2 string as far as regexes, etc., which is too bad. But even though you don't get the character-counting APIs you might want, the JavaScript engine knows this is a surrogate pair and represents a single code point (character). (It just doesn't do much with this knowledge.)

You can assign '\ud83d\ude04' to document.body.innerHTML in modern Chrome, Firefox, or Safari. In Safari you get a nice Emoji; in stock Chrome and Firefox, you don't, but the empty space is selectable and even copy-and-pastable as a smiley! So the character is actually there, it just doesn't render as a smiley.

The bug that may have been present in V8 or Node is: what happens if you take this length-2 string and write it to a UTF8 buffer, does it get translated correctly? Today, it does.

What if you put the smiley directly into a string literal in JS source code, not \u-escaped? Does that work? Yes, in Chrome, Firefox, and Safari.

jruderman13y ago

The invisible smiley was a font system problem, fixed in Firefox 19 Aurora (assuming you're on Mac).

https://bugzilla.mozilla.org/show_bug.cgi?id=715798

dale-cooper13y ago

The UCS-2 heritage is kind of annoying. In java for example, chars (the primitive type, which the Character class just wraps) are 16 bits. So one instance of a Character may not be a full "character" but rather a part of a surrogate pair. This creates a small gotcha where the length of a string might not be the same as the amount of characters it has. And that you just cant split/splice a Character array naively (because you might split it at a surrogate pair).

masklinn13y ago

Which, at the end of the day, doesn't really matter since a code point is not a "character" in the sense of "the smallest unit of writing" (as interpreted by an end-user): many "characters" may (depending on the normalization form) or will (jamo) span multiple codepoints. Splitting on a character array is always broken, regardless of surrogate pairs.

dale-cooper13y ago

Yes. What i'm saying is that it would feel less error prone if the character object was actually a codepoint. It's a leaky abstraction, you shouldn't need to handle something that is tied to the internal representation of strings in the jvm. Can one "character" span multiple codepoints? Do you have an example of this?

1 more reply

eloisant13y ago

Maybe nickpicking but I don't think Softbank came up with the Emoji. Emoji existed way before Softbank bought the Japanese Vodaphone, and even before Vodaphone bought J-Phone.

So emoji were probably invented by J-Phone, while Softbank was mostly taking care of Yahoo Japan.

adrianpike13y ago

Here's the thread in the v8 bug tracker about this issue: http://code.google.com/p/v8/issues/detail?id=761

Is there a reason that the workaround in comment 8 won't address some of these issues?

dgl13y ago

I don't think it's needed anymore.

If you read closely you'll see the original linked message is from January and there's an update on that issue from March when a fix was made in V8.

clebio13y ago

Somewhat meta, but this would be one where showing subdomain on HN submissions would be nice. The title is vague enough that I assumed it was something to do with _Github_ not processing Emoji (which would be sort of a strange state of affairs...).

ladon8613y ago

Not that strange, Github implements much of the Emoji set using different shortcuts, see the reference here:

http://www.emoji-cheat-sheet.com/

Before I read the article I guessed that maybe the icon set had some licensing issues for Github. Luckily, not so! (:smiley:)

clebio13y ago

That was basically my point. It would be strange if they _stopped_ processing it.

pla3rhat3r13y ago

I love this article. So often it has been difficult to explain to people why one set of characters can work while others will not. This lays out some great historical info that will be helpful going forward.

cjensen13y ago

UCS-16 is only used by programs which jumped the gun and implemented Unicode before it was all done. (It was 16 bits for awhile with Asian languages sharing code points so that the font in use determined whether the text was displayed as Chinese vs Japanese vs. etc). What Century was V8 written in that they thought UCS-16 was an acceptable thing to implement?

Good rule of thumb for implementers: get over it and use 32 bits internally. Always use UTF-8 when encoding into a byte stream. Add UTF-16 encoding if you must interface with archaic libraries.

masklinn13y ago

> UCS-16 is only used by programs which jumped the gun and implemented Unicode before it was all done.

There's no such thing as "all done", Unicode 1.0 was 16 bit, Unicode 6 was released recently.

evincarofautumn13y ago

Failures in Unicode support seem usually to result from the standard’s persistently shortsighted design—well intentioned and carefully considered though it undoubtedly is. It’s a “good enough” solution to a very difficult problem, but I wonder if we won’t see Unicode supplanted in the next decade.

All that aside: emoji should not be in Unicode. Fullstop.

FredericJ13y ago

How about this npm module : https://npmjs.org/package/emoji ?

xn13y ago

Here's the message decoded from quoted-printable: https://gist.github.com/4151707#file_emoji_sad_decoded.txt

mranney13y ago

Note that this message is almost a year old now. The issue has been addressed by the node and V8 teams.

shocks13y ago

Very informative, great read. Thanks!

alexbosworth13y ago

Fixed a good while ago for node.js

masklinn13y ago

Wow, the first half of the text is basically full of crap and claims which don't even remotely match reality, and now I'm reaching the technical section which can only get even more wrong.

masklinn13y ago

To whoever the downvoter was: no, seriously. For instance in the first few paragraphs:

* emoji were invented by NTT DoCoMo, not Softbank

* even if that had been right Softbank's copyrighting of their emoji representations has no bearing on NTT and KDDI/au using completely different implementations (and I do mean completely, KDDI/au essentially use <img> tags)

* lack of cooperation is endemic to japanese markets (especially telecoms) and has nothing to do with "ganging up"

* if NTT and au/KDDI wanted to gang up on Softbank you'd think they'd share the same emoji

* you didn't have to run "adware apps" to unlock the emoji keyboard (there were numerous ways to do so from dedicated — and usually quickly nuked – apps to apps "easter eggs" to jailbreak to phone backup edit/restore)

That's barely the first third.

sneak13y ago

TLDR: node sucks

akie13y ago

TLDR: The V8 engine can't (supposedly) encode Unicode codepoints that are over 16-bits in length, because it uses the UCS-2 encoding.

throwaway54-76213y ago

TLDR: v8 "sucks" (and doesn't support Unicode code points outside of the lowest ~64k characters).

Edit: v8 in general is pretty cool, but not supporting Unicode outside UCS-2 is pretty bad.

2 more replies

tptacek13y ago

It's v8's fault, and v8 does not suck.

prodigal_erik13y ago

Unicode 2.0 added surrogate pairs in 1996. Unfortunately, the first versions of both Java and JavaScript predated this and got strings horribly wrong, and now any conforming implementation of either is required to suck. The Right Thing would be for almost everyone to work with only combining character sequences, except for a rare few who need to know how to dissect one into its codepoints and reassemble them correctly (just as people don't normally need to extract high or low bits from an ASCII character).

1 more reply

sneak13y ago

Damn you for being right. :)

rymith13y ago

Not even a little bit accurate.

csense13y ago

A two-character sequence for a smiley face that should be compatible with everything in existence:

:)

Problem solved. Why is this front page material (#6 as of this writing)?

j / k navigate · click thread line to collapse

152 comments

oofabz13y ago

notJim13y ago

This comment is somewhat misleading. The issue at hand is orthogonal to any of the benefits of UTF-8 over UTF-16 (which are real, UTF-8 is great, you should use it.)

MySQL is a high-profile example of software which, until recently, had this problem: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8m....

snogglethorpe13y ago

So whereas it's rather common for programs to mis-handle multiple-unit UTF-16 characters, it seems much less likely that programs will mis-handle 4+ unit UTF-8 characters.

pixelcort13y ago

The problem with UTF-8 is that lots of tools have 3 byte limits, and characters like Emoji take up 4 bytes in UTF-8.

potatolicious13y ago

2 more replies

pjscott13y ago

3 more replies

thristian13y ago

I don't think it's specifically a 3-byte limit, I think it's just that lots of tools decode UTF-8 into UCS-2 internally instead of UTF-16.

1 more reply

eps13y ago

Which tools?

Honest question, as the three byte limit seems rather arbitrary and no more logical than, say, a four byte one.

2 more replies

Tsagadai13y ago

File bugs with those tools. These sort of issues should have been sorted years ago and any program that can't do 3+ byte character encodings should be named and shamed.

derleth13y ago

And the problem with UTF-16 is that a lot of applications can't handle surrogate pairs, except a lot of Emoji are above the BMP, aren't they? So why is this a bigger deal for UTF-8 than UTF-16?

1 more reply

est13y ago

UCS2 is better than UTF8 internally because it counts unicode characters faster than UTF8. Every character is juest 2 bytes, instead of 1, 2, 3 or even 4 bytes.

In python:

    len(u'汉字') == 2
    len( '汉字') == 4 # or maybe 6, it varies based on console encoding and CPython options
    len(u'汉字'.encode('utf8')) == 6