It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019) (opens in new tab)

(hsivonen.fi)

195 pointsprogram9mo ago274 comments

274 comments

I love how the title of this submission is changing every time I come back to HN.

At first there was an empty space between the double quotes. This made me click and read the article because it was surprising that the length of a space would be 7.

Then the actual emoji appeared and the title finally made sense.

Now I see escaped \u{…} characters spelled out and it’s just ridiculous.

Can’t wait to come back tomorrow to see what it will be then.

lovich9mo ago

This article could have well have been named "Falsehoods programmers believe about strings"

TeMPOraL9mo ago

Or, to address GP's concerns more directly, "Falsehoods programmers believe about Unicode filtering in Hacker News submission titles and comments".

rendx9mo ago

(Original renaming thread: https://news.ycombinator.com/item?id=44981808)

DavidPiper9mo ago

I think that string length is one of those things that people (including me) don't realise they never actually want. In a production system, I have never actually wanted string length. I have wanted:

- Number of bytes this will be stored as in the DB

- Number of monospaced font character blocks this string will take up on the screen

- Number of bytes that are actually being stored in memory

"String length" is just a proxy for something else, and whenever I'm thinking shallowly enough to want it (small scripts, mostly-ASCII, mostly-English, mostly-obvious failure modes, etc) I like grapheme cluster being the sensible default thing that people probably expect, on average.

arcticbull9mo ago

Taking this one step further -- there's no such thing as the context-free length of a string.

Strings should be thought of more like opaque blobs, and you should derive their length exclusively in the context in which you intend to use it. It's an API anti-pattern to have a context-free length property associated with a string because it implies something about the receiver that just isn't true for all relevant usages and leads you to make incorrect assumptions about the result.

Refining your list, the things you usually want are:

- Number of bytes in a given encoding when saving or transmitting (edit: or more generally, when serializing).

- Number of code points when parsing.

- Number of grapheme clusters for advancing the cursor back and forth when editing.

- Bounding box in pixels or points for display with a given font.

Context-free length is something we inherited from ASCII where almost all of these happened to be the same, but that's not the case anymore. Unicode is better thought of as compiled bytecode than something you can or should intuit anything about.

It's like asking "what's the size of this JPEG." Answer is it depends, what are you trying to do?

ramses09mo ago

"Unicode is JPG for ASCII" is an incredibly great metaphor.

size(JPG) == bytes? sectors? colors? width? height? pixels? inches? dpi?

account429mo ago

> Number of code points when parsing.

You shouldn't really ever care about the number of code points. If you do, you're probably doing something wrong.

josephg9mo ago

It’s a bit of a niche use case, but I use the codepoint counts in CRDTs for collaborative text editing.

Grapheme cluster counts can’t be used because they’re unstable across Unicode versions. Some algorithms use UTF8 byte offsets - but I think that’s a mistake because they make input validation much more complicated. Using byte offsets, there’s a whole lot of invalid states you can represent easily. Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint. Then inserting at position 2 is valid again. If you send me an operation which happened at some earlier point in time, I don’t necessarily have the text document you were inserting into handy. So figuring out if your insertion (and deletion!) positions are valid at all is a very complex and expensive operation.

Codepoints are way easier. I can just accept any integer up to the length of the document at that point in time.

1 more reply

torstenvl9mo ago

I really wish people would stop giving this bad advice, especially so stridently.

Like it or not, code points are how Unicode works. Telling people to ignore code points is telling people to ignore how data works. It's of the same philosophy that results in abstraction built on abstraction built on abstraction, with no understanding.

I vehemently dissent from this view.

3 more replies

baq9mo ago

ASCII is very convenient when it fits in the solution space (it’d better be, it was designed for a reason), but in the global international connected computing world it doesn’t fit at all. The problem is all the tutorials, especially low level ones, assume ASCII so 1) you can print something to the console and 2) to avoid mentioning that strings are hard so folks don’t get discouraged.

Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it.

craftkiller9mo ago

> Notably Rust did the correct thing

In addition to separate string types, they have separate iterator types that let you explicitly get the value you want. So:

  String.len() == number of bytes
  String.bytes().count() == number of bytes
  String.chars().count() == number of unicode scalar values
  String.graphemes().count() == number of graphemes (requires unicode-segmentation which is not in the stdlib)
  String.lines().count() == number of lines

Really my only complaint is I don't think String.len() should exist, it's too ambiguous. We should have to explicitly state what we want/mean via the iterators.

pron9mo ago

Similar to Java:

   String.chars().count(), String.codePoints().count(), and, for historical reasons, String.getBytes(UTF-8).length

westurner9mo ago

  String.graphemes().count()

That's a real nice API. (Similarly, python has @ for matmul but there is not an implementation of matmul in stdlib. NumPy has a matmul implementation so that the `@` operator works.)

ugrapheme and ucwidth are one way to get the graphene count from a string in Python.

It's probably possible to get the grapheme cluster count from a string containing emoji characters with ICU?

1 more reply

account429mo ago

> in the global international connected computing world it doesn’t fit at all.

I disagree. Not all text is human prose. For example, there is nothing wrong with an programming language that only allows ASCII in the source code and many downsides to allowing non-ASCII characters outside string constants or comments.

andriamanitra9mo ago

> For example, there is nothing wrong with an programming language that only allows ASCII in the source code and many downsides to allowing non-ASCII characters outside string constants or comments.

That's a tradeoff you should carefully consider because there are also downsides to disallowing non-ASCII characters. The downsides of allowing non-ASCII mostly stem from assigning semantic significance to upper/lowercase (which is itself a tradeoff you should consider when designing a language). The other issue I can think of is homographs but it seems to be more of a theoretical concern than a problem you'd run into in practice.

When I first learned programming I used my native language (Finnish, which uses 3 non-ASCII letters: åäö) not only for strings and comments but also identifiers. Back then UTF-8 was not yet universally adopted (ISO 8859-1 character set was still relatively common) so I occasionally encountered issues that I had no means to understand at the time. As programming is being taught to younger and younger audiences it's not reasonable to expect kids from (insert your favorite non-English speaking country) to know enough English to use it for naming. Naming and, to an extent, thinking in English requires a vocabulary orders of magnitude larger than knowing the keywords.

By restricting source code to ASCII only you also lose the ability to use domain-specific notation like mathematical symbols/operators and Greek letters. For example in Julia you may use some mathematical operators (eg. ÷ for Euclidean division, ⊻ for exclusive or, ∈/∉/∋ for checking set membership) and I find it really makes code more pleasant to read.

1 more reply

eviks9mo ago

The "nothing wrong" is, of course, this huge issue of not being able to use your native language, especially important when learning something by avoiding the extra language barrier on top of another language barrier

Now list anything as important from your list of downsides that's just as unfixable

simonask9mo ago

This is American imperialism at its worst. I'm serious.

Lots of people around the world learn programming from sources in their native language, especially early in their career, or when software development is not their actual job.

Enforcing ASCII is the same as enforcing English. How would you feel if all cooking recipes were written in French? If all music theory was in Italian? If all industrial specifications were in German?

It's fine to have a dominant language in a field, but ASCII is a product of technical limitations that we no longer have. UTF-8 has been an absolute godsend for human civilization, despite its flaws.

7 more replies

flohofwoe9mo ago

ASCII is totally fine as encoding for the lower 127 UNICODE code points. If you need to go above those 127 code points, use a different encoding like UTF-8.

Just never ever use Extended ASCII (8-bits with codepages).

bigstrat20039mo ago

> in the global international connected computing world it doesn’t fit at all.

Most people aren't living in that world. If you're working at Amazon or some business that needs to interact with many countries around the globe, sure, you have to worry about text encoding quite a bit. But the majority of software is being written for a much narrower audience, probably for one single language in one single country. There is simply no reason for most programmers to obsess over text encoding the way so many people here like to.

arp2429mo ago

No one is "obsessing" over anything. Reality is there are very few cases where you can use a single 8-bit character set and not run in to problems sooner or later. Say your software is used only in Greece so you use ISO-8859-7 for Greek. That works fine, but now you want to talk to your customer Günther from Germany who has been living in Greece for the last five years, or Clément from France, or Seán from Ireland and oops, you can't.

Even plain English text can't be represented with plain ASCII (although ISO-8859-1 goes a long way).

There are some cases where just plain ASCII is okay, but there are quite few of them (and even those are somewhat controversial).

The solution is to just use UTF-8 everywhere. Or maybe UTF-16 if you really have to.

rileymat29mo ago

Except, this is a response to emoji support, which does have encoding issues even if your user base is in the US and only speaks English. Additionally, it is easy to have issues with data that your users use from other sources via copy and paste.

raverbashing9mo ago

This is naive at best

Here's a better analogy, in the 70s "nobody planned" for names with 's in then. SQL injections, separators, "not in the alphabet", whatever. In the US. Where a lot of people with 's in their names live... Or double-barrelled names.

It's a much simpler problem and still tripped a lot of people

And then you have to support a user with a "funny name" or a business with "weird characters", or you expand your startup to Canada/Mexico and lo and behold...

1 more reply

wat100009mo ago

Which audience makes it so you don’t have to worry about text encodings?

eru9mo ago

Python 3 deals with this reasonable sensibly, too, I think. They use UTF-8 by default, but allow you to specify other encodings.

ynik9mo ago

Python 3 internally uses UTF-32. When exchanging data with the outside world, it uses the "default encoding" which it derives from various system settings. This usually ends up being UTF-8 on non-Windows systems, but on weird enough systems (and almost always on Windows), you can end up with a default encoding other than UTF-8. "UTF-8 mode" (https://peps.python.org/pep-0540/) fixes this but it's not yet enabled by default (this is planned for Python 3.15).

1 more reply

xigoi9mo ago

I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them.

5 more replies

xelxebar9mo ago

> Number of monospaced font character blocks this string will take up on the screen

Even this has to deal with the halfwidth/fullwidth split in CJK. Even worse, Devanagari has complex rendering rules that actually depend on font choices. AFAIU, the only globally meaningful category here is rendered bounding box, which is obviously font-dependent.

But I agree with the general sentiment. What we really about how much space these text blobs take up, whether that be in a DB, in memory, or on the screen.

xg159mo ago

It gets more complicated if you do substring operations.

If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.

arcticbull9mo ago

Substring operations (and more generally the universe of operations where there is more than one string involved) are a whole other kettle of fish. Unicode, being a byte code format more than what you think of as a logical 'string' format, has multiple ways of representing the same strings.

If you take a substring of a(bc) and compare it to string (bc) are you looking for bitwise equivalence or logical equivalence? If the former it's a bit easier (you can just memcmp) but if the latter you have to perform a normalization to one of the canonical forms.

jibal9mo ago

"Unicode, being a byte code format"

UTF-8 is a byte code format; Unicode is not. In Python, where all strings are arrays of Unicode code points, substrings are likewise arrays of Unicode code points.

1 more reply

setr9mo ago

I’m fairly positive the answer is trivially logical equivalence for pretty much any substring operation. I can’t imagine bitwise equivalence to ever be the “normal” use case, except to the implementer looking at it as a simpler/faster operation

I feel like if you’re looking for bitwise equivalence or similar, you should have to cast to some kind of byte array and access the corresponding operations accordingly

1 more reply

account429mo ago

> s.charAt(x) or s.codePointAt(x)

Neither of these are really useful unless you are implementing a font renderer or low level Unicode algorithm - and even then you usually only want to get the next code point rather than one at an arbitrary position.

mseepgood9mo ago

The values for x and y should't come from your brain, though (with the exception of 0). They should come from previous index operations like s.indexOf(...) or s.search(regex), etc.

xg159mo ago

Indeed. Or s.length, whatever that represents.

jlarocco9mo ago

It's definitely worth thinking about the real problem, but I wouldn't say it's never helpful.

The underlying issue is unit conversion. "length" is a poor name because it's ambiguous. Replacing "length" with three functions - "lengthInBytes", "lengthInCharacters", and "lengthCombined" - would make it a lot easier to pick the right thing.

perching_aix9mo ago

> Number of monospaced font character blocks this string will take up on the screen

To predict the pixel width of a given text, right?

One thing I ran into is that despite certain fonts being monospace, characters from different Unicode blocks would have unexpected lengths. Like I'd have expected half-width CJK letters to render to the same pixel dimensions as Latin letters do, but they don't. It's ever so slightly off. Same with full-width CJK letters vs two Latin letters.

I'm not sure if this is due to some font fallback. I'd have expected e.g. VS Code to be able to be able to render Japanese and English monospace in an aligned way without any fallbacks. Maybe once I have energy again to waste on this I'll look into it deeper.

oefrha9mo ago

(Some but not all) terminal emulators are capable of rendering CJK perfectly aligned with Latin even when mixing fonts. Browsers are fundamentally incapable of that because aligning characters in different fonts wasn’t a goal at all. VS Code being a webview under the hood means it inherited this fundamental incapability.* Therefore, don’t hold your breath.

* I'm talking about the DOM route, not <canvas> obviously. VS Code is powere by Monaco, which is DOM-based, not canvas-based. You can "Developer: Toggle Developer Tools" to see the DOM structure under the hood.

** I should further qualify my statement as browsers are fundamentally incapable of this if you use native text node rendering. I have built a perfectly monospace mixed CJK and Latin interface myself by wrapping each full width character in a separate span. Not exactly a performance-oriented solution. Also IIRC Safari doesn’t handle lengths in fractional pixels very well.

perching_aix9mo ago

That's very informative, thanks! I guess it really wasn't a mirage or me messing up in the end then.

guappa9mo ago

What if you need to find 5 letter words to play wordle? Why do you care how many bytes they occupy or how large they are on screen?

xigoi9mo ago

In the case of Wordle, you know the exact set of letters you’re going to be using, which easily determines how to compute length.

guappa9mo ago

No no, I want to create tomorrow's puzzle.

1 more reply

taneq9mo ago

If you're playing at this level, you need to define:

- letter

- word

- 5 :P

guappa9mo ago

Eh in macedonian they have some letters that in russian are just 2 separate letters

3 more replies

BobbyTables29mo ago

Very true. Rust’s handling of strings was an eye opener for me.

Seemed awkward but I eventually realized I rarely cared about number of characters. Even when dealing with substrings, I really only cared a means to describe “stuff” before/after not literal indices.

Counting Unicode characters is actually a disservice.

Semaphor9mo ago

FWIW, I frequently want the string length. Not for anything complicated, but our authors have ranges of characters they are supposed to stay in. Luckily no one uses emojis or weird unicode symbols, so in practice there’s no problem getting the right number by simply ignoring all the complexities.

tomsmeding9mo ago

It's not unlikely that what you would ideally use here is the number of grapheme clusters. What is the length of "ë"? Either 1 or 2 codepoints depending on the encoding (combining [1] or single codepoint [2]), and either 1 byte (Latin-1), 2 bytes (UTF-8 single-codepoint) or 3 bytes (UTF-8 combining).

The metrics you care about are likely number of letters from a human perspective (1) or the number of bytes of storage (depends), possibly both.

[1]: https://tomsmeding.com/unicode#U+65%20U+308 [2]: https://tomsmeding.com/unicode#U+EB

TZubiri9mo ago

How about for iterating every character in a string in order to find a specific character combination? I need (or the iterator needs) to know the length of the string and what the boundaries of each characters are.

bluecalm9mo ago

What about implementing text algorithms like prefix search or a suffix tree to mention the simplest ones? Don't you need a string length at various points there?

account429mo ago

With UTF-8 you can implement them on top of bytes.

jlarocco9mo ago

That's basically what a string data type is for.

capitainenemo9mo ago

FWIW, the cheap lazy way to get "number of bytes in DB" from JS, is unescape(encodeURIComponent("ə̀")).length

zwnow9mo ago

I actually want string length. Just give me the length of a word. My human brain wants a human way to think about problems. While programming I never think about bytes.

dwb9mo ago

The whole point is that string length doesn’t necessarily give you the “length” of a “word”, and both of those terms are not well enough defined.

jibal9mo ago

The point is that those terms are ambiguous ... and if you mean the length in grapheme clusters, it can be quite expensive to calculate it, and isn't the right number if you're dealing with strings as objects that are chunks of memory.

1 more reply

int_19h9mo ago

Humans speak many different languages. Not all of them are English, and not all of them have writing systems which make it meaningful to talk about "string length" without disambiguating further.

zwnow9mo ago

Why exactly would I care about humans not speaking any of the languages I speak?

1 more reply

bigstrat20039mo ago

I have never wanted any of the things you said. I have, on the other hand, always wanted the string length. I'm not saying that we shouldn't have methods like what you state, we should! But your statement that people don't actually want string length is untrue because it's overly broad.

zahlman9mo ago

> I have, on the other hand, always wanted the string length.

In an environment that supports advanced Unicode features, what exactly do you do with the string length?

PapaPalpatine9mo ago

I don’t know about advanced Unicode features… but I use them all the time as a backend developer to validate data input.

I want to make sure that the password is between a given number of characters. Same with phone numbers, email addresses, etc.

This seems to have always been known as the length of the string.

This thread sounds like a bunch of scientists trying to make a simple concept a lot harder to understand.

3 more replies

wredcoll9mo ago

Which length? Bytes? Code points? Graphemes? Pixels?

justsomehnguy9mo ago

Guessing from the other comments you missed the byte length for the codepoints.

When I'm comparing the human-readable strings I want the letgth. In all other cases I want sizeof(string) and it's... quite a variable thing.

thrdbndndn9mo ago

I see where you're coming from, but I disagree on some specifics, especially regarding bytes.

Most people care about the length of a string in terms of the number of characters.

Treating it as a proxy for the number of bytes has been incorrect ever since UTF-8 became the norm (basically forever), and if you're dealing with anything beyond ASCII (which you really should, since East Asian users alone number in the billions).

Same goes to the "string width".

Yes, Unicode scalar values can combine into a single glyph and cause discrepancies, as the article mentions, but that is a much rarer edge case than simply handling non-ASCII text.

account429mo ago

It's not rare at all - multi-code point emojis are pretty standard these days.

And before that the only thing the relative rarity did for you was that bugs with code working on UTF-8 bytes got fixed while bugs that assumed UTF-16 units or 32-bit code points represent a character were left to linger for much longer.

sigmoid109mo ago

I have wanted string length many times in production systems for language processing. And it is perfectly fine as long as whatever you are using is consistent. I rarely care how many bytes an emoji actually is unless I'm worried about extreme efficiency in storage or how many monospace characters it uses unless I do very specific UI things. This blog is more of a cautionary tale what can happen if you unconsciously mix standards e.g. by using one in the backend and another in the frontend. But this is not a problem of string lengths per se, they are just one instance where modern implementations are all over the place.

bstsb9mo ago

ironic that unicode is stripped out the post's title here, making it very much wrong ;)

for context, the actual post features an emoji with multiple unicode codepoints in between the quotes

dang9mo ago

Ok, we've put Man Facepalming with Light Skin Tone back up there. I failed to find a way to avoid it.

Is there a way to represent this string with escaped codepoints? It would be both amusing and in HN's plaintext spirit to do it that way in the title above, but my Unicode is weak.

NobodyNada9mo ago

That would be "\U0001F926\U0001F3FC\u200D\u2642\uFE0F" in Python's syntax, or "\u{1F926}\u{1F3FC}\u{200D}\u{2642}\u{FE0F}" in Rust or JavaScript.

Might be a little long for a title :)

dang9mo ago

Thanks! Your second option is almost identical to Mlller's (https://news.ycombinator.com/item?id=44988801) but the extra curly braces make it not fit. Seems like they're droppable for characters below U+FFFF, so I've squeezed it in above.

1 more reply

Mlller9mo ago

That would be …

  "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7

… for Javascript.

dang9mo ago

I can actually fit that within HN's 80 char limit without having to drop the "(2019)" bit at the end, so let's give it a try and see what happens... thanks!

1 more reply

cmeacham989mo ago

Funny enough I clicked on the post wondering how it could possibly be that a single space was length 7.

eastbound9mo ago

It can be many Zero-Width Space, or a few Hair-Width Space.

You never know, when you don’t know CSS and try to align your pixels with spaces. Some programers should start a trend where 1 tab = 3 hairline-width spaces (smaller than 1 char width).

Next up: The <half-br/> tag.

Moru9mo ago

You laugh but my typewriter could do half-br 40 years ago. Was used for typing super/subscript.

ale429mo ago

Maybe it isn't a space, but a list of invisible Unicode chars...

yread9mo ago

It could also be a byte length of a 3 byte UTF-8 BOM and then some stupid space character like f09d85b3

robin_reala9mo ago

It’s U+0020, a standard space character.

c129mo ago

I did exactly the same, thinking that maybe it was invisible unicode characters or something I didn't know about.

timeon9mo ago

Unintentional click-bait.

Phelinofist9mo ago

Before it wasn't, about 1h ago it was showing me a proper emoji

zahlman9mo ago

There's an awful lot of text in here but I'm not seeing a coherent argument that Python's approach is the worst, despite the author's assertion. It especially makes no sense to me that counting the characters the implementation actually uses should be worse than counting UTF-16 code units, for an implementation that doesn't use surrogate pairs (and in fact only uses those code units to store out-of-band data via the "surrogateescape" error handler, or explicitly requested characters. N.B.: Lone surrogates are still valid characters, even though a sequence containing them is not a valid string.) JavaScript is compelled to count UTF-16 code units because it actually does use UTF-16. Python's flexible string representation is a space optimization; it still fundamentally represents strings as a sequence of characters, without using the surrogate-pair system.

deathanatos9mo ago

> JavaScript is compelled to count UTF-16 code units because it actually does use UTF-16. Python's flexible string representation is a space optimization; it still fundamentally represents strings as a sequence of characters, without using the surrogate-pair system.

Python's flexible string system has nothing to do with this. Python could easily have had len() return the byte count, even the USV count, or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem. In particular, the USV count would have been made easy (O(1) easy!) by Python's flexible string representation.

You're handwaving it away in your writing by calling it a "character in the implementation", but what is a character? It's not a character in any sense a normal human would recognize — like a grapheme cluster — as I think if I asked a human "how many characters is <imagine this is man with skin tone face palming>?", they'd probably say "well, … IDK if it's really a character, but 1, I suppose?" …but "5" or "7"? Where do those even come from? An astute person might like "Oh, perhaps that takes more than one byte, is that it's size in memory?" Nope. Again: "character in the implementation" is a meaningless concept. We've assigned words to a thing to make it sound meaningful, but that is like definitionally begging the question here.

zahlman9mo ago

> or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem.

The unit is perfectly meaningful.

It's "characters". (Pedantically, "code points" — https://www.unicode.org/glossary/#code_point — because values that haven't been assigned to characters may be stored. This is good for interop, because it allows you to receive data from a platform that implements a newer version of the Unicode standard, and decide what to do with the parts that your local terminal, font rendering engine, etc. don't recognize.)

Since UTF-32 allows storing every code point in a single code unit, you can also describe it that way, despite the fact that Python doesn't use a full 4 bytes per code point when it doesn't have to.

The only real problem is that "character" doesn't mean what you think it does, and hasn't since 1991.

I don't understand what you mean by "USV count".

> but what is a character?

It's what the Unicode standard says a character is. https://www.unicode.org/glossary/#character , definition 3. Python didn't come up with the concept; Unicode did.

> …but "5" or "7"? Where do those even come from?

From the way that the Unicode standard dictates that this text shall be represented. This is not Python's fault.

> Again: "character in the implementation" is a meaningless concept.

"Character" is completely meaningful, as demonstrated by the fact the Unicode Consortium defines it, and by the fact that huge amounts of software has been written based on that definition, and referring to it in documentation.

deathanatos9mo ago

> Since UTF-32 allows storing every code point in a single code unit, you can also describe it that way, despite the fact that Python doesn't use a full 4 bytes per code point when it doesn't have to.

Python does not use UTF-32, even notionally. Yes, I know it uses a compact representation in memory when the value is ASCII, etc. That's not what I'm talking about here. |str| != |all UTF32 strings|; `str` and "UTF-32" are different things, as there are values in the former that are absent in the latter, and again, this is why encoding to utf8 or any utf encoding is fallible in Python.

Code points is not a meaningful metric, though I suppose strictly speaking, yes, len() is code points.

> I don't understand what you mean by "USV count".

The number of Unicode scalar values in the string. (If the string were encoded in UTF-32, the length of that array.) It's the basic building block of Unicode. It's only marginally useful, and there's a host of other more meaningful metrics, like memory size, terminal width, graphemes, etc. But it's more meaningful than code points, and if you want to do anything at any higher level of representation, USVs are going to be what you want to build off. Anything else is going to be more fraught with error, needlessly.

> It's what the Unicode standard says a character is.

The Unicode definition of "character" is not a technical definition, it's just there to help humans. Again, if I fed that definition to a human, and asked the same question above, <facepalm…> is 1 "character", according to that definition in Unicode as evaluated by a reasonable person. That's not the definition Python uses, since it returns 5. No reasonable person is looking at the linked definition, and then at the example string, and answering "5".

"How many smallest components of written language that has semantic value does <facepalm emoji …> have?" Nobody is answering "5".

(And if you're going to quibble with my use of definition (1.), the same applies to (2.). (3.) doesn't apply here as Python strings are not Unicode strings (again, |str| != |all Unicode strings|), (4.) is specific to Chinese.)

> "Character" is completely meaningful, as demonstrated by the fact the Unicode Consortium defines it, and by the fact that huge amounts of software has been written based on that definition, and referring to it in documentation.

A lot of people write bad code does not make bad code good. Ambiguous technical documentation is likewise not made good by being ambiguous. Any use of "character" in technical writing would be made more clear by replacing it with one of the actual technical terms defined by Unicode, whether that's "UTF-16 code point", "USV", "byte", etc. "Character" leaves far too much up to the imagination of the reader.

2 more replies

perching_aix9mo ago

As the other comment says, Python considers strings to be a sequence of codepoints, hence the length of a string will be the number of codepoints in that string.

I just relied on this fact yesterday, so it's kind of a funny timing. I wrote a little script that looks out for shenanigans in source files. One thing I wanted to explore was what Unicode blocks a given file references characters from. This is meaningless on the byte level, and meaningless on the grapheme cluster level. It is only meaningful on the codepoint level. So all I needed to do was to iterate through all the codepoints in the file, tally it all up by Unicode block, and print the results. Something this design was perfectly suited for.

Now of course:

- it coming in handy once for my specific random workload doesn't mean it's good design

- my specific workload may not be rational (am a dingus sometimes)

- at some point I did consider iterating by grapheme clusters, which the language didn't seem to love a whole lot, so more flexibility would likely indeed be welcome

- I am well and fully aware that iterating through data a few bytes at a time is abject terrible and possibly a sin. Too bad I don't really do coding in any proper native language, and I have basically no experience in SIMD, so tough shit.

But yeah, I really don't see why people find this so crazy. The whole article is in good part about how relying on grapheme cluster semantics makes you Unicode version dependent and that being a bit hairy, so it's probably not a good idea to default to it. At which point, codepoints it is. Counting scalars only is what would be weird in my view, you're "randomly" doing skips over the data potentially.

bobsmooth9mo ago

I'm curious what you mean by "shenanigans" is that like emojis and zalgo text?

1 more reply

xg159mo ago

The article both argues that the "real" length from a user perspective is Extended Grapheme Clusters - and makes a case against using it, because it requires you to store the entire character database and may also change from one Unicode version to the next.

Therefore, people should use codepoints for things like length limits or database indexes.

But wouldn't this just move the "cause breakage with new Unicode version" problem to a different layer?

If a newer Unicode version suddenly defines some sequences to be a single grapheme cluster where there were several ones before and my database index now suddenly points to the middle of that cluster, what would I do?

Seems to me, the bigger problem is with backwards compatibility guarantees in Unicode. If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?

re9mo ago

What do you mean by "use codepoints for ... database indexes"? I feel like you are drawing conclusions that the essay does not propose or support. (It doesn't say that you should use codepoints for length limits.)

> If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?

Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?

xg159mo ago

I was referring to this part, in "Shouldn’t the Nudge Go All the Way to Extended Grapheme Clusters?":

"For example, the Unicode version dependency of extended grapheme clusters means that you should never persist indices into a Swift strings and load them back in a future execution of your app, because an intervening Unicode data update may change the meaning of the persisted indices! The Swift string documentation does not warn against this.

You might think that this kind of thing is a theoretical issue that will never bite anyone, but even experts in data persistence, the developers of PostgreSQL, managed to make backup restorability dependent on collation order, which may change with glibc updates."

You're right it doesn't say "codepoints" as an alternative solution. That was just my assumption as it would be the closest representation that does not depend on the character database.

But you could also use code units, bytes, whatever. The problem will be the same if you have to reconstruct the grapheme clusters eventually.

> Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?

Because splitting a grapheme cluster in half can change its semantics. You don't want that if you e.g. have an index for fulltext search.

chrismorgan9mo ago

> it doesn't say "codepoints" as an alternative solution. That was just my assumption …

On the contrary, the article calls code point indexing “rather useless” in the subtitle. Code unit indexing is the appropriate technique. (“Byte indexing” generally implies the use of UTF-8 and in that context is more meaningfully called code unit indexing. But I just bet there are systems out there that use UTF-16 or UTF-32 and yet use byte indexing.)

> The problem will be the same if you have to reconstruct the grapheme clusters eventually.

In practice, you basically never do. Only your GUI framework ever does, for rendering the text and for handling selection and editing. Because that’s pretty much the only place EGCs are ever actually relevant.

> You don't want that if you e.g. have an index for fulltext search.

Your text search won’t be splitting by grapheme clusters, it’ll be doing word segmentation instead.

1 more reply

dang9mo ago

Related. Others? (Also, anybody know the answer to https://news.ycombinator.com/item?id=44987514?)

It’s not wrong that " ".length == 7 (2019) - https://news.ycombinator.com/item?id=36159443 - June 2023 (303 comments)

String length functions for single emoji characters evaluate to greater than 1 - https://news.ycombinator.com/item?id=26591373 - March 2021 (127 comments)

String Lengths in Unicode - https://news.ycombinator.com/item?id=20914184 - Sept 2019 (140 comments)

andy_xor_andrew9mo ago

https://news.ycombinator.com/item?id=27529697

kazinator9mo ago

Why would I want this to be 17, if I'm representing strings as array of code points, rather than UTF-8?

TXR Lisp:

  1> (len " ")
  5
  2> (coded-length " ")
  17

(Trust me when I say that the emoji was there when I edited the comment.)

The second value takes work; we have to go through the code points and add up their UTF-8 lengths. The coded length is not cached.

pron9mo ago

In Java,

    " ".codePoints().count()
    ==> 5

    " ".chars().count()
    ==> 7

    " ".getBytes(UTF_8).length
    ==> 17

(HN doesn't render the emoji in comments, it seems)

chrismorgan9mo ago

Previous discussions:

• https://news.ycombinator.com/item?id=36159443 (June 2023, 280 points, 303 comments; title got reemojied!)

• https://news.ycombinator.com/item?id=26591373 (March 2021, 116 points, 127 comments)

• https://news.ycombinator.com/item?id=20914184 (September 2019, 230 points, 140 comments)

I’m guessing this got posted by one who saw my comment https://news.ycombinator.com/item?id=44976046 today, though coincidence is possible. (Previous mention of the URL was 7 months ago.)

programOP9mo ago

I did post this. I found it by chance, coming from this other post https://tonsky.me/blog/unicode/

osener9mo ago

Python does an exceptionally bad job. After dragging the community through a 15-year transition to Python 3 in order to "fix" Unicode, we ended up with support that's worse than in languages that simply treat strings as raw bytes.

Some other fun examples: https://gist.github.com/ozanmakes/0624e805a13d2cebedfc81ea84...

mid-kid9mo ago

Yeah I have no idea what is wrong with that. Python simply operates on arrays of codepoints, which are a stable representation that can be converted to a bunch of encodings including "proper" utf-8, as long as all codepoints are representable in that encoding. This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.

deathanatos9mo ago

> which are a stable representation that can be converted to a bunch of encodings including "proper" utf-8, as long as all codepoints are representable in that encoding.

Which, to humor the parent, is also true of raw bytes strings. One of the (valid) points raised by the gist is that `str` is not infallibly encodable to UTF-8, since it can contain values that are not valid Unicode.

> This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.

If I write,

  def foo(s: str) -> …:

… I want the input string to be Unicode. If I need "Unicode, or maybe with bullshit mixed in", that can be a different type, and then I can take

  def foo(s: UnicodeWithBullshit) -> …:

mid-kid9mo ago

> If I write [str] I want the inut string to be Unicode.

No, nothing about the "string" type in python implies unicode. It's, for all intents and purposes, its own encoding, and should be treated as such. Not all encodings it can convert to are representable as unicode, and vice versa, so it makes no sense to think of it as unicode.

slavik819mo ago

The Python language developers themselves thought that their code only needed to operate on str and later realized that it needed to handle arbitrary bytes.

It's a common mistake. A lot of code was written using str despite users needing it to operate on UnicodeWithBullshit. PEP 383 was a necessary escape hatch to fix countless broken programs.

acuozzo9mo ago

> Python simply operates on arrays of codepoints

But most programmers think in arrays of grapheme clusters, whether they know it or not.

zahlman9mo ago

No, I'm not standing for that.

Python does it correctly and the results in that gist are expected. Characters are not grapheme clusters, and not every sequence of characters is valid. The ability to store unpaired surrogate characters is a feature: it would take extra time to validate this when it only really matters at encoding time. It also empowers the "surrogateescape" error handler, that in turn makes it possible to supply arbitrary bytes in command line arguments, even while providing strings to your program which make sense in the common case. (Not all sequences of bytes are valid UTF-8; the error handler maps the invalid bytes to invalid unpaired surrogates.) The same character counts are (correctly) observed in many other programming languages; there's nothing at all "exceptional" about Python's treatment.

It's not actually possible to "treat strings as raw bytes", because they contain more than 256 possible distinct symbols. They must be encoded; even if you assume an ecosystem-wide encoding, you are still using that encoding. But if you wish to work with raw sequences of bytes in Python, the `bytes` type is built-in and trivially created using a `b'...'` literal, or various other constructors. (There is also a mutable `bytearray` type.) These types now correctly behave as a sequence of byte (i.e., integer ranging 0..255 inclusive) values; when you index them, you get an integer. I have personal experience of these properties simplifying and clarifying my code.

Unicode was fixed (no quotation marks), with the result that you now have clearly distinct types that honour the Zen of Python principle that "explicit is better than implicit", and no longer get `UnicodeDecodeError` from attempting an encoding operation or vice-versa. (This problem spawned an entire family of very popular and very confused Stack Overflow Q&As, each with probably countless unrecognized duplicates.) As an added bonus, the default encoding for source code files changed to UTF-8, which means in practical terms that you can actually use non-English characters in your code comments (and even identifier names, with restrictions) now and have it just work without declaring an encoding (since your text editor now almost certainly assumes that encoding in 2025). This also made it possible to easily read text files as text in any declared encoding, and get strings as a result, while also having universal newline mode work, and all without needing to reach for `io` or `codecs` standard libraries.

The community was not so much "dragged through a 15-year transition"; rather, some members of the community spent as long as 15 (really 13.5, unless you count people continuing to try to use 2.7 past the extended EOL) years refusing to adapt to what was a clear bugfix of the clearly broken prior behaviour.

estimator72929mo ago

Stuff like this makes me so glad that in my world strings are ALWAYS ASCII and one char is always one byte. Unicode simply doesn't exist and all string manipulation can be done with a straightforward for loop or whatever.

Dealing with wide strings sounds like hell to me. Right up there with timezones. I'm perfectly happy with plain C in the embedded world.

RcouF1uZ4gsC9mo ago

That English can be well represented with ASCII may have contributed to America becoming an early computing powerhouse. You could actually do things like processing and sorting and doing case insensitive comparisons on data likes names and addresses very cheaply.

Ultimatt9mo ago

Worth giving Raku a shout out here... methods do what they say and you write what you mean. Really wish every other language would pinch the Str implementation from here, or at least the design.

    $ raku
    Welcome to Rakudo™ v2025.06.
    Implementing the Raku® Programming Language v6.d.
    Built on MoarVM version 2025.06.

    [0] > " ".chars
    1
    [1] > " ".codes
    5
    [2] > " ".encode('UTF-8').bytes
    17
    [3] > " ".NFD.map(*.chr.uniname)
    (FACE PALM EMOJI MODIFIER FITZPATRICK TYPE-3 ZERO WIDTH JOINER MALE SIGN VARIATION SELECTOR-16)

voidmain9mo ago

I haven't thought about this deeply, but it seems to me that the evolution of unicode has left it unparseable (into extended grapheme clusters, which I guess are "characters") in a forwards compatible way. If so, it seems like we need a new encoding which actually delimits these (just as utf-8 delimits code points). Then the original sender determines what is a grapheme, and if they don't know, who does?

umajho9mo ago

If you want to get the grapheme length in JavaScript, JavaScript now has Intl.Segmenter[^1][^2].

  > [...(new Intl.Segmenter()).segment(THAT_FACEPALM_EMOJI)].length
  1

[^1]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

[^2]: https://caniuse.com/mdn-javascript_builtins_intl_segmenter_s...

jfoster9mo ago

I run one of the many online word counting tools (WordCounts.com) which also does character counts. I have noticed that even Google Docs doesn't seem to use grapheme counts and will produce larger than expected counts for strings of emoji.

If you want to see a more interesting case than emoji, check out Thai language. In Thai, vowels could appear before, after, above, below, or on many sides of the associated consonants.

tralarpa9mo ago

Fascinating and annoying problem, indeed. In Java, the correct way to iterate over the characters (Unicode scalar values) of a string is to use the IntStream provided by String::codePoints (since Java 8), but I bet 99.9999% of the existing code uses 16-bit chars.

zahlman9mo ago

This does not fix the problem. The emoji consists of multiple Unicode characters (in turn represented 1:1 by the integer "code point" values). There is much more to it than the problem of surrogate pairs.

ivanjermakov9mo ago

Codepoint is not cluster and cluster is not character. I bet there is "50 falsehoods about Unicode".

Aissen9mo ago

I'd disagree the number of unicode scalars is useless (in the case of python3), but it's a very interesting article nonetheless. Too bad unicode.org decided to break all the URLs in the table at the end.

mrheosuper9mo ago

>We’ve seen four different lengths so far:

Number of UTF-8 code units (17 in this case) Number of UTF-16 code units (7 in this case) Number of UTF-32 code units or Unicode scalar values (5 in this case) Number of extended grapheme clusters (1 in this case)

We would not have this problem if we all agree to return number of bytes instead.

Edit: My mistake. There would still be inconsistency between different encoding. My point is, if we all decided to report number of bytes that string used instead number of printable characters, we would not have the inconsistency between languages.

curtisf9mo ago

"number of bytes" is dependent on the text encoding.

UTF-8 code units _are_ bytes, which is one of the things that makes UTF-8 very nice and why it has won

ivanjermakov9mo ago

I would say Unicode has won, but not UTF-8. UTF-16 is also widely used due to its efficiency on asian texts.

charcircuit9mo ago

>Number of extended grapheme clusters (1 in this case)

Only if you are using a new enough version of unicode. If you were using an older version it is more than 1. As new unicode updates come out, the number of grapheme clusters a string has can change.

minebreaker9mo ago

> We would not have this problem if we all agree to return number of bytes instead.

I don't understand. It depends on the encoding isn't it?

com2kid9mo ago

How would that help? UTF-8, 16, and 32 languages would still report different numbers.

jibal9mo ago

> if we all decided to report number of bytes that string used instead number of printable characters

But that isn't the same across all languages, or even across all implementations of the same language.

baq9mo ago

when I'm reading text on a screen, I very much am not reading bytes. this is obvious when you actually think what 'text encoding' means.

account429mo ago

You're not reading unicode code points either though. Your computer uses bytes, you read glyphs which roughly correspond to unicode extended grapheme clusters - anything between might look like the correct solution at first but is the wrong abstraction for almost everything.

baq9mo ago

you are right, but this just drives the point.

impure9mo ago

I learned this recently when I encountered a bug due to cutting an emoji character in two making it unable to render.

shirro9mo ago

Grapheme clustering does my head in. I just want to delete the character to the left of the cursor damn it.

pseufaux9mo ago

Can anyone recommend a good intro to understanding string encoding article?

torstenvl9mo ago

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

pseufaux9mo ago

After reading this, I did a search for mentions of this article and found this StackOverflow gem. The top answer basically picks up where the JoelOnSoftware article leaves off and filled in the rest of the blanks for me.

https://stackoverflow.com/questions/2241348/what-are-unicode...

Still have more reading to do and a lot to learn but this was super informative, so thank you internet stranger.

pseufaux9mo ago

Haha, this is fantastic.

> So I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.

Thank you!

pwdisswordfishz9mo ago

Call me naive, but I think the length of a space character ought to be one.

jibal9mo ago

Read the article ... the character between the quote marks isn't a space, but HN apparently doesn't support emoji, or at least not that one.

darkwater9mo ago

(2019) updated in (2022)

Mlller9mo ago

The article nearly equivocates “Rather Useless” and “unambiguously the worst”. Python3 seems more coherent to me than the article's argument:

1. Python3 plainly distinguishes between a string and a sequence of bytes. The function `len`, as a built-in, gives the most straightforward count: for any set or sequence of items, it counts the number of these items.

2. For a sequence of bytes, it counts the number of bytes. Taking this face-palming half-pale male hodgepodge and encoding it according to UTF-8, we get 17 bytes. Thus `len("\U0001F926\U0001F3FC\u200D\u2642\uFE0F".encode(encoding = "utf-8")) == 17`.

3. After bytes, the most basic entities are Unicode code points. A Python3 string is a sequence of Unicode code points. So for a Python3 string, `len` should give the number of Unicode code points. Thus `len("\U0001F926\U0001F3FC\u200D\u2642\uFE0F") == 5`.

Anything more is and should be beyond the purview of the simple built-in `len`:

4. Grapheme clusters are complicated and nearly as arbitrary as code points, hence there are “legacy grapheme clusters” – the grapheme clusters of older Unicode versions, because they changed – and “tailored grapheme clusters”, which may be needed “for specific locales and other customizations”, and of course the default “extended grapheme clusters”, which are only “a best-effort approximation” to “what a typical user might think of as a “character”.” Cf. https://www.unicode.org/reports/tr29

Of course, there are very few use cases for knowing the number of code points, but are there really much more for the number (NB: the number) of grapheme clusters?

Anyway, the great module https://pypi.org/project/regex/ supports “Matching a single grapheme \X”. So:

  len(regex.findall(r"\X", "\U0001F926\U0001F3FC\u200D\u2642\uFE0F")) == 1

5. The space a sequence of code points will occupy on the screen: certainly useful but at least dependent on the typeface that will be used for rendering and hence certainly beyond the purview of a simple function.

troupo9mo ago

Obligatory, Emoji under the hood https://tonsky.me/blog/emoji/

Sniffnoy9mo ago

Another little thing: The post mentions that tag sequences are only used for the flags of England, Scotland, and Wales. Those are the only ones that are standard (RGI), but because it's clear how the mechanism would work for other subnational entities, some systems support other ones, such as US state flags! I don't recommend using these if you want other people to be able to see them, but...

spyrja9mo ago

I really hate to rant on about this. But the gymnastics required to parse UTF-8 correctly are truly insane. Besides that we now see issues such as invisible glyph injection attacks etc cropping up all over the place due to this crappy so-called "standard". Maybe we should just to go back to the simplicity of ASCII until we can come up with with something better?

danhau9mo ago

Are you referring to Unicode? Because UTF-8 is simple and relatively straight forward to parse.

Unicode definitely has its faults, but on the whole it‘s great. I‘ll take Unicode w/ UTF-8 any day over the mess of encodings we had before it.

Needless to say, Unicode is not a good fit for every scenario.

xg159mo ago

I think GP is really talking about extended grapheme clusters (at least the mention of invisible glyph injection makes me think that)

Those really seem hellish to parse, because there seem to be several mutually independent schemes how characters are combined to clusters, depending on what you're dealing with.

E.g. modifier characters, tags, zero-width joiners with magic emoji combinations, etc.

So you need both a copy of the character database and knowledge of the interaction of those various invisible characters.

spyrja9mo ago

Just as an example of what I am talking about, this is my current UTF-8 parser which I have been using for a few years now.

  bool utf_append_plaintext(utf* result, const char* text) {
  #define msk(byte, mask, value) ((byte & mask) == value)
  #define cnt(byte) msk(byte, 0xc0, 0x80)
  #define shf(byte, mask, amount) ((byte & mask) << amount)
    utf_clear(result);
    if (text == NULL)
      return false;
    size_t siz = strlen(text);
    uint8_t* nxt = (uint8_t*)text;
    uint8_t* end = nxt + siz;
    if ((siz >= 3) && (nxt[0] == 0xef) && (nxt[1] == 0xbb) && (nxt[2] == 0xbf))
      nxt += 3;
    while (nxt < end) {
      bool aok = false;
      uint32_t cod = 0;
      uint8_t fir = nxt[0];
      if (msk(fir, 0x80, 0)) {
        cod = fir;
        nxt += 1;
        aok = true;
      } else if ((nxt + 1) < end) {
        uint8_t sec = nxt[1];
        if (msk(fir, 0xe0, 0xc0)) {
          if (cnt(sec)) {
            cod |= shf(fir, 0x1f, 6);
            cod |= shf(sec, 0x3f, 0);
            nxt += 2;
            aok = true;
          }
        } else if ((nxt + 2) < end) {
          uint8_t thi = nxt[2];
          if (msk(fir, 0xf0, 0xe0)) {
            if (cnt(sec) && cnt(thi)) {
              cod |= shf(fir, 0x0f, 12);
              cod |= shf(sec, 0x3f, 6);
              cod |= shf(thi, 0x3f, 0);
              nxt += 3;
              aok = true;
            }
          } else if ((nxt + 3) < end) {
            uint8_t fou = nxt[3];
            if (msk(fir, 0xf8, 0xf0)) {
              if (cnt(sec) && cnt(thi) && cnt(fou)) {
                cod |= shf(fir, 0x07, 18);
                cod |= shf(sec, 0x3f, 12);
                cod |= shf(thi, 0x3f, 6);
                cod |= shf(fou, 0x3f, 0);
                nxt += 4;
                aok = true;
              }
            }
          }
        }
      }
      if (aok)
        utf_push(result, cod);
      else
        return false;
    }
    return true;
  #undef cnt
  #undef msk
  #undef shf
  }

Not exactly "simple", is it? I am almost embarrassed to say that I thought I had read the spec right. But of course I was obviously wrong and now I have to go back to the drawing board (or else find some other FOSS alternative written in C). It just frustrates me. I do appreciate the level of effort made to come up with an all-encompassing standard of sorts, but it just seems so unnecessarily complicated.

simonask9mo ago

That's a reasonable implementation in my opinion. It's not that complicated. You're also apparently insisting on three-letter variable names, and are using a very primitive language to boot, so I don't think you're setting yourself up for "maintainability" here.

Here's the implementation in the Rust standard library: https://doc.rust-lang.org/stable/src/core/str/validations.rs...

It even includes an optimized fast path for ASCII, and it works at compile-time as well.

2 more replies

danhau9mo ago

I don't know what your code is doing exactly. For comparison, here's my utf8 decoder (for a single codepoint):

    static UnicodeCodepoint utf8_decode(u8 const bytes[static 4], u8 *out_num_consumed) {
        u8 const flipped = ~bytes[0];
        if (flipped == 0) {
            // Because __builtin_clz is UB for value 0.
            // When his happens, the UTF-8 is malformed.
            *out_num_consumed = 1;
            return 0;
        }
        
        u8 const num_ones = __builtin_clz(flipped) & 0x07;
        u8 const num_bytes_total = num_ones > 1 ? num_ones : 1;
        u8 const main_byte_shift = num_ones + 1;
        UnicodeCodepoint value = bytes[0] & (0xFF >> main_byte_shift);
        
        for (u8 i = 1; i < num_bytes_total; ++i) {
            if (bytes[i] >> 6 != 2) {
                // Not a valid continuation byte.
                *out_num_consumed = i;
                return 0;
            }
            
            value = (value << 6) | (bytes[i] & 0x3F);
        }

        *out_num_consumed = num_bytes_total;
        return value;
    }

guappa9mo ago

Sure, I'll just write my own language all weird and look like an illiterate so that you are not inconvenienced.

kalleboo9mo ago

I think what you meant is we should all go back to the simplicity of Shift-JIS

eru9mo ago

You could use a standard that always uses eg 4 bytes per character, that is much easier to parse than UTF-8.

UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.

account429mo ago

ASCII compatibility isn't the only advantage of UTF-8 over UCS-4. It also

- requires less memory for most strings, particular ones that are largely limited to ASCII like structured text-based formats often are.

- doesn't need to care about byte order. UTF-8 is always UTF-8 while UTF-16 might either be little or big endian and UCS-4 could theoretically even be mixed endian.

- doesn't need to care about alignment: If you jump to a random memory position you can find the next and previous UTF-8 characters. This also means that you can use preexisting byte-based string functions like substring search for many UTF-8 operations.

degamad9mo ago

It's not just the variable byte length that causes an issue, in some ways that's the easiest part of the problem. You also have to deal with code points that modify other code points, rather than being characters themselves. That's a huge part of the problem.

amake9mo ago

That has nothing to do with UTF-8; that's a Unicode issue, and one that's entirely unescapable if you are the Unicode Consortium and your goal is to be compatible with all legacy charsets.

1 more reply

bawolff9mo ago

That goes all the way back to the beginning

Even ascii used to use "overstriking" where the backspace character was treated as a joiner character to put accents above letters.

1 more reply

spyrja9mo ago

True. But then again, backward-compatibility isn't really such a hard to do with ASCII because the MSB is always zero. The problem I think is that the original motivation which ultimately lead to the complications we now see with UTF-8 was based on a desire to save a few bits here and there rather than create a straight-forward standard that was easy to parse. I am actually staring at 60+ lines of fairly pristine code I wrote a few years back that ostensibly passed all tests, only to find out that in fact it does not cover all corner cases. (Could have sworn I read the spec correctly, but apparently not!)

eru9mo ago

You might like https://fsharpforfunandprofit.com/series/property-based-test...

1 more reply

Ekaros9mo ago

Should have just gone with 32 bit characters and no combinations. Utter simplicity.

guappa9mo ago

That would be extremely wasteful, every single text file would be 4x larger and I'm sure eventually it would not be enough anyway.

Ekaros9mo ago

Maybe we should have just replaced ascii, horrible encoding were entire 25% of it is wasted. And maybe we could have gotten a bit more efficiency by saying instead of having both lower and uppercase letters just have one and then have a modifier before it. Saving lot of space as most text could just be lowercase.

1 more reply

bawolff9mo ago

I think combining characters are a lot simpler than having every single combination ever.

Especially when you start getting into non latin-based languages.

amake9mo ago

What does "no combinations" mean?

Ekaros9mo ago

Like say Ä it might be either Ä a single byte, or combination of ¨ and A. Both are now supported, but if you can have more than two such things going in one thing it makes a mess.

2 more replies

j / k navigate · click thread line to collapse

274 comments

koliber9mo ago

I love how the title of this submission is changing every time I come back to HN.

At first there was an empty space between the double quotes. This made me click and read the article because it was surprising that the length of a space would be 7.

Then the actual emoji appeared and the title finally made sense.

Now I see escaped \u{…} characters spelled out and it’s just ridiculous.

Can’t wait to come back tomorrow to see what it will be then.

lovich9mo ago

This article could have well have been named "Falsehoods programmers believe about strings"

TeMPOraL9mo ago

Or, to address GP's concerns more directly, "Falsehoods programmers believe about Unicode filtering in Hacker News submission titles and comments".

rendx9mo ago

(Original renaming thread: https://news.ycombinator.com/item?id=44981808)

DavidPiper9mo ago

I think that string length is one of those things that people (including me) don't realise they never actually want. In a production system, I have never actually wanted string length. I have wanted:

- Number of bytes this will be stored as in the DB

- Number of monospaced font character blocks this string will take up on the screen

- Number of bytes that are actually being stored in memory

arcticbull9mo ago

Taking this one step further -- there's no such thing as the context-free length of a string.

Refining your list, the things you usually want are:

- Number of bytes in a given encoding when saving or transmitting (edit: or more generally, when serializing).

- Number of code points when parsing.

- Number of grapheme clusters for advancing the cursor back and forth when editing.

- Bounding box in pixels or points for display with a given font.

It's like asking "what's the size of this JPEG." Answer is it depends, what are you trying to do?

ramses09mo ago

"Unicode is JPG for ASCII" is an incredibly great metaphor.

size(JPG) == bytes? sectors? colors? width? height? pixels? inches? dpi?

account429mo ago

> Number of code points when parsing.

You shouldn't really ever care about the number of code points. If you do, you're probably doing something wrong.

josephg9mo ago

It’s a bit of a niche use case, but I use the codepoint counts in CRDTs for collaborative text editing.

Codepoints are way easier. I can just accept any integer up to the length of the document at that point in time.

1 more reply

torstenvl9mo ago

I really wish people would stop giving this bad advice, especially so stridently.

I vehemently dissent from this view.

3 more replies

baq9mo ago

Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it.

craftkiller9mo ago

> Notably Rust did the correct thing

In addition to separate string types, they have separate iterator types that let you explicitly get the value you want. So:

  String.len() == number of bytes
  String.bytes().count() == number of bytes
  String.chars().count() == number of unicode scalar values
  String.graphemes().count() == number of graphemes (requires unicode-segmentation which is not in the stdlib)
  String.lines().count() == number of lines

Really my only complaint is I don't think String.len() should exist, it's too ambiguous. We should have to explicitly state what we want/mean via the iterators.

pron9mo ago

Similar to Java:

   String.chars().count(), String.codePoints().count(), and, for historical reasons, String.getBytes(UTF-8).length

westurner9mo ago

  String.graphemes().count()

That's a real nice API. (Similarly, python has @ for matmul but there is not an implementation of matmul in stdlib. NumPy has a matmul implementation so that the `@` operator works.)

ugrapheme and ucwidth are one way to get the graphene count from a string in Python.

It's probably possible to get the grapheme cluster count from a string containing emoji characters with ICU?

1 more reply

account429mo ago

> in the global international connected computing world it doesn’t fit at all.

andriamanitra9mo ago

> For example, there is nothing wrong with an programming language that only allows ASCII in the source code and many downsides to allowing non-ASCII characters outside string constants or comments.

1 more reply

eviks9mo ago

Now list anything as important from your list of downsides that's just as unfixable

simonask9mo ago

This is American imperialism at its worst. I'm serious.

Lots of people around the world learn programming from sources in their native language, especially early in their career, or when software development is not their actual job.

It's fine to have a dominant language in a field, but ASCII is a product of technical limitations that we no longer have. UTF-8 has been an absolute godsend for human civilization, despite its flaws.

7 more replies

flohofwoe9mo ago

ASCII is totally fine as encoding for the lower 127 UNICODE code points. If you need to go above those 127 code points, use a different encoding like UTF-8.

Just never ever use Extended ASCII (8-bits with codepages).

bigstrat20039mo ago

> in the global international connected computing world it doesn’t fit at all.

arp2429mo ago

Even plain English text can't be represented with plain ASCII (although ISO-8859-1 goes a long way).

There are some cases where just plain ASCII is okay, but there are quite few of them (and even those are somewhat controversial).

The solution is to just use UTF-8 everywhere. Or maybe UTF-16 if you really have to.

rileymat29mo ago

raverbashing9mo ago

This is naive at best

It's a much simpler problem and still tripped a lot of people

And then you have to support a user with a "funny name" or a business with "weird characters", or you expand your startup to Canada/Mexico and lo and behold...

1 more reply

wat100009mo ago

Which audience makes it so you don’t have to worry about text encodings?

eru9mo ago

Python 3 deals with this reasonable sensibly, too, I think. They use UTF-8 by default, but allow you to specify other encodings.

ynik9mo ago

1 more reply

xigoi9mo ago

I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them.

5 more replies

xelxebar9mo ago

> Number of monospaced font character blocks this string will take up on the screen

But I agree with the general sentiment. What we really about how much space these text blobs take up, whether that be in a DB, in memory, or on the screen.

xg159mo ago

It gets more complicated if you do substring operations.

If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.

arcticbull9mo ago

jibal9mo ago

"Unicode, being a byte code format"

UTF-8 is a byte code format; Unicode is not. In Python, where all strings are arrays of Unicode code points, substrings are likewise arrays of Unicode code points.

1 more reply

setr9mo ago

I feel like if you’re looking for bitwise equivalence or similar, you should have to cast to some kind of byte array and access the corresponding operations accordingly

1 more reply

account429mo ago

> s.charAt(x) or s.codePointAt(x)

mseepgood9mo ago

The values for x and y should't come from your brain, though (with the exception of 0). They should come from previous index operations like s.indexOf(...) or s.search(regex), etc.

xg159mo ago

Indeed. Or s.length, whatever that represents.

jlarocco9mo ago

It's definitely worth thinking about the real problem, but I wouldn't say it's never helpful.

perching_aix9mo ago

> Number of monospaced font character blocks this string will take up on the screen

To predict the pixel width of a given text, right?

oefrha9mo ago

perching_aix9mo ago

That's very informative, thanks! I guess it really wasn't a mirage or me messing up in the end then.

guappa9mo ago

What if you need to find 5 letter words to play wordle? Why do you care how many bytes they occupy or how large they are on screen?

xigoi9mo ago

In the case of Wordle, you know the exact set of letters you’re going to be using, which easily determines how to compute length.

guappa9mo ago

No no, I want to create tomorrow's puzzle.

1 more reply

taneq9mo ago

If you're playing at this level, you need to define:

- letter

- word

- 5 :P

guappa9mo ago

Eh in macedonian they have some letters that in russian are just 2 separate letters

3 more replies

BobbyTables29mo ago

Very true. Rust’s handling of strings was an eye opener for me.

Counting Unicode characters is actually a disservice.

Semaphor9mo ago

tomsmeding9mo ago

The metrics you care about are likely number of letters from a human perspective (1) or the number of bytes of storage (depends), possibly both.

[1]: https://tomsmeding.com/unicode#U+65%20U+308 [2]: https://tomsmeding.com/unicode#U+EB

TZubiri9mo ago

bluecalm9mo ago

What about implementing text algorithms like prefix search or a suffix tree to mention the simplest ones? Don't you need a string length at various points there?

account429mo ago

With UTF-8 you can implement them on top of bytes.

jlarocco9mo ago

That's basically what a string data type is for.

capitainenemo9mo ago

FWIW, the cheap lazy way to get "number of bytes in DB" from JS, is unescape(encodeURIComponent("ə̀")).length

zwnow9mo ago

I actually want string length. Just give me the length of a word. My human brain wants a human way to think about problems. While programming I never think about bytes.

dwb9mo ago

The whole point is that string length doesn’t necessarily give you the “length” of a “word”, and both of those terms are not well enough defined.

jibal9mo ago

1 more reply

int_19h9mo ago

Humans speak many different languages. Not all of them are English, and not all of them have writing systems which make it meaningful to talk about "string length" without disambiguating further.

zwnow9mo ago

Why exactly would I care about humans not speaking any of the languages I speak?

1 more reply

bigstrat20039mo ago

zahlman9mo ago

> I have, on the other hand, always wanted the string length.

In an environment that supports advanced Unicode features, what exactly do you do with the string length?

PapaPalpatine9mo ago

I don’t know about advanced Unicode features… but I use them all the time as a backend developer to validate data input.

I want to make sure that the password is between a given number of characters. Same with phone numbers, email addresses, etc.

This seems to have always been known as the length of the string.

This thread sounds like a bunch of scientists trying to make a simple concept a lot harder to understand.

3 more replies

wredcoll9mo ago

Which length? Bytes? Code points? Graphemes? Pixels?

justsomehnguy9mo ago

Guessing from the other comments you missed the byte length for the codepoints.

When I'm comparing the human-readable strings I want the letgth. In all other cases I want sizeof(string) and it's... quite a variable thing.

thrdbndndn9mo ago

I see where you're coming from, but I disagree on some specifics, especially regarding bytes.

Most people care about the length of a string in terms of the number of characters.

Same goes to the "string width".

Yes, Unicode scalar values can combine into a single glyph and cause discrepancies, as the article mentions, but that is a much rarer edge case than simply handling non-ASCII text.

account429mo ago

It's not rare at all - multi-code point emojis are pretty standard these days.

sigmoid109mo ago

bstsb9mo ago

ironic that unicode is stripped out the post's title here, making it very much wrong ;)

for context, the actual post features an emoji with multiple unicode codepoints in between the quotes

dang9mo ago

Ok, we've put Man Facepalming with Light Skin Tone back up there. I failed to find a way to avoid it.

Is there a way to represent this string with escaped codepoints? It would be both amusing and in HN's plaintext spirit to do it that way in the title above, but my Unicode is weak.

NobodyNada9mo ago

That would be "\U0001F926\U0001F3FC\u200D\u2642\uFE0F" in Python's syntax, or "\u{1F926}\u{1F3FC}\u{200D}\u{2642}\u{FE0F}" in Rust or JavaScript.

Might be a little long for a title :)

dang9mo ago

1 more reply

Mlller9mo ago

That would be …

  "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7

… for Javascript.

dang9mo ago

I can actually fit that within HN's 80 char limit without having to drop the "(2019)" bit at the end, so let's give it a try and see what happens... thanks!

1 more reply

cmeacham989mo ago

Funny enough I clicked on the post wondering how it could possibly be that a single space was length 7.

eastbound9mo ago

It can be many Zero-Width Space, or a few Hair-Width Space.

You never know, when you don’t know CSS and try to align your pixels with spaces. Some programers should start a trend where 1 tab = 3 hairline-width spaces (smaller than 1 char width).

Next up: The <half-br/> tag.

Moru9mo ago

You laugh but my typewriter could do half-br 40 years ago. Was used for typing super/subscript.

ale429mo ago

Maybe it isn't a space, but a list of invisible Unicode chars...

yread9mo ago

It could also be a byte length of a 3 byte UTF-8 BOM and then some stupid space character like f09d85b3

robin_reala9mo ago

It’s U+0020, a standard space character.

c129mo ago

I did exactly the same, thinking that maybe it was invisible unicode characters or something I didn't know about.

timeon9mo ago

Unintentional click-bait.

Phelinofist9mo ago

Before it wasn't, about 1h ago it was showing me a proper emoji

zahlman9mo ago

deathanatos9mo ago

zahlman9mo ago

The unit is perfectly meaningful.

Since UTF-32 allows storing every code point in a single code unit, you can also describe it that way, despite the fact that Python doesn't use a full 4 bytes per code point when it doesn't have to.

The only real problem is that "character" doesn't mean what you think it does, and hasn't since 1991.

I don't understand what you mean by "USV count".

> but what is a character?

It's what the Unicode standard says a character is. https://www.unicode.org/glossary/#character , definition 3. Python didn't come up with the concept; Unicode did.

> …but "5" or "7"? Where do those even come from?

From the way that the Unicode standard dictates that this text shall be represented. This is not Python's fault.

> Again: "character in the implementation" is a meaningless concept.

deathanatos9mo ago

Code points is not a meaningful metric, though I suppose strictly speaking, yes, len() is code points.

> I don't understand what you mean by "USV count".

> It's what the Unicode standard says a character is.

"How many smallest components of written language that has semantic value does <facepalm emoji …> have?" Nobody is answering "5".

2 more replies

perching_aix9mo ago

As the other comment says, Python considers strings to be a sequence of codepoints, hence the length of a string will be the number of codepoints in that string.

Now of course:

- it coming in handy once for my specific random workload doesn't mean it's good design

- my specific workload may not be rational (am a dingus sometimes)

- at some point I did consider iterating by grapheme clusters, which the language didn't seem to love a whole lot, so more flexibility would likely indeed be welcome

bobsmooth9mo ago

I'm curious what you mean by "shenanigans" is that like emojis and zalgo text?

1 more reply

xg159mo ago

Therefore, people should use codepoints for things like length limits or database indexes.

But wouldn't this just move the "cause breakage with new Unicode version" problem to a different layer?

re9mo ago

Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?

xg159mo ago

I was referring to this part, in "Shouldn’t the Nudge Go All the Way to Extended Grapheme Clusters?":

You're right it doesn't say "codepoints" as an alternative solution. That was just my assumption as it would be the closest representation that does not depend on the character database.

But you could also use code units, bytes, whatever. The problem will be the same if you have to reconstruct the grapheme clusters eventually.

> Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?

Because splitting a grapheme cluster in half can change its semantics. You don't want that if you e.g. have an index for fulltext search.

chrismorgan9mo ago

> it doesn't say "codepoints" as an alternative solution. That was just my assumption …

> The problem will be the same if you have to reconstruct the grapheme clusters eventually.

> You don't want that if you e.g. have an index for fulltext search.

Your text search won’t be splitting by grapheme clusters, it’ll be doing word segmentation instead.

1 more reply

dang9mo ago

Related. Others? (Also, anybody know the answer to https://news.ycombinator.com/item?id=44987514?)

It’s not wrong that " ".length == 7 (2019) - https://news.ycombinator.com/item?id=36159443 - June 2023 (303 comments)

String length functions for single emoji characters evaluate to greater than 1 - https://news.ycombinator.com/item?id=26591373 - March 2021 (127 comments)

String Lengths in Unicode - https://news.ycombinator.com/item?id=20914184 - Sept 2019 (140 comments)

andy_xor_andrew9mo ago

https://news.ycombinator.com/item?id=27529697

kazinator9mo ago

Why would I want this to be 17, if I'm representing strings as array of code points, rather than UTF-8?

TXR Lisp:

  1> (len " ")
  5
  2> (coded-length " ")
  17

(Trust me when I say that the emoji was there when I edited the comment.)

The second value takes work; we have to go through the code points and add up their UTF-8 lengths. The coded length is not cached.

pron9mo ago

In Java,

    " ".codePoints().count()
    ==> 5

    " ".chars().count()
    ==> 7

    " ".getBytes(UTF_8).length
    ==> 17

(HN doesn't render the emoji in comments, it seems)

chrismorgan9mo ago

Previous discussions:

• https://news.ycombinator.com/item?id=36159443 (June 2023, 280 points, 303 comments; title got reemojied!)

• https://news.ycombinator.com/item?id=26591373 (March 2021, 116 points, 127 comments)

• https://news.ycombinator.com/item?id=20914184 (September 2019, 230 points, 140 comments)

I’m guessing this got posted by one who saw my comment https://news.ycombinator.com/item?id=44976046 today, though coincidence is possible. (Previous mention of the URL was 7 months ago.)

programOP9mo ago

I did post this. I found it by chance, coming from this other post https://tonsky.me/blog/unicode/

osener9mo ago

Some other fun examples: https://gist.github.com/ozanmakes/0624e805a13d2cebedfc81ea84...

mid-kid9mo ago

deathanatos9mo ago

> which are a stable representation that can be converted to a bunch of encodings including "proper" utf-8, as long as all codepoints are representable in that encoding.

> This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.

If I write,

  def foo(s: str) -> …:

… I want the input string to be Unicode. If I need "Unicode, or maybe with bullshit mixed in", that can be a different type, and then I can take

  def foo(s: UnicodeWithBullshit) -> …:

mid-kid9mo ago

> If I write [str] I want the inut string to be Unicode.

slavik819mo ago

The Python language developers themselves thought that their code only needed to operate on str and later realized that it needed to handle arbitrary bytes.

It's a common mistake. A lot of code was written using str despite users needing it to operate on UnicodeWithBullshit. PEP 383 was a necessary escape hatch to fix countless broken programs.

acuozzo9mo ago

> Python simply operates on arrays of codepoints

But most programmers think in arrays of grapheme clusters, whether they know it or not.

zahlman9mo ago

No, I'm not standing for that.

estimator72929mo ago

Dealing with wide strings sounds like hell to me. Right up there with timezones. I'm perfectly happy with plain C in the embedded world.

RcouF1uZ4gsC9mo ago

Ultimatt9mo ago

Worth giving Raku a shout out here... methods do what they say and you write what you mean. Really wish every other language would pinch the Str implementation from here, or at least the design.

    $ raku
    Welcome to Rakudo™ v2025.06.
    Implementing the Raku® Programming Language v6.d.
    Built on MoarVM version 2025.06.

    [0] > " ".chars
    1
    [1] > " ".codes
    5
    [2] > " ".encode('UTF-8').bytes
    17
    [3] > " ".NFD.map(*.chr.uniname)
    (FACE PALM EMOJI MODIFIER FITZPATRICK TYPE-3 ZERO WIDTH JOINER MALE SIGN VARIATION SELECTOR-16)

voidmain9mo ago

umajho9mo ago

If you want to get the grapheme length in JavaScript, JavaScript now has Intl.Segmenter[^1][^2].

  > [...(new Intl.Segmenter()).segment(THAT_FACEPALM_EMOJI)].length
  1

[^1]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

[^2]: https://caniuse.com/mdn-javascript_builtins_intl_segmenter_s...

jfoster9mo ago

If you want to see a more interesting case than emoji, check out Thai language. In Thai, vowels could appear before, after, above, below, or on many sides of the associated consonants.

tralarpa9mo ago

zahlman9mo ago

ivanjermakov9mo ago

Codepoint is not cluster and cluster is not character. I bet there is "50 falsehoods about Unicode".

Aissen9mo ago

mrheosuper9mo ago

>We’ve seen four different lengths so far:

We would not have this problem if we all agree to return number of bytes instead.

curtisf9mo ago

"number of bytes" is dependent on the text encoding.

UTF-8 code units _are_ bytes, which is one of the things that makes UTF-8 very nice and why it has won

ivanjermakov9mo ago

I would say Unicode has won, but not UTF-8. UTF-16 is also widely used due to its efficiency on asian texts.

charcircuit9mo ago

>Number of extended grapheme clusters (1 in this case)

Only if you are using a new enough version of unicode. If you were using an older version it is more than 1. As new unicode updates come out, the number of grapheme clusters a string has can change.

minebreaker9mo ago

> We would not have this problem if we all agree to return number of bytes instead.

I don't understand. It depends on the encoding isn't it?

com2kid9mo ago

How would that help? UTF-8, 16, and 32 languages would still report different numbers.

jibal9mo ago

> if we all decided to report number of bytes that string used instead number of printable characters

But that isn't the same across all languages, or even across all implementations of the same language.

baq9mo ago

when I'm reading text on a screen, I very much am not reading bytes. this is obvious when you actually think what 'text encoding' means.

account429mo ago

baq9mo ago

you are right, but this just drives the point.

impure9mo ago

I learned this recently when I encountered a bug due to cutting an emoji character in two making it unable to render.

shirro9mo ago

Grapheme clustering does my head in. I just want to delete the character to the left of the cursor damn it.

pseufaux9mo ago

Can anyone recommend a good intro to understanding string encoding article?

torstenvl9mo ago

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

pseufaux9mo ago

https://stackoverflow.com/questions/2241348/what-are-unicode...

Still have more reading to do and a lot to learn but this was super informative, so thank you internet stranger.

pseufaux9mo ago

Haha, this is fantastic.

Thank you!

pwdisswordfishz9mo ago

Call me naive, but I think the length of a space character ought to be one.

jibal9mo ago

Read the article ... the character between the quote marks isn't a space, but HN apparently doesn't support emoji, or at least not that one.

darkwater9mo ago

(2019) updated in (2022)

Mlller9mo ago

The article nearly equivocates “Rather Useless” and “unambiguously the worst”. Python3 seems more coherent to me than the article's argument:

Anything more is and should be beyond the purview of the simple built-in `len`:

Of course, there are very few use cases for knowing the number of code points, but are there really much more for the number (NB: the number) of grapheme clusters?

Anyway, the great module https://pypi.org/project/regex/ supports “Matching a single grapheme \X”. So:

  len(regex.findall(r"\X", "\U0001F926\U0001F3FC\u200D\u2642\uFE0F")) == 1

troupo9mo ago

Obligatory, Emoji under the hood https://tonsky.me/blog/emoji/

Sniffnoy9mo ago

spyrja9mo ago

danhau9mo ago

Are you referring to Unicode? Because UTF-8 is simple and relatively straight forward to parse.

Unicode definitely has its faults, but on the whole it‘s great. I‘ll take Unicode w/ UTF-8 any day over the mess of encodings we had before it.

Needless to say, Unicode is not a good fit for every scenario.

xg159mo ago

I think GP is really talking about extended grapheme clusters (at least the mention of invisible glyph injection makes me think that)

Those really seem hellish to parse, because there seem to be several mutually independent schemes how characters are combined to clusters, depending on what you're dealing with.

E.g. modifier characters, tags, zero-width joiners with magic emoji combinations, etc.

So you need both a copy of the character database and knowledge of the interaction of those various invisible characters.

spyrja9mo ago

Just as an example of what I am talking about, this is my current UTF-8 parser which I have been using for a few years now.

  bool utf_append_plaintext(utf* result, const char* text) {
  #define msk(byte, mask, value) ((byte & mask) == value)
  #define cnt(byte) msk(byte, 0xc0, 0x80)
  #define shf(byte, mask, amount) ((byte & mask) << amount)
    utf_clear(result);
    if (text == NULL)
      return false;
    size_t siz = strlen(text);
    uint8_t* nxt = (uint8_t*)text;
    uint8_t* end = nxt + siz;
    if ((siz >= 3) && (nxt[0] == 0xef) && (nxt[1] == 0xbb) && (nxt[2] == 0xbf))
      nxt += 3;
    while (nxt < end) {
      bool aok = false;
      uint32_t cod = 0;
      uint8_t fir = nxt[0];
      if (msk(fir, 0x80, 0)) {
        cod = fir;
        nxt += 1;
        aok = true;
      } else if ((nxt + 1) < end) {
        uint8_t sec = nxt[1];
        if (msk(fir, 0xe0, 0xc0)) {
          if (cnt(sec)) {
            cod |= shf(fir, 0x1f, 6);
            cod |= shf(sec, 0x3f, 0);
            nxt += 2;
            aok = true;
          }
        } else if ((nxt + 2) < end) {
          uint8_t thi = nxt[2];
          if (msk(fir, 0xf0, 0xe0)) {
            if (cnt(sec) && cnt(thi)) {
              cod |= shf(fir, 0x0f, 12);
              cod |= shf(sec, 0x3f, 6);
              cod |= shf(thi, 0x3f, 0);
              nxt += 3;
              aok = true;
            }
          } else if ((nxt + 3) < end) {
            uint8_t fou = nxt[3];
            if (msk(fir, 0xf8, 0xf0)) {
              if (cnt(sec) && cnt(thi) && cnt(fou)) {
                cod |= shf(fir, 0x07, 18);
                cod |= shf(sec, 0x3f, 12);
                cod |= shf(thi, 0x3f, 6);
                cod |= shf(fou, 0x3f, 0);
                nxt += 4;
                aok = true;
              }
            }
          }
        }
      }
      if (aok)
        utf_push(result, cod);
      else
        return false;
    }
    return true;
  #undef cnt
  #undef msk
  #undef shf
  }

simonask9mo ago

Here's the implementation in the Rust standard library: https://doc.rust-lang.org/stable/src/core/str/validations.rs...

It even includes an optimized fast path for ASCII, and it works at compile-time as well.

2 more replies

danhau9mo ago

I don't know what your code is doing exactly. For comparison, here's my utf8 decoder (for a single codepoint):

    static UnicodeCodepoint utf8_decode(u8 const bytes[static 4], u8 *out_num_consumed) {
        u8 const flipped = ~bytes[0];
        if (flipped == 0) {
            // Because __builtin_clz is UB for value 0.
            // When his happens, the UTF-8 is malformed.
            *out_num_consumed = 1;
            return 0;
        }
        
        u8 const num_ones = __builtin_clz(flipped) & 0x07;
        u8 const num_bytes_total = num_ones > 1 ? num_ones : 1;
        u8 const main_byte_shift = num_ones + 1;
        UnicodeCodepoint value = bytes[0] & (0xFF >> main_byte_shift);
        
        for (u8 i = 1; i < num_bytes_total; ++i) {
            if (bytes[i] >> 6 != 2) {
                // Not a valid continuation byte.
                *out_num_consumed = i;
                return 0;
            }
            
            value = (value << 6) | (bytes[i] & 0x3F);
        }

        *out_num_consumed = num_bytes_total;
        return value;
    }

guappa9mo ago

Sure, I'll just write my own language all weird and look like an illiterate so that you are not inconvenienced.

kalleboo9mo ago

I think what you meant is we should all go back to the simplicity of Shift-JIS

eru9mo ago

You could use a standard that always uses eg 4 bytes per character, that is much easier to parse than UTF-8.

UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.

account429mo ago

ASCII compatibility isn't the only advantage of UTF-8 over UCS-4. It also

- requires less memory for most strings, particular ones that are largely limited to ASCII like structured text-based formats often are.

- doesn't need to care about byte order. UTF-8 is always UTF-8 while UTF-16 might either be little or big endian and UCS-4 could theoretically even be mixed endian.

degamad9mo ago

amake9mo ago

That has nothing to do with UTF-8; that's a Unicode issue, and one that's entirely unescapable if you are the Unicode Consortium and your goal is to be compatible with all legacy charsets.

1 more reply

bawolff9mo ago

That goes all the way back to the beginning

Even ascii used to use "overstriking" where the backspace character was treated as a joiner character to put accents above letters.

1 more reply

spyrja9mo ago

eru9mo ago

You might like https://fsharpforfunandprofit.com/series/property-based-test...

1 more reply

Ekaros9mo ago

Should have just gone with 32 bit characters and no combinations. Utter simplicity.

guappa9mo ago

That would be extremely wasteful, every single text file would be 4x larger and I'm sure eventually it would not be enough anyway.

Ekaros9mo ago

1 more reply

bawolff9mo ago

I think combining characters are a lot simpler than having every single combination ever.

Especially when you start getting into non latin-based languages.

amake9mo ago

What does "no combinations" mean?

Ekaros9mo ago

Like say Ä it might be either Ä a single byte, or combination of ¨ and A. Both are now supported, but if you can have more than two such things going in one thing it makes a mess.

2 more replies

j / k navigate · click thread line to collapse