At first there was an empty space between the double quotes. This made me click and read the article because it was surprising that the length of a space would be 7.
Then the actual emoji appeared and the title finally made sense.
Now I see escaped \u{…} characters spelled out and it’s just ridiculous.
Can’t wait to come back tomorrow to see what it will be then.
- Number of bytes this will be stored as in the DB
- Number of monospaced font character blocks this string will take up on the screen
- Number of bytes that are actually being stored in memory
"String length" is just a proxy for something else, and whenever I'm thinking shallowly enough to want it (small scripts, mostly-ASCII, mostly-English, mostly-obvious failure modes, etc) I like grapheme cluster being the sensible default thing that people probably expect, on average.
Strings should be thought of more like opaque blobs, and you should derive their length exclusively in the context in which you intend to use it. It's an API anti-pattern to have a context-free length property associated with a string because it implies something about the receiver that just isn't true for all relevant usages and leads you to make incorrect assumptions about the result.
Refining your list, the things you usually want are:
- Number of bytes in a given encoding when saving or transmitting (edit: or more generally, when serializing).
- Number of code points when parsing.
- Number of grapheme clusters for advancing the cursor back and forth when editing.
- Bounding box in pixels or points for display with a given font.
Context-free length is something we inherited from ASCII where almost all of these happened to be the same, but that's not the case anymore. Unicode is better thought of as compiled bytecode than something you can or should intuit anything about.
It's like asking "what's the size of this JPEG." Answer is it depends, what are you trying to do?
size(JPG) == bytes? sectors? colors? width? height? pixels? inches? dpi?
You shouldn't really ever care about the number of code points. If you do, you're probably doing something wrong.
Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it.
In addition to separate string types, they have separate iterator types that let you explicitly get the value you want. So:
String.len() == number of bytes
String.bytes().count() == number of bytes
String.chars().count() == number of unicode scalar values
String.graphemes().count() == number of graphemes (requires unicode-segmentation which is not in the stdlib)
String.lines().count() == number of lines
Really my only complaint is I don't think String.len() should exist, it's too ambiguous. We should have to explicitly state what we want/mean via the iterators.I disagree. Not all text is human prose. For example, there is nothing wrong with an programming language that only allows ASCII in the source code and many downsides to allowing non-ASCII characters outside string constants or comments.
Just never ever use Extended ASCII (8-bits with codepages).
Most people aren't living in that world. If you're working at Amazon or some business that needs to interact with many countries around the globe, sure, you have to worry about text encoding quite a bit. But the majority of software is being written for a much narrower audience, probably for one single language in one single country. There is simply no reason for most programmers to obsess over text encoding the way so many people here like to.
Even this has to deal with the halfwidth/fullwidth split in CJK. Even worse, Devanagari has complex rendering rules that actually depend on font choices. AFAIU, the only globally meaningful category here is rendered bounding box, which is obviously font-dependent.
But I agree with the general sentiment. What we really about how much space these text blobs take up, whether that be in a DB, in memory, or on the screen.
If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.
If you take a substring of a(bc) and compare it to string (bc) are you looking for bitwise equivalence or logical equivalence? If the former it's a bit easier (you can just memcmp) but if the latter you have to perform a normalization to one of the canonical forms.
Neither of these are really useful unless you are implementing a font renderer or low level Unicode algorithm - and even then you usually only want to get the next code point rather than one at an arbitrary position.
The underlying issue is unit conversion. "length" is a poor name because it's ambiguous. Replacing "length" with three functions - "lengthInBytes", "lengthInCharacters", and "lengthCombined" - would make it a lot easier to pick the right thing.
To predict the pixel width of a given text, right?
One thing I ran into is that despite certain fonts being monospace, characters from different Unicode blocks would have unexpected lengths. Like I'd have expected half-width CJK letters to render to the same pixel dimensions as Latin letters do, but they don't. It's ever so slightly off. Same with full-width CJK letters vs two Latin letters.
I'm not sure if this is due to some font fallback. I'd have expected e.g. VS Code to be able to be able to render Japanese and English monospace in an aligned way without any fallbacks. Maybe once I have energy again to waste on this I'll look into it deeper.
* I'm talking about the DOM route, not <canvas> obviously. VS Code is powere by Monaco, which is DOM-based, not canvas-based. You can "Developer: Toggle Developer Tools" to see the DOM structure under the hood.
** I should further qualify my statement as browsers are fundamentally incapable of this if you use native text node rendering. I have built a perfectly monospace mixed CJK and Latin interface myself by wrapping each full width character in a separate span. Not exactly a performance-oriented solution. Also IIRC Safari doesn’t handle lengths in fractional pixels very well.
Seemed awkward but I eventually realized I rarely cared about number of characters. Even when dealing with substrings, I really only cared a means to describe “stuff” before/after not literal indices.
Counting Unicode characters is actually a disservice.
The metrics you care about are likely number of letters from a human perspective (1) or the number of bytes of storage (depends), possibly both.
[1]: https://tomsmeding.com/unicode#U+65%20U+308 [2]: https://tomsmeding.com/unicode#U+EB
In an environment that supports advanced Unicode features, what exactly do you do with the string length?
When I'm comparing the human-readable strings I want the letgth. In all other cases I want sizeof(string) and it's... quite a variable thing.
Most people care about the length of a string in terms of the number of characters.
Treating it as a proxy for the number of bytes has been incorrect ever since UTF-8 became the norm (basically forever), and if you're dealing with anything beyond ASCII (which you really should, since East Asian users alone number in the billions).
Same goes to the "string width".
Yes, Unicode scalar values can combine into a single glyph and cause discrepancies, as the article mentions, but that is a much rarer edge case than simply handling non-ASCII text.
And before that the only thing the relative rarity did for you was that bugs with code working on UTF-8 bytes got fixed while bugs that assumed UTF-16 units or 32-bit code points represent a character were left to linger for much longer.
for context, the actual post features an emoji with multiple unicode codepoints in between the quotes
Is there a way to represent this string with escaped codepoints? It would be both amusing and in HN's plaintext spirit to do it that way in the title above, but my Unicode is weak.
Might be a little long for a title :)
"\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7
… for Javascript.You never know, when you don’t know CSS and try to align your pixels with spaces. Some programers should start a trend where 1 tab = 3 hairline-width spaces (smaller than 1 char width).
Next up: The <half-br/> tag.
Python's flexible string system has nothing to do with this. Python could easily have had len() return the byte count, even the USV count, or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem. In particular, the USV count would have been made easy (O(1) easy!) by Python's flexible string representation.
You're handwaving it away in your writing by calling it a "character in the implementation", but what is a character? It's not a character in any sense a normal human would recognize — like a grapheme cluster — as I think if I asked a human "how many characters is <imagine this is man with skin tone face palming>?", they'd probably say "well, … IDK if it's really a character, but 1, I suppose?" …but "5" or "7"? Where do those even come from? An astute person might like "Oh, perhaps that takes more than one byte, is that it's size in memory?" Nope. Again: "character in the implementation" is a meaningless concept. We've assigned words to a thing to make it sound meaningful, but that is like definitionally begging the question here.
The unit is perfectly meaningful.
It's "characters". (Pedantically, "code points" — https://www.unicode.org/glossary/#code_point — because values that haven't been assigned to characters may be stored. This is good for interop, because it allows you to receive data from a platform that implements a newer version of the Unicode standard, and decide what to do with the parts that your local terminal, font rendering engine, etc. don't recognize.)
Since UTF-32 allows storing every code point in a single code unit, you can also describe it that way, despite the fact that Python doesn't use a full 4 bytes per code point when it doesn't have to.
The only real problem is that "character" doesn't mean what you think it does, and hasn't since 1991.
I don't understand what you mean by "USV count".
> but what is a character?
It's what the Unicode standard says a character is. https://www.unicode.org/glossary/#character , definition 3. Python didn't come up with the concept; Unicode did.
> …but "5" or "7"? Where do those even come from?
From the way that the Unicode standard dictates that this text shall be represented. This is not Python's fault.
> Again: "character in the implementation" is a meaningless concept.
"Character" is completely meaningful, as demonstrated by the fact the Unicode Consortium defines it, and by the fact that huge amounts of software has been written based on that definition, and referring to it in documentation.
I just relied on this fact yesterday, so it's kind of a funny timing. I wrote a little script that looks out for shenanigans in source files. One thing I wanted to explore was what Unicode blocks a given file references characters from. This is meaningless on the byte level, and meaningless on the grapheme cluster level. It is only meaningful on the codepoint level. So all I needed to do was to iterate through all the codepoints in the file, tally it all up by Unicode block, and print the results. Something this design was perfectly suited for.
Now of course:
- it coming in handy once for my specific random workload doesn't mean it's good design
- my specific workload may not be rational (am a dingus sometimes)
- at some point I did consider iterating by grapheme clusters, which the language didn't seem to love a whole lot, so more flexibility would likely indeed be welcome
- I am well and fully aware that iterating through data a few bytes at a time is abject terrible and possibly a sin. Too bad I don't really do coding in any proper native language, and I have basically no experience in SIMD, so tough shit.
But yeah, I really don't see why people find this so crazy. The whole article is in good part about how relying on grapheme cluster semantics makes you Unicode version dependent and that being a bit hairy, so it's probably not a good idea to default to it. At which point, codepoints it is. Counting scalars only is what would be weird in my view, you're "randomly" doing skips over the data potentially.
Therefore, people should use codepoints for things like length limits or database indexes.
But wouldn't this just move the "cause breakage with new Unicode version" problem to a different layer?
If a newer Unicode version suddenly defines some sequences to be a single grapheme cluster where there were several ones before and my database index now suddenly points to the middle of that cluster, what would I do?
Seems to me, the bigger problem is with backwards compatibility guarantees in Unicode. If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?
> If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?
Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?
"For example, the Unicode version dependency of extended grapheme clusters means that you should never persist indices into a Swift strings and load them back in a future execution of your app, because an intervening Unicode data update may change the meaning of the persisted indices! The Swift string documentation does not warn against this.
You might think that this kind of thing is a theoretical issue that will never bite anyone, but even experts in data persistence, the developers of PostgreSQL, managed to make backup restorability dependent on collation order, which may change with glibc updates."
You're right it doesn't say "codepoints" as an alternative solution. That was just my assumption as it would be the closest representation that does not depend on the character database.
But you could also use code units, bytes, whatever. The problem will be the same if you have to reconstruct the grapheme clusters eventually.
> Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?
Because splitting a grapheme cluster in half can change its semantics. You don't want that if you e.g. have an index for fulltext search.
It’s not wrong that " ".length == 7 (2019) - https://news.ycombinator.com/item?id=36159443 - June 2023 (303 comments)
String length functions for single emoji characters evaluate to greater than 1 - https://news.ycombinator.com/item?id=26591373 - March 2021 (127 comments)
String Lengths in Unicode - https://news.ycombinator.com/item?id=20914184 - Sept 2019 (140 comments)
TXR Lisp:
1> (len " ")
5
2> (coded-length " ")
17
(Trust me when I say that the emoji was there when I edited the comment.)The second value takes work; we have to go through the code points and add up their UTF-8 lengths. The coded length is not cached.
" ".codePoints().count()
==> 5
" ".chars().count()
==> 7
" ".getBytes(UTF_8).length
==> 17
(HN doesn't render the emoji in comments, it seems)• https://news.ycombinator.com/item?id=36159443 (June 2023, 280 points, 303 comments; title got reemojied!)
• https://news.ycombinator.com/item?id=26591373 (March 2021, 116 points, 127 comments)
• https://news.ycombinator.com/item?id=20914184 (September 2019, 230 points, 140 comments)
I’m guessing this got posted by one who saw my comment https://news.ycombinator.com/item?id=44976046 today, though coincidence is possible. (Previous mention of the URL was 7 months ago.)
Some other fun examples: https://gist.github.com/ozanmakes/0624e805a13d2cebedfc81ea84...
Which, to humor the parent, is also true of raw bytes strings. One of the (valid) points raised by the gist is that `str` is not infallibly encodable to UTF-8, since it can contain values that are not valid Unicode.
> This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.
If I write,
def foo(s: str) -> …:
… I want the input string to be Unicode. If I need "Unicode, or maybe with bullshit mixed in", that can be a different type, and then I can take def foo(s: UnicodeWithBullshit) -> …:But most programmers think in arrays of grapheme clusters, whether they know it or not.
Python does it correctly and the results in that gist are expected. Characters are not grapheme clusters, and not every sequence of characters is valid. The ability to store unpaired surrogate characters is a feature: it would take extra time to validate this when it only really matters at encoding time. It also empowers the "surrogateescape" error handler, that in turn makes it possible to supply arbitrary bytes in command line arguments, even while providing strings to your program which make sense in the common case. (Not all sequences of bytes are valid UTF-8; the error handler maps the invalid bytes to invalid unpaired surrogates.) The same character counts are (correctly) observed in many other programming languages; there's nothing at all "exceptional" about Python's treatment.
It's not actually possible to "treat strings as raw bytes", because they contain more than 256 possible distinct symbols. They must be encoded; even if you assume an ecosystem-wide encoding, you are still using that encoding. But if you wish to work with raw sequences of bytes in Python, the `bytes` type is built-in and trivially created using a `b'...'` literal, or various other constructors. (There is also a mutable `bytearray` type.) These types now correctly behave as a sequence of byte (i.e., integer ranging 0..255 inclusive) values; when you index them, you get an integer. I have personal experience of these properties simplifying and clarifying my code.
Unicode was fixed (no quotation marks), with the result that you now have clearly distinct types that honour the Zen of Python principle that "explicit is better than implicit", and no longer get `UnicodeDecodeError` from attempting an encoding operation or vice-versa. (This problem spawned an entire family of very popular and very confused Stack Overflow Q&As, each with probably countless unrecognized duplicates.) As an added bonus, the default encoding for source code files changed to UTF-8, which means in practical terms that you can actually use non-English characters in your code comments (and even identifier names, with restrictions) now and have it just work without declaring an encoding (since your text editor now almost certainly assumes that encoding in 2025). This also made it possible to easily read text files as text in any declared encoding, and get strings as a result, while also having universal newline mode work, and all without needing to reach for `io` or `codecs` standard libraries.
The community was not so much "dragged through a 15-year transition"; rather, some members of the community spent as long as 15 (really 13.5, unless you count people continuing to try to use 2.7 past the extended EOL) years refusing to adapt to what was a clear bugfix of the clearly broken prior behaviour.
Dealing with wide strings sounds like hell to me. Right up there with timezones. I'm perfectly happy with plain C in the embedded world.
$ raku
Welcome to Rakudo™ v2025.06.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2025.06.
[0] > " ".chars
1
[1] > " ".codes
5
[2] > " ".encode('UTF-8').bytes
17
[3] > " ".NFD.map(*.chr.uniname)
(FACE PALM EMOJI MODIFIER FITZPATRICK TYPE-3 ZERO WIDTH JOINER MALE SIGN VARIATION SELECTOR-16) > [...(new Intl.Segmenter()).segment(THAT_FACEPALM_EMOJI)].length
1
[^1]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...[^2]: https://caniuse.com/mdn-javascript_builtins_intl_segmenter_s...
If you want to see a more interesting case than emoji, check out Thai language. In Thai, vowels could appear before, after, above, below, or on many sides of the associated consonants.
Number of UTF-8 code units (17 in this case) Number of UTF-16 code units (7 in this case) Number of UTF-32 code units or Unicode scalar values (5 in this case) Number of extended grapheme clusters (1 in this case)
We would not have this problem if we all agree to return number of bytes instead.
Edit: My mistake. There would still be inconsistency between different encoding. My point is, if we all decided to report number of bytes that string used instead number of printable characters, we would not have the inconsistency between languages.
UTF-8 code units _are_ bytes, which is one of the things that makes UTF-8 very nice and why it has won
Only if you are using a new enough version of unicode. If you were using an older version it is more than 1. As new unicode updates come out, the number of grapheme clusters a string has can change.
I don't understand. It depends on the encoding isn't it?
But that isn't the same across all languages, or even across all implementations of the same language.
https://stackoverflow.com/questions/2241348/what-are-unicode...
Still have more reading to do and a lot to learn but this was super informative, so thank you internet stranger.
> So I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.
Thank you!
1. Python3 plainly distinguishes between a string and a sequence of bytes. The function `len`, as a built-in, gives the most straightforward count: for any set or sequence of items, it counts the number of these items.
2. For a sequence of bytes, it counts the number of bytes. Taking this face-palming half-pale male hodgepodge and encoding it according to UTF-8, we get 17 bytes. Thus `len("\U0001F926\U0001F3FC\u200D\u2642\uFE0F".encode(encoding = "utf-8")) == 17`.
3. After bytes, the most basic entities are Unicode code points. A Python3 string is a sequence of Unicode code points. So for a Python3 string, `len` should give the number of Unicode code points. Thus `len("\U0001F926\U0001F3FC\u200D\u2642\uFE0F") == 5`.
Anything more is and should be beyond the purview of the simple built-in `len`:
4. Grapheme clusters are complicated and nearly as arbitrary as code points, hence there are “legacy grapheme clusters” – the grapheme clusters of older Unicode versions, because they changed – and “tailored grapheme clusters”, which may be needed “for specific locales and other customizations”, and of course the default “extended grapheme clusters”, which are only “a best-effort approximation” to “what a typical user might think of as a “character”.” Cf. https://www.unicode.org/reports/tr29
Of course, there are very few use cases for knowing the number of code points, but are there really much more for the number (NB: the number) of grapheme clusters?
Anyway, the great module https://pypi.org/project/regex/ supports “Matching a single grapheme \X”. So:
len(regex.findall(r"\X", "\U0001F926\U0001F3FC\u200D\u2642\uFE0F")) == 1
5. The space a sequence of code points will occupy on the screen: certainly useful but at least dependent on the typeface that will be used for rendering and hence certainly beyond the purview of a simple function.Unicode definitely has its faults, but on the whole it‘s great. I‘ll take Unicode w/ UTF-8 any day over the mess of encodings we had before it.
Needless to say, Unicode is not a good fit for every scenario.
Those really seem hellish to parse, because there seem to be several mutually independent schemes how characters are combined to clusters, depending on what you're dealing with.
E.g. modifier characters, tags, zero-width joiners with magic emoji combinations, etc.
So you need both a copy of the character database and knowledge of the interaction of those various invisible characters.
bool utf_append_plaintext(utf* result, const char* text) {
#define msk(byte, mask, value) ((byte & mask) == value)
#define cnt(byte) msk(byte, 0xc0, 0x80)
#define shf(byte, mask, amount) ((byte & mask) << amount)
utf_clear(result);
if (text == NULL)
return false;
size_t siz = strlen(text);
uint8_t* nxt = (uint8_t*)text;
uint8_t* end = nxt + siz;
if ((siz >= 3) && (nxt[0] == 0xef) && (nxt[1] == 0xbb) && (nxt[2] == 0xbf))
nxt += 3;
while (nxt < end) {
bool aok = false;
uint32_t cod = 0;
uint8_t fir = nxt[0];
if (msk(fir, 0x80, 0)) {
cod = fir;
nxt += 1;
aok = true;
} else if ((nxt + 1) < end) {
uint8_t sec = nxt[1];
if (msk(fir, 0xe0, 0xc0)) {
if (cnt(sec)) {
cod |= shf(fir, 0x1f, 6);
cod |= shf(sec, 0x3f, 0);
nxt += 2;
aok = true;
}
} else if ((nxt + 2) < end) {
uint8_t thi = nxt[2];
if (msk(fir, 0xf0, 0xe0)) {
if (cnt(sec) && cnt(thi)) {
cod |= shf(fir, 0x0f, 12);
cod |= shf(sec, 0x3f, 6);
cod |= shf(thi, 0x3f, 0);
nxt += 3;
aok = true;
}
} else if ((nxt + 3) < end) {
uint8_t fou = nxt[3];
if (msk(fir, 0xf8, 0xf0)) {
if (cnt(sec) && cnt(thi) && cnt(fou)) {
cod |= shf(fir, 0x07, 18);
cod |= shf(sec, 0x3f, 12);
cod |= shf(thi, 0x3f, 6);
cod |= shf(fou, 0x3f, 0);
nxt += 4;
aok = true;
}
}
}
}
}
if (aok)
utf_push(result, cod);
else
return false;
}
return true;
#undef cnt
#undef msk
#undef shf
}
Not exactly "simple", is it? I am almost embarrassed to say that I thought I had read the spec right. But of course I was obviously wrong and now I have to go back to the drawing board (or else find some other FOSS alternative written in C). It just frustrates me. I do appreciate the level of effort made to come up with an all-encompassing standard of sorts, but it just seems so unnecessarily complicated.UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.
- requires less memory for most strings, particular ones that are largely limited to ASCII like structured text-based formats often are.
- doesn't need to care about byte order. UTF-8 is always UTF-8 while UTF-16 might either be little or big endian and UCS-4 could theoretically even be mixed endian.
- doesn't need to care about alignment: If you jump to a random memory position you can find the next and previous UTF-8 characters. This also means that you can use preexisting byte-based string functions like substring search for many UTF-8 operations.
Especially when you start getting into non latin-based languages.