undefined | Better HN

0 pointsdietrichepp13y ago0 comments

Why do you want to count Unicode characters? Why do you care if it is fast to do so? Why would you ever need to use character-based string indexing?

UTF-16 solves problems that don't exist.

(Honestly, I would love it if someone could explain what the purpose of counting characters is, because I don't know why you'd ever do that, except when you're posting to Twitter.)

0 comments

pjscott13y ago

Be careful not to confuse UTF-16 and UCS-2. UTF-16 is a variable-width encoding, so using it doesn't actually make counting anything easier. UCS-2 is fixed-width, and evil. With UCS-2 you can easily count the number of code points, as long as they fall in the BMP (!), but this is not the same as counting graphemes.

SeoxyS13y ago

It's not a common use case, but I've had to do it. Luckily, it's fairly easy:

    int count_multibytes_encountered(char *text, unsigned long len) {
        int count = 0;
        for (int i = 0; i < len; i++) {
            if ((*(text+i) & 0b10000000) == 0b10000000 &&       // check if it's a multi-char byte
                (~*(text+i) & 0b01000000) == 0b01000000) {      // and check that it's not a leading byte
                count++;
            }
        }
        return count;
    }

Daniel_Newby13y ago

> Why do you want to count Unicode characters?

Text editing and rendering. Some parts of the system cannot simply treat Unicode text as an opaque binary hunk of information.

> Why do you care if it is fast to do so?

Efficient full text search that can ignore decorative combining characters.

meatmanek13y ago

> Text editing and rendering

Unless you're working entirely in fixed point characters (and you probably aren't, given that even fixed-width systems like terminal emulators use double-wide glyphs sometimes), you need to know the value of each character to know its width. That involves the same linear scan over the string that is required to calculate the number of glyphs in a variable-width encoding.

andolanra13y ago

If you implement naïve Aho-Corasick text search over one-byte characters, it works without modification on UTF-8 text. It does not ignore combining characters, but UCS-2 also features combining characters (c.f. other comments in this same thread), so no matter what encoding you use, you must first normalize the Unicode text and the search string before you compare for equivalence (or compatibility, which is a looser notion than equality for Unicode code point sequences.)

masklinn13y ago

> Text editing and rendering. Some parts of the system cannot simply treat Unicode text as an opaque binary hunk of information.

Except these parts of the system have to work on unicode glyphs (rendered characters) which will span multiple codepoints anyway, so counting codepoints remains pointless. The only use it has is knowing how much codepoints you have. Yay.

pjscott13y ago

How does fast character counting help with full-text search?

Daniel_Newby13y ago

The best search algorithms can skip ahead upon a mismatch. A variable-length encoding requires branch instructions in the inner loop, leading to pipeline flushes and potentially dramatic slow down.

1 more reply

est13y ago

> Why do you want to count Unicode characters?

Because, there are other countries which use more than English language?

I fucking hate you ascii-centric ignorant morons sometimes, you know, for example

- display welcome message character by character fro left to right

- Extract the first character because it's always the surname

- catch two non-ascii keyword and find its index in a string

In the first example, should I just put byte by byte, which displays as garbage, and suddenly three bytes become a recognizable character?

SnowLprd13y ago

Could you not have communicated your examples without the hostility?

dietricheppOP13y ago

I think you misunderstand. I wasn't asking why Unicode characters should be counted instead of bytes or ASCII characters, I was asking why you would even want to count characters at all.

> I fucking hate you ascii-centric ignorant morons

Nice.

> You ignorant, arrogant fuck.

This is why I quit posting under an alias, so I wouldn't be tempted to say such things.

> display welcome message character by character fro left to right

UTF-16/UCS-4/UCS-2 doesn't solve anything here. Counting characters doesn't help. For example, imagine if you try to print Korean character-by-character. You might get some garbage like this:

    ᄋ
    아
    안
    안ᄂ
    안녀
    안녕
    안녕ᄒ
    안녕하
    안녕하ᄉ
    안녕하세
    안녕하세ᄋ
    안녕하세요

Fixed width encodings do not solve this problem, and UTF-8 does not make this problem more difficult. I am honestly curious why you would need to count characters -- at all -- except for posting to Twitter.

Splitting on characters is garbage. (This example was done in Python 3, so everything is properly encoded, and there is no need to use the 'u' prefix. The 'u' prefix is a nop in Python 3. It is only there for Python 2.x compatibility.)

    >>> x
    '안녕하세요'
    >>> x[2:4]
    'ᆫᄂ'

I tried in the Google Chrome console, too:

    > '안녕하세요'.substr(2,2)
    "하세"
    > '안녕하세요'.substr(2,2)
    "ᆫᄂ"

I'm not even leaving the BMP and it's broken! You seem to be blaming encoding issues but I don't have any issues with encoding. It doesn't matter if Chrome uses UCS-2 or Python uses UCS-4 or UCS-2, what's happening here is entirely expected, and it has everything to do with Jamo and nothing to do with encodings.

    >>> a = '안녕하세요'
    >>> b = '안녕하세요'
    # They only look the same
    >>> len(a)
    5
    >>> len(b)
    12
    >>> def p(x):
    ...     return ' '.join(
                'U+{:04X}'.format(ord(c)) for c in x)
    
    >>> print(' '.join('U+{:04X}'.format(ord(c))
              for c in b))
    >>> print(p(a))
    U+C548 U+B155 U+D558 U+C138 U+C694
    >>> print(p(b))
    U+110B U+1161 U+11AB U+1102 U+1167 U+11BC U+1112 U+1161 U+1109 U+1166 U+110B U+116D

See? Expected, broken behavior you get when splitting on character boundaries.

If you think you can split on character boundaries, you are living in an ASCII world. Unicode does not work that way. Don't think that normalization will solve anything either. (Okay, normalization solves some problems. But it is not a panacea. Some languages have grapheme clusters that cannot be precomposed.)

Fixed-width may be faster for splitting on character boundaries, but splitting on character boundaries only works in the ASCII world.

est13y ago

> Counting characters doesn't help.

Why? If you can count characters (code points) then it's natural that you can split or substring by characters.

Try this in javascript:

    '안녕하세요'.substr(2,2)

Internally Fixed length encoding is much faster than variable-length encoding.

> Unicode does not work that way.

It DOES.

> Splitting on characters is garbage.

You messed up Unicode in Python in so many levels. Those characters you seen in Python console is, actually not Unicode. These are just bytes in sys stdout that happens be to correctly decoded and properly displayed. You should always use the u'' for any kind of characters. '안녕하세요' is WRONG and may lead to unspecified behaviors, it depends on your source code file encoding, intepreter encoding and sys default encoding, if you display them in console it depends on the console encoding, if it's GUI or HTML widget it depends on the GUI widget or content-type encoding.

> I'm not even leaving the BMP and it's broken!

Your unicode-fu is broken. Looks like your example provided identical Korean strings, which might be ICU module in Chrome auto normalized for you.

> You can't split decomposed Korean on character boundaries.

In a broken unicode implementation, like Chrome browser v8 js engine.

> I happen to be using Python 3. It is internally using UCS-4.

For the love of BDFL read this

http://www.python.org/dev/peps/pep-0414/

http://docs.python.org/3/whatsnew/3.3.html

2 more replies

est13y ago

> See? Expected, broken behavior you get when splitting on character boundaries.

Yeah, like your Jamo trick is complex for a native CJK speaker.

Thought Jamo is hard? Check out Ideographic Description Sequence. We have like millions of 偏旁部首笔画 that you can freestyle combine with.

And the fun is the relative length of glypes, 土 and 士 is different, only because one line is longer that the other. How would you distinguish that?

But you know what your problem is?

It's like arguing with you that you think ส็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็็ is only one character.

IMPOSSIBU?!!!???

And because U+202e exists on the lolternet so we deprive your ability to count 99% normal CJK characters???!??!111!

Combination characters is normalized to single character in most cases, and should be countable and indexable separately.

If you type combination characters EXPLICITLY, they will be counted with each combination, naturally, what's wrong with that?

Or else why don't we abandon Unicode, every country deal with their own weird glype composition shit?

1 more reply

j / k navigate · click thread line to collapse

0 comments

pjscott13y ago

SeoxyS13y ago

It's not a common use case, but I've had to do it. Luckily, it's fairly easy:

    int count_multibytes_encountered(char *text, unsigned long len) {
        int count = 0;
        for (int i = 0; i < len; i++) {
            if ((*(text+i) & 0b10000000) == 0b10000000 &&       // check if it's a multi-char byte
                (~*(text+i) & 0b01000000) == 0b01000000) {      // and check that it's not a leading byte
                count++;
            }
        }
        return count;
    }

Daniel_Newby13y ago

> Why do you want to count Unicode characters?

Text editing and rendering. Some parts of the system cannot simply treat Unicode text as an opaque binary hunk of information.

> Why do you care if it is fast to do so?

Efficient full text search that can ignore decorative combining characters.

meatmanek13y ago

> Text editing and rendering

andolanra13y ago

masklinn13y ago

> Text editing and rendering. Some parts of the system cannot simply treat Unicode text as an opaque binary hunk of information.

pjscott13y ago

How does fast character counting help with full-text search?

Daniel_Newby13y ago

The best search algorithms can skip ahead upon a mismatch. A variable-length encoding requires branch instructions in the inner loop, leading to pipeline flushes and potentially dramatic slow down.

1 more reply

est13y ago

> Why do you want to count Unicode characters?

Because, there are other countries which use more than English language?

I fucking hate you ascii-centric ignorant morons sometimes, you know, for example

- display welcome message character by character fro left to right

- Extract the first character because it's always the surname

- catch two non-ascii keyword and find its index in a string

In the first example, should I just put byte by byte, which displays as garbage, and suddenly three bytes become a recognizable character?

SnowLprd13y ago

Could you not have communicated your examples without the hostility?

dietricheppOP13y ago

I think you misunderstand. I wasn't asking why Unicode characters should be counted instead of bytes or ASCII characters, I was asking why you would even want to count characters at all.

> I fucking hate you ascii-centric ignorant morons

Nice.

> You ignorant, arrogant fuck.

This is why I quit posting under an alias, so I wouldn't be tempted to say such things.

> display welcome message character by character fro left to right

UTF-16/UCS-4/UCS-2 doesn't solve anything here. Counting characters doesn't help. For example, imagine if you try to print Korean character-by-character. You might get some garbage like this:

    ᄋ
    아
    안
    안ᄂ
    안녀
    안녕
    안녕ᄒ
    안녕하
    안녕하ᄉ
    안녕하세
    안녕하세ᄋ
    안녕하세요

    >>> x
    '안녕하세요'
    >>> x[2:4]
    'ᆫᄂ'

I tried in the Google Chrome console, too:

    > '안녕하세요'.substr(2,2)
    "하세"
    > '안녕하세요'.substr(2,2)
    "ᆫᄂ"

    >>> a = '안녕하세요'
    >>> b = '안녕하세요'
    # They only look the same
    >>> len(a)
    5
    >>> len(b)
    12
    >>> def p(x):
    ...     return ' '.join(
                'U+{:04X}'.format(ord(c)) for c in x)
    
    >>> print(' '.join('U+{:04X}'.format(ord(c))
              for c in b))
    >>> print(p(a))
    U+C548 U+B155 U+D558 U+C138 U+C694
    >>> print(p(b))
    U+110B U+1161 U+11AB U+1102 U+1167 U+11BC U+1112 U+1161 U+1109 U+1166 U+110B U+116D

See? Expected, broken behavior you get when splitting on character boundaries.

Fixed-width may be faster for splitting on character boundaries, but splitting on character boundaries only works in the ASCII world.

est13y ago

> Counting characters doesn't help.

Why? If you can count characters (code points) then it's natural that you can split or substring by characters.

Try this in javascript:

    '안녕하세요'.substr(2,2)

Internally Fixed length encoding is much faster than variable-length encoding.

> Unicode does not work that way.

It DOES.

> Splitting on characters is garbage.

> I'm not even leaving the BMP and it's broken!

Your unicode-fu is broken. Looks like your example provided identical Korean strings, which might be ICU module in Chrome auto normalized for you.

> You can't split decomposed Korean on character boundaries.

In a broken unicode implementation, like Chrome browser v8 js engine.

> I happen to be using Python 3. It is internally using UCS-4.

For the love of BDFL read this

http://www.python.org/dev/peps/pep-0414/

http://docs.python.org/3/whatsnew/3.3.html

2 more replies

est13y ago

> See? Expected, broken behavior you get when splitting on character boundaries.

Yeah, like your Jamo trick is complex for a native CJK speaker.

Thought Jamo is hard? Check out Ideographic Description Sequence. We have like millions of 偏旁部首笔画 that you can freestyle combine with.

And the fun is the relative length of glypes, 土 and 士 is different, only because one line is longer that the other. How would you distinguish that?

But you know what your problem is?

IMPOSSIBU?!!!???

And because U+202e exists on the lolternet so we deprive your ability to count 99% normal CJK characters???!??!111!

Combination characters is normalized to single character in most cases, and should be countable and indexable separately.

If you type combination characters EXPLICITLY, they will be counted with each combination, naturally, what's wrong with that?

Or else why don't we abandon Unicode, every country deal with their own weird glype composition shit?

1 more reply

j / k navigate · click thread line to collapse