undefined | Better HN

0 pointsjosephg9mo ago0 comments

It’s a bit of a niche use case, but I use the codepoint counts in CRDTs for collaborative text editing.

Grapheme cluster counts can’t be used because they’re unstable across Unicode versions. Some algorithms use UTF8 byte offsets - but I think that’s a mistake because they make input validation much more complicated. Using byte offsets, there’s a whole lot of invalid states you can represent easily. Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint. Then inserting at position 2 is valid again. If you send me an operation which happened at some earlier point in time, I don’t necessarily have the text document you were inserting into handy. So figuring out if your insertion (and deletion!) positions are valid at all is a very complex and expensive operation.

Codepoints are way easier. I can just accept any integer up to the length of the document at that point in time.

0 comments

account429mo ago

> Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint.

You have the same problem with code points, it's just hidden better. Inserting "a" between U+0065 and U+0308 may result in a "valid" string but is still as nonsensical as inserting "a" between UTF-8 bytes 0xC3 and 0xAB.

This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.

josephgOP9mo ago

I hear your point, but invalid codepoint sequences are way less of a problem than strings with invalid UTF8. Text rendering engines deal with weird Unicode just fine. They have to since Unicode changes over time. Invalid UTF8 on the other hand is completely unrepresentable in most languages. I mean, unless you use raw byte arrays and convert to strings at the edge, but that’s a terrible design.

> This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.

Disagree. Allowing 2 kinds of bugs to slip through to runtime doesn’t make your system more resilient than allowing 1 kind of bug. If you’re worried about errors like this, checksums are a much better idea than letting your database become corrupted.

tomsmeding9mo ago

For those following along: https://tomsmeding.com/unicode#U+65%20U+308%20U+20%20U+65%20...

j / k navigate · click thread line to collapse