undefined | Better HN

0 pointsjohncolanduoni8y ago0 comments

Since a string doesn't have any font rendering metrics (in that it lacks a font or size), I'm not sure how you expect a language's String implementation to take it into account. Similarly, bytes will change based on encodings, which most people would expect a language's String type to abstract over. Do you really want UTF8 and UTF16 strings to behave differently and introduce even more complexity to a very complex system?

There are languages whose orthographies don't fit the Unicode grapheme cluster specification, but they're complex enough that I doubt there's any way to deal with them properly other than having someone proficient in them looking over your text processing or pawning it off to a library. At least with grapheme clusters your code won't choke on something as simple as unnormalized Latin text.

0 comments

int_19h8y ago

It's not about encodings at all, actually. It's about the API that is presented to the programmer.

And the way you take it all into account is by refusing to accept any defaults. So, for example, a string type should not have a "length" operation at all. It should have "length in code points", "length in graphemes" etc operations. And maybe, if you do decide to expose UTF-8 (which I think is a bad idea) - "length in bytes". But every time someone talks about the length, they should be forced to specify what they want (and hence think about why they actually need it).

Similarly, strings shouldn't support simple indexing - at all. They should support operations like "nth codepoint", "nth grapheme" etc. Again, forcing the programmer to decide every time, and to think about the implications of those decisions.

It wouldn't solve all problems, of course. But I bet it would reduce them significantly, because wrong assumptions about strings are the most common source of problems.

Avernar8y ago

What do you mean by "expose UTF-8"? Because nothing about UTF-8 requires that you give byte access to the string.

As for indexing, strings shouldn't require indexing period. That's the ASCII way of thinking, especially fixed width columns and such. You should be thinking relatively. For example, find me the first space then using that point in the string the next character needs to be letter. When you build you're code that way you don't fall for the trap of byte indexing or the performance hit of codepoint indexing (UTF-8) or grapheme indexing (all encodings).

ubernostrum8y ago

There are real-world textual data types for which your idealized approach simply does not work. As in, it would be impossible or impossibly unwieldy to validate conformance to the type using your approach, because they require indexing to specific locations, or determining length, or both.

For example, I work for a company that does business in the (US) Medicare space. Every Medicare beneficiary has a HICN -- Health Insurance Claim Number -- and HICNs come in different types which need to be identified. Want to know how to identify them? By looking at prefix and suffix characters in specific positions, and the length of what comes between them. For example, the prefix 'A' followed by six digits means the person identified is the primary beneficiary and was first covered under the Railroad Retirement Board benefit program. Doing this without indexing and length operations is madness.

These data types can and should be subjected first to some basic checks to ensure they're not nonsense (i.e., something expected to be a numeric value probably should not contain Linear B code points, and it's probably a good idea to at least throw a regex at it first, but then applying regex to Unicode also has quirks people don't often expect at first...).

2 more replies

int_19h8y ago

By "expose UTF-8" I mean exposing the underlying UTF-8 representation of the string directly on the object itelf, instead of going through a separate byte array (or byte array view, to avoid copying)

1 more reply

saagarjha8y ago

This is exactly how Swift’s Strings work.

j / k navigate · click thread line to collapse

0 pointsjohncolanduoni8y ago0 comments

0 comments

int_19h8y ago

It's not about encodings at all, actually. It's about the API that is presented to the programmer.

It wouldn't solve all problems, of course. But I bet it would reduce them significantly, because wrong assumptions about strings are the most common source of problems.

Avernar8y ago

What do you mean by "expose UTF-8"? Because nothing about UTF-8 requires that you give byte access to the string.

ubernostrum8y ago

2 more replies

int_19h8y ago

By "expose UTF-8" I mean exposing the underlying UTF-8 representation of the string directly on the object itelf, instead of going through a separate byte array (or byte array view, to avoid copying)

1 more reply

saagarjha8y ago

This is exactly how Swift’s Strings work.

j / k navigate · click thread line to collapse