undefined | Better HN

0 pointsx5n19y ago0 comments

huh? strings have encodings. rust strings are bytes encoded in utf-8.

https://doc.rust-lang.org/book/strings.html

0 comments

In Rust a string is a sequence of unicode scalar values. I personally find it unfortunate that they dictate the storage of it at the API level, but that is a necessary evil for presenting a consistent ABI with foreign code.

I did not know that strings in Ruby have encodings. Is there a reason for that? I personally don't like mixing characters and opaque byte sequences as they are very different.

burntsushi9y ago

> In Rust a string is a sequence of unicode scalar values.

The representation of a Rust String in memory is guaranteed valid UTF-8. To me, a "sequence of Unicode scalar values" is an abstract description, because it could be implemented via UTF-8, UTF-16 or UTF-32.

> I personally find it unfortunate that they dictate the storage of it at the API level

It is extraordinarily convenient and provides a very transparent way to analyze the performance of string operations.

For transcoding, there is the in-progress `encoding` crate: https://github.com/lifthrasiir/rust-encoding

I note that Go does things very similarly (`string` is conventionally UTF-8) and it works famously for them. They have a much more mature set of encoding libraries, but they work the same as the equivalent libraries would work in Rust: transcode to and from UTF-8 at the boundaries. See: https://godoc.org/golang.org/x/text

MichaelGG9y ago

Ruby's Japanese heritage is probably why it handles encodings like that - I think there were multiple encs it had to deal with at once or something. Also Unicode doesn't completely handle all kanji in that there's some that have an old style not available in Unicode. But maybe that's not relevant.

aidenn09y ago

Unicode now handles all the Kanji in JIS. I wouldn't be surprised if Ruby predated that. It almost certainly predates good library support for all the Kanji in JIS.

1 more reply

steveklabnik9y ago

Ruby encoding stuff changed a lot over its history; it was one of the big changes from 1.8 to 1.9.

twelvechairs9y ago

Its a better way of doing things - you can handle things in their native format rather than have to arbitrarily convert to UTF8 (which is an 'encoding' itself).

[edit] I remember a talk where Matz was asked this specific question and tried to explain it clearly but seemed confused as to how the questioner could have such a poor grasp of unicode (the difference between monolingual americans and japanese i guess)

kibwen9y ago

String is just a typedef for Vec<u8> with some extra convenience functions for working with UTF-8. There's nothing stopping anyone from just using Vec<u8> to handle non-UTF-8 data in their native format, nor stopping anyone from writing convenience types like String for other encodings.

1 more reply

lobster_johnson9y ago

The reason is that Ruby supports non-Unicode encodings that are not subsets of Unicode. Not possible if your string is Unicode.

j / k navigate · click thread line to collapse

0 comments

aidenn09y ago

I did not know that strings in Ruby have encodings. Is there a reason for that? I personally don't like mixing characters and opaque byte sequences as they are very different.

burntsushi9y ago

> In Rust a string is a sequence of unicode scalar values.

> I personally find it unfortunate that they dictate the storage of it at the API level

It is extraordinarily convenient and provides a very transparent way to analyze the performance of string operations.

For transcoding, there is the in-progress `encoding` crate: https://github.com/lifthrasiir/rust-encoding

MichaelGG9y ago

aidenn09y ago

Unicode now handles all the Kanji in JIS. I wouldn't be surprised if Ruby predated that. It almost certainly predates good library support for all the Kanji in JIS.

1 more reply

steveklabnik9y ago

Ruby encoding stuff changed a lot over its history; it was one of the big changes from 1.8 to 1.9.

twelvechairs9y ago

Its a better way of doing things - you can handle things in their native format rather than have to arbitrarily convert to UTF8 (which is an 'encoding' itself).

kibwen9y ago

1 more reply

lobster_johnson9y ago

The reason is that Ruby supports non-Unicode encodings that are not subsets of Unicode. Not possible if your string is Unicode.

j / k navigate · click thread line to collapse