undefined | Better HN

0 pointsberdario11y ago0 comments

> Actually, the "ruby 1.9" solution is having Strings tagged with encoding at all -- prior to ruby 1.9 they were not, they were just bytes.

Whoops, you're right... I confused the version, what I had in mind is the "source code as UTF-8 by default", which wasn't introduced in Ruby1.9, but in Ruby2.0

> If ruby ever decides to make things even more strict, I don't think it'll actually be as disruptive as the 1.8 to 1.9 transition.

admittedly, I almost never touched ruby1.8, so I've no idea how actually hard was the transition from ruby1.8.

I'm under the impression that before ruby1.9, Ruby was simply encoding-oblivious, and for any encoding-sensitive piece of code, people simply relied on things like libuconv. Am I mistaken?

If that's the case, the change from 1.8 to 1.9 was painful for sure, but it was more the case of actually caring about encoding for the very first time in a codebase.

This is quite recent (and it deals with Jruby, which is different underneath): http://blog.rayapps.com/2013/03/11/7-things-that-can-go-wron...

but by reading this blog post, I'm under the impression that most of the breakage that you'd get with the move to Ruby1.9 wouldn't be in exceptions, but in strings corruption.

Migrating to a fail-fast approach (like Python3), imho makes things more difficult ecosystem-wise, because you'll get plenty of exceptions even just when importing the library when first trying to use/update it.

With the Ruby1.9 upgrade, you could've used a library even if it was not 100% compatible and correctly working with Ruby1.9, I'd assume. This could let people gradually migrate and port their code, while reporting issues about corruption and fixing them as they appear.

Instead, if you're the author of a big Python2 library that relies on the encoding, maybe you won't prioritize the porting work, because you realize how much work is it, and the fact that unless you've actually correctly migrated 100% of the codebase, your users won't benefit for it (and so you have less of an incentive to start porting a couple classes/modules/packages)

That'd be compounded with the fact that, in Python2 like in Ruby, you actually already have your libraries and your codebase working in an internationalized environment... things might get more robust, but in the meanwhile everything will break, and the benefit isn't immediately available nor obvious.

The last straw is then obviously the community and memes: I don't believe that Python developers are more conservative (the ones that use virtualenv at least, and it's most of them in the web development industry I'd assume... things might be different in the scientific calculus, ops, desktop guis, pentest, etc industries), and they intrinsecally prefer stabler things. Not more than Ruby developers at least.

But for sure, memes like "Python2 and Python3 are two different languages" can demoralize and stifle initiatives to port libraries. And also some mistakes happened without any doubt (mistakes that embittered part of the community), but they've been realized only in hindsight: I'm talking about not keeping the u'' literal (which has been reintroduced in Python3.3) and proposing 2to3 as a tool to be used at build/installation time, instead of only as an helper during migration to a single Python2/3 codebase.

> If I understand right, you're saying that it ought to be guaranteed to raise if you try to concatenate strings with different encoding.

Let's say that while I'd prefer if Ruby behaved like this, I'm not advocating at-all for such a change, due to all the problems I just mentioned, and the fact that I wouldn't want any such responsibility :)

0 comments

jrochkind111y ago

> I'm under the impression that before ruby1.9, Ruby was simply encoding-oblivious, and for any encoding-sensitive piece of code, people simply relied on things like libuconv.

True.

> but by reading this blog post, I'm under the impression that most of the breakage that you'd get with the move to Ruby1.9 wouldn't be in exceptions, but in strings corruption.

Eh... I don't know. In my experience, the encoding-related problems arising in the 1.9 move indeed generally arose as exceptions raised -- but because of ruby's attempt to let you get away with mixed encodings when they are both ascii compatible, you could _sometimes_ get those exceptions only on _certain input_, which could definitely make it terrible.

I am trying to think of any cases where you'd get corrupt bytes... the only ones I can think of is where you tried to deal with the transition without really understanding what was going on, by blindly calling `force_encoding` on Strings, when you were forcing them to a different encoding then they really were. You'd have to explicitly take (wrong) action to get corrupted bytes, you wouldn't get them on an upgrade otherwise -- you'd get raises, or you'd get working okay (if you stuck to ascii-compat bytes only).

Of course, one of your dependencies might be doing the wrong thing too, and infect your code with strings it returned to you -- it wouldn't have to be _you_ that did the wrong thing.

YAML serialization/de-serialization is sort of a special case, made worse by the fact that there was a transition between YAML engines in the stdlib too at that point, and that _neither_ really dealt with encodings properly, and they both did it differently! (Really, the whole yaml ecosystem, which is popular in rubyland, wasn't designed thinking properly about encodings).

Encoding of course can be tricky and confusing no matter what -- if you actually don't know what encoding your input is in, you can get corrupted bytes and/or exceptions. That's kind of an inherent part of dealing with encoding though. Once ruby 1.9, you couldn't get away with not understanding encoding anymore. I think there wasn't quite enough education and proper tools when ruby 1.9 came out (and still), perhaps the Japanese/English language barrier (and context difference! Japanese coders have different sorts of common issues with encoding) was part of that. String#scrub (replace invalid bytes with replacement chars) wasn't in the stdlib until very recently, and it was hard for me to get anyone to understand this was a problem when I needed it!

> With the Ruby1.9 upgrade, you could've used a library even if it was not 100% compatible and correctly working with Ruby1.9, I'd assume. This could let people gradually migrate and port their code, while reporting issues about corruption and fixing them as they appear.

Yes, that was sometimes (but not always) true. I'm not sure how much the encoding-related stuff contributed to that. On the other hand, in general, they were trying to mostly keep ruby 1.9 backwards compatible with ruby 1.8 (perhaps unlike Python 2/3). And in fact, the main reason this woudln't be true, and code written for 1.8 woudln't work on 1.9 -- was encoding.

So actually, the fact that they, in some cases (where all strings involved were strictly ascii) allowed you to ignore encoding problems -- might have actually been part of the success. Even though in other ways it actually makes encoding a lot harder to deal with it -- I think I'd agree with you that I'd prefer fail-fast, in the end, and not the current thing it does where, only in cases where all strings involved are pure ascii, it lets you get away with it.

But in the end, since the 1.8->1.9 transition was so successful, I guess we've got to say whatever they did was the right (or at least "a right") move.

I think switching to eliminate the "if all strings have exclusively ascii-compat chars" exception would actually be less disruptive at this point. But I could be wrong. And people were so burned by how difficult the 1.8->1.9 upgrade could be sometimes (largely because of encoding), there might be reluctance to touch it again any time close to soon.

It was _not_ an easy upgrade, although it may have been easier than python2->3, and it was possible to write libraries that would work in both (sometimes with special conditionals checking for ruby version -- especially around encoding!). I think the fact that Rails supported 1.9 very quickly (and then _stopped_ supporting 1.8 after that) is also huge, since Rails has a sort of unique place in ruby that even django doesn't have to python. I also think you are right that the ruby community is less change-averse than the python community (for better _and_ worse -- the ruby 1.9 and rails 3 transition was the beginning, for me, of starting to kind of hate how much work I had to do in rubyland just to keep everything working with supported versions of language and dependency).

There's actually way more we can say about this, but this is a huge book already, haha. One difference in encoding between python and ruby I think is, in ruby 1.9+, if a string is tagged with an encoding but contains bytes invalid for that encoding (that do not represent a legal sequence of chars), you'll get an exception if you try to concat it to anything else -- even a string of the same encoding. I don't _think_ that happens in python? Ruby also doesn't have a canonical internal encoding, strings can be in _any_ encoding it recognizes, tagged with that encoding and containing internal bytes in memory that are actually the bytes for that encoding (any one ruby knows about). I am not aware of any other language that made that choice -- I think it came about because of experience in the Japanese context, although at this point, I think anyone would be insane not to keep all of your in-memory strings in UTF-8 and transcode them on I/O, and kind of wish the language actually encouraged/required that. But, hey, I program in a U.S. context. And like I said, my experience dealing with encoding in ruby has been better than my experience in any other language I've worked in (and I do have to deal with encoding a lot in the software I write) -- I definitely like it better than Java, which did decide all strings had to have a canonical internal encoding (if only it wasn't the pre-unicode-consolidation "UCS-2"!! perhaps that experience, of choosing the in-retrospect wrong canonical internal encoding influended ruby's choice)

wycats11y ago

I agree with the vast majority of what you've said.

One thing worth noting is that there was a TREMENDOUS effort that I headed up in the Rails 3 era to very aggressively attempt to reduce the number of encoding-related problems in Rails, and to make sure that common mistakes produced clear error messages.

I wrote two somewhat lengthy blog posts at the time[1][2] for a contemporary historical perspective just as the difficulty with encodings started to heat up.

One of the goals of the Rails 3 effort was to make significant efforts to ensure that strings that made their way into Rails came in as UTF-8. That involved being very careful with templates (I wrote a bit of a novel in the docs that remains to this day[3]), figuring out how to ensure that browser forms submitted their data in UTF-8 (even in IE6[4]), and working with Brian Lopez on mysql2 to ensure that all strings coming in from Postgres were properly tagged with encodings.

I also did a lot of person-to-person evangelism to try to get C bindings to respect the `default_internal` encoding setting, which Rails sets to UTF-8.

The net effect of all of that work is that while people experienced a certain amount of encoding-related issues in Rails 3, it was dramatically smaller than the kinds of errors we were seeing when experimental Ruby 1.9 support was first added to Rails 2.3.

---

P.S. I completely agree that the ASCII-7 exception was critical to keeping things rolling in the early days, but I personally would have liked an opt-in setting that would raise an exception when concatenating BINARY that happened to contain ASCII-7-only bytes with an ASCII-compatible string. In practice, this exception allowed a number of obscure C bindings to continue to produce BINARY strings well into the encoding era, and they were responsible for a large percentage (in my experience) of weird production-only bugs.

Specifically, you would have development and test environments that only tested with ASCII characters (people's names, for example). Then, in production, the occasional user would type in something like "José", producing a hard-to-reproduce encoding compatibility exception. This kind of problem is essentially eliminated with libraries that are encoding-aware at the C boundary that respect `default_internal`.

[1]: http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer...

[2]: http://yehudakatz.com/2010/05/17/encodings-unabridged/

[3]: https://github.com/rails/rails/blob/master/actionview/lib/ac...

[4]: http://stackoverflow.com/a/3348524

berdarioOP11y ago

> There's actually way more we can say about this, but this is a huge book already, haha.

Yeah, so long for "I'll try to keep this short" :P

> One difference in encoding between python and ruby I think is, in ruby 1.9+, if a string is tagged with an encoding but contains bytes invalid for that encoding (that do not represent a legal sequence of chars), you'll get an exception if you try to concat it to anything else -- even a string of the same encoding. I don't _think_ that happens in python?

True, also strings in python are immutable, so unless there's some weird way to access the underlying char* with the CPython C Api, I don't think that you can have an invalid sequence of bytes inside an unicode string

(obviously you can have codepoint U+FFFD, if you set errors='replace' when decoding)

> Ruby also doesn't have a canonical internal encoding, strings can be in _any_ encoding it recognizes, tagged with that encoding and containing internal bytes in memory that are actually the bytes for that encoding (any one ruby knows about). I am not aware of any other language that made that choice -- I think it came about because of experience in the Japanese context, although at this point, I think anyone would be insane not to keep all of your in-memory strings in UTF-8 and transcode them on I/O, and kind of wish the language actually encouraged/required that. But, hey, I program in a U.S. context.

Yeah, some time ago I looked into the differences of Python/Ruby encoding, and I wrote down these notes that I just uploaded:

https://gist.github.com/berdario/9b6bd24cafe3817e4773

There are indeed some characters/ideograms that cannot be converted to unicode codepoints, but even if we try to obtain them, we westerners are none the wiser, since we cannot print them to our terminals in a utf-8 locale

About the edit you just added:

> I definitely like it better than Java, which did decide all strings had to have a canonical internal encoding (if only it wasn't the pre-unicode-consolidation "UCS-2"!! perhaps that experience, of choosing the in-retrospect wrong canonical internal encoding influended ruby's choice)

Yes, but I think that this issue is made more complex by Java's efforts to keep bytecode compatibility.

In a language like Python/Ruby, the bytecode is only an internal implementation detail, upon which you shouldn't rely (you should rely only on the semantics of the source code). If you keep the actual encoding of your unicode strings an internal implementation detail, this issue could've been avoided (without switching to linear time algorithms for strings handling):

Just migrate to UTF-32 (or to a dynamic fixed width encoding like in Python3.3) as the in-memory representation, when parsing strings from the source code, and everything would've continued to work.

I think that it had more to do with the Han unification, rather than with the fear of picking the "wrong encoding"

gsnedders11y ago

> (obviously you can have codepoint U+FFFD, if you set errors='replace' when decoding)

Which is totally fine because U+FFFD REPLACEMENT CHARACTER is a totally valid character.

j / k navigate · click thread line to collapse