Making Sense of Python Unicode (opens in new tab)

(lobstertech.com)

26 pointsleecho016y ago8 comments

8 comments

"But UTF-8 has a dark side, a single character can take up anywhere between one to six bytes to represent in binary."

What? No! UTF-8 takes, at most, 4 bytes per code point.

"But UTF-8 isn't very efficient at storing Asian symbols, taking a whole three bytes. The eastern masses revolted at the prospect of having to buy bigger hard drives and made their own encodings."

Many asian users object to UTF-8/Unicode because of the Han Unification, and because many characters supported in other character sets are not present in Unicode. Size of the binary encoding has nothing to do with it -- in fact, most east-asian characters take 4 bytes in UTF-16.

"American programmers: In your day to day grind, it's superfluous to put a 'u' in front of every single string."

American programmers who aren't morons: Use 'u' or the first time somebody tries to run an accent through your code, it'll come out looking like line noise.

sp33216y ago

>>But UTF-8 has a dark side, a single character can take up anywhere between one to six bytes to represent in binary.

>What? No! UTF-8 takes, at most, 4 bytes per code point.

I thought each half of a UTF-16 surrogate pair used 3 bytes in UTF-8, but it turns out that this is an incompatible modification of UTF-8 called CESU-8. http://en.wikipedia.org/wiki/CESU-8

nas16y ago

In 2.6 you can use: "from __future__ import unicode_literals". Use b'...' to get a str() instance instead of a unicode() object after that.

mshafrir16y ago

Some gotchas: http://stackoverflow.com/questions/809796/any-gotchas-using-...

qw16y ago

  Lobstertech wrote:
  > American programmers: In your day to day grind,
  > it's superfluous to put a 'u' in front of every single
  > string."*

  Good idea, who cares about internationalization? You can
  always just pay someone in India to go over all of your
  code the day you notice the rest of the world

  Regards,
  European developer

  (... who doesn't want more competition)

s-phi-nl16y ago

A good tutorial on Python Unicode is http://diveintopython3.org/strings.html. It's also my favorite explanation of Unicode in general.

leecho0OP16y ago

bonus tip:

don't forget to add:

  # -*- coding: utf-8 -*-

also, if you're using vim, make sure your encoding as well as your fileencoding are correct (they're different):

  set encoding=utf-8
  set fileencoding=utf-8

baq16y ago

> set encoding=utf-8

this will cause problems on windows.

j / k navigate · click thread line to collapse