Unicode strings in python, a gentle intro
08 Dec 2015 #ubuntu #python
Summary
In this post I will try to explain how to handle them in python 2 and 3.
I had long undermined the way I handled strings in my projects, but I could feel the gravity of handling strings properly when I was working on vocabulary, a side project of mine.
There was this one feature in it where the module had to return the pronunciation for a given word. Well I wrote the logic to parse the content and all the stuff. I had it all figured out, but then I was facing this issue.
Let’s start shall we?
ASCII strings
So let’s start with the ASCII strings, Have a look at hi.txt
Let’s see what does it hold
tasdik at Acer in ~/unicode
$ cat hi.txt
hiNice and easy. It contains, two characters h and i
Size ?
tasdik at Acer in ~/unicode
$ du -a -b hi.txt
2 hi.txtThis means that the file is of 2 bytes. Now what do these 2 bytes hold inside them? Let’s do a hexdump
tasdik at Acer in ~/unicode
$ hexdump hi.txt
0000000 6968
0000002If you look over to the ASCII table and look out for the hex representations, you will see that the letter h is represented by 68 and i is represented by 69
Let’s see how python2 handles this. Firing up the interpreter
>>> with open('hi.txt') as f:
... content = f.read()
...
>>> content
'hi'
>>> type(content)
<type 'str'>
>>> Now I probably should reiterate the fact that
Every character in a string is a single byte
And that the ASCII table translates each byte value to a unique character. the file contains an ASCII string of exactly two characters. So it does makes sense. Let’s dig a little further.
>>> len(content)
2
>>> content[0]
'h'
>>> content[1]
'i'
>>>So this confirms that the x[0] contains h and x[1] contains i
Enter Unicode
So how many characters does the ASCII representation able to represent? Doing the math, 256(2^8) would be the maximum number of characters that the ASCII table can represent. Just giving a heads up here, Chinese has a lot more than 256 characters. So how would you handle chinese as well as the characters on your keyboard?
Have a look at chinese.txt
tasdik at Acer in ~/unicode
$ cat chinese.txt
hi猫So it contains three character namely h, i and 猫. Size?
tasdik at Acer in ~/unicode
$ du -a -b chinese.txt
5 chinese.txt5 bytes. Let’s see what does each byte contain
tasdik at Acer in ~/unicode
$ hexdump chinese.txt
0000000 68 69 e7 8c ab
0000005The relevant thing to note here are the five hexadecimal numbers 69, 68, e7, 8c and ab
So five numbers, 5 bytes. Good so far? Now how do we interpret these numbers? We will have a look at the Unicode UTF-8 table.
In this table, 68 is the character h, 69 is the character i, and the three-byte sequence e7, 8c, ab is the character 猫. To recap, h is one byte, i is one byte, but 猫 is three bytes.
A point to note here is that, the Unicode UTF-8 table is a superset of the ASCII table, so that’s the reason h and i are represented by the same characters in both.
Handling unicode strings in python2
>>> with open('chinese.txt') as f:
... content = f.read()
...
>>> content
'hi\xe7\x8c\xab'
>>> len(content)
5
>>>What was all that? h and i are represented just fine but when it comes to the chinese character, it shows me hexdecimal numbers. And how does it return me 5 as the string lenght, when we know perfectly well that there are just 3 characters in that file?
It turns out that the python str doesn’t store a string but a stream of bytes in it. Digging further.
>>> x[0]
'h'
>>> x[1]
'i'
>>> x[2]
'\xe7'
>>> x[3]
'\x8c'
>>> x[4]
'\xab'The hi is returned prefectly fine as those are ASCII characters, but when it comes to the chinese character, it is represented by UTF-8 unicode. But since str object in python2 just stores a sequece of bytes, it has no way of deciding to group these 3 characters to represent the chinese character. So we see them as the hexadecimal numbers.
So how should we deal with this.
decode() to the rescue
>>> utf_content = content.decode('utf-8')
>>> utf_content
u'hi\u732b'
>>> type(utf_content)
<type 'unicode'>
>>> len(utf_content)
3
>>> utf_content[0]
u'h'
>>> utf_content[1]
u'i'
>>> utf_content[2]
u'\u732b'
>>>So the decode() tells python to convert the string content into a UTF-8 string. I know, the name is confusing as hell. But let’s leave that for another day.
I we call the print statement now. Let’s see what we get
>>> print utf_content
hi猫
>>>So there you go.
Word of caution
Weird things happen in python2 if you think that str is a string. To be safe, convert the str object to utf-8 format immediately by doing a decode('utf-8'). Then work with your unicode object and not the str or else you will some real pain handling the issues. Like I had in vocabulary
In python2, a unicode object type represents real strings whereas the str object is a sequece of bytes.
So when you are done precessing your unicode object and now you want to write it down to a file or a database. First convert it back to a sequence of bytes (str object) using the encode() method.
>>> str_content = utf_content.encode('utf-8')
>>> type(str_content)
<type 'str'>
>>> str_content
'hi\xe7\x8c\xab'
>>> content == str_content
True
>>> Now you will be able to write this content to a file or database as directly doing so with a unicode object would have given you some wierd errors.
Okay, okay. I will show that to you
>>> with open('myfile.txt', 'w') as f:
... f.write(utf_content)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u732b' in position 2: ordinal not in range(128)
>>>Now doing the same with the str object
>>> with open('myfile.txt', 'w') as f:
... f.write(str_content)
...
>>>Handling unicode strings in python3
Python3 makes handling of unicode strings easy.
One of the significant changes being that, str now stores unicode strings and not a sequence of bytes
Let’s see how it handles the chinese.txt file
tasdik at Acer in ~/unicode
$ python3
Python 3.4.2 (default, Jun 19 2015, 11:34:49)
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('chinese.txt') as f:
... content = f.read()
...
>>> type(content)
<class 'str'>
>>> len(content)
3
>>> content[0]
'h'
>>> content[1]
'i'
>>> content[2]
'猫'So everything works out of the box(Going with the Batteries included philosophy of python).
Now what if I wanted to interpret the contents of it as bytes.
You can do so by passing the argument rb when opening the file
>>> with open('chinese.txt', 'rb') as f:
... content = f.read()
...
>>> type(content)
<class 'bytes'>
>>> content
b'hi\xe7\x8c\xab'So now you have got the default behaviour of python2.
Converting it into utf-8
>>> content.decode('utf-8')
'hi猫'So to sum it up
In
python3,strrepresentsunicodestring while thebytestype represent the sequence ofbytes
For further reading, I would really, really suggest you have a look on the content written by these guys
on this topic