Tuesday, September 15, 2009

Chardet Python Library: Determining Character Encoding of Text File or Text Stream

Recently I have came into a disagreement with fellow programmer. He tried to create view simple codes in different programming languageshe knows: C, C++, Perl, Python, PHP, and Java (I think it's for self-actualization, obviously not for money or wealth). It turns out to be he created XML parser codes on those (programming) languages.

The code doesn't produce as expected, except of course in Python. He ascertained that it is because most those languages make use of libxml2 library (from GNU). He came into a conclusion that the Java library might also make use of the libxml2 library from GNU, which I believe most unlikely. Java wouldn't normally pick GNU stuffs for its library. It really hit into my Java developer pride, and I need to prove that it is not the library that is incorrect, instead the input file is not correctly encoded.

Although in Western world most text files usually play along well most of the time, characters other than ordinary characters will be displayed incorrectly. Most of the case it will be displayed incorrectly in countries that uses special characters!

Since the first time I work in Singapore, I absorb it quickly that the ability to display these special characters are one important thing a web application should have, especially when they cater for the Asian market! You will see that they will require you to do i18n, display this in Traditional Chinese characters, in Simplified Chinese characters, Tamil characters, Thai characters, to name a few...

It brought me to this question:
If I have an arbitrary text file, how could I know which encoding the text uses?

I googled around and found  this: Universal Encoding Detector (chardet) library. The chardet site shows how to do it for web sites.

[sourcecode lang="python"]<br />&gt;&gt;&gt; import urllib<br />&gt;&gt;&gt; urlread = lambda url: urllib.urlopen(url).read()<br />&gt;&gt;&gt; import chardet<br />&gt;&gt;&gt; chardet.detect(urlread("http://google.cn/"))<br />{'encoding': 'GB2312', 'confidence': 0.99}<br /><br />&gt;&gt;&gt; chardet.detect(urlread("http://yahoo.co.jp/"))<br />{'encoding': 'EUC-JP', 'confidence': 0.99}<br />[/sourcecode]


Now it here comes the next question for me: if it is not a web site, but a file instead, how to detect the encoding?

So here is the solution:

[sourcecode lang="python"]<br />&gt;&gt;&gt; import chardet<br />&gt;&gt;&gt; fileread = lambda filename: open(filename, "r").read()<br />&gt;&gt;&gt; chardet.detect(fileread("italian-english.xml"))<br />{'confidence': 0.9899999999999999, 'encoding': 'utf-8'}<br /><br />&gt;&gt;&gt; chardet.detect(fileread("utf8_demo.txt"))<br />{'confidence': 0.9899999999999999, 'encoding': 'utf-8'}<br /><br />[/sourcecode]


So this way, I checked that the text file is most likely an UTF-8 encoded text.

By the way, the encoding detection engine library was derived from (or ported from) Mozilla auto-detection code, which has been widely used. So more or less I believe it is mature and powerful enough for most usages.

Another thing to note is that those codes above are using lambda notation, a feature of Python language borrowed from pure functional programming syntax.

No comments:

Post a Comment