Some notes on unicode and UTF-8 and its various representations

Posted on Thu 13 November 2014 in Notes

When working with NLP on a wide variety of text, one is bound to encounter various encoding trouble. Even when everything is revolving around unicode, things are not as straightforward as one could hope.

Here's an example of some trouble with a series of files named by a string representation of the hexadecimal value of the utf-8 unicode codepoint.

File names:

E6B7B1.png
E6B48B.png
E6B88B.png
...

In my case they look something like this:

Stroke order diagram for 母 — Stroke order diagram converted to png from the KanjiVG project

The problem is to convert these back into unicode, so it is possible for a human to read what the names actually represent. This is easy to do, but difficult to learn how to.

The file names are UTF-8 hex, not unicode codepoints. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) guide has a nice explanation of what is going on here, the important part being this (emphasis mine):

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

So what we are seeing are 3 bytes encoded in hexadecimal that refers to some unicode codepoint (which again refers to an actual character)

In python we can use the string.decode() function to handle these 2 conversions. First we convert the ascii hex representations into actual hex and then we convert the hex numbers into the unicode codepoint that they refer to.

>>>> 'E6B88B'.decode('hex').decode('utf-8')
u'\u6e0b'
>>> print(u'\u6e0b')
渋

My full script is as follows

The python script:

# -*- coding: utf-8 -*-
"""
@author: Mads Olsgaard, 2014

Released under BSD3 License.

This scripts renames .png files that are named with hexadecimal values into their utf-8 string. Assumes the form of A1E2B3.png

Thanks to user plaes @ http://stackoverflow.com/a/13358677
"""

import glob, os, shutil

basepath = '../path_to/hex2utf/' #folder where we want to take our files from
targetpath = basepath+'out/' # path to where we want to store the renamed files

pattern = '??????.png'     # pattern of the file name. In this case we are only looking for 6 character long file names of the png type.
                        # These are not regex. See https://docs.python.org/3/library/fnmatch.html

filelist = glob.glob(basepath+pattern) #load all files in basepath that conform to pattern

# Extract the filename from each file path in filelist, truncate the '.png' section
# The .decode('hex').decode('utf-8') part is where the magic happens. First we convert the ASCII string into the hex values it represents
# 'E6B88B' -> '\xe6\xb8\x8b'
# and then we convert the hex-code into the UTF8 unicode character that it represents
# '\xe6\xb8\x8b' -> u'\u6e0b', which in unicode aware applications will show up as '渋'

filenames = [os.path.basename(n)[:-4].decode('hex').decode('utf-8') for n in filelist]

for n,t in zip(filelist, filenames):
    shutil.copy2(n, targetpath+t+'.png')

Some other links worth looking at

Python's Unicode HOWTO