Some notes on unicode and UTF-8 and its various representations
Posted on Thu 13 November 2014 in Notes
When working with NLP on a wide variety of text, one is bound to encounter various encoding trouble. Even when everything is revolving around unicode, things are not as straightforward as one could hope.
Here's an example of some trouble with a series of files named by a string representation of the hexadecimal value of the utf-8 unicode codepoint.
File names:
- E6B7B1.png
- E6B48B.png
- E6B88B.png
- ...
In my case they look something like this:
The problem is to convert these back into unicode, so it is possible for a human to read what the names actually represent. This is easy to do, but difficult to learn how to.
The file names are UTF-8 hex, not unicode codepoints. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) guide has a nice explanation of what is going on here, the important part being this (emphasis mine):
Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
So what we are seeing are 3 bytes encoded in hexadecimal that refers to some unicode codepoint (which again refers to an actual character)
In python we can use the string.decode() function to handle these 2 conversions. First we convert the ascii hex representations into actual hex and then we convert the hex numbers into the unicode codepoint that they refer to.
>>>> 'E6B88B'.decode('hex').decode('utf-8')
u'\u6e0b'
>>> print(u'\u6e0b')
渋
My full script is as follows
The python script:
# -*- coding: utf-8 -*-
"""
@author: Mads Olsgaard, 2014
Released under BSD3 License.
This scripts renames .png files that are named with hexadecimal values into their utf-8 string. Assumes the form of A1E2B3.png
Thanks to user plaes @ http://stackoverflow.com/a/13358677
"""
import glob, os, shutil
basepath = '../path_to/hex2utf/' #folder where we want to take our files from
targetpath = basepath+'out/' # path to where we want to store the renamed files
pattern = '??????.png' # pattern of the file name. In this case we are only looking for 6 character long file names of the png type.
# These are not regex. See https://docs.python.org/3/library/fnmatch.html
filelist = glob.glob(basepath+pattern) #load all files in basepath that conform to pattern
# Extract the filename from each file path in filelist, truncate the '.png' section
# The .decode('hex').decode('utf-8') part is where the magic happens. First we convert the ASCII string into the hex values it represents
# 'E6B88B' -> '\xe6\xb8\x8b'
# and then we convert the hex-code into the UTF8 unicode character that it represents
# '\xe6\xb8\x8b' -> u'\u6e0b', which in unicode aware applications will show up as '渋'
filenames = [os.path.basename(n)[:-4].decode('hex').decode('utf-8') for n in filelist]
for n,t in zip(filelist, filenames):
shutil.copy2(n, targetpath+t+'.png')
Some other links worth looking at