Fix "ValueError: unknown locale: UTF-8" under Mac OS X'

Posted on Mon 04 April 2016 in Notes

This is a problem I've been having after switching from OSX' default bash to oh-my-zsh.

When importing things like matplotlib in Python I get the following error:

ValueError: unknown locale: UTF-8

The problem is that the locale has not been set and UTF-8 is not a valid locale, as it is only an encoding.

In bash or zsh run

$ locale

if it looks like

LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

You are in trouble. You want it to look something like the following if you are using US locale

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

If you browse around the web for a solution you will be told to add

export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

To all sorts of places. ~/.bash, ~/.profile, /etc/.profile and the list goes on. If you are running oh-my-zsh you need to edit ~/.zshrc and add the above two lines and restart you terminal and python as well.


Installing Tensorflow in Python 3.5 with Anaconda

Posted on Fri 29 January 2016 in Notes

Since the release of Tensorflow 0.6, support for Python 3.3+ has finally been added.

However, if you are trying to install Tensorflow into an Anaconda install with conda you might just be using this command that is floating around the web:

# Old tensorflow version
$ conda install -c https://conda.anaconda.org/jjhelmus tensorflow

However, this is a packaged version of Tensorflow 0.5, and won't run on Python 3.3+

Instead you can install via pip into your Anaconda installation. Activate the environment you want to install into, or just install into root, and then use the following commands:

$ sudo easy_install --upgrade six
$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.0-py3-none-any.whl

In the official documentation [1] they link to ...tensorflow-0.5.0-py2-none-any.whl, which is an older version of tensorflow for python 2.x. At the time of writing the newest version is 0.7, so I just manually updated the numbers in the URL to fetch and install the version I want.

On OSX El Capitan, with Anaconda 2.5 the above does throw an exception, but the exception happens late enough in the install script that tensoflow will be installed once done. I suppose this is part of working with pre-1.0 release software.

Before installing tensorflow, consider updating your anaconda installation, by running

$ conda update anaconda

[1] https://www.tensorflow.org/versions/0.6.0/get_started/os_setup.html#pip_install


Getting several python kernels into JuPyter/IPython notebook using Anaconda

Posted on Fri 29 January 2016 in Notes

Notebook showing several available python kernels

To get Jupyter notebooks to use several different Python kernels using Anaconda, so the following from the commandline:

$ conda create -n py27 anaconda python=2.7 
$ source activate py27 
$ conda install notebook ipykernel
$ ipython kernel install

In the first line, the -n py27 sets the name of the new environment we create to py27. This is a handy name for a Python 2.7 environment. anaconda denotes the packages we want installed. The anaconda package contains all the scientific packages you'd expect (numpy, scipy, matplolib, sklearn, etc.) python=2.7 tells conda to create the environment with python version 2.7

The second line switches you to the new virtual environment.

Third and fourth line installs the kernel into IPython notebook.


Notes on Kyoto Corpus installation and format

Posted on Thu 21 January 2016 in Notes

Installation

  • unzip KyotoCorpus4.0.tar.gz (available here)
  • If you do not have a CD-drive, copy the mai95.txt from the Mainichi Shinbun 1995 CD-ROM to the KyotoCorpus4.0 library you just unzipped (USB stick from friend who has an old PC ...)
  • If you have a CD-drive, the install script should find the file automatically from your drive.
  • Run ./auto_conv -d . to run the install script and have it look for mai95.txt in the same directory.
  • When installing with CD, just run ./auto_conv
  • On windows you can install Kyoto Corpus via cygwin

The install script relies heavily on Perl's encode function, which is deprecated. Expect lots of warnings! The script will probably not run on Perl 6, and only versions of Perl newer than 5.8 (5.18, on OSX 10.11 works fine!)


Convert encoding for multiple files recursively

Posted on Thu 14 January 2016 in Notes

If you have a large corpus of text files in, say euc-jp encoding, they can be quite difficult to work with, since most command-line tools on modern systems expects utf-8 files.

iconv can be used to convert file encodings from one known encoding to another. One problem on OSX is that the -o option doesn't work and instead you have to use the redirect operator >. Moreover you can't do this to overwrite an existing file, so if you have a large, complex directory structure you need to traverse recursively to change the encoding of each file, it becomes problematic.

I've found the following to work very well:

find . -type f -exec sh -c "iconv -f eucjp -t UTF-8 {} > {}.utf8"  \; -exec mv "{}".utf8 "{}" \;
  • find finds all files and directories recursively
  • . denotes starting directory. In this case, the current directory and thus everything below as well.
  • -type f limits the search to files only (so no directories will be returned)
  • -exec executes a command for each search result
  • sh -c opens bash shell, and executes the string followin -c
  • iconv -f eucjp -t UTF-8 converts encoding -f(rom) euc-jp to utf-8
  • {} denotes the search result (filename)
  • > the redirect operator. We run this line via the shell to get this to work, since it doesn't work if run directly via the -exec command (what a mess!)
  • {}.utf8 save to a file with “utf8” as the extension
  • "  \; close the bash command and close the -exec command.
  • -exec do another command with the search result
  • mv "{}".utf8 "{}" move the new file to the old filename, thus overwriting the original file
  • \; close the second -exec command.