Getting several python kernels into JuPyter/IPython notebook using Anaconda

Posted on Fri 29 January 2016 in Notes

Notebook showing several available python kernels

To get Jupyter notebooks to use several different Python kernels using Anaconda, so the following from the commandline:

$ conda create -n py27 anaconda python=2.7 
$ source activate py27 
$ conda install notebook ipykernel
$ ipython kernel install

In the first line, the -n py27 sets the name of the new environment we create to py27. This is a handy name for a Python 2.7 environment. anaconda denotes the packages we want installed. The anaconda package contains all the scientific packages you'd expect (numpy, scipy, matplolib, sklearn, etc.) python=2.7 tells conda to create the environment with python version 2.7

The second line switches you to the new virtual environment.

Third and fourth line installs the kernel into IPython notebook.


Notes on Kyoto Corpus installation and format

Posted on Thu 21 January 2016 in Notes

Installation

  • unzip KyotoCorpus4.0.tar.gz (available here)
  • If you do not have a CD-drive, copy the mai95.txt from the Mainichi Shinbun 1995 CD-ROM to the KyotoCorpus4.0 library you just unzipped (USB stick from friend who has an old PC ...)
  • If you have a CD-drive, the install script should find the file automatically from your drive.
  • Run ./auto_conv -d . to run the install script and have it look for mai95.txt in the same directory.
  • When installing with CD, just run ./auto_conv
  • On windows you can install Kyoto Corpus via cygwin

The install script relies heavily on Perl's encode function, which is deprecated. Expect lots of warnings! The script will probably not run on Perl 6, and only versions of Perl newer than 5.8 (5.18, on OSX 10.11 works fine!)


Convert encoding for multiple files recursively

Posted on Thu 14 January 2016 in Notes

If you have a large corpus of text files in, say euc-jp encoding, they can be quite difficult to work with, since most command-line tools on modern systems expects utf-8 files.

iconv can be used to convert file encodings from one known encoding to another. One problem on OSX is that the -o option doesn't work and instead you have to use the redirect operator >. Moreover you can't do this to overwrite an existing file, so if you have a large, complex directory structure you need to traverse recursively to change the encoding of each file, it becomes problematic.

I've found the following to work very well:

find . -type f -exec sh -c "iconv -f eucjp -t UTF-8 {} > {}.utf8"  \; -exec mv "{}".utf8 "{}" \;
  • find finds all files and directories recursively
  • . denotes starting directory. In this case, the current directory and thus everything below as well.
  • -type f limits the search to files only (so no directories will be returned)
  • -exec executes a command for each search result
  • sh -c opens bash shell, and executes the string followin -c
  • iconv -f eucjp -t UTF-8 converts encoding -f(rom) euc-jp to utf-8
  • {} denotes the search result (filename)
  • > the redirect operator. We run this line via the shell to get this to work, since it doesn't work if run directly via the -exec command (what a mess!)
  • {}.utf8 save to a file with “utf8” as the extension
  • "  \; close the bash command and close the -exec command.
  • -exec do another command with the search result
  • mv "{}".utf8 "{}" move the new file to the old filename, thus overwriting the original file
  • \; close the second -exec command.

Running JuPyther notebook with connected qt console and styles

Posted on Tue 12 January 2016 in Notes

When developing a project in JuPyther/Ipython notebook it is often nice to run some test code in a console, especially if you want to check the content of a variable that might have thousands of items – something that might choke your notebook.

Open your notebook as you'd normally do via a terminal:

$ ipython notebook

In a new terminal window, open a connected qtconsole:

$ ipython qtconsole --existing

Since the default color scheme isn't very nice, I usually choose a style, such as monokai

$ ipython qtconsole --existing --style=monokai

By doing this, you can share code between the notebook and the console, as well as draw graphs in both. This is nice for checking the result of a line of code or running a script that prints thousands of lines.

qtconsole connected to notebook

For a list of styles, run the following code in python:

In [1]: from pygments.styles import STYLE_MAP
In [2]: print (STYLE_MAP.keys())
Out [2]: dict_keys(['rrt', 'perldoc', 'monokai', 'friendly', 'borland', 'native', 'xcode', 'colorful', 'fruity', 'manni', 'paraiso-light', 'vs', 'emacs', 'bw', 'default', 'murphy', 'igor', 'paraiso-dark', 'trac', 'tango', 'pastie', 'vim', 'autumn'])

Installing MeCab on OSX

Posted on Thu 08 October 2015 in Notes

In the In the there are only instruction on how to install on Windows and Linux, but not for OSX, despite the program working quite nicely on Mac.

Kousei Ikeda has easy to follow instruction on his Ikekou blog reprinted here in English.

MeCab

We install from the Terminal, using the following commands:

$ cd ~/Downloads
$ curl -O https://mecab.googlecode.com/files/mecab-0.996.tar.gz
$ tar zxfv mecab-0.996.tar.gz
$ cd mecab-0.996
$ ./configure
$ make
$ make check

This downloads the source from Google Code, unpacks it and compile and check compilation.

IPADIC

Next we need to install a dictionary file / language model for MeCab. The recommended model is IPADIC.

$ cd ~/Downloads
$ $ curl -O https://mecab.googlecode.com/files/mecab-ipadic-2.7.0-20070801.tar.gz
$ tar zxfv mecab-ipadic-2.7.0-20070801.tar.gz
$ cd mecab-ipadic-2.7.0-20070801
$ ./configure --with-charset=utf8
$ make
$ sudo make install

This downloads the source code, unpacks it and compiles the dictionary using UTF-8 character encoding.

Note that the above code downloads from googlecode.com, which is deprecated. The project is inaccessible from Google code and is currently hosted on Github (here) so it is uncertain how long the zipped source will be available there.

Check the download section of the documentation for links to the currently newest version of both IPADIC and MeCab.

After this is done you can use pip to install MeCab bindings for python like this:

$ pip install mecab-python3