Running JuPyther notebook with connected qt console and styles

Posted on Tue 12 January 2016 in Notes

When developing a project in JuPyther/Ipython notebook it is often nice to run some test code in a console, especially if you want to check the content of a variable that might have thousands of items – something that might choke your notebook.

Open your notebook as you'd normally do via a terminal:

$ ipython notebook

In a new terminal window, open a connected qtconsole:

$ ipython qtconsole --existing

Since the default color scheme isn't very nice, I usually choose a style, such as monokai

$ ipython qtconsole --existing --style=monokai

By doing this, you can share code between the notebook and the console, as well as draw graphs in both. This is nice for checking the result of a line of code or running a script that prints thousands of lines.

qtconsole connected to notebook

For a list of styles, run the following code in python:

In [1]: from pygments.styles import STYLE_MAP
In [2]: print (STYLE_MAP.keys())
Out [2]: dict_keys(['rrt', 'perldoc', 'monokai', 'friendly', 'borland', 'native', 'xcode', 'colorful', 'fruity', 'manni', 'paraiso-light', 'vs', 'emacs', 'bw', 'default', 'murphy', 'igor', 'paraiso-dark', 'trac', 'tango', 'pastie', 'vim', 'autumn'])

Installing MeCab on OSX

Posted on Thu 08 October 2015 in Notes

In the In the there are only instruction on how to install on Windows and Linux, but not for OSX, despite the program working quite nicely on Mac.

Kousei Ikeda has easy to follow instruction on his Ikekou blog reprinted here in English.

MeCab

We install from the Terminal, using the following commands:

$ cd ~/Downloads
$ curl -O https://mecab.googlecode.com/files/mecab-0.996.tar.gz
$ tar zxfv mecab-0.996.tar.gz
$ cd mecab-0.996
$ ./configure
$ make
$ make check

This downloads the source from Google Code, unpacks it and compile and check compilation.

IPADIC

Next we need to install a dictionary file / language model for MeCab. The recommended model is IPADIC.

$ cd ~/Downloads
$ $ curl -O https://mecab.googlecode.com/files/mecab-ipadic-2.7.0-20070801.tar.gz
$ tar zxfv mecab-ipadic-2.7.0-20070801.tar.gz
$ cd mecab-ipadic-2.7.0-20070801
$ ./configure --with-charset=utf8
$ make
$ sudo make install

This downloads the source code, unpacks it and compiles the dictionary using UTF-8 character encoding.

Note that the above code downloads from googlecode.com, which is deprecated. The project is inaccessible from Google code and is currently hosted on Github (here) so it is uncertain how long the zipped source will be available there.

Check the download section of the documentation for links to the currently newest version of both IPADIC and MeCab.

After this is done you can use pip to install MeCab bindings for python like this:

$ pip install mecab-python3

Evolutionary algorithms

Posted on Thu 28 May 2015 in Notes

Terms:

  • Evolutionary Algorithm (EA)
  • Population: Collection of string of symbols. In case of fooling an image classifier, the population is the list of images used throughout the test.
  • Chromosomes: Each string (or image) in the population is called a chromosome.
  • Initial population: This is our starting point. All chromosomes in the initial population are randomly (or heuristically) generated.
  • Fitness: Each chromosome has a fitness value attached to it. Fitness is determined by our fitness function. For [## Terms:

  • Evolutionary Algorithm (EA)

  • Population: Collection of string of symbols. In case of fooling an image classifier, the population is the list of images used throughout the test.
  • Chromosomes: Each string (or image) in the population is called a chromosome.
  • Initial population: This is our starting point. All chromosomes in the initial population are randomly (or heuristically) generated.
  • Fitness: Each chromosome has a fitness value attached to it. Fitness is determined by our fitness function. For](https://en.wikipedia.org/wiki/Travelling_salesman_problem "Traveling Salesman problem on Wikipedia") it is the total length travelled for a particular route (or chromosome). When fooling image classifiers, the fitness is the confidence the classifier deems a chromosome is part of a particular class.
  • Evaluation: Evaluating a chromosome is simply determining its fitness value.
  • Selection: The EA chooses the chromosomes with highest fitness following some rule (top 100, or everything above a threshold)
  • Parent: A parent is simply a chromosome that has been selected. From parents we generate offsprings.
  • Offspring: The EA will generate new chromosome from the selections. This can be done in several ways:
    • Crossover / recombination: generating new chromosomes by recombining two or more parents
    • Mutation: generating offsprings by modifying a single parent, typically in some random way.

http://web.stcloudstate.edu/bajulstrom/aboutEC.html


Extracting tables from a regular HTML document using regex

Posted on Sun 10 May 2015 in Notes

If you know the structure of your html is regular, it may sometimes be easier to do a quick and dirty regex extraction job, than firing up beautiful soup.

The main limitation of using regex is that we cannot properly parse nested tags. If we are searching for opening and closing `If you know the structure of your html is regular, it may sometimes be easier to do a quick and dirty regex extraction job, than firing up beautiful soup.

The main limitation of using regex is that we cannot properly parse nested tags. If we are searching for opening and closing` tags, and somewhere we have:

<table>
  <table>
    ....

  </table>

</table>

The result will be a disappointing:

<table>
  <table>
    ....

  </table>

So in order for this to work, we have to be confident that there are no nested tags that we are trying to extract.

In python we can build the pattern and extract the tables like this:

pattern = re.compile(r'&lt;table.*?\/table>', re.DOTALL)

with open("document.html", 'r') as infile:
    html = infile.read()
    tables = pattern.findall(html)

The <a href="https://docs.python.org/3.4/library/re.html#re.DOTALL">re.DOTALL</a> ensures that . matches \newlines, which we need since the tables span multiple lines.
if we only used . instead of .? to match the content within table tags, the closing tag would be included in the .* part of the pattern and the pattern would match the entire document.


Connect iPython REPL to an existing notebook

Posted on Wed 06 May 2015 in Notes

This is a really smart trick if you are developing using iPython Notebooks. What you do is open a REPL that is connected to the notebook you are working on (think of it as an ipython instance that shares the functions and variables with your notebook).

This is great as a scratch pad, easy checking of variables and for doing long print statements that would otherwise bog down a browser.

Simply run the following command from the terminal after opening the notebook server:

$ ipython console --existing

In recent versions of Anaconda, I get a very long error-message that ends with:

ImportError: No module named 'jupyter_console'

To fix this, simply install jupyter_console using pip from the terminal, like so:

$ pip install jupyter_console

I have no idea why this isn't installed by default with Anaconda.