Bell curves, normal distributions and Gaussians

Posted on Mon 27 October 2014 in Notes

While it is much better to refer to such a curve as a ‘normal distribution' than as a ‘bell curve,' if you really want to fit into the Statistical NLP or pattern recognition communities, you should instead learn to refer to these functions as Gaussians, and to remark things like, ‘Maybe we could model that using 3 Gaussians' at appropriate moments.'

– Foundations of Statistical Natural Language Processing


Showing Japanese characters in Matplotlib on Ubuntu

Posted on Mon 27 October 2014 in Notes

TL;DR: Install Japanese language support and insert the following in your python script

import matplotlib
matplotlib.rc('font', family='TakaoPGothic')

If you are working with any kind of NLP in Python that involves Japanese, it is paramount to be able to view summary statics in the form of graphs that in one way or another includes Japanese characters.

Below is a graph showing Zipf's Law for the distribution of characters used in [TL;DR: Install Japanese language support and insert the following in your python script

import matplotlib
matplotlib.rc('font', family='TakaoPGothic')

If you are working with any kind of NLP in Python that involves Japanese, it is paramount to be able to view summary statics in the form of graphs that in one way or another includes Japanese characters.

Below is a graph showing Zipf's Law for the distribution of characters used in](http://en.wikipedia.org/wiki/Tetsuko_Kuroyanagi) ‘Totto Channel', the sequel to her famous “Totto Chan: The little girl by the window”.


Character Distribution of 100 most used characters - but which ones?

Character Distribution of 100 most used characters - but which ones?


On most systems, Matplotlib will not be able to display Japanese characters out-of-the-box and this is a big problem as the graph above is completely useless for even the most basic investigation.

I've tested on OSX, Windows 8 and Ubuntu and only OSX manages to work out of the box, despite my Windows installation being Japanese!

Most advice online will tell you to change the font used by Matplotlib, but if you are on Ubuntu it might not be completely obvious which font you need to use! Moreover there are many ways to change the font.

I've found the simplest way of changing fonts to simply be using matplotlib.rc

import matplotlib
matplotlib.rc('font', family='Monospace')

In family you can either insert the name of a font family (as in the above example) or you can name a specific font, which is what you want to do in this case. But which one? I wrote the following script to help check which fonts will work.

# -*- coding: utf-8 -*-
"""
Matplotlib font checker
Prints a figure displaying a variety of system fonts and their ability to produce Japanese text

@author: Mads Olsgaard, 2014

Released under BSD License.
"""

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import font_manager

fonts = ['Droid Sans', 'Vera', 'TakaoGothic', 'TakaoPGothic', 'Liberation Sans', 'ubuntu', 'FreeSans', 'Droid Sans Japanese', 'DejaVu Sans']
#fonts = ['Arial', 'Times New Roman', 'Helvetica'] #uncomment this line on Windows and see if it helps!
english = 'The quick ...'
japanese = '日本語'
x = 0.1
y = 1

# Buils headline
plt.text(x+0.5,y, 'english')
plt.text(x+0.7, y, 'japanese')
plt.text(x,y, 'Font name')
plt.text(0,y-0.05, '-'*100)
y -=0.1

for f in fonts:
    matplotlib.rc('font', family='DejaVu Sans')
    plt.text(x,y, f+':')
    matplotlib.rc('font', family=f)
    plt.text(x+0.5,y, english)
    plt.text(x+0.7, y, japanese)
    y -= 0.1
    print(f, font_manager.findfont(f))  # Sanity check. Prints the location of the font. If the font it not found, an error message is printed and the location of the fallback font is shown

plt.show()

On ubuntu the output should be the following:

Droid Sans /usr/share/fonts/truetype/droid/DroidSans.ttf
Vera /home/supermads/anaconda3/lib/python3.4/site-packages/matplotlib/mpl-data/fonts/ttf/Vera.ttf
TakaoGothic /usr/share/fonts/truetype/takao-gothic/TakaoGothic.ttf
TakaoPGothic /usr/share/fonts/truetype/takao-gothic/TakaoPGothic.ttf
Liberation Sans /usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf
ubuntu /usr/share/fonts/truetype/ubuntu-font-family/Ubuntu-R.ttf
FreeSans /usr/share/fonts/truetype/freefont/FreeSans.ttf
Droid Sans Japanese /usr/share/fonts/truetype/droid/DroidSansJapanese.ttf
DejaVu Sans /usr/share/fonts/truetype/dejavu/DejaVuSans.ttf

As you can see, I'm running Anaconda Python 3, and if Anaconda can't find a font it will fallback into it's own folder to load the Vera font.

Font check

Surprisingly, Droid does support Japanese, it just saves the Japanese character space in a seperate font file, rendering it useless for this purpose. However, the Takao font family does work for our purpose.

Takao fonts should be installed by default if you have set your location somewhere in Japan during installation of Ubuntu or if you have installed support for Japanese language in System SettingsLanguage Support (just hit the super key and search for language). I recommend this, since this will also install the Japanese input method, Anthy

You can also use apt-get, like this from the command line (not tested):

sudo apt-get install fonts-takao-mincho fonts-takao-gothic fonts-takao-pgothic

And now we can finally see which characters Kuronayagi used the most for her sequel:

Character Distribution of 100 most used characters

And apparently, that's the Japanese comma, also called 読点


Frequentist vs. Bayesian

Posted on Sun 13 April 2014 in Notes

Was reading Roger Levy – Probabilistic Models in the Study of Language draft when I got an actual introduction to the words frequentist and bayesian for the first time. It had never daunted on me that there are these two fundamentally different ways of viewing probability and it has been on my mind ever since.

Here's the relevant quote:

You and your friend meet at the park for a game of tennis. In order to determine who will serve first, you jointly decide to flip a coin. Your friend produces a quarter and tells you that it is a fair coin. What exactly does your friend mean by this?

A translation of your friend’s statement into the language of probability theory would be that the tossing of the coin is an experiment—a repeatable procedure whose outcome may be uncertain—in which the probability of the coin landing with heads face up is equal to the probability of it landing with tails face up, at 1/2.

In mathematical notation we would express this translation as P(Heads) = P(Tails) = 1/2.

This mathematical translation is a partial answer to the question of what probabilities are.

The translation is not, however, a complete answer to the question of what your friend means, until we give a semantics to statements of probability theory that allows them to be interpreted as pertaining to facts about the world. This is the philosophical problem posed by probability theory.

Two major classes of answer have been given to this philosophical problem, corresponding to two major schools of thought in the application of probability theory to real problems in the world.

One school of thought, the frequentist school, considers the probability of an event to denote its limiting, or asymptotic, frequency over an arbitrarily large number of repeated trials. For a frequentist, to say that P(Heads) = 1/2 means that if you were to toss the coin many, many times, the proportion of Heads outcomes would be guaranteed to eventually approach 50%.

The second, Bayesian school of thought considers the probability of an event E to be a principled measure of the strength of one’s belief that E will result. For a Bayesian, to say that P(Heads) for a fair coin is 0.5 (and thus equal to P(Tails)) is to say that you believe that Heads and Tails are equally likely outcomes if you flip the coin. A popular and slightly more precise variant of Bayesian philosophy frames the interpretation of probabilities in terms of rational betting behavior, defining the probability π that someone ascribes to an event as the maximum amount of money they would be willing to pay for a bet that pays one unit of money. For a fair coin, a rational better would be willing to pay no more than fifty cents for a bet that pays $1 if the coin comes out heads.

... Fortunately, for the cases in which it makes sense to talk about both reasonable belief and asymptotic frequency, it’s been proven that the two schools of thought lead to the same rules of probability.

I'm having a hard time understanding the Bayesian argument here. The only reason you would want to bet $ 0.49 to win $1 in a 50/50 bet, is if you are able to repeat the bet for a large number of times. Else you are standing to loose $0.49 - and what if that was actually a lof of money to you? In this sense, the idea of betting "no more than fifty cents", is the frequentist idea, that when repeating the bet many times, your winnings will converge to zero or higher.

However, Levy further refers to a paper by Cox (1946) which comes with a great counter point to the frequentist view of the world.

... there are probabilities in the sense of reasonable expectation [Bayesian] for which no ensemble exists

Here ensemble are all the drawings from a random distribution - or in the former case a lot of coin tosses. Cox continues:

Thus when the probability is calculated that more than one planetary system exists in the universe, it is barely tenable even as an artifice that this refers to the number of universes, all resembling in some way the universe, which by definition is all-inclusive.

And this is where I feel a paradox starting to creep in. Of course we can make probabilities about situations that cannot be repeated. But on the other hand: If I throw a coin and it turn out heads, wasn't that throw in retrospect certainly a head?

Does it make sense to talk about the probability of a single, actual outcome, since that outcome is certainly what it was?

I suppose the answer is, if the world is deterministic, then the frequentist theory of probability doesn't make sense: A coin toss is not 50/50 random outcome. It is an event for which we have insufficient knowledge about to properly predict. And so a bayesien would say: "Given my insufficient knowledge, what should I bet on, and how much?"


Linear seperator, Perceptron, SVM

Posted on Thu 13 February 2014 in Notes

A Linear seperator is an algorithm that seperates two dataset via a straight line.

A Linear seperator is an algorithm that seperates two dataset via a straight line.

  • A Perceptron is a linear seperator that seperates at the first line it finds.
  • An SVM (Support Vector Machine) seperates at the "best" line - e.g. with the largest distance to every point (the black line)

Source:


Control flow

Posted on Thu 19 December 2013 in Notes

Control flow tools is something I've always been really bad at beyond the basic loops and if-else statements. Or actually something I've never really known what was. But today as I was prototyping a script, I wanted to write up some program structure, and got hit hard in the head with the importance of understanding some of the “exotic” control flow tools.

if inpt[0] == ' ':
    #initialize splitting of current word
if inpt == 'redo':
    #initialize code for resplitting entire tweet
if inpt.upper() in tags:
    saveline(w+sep+inpt.upper(), f)
if inpt = 'quit':
    quit()

Will give you a very hard to understand error

File "", line 3
    if inpt == 'redo':
     ^
IndentationError: expected an indented block</pre>

But my indentations are okay!

Pass

pass is a control flow tool that can stand in for nothing. Python expects something after an if-statement, so in the above example it assumes the following if-statement is nested, but since it hasn't been indented probably, Python throws an error.

pass tells Python that this too, shall pass.

def thistooshallpass():
    pass

won't throw any errors, and let you place functions and other structures in-place in your code, waiting to be filled with actual instructions, once future you gets off your lazy bum.

Break & Continue

These are pretty cool too. Break stops a for or while-loop prematurely, while continue jumps back to the next iteration of a for or a while-loop.

In [7]: a = 0

In [8]: while True:
   ...:     break
   ...:     a += 1
   ...: print a
   ...: 
0

a never grows, because the break construct broke out of the while-loop (and saved us from an indefinite loop too!)

In [9]: for i in range(10):
   ...:         if i &lt; 5:
   ...:                 continue
   ...:         break
   ...: print i
   ...: 
5

Here the break statement is not reached while i < 5, but once it hit 5, the continue statement is no longer executed and the interpreter reaches the break statement and terminates the loop, thus giving us a nice 5.