On science and engineering

Posted on Fri 27 February 2015 in Notes

Science: If you know what you are doing, you are doing it wrong.

Engineering: If you don't know what you are doing, you are doing it wrong

Dr. Richard W. Hamming, 1995 "Learning to learn""


Measuring succes of Voice Activity Detection algorithms. HR0 and HR1

Posted on Fri 06 February 2015 in Notes

When measuring the effectiveness of a Voice Activity Detection algorithm (VAD) looking at 0-1 accuracy is rarely enough. We typically also look at Nonspeech Hit Rate (HR0) and Speech Hit Rate (HR1).

  1. HR0 is computed as the ratio of the number of correctly detected nonspeech frames to the number of real nonspeech frames.
  2. HR1 is computed as the ratio of the number of correctly detected speech frames to the number of real speech frames.

Park et al. 2014 [1]

Another way to put it is _the percentage of nonspeech and speech frames that are correctly predicted. In Python, this can be calculated in the following way:

import numpy as np
import our-vad-library as VAD

X = VAD.load_data()
y = VAD.load_targets()

y_hat = VAD.predict(X)

# Find nonspeech and speech hit rates:
index0 = np.where(y ==0)
index1 = np.where(y ==1)

hr0 = (y_hat[index0] == y[index0]).mean()
hr1 = (y_hat[index1] == y[index1]).mean()

First we create 2 indexes of y using numpy's where() function (see more). index0 is a vector of all the positions of y that represents a silent frame in our data. Say y = [0,0,0,1,1,0], then index0 = [0,1,2,5], since y[0] = y[1] = y[2] = y[5] = 0.

this means that

print y[index0] 
# -> [0,0,0,0]

Which in and of itself is not interesting. However, we can use the same index to pull out all the predictions in ŷ and compare them to the ground-truth in y

y_hat[index0] == y[index0]
# -> (True, True, False ... , dtype=bool)

This gives us a new array of the same dimensions with boolean True or False values. Each True represents a correct prediction and each False an incorrect. A neat python trick is that boolean values are treated as 0 and 1, so we can take the mean of this boolean result array to get the ratio between correct and incorrect prediction using the .mean() function.


[1] Park, Jinsoo, Wooil Kim, David K. Han, and Hanseok Ko. “Voice Activity Detection in Noisy Environments Based on Double-Combined Fourier Transform and Line Fitting.” The Scientific World Journal 2014 (August 6, 2014): e146040. doi:10.1155/2014/146040.


Quote

Posted on Mon 26 January 2015 in Notes

If [the curse of dimensionality] problem didn't exist, we would use the nearest neighbour averaging as the sole basis for doing estimation.

Trevor Hastie from his "Dimensionality and structured models" lecture in his Statistical Learning course at Stanford.


Some notes on unicode and UTF-8 and its various representations

Posted on Thu 13 November 2014 in Notes

When working with NLP on a wide variety of text, one is bound to encounter various encoding trouble. Even when everything is revolving around unicode, things are not as straightforward as one could hope.

Here's an example of some trouble with a series of files named by a string representation of the hexadecimal value of the utf-8 unicode codepoint.

File names:

  • E6B7B1.png
  • E6B48B.png
  • E6B88B.png
  • ...

In my case they look something like this:

Stroke order diagram for 母

Stroke order diagram converted to png from the KanjiVG project

The problem is to convert these back into unicode, so it is possible for a human to read what the names actually represent. This is easy to do, but difficult to learn how to.

The file names are UTF-8 hex, not unicode codepoints. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) guide has a nice explanation of what is going on here, the important part being this (emphasis mine):

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

So what we are seeing are 3 bytes encoded in hexadecimal that refers to some unicode codepoint (which again refers to an actual character)

In python we can use the string.decode() function to handle these 2 conversions. First we convert the ascii hex representations into actual hex and then we convert the hex numbers into the unicode codepoint that they refer to.

>>>> 'E6B88B'.decode('hex').decode('utf-8')
u'\u6e0b'
>>> print(u'\u6e0b')
渋

My full script is as follows

The python script:

# -*- coding: utf-8 -*-
"""
@author: Mads Olsgaard, 2014

Released under BSD3 License.

This scripts renames .png files that are named with hexadecimal values into their utf-8 string. Assumes the form of A1E2B3.png

Thanks to user plaes @ http://stackoverflow.com/a/13358677
"""

import glob, os, shutil

basepath = '../path_to/hex2utf/' #folder where we want to take our files from
targetpath = basepath+'out/' # path to where we want to store the renamed files

pattern = '??????.png'     # pattern of the file name. In this case we are only looking for 6 character long file names of the png type.
                        # These are not regex. See https://docs.python.org/3/library/fnmatch.html

filelist = glob.glob(basepath+pattern) #load all files in basepath that conform to pattern

# Extract the filename from each file path in filelist, truncate the '.png' section
# The .decode('hex').decode('utf-8') part is where the magic happens. First we convert the ASCII string into the hex values it represents
# 'E6B88B' -> '\xe6\xb8\x8b'
# and then we convert the hex-code into the UTF8 unicode character that it represents
# '\xe6\xb8\x8b' -> u'\u6e0b', which in unicode aware applications will show up as '渋'

filenames = [os.path.basename(n)[:-4].decode('hex').decode('utf-8') for n in filelist]

for n,t in zip(filelist, filenames):
    shutil.copy2(n, targetpath+t+'.png')

Some other links worth looking at


Basics of Kana-kanji conversion

Posted on Tue 04 November 2014 in Notes

A kana/kanji conversion system goal is to convert a string of phonetical characters – kana – into a string of mixed logographic kanji and phonetic kana. That is, we want to go from ごはんをたべます →ご飯を食べます (To eat a meal).

Let $W$ be a set of words ${w_1, w_2 ... w_n}$ in a sentence (in the above example W would be ご飯を食べます). $P(W)$ is the probability that this sentence exists in the language. In practice that is, how many times does this exact sentence show up in our training data. Since very few sentences show up in written language more than once, the chance that we will deal with a sentence that does not exist in our training set is quite high. Therefore we tend to let $W$ be a collection of uni, bi, and trigrams and their associated frequenzy. $P(W)$ is also known as our language model.

Let $A$ be a string of phonetic characters – or kana – so that $A = {a_1, a_2 ... a_m}$.

Now we want to find the most probable string $W$ associated with the input string $A$.

A classic example is きしゃのきしゃがきしゃのきしゃできしゃした which we typically want to Picture of mounted archery "きしゃ"be 貴社の記者が貴社の汽車で帰社した (Your reporter was sent home in your train). Given that all the nouns in the phrase is pronounced きしゃ(kisha) this leads to a large variety of more or less meaningful, but still possible sentences. The reporter could be sent home in the reporters own train, or the train could be sent home on the reporter. The reporter or train could even be involved in mounted archery (騎射 also read きしゃ) if one is imaginative enough. However entertaining that may be, it is the job of our kana-kanji converter to convert to the most likely sentence, not the most entertaining.

Using Bayes theorem we can ask What is the probability of an observed kana string A, given a sentence W?

$$P(W|A) = \frac{P(A|W)P(W)}{P(A)} $$

[1]

And we want to find the W that maximizes this probability, so that $W^* = arg_wmaxP(W|A) =  arg_wmax\frac{P(A|W)P(W)}{P(A)}$

We tend to assume that users will correctly type the kana input that they intend to type. That is, there are no typos and there are no spelling mistakes (note the general absence of Japanese spellcheckers in the market place). So we can assume that $P(A) = 1$.

We can make a few more assumptions about the Japanese language. For example, if we have a sentence W, then we can be very certain that there is only 1 kana string A that fits. ご飯を食べます can only be read ごはんをたべます in basically any circumstance that we would come across. So $P(A|W)=1$, for most sentence and we can further reduce [1] to $$P(W|A) = P(W) $$ or to put it in a different way, the most likely correct sentence is completely dependent on the language model ((Suzuki, Hisami, and Jianfeng Gao. Microsoft Research IME Corpus. Microsoft Research Technical Report, 2005. http://131.107.65.14/pubs/70243/tr-2005-168.pdf.

)).

Caveats

However, things are not always as simple as described above.

Input string A and P(A)

We assume input to be a string of phonetic character representations that we must convert into an actual sentence, very much like how we would do in a text-speech system. However, to complicate matters, both input and output strings will often contain alphanumeric characters as well as punctuation. Some of these – such as numbers – may have an effect on how conversion of following kana must be done, others characters – mostly alphabet characters – will not and it might be best to ignore such characters during conversion, especially if we suspect the user to input very creative emoticon that we haven't met in the training data. However, we may want to substitute single width alphabet to double width and we may or may not want to convert alphabet into kana.

Moreover, we assume the user to type correctly. There appears to be very little research into what kind of spelling mistakes Japanese writers tend to make when typing Japanese on a keyboard or smartphone. There exist research on the errors people make when writing by hand or when typing English, but it appears to be silently assumed by academics that Japanese authors type correctly. It does seem unlikely, though, that Japanese writers never forget to add a ” or make a や or a ゆ small. A literature search only found 1 paper dealing with proofing written Japanese ((Takeda, Koichi, Emiko Suzuki, Tetsuro Nishino, and Tetsunosuke Fujisaki. “CRITAC—An Experimental System for Japanese Text Proofreading.” IBM Journal of Research and Development 32, no. 2 (1988): 201–16.

)), while more research appears to have gone into how Japanese writers misspell English text ((Ishikawa, Masahiko. Apparatus for Correcting Misspelling and Incorrect Usage of Word. Google Patents, 1998. http://www.google.com/patents/US5812863.

Mitton, Roger, and Takeshi Okada. “The Adaptation of an English Spellchecker for Japanese Writers,” 2007. http://eprints.bbk.ac.uk/592.

)).

**

So in the real world, $P(A) \ne 1$**

Unambiguity of the reading of a known sentence

Earlier I described how we can assume $P(A|W) = 1$. This is an important assumption for a lot of practical reasons. Mainly, because training data is expensive to annotate, but with this assumption we can annotate training data with kana readings automatically, without human intervention. This is how Baidu ((Wu, Xianchao, Rixin Xiao, and Xiaoxin Chen. “Using the Web to Train a Mobile Device Oriented Japanese Input Method Editor,” 2013. http://www.aclweb.org/anthology/I/I13/I13-1172.pdf.

)) is able to build a 2.5 terabyte training corpus from the web. This would be impossible to annotate by hand.

But the readings of a Japanese sentence is ambiguous at times. 今日 can in basically every instance it occurs in Japanese be read either きょう or こんにち, with the latter being too polite for most occasions, but still a possibility. If we expand the earlier example sentence to 私はご飯を食べます (I eat a  meal) it is impossible to deduce if the author meant to use the very polite わたくし or the normal わたし. Given that most automated will convert 私 into わたし we get a self reenforcing that the shorter form should be used.

So in the real world, $P(A|W) \ne 1$