Basics of Kana-kanji conversion
Posted on Tue 04 November 2014 in Notes
A kana/kanji conversion system goal is to convert a string of phonetical characters – kana – into a string of mixed logographic kanji and phonetic kana. That is, we want to go from ごはんをたべます →ご飯を食べます (To eat a meal).
Let $W$ be a set of words ${w_1, w_2 ... w_n}$ in a sentence (in the above example W would be ご飯を食べます). $P(W)$ is the probability that this sentence exists in the language. In practice that is, how many times does this exact sentence show up in our training data. Since very few sentences show up in written language more than once, the chance that we will deal with a sentence that does not exist in our training set is quite high. Therefore we tend to let $W$ be a collection of uni, bi, and trigrams and their associated frequenzy. $P(W)$ is also known as our language model.
Let $A$ be a string of phonetic characters – or kana – so that $A = {a_1, a_2 ... a_m}$.
Now we want to find the most probable string $W$ associated with the input string $A$.
A classic example is きしゃのきしゃがきしゃのきしゃできしゃした which we typically want to Picture of mounted archery "きしゃ"be 貴社の記者が貴社の汽車で帰社した (Your reporter was sent home in your train). Given that all the nouns in the phrase is pronounced きしゃ(kisha) this leads to a large variety of more or less meaningful, but still possible sentences. The reporter could be sent home in the reporters own train, or the train could be sent home on the reporter. The reporter or train could even be involved in mounted archery (騎射 also read きしゃ) if one is imaginative enough. However entertaining that may be, it is the job of our kana-kanji converter to convert to the most likely sentence, not the most entertaining.
Using Bayes theorem we can ask What is the probability of an observed kana string A, given a sentence W?
$$P(W|A) = \frac{P(A|W)P(W)}{P(A)} $$
[1]
And we want to find the W that maximizes this probability, so that $W^* = arg_wmaxP(W|A) = arg_wmax\frac{P(A|W)P(W)}{P(A)}$
We tend to assume that users will correctly type the kana input that they intend to type. That is, there are no typos and there are no spelling mistakes (note the general absence of Japanese spellcheckers in the market place). So we can assume that $P(A) = 1$.
We can make a few more assumptions about the Japanese language. For example, if we have a sentence W, then we can be very certain that there is only 1 kana string A that fits. ご飯を食べます can only be read ごはんをたべます in basically any circumstance that we would come across. So $P(A|W)=1$, for most sentence and we can further reduce [1] to $$P(W|A) = P(W) $$ or to put it in a different way, the most likely correct sentence is completely dependent on the language model ((Suzuki, Hisami, and Jianfeng Gao. Microsoft Research IME Corpus. Microsoft Research Technical Report, 2005. http://131.107.65.14/pubs/70243/tr-2005-168.pdf.
)).
Caveats
However, things are not always as simple as described above.
Input string A and P(A)
We assume input to be a string of phonetic character representations that we must convert into an actual sentence, very much like how we would do in a text-speech system. However, to complicate matters, both input and output strings will often contain alphanumeric characters as well as punctuation. Some of these – such as numbers – may have an effect on how conversion of following kana must be done, others characters – mostly alphabet characters – will not and it might be best to ignore such characters during conversion, especially if we suspect the user to input very creative emoticon that we haven't met in the training data. However, we may want to substitute single width alphabet to double width and we may or may not want to convert alphabet into kana.
Moreover, we assume the user to type correctly. There appears to be very little research into what kind of spelling mistakes Japanese writers tend to make when typing Japanese on a keyboard or smartphone. There exist research on the errors people make when writing by hand or when typing English, but it appears to be silently assumed by academics that Japanese authors type correctly. It does seem unlikely, though, that Japanese writers never forget to add a ” or make a や or a ゆ small. A literature search only found 1 paper dealing with proofing written Japanese ((Takeda, Koichi, Emiko Suzuki, Tetsuro Nishino, and Tetsunosuke Fujisaki. “CRITAC—An Experimental System for Japanese Text Proofreading.” IBM Journal of Research and Development 32, no. 2 (1988): 201–16.
)), while more research appears to have gone into how Japanese writers misspell English text ((Ishikawa, Masahiko. Apparatus for Correcting Misspelling and Incorrect Usage of Word. Google Patents, 1998. http://www.google.com/patents/US5812863.
Mitton, Roger, and Takeshi Okada. “The Adaptation of an English Spellchecker for Japanese Writers,” 2007. http://eprints.bbk.ac.uk/592.
)).
**
So in the real world, $P(A) \ne 1$**
Unambiguity of the reading of a known sentence
Earlier I described how we can assume $P(A|W) = 1$. This is an important assumption for a lot of practical reasons. Mainly, because training data is expensive to annotate, but with this assumption we can annotate training data with kana readings automatically, without human intervention. This is how Baidu ((Wu, Xianchao, Rixin Xiao, and Xiaoxin Chen. “Using the Web to Train a Mobile Device Oriented Japanese Input Method Editor,” 2013. http://www.aclweb.org/anthology/I/I13/I13-1172.pdf.
)) is able to build a 2.5 terabyte training corpus from the web. This would be impossible to annotate by hand.
But the readings of a Japanese sentence is ambiguous at times. 今日 can in basically every instance it occurs in Japanese be read either きょう or こんにち, with the latter being too polite for most occasions, but still a possibility. If we expand the earlier example sentence to 私はご飯を食べます (I eat a meal) it is impossible to deduce if the author meant to use the very polite わたくし or the normal わたし. Given that most automated will convert 私 into わたし we get a self reenforcing that the shorter form should be used.
So in the real world, $P(A|W) \ne 1$