'Phonix.py – phonetic name search in Python'

Posted on Tue 24 March 2015 in Notes

  • ICL, good morning!

  • Good morning, my name is George Parson. I’m looking for one of your employees, Philip Carlquist from Engsboda.

  • Let me see, Filip Karlkvist you said ... no, I’m sorry I can’t find anybody here with that name. I’ll just look up the location, ... Ängsboda, no, we don’t even seem to have an office there. Are you sure he works for ICL sir?

  • Quite sure. I met him only last week.

  • Well, I’m sorry, he is not in the computer, he must have left the company since.

  • But ...

Looking up names can be difficult, because there are many possible ways to spell otherwise similar sounding names and places and we can't expect the user to know of - or even think to try - all possible spellings.

Phonetic search algorithms dates back to the early 1900 when Russel & Odell invented the Soundex algorithm in order to simplify and improve an index wherein names are to be entered and grouped phonetically rather than in accordance with the alphabetical construction of the names (Patent 1918).

Soundex grew to fame when it was later used for an genealogical analysis of the US Census data from 1890-1920. Today, despite its limitation it is found everywhere from SQL to The Art of Computer Programming.

The algorithm splits all letters in a name into 1 of 8 groups, each represented by a number 0-7. All zeros and consecutive numbers are pruned. The first letter of the original name is reinstated, and if result is less than 4 characters long, it is extended by zeros.

Not content with the obvious short comings of this approach, T. N. Gadd sat down and painstakingly constructed no less than 160 rules for how to change the spelling of names to be more phonetic. He also optimized the grouping and published the new resulting algorithm, Phonix, in 1988's 3rd number of'Information System', under the catching title: "Fisching Fore Werds", which if run through the Phonix algorithm would produce the same result as "Fishing for words".

Since the Phonix algorithm is at the heart of several more advanced phonetic search algorithms, and there exists no live open implementation of it, I decided to reimplement it in Python and release it publicly under BSD.