Evolutionary algorithms

Posted on Thu 28 May 2015 in Notes

Terms:

  • Evolutionary Algorithm (EA)
  • Population: Collection of string of symbols. In case of fooling an image classifier, the population is the list of images used throughout the test.
  • Chromosomes: Each string (or image) in the population is called a chromosome.
  • Initial population: This is our starting point. All chromosomes in the initial population are randomly (or heuristically) generated.
  • Fitness: Each chromosome has a fitness value attached to it. Fitness is determined by our fitness function. For [## Terms:

  • Evolutionary Algorithm (EA)

  • Population: Collection of string of symbols. In case of fooling an image classifier, the population is the list of images used throughout the test.
  • Chromosomes: Each string (or image) in the population is called a chromosome.
  • Initial population: This is our starting point. All chromosomes in the initial population are randomly (or heuristically) generated.
  • Fitness: Each chromosome has a fitness value attached to it. Fitness is determined by our fitness function. For](https://en.wikipedia.org/wiki/Travelling_salesman_problem "Traveling Salesman problem on Wikipedia") it is the total length travelled for a particular route (or chromosome). When fooling image classifiers, the fitness is the confidence the classifier deems a chromosome is part of a particular class.
  • Evaluation: Evaluating a chromosome is simply determining its fitness value.
  • Selection: The EA chooses the chromosomes with highest fitness following some rule (top 100, or everything above a threshold)
  • Parent: A parent is simply a chromosome that has been selected. From parents we generate offsprings.
  • Offspring: The EA will generate new chromosome from the selections. This can be done in several ways:
    • Crossover / recombination: generating new chromosomes by recombining two or more parents
    • Mutation: generating offsprings by modifying a single parent, typically in some random way.

http://web.stcloudstate.edu/bajulstrom/aboutEC.html


Extracting tables from a regular HTML document using regex

Posted on Sun 10 May 2015 in Notes

If you know the structure of your html is regular, it may sometimes be easier to do a quick and dirty regex extraction job, than firing up beautiful soup.

The main limitation of using regex is that we cannot properly parse nested tags. If we are searching for opening and closing `If you know the structure of your html is regular, it may sometimes be easier to do a quick and dirty regex extraction job, than firing up beautiful soup.

The main limitation of using regex is that we cannot properly parse nested tags. If we are searching for opening and closing` tags, and somewhere we have:

<table>
  <table>
    ....

  </table>

</table>

The result will be a disappointing:

<table>
  <table>
    ....

  </table>

So in order for this to work, we have to be confident that there are no nested tags that we are trying to extract.

In python we can build the pattern and extract the tables like this:

pattern = re.compile(r'&lt;table.*?\/table>', re.DOTALL)

with open("document.html", 'r') as infile:
    html = infile.read()
    tables = pattern.findall(html)

The <a href="https://docs.python.org/3.4/library/re.html#re.DOTALL">re.DOTALL</a> ensures that . matches \newlines, which we need since the tables span multiple lines.
if we only used . instead of .? to match the content within table tags, the closing tag would be included in the .* part of the pattern and the pattern would match the entire document.


Connect iPython REPL to an existing notebook

Posted on Wed 06 May 2015 in Notes

This is a really smart trick if you are developing using iPython Notebooks. What you do is open a REPL that is connected to the notebook you are working on (think of it as an ipython instance that shares the functions and variables with your notebook).

This is great as a scratch pad, easy checking of variables and for doing long print statements that would otherwise bog down a browser.

Simply run the following command from the terminal after opening the notebook server:

$ ipython console --existing

In recent versions of Anaconda, I get a very long error-message that ends with:

ImportError: No module named 'jupyter_console'

To fix this, simply install jupyter_console using pip from the terminal, like so:

$ pip install jupyter_console

I have no idea why this isn't installed by default with Anaconda.


Quote on Color vs. Colour

Posted on Thu 26 March 2015 in Notes

COLOUR and COLOR are phonetically similar and have the same meaning. COLOUR, COLLAR and CARLA are phonetically similar but differ in meaning. Yet meaning is determined by context. We might COLLAR in a picture, using bright KILLERS; or buy a CARLA for our dog.

Context assists with meaningful interpretation of what has been said or read, and we happily substitute ‘preferred forms' for words or 'concepts' we consider to be in error. Utter the words out of context however, and many variant spellings may be offered in return. This is particularly true of names; for example CAREL, CARRELL, CARROL, CURRALL, CURRELL, KAREEL, KAREL, KAROL and KARIEL.

The word 'phonetic' refers to spoken sounds and not to the spelling of words. Different spelling sequences often represent similar sounds; for example, the 'PH' in PHONE is similar to the 'F' in FOAM. What are defined as vowels and consonants in the written alphabet are not consistent with spoken language, and are unlikely to be since language is continually changing (the written form taking longer to change than the spoken). The longer documents and records are accumulated, the harder it becomes to justify changing those alphabets.

T.N. Gadd. Fisching Fore Werds, 1988

I was very impressed with this concise explanation on phonetics and homonyms made by Gadd when he first introduced his Phonix improvement to the Soundex algorithm for phonetic retrieval of names.

Check out my implementation of Phonix.

Or:

Or


'Phonix.py – phonetic name search in Python'

Posted on Tue 24 March 2015 in Notes

  • ICL, good morning!

  • Good morning, my name is George Parson. I’m looking for one of your employees, Philip Carlquist from Engsboda.

  • Let me see, Filip Karlkvist you said ... no, I’m sorry I can’t find anybody here with that name. I’ll just look up the location, ... Ängsboda, no, we don’t even seem to have an office there. Are you sure he works for ICL sir?

  • Quite sure. I met him only last week.

  • Well, I’m sorry, he is not in the computer, he must have left the company since.

  • But ...

Looking up names can be difficult, because there are many possible ways to spell otherwise similar sounding names and places and we can't expect the user to know of - or even think to try - all possible spellings.

Phonetic search algorithms dates back to the early 1900 when Russel & Odell invented the Soundex algorithm in order to simplify and improve an index wherein names are to be entered and grouped phonetically rather than in accordance with the alphabetical construction of the names (Patent 1918).

Soundex grew to fame when it was later used for an genealogical analysis of the US Census data from 1890-1920. Today, despite its limitation it is found everywhere from SQL to The Art of Computer Programming.

The algorithm splits all letters in a name into 1 of 8 groups, each represented by a number 0-7. All zeros and consecutive numbers are pruned. The first letter of the original name is reinstated, and if result is less than 4 characters long, it is extended by zeros.

Not content with the obvious short comings of this approach, T. N. Gadd sat down and painstakingly constructed no less than 160 rules for how to change the spelling of names to be more phonetic. He also optimized the grouping and published the new resulting algorithm, Phonix, in 1988's 3rd number of'Information System', under the catching title: "Fisching Fore Werds", which if run through the Phonix algorithm would produce the same result as "Fishing for words".

Since the Phonix algorithm is at the heart of several more advanced phonetic search algorithms, and there exists no live open implementation of it, I decided to reimplement it in Python and release it publicly under BSD.

Or