Monday, March 2, 2009

Testing out the classifiers

I created and saved off 26 different classifiers - 1 for each lower case letter. I started out with 5000 haar-like features and brought it down to 200. So now, the feature vectors will be of length 200. I chose 200 because I found that after 200, there was no improvement in performance. Take a look at the following graph:



As you can see, performance is saturated after 180 iterations. This threshold varies for the different letters, but 200 is always enough from what I've seen.

I spent the majority of my time this weekend dealing with my old code for extracting letters from test data. To remind you, I am given something like:



... and I need to extract all of these letters. I wrote code for this last year but seem to have lost it :) So, I wrote new code to extract the letters which was time consuming. I am not entirely happy with the result of the code either, but I'll deal with it for now.

Given 4 images of 1 type of letter that were automatically extracted with the code I wrote, I ran a number of the 26 different classifiers on them to find the confidences. Confidence ranges from 0 to 1 where 1 indicates that it is 100% confident that the letter is a true instance, and 0 means the classifier is 100% confident that the letter is a false instance.

Here are 4 examples of the letter 'n':



Classifier Image 1 Image 2 Image 3 Image 4
n 0.4799 0.4897 0.5225 0.5120
a 0.3878 0.4333 0.3969 0.3969
b 0.4337 0.4499 0.4552 0.4967
c 0.3715 0.3448 0.3443 0.3232


So, the 'n' classifier does the best, which is a relief. However, these numbers can be pretty close, so hopefully the roster information will take care of this.

I wrote another perl script to generate the transition probabilities because I seem to have lost my old one also. Given that the current letter is 'x', I find the probability that the next letter is a 'y' by dividing the number of occurrences of the string 'xy' by the number of occurrences of the letter 'x'.

No comments: