Monday, February 25, 2008

Experimenting with nearest neighbor

Nearest neighbor shockingly doesn't work that well. I think part of the problem is that the training data isn't all that great and there isn't enough of it.

Here's the setup:

Training:
I have 90 training images for each letter in the alphabet (there are 26 of those). This makes for 2340 total training images, for those of you that can't do maf. All of the training images are the same size - 8 bit images of 120 by 123 pixels.

Testing:
So far I only have 2 names that I'm testing against but that size will grow shortly. I made the test characters also be 120 by 123 pixels (by adding extra white space evenly around the edges).

Here is the testing process:
I load all of my training data into a huge matrix of size 2340 x 14760, where each row is a strung out training image (of a character). I then read in a test character image, and find the euclidean distance between that test image and each of the training images, and sort the results based on the distances.

Currently I am looking at the top 50 closest matches and having those vote on a character. I have been getting some good and some bad results.

The first letter I tried, 'J' from "Jean Poole" had 'J' as its top match!
At that point life was pretty good. For one thing, all of the top 10 matches were j's. So that case works pretty well.

The next letter I tried was 'e':

The top match for that was 'p'... not so good. In fact, only 3 of the 50 votes were for 'e'. Here are the training e's... so you're trying to tell me only 3 of these look like that 'e' up there?

Here is another example.. 'o':

Luckily the mode of the top 50 matches is 'o', so 'o' wins but still, some weird results. The training image that is closest to the 'o' is the following 'n':...

I don't really get why that is. Here is the second closest match:

This is a little more reasonable, even though it is a 'c'. You can see the resemblance.

Lastly, here is the third match:

This happens to be a 'g' that was cut-off at the bottom during the traumatizing training process. This is also fairly understandable. Finally, the 4th closest match is an 'o'. (Of course after that there are plenty more random results.) This is what the first matching 'o' looks like:

No comments: