Tuesday, March 11, 2008

Incorporating another set of NN

Currently when I run my algorithm, I come up with the most likely set of states that made up the character images. In other words, I come up with a predicted name. I now run nearest neighbor with the predicted name against the roster names and say that the nearest neighbor is my predicted name. This way my predicted name is at least a valid possibility.

And now... I get 100% accuracy. That's right. Yeah yeah, the roster only has 7 names but it is a start!!1!!.

The way I calculate the distance between two strings is I have a digit array with the indices of the characters - 'a' maps to 1, 'b' to 2, 'c' to 3 etc. Then I use Euclidean distance. In the future I can apparently use edit distance but Euclidean is fine for me for now.

Ok there is one more hack in there. The names are not all the same length, and in order for nearest neighbor to work, each feature vector needs to be the same size. The way I do this is have all of the feature vectors be the size of the longest name on the roster. The extra slots in the feature vector for names not as long as the longest name are set to zero. I require that the nearest neighbor returned be the same length as the query vector (before I add zeros). I think that this step helps a lot especially since the roster is so small. The next step is to have a test set that is much larger.

No comments: