Wednesday, March 11, 2009
Monday, March 9, 2009
Jittering Data
After my results came out really bad, I decided that my training data was the problem. Both quality and quantity wise. It is difficult to get more training data fast, so Serge recommended using what I have to make more. I used Piotr Dollar's jitterImage function to take a single image and apply transformations to it.
Exaggeration:
Piotr's function applies both rotational and translational rotation to images. Serge recommended playing around with thickness also by raising an image to a power greater than 1, and also a power between 0 and 1.
I did all this, and started out with around 50,000 examples of each character instead of 135. I found out quickly that I run out of memory with this many examples. I eventually had to bring it down to 625, and training took forever.
The first thing I noticed was that the error in the training data was much higher than for when I only had 135 examples. I think that this is because there is a lot more variation in the data now and it's hard to cover all cases in 200 features. When I only had 135 examples, the error eventually went down to 0. This is the error with the 625 examples.
Also, here are the ROC curves plotted on top of each other. When I only have 135 examples, the ROC curve was a right angle.
Here is half of the 625 training examples for 'a':
To cut to the chase, my algorithm still performs poorly, and probably even worse now. D:
Exaggeration:
Piotr's function applies both rotational and translational rotation to images. Serge recommended playing around with thickness also by raising an image to a power greater than 1, and also a power between 0 and 1.
I did all this, and started out with around 50,000 examples of each character instead of 135. I found out quickly that I run out of memory with this many examples. I eventually had to bring it down to 625, and training took forever.
The first thing I noticed was that the error in the training data was much higher than for when I only had 135 examples. I think that this is because there is a lot more variation in the data now and it's hard to cover all cases in 200 features. When I only had 135 examples, the error eventually went down to 0. This is the error with the 625 examples.
Also, here are the ROC curves plotted on top of each other. When I only have 135 examples, the ROC curve was a right angle.
Here is half of the 625 training examples for 'a':
To cut to the chase, my algorithm still performs poorly, and probably even worse now. D:
Wednesday, March 4, 2009
Well, I ran it already
Finally, I got everything together and ran my algorithm. It did horribly. We're talking 0% accuracy horrible. I am using the same hidden markov model (hmm) as last year and I think that the model is just too unforgiving. Once it mispredicts the first letter, we're basically screwed.
I tried abandoning the hmm and for each letter in the test name, find the prediction of that letter, independent of the rest of the letters in the name. I found the prediction by taking the max over all confidences for each of the 26 possibilities. Then, in the end, I find the nearest neighbor of the predicted name with all of the names in the roster. This also did horribly.
I took a look at what was going on under the covers. I found that for each letter, the confidence of that letter being the right letter is pretty high compared to the rest of the confidences, but some other letter always beats it by a little bit. And my algorithm doesn't care about 2nd place. Therefore, I need another way!
I tried abandoning the hmm and for each letter in the test name, find the prediction of that letter, independent of the rest of the letters in the name. I found the prediction by taking the max over all confidences for each of the 26 possibilities. Then, in the end, I find the nearest neighbor of the predicted name with all of the names in the roster. This also did horribly.
I took a look at what was going on under the covers. I found that for each letter, the confidence of that letter being the right letter is pretty high compared to the rest of the confidences, but some other letter always beats it by a little bit. And my algorithm doesn't care about 2nd place. Therefore, I need another way!
Monday, March 2, 2009
Huh. How 'bout dat
Well. My real road block at this point was extracting letters from the test data character boxes because I lost my code from last year. But losing the code wasn't the real bad part - the code wasn't very good in the first place. It had all sorts of hough transforms in it.
Instead, I decided to take a blank sheet and do normalized cross correlation with it against a filled in box. This way, I could find the center of the filled in boxes. I basically figured out how big the boxes are, and can now extract the letter mo' betta. The red circle is where the normalized cross correlation signal was the strongest.
Zample:
To extract the letters, I found that the thickness of the character box lines were about 3 pixels wide so I took this into account.
The result:
Yee-haw. I took the output of this and fed it into my "get rid of white space" function, and resized the letters to 24 by 24 pixels. I also checked if the sum of the pixels of the character image are above a certain threshold, and if so, I assumed there was no letter there. I display these as gray cells with an x through them. This resulted in the following:
Here is another zample.
There is an issue though. SOME people (I'm not going to name any names) cannot write inside character boxes. Here is an example:
Because of that last 'a', my algorithm thinks there's a legitimate character there. See:
However, as far as I can tell, if someone can't keep their characters inside the boxes, they don't deserve to get their quiz graded! They can play games... but so can we! They want to play games, so be it! Let the games begin!
Instead, I decided to take a blank sheet and do normalized cross correlation with it against a filled in box. This way, I could find the center of the filled in boxes. I basically figured out how big the boxes are, and can now extract the letter mo' betta. The red circle is where the normalized cross correlation signal was the strongest.
Zample:
To extract the letters, I found that the thickness of the character box lines were about 3 pixels wide so I took this into account.
The result:
Yee-haw. I took the output of this and fed it into my "get rid of white space" function, and resized the letters to 24 by 24 pixels. I also checked if the sum of the pixels of the character image are above a certain threshold, and if so, I assumed there was no letter there. I display these as gray cells with an x through them. This resulted in the following:
Here is another zample.
There is an issue though. SOME people (I'm not going to name any names) cannot write inside character boxes. Here is an example:
Because of that last 'a', my algorithm thinks there's a legitimate character there. See:
However, as far as I can tell, if someone can't keep their characters inside the boxes, they don't deserve to get their quiz graded! They can play games... but so can we! They want to play games, so be it! Let the games begin!
Testing out the classifiers
I created and saved off 26 different classifiers - 1 for each lower case letter. I started out with 5000 haar-like features and brought it down to 200. So now, the feature vectors will be of length 200. I chose 200 because I found that after 200, there was no improvement in performance. Take a look at the following graph:
As you can see, performance is saturated after 180 iterations. This threshold varies for the different letters, but 200 is always enough from what I've seen.
I spent the majority of my time this weekend dealing with my old code for extracting letters from test data. To remind you, I am given something like:
... and I need to extract all of these letters. I wrote code for this last year but seem to have lost it :) So, I wrote new code to extract the letters which was time consuming. I am not entirely happy with the result of the code either, but I'll deal with it for now.
Given 4 images of 1 type of letter that were automatically extracted with the code I wrote, I ran a number of the 26 different classifiers on them to find the confidences. Confidence ranges from 0 to 1 where 1 indicates that it is 100% confident that the letter is a true instance, and 0 means the classifier is 100% confident that the letter is a false instance.
Here are 4 examples of the letter 'n':
So, the 'n' classifier does the best, which is a relief. However, these numbers can be pretty close, so hopefully the roster information will take care of this.
I wrote another perl script to generate the transition probabilities because I seem to have lost my old one also. Given that the current letter is 'x', I find the probability that the next letter is a 'y' by dividing the number of occurrences of the string 'xy' by the number of occurrences of the letter 'x'.
As you can see, performance is saturated after 180 iterations. This threshold varies for the different letters, but 200 is always enough from what I've seen.
I spent the majority of my time this weekend dealing with my old code for extracting letters from test data. To remind you, I am given something like:
... and I need to extract all of these letters. I wrote code for this last year but seem to have lost it :) So, I wrote new code to extract the letters which was time consuming. I am not entirely happy with the result of the code either, but I'll deal with it for now.
Given 4 images of 1 type of letter that were automatically extracted with the code I wrote, I ran a number of the 26 different classifiers on them to find the confidences. Confidence ranges from 0 to 1 where 1 indicates that it is 100% confident that the letter is a true instance, and 0 means the classifier is 100% confident that the letter is a false instance.
Here are 4 examples of the letter 'n':
Classifier Image 1 Image 2 Image 3 Image 4
n 0.4799 0.4897 0.5225 0.5120
a 0.3878 0.4333 0.3969 0.3969
b 0.4337 0.4499 0.4552 0.4967
c 0.3715 0.3448 0.3443 0.3232
So, the 'n' classifier does the best, which is a relief. However, these numbers can be pretty close, so hopefully the roster information will take care of this.
I wrote another perl script to generate the transition probabilities because I seem to have lost my old one also. Given that the current letter is 'x', I find the probability that the next letter is a 'y' by dividing the number of occurrences of the string 'xy' by the number of occurrences of the letter 'x'.
Monday, February 23, 2009
Where the classifier is at now
Sunday, February 22, 2009
Improving the ROC curve
It turns out that the ROC curve posted in the previous post was incorrect. I accidentally tested on my training data. Oooops! Once I fixed this bug, my ROC curve became very ugly. In fact, this is what it looked like:
Uh oh! This scared me. I started tweaking the algorithm by changing the number of haar features from 1,000 to 5,000, and changed the number of weak classifiers from 25 to 100. This barely made any difference. I then thought that maybe my 24 by 24 pixel character images simply weren't sufficient so I generated new training data and made the images slightly bigger - 40 by 40 pixels. This also didn't make a difference.
What did make a difference was centering each of the images and getting rid of the extra white space. This changed the ROC curve to this:
Uh oh! This scared me. I started tweaking the algorithm by changing the number of haar features from 1,000 to 5,000, and changed the number of weak classifiers from 25 to 100. This barely made any difference. I then thought that maybe my 24 by 24 pixel character images simply weren't sufficient so I generated new training data and made the images slightly bigger - 40 by 40 pixels. This also didn't make a difference.
What did make a difference was centering each of the images and getting rid of the extra white space. This changed the ROC curve to this:
Wednesday, February 18, 2009
Boosting in Matlab
Boris gave me his Matlab boosting code. His code is an implementation of Adaboost using haar-like features. The weak classifier is just a stump (looks at 1 feature and thresholds it).
The idea behind boosting is that many weak classifiers together create 1 strong classifier. The weak classifiers used in this implementation are extremely trivial and their individual accuracy is not that much better than 0.5. However, as we add more weak classifiers to the overall classifier, the overall accuracy improves, as shown in the following graph:
What is a haar feature?
In this implementation, a random number of rectangles are created, each with a different weight. Note that all of the training images must be the same size. We choose these haar features without seeing the images yet. The sum of the pixels in each of the different rectangles and the weights are used to come up with the feature. Each haar feature will result in 1 singleton value for each image.
Example of haar features:
What you'll need for an 'X' classifier:
1. A pool of images of 'X' (positive examples)
2. A pool of images that do not contain 'X' in them (negative examples)
Set aside part of each of the above pools for testing. The remaining, you'll use for training.
How it will go down:
1. Choose the number of haar features to create, and call this nh.
2. Apply all nh haar features to all of the training images, both positive and negative. As a result, for each image, we will have a feature vector of length nh.
3. Choose the number of weak classifiers desired, and call this nwc. Note that nwc <= nh.
4. Choose nwc of the nh haar features. Ideally, these nwc haar features are the best haar features out of the bunch. This means that these features are the strongest. Associate a threshold for each of the nwc haar features.
5. For each of the test images, find the nwc features. We now have a feature vector of length nwc for each of the test images.
6. Given the threshold for each of the nwc features, come up with a confidence for each test image.
I created an 'a' classifier for my project. I resized all of my training images to 24 by 24 pixels. I used a portion of the images of 'a' as the positive training examples, and used the remainder of them for testing. I used a combination of images of letters 'b' through 'z' as the negative training examples, and a separate portion of those for testing. Using a threshold of 0.5, the number of false positives was 194/1350, which comes out to a rate of 14%. The true positive rate is 100%. Here is the ROC curve:
How this will apply to my project: Given an image of a letter, I will apply all 26 classifiers to it. I will then have a score for each letter, and can use this instead of the nearest neighbor mechanism I was using before.
The idea behind boosting is that many weak classifiers together create 1 strong classifier. The weak classifiers used in this implementation are extremely trivial and their individual accuracy is not that much better than 0.5. However, as we add more weak classifiers to the overall classifier, the overall accuracy improves, as shown in the following graph:
What is a haar feature?
In this implementation, a random number of rectangles are created, each with a different weight. Note that all of the training images must be the same size. We choose these haar features without seeing the images yet. The sum of the pixels in each of the different rectangles and the weights are used to come up with the feature. Each haar feature will result in 1 singleton value for each image.
Example of haar features:
What you'll need for an 'X' classifier:
1. A pool of images of 'X' (positive examples)
2. A pool of images that do not contain 'X' in them (negative examples)
Set aside part of each of the above pools for testing. The remaining, you'll use for training.
How it will go down:
1. Choose the number of haar features to create, and call this nh.
2. Apply all nh haar features to all of the training images, both positive and negative. As a result, for each image, we will have a feature vector of length nh.
3. Choose the number of weak classifiers desired, and call this nwc. Note that nwc <= nh.
4. Choose nwc of the nh haar features. Ideally, these nwc haar features are the best haar features out of the bunch. This means that these features are the strongest. Associate a threshold for each of the nwc haar features.
5. For each of the test images, find the nwc features. We now have a feature vector of length nwc for each of the test images.
6. Given the threshold for each of the nwc features, come up with a confidence for each test image.
I created an 'a' classifier for my project. I resized all of my training images to 24 by 24 pixels. I used a portion of the images of 'a' as the positive training examples, and used the remainder of them for testing. I used a combination of images of letters 'b' through 'z' as the negative training examples, and a separate portion of those for testing. Using a threshold of 0.5, the number of false positives was 194/1350, which comes out to a rate of 14%. The true positive rate is 100%. Here is the ROC curve:
How this will apply to my project: Given an image of a letter, I will apply all 26 classifiers to it. I will then have a score for each letter, and can use this instead of the nearest neighbor mechanism I was using before.
Monday, February 9, 2009
Boosting with OpenCV
My next step is to apply OpenCV's boosting framework to my problem. I have OpenCv installed, and after a bunch of linking issues, I can at least get things to compile. I have spent this weekend and today figuring out how to use it. I have a much better handle on it now and will hopefully get it working soon.
I have been referencing this tutorial, which is ok.
I plan to train 26 different classifiers, one for each letter.
Random fact: OpenCV either lets you give it one training image, where it takes random samples of the image, or it lets you give it a bunch of images and it does nothing with those. Although the tutorial mentioned above provides a way to combine the 2 approaches.
I have been referencing this tutorial, which is ok.
I plan to train 26 different classifiers, one for each letter.
Random fact: OpenCV either lets you give it one training image, where it takes random samples of the image, or it lets you give it a bunch of images and it does nothing with those. Although the tutorial mentioned above provides a way to combine the 2 approaches.
Monday, February 2, 2009
Trying PCA
I decided to use PCA on my training data. The way I did this is I used Matlab's princomp function. So I run princomp on my n x p matrix (n row vectors, where each row vector has p values). I then get a matrix D of size p x p. I decided to use the first 100 components, so I multiplied my n x p data matrix by D(:,1:100). Sometimes the results are better, sometimes they are worse. I will do more experimentation and report back.
Monday, January 26, 2009
Fixing the cropping issue
I've spent some time now fixing up my cropping function. The reason I've been spending my time on this is because I think that it's really important to have quality training data. If my training data is crap, then nothing will work. I want to be able to cleanly cut out letters. So, I changed my cropping algorithm to the following:
cut the original image into 4 pieces, as follows:
Here is the left piece:
Right piece:
Top piece:
Bottom piece:
I run the hough transform on all four pieces. In each, I know where a line would be, if there exists one. I have some way of thresholding whether or not there is a line there (if the maximum number of votes is higher than a certain percentage of the length of the image). If so, I cut the appropriate piece out.
The performance isn't as great as expected. I think that I need to fix up my hough transform function because sometimes, when there appears to be a dark line, it won't find it (the number of votes is low).
Here are the overall results anyway:
The letters straight out of the training sheet look like this, with no postprocessing:
The next step for me is to fix my hough transform function, get the training data looking good, retrying nearest neighbor, and then moving on to opencv (boosting).
cut the original image into 4 pieces, as follows:
Here is the left piece:
Right piece:
Top piece:
Bottom piece:
I run the hough transform on all four pieces. In each, I know where a line would be, if there exists one. I have some way of thresholding whether or not there is a line there (if the maximum number of votes is higher than a certain percentage of the length of the image). If so, I cut the appropriate piece out.
The performance isn't as great as expected. I think that I need to fix up my hough transform function because sometimes, when there appears to be a dark line, it won't find it (the number of votes is low).
Here are the overall results anyway:
The letters straight out of the training sheet look like this, with no postprocessing:
The next step for me is to fix my hough transform function, get the training data looking good, retrying nearest neighbor, and then moving on to opencv (boosting).
Wednesday, January 14, 2009
Cropping
I get better accuracy now, when I try to classify my test data. However, I noticed that after my "cropping" tool, the test data can get pretty deformed. For example, take a look at a before and after sequence:
Before:
After:
I decreased the dimensionality of the test data by a lot. The new test data characters are 32 by 32 pixels. Before this, it was 120 by 123 pixels. This reduction in dimensionality really brought down the running time of classification. I think that I need to tweak my cropping tool though.
Cropping tool = bad. Exhibit A:
=
When I change it that only 2% of the pixels can be all white (instead of 5%), then the cut out changes to:
Now, the same test set looks like this:
Before:
After:
I decreased the dimensionality of the test data by a lot. The new test data characters are 32 by 32 pixels. Before this, it was 120 by 123 pixels. This reduction in dimensionality really brought down the running time of classification. I think that I need to tweak my cropping tool though.
Cropping tool = bad. Exhibit A:
=
When I change it that only 2% of the pixels can be all white (instead of 5%), then the cut out changes to:
Now, the same test set looks like this:
Monday, January 12, 2009
Fixing up training data
Saturday, January 3, 2009
I'm baaaack
Didn't get enough last year? Miss me? Well, I'm back for more. I will be continuing the same project. Here is a refresher of how the project works:
The first thing that I will be doing this week is getting back to the point I was at at the end of winter 2008. I will then research machine learning methods to apply to the problem, to replace the nearest neighbor approach.
The first thing that I will be doing this week is getting back to the point I was at at the end of winter 2008. I will then research machine learning methods to apply to the problem, to replace the nearest neighbor approach.
Subscribe to:
Posts (Atom)