I learnt a lot from this photo quality prediction problem in Kaggle. Feature engineering plays a great role this time and most of my effort is spent in this part.
The problem is to predict whether a photo is of good quality or not based on its meta data rather than the image information (http://www.kaggle.com/c/PhotoQualityPrediction). The meta data contains three kinds of information: 1) location: the latitude and longitude of this photo; 2) Width, height of this photo and along with its size (MB); 3) Text information of the name, description and the caption of this album. The result is evaluated by binomial deviance.
It’s obvious that all these three kinds of features are quite different, and it turns out that location and text information are more important. The first several days I focused more on the text information. I tried to use the traditional text classification method to solve the problem. I used tf-idf, word count in feature extraction, and applied Naive Bayes, Random Forest, Logistic Regression and even SVM, but I could hardly make the binomial deviance under 0.22 even combined with location or shape features. (The leader is 0.18434)
I think the problem is that after tokenizing the text information (name, description, caption), most of the text features contain only several words, especially names and descriptions. So maybe it can only be regarded as words rather than text.
So finally I used some very different features and the result is amazing after trying with different sets of features. Similarly, my features are divided into the same three categories, and I didn’t consider to join them.
First, lets define y=1 corresponds to photo of good quality. Then, for each word appears in the text (name, description, caption), I calculated the average score of this word by dividing total # of good photo that contains this word by the total # of appearance of this word. So now each word has a score corresponding to it and text features contain:
1) Score of name, description, caption and the score of total text: I calculated the average score for each word in the name as the score for name, and the same goes with description and caption. Also, I computed the average score for the whole text (combining name, description, caption as a whole).
2) Standard deviation of scores in name, description, caption and the whole text. It seems that the average score of the text is not enough, by adding the standard deviation into the feature, the result improved to some extend (about 0.002)
3) The length of name, description and caption.
For each (latitude, longitude) pair, I computed the average score by dividing the # of good photos of this location by the total # of appearance of the location. And also I computed the average score of latitude and longitude separately. Finally, the location features include the score for (latitude, longitude) pair, score for latitude, score for longitude. As there might some places where photos are more welcome, by constructing features in this way the classifier can capture this kind of information better. Actually, classifiers tend to learn “linear” features better.
Almost the same as before, I include the score of (width, height) pair, score of width, score of height, score of size in the features.
I didn’t spend too much time on selecting classifiers. Logistic regression gets a good result, and is very fast to train since the data set is small. Usually it can give me an idea of how good are the features.
Eventually, I used random forest with 2000 trees and max_features=2. And the training time is only around 20min. The best result I got is 0.19131 and leader is 0.18434.
I regard the key to the solution can be divided into 2 phase: feature engineering and machine learning. Feature engineering involves constructing features, manually generating synthetic features from raw features etc.. Usually if the raw data haven’t been modified much and there are many different types of features, this phase may influence the final result significantly. However, I feel there can hardly be any rules for us to generate good features, and it’s kind of experience.
In the second phase machine learning, we apply different algorithms to the features. Different classifiers may have very different performance due to the type of features, # of features, # of training set etc.. Bias/variance analysis is a good way to adjust parameters or to choose different classifiers. And boosting is a very effective method in practice, like random forest.