Monday, January 22, 2007

 

The Netflix Challenge and Good Theory

William Fisher reminds me "there is nothing so practical as a good theory"*. But what is good theory for the Netflix challenge, essentially predicting 3 million new ratings based on 100 million previous ratings? It turns out there are several challenges.

1. Predicting the 1.5 million "quiz" ratings ... for which a summary statistic of success so far is supplied. These decide the "leaders".
2. Predicting the 1.5 million "test" ratings ... for which there is no feed-back until the competition is over. These decide the "winner".
3. Constructing a Netflix-acceptable method for achieving 2 in order to be eligible to win.

It would seem that 3. comes first. But no! There is no need to consider whether a prediction method is acceptable until one is in a winning position. When one has a winning solution, then one can commence the effort to develop an acceptable method of obtaining that solution (or better).

So, since we won't know how 2. went until after the challenge is over, the task becomes success on 1. That success is shown on the Netflix Leaderboard: http://www.netflixprize.com/leaderboard - As of this writing, I'm about 30th of the 1,352 teams who have submitted answers (multiple submissions are allowed - as often as one per day, but only the best result of the top 40 teams is shown). Most of the 14,994 registered teams have not submitted an answer.

So what theory is good? Initially, I tried complicated methods. But my results were only so-so. A couple of elementary insights improved my answer sets - but still far behind the leaders. Then Simon Funk, an early leader, published his approach http://sifter.org/~simon/journal/20061211.html . It is a simple raw-observation SVD (single value decomposition). Conclusion: my approach was too elaborate.

So it's back to the algebraic drawing board. It's not a matter of predicting the most likely set of ratings, but rather of predicting a particular set of ratings. I've tried many different ideas ... but nothing brilliant so far. Two methods appeared to be winners .... but they fell by the wayside. Happily, there are more ideas to try ....

What have I learned? (a) How to manipulate a dataset of 100 million ratings. (b) The virtue of 2GB RAM memory. (c) Slick methods of computing in C++. (d) Additional features to add to Facets.

Will brains beat brawn in the Netflix Challenge? Will the winner be the team with the keenest insights or the one with the greatest computer power? Keen insights, i.e., good theory, seem to be the way to go ....

* Lewin, K. (1951). Field theory in social science: Selected theoretical papers (D. Cartwright, Ed.). New York: Harper & Row, p. 169.

This page is powered by Blogger. Isn't yours?