Tuesday, October 03, 2006
The Netflix Prize
This $1 million prize, announced at http://www.netflixprize.com/ looks to be a challenging demonstration application for Rasch measurement - thanks for telling me about it, Martin Caust. It has to do with the analysis of customer ratings of movies, and the prediction of what movies they would like to obtain. Right now I'm downloading the initial 665MB data set. My computer tells me it will take 13 hours over my connection - but other folks have reported 20 minutes to download over theirs.
Netflix are likely to receive submission that are local optimizations of the type that W.E. Deming decried. My suspicion is that someone will discover an opportunistic algorithm that beats the existing Netflix algorithm, but which does not generalize. A Rasch solution would generalize better even though apparently doing "worse" on the initial data set.
Any other Rasch folk going to give this a try?
Progress report on the Netflix analysis: there are 17,770 movies (items), 480,189 customers and 100,480,507 observations on a 1-5 rating scale. If this was set up for Winsteps (capacity 40,000 items, 10 million persons) the data set size would be about 20,000 x 500,000 = 10GB and the required workspace about 50GB - too big for my current hardware. So I'm running the data through Facets - which is very efficient with sparse data matrices - the Netflix data is 99% missing. No unusual problem with the data file or data input. But there is a computational overflow halfway through the first estimation iteration - time to enhance the Facets software!
Netflix are likely to receive submission that are local optimizations of the type that W.E. Deming decried. My suspicion is that someone will discover an opportunistic algorithm that beats the existing Netflix algorithm, but which does not generalize. A Rasch solution would generalize better even though apparently doing "worse" on the initial data set.
Any other Rasch folk going to give this a try?
Progress report on the Netflix analysis: there are 17,770 movies (items), 480,189 customers and 100,480,507 observations on a 1-5 rating scale. If this was set up for Winsteps (capacity 40,000 items, 10 million persons) the data set size would be about 20,000 x 500,000 = 10GB and the required workspace about 50GB - too big for my current hardware. So I'm running the data through Facets - which is very efficient with sparse data matrices - the Netflix data is 99% missing. No unusual problem with the data file or data input. But there is a computational overflow halfway through the first estimation iteration - time to enhance the Facets software!