Friday, June 27, 2008

How imdb determines the top 250

Users rate movies from 1-10 after they have seen the movie. Registering an account is a time consuming process, which decreases the likelihood that people will register for the pithy purpose of giving certain movies a high or low score. Voting on the other hand is easy:

The wizards at imdb collect this data, most likely through some sort of SQL server. They then limit the votes that count toward the top 250 to "regular voters." The staff intentionally does not release how they define regular voters, but I would expect that it's some combination of movies voted on and the timespan of activity on an account.

Next, they run the data through a "Bayesian" filtering process. Note that they didn't need to call their data analysis a "true Bayesian estimate", but they did so anyways because being doing Bayesian statistics these days is the equivalent to snorting cocaine in the 1980s.

They write the equation as,

Weighted Rating = (V / (V + M)) x R + (M / (V + M)) x C

V = Total number of votes (from regular voters) for the movie
M = Minimum number of votes required to be listed in the top 250 (currently equal to 1300)
C = The average (mean) score of all movies on imdb (currently equal to 6.7)
R = The average (mean) score of the movie, as determined by regular voters

Technically, this is the same form as Bayes rule, but you seriously don't need to know that in order to understand the equation. Essentially, the equation is set up so that movies with low vote totals will have their scores weighted more towards the mean of 6.7.

This makes sense because movies should be voted on by a large number of people before we take their ratings seriously. The top two movies have over 200,000 votes each. Ultimately, it is their ability to harness such a large sample size that makes their rating system better than any other.