Saturday, August 13, 2011

Searching For The Imdb Of Books, Part II

As watching imdb's top 250 most highly rated movies has proven to be such a smashing success, I have long yearned to find (and fleetingly, to develop) a similarly authoritative list for fiction books. The keys for a good list are: 1) a large sample size, 2) shrinkage estimation of ratings to the average, 3) a continuous scale (the more levels, the better, but yes we'll often have to settle for five stars), 4) defenses against gaming, and 5) a wide index of titles. To the best of my knowledge no site fulfills all of these requirements. These are the current contenders:

Amazon ReviewsUpside: They have a huge incentive to index all available books and are proficient at combining ratings across different editions of the same text. They also have a useful "was this review helpful to you?" tool which could eventually be employed to rate the raters and thus weight the overall ratings. Downsides: Their insistence on showing the average rating in half-star increments (typically 5, 4.5, 4, or 3.5) means that it involves manual calculation to distinguish between the two radically different scores of 4.24 and 3.76. I also often don't trust the resistance of their ratings to gaming. But most damningly, there is simply no attempt to create a good list of the most highly rated fiction books. Filtering by "highest average rating" in "literature and fiction", their #7 best fiction book of all time is currently Jim Gorant's The Lost Dogs: Michael Vick's Dogs and Their Tale of Rescue and Redemption, which is probably a fine book, but I think the author would be insulted to hear that it was considered fiction, and I think more than three-quarters of the english profs across the country would be insulted to hear it called literature.  

One-Time Votes: By this I refer to ad-hock competitions of various websites which ask users to vote on their favorite books. There are many of these strewn across the web, for example, check out NPR's top 100 science fiction and fantasy books, or Modern Library's top 100 novelsUpside: These tend to get large sample sizes (NPR had >60,000 votes), which makes them more accurate and harder to game. Downside: The process is not iterative and requires manual input to update, so they won't last or scale. More troublingly, many (such as NPR's) only allow the option to select one's favorite books, without voting others down, which unfairly favors books with high variance as opposed to just high average quality. 

Google Books: The site aggregates ratings from elsewhere on the web, including major vendors and online "bookshelves." Upside: Transparent code, takes ratings from diverse sources, and has a clean layout. Downside: Like Amazon, also displays ratings in half-star increments (et tu, google?). But their biggest problem is that different editions of books are stored in different locations and the ratings are not aggregated across editions. See, for instance, the first four results of a search for "pride and prejudice" (here, here, here, and here). Now, even if they did manage to output one total score per novel, it still doesn't seem very google-like to actually curate such a list themselves. But in that case, it wouldn't be hard for someone else to scrape the ratings and convert them into a ranked list. 

Library Thing: Upside: They have scale, with over 10 million ratings, and they already have some pretty cool statistics (check out the most "connected" people--Napolean is #1). They also do have a top 25 books by ratingDownside: They need to split the rankings for non-fiction and fiction. At this point I've given up on searching for a canonical non-fiction ranked list, as those ratings are so context-dependent and world-view driven. And they need to do a better job of categorizing in general. For example, the movie for LoTR:Two Towers, while an awesome movie and in imdb's top 250, should not be among the highest rated 25 books. More importantly, the editors of the site have not implemented a rating system that punishes books with fewer ratings. Instead, books simply need a minimum total of 20 ratings to make the list. This is bothersome, but easily improved, as the editors could simply study and implement the imdb method

Good Reads: Upside: As far as I can tell, this is the largest "bookshelf" site with the most user ratings. Huge potential. Downside: They've made no attempt to publish a list of the highest rated books across the site! All I can ask is, what is holding you back, GoodReads editors? Qualms about alienating authors whose works won't make the list? Fears of being labelled imperialistic? These are both hogwash. Our time is scarce and in order to be informed consumers we need to know what the best books are. If you are worried about the arbitrariness of the minimum votes cut-off, then publish multiple lists with different scaling parameters. You will thank me later when the list gets out-of-control traffic. Indeed, a group of passionate GoodReads users recently called for such a list. To this valiant effort I can only say, Viva la RĂ©sistance!