Saturday, June 28, 2008

Iron Man at the box office

On May 13th I predicted that Iron Man would gross $300 million in the box office. It just crossed 300 million, and with a last week total of $4 million its slowing down. So it looks like I was pretty close.

Friday, June 27, 2008

How imdb determines the top 250

Users rate movies from 1-10 after they have seen the movie. Registering an account is a time consuming process, which decreases the likelihood that people will register for the pithy purpose of giving certain movies a high or low score. Voting on the other hand is easy:


The wizards at imdb collect this data, most likely through some sort of SQL server. They then limit the votes that count toward the top 250 to "regular voters." The staff intentionally does not release how they define regular voters, but I would expect that it's some combination of movies voted on and the timespan of activity on an account.

Next, they run the data through a "Bayesian" filtering process. Note that they didn't need to call their data analysis a "true Bayesian estimate", but they did so anyways because being doing Bayesian statistics these days is the equivalent to snorting cocaine in the 1980s.

They write the equation as,

Weighted Rating = (V / (V + M)) x R + (M / (V + M)) x C

V = Total number of votes (from regular voters) for the movie
M = Minimum number of votes required to be listed in the top 250 (currently equal to 1300)
C = The average (mean) score of all movies on imdb (currently equal to 6.7)
R = The average (mean) score of the movie, as determined by regular voters

Technically, this is the same form as Bayes rule, but you seriously don't need to know that in order to understand the equation. Essentially, the equation is set up so that movies with low vote totals will have their scores weighted more towards the mean of 6.7.

This makes sense because movies should be voted on by a large number of people before we take their ratings seriously. The top two movies have over 200,000 votes each. Ultimately, it is their ability to harness such a large sample size that makes their rating system better than any other.

Thursday, June 26, 2008

More on correlation versus causation

From Chris Anderson's explosive article in Wired:

Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

I would add that yes, this data is obviously a plus to tell us what we can see, but in order to make predictions we will still need models. Read the whole thing, it's fairly short.

Tuesday, June 24, 2008

The kiss of death

Greg Mankiw endorses Vampire Weekend, calling them "the best new rock band [he] has heard in a long time." If you don't know Greg Mankiw, he's the square who probably wrote your Microeconomics textbook, if you were dorky enough to even take that class.

And since I'm such a big square as to be reading his blog, I will do you all a huge favor and never endorse your band. In fact, if you'd like I'll criticize your band, which will probably improve your standing via the Conformity Theory. Let me know.

Tuesday Statisticz: Tuesday Statisticz

I have a friend that doesn't like writing about writing, because it's too cyclical. But what about statistics about statistics? I've done my fair share of statistics posts on Tuesdays, and now I think it's high time I turned the microscope upside down. Er... inside out. Actually, nevermind.

Unfortunately, my only accurate metric for popularity is comments, because my blog is not yet so popular that everybody feels the need to Digg or tag my posts to delicious. Comments aren't a good a metric, because readers usually comment when they disagree or have something to say, which isn't directly linked to the quality of the post. But again, it's my only available metric, so bare with me.

So, what makes the posts more popular? I have two ideas: one is the number of words, and the other is the number of links. All of the data was collected from... this blog. And here's the money:


The blue dots show the relationship between words and comments, and the red dots show the relationship between links and comments. Both of the r squares (0.081 and 0.013) fall too short to make me think that there is a relationship between either of the factors.

By the time I finish writing this paragraph this post will be 243 words long, and I have one outgoing link. Based on my model, I predict 1.5 comments.

Any thoughts?

Sunday, June 22, 2008

Uncle Sam wants you!

If you have a normal last name, do the world a favor and become a mathematician. I'm sick of Yrijustfigjilovian equations and Qusdfitnbnsidox complexities.

By "normal," I mean short, easy to pronounce, and American. Enough said.

Thursday, June 19, 2008

Save the cheerleader

Follow-up to: The Carbon Cycle

On Monday I wrote about how changes in atmospheric carbon levels (or any greenhouse gas) is extremely unlikely to lead to the end of the world. Some of my astute friends pointed out that while climate change might not cause any damage to our planet, it will make it a less pleasant place to live for us humans.

They're right.

However, the future of the world and the future of humanity are logically distinct futures. This may seem obvious, I don't think that it is a point made often enough. The main reason that people are concerned with global warming is not to save the cute polar bears, or so that we can all take long walks in the park. It's about saving human lives.

The earth doesn't mind if it gets a little bit warmer for a few thousand years. But we do.

Forget the world. Save the cheerleader.

Tuesday, June 17, 2008

Tuesday Statisticz: Still the best

I felt bad for Kobe today as he left the Garden, but then I realized that he's still the best. And what's more, he's been the best for awhile now. After Paul Pierce won the Finals MVP, I bet that some talk show hosts are going to be talking tomorrow about why Pierce may be better than Kobe. But they would be wrong.

The most two important statistics for a guard are assist to turnover ratio and points per game. Kobe has Mr. Pierce in spades in these categories, and he's been consistently better for quite some time now. Here's their assist to turnover ratios on a span of careers (I've truncated Kobe's first two years to compensate for the fact that he was good enough to go straight to the pros):


And here's points per game:


Aside from one (oddball) year in each category, Kobe comes out victorious. I got the data from their Yahoo player profiles (here and here). And since Kobe is on the Lakers, the Lakers win.

Unfortunately, that means that the Celtics lose. Sorry guys, you gave it your all, but in the end, Kobe came out on top. Better luck next year.

Monday, June 16, 2008

The Carbon Cycle

Some critics of anthropogenic climate change argue that since the earth has warmed before, it isn't such a big deal if it does so again today.

I think they're right.

By all accounts, the universe will keep expanding, the moon will still circle us, and the earth is going to keep spinning, with or without global warming. The earth itself doesn't give a damn about global warming. Carbon levels have risen before, they've fallen before, and the world has not ended.

Scientists in Nature published data last month for the lowest 200 meters of Dome C (an ice core in the Antarctic). Here's one of their figures:



The black dots on the bottom right (650,000-800,000 years ago) represent the most recent data from this ice core, and they've placed temperature data from a different ice core above. It doesn't take a geophysicist to see that there's a correlation between the two data sets.

But there's one other inference that most people aren't drawing from this type of climate data. It's that temperatures have risen before, and they've always eventually fallen. Based on that graph, it usually takes about 25,000 years for the carbon levels to drop from their peaks to average levels of CO2.

Which means that no matter how high CO2 levels and temperatures reach, they will eventually come down. The earth is going to be fine.

Global warming will not be the end of the world.

Reference

Luthi D, Le Floch M, Bereiter B, Blunier T, Barnola J-M, Siengenthaler U, Raynaud D, Jouzel J, Fischer H, Kawamura K, Stocker T F. 2008. High-resolution carbon dioxide concentration record 650,000-800,000 years before present. Nature 453: 379-382. doi:10.1038/nature06949.

Sunday, June 15, 2008

Does correlation equal causation?

"Critical Flop", the consistently funny satirist, poses an important question in the comments to the most recent Tuesday Statisticz:

"Not to sound like a stuffy statistics professor, but even if there was a correlation, that's just a correlation, right? It doesn't mean one is the cause of the other."

I have lots and lots of thoughts on this. The general idea from Science (with a capital S) is that you can never prove anything to be true, you can merely show that the opposite of it is not true. This is why scientists conduct experiments, and why Jared Diamond sought natural experiments in his research for Guns, Germs, and Steel.

Unfortunately, brushing aside correlations has gotten people into trouble from time to time. R. A. Fisher, the famous statistician, wrote in a letter to Nature in 1957 that:

"The curious associations with lung cancer found in relation to smoking habits do not, in the minds of some of us, lend themselves easily to the simple conclusion that the products of combustion reaching the surface of the bronchus induce, though after a long interval, the development of a cancer."

Essentially, he was saying that we couldn't assume that cigarettes cause cancer, because although there was a "curious association", correlation does not equal causation. In case you haven't been forwarded the e-mail yet, he was wrong.

At the same time, we can't look at r² values and make a tidal wave out of surf wake. Even if the r² had been higher, we shouldn't conclude that more Tarantino and Lynch will cause people to invent more patentable stuff. There's likely to be a third variable at play, like the economic mobility in the country.

What you want to do in order to prove causation is to start ruling out all these other possible third variables. So check to see if the type of government influences patent applications. Check to see if something else also has a correlation. Once you start ruling other variables out, your explanation will look better and better.

Bottom Line: Sometimes. Correlation is a good start, and if you build up more and more data, you can make a good case that there is a causal relationship.

Tuesday, June 10, 2008

Tuesday Statisticz: Do movies make you more creative?

Well, do they?

From UNdata, I found records from 1999 of the number of total cinema seats and patent applications (a rough measure of creativity?) per country. They don't need to be adjusted for population since they both depend on it, and when you graph the two data sets with corresponding countries, it seems that...

... much to the chagrin of cinemaphiles worldwide, the answer is, not really.

Saturday, June 7, 2008

Debunking myths with statistics

Myth #1: Eating local is the key to curbing global warming. Ummm, not really. Transportation costs only account for 11% of the carbon stamp when it comes to food, and the difference between local and distant only accounts for 4%. (Hat tip: MR)

Reversal: Eating locally may still help prevent allergies, so it has switched from altruistic to mostly selfish.

Myth #2: The SATs are a poor measure of predicting college success. Actually, they may not be all that bad. Researchers at the University of Minnesota found that the correlation between math and verbal scores with GPA, correcting for the difficulty of the class, averages .55, with a sample size of 100,000+. This is a strong correlation. Additionally, they found that these scores were not solely an artifact of socioeconomic status.

Reversal: The question of whether or not these tests are predictive of college success is separate from whether or not how much colleges should weigh them as admissions criterion. Hate the game, not the player.

Myth #3: Legalizing Prostitution would turn any country in a modern-day Gomorrah. Not in New Zealand! 5 years later, the act decriminalizing it has not led to an increase in sex workers and may have even had a positive effect on their health and safety.

Reversal: New Zealand is weird. At one point their history there are believed to have been hobbits.

Friday, June 6, 2008

Ranking the Harry Potters

7) Harry Potter and the Half Blood Prince -- too sad.

6) Harry Potter and the Chamber of Secrets

5) Harry Potter and the Order of the Phoenix

4) Harry Potter and the Goblet of Fire

3) Harry Potter and the Sorcerer's Stone -- started it all.

2) Harry Potter and the Prisoner of Azkaban -- the one that blew up the series.

1) Harry Potter and the Deathly Hollows -- not the funniest, but cemented the legacy.

Tuesday, June 3, 2008

Tuesday Statisticz: Urban carbon dioxide emissions, part II

Theorizer (that's a real word?!) Teddy brings up an objection in the comments to part 1:

"Highly urbanized countries are the most developed countries, and the standard of living is higher. Higher standard of living means more emissions from transportation and manufacturing."

Wikipedia defines standard of living by the human development index, so I'll use that too. (The data is from 2005, while the other two measures are from 2007, but there shouldn't have been too much change in HDI since then.)

There isn't much of a relationship between this HDI index and carbon dioxide emissions (although it is positive, r square = 0.047), and there isn't much of a relationship between standard of living and percentage of urban population in the first place:



Granted, there aren't many countries that even report carbon dioxide emissions, and those countries tend to have higher standards of living . Maybe a better idea would be to do an analysis by US state, which would probably have more data.

By the way, the country with the highest carbon dioxide emissions per capita? Luxembourg, whose slogan is, "come visit our country, but make sure you also have an afternoon activity." Ouch.

Monday, June 2, 2008

Freud the comedian

"The diminution of the olfactory stimuli seems itself to be a consequence of man's raising himself from the ground, of his assumption of an upright gait; this made his genitals, which were previously concealed, visible and in need of protection, and so provoked feelings of shame in him."

From Civilization and its Discontents. How would Freud explain the rise of nudist colonies?

Sunday, June 1, 2008

Randomizing TV ads

Back when I used to watch tons (tons) of TV, one of my main skills was changing the channel at the right time so that I never watched any ads. I was like Arturo Toscanini orchestrating Sportscenter and re-runs of Comedy Central. However, almost all of my success relied on the fact that most station's ad periods were the same length.

If TV stations want us to watch ads (it's how they make money), then they should randomize this length. So maybe one ad period would be 4 minutes and another would be just 30 seconds. They wouldn't have to change the overall mean, but merely up the variance. This way, if you really don't want to miss any of the show, you'll have no choice but to stay tuned to the channel.

Yes, this would annoy people, but aren't people annoyed by ads anyways?