Wednesday, June 13, 2012

When More Data Trumps Logic

A difficulty with the “more data is better” point of view is that it’s not clear how to determine what the tradeoffs are in practice: is the slope of the curve very shallow (more data helps more than better algorithms), or very steep (better algorithms help more than more data). To put it another way, it’s not obvious whether to focus on acquiring more data, or on improving your algorithms. Perhaps the correct moral to draw is that this is a key tradeoff to think about when deciding how to allocate effort. At least in the case of the AskMSR system, taking the more data idea seriously enabled the team to very quickly build a system that was competitive with other systems which had taken much longer to develop.
That's Michael Nielsen in an interesting post describing how machine learning question-and-answer systems work. I completely agree that identifying trade-offs is one of the most useful ways to decide how to proceed on a problem. That's why I think the general study of trade-offs, across fields, is underrated.