The New Computer Scientist

04 April 2009

Wired has been pushing the idea of "the Petabyte age" for a little while now. The basic idea is this: now that computers are able to meaningfully process enormous data sets (a petabyte is a thousand terabytes, or about 1.6 million compact discs), we are facing a revolution in the way in which we can approach science, eschewing the scientific method in favor of the statistical method. In theory, we can just give a computer an extraordinary amount of information (say, the entire human genome), press a button labeled "make sense of it," and wait.

Sounds a lot more outlandish than it apparently is: as Wired covered Friday, this is actually happening. Researchers at Cornell handed their computer nothing more than basic arithmetic and some raw data of how a pendulum worked, and the computer spat out Newtonian dynamics. Meanwhile, at Columbia, a robotic lab devised and conducted an experiment, start to finish: identifying and isolating certain genes in baker's yeast.

It's interesting, though, in that these computers aren't capable of science in the way that we practice it -- they do have a radically different approach. Where a human would study the problem and try to understand it, so s/he can formulate a hypothesis that is probably correct and then go test it and (hopefully) get it right the first time, the computer's first hypothesis is almost certain to be wildly inaccurate. So it comes up with another. And another. And so on until it has covered pretty much all the possibilities.

The computer at Cornell was particularly interesting in this respect: it uses what they call a "genetic" algorithm for finding the right solution. It's first round of calculations for the pendulum were all totally wrong, but some of them were less wrong, so it pursued those sorts of calculations for its next round and so on until it had finally come up with equations that modeled the pendulum exactly.

This sort of science-by-natural-selection is incredibly powerful -- it's capable of teasing patterns out of immensely complex data, but it has pitfalls. For one thing, it's very likely to run into correlation/causation issues. For example, a computer would probably come up with this:

Lemons on the highway

without having any idea that the hypothesis it implies isn't just wrong, it's laughable.

For now, this is another tool for science -- a very powerful tool -- and we ought to be glad we can take advantage of it. Who knows what problems it may solve?