This weekend's NYT carried an alarming feature article on the gross wastefulness of the data-centers that host the world's racks of server hardware. James Glanz's feature, The Cloud Factory, painted a picture of grotesque waste and depraved indifference to the monetary and environmental costs of the "cloud," and suggested that the "dirty secret" was that there were better ways of doing things that the industry was indifferent to.
In a long rebuttal, Diego Doval, a computer scientist who previously served as CTO for Ning, Inc, takes apart the claims made in the Times piece, showing that they were unsubstantiated, out-of-date, unscientific, misleading, and pretty much wrong from top to bottom.
First off, an “average,” as any statistician will tell you, is a fairly meaningless number if you don’t include other values of the population (starting with the standard deviation). Not to mention that this kind of “explosive” claim should be backed up with a description of how the study was made. The only thing mentioned about the methodology is that they “sampled about 20,000 servers in about 70 large data centers spanning the commercial gamut: drug companies, military contractors, banks, media companies and government agencies.” Here’s the thing: Google alone has more than a million servers. Facebook, too, probably. Amazon, as well. They all do wildly different things with their servers, so extrapolating from “drug companies, military contractors, banks, media companies, and government agencies” to Google, or Facebook, or Amazon, is just not possible on the basis of just 20,000 servers on 70 data centers.
Not possible, that’s right. It would have been impossible (and people that know me know that I don’t use this word lightly) for McKinsey & Co. to do even a remotely accurate analysis of data center usage for the industry to create any kind of meaningful “average”. Why? Not only because gathering this data and analyzing it would have required many of the top minds in data center scaling (and they are not working at McKinsey), not only because Google, Facebook, Amazon, Apple, would have not given McKinsey this information, not only because the information, even if it was given to McKinsey, would have been in wildly different scales and contexts, which is an important point.
Even if you get past all of these seemingly insurmountable problems through an act of sheer magic, you end up with another problem altogether: server power is not just about “performing computations”. If you want to simplify a bit, there’s at least four main axis you could consider for scaling: computation proper (e.g. adding 2+2), storage (e.g. saving “4″ to disk, or reading it from disk), networking (e.g. sending the “4″ from one computer to the next) and memory usage (e.g. storing the “4″ in RAM). This is an over-simplification because today you could, for example, split up “storage” into “flash-based” and “magnetic” storage since they are so different in their characteristics and power consumption, just like we separate RAM from persistent storage, but we’ll leave it at four. Anyway, these four parameters lead to different load profiles for different systems.