Yahoo issued an announcement earlier this week in which they claimed to have indexed over 20 billion items. Over on Searchblog, Boing Boing "band manager" John Battelle posts:
[This] ruffled more than a few feathers across the web, and nowhere more distinctly than at Google. I spent an hour or so on the phone with a group of Google folks, and they shared a lot of information about how they measure index size, how they deal with issues of duplicate URLs and documents, and why they are baffled by Yahoo's claim. I am still reporting this story, so a longer post is forthcoming, but an update at the end of the day is worth penning.
First of all, I agreed to review some of the Google information on background, agreeing not to disclose it save with permission. (I agreed to this only if I could tell you all that I did in fact agree to it). I am still digesting what Google had to say, and the information they sent me, but it did leave a distinct set of questions percolating in my mind, questions that I plan to speak to Yahoo about (Yahoo has agreed to talk as well, we just haven't had time yet).
In any case, the lead really is this: I asked Google to go on the record with their concerns about Yahoo's index and whether they believed the news was in fact accurate, and Google agreed. The quote, which I can only attribute at this point to a "Google spokesperson," is as follows:
"Our scientists are not seeing the increase claimed in the Yahoo! index. The data we have doesn't support the 19.2 (billion page) claim and we're confused by that."
Details here. A response from Yahoo, and analysis on how their numbers were calculated, is said to be forthcoming in another post. But as JBat says, the larger point seems to be:
This calls for a benchmark/standard for measurement that might makes all of this moot.
In related news, and also on JBat's blog, the widely discussed Google/Meetro buyout rumor is thought to be false: Link.