Last week, Yahoo issued an announcement in which they claimed to have indexed over 20 billion items through their search service. On BB band manager John Battelle's Searchblog, a Google spokesperson said the company's scientists "were not seeing the increase claimed in the Yahoo! index." This sparked much debate, and questions about how an independent entity might go about comparing the reach of competing search providers.
Researchers at the National Center for Supercomputing Applications (NCSA) took up that question, and have now posted their study results: "A Comparision of the Size of the Yahoo! and Google Indices." Snip from conclusion:
Based on the data created from our sample searches, this study concludes that a user can expect, on average, to receive 166.9% more results using the Google search engine than the Yahoo! search engine. In fact, in the 10,012 test cases we ran, only in 3% of the cases (307) did Yahoo! return more results. In 96.6% of the cases (9,676) Google returned more results. In less than 1% of the cases (29) both search engines returned the same number of results.
It is the opinion of this study that Yahoo!'s claim to have a web index of over twice as many documents as Google's index is suspicious. Unless a large number of the documents Yahoo! has indexed are not yet available to its search engine, we find it puzzling that Yahoo!'s search engine consistently returned less results than Google.
Link to study. I'm sure we haven't heard the last of this — and won't, until industrywide standards on how to measure these factors evolve. Hey, wait — maybe that odometer correction device Mark blogged about would come in handy here… (via /.)
Previously:
Battelle: more on Yahoo, Google, index, size
Battelle on Yahoo search claims, Google reply
Reader comment: Mike Winter says,
As you probably guessed, there are definitely some major questions about the NCSA study comparing Google and Yahoo! indices. I think everyone should re-read the methodology focusing on the assumptions made. The make it clear (if you read between the lines) that the test may be as much about the matching and filtering algorithms used by the two competitors as the number of pages. One of the goals of good algorithms (IMO) is to reduce the number of hits to a bare minimum, not give you a hit on anything that might be close.
So which is it? More pages indexed or poorer search & filtering algorithms? Or both? There has to be a better way.