From John Battelle’s site comes the news that Google has decided to drop the number of documents it listed on its front page. The company now claims its index is three times larger than its nearest competitor. Let’s look at the number.
A few weeks ago, Yahoo! claimed that its index was over 20 billion items large, broken as follows:
just over 19.2 billion web documents, 1.6 billion images, and over 50 million audio and video files
If we assume that Google believes its nearest competitor is Yahoo!, this would put the Google index at roughly 60 billion items, a fairly large number, which is probably on the high side. So we need to do more analysis in order to get closer to the truth.
As part of Google’s seventh birthday celebration, Google staffers posted an entry on the official Google blog, claiming that their index is now 1,000 times the size of their original index. If that’s the case, figuring out what the original index size was should give us a good number. Fortunately, I have a copy of John Battelle‘s excellent book about the company (it’s entitled The Search, which is a must-read for anyone interested in the search space. No other book has gotten as deeply into the history of internet search and few have analyzed more keenly potential futures for Google). In the book, Battelle relays an email from Larry Page to Terry Winograd dated July 15, 1996. In order to give some context, one has to realize that Google started in March of 1996 so, in July of that year, Google was all of four months old. The email is regarding some of the growth issue that the search engine is having and reads (emphasis is mine):
I am almost out of disk space.
I have downloaded about… 24 million unique URLs
and about 100 million links… I think I will need 8 gigs more to store everything… Current retail prices are about $1000/4 gigs… I have only about 15% of the pages but it seems promising
If we take that number as a starting point, that would mean that the original index was around 24 million pages. From there, it is easy to multiply by the 1,000 factor they talk about in their blog and get a number of items in the Google index.
That number would be
, a little more than what Yahoo! has in their index.
In November 2004, MSN was estimated to have about 5 billion pages. Ken Moss, the General Manager of MSN Search claimed that they added a lot to their index. While he’s not forthcoming with any detailed information in his post, we can still assume that the MSN search index is now larger than 5 billion.
This is interesting in itself in that it may actually help us triangulate to the right size for the Google index. If we try different growth curves against the MSN search, we could look at the following:
If we take Google’s assessment that it is three times larger than its nearest competitor and assume that Google is considering MSN search to be its nearest competitor, those growth curves translate as follows:
When one looks at those results, a pattern emerges: Let’s first remember the rough claim of 24 billion based on the Google vs. Google analysis above. On the 50% MSN growth curve, Google is at 22.5 billion items indexed. On the 75% MSN growth curve, Google is at 26.5 billion items indexed. It could then be that Google considers MSN Search, and not Yahoo! to be its nearest competitor, as the 24 billion mark seems to fall right in between.
While the index size is largely a game of public relations, it appears that the Google index is sitting somewhere between 22.5 and 26.5 billion items indexed and, more probably than not, at the 24 billion items indexed mark. This gives it a slight edge over the Yahoo! index and shows that the company considers Microsoft its nearest competitor. Of course, this is my own speculation so your mileage may vary.
© Tristan Louis 1994-present Some rights reserved.