TNL.net

Google Accelerates Search

6th
1

Google intro­duced a new tool called Web Accel­er­a­tor. While much will be made of the fears about the pri­vacy impli­ca­tions of that move, I per­son­ally believe that this move is one that is deeply rooted in the search mis­sion of the com­pany and will be seen as a gam­bit of the same size as the one taken by its founders when they first looked at Yahoo! in the mid-90s and fig­ured they could deliver a bet­ter search product.

How Search Engines Work

Before I get into details as to why I think this web accel­er­a­tor is a major search move by Google, I first need to edu­cate some of my read­ers as to the basics of search and some of the issues relat­ing to cre­at­ing a good search prod­uct. If you already know about search index­ing, you can skip to the next section.

Search engines are basi­cally act­ing as not only giant card cat­a­logs, sim­i­lar to the ones you can find in a library but also as giant libraries in and off them­selves. When you type a word in a search box, what hap­pens next includes a num­ber of dif­fer­ent steps that allow to look through a giant index, which is basi­cally an image of all the pages the search engine knows about.

The way those indexes are cre­ated is through pro­grams that are known as spi­ders (some­times also referred to as web-robots or crawlers). Those pro­grams are inde­pen­dent pieces of soft­ware that go and basi­cally surf the web at very high speed, mak­ing copies of every­thing they encounter and com­par­ing what they find to what other spi­ders are found. That giant set of pages copied by spi­ders is called an index (it is also some­times referred to as a col­lec­tion). They run around the clock and their sole job is to get more pages and ensure that the pages they’ve got­ten in the past still exist and that they have not changed (if they have changed, the spi­der will “re-index” the page, ie. delete the pre­vi­ous one from the index and put the new ver­sion in its place).

Size Mat­ters

The idea is a sur­pris­ingly sim­ple one and was first intro­duced in the early days of the web. At the time, cre­at­ing an index of all the pages on the web was rel­a­tively easy, largely due to the fact that there were not that many pages and that not that many peo­ple were cre­at­ing them (I actu­ally enjoy sur­pris­ing new­bies by telling them that I once saw the whole web, every sin­gle pages on it. What I omit until later in the story is that I did this in 1993, at a time when you could count the num­ber of web servers with­out hit­ting 100 and when you could actu­ally see the whole web in only a few hours.)

The amaz­ing thing is that, although the num­ber of web sites (and hence the num­ber of web pages) has exploded, the basic tech­nol­ogy to build a search index has not evolved that much. The con­cepts are basi­cally the same today as they were in 1994–1995 but the web is now much, much larger.

How large, you won­der? Well, a good indi­ca­tor would be to take a look at the bot­tom of the Google home page for a num­ber. As of this writ­ing, that num­ber stands at 8,058,044,651. That’s over 8 bil­lion pages, a very large num­ber and one that folks at Google are appro­pri­ately proud of.

There’s only one lit­tle issue with that num­ber. It’s on the low side. In fact, it’s esti­mated that it rep­re­sents less than one per­cent of the actual num­ber of pages on the web. In 2001, that num­ber was esti­mated at over 500 bil­lion pages in what is called the Deep Web, a part of the web that has not been indexed by search engines yet. With the growth of weblogs, which are gen­er­ated tons of con­tent on a daily basis, and the con­nec­tion of more sys­tems like books, satel­lite maps, etc… to the web, you can only imag­ine that the num­ber has grown.

Let’s pause for a moment and assume that only as many pages were cre­ated between 2001 and now as were cre­ated in the pre­vi­ous four years, at the high of the dot­com boom. This means that there would be over a tril­lion web pages on the Inter­net. Now that gets to be a much more inter­est­ing number.

You Call THIS Fresh?!?

So we know that Google has a prob­lem in find­ing a lot of the pages that already exist on the Inter­net. But that’s noth­ing com­pared to the other prob­lem Google has.

Imag­ine an index with 1 mil­lion pages. If you assume that a spi­der can index that one mil­lion pages in a day, the con­tent on those pages is refreshed daily, mean­ing that the index has a new ver­sion of the pages only once a day. Now try to do the same with 8 bil­lion pages and it becomes a pretty com­pli­cated prob­lem. Google has solved some of that prob­lem by basi­cally decid­ing that some sites have a higher worth than oth­ers. As a results, sites which are known to refresh their con­tent on a reg­u­lar basis get more atten­tion from Google than sites that do not.

With the explo­sion of weblogs, how­ever, a new breed of sites has cre­ated a prob­lem for Google. For starters, there are a lot of them, and most of them refresh their con­tent reg­u­larly, in some cases more than once a day. This makes the job of pro­duc­ing rel­e­vant indexes almost impos­si­ble for Google, turn­ing their search engine into some­thing more akin to a library, the kind of place that you use when you are look­ing for a ref­er­ence, than an up to date source.

Not only that but, if Google is to also index the deep web, keep­ing track of all the changes across all the web becomes impos­si­ble… Impos­si­ble, that is, if you are using crawlers.

So we now know that the crawlers are no longer the right option when it comes to keep­ing fresh infor­ma­tion within a proper search engine index. Look­ing at this, Google needs to do some­thing rad­i­cal. On the one hand, they can try to build a sys­tem that will get the most up to date infor­ma­tion through noti­fi­ca­tion from the sites that are updat­ing con­tent. This is where ser­vices like Tech­no­rati and Feed­ster come in, get­ting updates from RSS feeds and thus build­ing indexes with more recent infor­ma­tion than Google’s.

On the other hand, they could look at increas­ing the num­ber of crawlers they are using. We know that Google has a lot of machines but try­ing to scale to the point where they can mon­i­tor a tril­lion pages via crawl would require a lot more power than that.

Enters Web Accelerator!

Spread­ing the Load

In the late 90s, dis­trib­ut­ing com­put­ing took hold as a con­cept. Projects like SETI@home and Folding@Home have shown the way in terms of har­ness­ing the power of mil­lions of com­put­ers to solve processor-intensive kinds of prob­lems. Google started look­ing at this with the roll out of their tool­bar with a fea­ture called Google Com­pute.

Now let’s move for­ward. What if you could get infor­ma­tion as to what pages are new and what pages are changes by just observ­ing where peo­ple are surf­ing? This is the space that the accel­er­a­tor occu­pies. Sit­ting neatly between your web browser and the Google archi­tec­ture is a mini proxy that keeps check­ing if it can find a way to give you pages at a faster rate from the Google index than it does from the actual exist­ing site. Along the way, Google finds out what pages are miss­ing from its index (and gets a chance to add them) and what pages in its index are not up to date.

Imag­ine a mil­lion peo­ple down­load­ing the Google Web Accel­er­a­tor and all of a sud­den, you have an infra­struc­ture that finds out about a lot of pages very quickly.

Microsoft and Yahoo! are already in com­pe­ti­tion with Google in the search space. In order to main­tain its lead­er­ship, Google needs to not only pro­vide an index that is larger than its com­peti­tors but also more up to date. With this accel­er­a­tor, they can do that and only one of its com­peti­tor can ever hope to match the fea­ture: Microsoft.

The web­mas­ter FAQ points the accel­er­a­tor does not cover pages which are secure (nicely bypass­ing secu­rity issues) nor large media files. I sus­pect that we will see that change in the future, with the addi­tion of images com­ing first.

Related Posts with Thumbnails

Related Terms

, , , ,

1 Comment

  1. 1 Rubber Bucket — June 5, 2007 at 1:42 am

    , with a sug­ges­tion that it might now be much higher. Still, this is still in the same ball­park. As for the total num­ber of pages (search engines delib­er­ately try to omit junk), here’s one sug­ges­tion from 2 years ago: 1 tril­lion. That stacks up to 100,000km. While not ter­ri­bly sci­en­tific, I’m going to end with that fig­ure. At least it’s a quar­ter of the way to the moon.

  2. 2The Ghost of PT Barnum — September 24, 2006 at 1:12 am

    This is a great pro­gram for those users who don’t know any­thing about how com­put­ers and/or the inter­net works, and/or just want to expe­ri­ence the “feel good illu­sion” of (not really) increased speed by hav­ing an entirely use­less accel­er­a­tion pro­gram on their hard drive.

    Still you’ve got to admire Google’s chutzpa here; I’m guess­ing that it has to be the most hilar­i­ous bit of shell-game spy­ware ever invented by any com­pany in the entire his­tory of com­puter or Inter­net use and devel­op­ment. Very clever really, when you con­sider that the trade off is that users “think” they’re get­ting “increased” inter­net speed; in exchange for reveal­ing exact the name of every sin­gle web­page that you ever visit from the moment that you install Google Web Accel­er­a­tor until (hope­fully) the moment you wise up and remove it.

    After Google Web Accel­er­a­tor is installed it does absolutely noth­ing to improve brows­ing. Also Google Web Accel­er­a­tor col­lects copies of all web pages, (includ­ing prefetched pages that you did not even visit), in the Google Web Accel­er­a­tor cache on your com­puter. All it does is col­lect and store a gazil­lion MB of temp files every time you use it for a ses­sion of surf­ing; and Google gets to know the exact the name of every sin­gle web­page that you ever visit for prod­ucts, news, bank­ing, what­ever! This is very valu­able infor­ma­tion to have; not only does Google know every­thing you click on, but you get zero in exchange for this info.

    Finally, Google admits on their own sup­port page that any and all pass­words, e-mail addresses etc. you enter in a web form (e. g. when pur­chas­ing an item online) will be fun­neled via their sys­tems. If you enter per­son­ally iden­ti­fi­able infor­ma­tion (such as an email address) onto a form on an unen­crypted web page, the sites will send this infor­ma­tion through Google.

    Had he lived long enough to see this, P.T. Bar­num; the per­son who coined the phrase: “A Sucker is Born Every Minute” would most cer­tainly con­sider those who down­load, install and leave this pro­gram on their com­put­ers to be suck­ers indeed!

Comments are disabled.