TNL.net

Technorati Yahoo and Google Too

20th
3

In the last entry on the sub­ject, we took a look at how Tech­no­rati and Google com­pared. From there, we dis­cov­ered that Tech­no­rati was get­ting roughly a fourth of the links Google could locate. Which brought up some inter­est­ing ques­tions: could we rely on the Google num­bers? Were they so much larger than any other search engine that we were build­ing an unfair com­par­i­son? And, as some alert read­ers pointed in email, was Google under-reporting the num­ber of links to a site? In order to answer some of those ques­tions, I decided to build some more com­par­isons. So I decided to take a look at some of Google’s com­peti­tors. Today, I’ll go into how Yahoo! fared (Hint: I was sur­prised by the results).

Gath­er­ing the data

As I had done for the pre­vi­ous effort, I gath­ered data against the same list of site around the same date. This pro­vided me with some con­sis­tency in the data set that allowed for bet­ter com­par­i­son. Com­pare one or two site and you may get some false pos­i­tives. Com­pare 100 sites and things start get­ting a lit­tle more inter­est­ing. The Yahoo! data ended up look­ing at this (for peo­ple who are new to the series, I am doing the same graphs for a num­ber of search engines):

Tech­no­rati Top 100 Yahoo Links Tech­no­rati Links Technorati/Yahoo Links
Boing Boing 1880000 22532 1.19851%  
InstaPun­dit 2160000 15190 0.70324%  
Daily Kos 1690000 15833 0.93686%  
Giz­modo 1970000 12278 0.62325%  
Fark 1420000 10216 0.71944%  
EnGad­get 2820000 15051 0.53372%  
Dav­e­net­ics 66400 7571 11.40211%  
Escha­ton 1400000 8713 0.62236%  
Dooce 653000 6797 1.04089%  
Andrew Sul­li­van 1260000 7680 0.60952%  
The Best Page In The Universe 62000 6333 10.21452%  
Talk­ing Points Memo: by Joshua Micah Marshall 563000 7592 1.34849%  
lgf: anti-idiotarian 49300 8275 16.78499%  
kottke.org 1200000 7278 0.60650%  
WIL WHEATON DOT NET 564000 6314 1.11950%  
Metafil­ter 1160000 7591 0.65440%  
Doc Searls 1150000 5690 0.49478%  
(In)formacao e (In)utilidade 110000 6040 5.49091%  
Won­kette 1370000 5877 0.42898%  
Script­ing News 1470000 5728 0.38966%  
Power Line 344000 7477 2.17355%  
Bal­masque 40500 4544 11.21975%  
Corante 265000 7686 2.90038%  
A list Apart 620000 5536 0.89290%  
Some­thing Awful 372000 4512 1.21290%  
Mega­tokyo 361000 4154 1.15069%  
Michelle Malkin 537000 6091 1.13426%  
Arts and Let­ters Daily 866000 3983 0.45993%  
Gawker 1060000 4453 0.42009%  
After­all it was the best I ever had 34900 3591 10.28940%  
The Volokh Conspiracy 1190000 5873 0.49353%  
Sco­belizer 937000 5524 0.58954%  
Jef­frey Zeldman 528000 4134 0.78295%  
This Mod­ern World 813000 3913 0.48130%  
The Web Stan­dards Project 59800 3810 6.37124%  
Joel on Software 966000 4514 0.46729%  
Media Mat­ters for America 536000 6809 1.27034%  
Tele­vi­sion with­out pity 356000 3859 1.08399%  
Kuro5hin 866000 4208 0.48591%  
Lileks 39700 3824 9.63224%  
Hugh Hewitt 929000 4573 0.49225%  
Joel Veitch 135000 3774 2.79556%  
Truthout 371000 6528 1.75957%  
Bagh­dad Burning 552000 3519 0.63750%  
Buzz machine 1010000 4145 0.41040%  
fleugel 201000 3670 1.82587%  
Informed Com­ment 787000 3905 0.49619%  
Doppler: redefin­ing podcasting 607000 3040 0.50082%  
geek and proud 9110 3166 34.75302%  
load­mem­ory (Asian site) 1550 3324 214.45161%  
Pho­to­junkie 51200 2860 5.58594%  
Ross Rader 48200 2976 6.17427%  
The Truth Laid Bear 717000 4127 0.57559%  
Joi Ito 1050000 5165 0.49190%  
Scrap­ple­Face 807000 3480 0.43123%  
Lex­Text 31200 2671 8.56090%  
Google Blog 297000 3688 1.24175%  
Xbox 237000 4221 1.78101%  
My life in a Bush of Ghosts 903 2519 278.95903%  
Astron­omy pic­ture of the day 113000 3498 3.09558%  
Crooked Tim­ber 67500 3617 5.35852%  
Vodka Pun­dit 169000 3085 1.82544%  
Captain’s quar­ter 730000 3671 0.50288%  
A small victory 460000 3223 0.70065%  
Gato Fedorento 126000 2574 2.04286%  
Mez­zoblue 278000 2952 1.06187%  
Post­Se­cret 202000 2707 1.34010%  
Samizdata.net 18000 2872 15.95556%  
Lawrence Lessig 959000 2949 0.30751%  
Coun­ter­punch 295000 3278 1.11119%  
Democ­rac­tic Underground 417000 3913 0.93837%  
Right Wing News 794000 2967 0.37368%  
StopDe­sign 255000 3037 1.19098%  
iBib­lio 197000 3105 1.57614%  
Samizdata.net (mis­take?) 697000 2743 0.39354%  
Abrupto 44700 2935 6.56600%  
gene7299 (Asian MSNSpaces site) 764 3215 420.81152%  
Where is Raed? 232000 2409 1.03836%  
B3TA: We love the web 839000 2614 0.31156%  
Talk­left 221000 2901 1.31267%  
Wiz­bang 634000 3358 0.52965%  
m1net (MSN spaces site) 579 3548 612.78066%  
Hoder 20900 5422 25.94258%  
CTRL+Alt+Del 171000 2315 1.35380%  
Brad DeLong 882000 2715 0.30782%  
Blogs for Bush 824000 3560 0.43204%  
Neil Gaiman 319000 2194 0.68777%  
Gothamist 491000 2729 0.55580%  
Thought Mechan­ics 190000 2197 1.15632%  
IMAO 407000 2905 0.71376%  
Dan Gill­mor (old weblog) 298000 2600 0.87248%  
HINAGATA 21100 2186 10.36019%  
Dean’s World 784000 2985 0.38074%  
Defamer 725000 2372 0.32717%  
USS Clue­less 264000 2570 0.97348%  
Dive into Mark 235000 2540 1.08085%  
Pandagon 743000 2822 0.37981%  
Blogging.la 67700 3061 4.52142%  
Why are you wor­ship­ping the ground I blog on? 85000 2238 2.63294%  
Dar­ing Fireball 221000 2573 1.16425%  

The first thing of inter­est when putting together that set of num­bers was how much larger the num­ber of links found in the Yahoo! index was, com­pared to the num­ber of links found in either Tech­no­rati or Google. The sec­ond item I found inter­est­ing was a rel­a­tive con­sis­tency in terms of Asian sites not fig­ur­ing well in the Yahoo! index com­pared to the Tech­no­rati one. It seems that Tech­no­rati is get­ting a bet­ter han­dle on the Asian blo­gos­phere than Yahoo! is, a sur­pris­ing result con­sid­er­ing how much time and effort the lat­ter has put into its Asian operations.

In order to get some real visual com­par­i­son, I decided to draw a sim­i­lar dia­gram of the link per­cent­ages dis­trib­uted across all 100 sites. It looked like this:

link distribution

The inter­est­ing story, look­ing at this is that it appeared that there was much greater vari­ance from site to site in the Google index that there was in the Yahoo! one. In the Yahoo sys­tem, the vast major­ity of site fall in the below one per­cent range but what became even more inter­est­ing was that the rate of vari­ance was really not that high: when com­par­ing the median and the aver­age, it turned out to be less than .1% of difference:

Tech­no­rati Top 100 Yahoo Links Tech­no­rati Links Technorati/Yahoo Links
Total 56150006 479580 0.85410%
Median 389500 3679.5 0.94467%

While the num­ber were vastly dif­fer­ent in terms of size (it appeared Yahoo! had a lot more links), I fig­ured the pat­terns would be roughly the same in terms of cov­er­age: I expected the top sites to get bet­ter cov­er­age in a large search engine like Yahoo! than smaller sites. Imag­ine my sur­prise then when I started to do some group analysis:

Tech­no­rati Top 100 Yahoo Links Tech­no­rati Links Technorati/Yahoo Links
AVERAGE TOP 10 1531940 12186.1 0.79547%  
AVERAGE TOP 25 986368 8733.36 0.88541%  
AVERAGE TOP 50 768245.2 6534.36 0.85056%  
AVERAGE BOTTOM 50 354754.92 3057.24 0.86179%  
AVERAGE BOTTOM 25 362220.8846 2834.884615 0.78264%  
AVERAGE BOTTOM 10 350072.7273 2622.909091 0.74925%  

Those num­bers seemed to be all over the map, a fact that became much clearer once I graphed it:

don't you like a nice graph

None of the nice down­grade curve I had with the Google set. Here was a much more dis­parate set, pro­vid­ing lit­tle in terms of sup­port­ing a the­ory of bias from a search engine. In fact, it worked more to poten­tially prove such the­ory wrong.

Was my data set wrong? I rechecked it and it was not. So what was hap­pen­ing here? As dreams of long tail and power law dis­tri­b­u­tions fell out, I started to won­der how Yahoo! and Google com­pared. So, of course, I decided to run the num­bers again…

Yahoo! vs. Google

This time I decided to com­pare Google and Yahoo! First, I fig­ured I would get some ref­er­ence data on the sub­ject. I was sur­prised to not find any actual side by side com­par­i­son on a large set of sites. Anec­do­tal evi­dence existed but noth­ing com­pared to the data set I had amassed so I fig­ure I would trust my own data set (note: If you have a bet­ter one, please leave a com­ment as to where it is located). The set ended up look­ing like this:

Name Posi­tion 5/19/05 Google Yahoo Google/Yahoo Links
Boing Boing 1 45200 1880000 2.40%
InstaPun­dit 2 75000 2160000 3.47%
Daily Kos 3 59800 1690000 3.54%
Giz­modo 4 39300 1970000 1.99%
Fark 5 43600 1420000 3.07%
EnGad­get 6 46800 2820000 1.66%
Dav­e­net­ics 7 1780 66400 2.68%
Escha­ton 8 62400 1400000 4.46%
Dooce 9 23600 653000 3.61%
Andrew Sul­li­van 10 41100 1260000 3.26%
The Best Page In The Universe 11 656 62000 1.06%
Talk­ing Points Memo: by Joshua Micah Marshall 12 74600 563000 13.25%
lgf: anti-idiotarian 13 14700 49300 29.82%
kottke.org 14 32000 1200000 2.67%
WIL WHEATON DOT NET 15 16900 564000 3.00%
Metafil­ter 16 34500 1160000 2.97%
Doc Searls 17 33600 1150000 2.92%
(In)formacao e (In)utilidade 18 1780 110000 1.62%
Won­kette 19 28800 1370000 2.10%
Script­ing News 20 39400 1470000 2.68%
Power Line 21 7510 344000 2.18%
Bal­masque 22 24 40500 0.06%
Corante 23 6770 265000 2.55%
A list Apart 24 21100 620000 3.40%
Some­thing Awful 25 9020 372000 2.42%
Mega­tokyo 26 7310 361000 2.02%
Michelle Malkin 27 17300 537000 3.22%
Arts and Let­ters Daily 28 23900 866000 2.76%
Gawker 29 23500 1060000 2.22%
After­all it was the best I ever had 30 95 34900 0.27%
The Volokh Conspiracy 31 42000 1190000 3.53%
Sco­belizer 32 21800 937000 2.33%
Jef­frey Zeldman 33 22500 528000 4.26%
This Mod­ern World 34 32100 813000 3.95%
The Web Stan­dards Project 35 1850 59800 3.09%
Joel on Software 36 22400 966000 2.32%
Media Mat­ters for America 37 24800 536000 4.63%
Tele­vi­sion with­out pity 38 13300 356000 3.74%
Kuro5hin 39 17300 866000 2.00%
Lileks 40   39700 0.00%
Hugh Hewitt 41 26700 929000 2.87%
Joel Veitch 42 2830 135000 2.10%
Truthout 43 8780 371000 2.37%
Bagh­dad Burning 44 22700 552000 4.11%
Buzz machine 45 30600 1010000 3.03%
fleugel 46 1890 201000 0.94%
Informed Com­ment 47 27900 787000 3.55%
Doppler: redefin­ing podcasting 48 4420 607000 0.73%
geek and proud 49 355 9110 3.90%
load­mem­ory (Asian site) 50 83 1550 5.35%
Pho­to­junkie 51 1540 51200 3.01%
Ross Rader 52 1070 48200 2.22%
The Truth Laid Bear 53 23900 717000 3.33%
Joi Ito 54 23400 1050000 2.23%
Scrap­ple­Face 55 31100 807000 3.85%
Lex­Text 56 1970 31200 6.31%
Google Blog 57 46 297000 0.02%
Xbox 58 6600 237000 2.78%
My life in a Bush of Ghosts 59 6 903 0.66%
Astron­omy pic­ture of the day 60 5020 113000 4.44%
Crooked Tim­ber 61 3560 67500 5.27%
Vodka Pun­dit 62 4520 169000 2.67%
Captain’s quar­ter 63 27100 730000 3.71%
A small victory 64 16700 460000 3.63%
Gato Fedorento 65 1630 126000 1.29%
Mez­zoblue 66 12000 278000 4.32%
Post­Se­cret 67 5790 202000 2.87%
Samizdata.net 68 1050 18000 5.83%
Lawrence Lessig 69 30600 959000 3.19%
Coun­ter­punch 70 11700 295000 3.97%
Democ­rac­tic Underground 71 14900 417000 3.57%
Right Wing News 72 27900 794000 3.51%
StopDe­sign 73 10200 255000 4.00%
iBib­lio 74 9730 197000 4.94%
Samizdata.net (mis­take?) 75 25500 697000 3.66%
Abrupto 76 550 44700 1.23%
gene7299 (Asian MSNSpaces site) 77 58 764 7.59%
Where is Raed? 78 10100 232000 4.35%
B3TA: We love the web 79 12000 839000 1.43%
Talk­left 80 7170 221000 3.24%
Wiz­bang 81 21000 634000 3.31%
m1net (MSN spaces site) 82 104 579 17.96%
Hoder 83 1480 20900 7.08%
CTRL+Alt+Del 84 2310 171000 1.35%
Brad DeLong 85 30100 882000 3.41%
Blogs for Bush 86 16200 824000 1.97%
Neil Gaiman 87 13700 319000 4.29%
Gothamist 88 15200 491000 3.10%
Thought Mechan­ics 89 4400 190000 2.32%
IMAO 90 23800 407000 5.85%
Dan Gill­mor (old weblog) 91 10800 298000 3.62%
HINAGATA 92 10100 21100 47.87%
Dean’s World 93 30600 784000 3.90%
Defamer 94 9310 725000 1.28%
USS Clue­less 95 8470 264000 3.21%
Dive into Mark 96 14600 235000 6.21%
Pandagon 97 27300 743000 3.67%
Blogging.la 98 3200 67700 4.73%
Why are you wor­ship­ping the ground I blog on? 99 1430 85000 1.68%
Dar­ing Fireball 100 12000 221000 5.43%

Noth­ing par­tic­u­larly impres­sive there. It seemed that Google, on aver­age, ended up with only about 3% of the links Yahoo! had in its index. How­ever, the story got more inter­est­ing when look­ing at diver­gence between the aver­age and the median, as it seemed there was a sta­tis­ti­cal diver­gence (almost half a per­cent) between the two:

Tech­no­rati Top 100 Google Yahoo Google/Yahoo Links
Total 1739867 56150006 3.10%
Median 13700 389500 3.52%

But wait, for the weird­ness is only get­ting started. Next up was look­ing at the dis­tri­b­u­tions (as I’ve done for Tech­no­rati vs. each of the engines):

Tech­no­rati Top 100 Google Yahoo Google/Yahoo Links
AVERAGE TOP 10 43858 1531940 2.86%
AVERAGE TOP 25 30397.6 986368 3.08%
AVERAGE TOP 50 23599.04082 768245.2 3.07%
AVERAGE BOTTOM 50 11443.07843 354754.92 3.23%
AVERAGE BOTTOM 25 11980.07692 362220.8846 3.31%
AVERAGE BOTTOM 10 13782.72727 350072.7273 3.94%

I looked at the num­ber and they did not seem right so I ran them again and ended up with the same results. Ran them a third time and still couldn’t make sense of it. So I graphed it:

Google vs. Yahoo round 2

… and to my sur­prise, it appeared that the fur­ther down the line one went, the greater the dif­fer­en­tial. In fact, sites that are in the bot­tom of the top 100 are one full per­cent more likely to get indexed in Yahoo! than in Google.

Con­clu­sions

From here, we can draw a few conclusions:

Up next, we’ll take a look at how MSN plays in all this game. So stay tuned!

Related Posts with Thumbnails

Related Terms

, ,

3 Comments

  1. 1Musical Perceptions — May 22, 2007 at 5:00 pm

    he would be down at #90 (oops, I just men­tioned one). So I trust Tech­no­rati more for this use. Here is an old study com­par­ing Google with Tech­no­rati, though some of his inter­pre­ta­tions of the sta­tis­tics are incor­rect, par­tic­u­larly when com­par­ing Google and Yahoo. Tech­no­rati doesn’t record nearly as many links, the ques­tion is which links are impor­tant, and how old the links are. Update: Chris Foley has gen­er­ated his own list, using sub­scrip­tion rates on Bloglines

  2. 2Of Interest — May 10, 2007 at 2:49 am

    blog­ger com­pen­sa­tions while at the same time ana­lyz­ing the out­put of A-list blog­gers and how they were linked. The links got me to then con­sider how an engine like Tech­no­rati fared against the big three: Google, Yahoo, and MSN. The results revealed some inter­est­ing data and were (and are still) dis­cussed for months. Because of all this activ­ity, I also started notic­ing the value of archival con­tent, around the same time as Chris Ander­son started think­ing about the

  3. 3New Music reBlog — May 22, 2007 at 2:03 am

    he would be down at #90 (oops, I just men­tioned one). So I trust Tech­no­rati more for this use. Here is an old study com­par­ing Google with Tech­no­rati, though some of his inter­pre­ta­tions of the sta­tis­tics are incor­rect, par­tic­u­larly when com­par­ing Google and Yahoo. Tech­no­rati doesn’t record nearly as many links, the ques­tion is which links are impor­tant, and how old the links are. Orig­i­nally from Musi­cal Per­cep­tions, ReBlogged by new­mu­si­cre­blog­gers on May 21, 2007 at 04:21 PM

Comments are disabled.