Technorati Yahoo and Google Too

In the last entry on the subject, we took a look at how Technorati and Google compared. From there, we discovered that Technorati was getting roughly a fourth of the links Google could locate. Which brought up some interesting questions: could we rely on the Google numbers? Were they so much larger than any other search engine that we were building an unfair comparison? And, as some alert readers pointed in email, was Google under-reporting the number of links to a site? In order to answer some of those questions, I decided to build some more comparisons. So I decided to take a look at some of Google’s competitors. Today, I’ll go into how Yahoo! fared (Hint: I was surprised by the results).

Gathering the data

As I had done for the previous effort, I gathered data against the same list of site around the same date. This provided me with some consistency in the data set that allowed for better comparison. Compare one or two site and you may get some false positives. Compare 100 sites and things start getting a little more interesting. The Yahoo! data ended up looking at this (for people who are new to the series, I am doing the same graphs for a number of search engines):

Technorati Top 100Yahoo LinksTechnorati LinksTechnorati/Yahoo Links
Boing Boing1880000225321.19851%
InstaPundit2160000151900.70324%
Daily Kos1690000158330.93686%
Gizmodo1970000122780.62325%
Fark1420000102160.71944%
EnGadget2820000150510.53372%
Davenetics66400757111.40211%
Eschaton140000087130.62236%
Dooce65300067971.04089%
Andrew Sullivan126000076800.60952%
The Best Page In The Universe62000633310.21452%
Talking Points Memo: by Joshua Micah Marshall56300075921.34849%
lgf: anti-idiotarian49300827516.78499%
kottke.org120000072780.60650%
WIL WHEATON DOT NET56400063141.11950%
Metafilter116000075910.65440%
Doc Searls115000056900.49478%
(In)formacao e (In)utilidade11000060405.49091%
Wonkette137000058770.42898%
Scripting News147000057280.38966%
Power Line34400074772.17355%
Balmasque40500454411.21975%
Corante26500076862.90038%
A list Apart62000055360.89290%
Something Awful37200045121.21290%
Megatokyo36100041541.15069%
Michelle Malkin53700060911.13426%
Arts and Letters Daily86600039830.45993%
Gawker106000044530.42009%
Afterall it was the best I ever had34900359110.28940%
The Volokh Conspiracy119000058730.49353%
Scobelizer93700055240.58954%
Jeffrey Zeldman52800041340.78295%
This Modern World81300039130.48130%
The Web Standards Project5980038106.37124%
Joel on Software96600045140.46729%
Media Matters for America53600068091.27034%
Television without pity35600038591.08399%
Kuro5hin86600042080.48591%
Lileks3970038249.63224%
Hugh Hewitt92900045730.49225%
Joel Veitch13500037742.79556%
Truthout37100065281.75957%
Baghdad Burning55200035190.63750%
Buzz machine101000041450.41040%
fleugel20100036701.82587%
Informed Comment78700039050.49619%
Doppler: redefining podcasting60700030400.50082%
geek and proud9110316634.75302%
loadmemory (Asian site)15503324214.45161%
Photojunkie5120028605.58594%
Ross Rader4820029766.17427%
The Truth Laid Bear71700041270.57559%
Joi Ito105000051650.49190%
ScrappleFace80700034800.43123%
LexText3120026718.56090%
Google Blog29700036881.24175%
Xbox23700042211.78101%
My life in a Bush of Ghosts9032519278.95903%
Astronomy picture of the day11300034983.09558%
Crooked Timber6750036175.35852%
Vodka Pundit16900030851.82544%
Captain’s quarter73000036710.50288%
A small victory46000032230.70065%
Gato Fedorento12600025742.04286%
Mezzoblue27800029521.06187%
PostSecret20200027071.34010%
Samizdata.net18000287215.95556%
Lawrence Lessig95900029490.30751%
Counterpunch29500032781.11119%
Democractic Underground41700039130.93837%
Right Wing News79400029670.37368%
StopDesign25500030371.19098%
iBiblio19700031051.57614%
Samizdata.net (mistake?)69700027430.39354%
Abrupto4470029356.56600%
gene7299 (Asian MSNSpaces site)7643215420.81152%
Where is Raed?23200024091.03836%
B3TA: We love the web83900026140.31156%
Talkleft22100029011.31267%
Wizbang63400033580.52965%
m1net (MSN spaces site)5793548612.78066%
Hoder20900542225.94258%
CTRL+Alt+Del17100023151.35380%
Brad DeLong88200027150.30782%
Blogs for Bush82400035600.43204%
Neil Gaiman31900021940.68777%
Gothamist49100027290.55580%
Thought Mechanics19000021971.15632%
IMAO40700029050.71376%
Dan Gillmor (old weblog)29800026000.87248%
HINAGATA21100218610.36019%
Dean’s World78400029850.38074%
Defamer72500023720.32717%
USS Clueless26400025700.97348%
Dive into Mark23500025401.08085%
Pandagon74300028220.37981%
Blogging.la6770030614.52142%
Why are you worshipping the ground I blog on?8500022382.63294%
Daring Fireball22100025731.16425%

The first thing of interest when putting together that set of numbers was how much larger the number of links found in the Yahoo! index was, compared to the number of links found in either Technorati or Google. The second item I found interesting was a relative consistency in terms of Asian sites not figuring well in the Yahoo! index compared to the Technorati one. It seems that Technorati is getting a better handle on the Asian blogosphere than Yahoo! is, a surprising result considering how much time and effort the latter has put into its Asian operations.

In order to get some real visual comparison, I decided to draw a similar diagram of the link percentages distributed across all 100 sites. It looked like this:

Technorati vs. Yahoo

Technorati vs. Yahoo – Source:TNL.net

The interesting story, looking at this is that it appeared that there was much greater variance from site to site in the Google index that there was in the Yahoo! one. In the Yahoo system, the vast majority of site fall in the below one percent range but what became even more interesting was that the rate of variance was really not that high: when comparing the median and the average, it turned out to be less than .1% of difference:

Technorati Top 100Yahoo LinksTechnorati LinksTechnorati/Yahoo Links
Total561500064795800.85410%
Median3895003679.50.94467%

While the number were vastly different in terms of size (it appeared Yahoo! had a lot more links), I figured the patterns would be roughly the same in terms of coverage: I expected the top sites to get better coverage in a large search engine like Yahoo! than smaller sites. Imagine my surprise then when I started to do some group analysis:

Technorati Top 100Yahoo LinksTechnorati LinksTechnorati/Yahoo Links
AVERAGE TOP 10153194012186.10.79547%Â
AVERAGE TOP 259863688733.360.88541%Â
AVERAGE TOP 50768245.26534.360.85056%Â
AVERAGE BOTTOM 50354754.923057.240.86179%Â
AVERAGE BOTTOM 25362220.88462834.8846150.78264%Â
AVERAGE BOTTOM 10350072.72732622.9090910.74925%Â

Those numbers seemed to be all over the map, a fact that became much clearer once I graphed it:

Technorati vs. Yahoo

Technorati vs. Yahoo – Source:TNL.net

None of the nice downgrade curve I had with the Google set. Here was a much more disparate set, providing little in terms of supporting a theory of bias from a search engine. In fact, it worked more to potentially prove such theory wrong.

Was my data set wrong? I rechecked it and it was not. So what was happening here? As dreams of long tail and power law distributions fell out, I started to wonder how Yahoo! and Google compared. So, of course, I decided to run the numbers again…

Yahoo! vs. Google

This time I decided to compare Google and Yahoo! First, I figured I would get some reference data on the subject. I was surprised to not find any actual side by side comparison on a large set of sites. Anecdotal evidence existed but nothing compared to the data set I had amassed so I figure I would trust my own data set (note: If you have a better one, please leave a comment as to where it is located). The set ended up looking like this:

NamePosition 5/19/05GoogleYahooGoogle/Yahoo Links
Boing Boing14520018800002.40%
InstaPundit27500021600003.47%
Daily Kos35980016900003.54%
Gizmodo43930019700001.99%
Fark54360014200003.07%
EnGadget64680028200001.66%
Davenetics71780664002.68%
Eschaton86240014000004.46%
Dooce9236006530003.61%
Andrew Sullivan104110012600003.26%
The Best Page In The Universe11656620001.06%
Talking Points Memo: by Joshua Micah Marshall127460056300013.25%
lgf: anti-idiotarian13147004930029.82%
kottke.org143200012000002.67%
WIL WHEATON DOT NET15169005640003.00%
Metafilter163450011600002.97%
Doc Searls173360011500002.92%
(In)formacao e (In)utilidade1817801100001.62%
Wonkette192880013700002.10%
Scripting News203940014700002.68%
Power Line2175103440002.18%
Balmasque2224405000.06%
Corante2367702650002.55%
A list Apart24211006200003.40%
Something Awful2590203720002.42%
Megatokyo2673103610002.02%
Michelle Malkin27173005370003.22%
Arts and Letters Daily28239008660002.76%
Gawker292350010600002.22%
Afterall it was the best I ever had3095349000.27%
The Volokh Conspiracy314200011900003.53%
Scobelizer32218009370002.33%
Jeffrey Zeldman33225005280004.26%
This Modern World34321008130003.95%
The Web Standards Project351850598003.09%
Joel on Software36224009660002.32%
Media Matters for America37248005360004.63%
Television without pity38133003560003.74%
Kuro5hin39173008660002.00%
Lileks40Â397000.00%
Hugh Hewitt41267009290002.87%
Joel Veitch4228301350002.10%
Truthout4387803710002.37%
Baghdad Burning44227005520004.11%
Buzz machine453060010100003.03%
fleugel4618902010000.94%
Informed Comment47279007870003.55%
Doppler: redefining podcasting4844206070000.73%
geek and proud4935591103.90%
loadmemory (Asian site)508315505.35%
Photojunkie511540512003.01%
Ross Rader521070482002.22%
The Truth Laid Bear53239007170003.33%
Joi Ito542340010500002.23%
ScrappleFace55311008070003.85%
LexText561970312006.31%
Google Blog57462970000.02%
Xbox5866002370002.78%
My life in a Bush of Ghosts5969030.66%
Astronomy picture of the day6050201130004.44%
Crooked Timber613560675005.27%
Vodka Pundit6245201690002.67%
Captain’s quarter63271007300003.71%
A small victory64167004600003.63%
Gato Fedorento6516301260001.29%
Mezzoblue66120002780004.32%
PostSecret6757902020002.87%
Samizdata.net681050180005.83%
Lawrence Lessig69306009590003.19%
Counterpunch70117002950003.97%
Democractic Underground71149004170003.57%
Right Wing News72279007940003.51%
StopDesign73102002550004.00%
iBiblio7497301970004.94%
Samizdata.net (mistake?)75255006970003.66%
Abrupto76550447001.23%
gene7299 (Asian MSNSpaces site)77587647.59%
Where is Raed?78101002320004.35%
B3TA: We love the web79120008390001.43%
Talkleft8071702210003.24%
Wizbang81210006340003.31%
m1net (MSN spaces site)8210457917.96%
Hoder831480209007.08%
CTRL+Alt+Del8423101710001.35%
Brad DeLong85301008820003.41%
Blogs for Bush86162008240001.97%
Neil Gaiman87137003190004.29%
Gothamist88152004910003.10%
Thought Mechanics8944001900002.32%
IMAO90238004070005.85%
Dan Gillmor (old weblog)91108002980003.62%
HINAGATA92101002110047.87%
Dean’s World93306007840003.90%
Defamer9493107250001.28%
USS Clueless9584702640003.21%
Dive into Mark96146002350006.21%
Pandagon97273007430003.67%
Blogging.la983200677004.73%
Why are you worshipping the ground I blog on?991430850001.68%
Daring Fireball100120002210005.43%

Nothing particularly impressive there. It seemed that Google, on average, ended up with only about 3% of the links Yahoo! had in its index. However, the story got more interesting when looking at divergence between the average and the median, as it seemed there was a statistical divergence (almost half a percent) between the two:

Technorati Top 100GoogleYahooGoogle/Yahoo Links
Total1739867561500063.10%
Median137003895003.52%

But wait, for the weirdness is only getting started. Next up was looking at the distributions (as I’ve done for Technorati vs. each of the engines):

Technorati Top 100GoogleYahooGoogle/Yahoo Links
AVERAGE TOP 104385815319402.86%
AVERAGE TOP 2530397.69863683.08%
AVERAGE TOP 5023599.04082768245.23.07%
AVERAGE BOTTOM 5011443.07843354754.923.23%
AVERAGE BOTTOM 2511980.07692362220.88463.31%
AVERAGE BOTTOM 1013782.72727350072.72733.94%

I looked at the number and they did not seem right so I ran them again and ended up with the same results. Ran them a third time and still couldn’t make sense of it. So I graphed it:

Google vs. Yahoo round 2

Google vs. Yahoo round 2

… and to my surprise, it appeared that the further down the line one went, the greater the differential. In fact, sites that are in the bottom of the top 100 are one full percent more likely to get indexed in Yahoo! than in Google.

Conclusions

From here, we can draw a few conclusions:

  • Yahoo! generally does a better job at indexing the blogosphere than Google does. We know they have been working hard to improve their index and here’s proof that they are getting results
  • Even if Google is the one with the motto about not doing evil, Yahoo! seems to be the one interested in giving equal opportunity to the little guy: smaller blogs seem to have a better chance of being recognized by Yahoo! than they do of being recognized by Google
  • While the front page of Google advertises they are currently indexing over 8 billion pages, it is very difficult to find ways to support that claim via the link feature they are offering: this can be seen as confirmation that Google does not tell you about all the links it has in its index.
  • Sure volume counts but in the case of search indexes, they may count against sites: if one is less likely to appear in Google than it is to appear in Yahoo! and the Google index is much larger than the Yahoo! one, then, if Yahoo! and Google had the same amount of traffic, a single blog could find itself receiving more traffic from Yahoo! than it does from Google. This would be due to the fact that each individual page in Yahoo! has more weight than it does in Google.
  • The top 100 blogs have other 56 million links in the Yahoo!. That’s a lot of links and clearly shows that links are the currency of the blogging world. It would be interested to get data that would help analyze how much interlinking exists across those sites.

Up next, we’ll take a look at how MSN plays in all this game. So stay tuned!

Previous Post
Secrets of the A-list bloggers: Technorati vs. Google
Next Post
Microsoft Loves RSS
Menu