TNL.net is designed for modern browsers but the content is still readable in older ones. If you want to ensure the best experience, please install a browser that was developed after 2009.

tnl.net

Secrets of the A-list bloggers: Technorati vs. Google

Look­ing at data about the A-list, first in and of its own, and later as part of a wider scope made me won­der about the ini­tial data set I was using. What does it mean to be on the Tech­no­rati 100? Is Tech­no­rati pre­sent­ing an accu­rate rep­re­sen­ta­tion of the world? And how does it com­pare against the wider world? So I decided to start gath­er­ing data from other search engines. In this entry, I will go into a more detailed analy­sis of that infor­ma­tion and will attempt to answer some of the ques­tions I raised above.

Source Check

In my ini­tial review, I noticed that Tech­no­rati was rank­ing sites bases on sources. How­ever, incom­ing and out­go­ing infor­ma­tion is not really avail­able from the major search engines when it comes to sourc­ing data. So, for this par­tic­u­lar inves­ti­ga­tion, I decided to dis­miss source data and focus on link data. I decided to go and get link data from the three largest search engines: Google, Yahoo! and MSN (that last one was included at the last minute just because I knew that Robert Scoble would com­plain about the study being biased if I didn’t include MSN).

Pick­ing three search engines was also inter­est­ing because it pro­vid­ing some sort of ref­er­ence check. If one of the engines did not line up with the other two, we could point out to a poten­tial flaw in that engine instead of try­ing to under­stand why the data was wrong.

Hav­ing picked that data set, I decided to start gath­er­ing the data. Let me say that it’s a lot of infor­ma­tion and, should I try to do this again in the future, writ­ing soft­ware to gather the infor­ma­tion will prob­a­bly be less time con­sum­ing that try­ing to get it by hand.

But enough about the process, let’s get into the numbers.

Tech­no­rati vs. Google

So the first dataset I cre­ated was a com­par­a­tive index of Tech­no­rati and Google. The set was cre­ated by grab­bing the num­ber of links to a site in Google and get­ting the equiv­a­lent value for Tech­no­rati. The resut looked like this:

Tech­no­rati Top 100 Google Links Tech­no­rati Links Technorati/Google
Boing Boing 45200 22532 49.8496%
InstaPun­dit 75000 15190 20.2533%
Daily Kos 59800 15833 26.4766%
Giz­modo 39300 12278 31.2417%
Fark 43600 10216 23.4312%
EnGad­get 46800 15051 32.1603%
Dav­e­net­ics 1780 7571 425.3371%
Escha­ton 62400 8713 13.9631%
Dooce 23600 6797 28.8008%
Andrew Sul­li­van 41100 7680 18.6861%
The Best Page In The Universe 656 6333 965.3963%
Talk­ing Points Memo: by Joshua Micah Marshall 74600 7592 10.1769%
lgf: anti-idiotarian 14700 8275 56.2925%
kottke.org 32000 7278 22.7438%
WIL WHEATON DOT NET 16900 6314 37.3609%
Metafil­ter 34500 7591 22.0029%
Doc Searls 33600 5690 16.9345%
(In)formacao e (In)utilidade 1780 6040 339.3258%
Won­kette 28800 5877 20.4063%
Script­ing News 39400 5728 14.5381%
Power Line 7510 7477 99.5606%
Bal­masque 24 4544 18933.3333%
Corante 6770 7686 113.5303%
A list Apart 21100 5536 26.2370%
Some­thing Awful 9020 4512 50.0222%
Mega­tokyo 7310 4154 56.8263%
Michelle Malkin 17300 6091 35.2081%
Arts and Let­ters Daily 23900 3983 16.6653%
Gawker 23500 4453 18.9489%
After­all it was the best I ever had 95 3591 3780.0000%
The Volokh Conspiracy 42000 5873 13.9833%
Sco­belizer 21800 5524 25.3394%
Jef­frey Zeldman 22500 4134 18.3733%
This Mod­ern World 32100 3913 12.1900%
The Web Stan­dards Project 1850 3810 205.9459%
Joel on Software 22400 4514 20.1518%
Media Mat­ters for America 24800 6809 27.4556%
Tele­vi­sion with­out pity 13300 3859 29.0150%
Kuro5hin 17300 4208 24.3237%
Lileks 0 3824 N/A
Hugh Hewitt 26700 4573 17.1273%
Joel Veitch 2830 3774 133.3569%
Truthout 8780 6528 74.3508%
Bagh­dad Burning 22700 3519 15.5022%
Buzz machine 30600 4145 13.5458%
fleugel 1890 3670 194.1799%
Informed Com­ment 27900 3905 13.9964%
Doppler: redefin­ing podcasting 4420 3040 68.7783%
geek and proud 355 3166 891.8310%
load­mem­ory (Asian site) 83 3324 4004.8193%
Pho­to­junkie 1540 2860 185.7143%
Ross Rader 1070 2976 278.1308%
The Truth Laid Bear 23900 4127 17.2678%
Joi Ito 23400 5165 22.0726%
Scrap­ple­Face 31100 3480 11.1897%
Lex­Text 1970 2671 135.5838%
Google Blog 46 3688 8017.3913%
Xbox 6600 4221 63.9545%
My life in a Bush of Ghosts 6 2519 41983.3333%
Astron­omy pic­ture of the day 5020 3498 69.6813%
Crooked Tim­ber 3560 3617 101.6011%
Vodka Pun­dit 4520 3085 68.2522%
Captain’s quar­ter 27100 3671 13.5461%
A small victory 16700 3223 19.2994%
Gato Fedorento 1630 2574 157.9141%
Mez­zoblue 12000 2952 24.6000%
Post­Se­cret 5790 2707 46.7530%
Samizdata.net 1050 2872 273.5238%
Lawrence Lessig 30600 2949 9.6373%
Coun­ter­punch 11700 3278 28.0171%
Democ­rac­tic Underground 14900 3913 26.2617%
Right Wing News 27900 2967 10.6344%
StopDe­sign 10200 3037 29.7745%
iBib­lio 9730 3105 31.9116%
Samizdata.net (mis­take?) 25500 2743 10.7569%
Abrupto 550 2935 533.6364%
gene7299 (Asian MSNSpaces site) 58 3215 5543.1034%
Where is Raed? 10100 2409 23.8515%
B3TA: We love the web 12000 2614 21.7833%
Talk­left 7170 2901 40.4603%
Wiz­bang 21000 3358 15.9905%
m1net (MSN spaces site) 104 3548 3411.5385%
Hoder 1480 5422 366.3514%
CTRL+Alt+Del 2310 2315 100.2165%
Brad DeLong 30100 2715 9.0199%
Blogs for Bush 16200 3560 21.9753%
Neil Gaiman 13700 2194 16.0146%
Gothamist 15200 2729 17.9539%
Thought Mechan­ics 4400 2197 49.9318%
IMAO 23800 2905 12.2059%
Dan Gill­mor (old weblog) 10800 2600 24.0741%
HINAGATA 10100 2186 21.6436%
Dean’s World 30600 2985 9.7549%
Defamer 9310 2372 25.4780%
USS Clue­less 8470 2570 30.3424%
Dive into Mark 14600 2540 17.3973%
Pandagon 27300 2822 10.3370%
Blogging.la 3200 3061 95.6563%
Why are you wor­ship­ping the ground I blog on? 1430 2238 156.5035%
Dar­ing Fireball 12000 2573 21.4417%

The third col­umn in this is just a quick set of cal­cu­la­tion pro­vid­ing us with some data as to what per­cent­age of Google links was avail­able in Tech­no­rati. From there, we’re already notic­ing some inter­est­ing trends. While most of the data ends up show­ing Google has hav­ing a larger set of links in its index than Tech­no­rati, there are 16 cases where the Tech­no­rati index of links is larger than the Google one. In any study, over 15% of a dataset is sta­tis­ti­cally sig­nif­i­cant. How Tech­no­rati ends up get­ting more data than Google is some­thing that some­one might want to inves­ti­gate. Beyond that, it appears that Tech­no­rati gets about 30% of the links that Google get to a par­tic­u­lar site, as illus­trated in the chart below:

technorati vs. google

The next set of inter­est­ing find­ings is that while the link­age from Tech­no­rati is gen­er­ally lower than it is in Google, it is con­sis­tently that way. A quick analy­sis of the data set shows that the aver­age per­cent­age of Tech­no­rati links com­pared to Google links is not that far from the aver­age median of Tech­no­rati links com­pared to Google links. Con­fused by that last sen­tence? Don’t worry (I was too after I wrote it) and let me show you, by pulling out another data chart:

Tech­no­rati Top 100 Google Links Tech­no­rati Links Technorati/Google
TOTAL 1739867 479580 27.5642%
MEDIAN 13500 3679.5 27.2556%

Doesn’t it all become clearer? On aver­age, for the top 100 blog­gers, Tech­no­rati holds 27.56% of the links that Google holds. Part of the rea­son behind this may be that Tech­no­rati only rep­re­sents the blogs sub­set of the whole web while Google rep­re­sents link­age for the web as a whole. From here, we could gather that for every link a blog pro­vides, other sources on the web pro­vide 3 links. Since blogs still rep­re­sent a small por­tion of the web, how­ever, the impor­tance of links in the blog world may be out­pac­ing the impor­tance of links in the non-blog world. Part of the rea­son behind this could be that links are one of the big cur­rency in the web space and many blogs are offer­ing lit­tle con­tent but are heavy on the link­ing. If an aver­age blog entry is under 300 words, it often con­tains at least one link. This could mean that Tech­no­rati and other blog search engines are right to con­sider links as a strong mea­sure­ment, but may show that blogs, as a medium, are not pro­vid­ing that much con­tent beyond linking.

How­ever, it gets even more inter­est­ing if you dig in. Look­ing at the data, these val­ues are actu­ally mis­lead­ing. What is hap­pen­ing is not truly an egal­i­tar­ian match. Doing a quick review of the dis­tri­b­u­tion, we start see­ing some inter­est­ing trends.

Tech­no­rati Top 100 Google Links Tech­no­rati Links Technorati/Google
AVERAGE TOP 10 43858 12186.1 27.7854%
AVERAGE TOP 25 30397.6 8733.36 28.7304%
AVERAGE TOP 50 23127.06 6534.36 28.2542%
AVERAGE BOTTOM 50 11443.07843 3057.24 26.7169%
AVERAGE BOTTOM 25 11980.07692 2834.884615 23.6633%
AVERAGE BOTTOM 10 13782.72727 2622.909091 19.0304%

Let’s graph the Tech­no­rati links as per­cent­age of Google to see a lit­tle more of what I’m inferring:

averages

Look­ing at this, it seems that our friends at Tech­no­rati have a bias. On aver­age, blogs in the top 10 are 8% more likely to get indexed by both Google and Tech­no­rati than they are to be indexed by Google only. Con­sid­er­ing that Google already admits to some level of bias in their sys­tem (part of the foun­da­tion for PageR­ank is that sites with higher PageR­anks get indexed more often), it is a bit wor­ri­some, espe­cially if the trend holds across the whole of Technorati’s uni­verse. If Google favors index­ing more pop­u­lar sites more often, a clear opprtu­nity for world-live-web search engines like Tech­no­rati would be in the long tail of less-often-indexed sites but Tech­no­rati seems to ignore that oppor­tu­nity and con­cen­trate on the top sites. What that will trans­late into is a direct repro­duc­tion of the power laws when it comes to index­ing of blogs.

But is that true of Google vs Tech­no­rati only? Or do the same rules apply for other search engines? We’ll look at that in the next entry.

Originally published on June 13, 2005 in Business . You may find related thoughts pieces under the following terms: , ,