Business

Secrets of the A-list bloggers: Technorati vs. Google

Looking at data about the A-list, first in and of its own, and later as part of a wider scope made me wonder about the initial data set I was using. What does it mean to be on the Technorati 100? Is Technorati presenting an accurate representation of the world? And how does it compare against the wider world? So I decided to start gathering data from other search engines. In this entry, I will go into a more detailed analysis of that information and will attempt to answer some of the questions I raised above.

Source Check

In my initial review, I noticed that Technorati was ranking sites bases on sources. However, incoming and outgoing information is not really available from the major search engines when it comes to sourcing data. So, for this particular investigation, I decided to dismiss source data and focus on link data. I decided to go and get link data from the three largest search engines: Google, Yahoo! and MSN (that last one was included at the last minute just because I knew that Robert Scoble would complain about the study being biased if I didn’t include MSN).

Picking three search engines was also interesting because it providing some sort of reference check. If one of the engines did not line up with the other two, we could point out to a potential flaw in that engine instead of trying to understand why the data was wrong.

Having picked that data set, I decided to start gathering the data. Let me say that it’s a lot of information and, should I try to do this again in the future, writing software to gather the information will probably be less time consuming that trying to get it by hand.

But enough about the process, let’s get into the numbers.

Technorati vs. Google

So the first dataset I created was a comparative index of Technorati and Google. The set was created by grabbing the number of links to a site in Google and getting the equivalent value for Technorati. The resut looked like this:

Technorati Top 100 Google Links Technorati Links Technorati/Google
Boing Boing 45200 22532 49.8496%
InstaPundit 75000 15190 20.2533%
Daily Kos 59800 15833 26.4766%
Gizmodo 39300 12278 31.2417%
Fark 43600 10216 23.4312%
EnGadget 46800 15051 32.1603%
Davenetics 1780 7571 425.3371%
Eschaton 62400 8713 13.9631%
Dooce 23600 6797 28.8008%
Andrew Sullivan 41100 7680 18.6861%
The Best Page In The Universe 656 6333 965.3963%
Talking Points Memo: by Joshua Micah Marshall 74600 7592 10.1769%
lgf: anti-idiotarian 14700 8275 56.2925%
kottke.org 32000 7278 22.7438%
WIL WHEATON DOT NET 16900 6314 37.3609%
Metafilter 34500 7591 22.0029%
Doc Searls 33600 5690 16.9345%
(In)formacao e (In)utilidade 1780 6040 339.3258%
Wonkette 28800 5877 20.4063%
Scripting News 39400 5728 14.5381%
Power Line 7510 7477 99.5606%
Balmasque 24 4544 18933.3333%
Corante 6770 7686 113.5303%
A list Apart 21100 5536 26.2370%
Something Awful 9020 4512 50.0222%
Megatokyo 7310 4154 56.8263%
Michelle Malkin 17300 6091 35.2081%
Arts and Letters Daily 23900 3983 16.6653%
Gawker 23500 4453 18.9489%
Afterall it was the best I ever had 95 3591 3780.0000%
The Volokh Conspiracy 42000 5873 13.9833%
Scobelizer 21800 5524 25.3394%
Jeffrey Zeldman 22500 4134 18.3733%
This Modern World 32100 3913 12.1900%
The Web Standards Project 1850 3810 205.9459%
Joel on Software 22400 4514 20.1518%
Media Matters for America 24800 6809 27.4556%
Television without pity 13300 3859 29.0150%
Kuro5hin 17300 4208 24.3237%
Lileks 0 3824 N/A
Hugh Hewitt 26700 4573 17.1273%
Joel Veitch 2830 3774 133.3569%
Truthout 8780 6528 74.3508%
Baghdad Burning 22700 3519 15.5022%
Buzz machine 30600 4145 13.5458%
fleugel 1890 3670 194.1799%
Informed Comment 27900 3905 13.9964%
Doppler: redefining podcasting 4420 3040 68.7783%
geek and proud 355 3166 891.8310%
loadmemory (Asian site) 83 3324 4004.8193%
Photojunkie 1540 2860 185.7143%
Ross Rader 1070 2976 278.1308%
The Truth Laid Bear 23900 4127 17.2678%
Joi Ito 23400 5165 22.0726%
ScrappleFace 31100 3480 11.1897%
LexText 1970 2671 135.5838%
Google Blog 46 3688 8017.3913%
Xbox 6600 4221 63.9545%
My life in a Bush of Ghosts 6 2519 41983.3333%
Astronomy picture of the day 5020 3498 69.6813%
Crooked Timber 3560 3617 101.6011%
Vodka Pundit 4520 3085 68.2522%
Captain’s quarter 27100 3671 13.5461%
A small victory 16700 3223 19.2994%
Gato Fedorento 1630 2574 157.9141%
Mezzoblue 12000 2952 24.6000%
PostSecret 5790 2707 46.7530%
Samizdata.net 1050 2872 273.5238%
Lawrence Lessig 30600 2949 9.6373%
Counterpunch 11700 3278 28.0171%
Democractic Underground 14900 3913 26.2617%
Right Wing News 27900 2967 10.6344%
StopDesign 10200 3037 29.7745%
iBiblio 9730 3105 31.9116%
Samizdata.net (mistake?) 25500 2743 10.7569%
Abrupto 550 2935 533.6364%
gene7299 (Asian MSNSpaces site) 58 3215 5543.1034%
Where is Raed? 10100 2409 23.8515%
B3TA: We love the web 12000 2614 21.7833%
Talkleft 7170 2901 40.4603%
Wizbang 21000 3358 15.9905%
m1net (MSN spaces site) 104 3548 3411.5385%
Hoder 1480 5422 366.3514%
CTRL+Alt+Del 2310 2315 100.2165%
Brad DeLong 30100 2715 9.0199%
Blogs for Bush 16200 3560 21.9753%
Neil Gaiman 13700 2194 16.0146%
Gothamist 15200 2729 17.9539%
Thought Mechanics 4400 2197 49.9318%
IMAO 23800 2905 12.2059%
Dan Gillmor (old weblog) 10800 2600 24.0741%
HINAGATA 10100 2186 21.6436%
Dean’s World 30600 2985 9.7549%
Defamer 9310 2372 25.4780%
USS Clueless 8470 2570 30.3424%
Dive into Mark 14600 2540 17.3973%
Pandagon 27300 2822 10.3370%
Blogging.la 3200 3061 95.6563%
Why are you worshipping the ground I blog on? 1430 2238 156.5035%
Daring Fireball 12000 2573 21.4417%

The third column in this is just a quick set of calculation providing us with some data as to what percentage of Google links was available in Technorati. From there, we’re already noticing some interesting trends. While most of the data ends up showing Google has having a larger set of links in its index than Technorati, there are 16 cases where the Technorati index of links is larger than the Google one. In any study, over 15% of a dataset is statistically significant. How Technorati ends up getting more data than Google is something that someone might want to investigate. Beyond that, it appears that Technorati gets about 30% of the links that Google get to a particular site, as illustrated in the chart below:

technorati vs. google

technorati vs. google

The next set of interesting findings is that while the linkage from Technorati is generally lower than it is in Google, it is consistently that way. A quick analysis of the data set shows that the average percentage of Technorati links compared to Google links is not that far from the average median of Technorati links compared to Google links. Confused by that last sentence? Don’t worry (I was too after I wrote it) and let me show you, by pulling out another data chart:

Technorati Top 100 Google Links Technorati Links Technorati/Google
TOTAL 1739867 479580 27.5642%
MEDIAN 13500 3679.5 27.2556%

Doesn’t it all become clearer? On average, for the top 100 bloggers, Technorati holds 27.56% of the links that Google holds. Part of the reason behind this may be that Technorati only represents the blogs subset of the whole web while Google represents linkage for the web as a whole. From here, we could gather that for every link a blog provides, other sources on the web provide 3 links. Since blogs still represent a small portion of the web, however, the importance of links in the blog world may be outpacing the importance of links in the non-blog world. Part of the reason behind this could be that links are one of the big currency in the web space and many blogs are offering little content but are heavy on the linking. If an average blog entry is under 300 words, it often contains at least one link. This could mean that Technorati and other blog search engines are right to consider links as a strong measurement, but may show that blogs, as a medium, are not providing that much content beyond linking.

However, it gets even more interesting if you dig in. Looking at the data, these values are actually misleading. What is happening is not truly an egalitarian match. Doing a quick review of the distribution, we start seeing some interesting trends.

Technorati Top 100 Google Links Technorati Links Technorati/Google
AVERAGE TOP 10 43858 12186.1 27.7854%
AVERAGE TOP 25 30397.6 8733.36 28.7304%
AVERAGE TOP 50 23127.06 6534.36 28.2542%
AVERAGE BOTTOM 50 11443.07843 3057.24 26.7169%
AVERAGE BOTTOM 25 11980.07692 2834.884615 23.6633%
AVERAGE BOTTOM 10 13782.72727 2622.909091 19.0304%

Let’s graph the Technorati links as percentage of Google to see a little more of what I’m inferring:

technorati vs. google: averages

technorati vs. google: averages

Looking at this, it seems that our friends at Technorati have a bias. On average, blogs in the top 10 are 8% more likely to get indexed by both Google and Technorati than they are to be indexed by Google only. Considering that Google already admits to some level of bias in their system (part of the foundation for PageRank is that sites with higher PageRanks get indexed more often), it is a bit worrisome, especially if the trend holds across the whole of Technorati’s universe. If Google favors indexing more popular sites more often, a clear opprtunity for world-live-web search engines like Technorati would be in the long tail of less-often-indexed sites but Technorati seems to ignore that opportunity and concentrate on the top sites. What that will translate into is a direct reproduction of the power laws when it comes to indexing of blogs.

But is that true of Google vs Technorati only? Or do the same rules apply for other search engines? We’ll look at that in the next entry.

Previous Post
Apple moves to Intel
Next Post
Technorati Yahoo and Google Too

Related Posts

Menu