Secrets of the A-list bloggers: Technorati vs. Google

Looking at data about the A-list, first in and of its own, and later as part of a wider scope made me wonder about the initial data set I was using. What does it mean to be on the Technorati 100? Is Technorati presenting an accurate representation of the world? And how does it compare against the wider world? So I decided to start gathering data from other search engines. In this entry, I will go into a more detailed analysis of that information and will attempt to answer some of the questions I raised above.

Source Check

In my initial review, I noticed that Technorati was ranking sites bases on sources. However, incoming and outgoing information is not really available from the major search engines when it comes to sourcing data. So, for this particular investigation, I decided to dismiss source data and focus on link data. I decided to go and get link data from the three largest search engines: Google, Yahoo! and MSN (that last one was included at the last minute just because I knew that Robert Scoble would complain about the study being biased if I didn’t include MSN).

Picking three search engines was also interesting because it providing some sort of reference check. If one of the engines did not line up with the other two, we could point out to a potential flaw in that engine instead of trying to understand why the data was wrong.

Having picked that data set, I decided to start gathering the data. Let me say that it’s a lot of information and, should I try to do this again in the future, writing software to gather the information will probably be less time consuming that trying to get it by hand.

But enough about the process, let’s get into the numbers.

Technorati vs. Google

So the first dataset I created was a comparative index of Technorati and Google. The set was created by grabbing the number of links to a site in Google and getting the equivalent value for Technorati. The resut looked like this:

Technorati Top 100Google LinksTechnorati LinksTechnorati/Google
Boing Boing452002253249.8496%
InstaPundit750001519020.2533%
Daily Kos598001583326.4766%
Gizmodo393001227831.2417%
Fark436001021623.4312%
EnGadget468001505132.1603%
Davenetics17807571425.3371%
Eschaton62400871313.9631%
Dooce23600679728.8008%
Andrew Sullivan41100768018.6861%
The Best Page In The Universe6566333965.3963%
Talking Points Memo: by Joshua Micah Marshall74600759210.1769%
lgf: anti-idiotarian14700827556.2925%
kottke.org32000727822.7438%
WIL WHEATON DOT NET16900631437.3609%
Metafilter34500759122.0029%
Doc Searls33600569016.9345%
(In)formacao e (In)utilidade17806040339.3258%
Wonkette28800587720.4063%
Scripting News39400572814.5381%
Power Line7510747799.5606%
Balmasque24454418933.3333%
Corante67707686113.5303%
A list Apart21100553626.2370%
Something Awful9020451250.0222%
Megatokyo7310415456.8263%
Michelle Malkin17300609135.2081%
Arts and Letters Daily23900398316.6653%
Gawker23500445318.9489%
Afterall it was the best I ever had9535913780.0000%
The Volokh Conspiracy42000587313.9833%
Scobelizer21800552425.3394%
Jeffrey Zeldman22500413418.3733%
This Modern World32100391312.1900%
The Web Standards Project18503810205.9459%
Joel on Software22400451420.1518%
Media Matters for America24800680927.4556%
Television without pity13300385929.0150%
Kuro5hin17300420824.3237%
Lileks03824N/A
Hugh Hewitt26700457317.1273%
Joel Veitch28303774133.3569%
Truthout8780652874.3508%
Baghdad Burning22700351915.5022%
Buzz machine30600414513.5458%
fleugel18903670194.1799%
Informed Comment27900390513.9964%
Doppler: redefining podcasting4420304068.7783%
geek and proud3553166891.8310%
loadmemory (Asian site)8333244004.8193%
Photojunkie15402860185.7143%
Ross Rader10702976278.1308%
The Truth Laid Bear23900412717.2678%
Joi Ito23400516522.0726%
ScrappleFace31100348011.1897%
LexText19702671135.5838%
Google Blog4636888017.3913%
Xbox6600422163.9545%
My life in a Bush of Ghosts6251941983.3333%
Astronomy picture of the day5020349869.6813%
Crooked Timber35603617101.6011%
Vodka Pundit4520308568.2522%
Captain’s quarter27100367113.5461%
A small victory16700322319.2994%
Gato Fedorento16302574157.9141%
Mezzoblue12000295224.6000%
PostSecret5790270746.7530%
Samizdata.net10502872273.5238%
Lawrence Lessig3060029499.6373%
Counterpunch11700327828.0171%
Democractic Underground14900391326.2617%
Right Wing News27900296710.6344%
StopDesign10200303729.7745%
iBiblio9730310531.9116%
Samizdata.net (mistake?)25500274310.7569%
Abrupto5502935533.6364%
gene7299 (Asian MSNSpaces site)5832155543.1034%
Where is Raed?10100240923.8515%
B3TA: We love the web12000261421.7833%
Talkleft7170290140.4603%
Wizbang21000335815.9905%
m1net (MSN spaces site)10435483411.5385%
Hoder14805422366.3514%
CTRL+Alt+Del23102315100.2165%
Brad DeLong3010027159.0199%
Blogs for Bush16200356021.9753%
Neil Gaiman13700219416.0146%
Gothamist15200272917.9539%
Thought Mechanics4400219749.9318%
IMAO23800290512.2059%
Dan Gillmor (old weblog)10800260024.0741%
HINAGATA10100218621.6436%
Dean’s World3060029859.7549%
Defamer9310237225.4780%
USS Clueless8470257030.3424%
Dive into Mark14600254017.3973%
Pandagon27300282210.3370%
Blogging.la3200306195.6563%
Why are you worshipping the ground I blog on?14302238156.5035%
Daring Fireball12000257321.4417%

The third column in this is just a quick set of calculation providing us with some data as to what percentage of Google links was available in Technorati. From there, we’re already noticing some interesting trends. While most of the data ends up showing Google has having a larger set of links in its index than Technorati, there are 16 cases where the Technorati index of links is larger than the Google one. In any study, over 15% of a dataset is statistically significant. How Technorati ends up getting more data than Google is something that someone might want to investigate. Beyond that, it appears that Technorati gets about 30% of the links that Google get to a particular site, as illustrated in the chart below:

technorati vs. google

technorati vs. google

The next set of interesting findings is that while the linkage from Technorati is generally lower than it is in Google, it is consistently that way. A quick analysis of the data set shows that the average percentage of Technorati links compared to Google links is not that far from the average median of Technorati links compared to Google links. Confused by that last sentence? Don’t worry (I was too after I wrote it) and let me show you, by pulling out another data chart:

Technorati Top 100Google LinksTechnorati LinksTechnorati/Google
TOTAL173986747958027.5642%
MEDIAN135003679.527.2556%

Doesn’t it all become clearer? On average, for the top 100 bloggers, Technorati holds 27.56% of the links that Google holds. Part of the reason behind this may be that Technorati only represents the blogs subset of the whole web while Google represents linkage for the web as a whole. From here, we could gather that for every link a blog provides, other sources on the web provide 3 links. Since blogs still represent a small portion of the web, however, the importance of links in the blog world may be outpacing the importance of links in the non-blog world. Part of the reason behind this could be that links are one of the big currency in the web space and many blogs are offering little content but are heavy on the linking. If an average blog entry is under 300 words, it often contains at least one link. This could mean that Technorati and other blog search engines are right to consider links as a strong measurement, but may show that blogs, as a medium, are not providing that much content beyond linking.

However, it gets even more interesting if you dig in. Looking at the data, these values are actually misleading. What is happening is not truly an egalitarian match. Doing a quick review of the distribution, we start seeing some interesting trends.

Technorati Top 100Google LinksTechnorati LinksTechnorati/Google
AVERAGE TOP 104385812186.127.7854%
AVERAGE TOP 2530397.68733.3628.7304%
AVERAGE TOP 5023127.066534.3628.2542%
AVERAGE BOTTOM 5011443.078433057.2426.7169%
AVERAGE BOTTOM 2511980.076922834.88461523.6633%
AVERAGE BOTTOM 1013782.727272622.90909119.0304%

Let’s graph the Technorati links as percentage of Google to see a little more of what I’m inferring:

technorati vs. google: averages

technorati vs. google: averages

Looking at this, it seems that our friends at Technorati have a bias. On average, blogs in the top 10 are 8% more likely to get indexed by both Google and Technorati than they are to be indexed by Google only. Considering that Google already admits to some level of bias in their system (part of the foundation for PageRank is that sites with higher PageRanks get indexed more often), it is a bit worrisome, especially if the trend holds across the whole of Technorati’s universe. If Google favors indexing more popular sites more often, a clear opprtunity for world-live-web search engines like Technorati would be in the long tail of less-often-indexed sites but Technorati seems to ignore that opportunity and concentrate on the top sites. What that will translate into is a direct reproduction of the power laws when it comes to indexing of blogs.

But is that true of Google vs Technorati only? Or do the same rules apply for other search engines? We’ll look at that in the next entry.

Previous Post
Apple moves to Intel
Next Post
Technorati Yahoo and Google Too
Menu