The state of HTML validation

There’s been a lot of talk about HTML5 recently and, in some geek circles, there have been snickers when companies have done a poor job of implementing it. But what is the true state of html5. To find out, I decided to check whether the top sites on the internet had implemented it and how successful they were in doing so.

Methodology

One of the first thing in this effort was to get a decent list of sites. Unfortunately, it seems that it has become increasingly difficult to get a sense of which sites are the most popular when it comes to number of visits. I eventually settled down on Alexa’s Top Sites list because it featured most of the sites people think of when considering what large sites are and includes a few non-US sites.

I then used the W3C Validator against each of the top 25 sites. This allowed me to get 3 different pieces of information:

  • Doctype: This is what the site declares as its HTML code version. In other words, how the site identifies what version of HTML it supports.
  • Encoding: This is the language the site uses, which gives us a better understanding as to whether they are targeting a particular language or trying to offer a global site.
  • Validation: This is how the site validated when tested for errors relating to the HTML version it purported to be offering. It gives us an idea as to how compliant with the standards the site truly is.

Surprisingly, a number of popular Web 2.0 sites were not in Alexa’s Top 25 so I created a separate list for them.

Top 25

Looking at the top 25, here are the results:

NameDoctypeEncodingValidation
GoogleHTML 5iso-8859-137 errors, 3 warnings
FacebookHTML 5utf-834 errors
YouTubeHTML 5utf-8120 errors, 2 warnings
Yahoo!HTML 5utf-8144 errors, 8 warnings
BloggerHTML 4.0 Strictutf-834 errors, 45 warnings
BaiduHTML 5gb23126 errors, 6 warnings
WikipediaHTML 5utf-85 errors, 1 warning
Windows LiveHTML 4.01 Transitionalutf-833 errors, 17 warnings
TwitterHTML 5utf-85 errors, 1 warning
QQ.comXHTML 1.0 Transitionalgb2312validator crashed
MSNXHTML 1.0 Strictutf-8Completely valid
Yahoo JapanHTML 4.01 Transitionalutf-826 errors, 24 warnings
LinkedInHTML 5utf-812 errors, 1 warning
Google IndiaHTML 5iso-8859-140 errors, 2 warnings
AmazonHTML 4.01 Transitionaliso-8859-1516 errors, 125 warnings
Sina.com.cnXHTML 1.0 Transitionalgb2312validator crashed
Taobao.comHTML 5gb2312validator crashed
WordPressXHTML 1.0 Transitionalutf-84 errors
Google HKHTML 5Big540 errors, 1 warning
Google GermanyHTML 5iso-8859-137 errors, 3 warnings
EbayHTML 4.01 Transitionalutf-8386 errors, 19 warnings
YandexHTML 4.01 Transitionalutf-852 errors, 12 warnings
Google UKHTML 5iso-8859-137 errors, 3 warnings
Google JapanHTML 5shift_jis39 errors, 1 warning
BingXHTML 1.0 Transitionalutf-816 errors

Looking at the data, the first thing that is interesting is how many sites have made the switch to HTML 5. Of the top 25 sites, 14 have made the switch to HTML 5. This means than in the last year, 56 percent of the largest sites on the internet have completely modified their code base to comply with a new standard. 6 sites are still left on the old HTML standard and 5 are sticking to the somewhat more recent XHTML standard.

However, it is also interesting to note that none of the sites which have made the transition comply with proper HTML standards. In fact, of the top 25 sites in the Alexa list, only MSN was found to provide completely valid code. Maybe Microsoft could point those people towards their other properties. Amazon was the worst offender, with 516 errors in their code, showing that disregard for standard compliance does not seem to have an impact on economic performance. However, Ebay and Yahoo came closely behind with hundreds of errors in their code, maybe highlighting Amazon as an exception.

Another interesting phenomenon is that most of the large sites have adopted UTF 8, the encoding type that support most languages, as their default language. Once again, over half (56%) of the sites have switched with Amazon and Google being among the rare exceptions. An interesting aside here is that the W3C validator may have issues when it comes to validating chinese sites as it was not able to finish the job.

Web 2.0 Companies

Looking at Web 2.0 companies, the data was surprising:

NameDoctypeEncodingValidation
FacebookHTML 5utf-834 errors
YouTubeHTML 5utf-8120 errors, 2 warnings
BloggerHTML 4.0 Strictutf-834 errors, 45 warnings
TwitterHTML 5utf-85 errors, 1 warning
LinkedInHTML 5utf-812 errors, 1 warning
WordPressXHTML 1.0 Transitionalutf-84 errors
FlickrHTML 5utf-815 errors, 3 warnings
TumblrXHTML 1.0 Transitionalutf-819 errors
FoursquareXHTML 1.0 Strictutf-840 errors
GrouponXHTML 1.0 Transitionalutf-86 errors
ZyngaXHTML 1.0 Transitionalutf-84 errors, 6 warnings

I captured the data for companies other than those in the top 25 and a few interesting trends seem to pop up. The first thing that came as a surprise is that there seems to be that a lower number of sites have made the transition to HTML 5, with only 5 sites out of 11 (or 45 percent) having completed the transition. There seems to still be a strong preference for XHTML as the way to encode pages.

Also of note is that all sides have plans for globalization, encoding their page in the UT-8 format that can support both western and non-western alphabets.

However, none of the sites successfully validate in any of their preferred standard. It looks like there is still much room for improvement in the world of HTML validation.

Previous Post
Google Acquiring Motorola
Next Post
The third screen
Menu