The state of HTML validation

There’s been a lot of talk about HTML5 recently and, in some geek circles, there have been snickers when companies have done a poor job of implementing it. But what is the true state of html5. To find out, I decided to check whether the top sites on the internet had implemented it and how successful they were in doing so.

Methodology

One of the first thing in this effort was to get a decent list of sites. Unfortunately, it seems that it has become increasingly difficult to get a sense of which sites are the most popular when it comes to number of visits. I eventually settled down on Alexa’s Top Sites list because it featured most of the sites people think of when considering what large sites are and includes a few non-US sites.

I then used the W3C Validator against each of the top 25 sites. This allowed me to get 3 different pieces of information:

  • Doctype: This is what the site declares as its HTML code version. In other words, how the site identifies what version of HTML it supports.
  • Encoding: This is the language the site uses, which gives us a better understanding as to whether they are targeting a particular language or trying to offer a global site.
  • Validation: This is how the site validated when tested for errors relating to the HTML version it purported to be offering. It gives us an idea as to how compliant with the standards the site truly is.

Surprisingly, a number of popular Web 2.0 sites were not in Alexa’s Top 25 so I created a separate list for them.

Top 25

Looking at the top 25, here are the results:

Name Doctype Encoding Validation
Google HTML 5 iso-8859-1 37 errors, 3 warnings
Facebook HTML 5 utf-8 34 errors
YouTube HTML 5 utf-8 120 errors, 2 warnings
Yahoo! HTML 5 utf-8 144 errors, 8 warnings
Blogger HTML 4.0 Strict utf-8 34 errors, 45 warnings
Baidu HTML 5 gb2312 6 errors, 6 warnings
Wikipedia HTML 5 utf-8 5 errors, 1 warning
Windows Live HTML 4.01 Transitional utf-8 33 errors, 17 warnings
Twitter HTML 5 utf-8 5 errors, 1 warning
QQ.com XHTML 1.0 Transitional gb2312 validator crashed
MSN XHTML 1.0 Strict utf-8 Completely valid
Yahoo Japan HTML 4.01 Transitional utf-8 26 errors, 24 warnings
LinkedIn HTML 5 utf-8 12 errors, 1 warning
Google India HTML 5 iso-8859-1 40 errors, 2 warnings
Amazon HTML 4.01 Transitional iso-8859-1 516 errors, 125 warnings
Sina.com.cn XHTML 1.0 Transitional gb2312 validator crashed
Taobao.com HTML 5 gb2312 validator crashed
WordPress XHTML 1.0 Transitional utf-8 4 errors
Google HK HTML 5 Big5 40 errors, 1 warning
Google Germany HTML 5 iso-8859-1 37 errors, 3 warnings
Ebay HTML 4.01 Transitional utf-8 386 errors, 19 warnings
Yandex HTML 4.01 Transitional utf-8 52 errors, 12 warnings
Google UK HTML 5 iso-8859-1 37 errors, 3 warnings
Google Japan HTML 5 shift_jis 39 errors, 1 warning
Bing XHTML 1.0 Transitional utf-8 16 errors

Looking at the data, the first thing that is interesting is how many sites have made the switch to HTML 5. Of the top 25 sites, 14 have made the switch to HTML 5. This means than in the last year, 56 percent of the largest sites on the internet have completely modified their code base to comply with a new standard. 6 sites are still left on the old HTML standard and 5 are sticking to the somewhat more recent XHTML standard.

However, it is also interesting to note that none of the sites which have made the transition comply with proper HTML standards. In fact, of the top 25 sites in the Alexa list, only MSN was found to provide completely valid code. Maybe Microsoft could point those people towards their other properties. Amazon was the worst offender, with 516 errors in their code, showing that disregard for standard compliance does not seem to have an impact on economic performance. However, Ebay and Yahoo came closely behind with hundreds of errors in their code, maybe highlighting Amazon as an exception.

Another interesting phenomenon is that most of the large sites have adopted UTF 8, the encoding type that support most languages, as their default language. Once again, over half (56%) of the sites have switched with Amazon and Google being among the rare exceptions. An interesting aside here is that the W3C validator may have issues when it comes to validating chinese sites as it was not able to finish the job.

Web 2.0 Companies

Looking at Web 2.0 companies, the data was surprising:

Name Doctype Encoding Validation
Facebook HTML 5 utf-8 34 errors
YouTube HTML 5 utf-8 120 errors, 2 warnings
Blogger HTML 4.0 Strict utf-8 34 errors, 45 warnings
Twitter HTML 5 utf-8 5 errors, 1 warning
LinkedIn HTML 5 utf-8 12 errors, 1 warning
WordPress XHTML 1.0 Transitional utf-8 4 errors
Flickr HTML 5 utf-8 15 errors, 3 warnings
Tumblr XHTML 1.0 Transitional utf-8 19 errors
Foursquare XHTML 1.0 Strict utf-8 40 errors
Groupon XHTML 1.0 Transitional utf-8 6 errors
Zynga XHTML 1.0 Transitional utf-8 4 errors, 6 warnings

I captured the data for companies other than those in the top 25 and a few interesting trends seem to pop up. The first thing that came as a surprise is that there seems to be that a lower number of sites have made the transition to HTML 5, with only 5 sites out of 11 (or 45 percent) having completed the transition. There seems to still be a strong preference for XHTML as the way to encode pages.

Also of note is that all sides have plans for globalization, encoding their page in the UT-8 format that can support both western and non-western alphabets.

However, none of the sites successfully validate in any of their preferred standard. It looks like there is still much room for improvement in the world of HTML validation.

Posted in:
About the Author

Tristan Louis

Writing and working on the internet since 1993, I've launched six companies, of which two went public and three were sold. This is my personal site and all opinions here are mine.

  • Rosemarie Pritchard

    This is probably just cheeky, but: this entry has 46 errors :P

    • http://www.tnl.net Tristan Louis

      It’s actually 15, after I realized that 2 wordpress plugins were creating mass amounts of errors. Still have to figure out how to deal with the other 15 errors.

  • Anon

    there is so much more to HTML than validation. It’s ok to break a few rules now again if you know why you’re breaking them. Validate, explain to yourself and have a nice cup of tea and stop worrying.

    much love

    • http://www.tnl.net Tristan Louis

      “Corn Syrup tastes wonderfully sweet so don’t worry about the fact that it causes obesity” seems to fall under the same line of argument. Yes, HTML is awesome but it doesn’t mean it couldn’t get better. Can’t we all work together to upgrade the web to something that is as good as (or better than) it is today? Wouldn’t it be awesome if most web sites were as advanced as they are today AND also validated, allowing for both forward and backward compatibility?

      • Anon 2

        Your analogy introduced an effect, when your original argument had none. That is, you say that corn syrup causes obesity, but you never established that the validation errors on major websites are causing any kind of real-world problem.

        The best example, probably, is the Google homepage. The designers/developers/maintainers of that page are intimately aware of exactly which validation errors they are triggering, and are keeping those errors because they allow for a major reduction in bandwidth without losing any users.

        We aren’t all top 25 web developers, though, which is why validation can still be a constructive goal for us, but also why this post struggles to be relevant to readers.

        • http://www.tnl.net Tristan Louis

          I guess you helped me refine the post. As you point out, it’s still a constructive goal, which is what we should all strive for. The way I look at it is that if even 1 developer is looking to make sites more compliant as a result of this post, it’s been a success.