The Worst Thing That Ever Happened to the Internet

I mentioned in the last episode that Internet Explorer was the second-worst thing that ever happened to the Internet. Today I’ll talk about the absolute worst. It’s really a long technical rant that doesn’t matter, but it feels good to let it out. What follows is an underinformed ramble about the scourge that did the most harm to the developing computer network that went on to transform our lives — damage that we still live with today. Without this one corrupting influence, we would have had Internet applications that didn’t suck a decade ago, if not longer. In fact, it was because of this electronic plague that Microsoft was able to cause so much harm with Internet Explorer.

The culprit? The ball and chain that modern technology has dragged along despite its obvious flaws? Hypertext Markup Language, or HTML.

First, let’s start with the name. HTML is not a language. Not even close. It is a document format. That its inventors did not recognize the difference tells you that the wrong guys were doing it.

Second, it’s not a very good document format. At its heart, the inventors wanted a format that did three things: connect related documents, embed external resources (like images) and contain standard formatting information that would be interpreted by viewing software consistently. They were not the only ones developing systems like this; Josten’s Learning invented a similar system when they built the first multimedia encyclopedia for Compton’s New Media. Where Berners-Lee and friends had URL’s, Josten’s engineers created BRU’s, but beyond the initials the function was the same.

I don’t want to be too harsh on Berners-Lee, Cailliau, and the others who grew HTML, but I wish they’d been a little more far-sighted. I say ‘grew’ rather than ‘invented’ because it’s clear that they never sat back and asked themselves “What is a tag? What roles do they perform?” Even now, XHTML, the supposedly more rigorous (if still misnamed) descendant of HTML has fundamental inconsistencies.

For a simple example, take the <br /> tag. It exists because in HTML all whitespace (tabs, spaces, and returns) are mushed together and presented on the screen as a single space. Thus

<p>this markup</p>

and

<p>this
 
        markup</p>

come out the same on the screen. That’s fine if you know what’s going on. But what if you want to put in a line break or a space? Well, for a space you add a special character code &nbsp; and for break you add a tag <br />. Why is one a character and one a tag? Because on the day HTML’s inventors decided they needed line breaks, a tag seemed like a good way to go, even though semantically it had nothing to do with the roles of other tags. It could just as easily been &br; or something like that. That’s how HTML grew up. And thus the World Wide Web was born.

Another fundamental flaw is that the content (what to display) is all mixed up with the presentation (how to display it). What if you want to show the same document in different formats? Nope. While some tags were geared toward identifying the type of content that they enclosed (like the <p> tag), others were direct formatting instructions (like the <i> tag). This inconsistency in the role of tags in a document is a reflection of the organic (and sloppy) way that HTML was grown.

I really can’t blame the inventors of HTML for what came next. Everyone started using it. Everyone. The flaws and inadequacies of the format quickly became apparent. Different document viewers (browsers) rendered things differently. Formatting options were extremely limited. The systems were vulnerable to abuse by unscrupulous people. Right then, there was a chance for people to say, “hold on a second! Let’s take the idea of HTML and apply the lessons we’ve already learned in other branches of computing, and make something that doesn’t suck.”

Rather than scrap HTML, browser makers and others set out to fix it. That was the Big Mistake. After twenty years of tweaking and bickering and incompatible extensions introduced by browser manufacturers and squabbles and lawsuits, HTML has been upgraded from awful to poor. Along the way, companies like Adobe and Macromedia thought to get their technology adopted as a replacement to HTML (the Web in pdf? Interesting…) but those efforts were doomed from the start because they did not provide free, simple tools to create the content.

HTML’s greatest shining virtue (and it’s an awesome one) is that it’s accessible to anyone who can type. Anyone. No special tools required.

So, now we have style sheets to help separate content and presentation, XHTML to fix some of the semantic craziness of HTML, and browsers are finally starting to agree on what all the formatting instructions actually mean. We could have had that fifteen years ago if people had just let go of HTML, but here we are now, with an almost-functional system. There are still plenty of flaws, however. Things that seem so normal now that we don’t even think about how dumb they are.

Take this blog, for instance. It’s a pretty well-built Web application, based on reasonably up-to-date practices. Yet were you to click the comment link at the bottom of this episode, you would go to a new page. On that new page the browser would reload the same header and the same sidebar it just erased. What a waste! Why does it do it? Because that’s how HTML (and HTTP, the underlying part that communicates with servers) works. There have been abortive attempts to fix that over the years, but they have all been flawed. Now, at long last, techniques have been developed to overcome that problem, but they are not quite ready for prime time yet. For one thing, they are very complicated, and for another they rely on browsers working just right. Why was it so hard to implement? Because at its core the Web was not made that way.

Even in the days when almost everyone was on dialup (except the people inventing HTML), no one stopped to say, “hey, let’s make a way to only update the content that changes.” That problem has now been ‘solved’ by adding a new layer of complexity on Web sites. By adding this layer (on top of CSS and so forth), we get sensible Web applications at last, but we take away the one super-cool thing about HTML. It is no longer a simple format that can be harnessed by anyone with a text editor. We have lost the attribute that was the only reason to keep HTML around in the first place.

So now we have a system that is both inaccessibly arcane and flawed. Yay!

2

25 thoughts on “The Worst Thing That Ever Happened to the Internet

    • You’re really going to hate it when William Shatner rises up to perform a snippet of his classic ‘Lucy in the Sky with Diamonds’ then.

      I intend to allow users to stop the duck, now that I have it working on Windows. Eventually. It’s the sort of distraction that I hate when it’s selling something, but I thought this place needed a little whim-wham (a word I discovered while looking up the spelling of ‘whimsy’).

  1. Oh, what progress. In a mere 20 years, HTML has arrived at a state that took the English language 1500 years (more or less) to achieve: illogical, cobbled-together, inconsistent, hard to learn, and at times just plain silly.

    • The trade-off is that English is also the most versatile, flexible, powerful, precise, and subtly-nuanced (at least in its written form) language on Earth. I don’t know that HTML has achieved those milestones.

  2. Yes, exactly what they both said. Damn duck. Damn irregular English language. HTML, millstone though it is, has been very helpful to me, old Stone Age Beast that I am.

    • Those who dislike the duck would be well-served to vote in the current poll. I’ve tossed in a negative vote for john, but I’ll have to wait until tomorrow to register Aser’s vote.

  3. After reading the first couple of lines, I was guessing you were going to go after HTTP and stateless sessions. You kind of got there in the second to last paragraph.

    I’d say more, but I’ve got to sit on the MOH count and snake the office.

    • As I was writing the screed I realized I was lumping HTTP in with HTML, and I wondered if people with more technical knowledge had similar complaints about TCP/IP.

      A saying I heard several years ago: the Internet works in practice but not in theory.

    • As far as MOH goes, this will be the last one determined by SiteMeter. They’ve been having troubles with their record-keeping lately. I have a new counter installed (after trying several), and it seems pretty good but it doesn’t weed out as many visits by spammers and even misses a bot now and then.

      On the flip side, I have much more control of this one and I can even automate logging the new MOH’s and calculating the next big number.

      A Spambot for Millennial Office Holder? I sure hope not. It’s up to the blogcomm to prevent it!

  4. My minor opinion (and you know far more about this stuff than I) is that there may be a connection between the development of HTML and the fondness in the science/math/academic world for TeX and LaTeX. In those, the term “markup language” is referring to what an editor or typesetter uses to make the viewed text look as desired. So the use of the word “language” has precedent. It’s pure speculation, but maybe Berners-Lee was coming from that viewpoint; of wanting to have hidden markups that would present the viewed material as he wanted. Secondly, the big attraction of this was getting exactly what he wanted. Using LaTeX can be a hairpulling experience but in the end it is exactly what you want and not what Microsoft tells you to want. This may have been what was circulating in the minds of the early “growers” – to use your term.

    I don’t think there can be too much emphasis put on your very own words:
    … but those efforts were doomed from the start because they did not provide free, simple tools to create the content.
    HTML’s greatest shining virtue (and it’s an awesome one) is that it’s accessible to anyone who can type. Anyone. No special tools required.

    Finally, I have to say that (in addition to being a fun entry, full of interesting discussion), that bloggcomm is spelled with two g‘s for aesthetic reasons (to match the ending m‘s)

    Finally, finally, whenever mister boo crosses the screen, comment typing is interupted.

    • While there was definitely an attitude of borrowing what worked from other systems (I believe that’s how the image tag came to be the way it is), I certainly HOPE they weren’t trying for absolute vendor-independent precision, since they failed at it so completely.

      TeX was carefully designed to fill a real void (typesetting complex documents) an by its second iteration included a real language. LaTeX was created to make TeX more useful to normal folks. That’s all I know about TeX, and man, I really hope I never have to learn more. I’ve had several requests to have Jer’s Novel Writer output TeX documents, but JersNW is not now, and never will be, a prepress program.

      I think one user wrote a thing to convert the XML output to TeX, anyway.

      If Berners-Lee et. al. had used TeX as a model, they would have spent a lot more time thinking about the role of each tag, making them consistent, and making a more flexible system. If they had started with “what is a document?” and “what defines its parts?” and built a markup format from the ground up (what we now call DOM – the browser manufacturers almost agree on that now), things would have worked better a long time ago. There was nothing stopping them, this same approach was being used successfully elsewhere. Once the document model was in place, syntax to format specific parts and to request subsets of the document from the server would have been simpler, and the sun would have shone more brightly, the birds sung more prettily, and war as we know it would have ceased.

      I agree with you completely about the critical importance of being able to use a text editor and no special tools to create marked-up documents. Had HTML been done right, it would still be possible to create really slick sites without complex code libraries and development tools, along with the always-necessary text editor and a good aesthetic sense.

      • I’m going to tilt at this windmill again. And it is an appropriate image because I am not the knight errant to do this argument. But your whole entry is (I’m sure purposefully) provocative and iconoclastic. HTML is all wrong? The deuce you say! I am provoked. I mean, I came to the internet right at the pre-cambrian explosion of the www. I had one class make me find a bullitan board, used Usenet once, used Gopher once, and that was it. It’s all been webpages and browsers since, and you are saying, “cars are silly, if Henry Ford wasn’t the maroon he was, he woulda concentrated on hovercraft.” The deuce you say! I like cars.

        First of all, isn’t Berners-Lee a high energy Physicist? (I couuuuuld look up on wikipedia). And wasn’t he one of the few scientists named as most influential people of 20th century in Time magazine or something? In a sea of celebs and politicians. Scientists never git nothin, and now you want to take this away? Your dad is a high energy physicist. I think a couple of kittens expired when you wrote your words, and a russian grandmother tore out her own eyes. You, SIR! are a father-hatin, kitten-killin, russian-grammy-blindin provocateur! And iconoclast. Probably a sociopath as well. Since ad-hominem attacks are the pinnacle of rhetoric and debate I could rest my case here, while you gasp for breath like a mewling, scabies-ridden puppy. But I will press on.

        First of all (you don’t merit any second, third or more of alls you iconoclast!), I should clarify that my only reason to invoke TeX…
        (Gaw! %##@&!
        Both the duck and mister boo are completely shutting down my erudition. Are you doing this on purpose because you are a mewling puppy afraid of real debate?!? I’m running firefox 3.0.10 on linux. What seems to be happening is they are taking my cursor out of the editor window. Thus I have to re-click at end of last typing to insert myself back into comment window. Oh and I see no special formatting of TeX.)
        …is to say that they wanted to have process and results do exactly what they want and not to invoke it as the model for formatting web documents. Well, maybe a little.

        Okay, wiki says Berners-lee is a computer scientist, not a physicist. I was thrown because he worked at CERN. So you’re not a father-hater, but those other things, …MAN!

        Stepping thru your points, in order, the first thing we come to is that HTML is not a language. And you imply they are stupid because they didn’t know the difference. Knight of the British Empire – stupid! Well, what do you mean by a language? A programming language? Are you saying that you want data put into a program, the crank turned, and results output? This strikes me as semantics. Nobody really takes piss home in their pocket when they go to take a piss, but we all know what is meant. HTML is a set of specific, technical, jargonations that allow us to speak of the pompatus of web. It is a language.

        Your next point is the inconsistency of having both tags and ampersand escaped commands. This is both a trivial and major point. OToneH it is hardly the straw to break the camel’s back, but on the other hand – if it is worth doing, it is worth doing right, and it is esthetically not very good. It strikes me as the different approaches that science and engineering take. The stereotypical scientist is organic and wild-haired and pushes the boundaries forward. The stereotypical engineer is buttondown, precise and makes it all elegant. Since wiki says Berner-Lee is an engineer, it is a surprising lapse.

        Your next point is having the content and the instructions on how to display the content all mixed together. Now this is an interesting point, and worth lots of friendly discussion over a fine brew. A smoky porter, perhaps. Or a pumpkin stout. Now tex does exactly this kind of mixing, and it harkens back to my point of “I’m gonna display something, and I don’t want Bill Gates telling me how he thinks the results should look.” But your point is stimulating, and I can’t really address it without knowing more. What would you propose? If you wanted to serve people something to read and/or hear how would you separate content from instructions? A fascinating concept, I sincerely look forward to your thoughts.

        Your next point I also can’t address so much, because I would like/need more detail. You say that when the web and attendant HTML exploded, the flaws became apparent immediately. How so? Were there technoGeek conferences on the flaws? Even at the early stage?
        Some flaws you mention: Different browsers rendered pages differently. Well, why did they? Why aren’t they (instead of HTML) the asshats?; formatting options extremely limited. How so? What did you miss? What did you wish to format that you couldn’t?; Unscrupulous people…what vulnerabilities were there? Finally what are the lessons from other branches of computing that would have helped these flaws? I dunno the answers, that’s why I’m asking.

        Your next argument is that the HTTP is reloading all of the static stuff over and over again. Such as the headers, and sidebar. Really? I didn’t realize that. I thought that was what cache was all about. Nonetheless this is a really good point. In a world where AT&T and others want to start chargin for bandwidth, cutting the fat and inefficiency is really important. And I’m with you – if I had been an early HTTP pioneer, I would have certainly programmed the protocol not to waste effort on the static stuff. What were they thinking. I can agree on this point. A serious lapse. If – as you suggest – it currently takes complex coding to sit on top of the CSS that sits on top of the HTML, then it is a true tragedy.

        First of all, I’d also add, that somewhere you complain about the organicness of the development. Your point is well taken, but there can also be a beauty in organicism and I wouldn’t dismiss it lightly. It is entirely possible that once upon a time an unscrupulous hacker pulled his hair out because he couldn’t do evil thing X since the organic web was doing a certain convoluted, inefficient method Y.

        Finally, first of all, this comment was sooo painful. The duck/boo really was agony. If the above was hyperbolic ranting, there is no hyperbole here: It took me an extra fifteen minutes to rant because of the interruption. Oddly, Shatty Baby didn’t cause problems.

        • Excellent ad hominism! I wish I could reply in kind, but I don’t have enough chemicals in my bloodstream yet. So in return for all your hard work you get a pedantic list of points. So it goes.

          First, the duck/boo interruptions: I’d like to know if anyone else experiences them. It’s odd that shat would not have the same effect; he uses the same code. I’ll look and see what I can find.

          HTML is less of a language than pdf or Microsoft .doc – those document formats can contain logic. TeX contains a full-blown language. If you wish to put languageness on a spectrum, HTML would still fall way over on the ‘not’ side.

          The inconsistency between spaces and line breaks isn’t about having tags and escaped commands, it’s that two special formatting characters are treated completely differently. They fill the same role; they should work the same. (Neither should be tags, which perform a different role.)

          ‘Separate data and presentation’ has been a mantra in computer science at least since the invention of the GUI. TeX is not great about that, but you could go nuts and accomplish it (LaTeX makes that a lot easier). TeX, however, is so much about creating one specific sub-pixel perfect layout of a document that exists only to be laid out, that the versatility is less important.

          I could have sworn Berners-Lee was a Physicist as well. He was certainly influential, but that influence wasn’t all good, and really not all that amazing. As I said before, there were other people solving the same problems; he just happened to be in the right place – next to a big, widespread network with lots of peers who were pining for something like this. (Actually, I think his real influence came later as one of the people maintaining the standards.) Influential, yes. Visionary, no.

          The flaws in HTML were painfully obvious to anyone who wanted to do more than slap text on the screen and add a few pictures (that the text did not flow around). I took a look at HTML back then and thanked my lucky stars that I was not going to have to use such a clunker. Oh, how naïve I was.

          HTML 2.0 made the format somewhat more useful, and things have progressed. One-third of the tags in the original HTML have since been removed, though sadly not line break or horizontal rule. Somewhere early on tables and forms were added, making the format suck a little less. And that’s what its been for two decades, a gradual easing of the suckage, at the expense of the one shining attribute HTML had to offer: universal accessibility.

          As far as already knowing better, look no farther than SGML, which was a generalized, conceptual description of document markup that HTML sorta-kinda followed, except when it didn’t, which was a lot of the time. The intervening years have been a gradual drift towards SGML. We’re still not there. Berners-Lee, I think, was somewhat in denial about the gulf between HTML and SGML. Near the start would have been a great time for the geeks making Mosaic to step forward and offer guidance.

          To be fair to Berners-Lee and his cohorts, they made something that was adequate for a specific, very limited purpose – sharing static text information among peers. Aesthetics were not considered to be very important, nor flexibility. It’s when the world jumped on HTML and started wanting to use it for everything else that the flaws became relevant.

          Since then DOM has been added, providing a way to describe all the pieces of a document (what they should have started with), which made CSS possible, and now with JavaScript ubiquitous and consistent it is finally possible to request content for and update only one part of the page, though you also need custom stuff on the server to handle those requests. The Google search over on the sidebar uses AJAX, for instance.

          The cache is a complex mechanism that works differently in different browsers, which was invented as yet another stretch of duct tape to fix HTML’s flaws. Granted, caching would have been useful anyway, but even more efficient would be to not even ask for the same content again. And while images go into the cache, all the text on the sidebar has to be reloaded (since it’s in the code for the main page), and re-rendered, and the scripts re-executed. Cache was only a partial solution to something that the HTML guys should have addressed long ago.

          Your hacker example made me think of this specific story that illustrates your point. The ‘classic’ mac operating system was amazingly secure, in part because way, way, down inside the original engineers decided on a weird way to invoke operating system functions. It was a pain in the patoot to work with, but for reasons I’m not totally clear on, made cracking the mac nigh-on impossible. Now, of course, Mac uses an operating system which at its core has stuff to make it easier for one computer to control another. Usually it works the other way, however. Look at Microsoft Windows a few years ago. They kept adding stuff that was useful for large businesses, but also turned out to be useful to hackers.

          That notwithstanding, it’s more the underlying stuff beneath HTML (HTTP and so forth), built in an environment of happy collaboration, that has attracted the malicious no-goodniks with their hacking and their denial-of-service attacks and whatnot, so that point can probably be set aside.

          I suppose I should tack a conclusion on to all of this. HTML changed the Internet, put the power to publish into everyone’s hands, and was a poorly-designed piece of crap that could have been done much better, to the point where fixing all its problems has undermined its original strength.

    • Ooo! And here’s something interesting: TeX is actually spelled to show off its typesetting mojo. Funny making a product with a name you need the product to read.

      The graphic format for the logo there is svg which (I think) stands for Scalable Vector Graphics. It looks pretty promising, because search engines can get to the text in the image, and fonts are not an issue. I could also write a script that moved parts of the image around.

      I’m curious what success people have viewing the image. I think Internet Explorer users are out of luck, but I’d be happy to be wrong. I think there are plugins available.

        • I did a little reading, and apparently Firefox requires you to use a different syntax to include svg images than you do for any other image. Something about security makes the object tag safer than the img tag. I won’t pretend to understand. One could also take the contents of the image file and include the xml directly into the page and it would work in Firefox. (I might try that later, just for giggles.)

          Opera and Safari both treat svg like any other image.

          Microsoft proposed a different format for the same purposes, and has never joined the svg parade.

          • My computer just downloaded and installed Firefox 3.5.3 this afternoon, although it gave me a message that for ultimate performance and security, I should also get the latest version of Flashplayer.

            Haven’t seen Mr. Boo yet, and the duck and Shatner seem to be showing up less often, although that may be a matter of perception.

          • I saw the notice of the flash player, and it looked like Firefox was trying to say there was a security hole in the old player without actually coming out and saying it. It looks like updating would be a good idea.

          • Haven’t updated flashplayer yet … still haven’t seen Mr. Boo. Neither the duck nor Shatner give my computer any problem, although, speaking of ghosts, the duck leaves a copy of himself behind when he passes over certain graphics.

Leave a Reply

Your email address will not be published. Required fields are marked *