tommorris.org

Discussing software, the web, politics, sexuality and the unending supply of human stupidity.


html


Proposal: 'change password' discoverability metadata

The recent leak of LinkedIn’s password database shows that passwords remain a fragile part of our security ecosystem. Users are bad at coming up with passwords. They use the same password among multiple services. Enterprise password change policies have been part of the problem: users simply take their existing passwords and stick an incrementing number on the end, or engage in other substitutions (changing the letter o for the number 0, for example). Plus, the regular password change doesn’t really help as a compromised password needs to be fixed immediately, rather than waiting three months for the next expiration cycle. CESG recently issued guidance arguing against password expiration policies using logic that is obvious to every competent computer professional but not quite so obvious to big enterprise IT managers.

Many users, fed up with seeing yet another IT security breach, have switched over to using password managers like KeePass, 1Password, Dashlane and LastPass. This is something CESG have encouraged in their recent password guidance. Password managers are good, especially if combined with two-factor authentication.

For users who are starting to use a password manager, they have the initial hurdle of switching over from having the same password for everything to using the password manager’s generated password approach. They may have a backlog of tens or hundreds of passwords that need changing. The process of changing passwords on most websites is abysmally unfriendly. It is one of those things that gets tucked away on a settings page. But then that settings page grows and grows. Is it ‘Settings’, or ‘My Profile’ or ‘My Account’ or ‘Security’ or ‘Extra Options’? Actually finding where on the website you have to go in order to change your password is the part which takes longest.

Making it easier for a user to change their password improves security by allowing them to switch from a crap (“123456”), reused, dictionary word (“princess”) or personally identifiable password (the same as their username, or easily derived from it: “fred” for the username “fred.jones”) to a strong password that is stored only in their password manager like “E9^057#6rb2?1Yn”.

We could make it easier by clearly pointing the way to the password change form so that software can assist the user to do so. The important part here is assist, not automate. The idea of software being able to automate the process of changing passwords has some potential selling points, but the likelihood of it being adopted is low. Instead, I’m simply going to suggest we have software assist the user to get to the right place.

In the form of a user story, it’d be like this: as a user of a password management application, I’d like to speed up the process of changing passwords on websites where they have been detected to be weak, reused or old. When I’m looking at a password I wish to change, I could click “change password” in the password management application and it’d take me to the password change form on the website without me having to search around for it.

There’s a few ways we could do this. There are some details that would have to be ironed out, but this is a rough first stab at how to solve the problem.

This is my preferred option. On the website, there is a link, either visible (using an a element) or invisible (a link in the head). It would be marked with a rel attribute with a value like password-change. Software would simply parse the HTML and look for an element containing rel="password-change" and then use the href attribute. The user may have to go through the process of logging in to actually use the password change form, but it’d stop the process of searching.

One issue here is that there are a large number of web apps that rely on JavaScript to render up the page and there is the potential for rogue third-party JavaScript to modify the DOM. A simple way to ameliorate this is to search for the value in the HTML itself and ignore any JavaScript. Another possible solution is to require that the password change form be located on the same domain as the website, or decide whether to trust the URL relative to the base domain based on an existing origin policy like CORS.

Putting JSON in a specified location

Alternatively, have people put some JSON metadata in a file and store it in a known location, similar to robots.txt or the various things spooked away in the .well-known hidey-hole. This is okay, but it suffers from all the usual flaws of invisble metadata, and is also a violation of the “don’t repeat yourself” principle—the links are already on the web in the HTML. Replicating that in JSON when it already exists in HTML increases the likelihood that the JSON version will fall out of sync with the published reality.

Same principle as the JSON one, but using HTTP(S) headers. Same issue of invisible metadata. Same issue with same-origin policies.

Security considerations

As noted above, there are some security issues that would have to be handled:

  1. Should a consuming agent (i.e. the password management application) allow third-party (or even same-origin) JavaScript to modify the DOM that contains the link?
  2. Should a consuming agent ignore password change form endpoint targets that are on a different domain?
  3. Should a consuming agent follow a password change link to a non-HTTPS endpoint?

My rather conservative answers to these three questions are all no, but other people might differ.

Warning on scope

As I said above, this is a very narrowly specified idea: the ecology of web application security is pretty fragile, and the likelihood of radical change is low, so I’m not proposing a radical overhaul. Just a very minor fix that could make it easier for (motivated, security-conscious) users to take steps to transition to better, stronger passwords.


A fictional conversation about progressive enhancement

“I am disappointed by modern web development. Too many bloated frameworks, too much JavaScript, single page web apps, hash bang URLs—it’s all a bit over engineered. We have lost the old techniques of progressive enhancement and in return we have ghastly nonsense like infinite scroll which looks nifty but does not really improve the user experience. It all seems a bit like we have reinvented the era of Flash intros but we think it is so much better because we have made all this pointless bullshit in JavaScript rather than Flash.”

“I take your point, granddad. Perhaps this technology is excessive for mere web sites but we are building web apps now.”

“At some point someone will give me a clear explanation of the difference, riiight?

“Well a web app is something you can’t really experience without a whole lot of scripting. Like, you can’t progressively enhance it.”

“So a web app is defined as a system that requires the JavaScript excesses for it to work. And the argument for the JavaScript excesses is that we need it to build web apps. That sounds a teeny bit circular to me.”

“Bah. Logic. I don’t need logic. Just because you can’t fit it into your theological categories doesn’t mean there isn’t a distinction. Like, I can point to clear examples of web apps. Gmail! Google Docs! They don’t make sense if you don’t understand them as apps. They don’t fit that old fashioned web pages with little blobs of progressive enhancement model that you grumpy old Luddites keep banging on about. If I want to build Google Docs, I need to do it in the new way.”

“You make a good point. You do kind of need a modern browser with bells and whistles to be able to edit a spreadsheet in Google Docs. The user experience of using that in Lynx is going to suck, so perhaps you don’t really need that.”

“See, this brave new world of apps is not so scary! Shall I help you with your Gulpfile now?”

“Let’s not be too hasty. I mean the argument is that Google Docs is completely useless without all the modern front end stuff all working.”

“Sure.”

“And there is literally nothing you could display to someone viewing a Google Docs spreadsheet or word processing document if, say, their browser had scripting turned off?”

“Absolutely. This is why you need to approach it with an app mindset rather than a document mindset.”

“What is the user editing in Google Docs?”

“Well, rich text files and spreadsheets.”

“Which are types of what?”

“Documents.”

“Can you repeat that word for me?”

“Oh fuck. Documents. You got me.”

“So what could you do if the user loads the page in a browser that doesn’t have the capabilities to edit the document?”

“Well, we could display the document, I guess.”

“And what technology do you need to render rich text and tables in browsers?”

“You know the answer already. HTML and CSS.”

“And if your browser can edit the document—”

“—then it loads the relevant code to edit it. It is still progressive enhancement! I get it.”

“And you can even use your silly Node.js reimplementations of GNU Make if it makes you happy.”

Russian translation


Hey, @SlackHQ, Markdown without inline HTML is rather useless

The Markdown syntax document that John Gruber wrote says this:

Markdown is not a replacement for HTML, or even close to it. Its syntax is very small, corresponding only to a very small subset of HTML tags. The idea is not to create a syntax that makes it easier to insert HTML tags. In my opinion, HTML tags are already easy to insert. The idea for Markdown is to make it easy to read, write, and edit prose.

Which some implementers of Markdown (yeah, I mean you, Slack) then decided meant “let’s make it so you can’t type HTML in a document because, urgh, we support Markdown not HTML” even though Markdown is HTML plus a bunch of shortcuts to make it easier to write common stuff. Markdown without inline HTML means you can’t write things like tables or definition lists. Sigh.

I’m currently rewriting a document because I can’t type HTML in Slack’s variant of Markdown. This kind of bullshit is why it is easier for me to just render it properly using a non-stupid Markdown implementation and then save it as a PDF and send that to people. It’s after 7pm and I’m still in the office reformatting fucking Markdown—this is not “[making] working life simpler, more pleasant and more productive”, as Slack promises. Quite the opposite in fact.

If you let me write Markdown, let me write HTML in that Markdown. That’s how Gruber designed it. If you don’t let me write HTML in Markdown, I can’t use it to actually write anything detailed.


@t linked me to @veganstraightedge’s take on JSON-LD.

I disagree that JSON-LD is “unneeded”.

In an ideal world, we’d switch to putting data inside HTML. But lots of people seem to think JSON is useful. (There’s a reason microformats2 parsing is specified with reference to transforming it into a JSON document.) JSON APIs are an existing practice.

The reason JSON-LD is useful stems from the simple fact that (a) lots of people publish JSON, and (b) it’d be quite useful if we could actually layer a bit of semantics on top of that. Combined with something like Hydra, it means we might not have to write a special snowflake RubyGem or Python library or JAR or whatever for each different web service out there.

While JSON APIs continue to exist, I’d rather have self-describing, semantically-rich RESTful JSON APIs than the crap I see created by programmers who keep on reinventing the damn wheel badly.


My blog: now with HTML video

I’ve done it! No more Flash video on my blog if you don’t want it.

If you are using Google Chrome, Safari (on a Mac, or on an iOS device) or the bleeding edge in Firefox or Opera, you can now watch videos on my blog without Flash using plain-old HTML video. Or “OMG HTML 5” if you prefer.

I’ve updated the JavaScript to automatically switch out the Flash players from YouTube and Vimeo for iframe video embeds from both sites. I did this by modifying this code by Matthew Buchanan. I’ve basically added a function to support YouTube.

In each browser that you are using, you need to opt-in on YouTube and Vimeo’s sites.

To opt-in for YouTube, go to youtube.com/html5 and click “Join HTML5 trial” at the bottom of the page.

To opt-in for Vimeo, go to any Vimeo video and choose the “Switch to HTML5 Player” option underneath the video.

Now come back to this site and you should see the Flash players replaced with native HTML players.

If you want to see this in action, try going to the video or 8-bit tags.

Enjoy.

I may turn the script into a thing any Tumblr user can drop on their page quite easily soon (basically an external JavaScript), but until that happens, you can see how it works by viewing source. This is the web after all.


HOWTO: Build an HTML 5 website

Everyone is going on about how they are making “HTML 5 sites” and going on and on about how HTML 5 is giving them a hard-on or something equally exciting.

So, I’ll show you how you join this amazing club.

Open up your text editor and find some HTML file or template.

Look for the bit right at the top. It is called a DOCTYPE. It’ll look something like this:1

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
    "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">

Now, delete all that and replace it with:

<!DOCTYPE html>

Save the file and push it out onto the web.

Congratulations, you are now using HTML 5.

Give yourself a big pat on the back. Listen to some cutting edge spacey techno or something. ‘Cos you are living in the future, man. Your techno-halo is so bright, I need to put on two pairs of shades indoors.

You are now officially signed up in the fight against SilverFlash and a minion in Steve Jobs’ campaign for the open web or something. (Because embrace-and-extend is so much nicer when it comes from Apple and Google than when it comes from Microsoft and Adobe.)

You can also go to your boss and justify a huge champagne and coke-fuelled party with hookers and everything because you are now fully buzzword compliant. You can get venture capitalists and TechCrunch and other people who wouldn’t know a DTD from an STD2 to give your huge, manly testicles a thorough tonguebath – sadly, only rhetorically – because you are smart and hip enough to be using HTML 5. Pow! Bam! Shazam! You are like a cross between Nathan Barley and Rambo!

Or, you know, you could actually learn what HTML 5 is. Let me give you a clue: it is quite a lot like HTML 4. That’s part of the philosophy of the damn thing: it is continuous with what you are already doing rather than a radical shift! It is that cliché: evolution not revolution. It’s like the difference between OS X Leopard and Snow Leopard.

Once you realise this important truth, you can drop the buzzwords, and just quietly educate yourself on some of the quite nifty new things you get to do on the web, get your rather excitable colleagues to calm down before they feint in pre-orgasmic excitement, and maybe try and nudge the community at large into realising that HTML 5 is a few new bits and bobs they are adding to HTML, not some hybrid of Jesus and Vannevar Bush riding down on a pterodactyl/unicorn hybrid giving out ultratight Fleshlights to anyone who slings angle-brackets so they can prepare for the giant fight between HTML 5, evil browser plugins and mobile app stores.3

You can adopt HTML 5 really quite slowly: if your site sucks now, making it “HTML 5” won’t make it not suck. Even better, don’t start with HTML 5. Start with CSS 3: the nature of CSS is that it is much easier to fiddle with a stylesheet, add a few things like media queries and so on.

Be patient and don’t rush into this. Include only technologies that improve your site and the experience of using it. Not because some fucking bullshit web design blog you found on Reddit is jabbering on about how it is the most awesomest thing ever invented since someone discovered you could have sex while eating sliced bread or some other crap like that. It’s not. It’s an evolutionary step from existing HTML on the web that gives you a few shiny new things that might make life easier.

Now calm down. I’ve just washed my clothes and I don’t want you jizzing all over them when you discover the joys of the section element.

  1. Yours will be much more boring. It won’t have cool shit like RDFa in it because you suck.

  2. To be fair, DTDs and STDs share a scary resemblance in lots of ways. You can prevent the transmission of DTDs by adopting RELAX NG for all your XML schema validation needs.

  3. Again, the whole native vs. web thing is fucking stupid. The only reason it is happening is because people seem to think that everything needs to be an app. You know, if the thing is more like a web page, you put it on the web. If it is more like a desktop application, you put it in an app. Content? Web. Functionality? App. This also resolves all the stupid nonsense about app store approvals. Why have we reached a situation where people are putting content in an app? You know, people are downloading blobs of Objective-C compiled object code that contain satirical political cartoons. Then they are complaining when Apple ban the ‘app’. What the fuck is that all about? Put that shit on the web. Apple can do what they want to apps, but why let them tell you what you put in your content. Let them approve functionality, not content.

    There was a time many moons ago when you had to download a Windows application – actually, you had to have a Windows application sent to you on a CD-ROM – in order to order groceries from Tesco. This is the app world we live in today, and it is totally idiotic. Apps are things like Vim or Firefox or Unreal Tournament 2004 or iTunes or The Gimp or Final Cut Pro. If you wouldn’t download a Windows or Mac app to read Wired Magazine, why are you downloading a damn iOS app?

    What is so stupid about this is that while Apple and Android and whatnot train everyone up into using app stores, what’s the reaction of plenty of people in the open source community: don’t worry, the web will do it. (Or worse: we’ll make an open web app store!) But it’s bullshit. The web is a pretty damn retarded application platform. I mean, it is okay in a pinch, but I’m not betting on a decent Ajax version of Vim, Half-Life 2 or Adobe Illustrator any day soon. And why would I want to use Google Docs when I’ve got thirty years of hard work by Donald Knuth and Leslie Lamport sitting there ready to churn out absolutely awesome pixel-perfect print documents from my damn command line. Plain text, Vim and Git (or Emacs and Mercurial or some other combination thereof) will beat the socks off whatever cloud vapour out there.

    You do actually sometimes need native code on actual hardware, not seventeen layers of JavaScript indirection bouncing back and forth between a server that doesn’t respond half the time and a browser that’s filled with security holes and memory leaks. Why do I want this when I have a command line here that does the job quicker and easier and works when I’m in a fucking train tunnel? And don’t even think about saying “Google Gears”.


When parsing HTML using regex is okay

There's been a lot of fuss over on Stack Overflow, and consequently on Metafilter and on Jeff Atwood's twitter, about people parsing HTML with regular expressions, along with the advice to never do that and tales of how Cthulhu will eat your soul.

In general, never parsing HTML with regular expressions is good advice. That's good advice in general.

But sometimes it isn't. I'll give an example case of when you shouldn't. You may find that it's applicable to you.

A while back, I had over 2Gb of HTML to parse - 77,000 files. Every file was exactly the same structure. I only wanted to extract two pieces of data from each file - the contents of the h1 element and the contents of a div with the class of 'author' or something similar.

I wrote some Ruby code to parse each page using Nokogiri or Hpricot or whatever was then the preferred HTML parsing library. But this was slow. It was taking about 4 or 5 seconds to parse each file. In general, that's pretty fast, but when you've got 77,000 to do, that's not so good. That means four days.

I rewrote the code in Java so that it would open each file with a BufferedReader, then readLine on each line of the file, using the String startsWith method to see if it's the right line, then use regexes to extract the stuff we are interested in. I compiled and ran this code: it went from four days to about ten minutes. Which is fine because I made a goof-up in the code that I only discovered after running it - if I had only discovered that goof-up four days later, I would have been a lot more angry than if I'd discovered it after ten minutes.

I've told this story to people, and there seems to be two possible reactions. There is the "OMG Ruby is so slow, I knew that not learning it and sticking with Java was sensible" reaction, and there's the sensible reaction - I could have re-written it in Ruby and gotten the same performance benefits by using IO rather than the XML/HTML parsing library - it just happens that I know the Java IO library better than I know the Ruby IO library. Part of what was probably taking the time in Ruby was the fact that I was constructing a large number of objects extremely quickly, but Ruby's GC is notoriously painful in a non-generational way compared to the JVM's generational GC.

The key thing is whether or not you are working with files that are all structured in a broadly similar way. If you've got 77,000 files that are all very similar and you know exactly what you want from them, sometimes for performance, parsing it as a bag of lines and strings is much more sensible than parsing it into a DOM. These very limited circumstances really provide the exception that proves the rule. If you don't have a very good reason to be parsing XML or HTML using an XML or HTML parsing library rather than using regexes, you shouldn't be doing so. (The same is true with RDF: use the right level of abstraction - unless you are logged into the swig IRC room all day every day and know the RDF specs like the back of your hand, you should be using an RDF library not an XML library to parse RDF documents.)


Why you shouldn't create another markup language

A while back, Gareth Rushgrove quoted me in DSLs for HTML and CSS - The Future, or Just Plain Wrong? where I said: I'm not sure why everyone insists on clumsily reinventing HTML every few weeks (eg. wiki syntaxes, of which there are hundreds).

Gareth posted this in the context of Haml and Sass, Rubyist abstractions for HTML and CSS. People think that when I bitch about these things, I'm saying that they are bad. They aren't intriniscally bad. Problem is, taken as a set, they are a pain in the behind. I already know a language that allows me to express inline differentiations in a document. It's called HyperText Markup Langauge, or HTML for short. I don't need to abstract HTML, because HTML isn't that complicated. a for links, em for emphasis, strong for strong, q for quotes, kbd and var for inputtable strings and variables, img for images and so on.

But, the thinking goes, that's too complicated for Ordinary People (something of a myth really). So we reinvent it. The first clumsy reinvention of HTML I can remember is BBCode, which really just takes the angle brackets and replaces them with square brackets. If you need to remember the markup mapping, that's not actually much of an improvement. Why is [url=http://example.org]Example[/url] any better than the same in HTML? It's not. It's slightly more convenient for programmers though.

Then more recently, we've started getting things like Textile and Markdown. Of the two, I think Markdown is preferable, although I've recently been trying out Textile and it's just about okay.

But then there's wiki syntaxes. If you pick the main wiki engines, they all use different syntaxes. I've got the MediaWiki syntax burned into my brain. But if you try any other wiki engine, they all have different syntaxes, and it gets very annoying, very quickly. All this means is that I have to remember ten different ways of making something italic or making a link. What's the damn point? I know how to make a link. I use an a tag. At the very most, I can see space for a separate wiki syntax - only one though. My choice would be the MediaWiki syntax. I've actually gotten to the point of not contributing to wikis that use anything other than MediaWiki syntax, and I'd like it if other people were to follow the same rule. We should turn the MediaWiki syntax into a de-facto standard for all wikis, just for pragmatic reasons. The vast majority of people likely to click 'edit' on a wiki article will probably start on Wikipedia or another MediaWiki server. And, well, the syntax has been pretty well tested - en.wikipedia.org has over eight million reigstered users and a shit ton of unregistered users. If you have a wiki that uses anything other than MediaWiki syntax, you better have a damn good reason. I haven't yet seen a damn good reason.

There is maybe a case for Textile and Markdown. But no more of them. There's too many already.

And if you are thinking of allowing people to post rich text, your starting point ought to be HTML. Yes, people will have to learn HTML. So provide a link to a funky little pop-up with all the basics you need. And, yes, you'll need to run the input through an HTML input filter like Tidy (be sure that you use one that is aware of XSS and other potential security hazards, and preferably one which has a comprehensive test suite). They exist in most languages you are likely to be building applications in, so what's the problem?

Just don't make me learn yet another "lightweight markup language" or I'll get a crane, wait until you are sitting on the toilet and drop on top of you a collection of all the books ever written about HTML, including my old copy of Teach Yourself HTML 4.