Tom Morris



2009.11.16

  No. 1017 

When parsing HTML using regex is okay 2009-11-16T11:03:52ZPermalink

There's been a lot of fuss over on Stack Overflow, and consequently on Metafilter and on Jeff Atwood's twitter, about people parsing HTML with regular expressions, along with the advice to never do that and tales of how Cthulhu will eat your soul.

In general, never parsing HTML with regular expressions is good advice. That's good advice in general.

But sometimes it isn't. I'll give an example case of when you shouldn't. You may find that it's applicable to you.

A while back, I had over 2Gb of HTML to parse - 77,000 files. Every file was exactly the same structure. I only wanted to extract two pieces of data from each file - the contents of the h1 element and the contents of a div with the class of 'author' or something similar.

I wrote some Ruby code to parse each page using Nokogiri or Hpricot or whatever was then the preferred HTML parsing library. But this was slow. It was taking about 4 or 5 seconds to parse each file. In general, that's pretty fast, but when you've got 77,000 to do, that's not so good. That means four days.

I rewrote the code in Java so that it would open each file with a BufferedReader, then readLine on each line of the file, using the String startsWith method to see if it's the right line, then use regexes to extract the stuff we are interested in. I compiled and ran this code: it went from four days to about ten minutes. Which is fine because I made a goof-up in the code that I only discovered after running it - if I had only discovered that goof-up four days later, I would have been a lot more angry than if I'd discovered it after ten minutes.

I've told this story to people, and there seems to be two possible reactions. There is the "OMG Ruby is so slow, I knew that not learning it and sticking with Java was sensible" reaction, and there's the sensible reaction - I could have re-written it in Ruby and gotten the same performance benefits by using IO rather than the XML/HTML parsing library - it just happens that I know the Java IO library better than I know the Ruby IO library. Part of what was probably taking the time in Ruby was the fact that I was constructing a large number of objects extremely quickly, but Ruby's GC is notoriously painful in a non-generational way compared to the JVM's generational GC.

The key thing is whether or not you are working with files that are all structured in a broadly similar way. If you've got 77,000 files that are all very similar and you know exactly what you want from them, sometimes for performance, parsing it as a bag of lines and strings is much more sensible than parsing it into a DOM. These very limited circumstances really provide the exception that proves the rule. If you don't have a very good reason to be parsing XML or HTML using an XML or HTML parsing library rather than using regexes, you shouldn't be doing so. (The same is true with RDF: use the right level of abstraction - unless you are logged into the swig IRC room all day every day and know the RDF specs like the back of your hand, you should be using an RDF library not an XML library to parse RDF documents.)

Faithful swell numbers and power by counting pension collectors and stamp buyers as religious 2009-11-16T17:44:41ZPermalink

From the Torygraph: The villagers of Kinoulton in Nottinghamshire have breathed new life into their church by introducing into it a cafe and post office.

Great. This means that now if you want to post a letter, collect your pension or benefits, buy some stamps or renew your road tax, you are now counted as a church-goer, swelling the influence of the church. Imagine the outrage if you had to go into a mosque or synagogue to do these things. Spending time in other people's religious buildings erodes my epistemic credibility!

It's all part of the dizzyingly anti-secular crapness of British society. See also Faith groups to be key policy advisers: Mr Denham argued that Christians and Muslims can contribute significant insights on key issues, such as the economy, parenting and tackling climate change.

Denham always struck me as being a bit soft in the 'ed, even before he took the inherently soft-headed role of "communities minister". I mean, fans of funk music, Java programmers, philosophy graduate students, Twitter users, people with beards and residents of Sussex can probably also contribute significant insights on the economy, parenting and climate change, but we don't give them a special committee in government (much to my disappointment, as I'd be happy to serve on any of them). Christians and Muslims (and every person of every faith or none) already have a role in government and decision making: they can vote, they can lobby, they can assemble freely with their fellow citizens, they can organise themselves into pressure groups and political parties for the purpose of lobbying.

Comments
blog comments powered by Disqus


Tom Morris 9f4907d871750fd4c9b9bad7086701b51d6abd10 bd9f81a05283ed85e699175ed057b4a497f20b77 802c68123e12bf69d99a25a87cef360f18813fe4
Currently in: Kent, England
Usually in: East Sussex, England

I am a , an , like to code in and (and Java, but let’s not talk about that), and noodle about with and the .

I have an MA in philosophy from Heythrop College, University of London. My philosophical interests are in analytic metaphysics, ontology, modality, the work of , , , and . I have a strange, unfulfilled interest in . I’ve been influenced by Gadamer, by , , and .

Musically, I like jazz fusion, soul and P-Funk. My musical nirvana would be a mixture of Beethoven, Miles Davis and George Clinton topped with a side-serving of Erykah, Jill and Angie.

I also write for the Citizendium, an online encyclopedia project. If you know about stuff, you should join in. I occasionally produce audio recordings for The Pod Delusion.

Elsewhere:

  • GPG Key
  • del.icio.us
  • Flickr
  • Twitter
  • Jaiku
  • LinkedIn
  • ma.gnolia
  • blip.tv
  • upcoming.org
  • MetaFilter
  • LiveJournal
  • CiteULike
  • Technorati Profile

RSS Feed Subscribe:

RDF

« November 2009 »
SuMoTuWeThFrSa
1234567
891011121314
15161718192021
22232425262728
2930 

View in month context

On this day in: 2006 2008