To illustrate some of the things I spoke about at BarCamp, I have been putting together a really simple parser for (X)HTML pages contaiing microformats to turn them in to RDF. 
It basically takes a page, runs it through Tidy, runs all the GRDDL stylesheets across it, loads the result in to RAP and outputs the result as (somewhat messy) RDF/XML. 
Currently, the parser is reading hCard, hCalendar, XFN, DC-Extract (not a microformat with a capital 'M', but still parseable) and rel-licence. I also have support planned for hReview, GeoURLs and hDOAP. 
What's neat about the parser? Once the stuff is in a document, you can easily run SPARQL queries over the pages. One of the planned functions that I hope to add is an RSS adder. What this would do is let you request that all the URLs that are likely to be web pages be checked to see if there is an RSS feed attached which we might be able to add. 
What is nice about RDF is that it becomes almost like a universal format, and it is trivially easy to get data out of it. 
I'm planning to add some new functions to it in order to extract more data from different places - Flickr parsing, better Twitter parsing. Now I basically have a workflow which I can edit based in PHP5. For instance, Twitter supports XFN, but there is better data available by writing a domain-specific parser for it. It can be as simple as writing: 
if (strstr($url, "twitter.com") { $stylesheets[] = // twitter parser 
Of course, we can use weak string matching for speeding up the process: 
if (strstr($data, "vcard") { // add hCard parser 
This is simply to reduce resource usage - parsing stylesheets isn’t the quickest of processes. 
The parser itself is 37 lines of PHP5. It’ll grow as I add domain-specific and site-specific conditionals. 
You can access the microformat-to-RDF parser at: 
The mf/rdf means that it may be possible to start offering other parsing possibilities - mf/xml, mf/rss etc. 
If there’s a problem, either post a comment or come and chat - I’ll be in #swig and #microformats most of this evening. 
This brings me on to another little service I’ve started offering over REST which is Tidy. 
Tidy is a fantastically powerful C application that takes badly marked up (X)HTML documents and tries to make them slightly more sensible - with validation, XML well-formedness and the such being the end result. 
If you use the microformat to RDF parser, I run the HTML I get through Tidy anyway, so there’s no need to bother doing it for that. 
I am using Tidy because I really like using XSLT, and XSLT doesn’t run on HTML - of course, Beautiful Soup can be used when XSLT doesn’t. 
The REST interface for Tidy is: 
By default, my Tidy interface returns XHTML (even if not provided). This is an utterly greedy mostly practical, somewhat philosophical decision - I need XML. Non-XML standards don’t deserve existence unless absolutely necessary.

But if you don’t want automatic XHTML conversion to take place, use: 
tools.opiumfield.com/tidy/h/URL 
tools.opiumfield.com/tidy/html/URL 
Yes, both ‘h’ and ‘html’ work fine. 
Try not to bombard my server, and try to cache results wherever practical. And if you are in the position to provide similar services, then please do so - drop me an e-mail and I’ll point some of my external traffic to your script. 
Tags: rdf, semweb, microformats, grddl, barcamplondon2, tidy, xhtml, html, xml 
