Oh boy. RDF and Python together. Add an unhealthy dose of chocolate and hyper-paranoid military-strength public key encryption and I’m in ecstasy.


Seriously though. One of the dullest complaints I get is the “RDF is sooooo hard! My poor little head will never cope!” Usually, though, like with all such complaints, it’s voiced in the third person. Expressing other people’s ignorance is a lot easier than expressing your own. That’s why, for instance, we always say “How will people ever cope without religion?” but never put in the key data point that the person saying it gets on quite well without religion, thank you very much. (In addition, the person bemoaning the complexity of RDF is doing so having understood far more cryptic things like getting pixel-perfect CSS in IE5).


The thing is that RDF isn’t necessarily very complex at all. Done right, RDF can be tremendously simple, and it can also be a very simple way of doing what is otherwise complex. If you can grok the basics of what a directed graph is, you’ve got most of the way there. There are bits which are slightly more irritating - reification and blank nodes, for instance.


Let’s take an example. Taking a list of people on a social networking site and finding out who their friends in common are. You could do this by collecting together a list of all the people from the social networking site’s API, decrypting the site-specific XML or JSON format they use and then iterating over the lot and joining them all together. Dull. You write lots of code just to perform a simple query. You have to assign them all places in some internal hierarchy-of-doom. Boring.


In the case of Twitter, I’ve done a lot of it for you. I’ve written an XSLT transformation to take Twitter’s API data and make it available as RDF/XML. You send one request to tools.opiumfield.com/twitter/[$username]/rdf and you get back an RDF file with all the stuff you need. You then load that into an internal representation and query it.


Let me walk you through some Python code that demonstrates it. It uses RDFLib. If you are on OS X, you should install the latest version of Python (RDFLib requires 2.4, but if you are on 10.4, you will only have 2.3.5 - run “sudo fink install python-2.5” and then run “easy_install -U rdflib” to add RDFLib).


Once you’ve got the upgrade and the library, we can step through the code line-by-line and see what it does:


import rdflib, sys


This simply imports the sys module and the rdflib module.


ts = rdflib.ConjunctiveGraph()


This creates a new object called ‘ts’ which is a ConjunctiveGraph. You know all that ‘social graph’ stuff that people have been waffling about? This is one of them. A graph model - a ‘network’. Which is ideal, really, for a network of friends.


querystring = ""


We are just instantiating this as a string as we are going to be appending to it in a loop in a second. God bless dynamic typing, right?


for i in sys.argv[1:]:


Here, we iterate over each item in the list of arguments - except the first one.


[tab] ts.parse("http://tools.opiumfield.com/twitter/" + i + "/rdf")


Here, we load in the RDF data for each user.


[tab] querystring = querystring + "<http://twitter.com/" + i + "> <http://xmlns.com/foaf/0.1/knows> ?person . "


Here we construct part of what will become the WHERE clause of the query. It basically says that the query string should have added to it the triple of the username, then the foaf:knows property and finally the variable ‘person’. When we run the query, it looks for all the triples in the graph which contain these, and returns the variable. As we are iterating over it, it’ll add all of them. Each ‘clause’ is ended with a full stop and a space.


res = ts.query("SELECT DISTINCT ?person WHERE { " + querystring + " }")


This is where we run the query. It pulls in the querystring variable, and runs it in a SELECT query looking for a DISTINCT ?person (a non-distinct would mean that if both A and B were friends with C, it would list C twice - whereas here it only returns each distinct entry) WHERE the querystring - each name.


The res variable then becomes a list containing all the results. What is there left to do? Print ‘em out, of course. Since we are just doing a demo, we’ll print them to the shell.


for i in res:
[tab] print str(i[0])


The reason it’s i[0] is because inside each list component is an object serialization of the triple. If you run it interactively, you’ll see.


Let’s see this script in action:


darwin:~/bin tom$ python friendscmp.py tommorris adactio t


We invoke the script with a list of arguments - in this case, Jeremy Keith, Tantek Çelik and myself. The script goes off, gathers the RDF representation of their friends list and then queries it for people who we all know (that is, who we have all followed on Twitter).


http://twitter.com/BenWard
http://twitter.com/cackhanded
http://twitter.com/cubicgarden
http://twitter.com/arielwaldman
http://twitter.com/drewm
http://twitter.com/codepo8
http://twitter.com/briansuda


Consider this a kind of Hello World of RDF querying. Where to go from here? Well, you can beef up your SPARQL-fu so you can make more elaborate queries. I’d suggest you start with Leigh Dodd’s tutorial on XML.com, and then maybe punish yourself with the specification document if that’s your kind of thing.


What else is there to query? Well, today I’ve been working on mapping last.fm data. For instance, here’s my friends on last.fm in RDF/XML. You could play about with mashing up data between services. How about dbPedia? Just as you can query Twitter friends lists, you can do the same for - oh - the whole of Wikipedia. If you are playing with dbPedia, be sure to do it interactively in the Python shell so you can discover things like the language construct built in to RDF and used heavily in the dbPedia dataset. Yep. I18n is built-in for every string literal. And Unicode. Unicode rocks. And if you are a Pythonista, you can grok Unicode quite a lot easier than everybody else since your language of choice has native Unicode support.


This is all well and good for data which we’ve published explicitly as RDF. But what about Microformats? Microformats embed data in to the HTML of web pages. Well, if you’ve got well-formed XHTML, you can run it through Triplr and get data out. RDFLib-compatible GRDDL is something I may work on soon.


As for what I want? I’d really like someone to port RDFLib to Ruby. Come on, we’ve all got a deep, burning Rails envy. There’s lots of Rails developers we can infect with this sordid, evil RDF stuff.


You can download the source code for the script used here: friendscmp.py (consider it GPLed) 