Tom Morris

Using FOAF and Cwm for simple list compilation

Document is currently unfinished.

I was talking with someone the other day who wanted to compile all their friends together from different sources and have a page on their site that lists their whole social network. With the use of Microformats, this has a lot of use - you can then point software to that page in order to replicate your social network. Dopplr is a social network that utilises the XHTML Friends Network (XFN) schema to import contacts into Dopplr, saving you from having to add all your friends again.

But what if you don't have all your friends' data available? Surely there must be an easy way to compile it all together and use it? This is where Friend of a Friend (FOAF) comes in. FOAF is an RDF-based specification that lets you describe complex social relationships, especially if combined with the Relationships Ontology. In the following tutorial, I will show you how you can pull in a number of your own FOAF files to produce a master list, and then transform that master list in to a static HTML page using XSLT. The static HTML page will use both eRDF and various microformats.

The source data will come from a number of sources - LiveJournal, Twitter and Flickr. Extra services are easy to add.

One uses an RDF file in N3 to write all the data, and has some scripts to do the conversion. This has the advantage that because RDF is 'intermingleable', you can take data from a number of places, combine it and turn out HTML. Simple scripts that run periodically do the hard work for us. Because it's RDF, the data is also available.

Requirements

Downloads

An RDF primer

How does this work? Well, our RDF file is written in Notation3 (or N3). N3 is a really nice little format that is to RDF/XML (how most people see RDF) as JSON is to XML. It's a non-XML serialization, and is very easy for humans to write.

RDF works by putting together triples. Triples are literally statements in three parts. The first part is called the "subject", the second part is called the "predicate" and the third part is the "object". The subject and predicates are basically URLs, and the object is either a URL or a literal, like a piece of text or a number. For string literals (text), you can also use XML Schema Datatype properties to restrict them - for instance, you may want to use the XSD datatype "dateTime" to restrict a property to being that of a timestamp. You can also use language encoding on strings and mark that a string has a particular language - for instance, "London" could be assigned the "en" language code, while "Londres" is the French version so could be given "fr".

Here's an example:

:google dc:title "Google"; dc:lang "en".

This simply states the resource #google (most people define the ':' namespace to be the local document) has a Dublin Core title "Google", and the Dublin Core language is "en" (English). Although our N3 will end up more complex than this, N3 is not hard to get used to.

Using Cwm, you can convert to and from N3. If you have N3 that you want in RDF/XML, do this:

cwm foo.n3 --rdf > foo.rdf

For RDF-to-N3:

cwm --rdf bar.rdf > bar.n3

There are a variety of ways in which you can actually 'encode' your RDF within either format - sort of syntactical shortcuts. To use our Google example:

:google dc:title "Google". :google dc:lang "en".

means the same as

:google dc:title "Google"; dc:lang "en".

Why is that? Well, the semi-colon in the second example finishes one triple but says that the next triple will have the same subject. This is kind of similar to how in CSS, the folowing are the same:

h1 { font-weight: bold; } h1 { font-family: "Arial"; }

h1 { font-weight: bold; font-family "Arial"; }

The semi-colon in N3 says 'this has the same subject', just as not closing the curly braces in CSS does. In fact, there seems to be a close analogy between CSS and N3, only N3 is for data while CSS is for style.

As mentioned above, RDF has another neat function, which is that you can encode languages directly in to strings. This is how you do it with Notation 3:

:google dc:title "Google"@en; dc:lang "en".

This says that the resource 'google' has the dc:title "Google", and that the string "Google" is in language 'en'. Why dc:lang too? The dc:lang refers to the language of the resource, but the '@en' refers to the language of that string. Here's another example:

:usa dc:title "United States of America"@en, "États-Unis"@fr, "Estados_Unidos_de_América", "Stati_Uniti_d'America"@it.

Here, we are not describing the #usa resource, but rather we are giving it multiple titles in different languages. The #usa resource has different titles in different languages.

Beyond this, we shall go no further in to RDF. You can map a lot of data in the RDF format. If you want to know more, I'd suggest you view the GetSemantic wiki page.

Alternatives

There are other ways of doing what I'm going to do. Cwm rules are a series of logical rules in a Notation3 document, that you process using Cwm. They in effect work like a 'pipeline' - you put RDF data through it, some processing is done, and you get the result.

You do not necessarily have to use Cwm Rules to process RDF data. In fact, sometimes it's a bad idea, since sometimes it's not simple to map your brain processes into Cwm Rules. In that case, you can use a variety of other tools to process the RDF. I'd suggest rdflib for Python. See Semantic Web Tools on the GetSemantic wiki for more details on tools you can use to code with.

Getting underway

Cwm has a number of ways of getting data out of it. The first way is to use what is called the 'log:stringOutput' method, which is a nifty, in-built way of producing reports from RDF data. I've experimented with this, and have found it to be useful but probably not quite powerful enough for handling large quantities of XML data being outputted using it. Instead, we shall use another method whereby we get RDF/XML out of Cwm and use XSLT to transform it. Cwm becomes part of a 'pipeline' for producing webpages - it is sort of equivalent to the 'model' of a Model-View-Controller system. You use it to describe the sources of data. The Controller element becomes the XSLT, and the View becomes (X)HTML and CSS (I'd love to do XSL-FO with the intention of building printed objects, but that'll have to wait for another tutorial).

The first thing we will need is some data. We can add multiple data sources, but we shall start with just two - Twitter and LiveJournal.

I have a LiveJournal account at tommorris.livejournal.com, but I don't use it much. LiveJournal users have a FOAF profile with their interests and friends in. Mine is here ([lj-url]/data/foaf). I've accustomed myself to reading RDF/XML, but we can take a look at it in Notation3 by typing in:

cwm http://tommorris.livejournal.com/data/foaf --n3

The RDF that LiveJournal returns is a bit of a mess. We need to clean it up. The problem with the RDF it returns is that it's not grounded. A grounded RDF graph means that the subject has a URL. In English, it's the difference between the following:

Joking aside, there are no identifiers in the RDF file, so we need to create them. Since we are only using it internally, we can use the person's LiveJournal blog as an identifier. It's not perfect, but it'll do. We need a rules file for this. Below is the one I wrote for myself. We'll go through it section-by-section explaining what it does in a second.

@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix : <#>.
{ ?n foaf:knows ?y. ?y a foaf:Person. ?y foaf:weblog ?w. ?y foaf:member_name ?name. ?y foaf:nick ?nick. ?y foaf:tagLine ?tagline. ?y foaf:image ?image. } => { :me foaf:knows ?w. ?w a foaf:Person. ?w foaf:member_name ?name. ?w foaf:nick ?nick. ?w foaf:homepage ?w. ?w foaf:tagLine ?tagline. ?w foaf:image ?image. } .

This looks like a total mess, right? Well, let's decode it a bit. The 'words' that begin with a question mark are variables. The => means 'implies'. Each full-stop ends a piece. Before the => are the 'inputs' and after are the 'outputs'. One way to think about this is as a set of 'rules'. What are we doing?

  1. Before we start, we declare two 'namespace' prefixes - foaf and blank. The FOAF one goes to the FOAF namespace, while the blank one just goes to <#>
  2. Firstly, we declare ?y as whoever knows ?n. In the LiveJournal FOAF, there are only FOAF relationships between the primary subject (me) and other people.
  3. Then we check that ?y is a person. 'a' is a shorthand for rdf:type, which is how you state that something is a particular class.
  4. Then we look at the property foaf:weblog (which we know they will have, since they are on LiveJournal) and assign it to ?w.
  5. We then go on to assign nick, tagline and image to variables with similar names.
  6. Then we declare equivalence. We are moving in to 'output'.
  7. Next we say that ":me" (which will become #me in the final document because of the prefix) knows ?w (the URL of that person's weblog).
  8. Then we redeclare that the person is a foaf:Person.
  9. We then re-add all their attributes by name (there is an automated way of doing this if you don't need to change any), except for foaf:weblog, which we are changing to foaf:homepage.
    The reason we are doing that is because we will be merging the outputted RDF file with other files that use foaf:homepage. If we wanted to not pollute the Semantic Web, we might be more interested in using a 'local' property.

Let's run this through Cwm. To do this, we use the following command:

cwm http://tommorris.livejournal.com/data/foaf --filter=lj-cleanup.rules.n3

For lj-cleanup.rules.n3, use wherever you saved the above rules file (it can be on the web - Cwm takes file names and URLs as equivalent). What you should see stream in front of you is a cleaned-up RDF file as N3. Pop " > lj-output.n3" on the end of that command and it'll save it in to lj-output.n3.

Combining it with Twitter

Twitter does not provide a FOAF file, but I do. I have set up a web service which works like this: <http://tools.opiumfield.com/twitter/username/rdf>. It returns both FOAF and SIOC data. SIOC stands for Semantically-Interlinked Online Communities, and is used to describe the relationships between users, posts and communities. We do not need to concern ourself with SIOC here, but it's darn cool nonetheless.

Since I've written the Twitter RDFizer, I know it works, and we don't need to 'ground' our URIs. We can just pull in the data we want. We should be able to get away with two rules to combine the Twitter data with the LiveJournal data. Here they are:

@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix : <#>.
@prefix lj: <lj-output.n3#>.
{ ?profile foaf:primaryTopic ?user. ?user foaf:knows ?y. ?y ?p ?o. } => { :me foaf:knows ?y. :me a foaf:Person. ?y ?p ?o. } .
{ lj:me ?p ?o. } => { :me ?p ?o. } .

What does this code do, then? Let's look at it in sequence again.

  1. Declare our namespaces. The first two are the same as before. The third prefix is referring to our previous file, the output from LiveJournal.
  2. What we now do is find out who the primary topic of the file is. It contains data about a lot of users, but this should give us the URL of the user who we ask for (ie. http://twitter.com/tommorris). Next we find out who they know and set them as ?y. Then we get all the triples about ?y and set them to ?p (predicate) and ?o (object).
  3. Next we declare that #me (our new 'me') knows ?y. Then we specify that #me is a Person. Then we redeclare all our triples (?y ?p ?o).
    In the previous example, we redeclared them all manually, but now we just do it for all of them.
  4. Finally, we read in the #me from the LiveJournal file, and just redeclare it all as #me in the new file.

We save our rules file as 'twitter-combine.rules.n3'. Now we can finally combine the two files together pretty seamlessly in Cwm:

cwm --n3 lj-output.n3 --rdf http://tools.opiumfield.com/twitter/tommorris/rdf --filter=twitter-combine.rules.n3 --rdf --bySubject > combined.rdf

We have some changes here - we declare types for our input files (because our LiveJournal data is N3 and our Twitter data is RDF/XML). I've also added "--rdf" at the end, which tells cwm to give us back RDF/XML. Also at the end is "--bySubject", which tells cwm to sort our RDF/XML by subject. This basically tells the XML creator not to use any fancy tricks, but to give us a consistent XML for all the triples. This will be important when we come to parse the data using XML tools, as we are about to.

A quick JSON aside

Before we get in to XML and XSLT, I should point out that those who simply want to use this data in JavaScript or in a programming language that supports JSON, they can use an online service called Triplr to get JSON back from this new combined data.

To do this, you need to upload the combined.rdf file to the web. Once it is online, add <http://triplr.org/json/> before the URL where it is, and you will get your RDF data back as a series of triples in JSON. You can then programatically parse this. You can also do this with a number of different language toolkits which can return a native array for similar parsing.

And on to XSLT

The next part of the process is to transform the XML file in to HTML. To do this, we use a programming language called XSLT. XSLT stands for XML Stylesheet Language for Transformations, and is designed to help you easily turn an XML file in to another XML file format. For use with HTML, it is quite flexible. You can use it to produce either XHTML or that HTML stuff. XHTML output is done using the 'xml' output method, while HTML is done with 'html'. Let's get started then.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
[code goes here]
</xsl:stylesheet>

This is an empty XSLT 1.0 file, with a few standard namespaces for RDF added. Mainly, in this case, RDF and FOAF. If you have any other namespaces declared, you can declare them here. You need to make sure that they are exactly the same URIs as you are using in your RDF file.

If you have access to OxygenXML or a similar XML editor, it makes life a lot easier since you can debug your XSL.

The difficult bit of using XSLT is getting your URLs to connect. The XPath you need to jump from one template to another is quite simple:

ancestor::rdf:RDF/*[@rdf:about=current()/@rdf:resource]/dc:title[@xml:lang='en']

Lets work our way through this - the first node, "ancestor::rdf:RDF" is to get us back to our root node, then we can work our way down to a different subject node. We use a '*' for the next, because RDF/XML can use the element name to specify the rdf:type. So, a foaf:Person may be declared using this element or it may be declared by using a sub-element. Since we know the subject we are looking for, we can then specify that the rdf:about attribute of this element has the same as the rdf:resource attribute of what we are looking it up from. The last component specifies what we are looking for. Variations on this will allow you to do lookups by URL.

Explaining the process is complex, so I have provided a whole bunch of files to show you the process:

Some tips for HTMLizing

For XSLT, I'd strongly recommend Michael Kay's XSLT reference (Amazon, Amazon UK) - or O'Reilly's XSLT 1.0 pocket reference (Amazon, Amazon UK). Online, the zvon.org reference really is a brilliant reference for both XSLT 1.0 and XPath.

Automating the process

I'm pretty sure that you don't want to go through this process everytime you add a friend. If you are using a machine that supports crontab, you could automate the process. Crontab lets you specify a command that is run at a specific time or date. For uploading your FOAF and HTML files, a daily update schedule would probably be appropriate. Just bundle up the rules processing step and the XSLT processing step in to a shell script and then specify an upload step using curl. If you are using OS X, you may find that certain GUI applications let you do automated uploads - Interarchy, Radio UserLand and the OPML Editor come to mind. In those cases, just have your script copy the file in to the relevant location and then a 'folder watching' script should pick it up and process it.

If you are on OS X, I'd suggest looking up CronniX to easily add commands to your crontab.

The commands you need to line up are: cwm, xsltproc and curl --upload-file.

Going further

All of the stuff done above is actually quite practical - hopefully, it'll serve as a neat reaction to the idea that the Semantic Web and specifically RDF is some kind of pie in the sky, impractical dream. The technologies can be seen as relatively complex, but they are certainly not impossible to figure out if you are a reasonably intelligent person.

What is interesting about the Semantic Web is that you can innovate on top of it - it's designed for people to build upon - as good systems should be (hint: Apple, 'web apps' is not a suitable surrogate for a developer kit for the iPhone). So, feel free to experiment. Here are some ideas that you might want to start with.

GRDDL allows you to describe a link between 'POSH' (plain old semantic HTML) data patterns and RDF. It is designed in such a way that web authors can declare something as having a GRDDL profile. The website Triplr is a public GRDDL processor. If you are publishing data on a website, you can expose it to Triplr by using a GRDDL profile. You can find out more about this on the GetSemantic wiki.

'Trust' is a component of the 'Semantic Web layer cake' that Tim Berners-Lee published a while back. Current implementations of 'trust' are based on the use of public-key encryption (GPG, for instance) to sign RDF documents. Again, this is outside of the scope of this tutorial, but Edd Dumbill's tutorial is a good introduction.

Related to trust is an idea I've been mulling over recently (although I cannot claim exclusivity over - lots of people have been thinking it) is the use of OpenID along with FOAF. I could imagine an OpenID provider making available a machine-readable list of the web sites and services that a person is using with their OpenID. A FOAF system could hook up to a person's OpenID provider and use the list of services provided to extract friends lists or other data from. Alternatively, attaching FOAF and OpenID (in either direction) sounds like a cool idea. Because OpenID relies on the use of a URL for identity management (your login ID becomes a URL), it fits in quite naturally with the Semantic Web view of the world, where everything has a URI.

Fortunately, Dan Brickley is thinking along the same lines - and it is looking like FOAF may support OpenID quite soon.


Blog Talks Glossary Colophon
Last updated: Sunday, March 16, 2008 7:37:24 PM