tommorris.org

Discussing software, the web, politics, sexuality and the unending supply of human stupidity.


Neo4J, RDF and Kevin Bacon

Today, I managed to wangle my way into Off the Rails, a train hack day. I was helping friends with data mangling: OpenStreetMap, Dbpedia, RDF and Neo4J.

It’s funny actually. Way back when, if I said to people that there is some data that fits quite well into graph models, they’d look at me like some kind of dangerous looney. Graphs? Why? Doesn’t MySQL and JSON do everything I need?

Actually, no.

If you are trying to model a system where there are trains that travel on tracks between stations, that maps quite nicely to graphs, nodes and edges. If only there were databases and data models for that stuff, right?

Oh, yeah, there is. There’s Neo4J and there’s our old friend RDF, and the various triple store databases. I finally had a chance to play with Neo4J today. It’s pretty cool. And it shows us one of the primary issues with the RDF toolchain: it usually fails to implement the one thing any reasonable person wants from a graph store.

Kevin Bacon. Finding shortest path from one node to another with some kind of predicate filter. If you ask people what the one thing they want to do with a graph is, they’ll say: shortest path.

This is what Neo4J makes easy. I can download Neo4J in a Java (or JRuby, Scala, whatever) project, instantiate a database in the form of an embedded database, kinda like SQLite in Rails, parse a load of nodes and relations into it, then in two damn lines of Java find the shortest path between nodes.

Why can’t I do this in SPARQL? Because it isn’t in SPARQL 1.0. It isn’t in SPARQL 1.1. There’s property paths, but that doesn’t do shortest path. I can use Jena’s OntoTools. There’s some vendor specific stuff in some graph databases. 4store doesn’t do it as far as I’m aware.

Yes, find shortest path is computationally expensive. I don’t care. Yes, it might make a particular implementation topple over. I don’t care. Yes, it doesn’t scale well. I don’t care.

Don’t get me wrong: there’s stuff that’s very good about RDF. I haven’t suddenly had an anti-RDF revelation. The data model is great. If you are designing graph models for use in Neo4J, you should probably follow the practices found out in RDF land. You’ll reinvent them anyway if you don’t (and you’ll reinvent them badly). Additive data models are awesome: it makes merging data models simple. So does having globally unique identifiers that are also dereferenceable (you know, URIs). SPARQL is pretty damn powerful.

(And, it should be noted, with things like Tinkerpop and Sail you can use Neo4J for storing and querying RDF data.)

But there’s not much point in having a graph model if you don’t actually traverse the damn graph at some point. Why has it taken Neo4J to actually make Kevin Bacon problems easily solveable by people who haven’t got the foggiest what model theory or reification is? The situation is quite simply absurd.

Update: The Neo4J Twitter account has requested code samples. I was sorta helping others, so my code is just hacky stuff:

  • This Scala code just shows basic import and traversal. The orderedListToRelation method is not something I’m proud of: it could be made much more functional by using recurison. But I was feeling more like a lazy Saturday afternoon than a functional programming wizard.
  • This SPARQL query is what I used to get data out of dbpedia.org
  • The resulting data as RDF in Notation 3 format, after running it through this Ruby script to gather further data from the graph. There’s some further post-processing needed on that data before importing into Neo4J… but I was tired.