On RDF as a storage medium

26 05 2007

I think this comment that Danny Ayers left on my In defence of the RDBMS post deserves to be discussed in a post of its own:

The point about relational databases as an integration technology is well made, but I’m curious to know why you consider RDF worse as a storage medium. It has definite advantages over OO/RDBs when it comes to integration on the web (thanks to the use of URIs as keys, and the open world model).

For persistence I can’t see any way it’s worse than OO/RDBs (in fact quite a few RDF stores use RDBs for persistence under the hood). What’s more, RDF has well-defined serializations (such as RDF/XML) which means that not only is the data portable between stores, it can also be dumped in a *standard* form. (For persistence it’s perfectly reasonable to divide the data up into manageable chunks and distribute them across RDF/XML files).

If “data lives much longer than applications”, then isn’t it better to take advantage of a clear standard, rather than the quasi-standards found in SQL implementations, or for that matter the more proprietary models found in OO DBs..?

Well… yes and no.

Let me first point out the fact that I wrote about “using RDF as a persistent storage medium just because it’s more flexible than a RDBMS”. That’s what I was objecting to (and before you ask: no, it’s not a hypothetical scenario) and not the usage of RDF per se. As Gavin King wrote in the post I was responding to: “Database refactoring is possible and practical.” and you shouldn’t be using some new, unproven technology just because refactoring and maintaining SQL databases is hard.

Second, RDF data might be portable when it is serialized as XML or N3, but once it is persistently stored, it is usually in a proprietary format that can only be accessed with a proprietary API. If I have, say, a Jena model stored in an RDBMS, all I have is a essentially a single table with three columns (subject, predicate and object) where all values have been mangled so much that the number 42 becomes Lv:0:42:http://www.w3.org/2001/XMLSchema#nonNegativeInteger4 and so on.

Contrast that to an SQL database schema where, if the designer didn’t purposefully obfuscate it, it’s usually possible to reverse engineer it, sometimes just by looking at the names of tables and columns, and at foreign keys to infer relationships. There are also mature tools to move data between different databases.

You could argue that I am comparing things that are at different levels, that I should be looking at N3 serialization format as an equivalent of SQL, and that complaining about the non-portability of Jena models is equivalent to complaining about not being able to move MySQL data files to Oracle. If you did that… well, I’d concede you have a point 😉

But the fact remains that, as long as I have my data served by a reasonably well-known RDBMS and I am using a reasonably well-designed schema, I’ll be able to find a (oftentimes cheap or free) tool that allows me to make sense of that data, analyze it, transform it, plot it, report it, you name it.

Without even much thinking about it, I can fire up mysql, psql or sqlplus from the comman line and type:

select avg(salary) as a from person
  group by age having avg(salary) > 50000
  order by a desc;

I’m not really up to speed with SPARQL, but I don’t think it’s able to do that just yet. Not to mention how efficient it would be, whereas RDBMS have been optimized for 30 years in order to be blazingly fast at doing joins, sorts, groupings, projections, and the like. You know, the kind of things business people tend to ask from a data store.

So, to sum it up, RDF does really shine “when it comes to integration on the web”, especially when we are doing integration between really heterogeneous systems, without much in the way of predefined agreements between them. But I wouldn’t right now, given the maturity of tools, design a system that had an RDF storage system at its core, unless I had some compelling, specific reason for doing so.

Advertisements

Actions

Information

3 responses

27 05 2007
Dan Creswell

You said:

““Database refactoring is possible and practical.” and you shouldn’t be using some new, unproven technology just because refactoring and maintaining SQL databases is hard.”

Urgh – that’s a terrible argument to be trying to make. If something is hard it’s time to investigate/try other options which can be either revolution or evolution. Both are about as valid as each other and which you try will be determined by your project risk profile.

5 10 2009
Dan Brickley

The reason I generally advise against keeping data primarily in RDF databases, is that RDF’s greatest strengths (flexibility, extensibility) can become a burden when you are managing rather than aggregating data. To a certain extent using named graphs to partition the data can help there, but I’ve seen a few too many projects through everything into (the Apache thing was well-named) a ‘triple soup’, only to then suffer from the irreversability of the blend. If the data can be managed against a restrictive, closed-world schema, that’s great. RDFS doesn’t provide such a thing out of the box, which can make it hard to know what exactly is in your triplestore. There are a fair number of experiments around that use combinations of SPARQL, OWL or XML-level mechanisms to provide data validation over RDF (schemarama etc) but none of these have yet become well established and widespread practice. So in the meantime, I see RDF as the place where datasets meet, rather than where they’re managed…

5 10 2009
Dan Brickley

s/through/throw/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: