Sparkle Cubes

So you have your QTL analysis, GDP data, or Bathing Water information represented as a Data Cube. You can load your data into a triple store, make some pretty graphs in Cubeviz, and you or anyone else can get a pretty good idea of what it looks like by reading the native n3 formatted encoding. Neato. But you’re wondering how this is really any better than the other formats you’re already familiar with; sure it’s easier to load and share than the data in a relational database, but there are plenty of tools to help with that around already. Perhaps the more relevant comparison is to flat file formats such as CSV, since it’s still the de-facto way of sharing bioinformatics data. Why bother learning a new format that is not yet widely used? The most important reason, the “Semantic” part of The Semantic Web, will be the subject of another post, but today I’d like to write a little about another important technology, which you can already use to take control of your Cube formatted data and really make it shine (sorry): SPARQL.

w3cSPARQL-logo

SPARQL, an example of everyone favorite internet neologism, the recursive acronym, stands for SPARQL Protocol and RDF Query Language. As its name suggests, its main function is in querying RDF stores. Its general shape should look somewhat familiar to SQL users, but it is designed to create queries based on the “Subject Predicate Object” format native to RDF. Instead of simply listing the elements of a SPARQL query, let’s go through an example (from wikipedia):

PREFIX foaf: 
SELECT ?name ?email
WHERE {
  ?person a foaf:Person.
  ?person foaf:name ?name.
  ?person foaf:mbox ?email.
}

In brief, this query will return the names and emails of everyone in the database, which we assume contains records specified according to a particular vocabulary. If you’d like to know more details, read on, otherwise skip to the next section to see how we’ll apply this to our Data Cube data.

The first thing you’ll see is a PREFIX definition, which allows you to specify which vocabulary a resource is defined under. Using the prefix is just a shortcut to save space; you could replace every instance of “foaf:” with “” and have an equivalent query. The foaf (friend of a friend) vocabulary is one of the oldest Semantic Web vocabularies. It is used to define data about people and social networks, such as their names, emails, and connections with each other. If you’d like to know more, all you have to do is browse to the url, and you can find a detailed, human readable specification for the vocabulary. This is one nice convention in the Semantic Web community; when you browse to a URI for a vocabulary, you will frequently be redirected to a human readable version of it. This makes it easy to learn about and use new vocabularies, and to share ones you develop with others.

Next comes the SELECT line. This is one area that will look particularly familiar to SQL users, although more complex queries may not be. In this case, all we’re saying is we want to grab the parts of the data specified by “?name” and “?email” in the next part of the query. In SPARQL, tokens beginning with “?” are considered variables, so they could be named anything, but as with other languages its good practice to name them based on what they represent.

Last is the WHERE block, which usually makes up the bulk of the query. Here you can see three conditions specified in Subject Predicate Object form. If you’ve been reading along in previous posts, you may be able to understand their meanings, but even if not it’s fairly comprehensible. We’re looking for an object which is a foaf:Person, which has a foaf:name and a foaf:mbox. Although there are shortcuts which can make queries less verbose than this, the WHERE block is essentially just a list of RDF statements which you want to be true for all the data you are selecting.

Once the WHERE block returns the objects it specifies, the SELECT block picks out the portions the user has asked for, in this case the name and email, and returns them.

SPARQL and Data Cube

So now you know the basic structure of a SPARQL query, but how is it useful for the data we created in previous posts? In a multitude of ways, as it turns out. We’ll be using the following prefixes in the example queries. Note that if you were to run these queries yourself, you would need to include the prefixes at the beginning of every query, but in the interest of brevity I’ll be omitting them for the rest of the post.

PREFIX :     <http://www.rqtl.org/ns/#> 
PREFIX qb:   <http://purl.org/linked-data/cube#> 
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX prop: <http://www.rqtl.org/dc/properties/> 
PREFIX cs:   <http://www.rqtl.org/dc/cs/>

Just using relatively familiar syntax, we can do things like select all of the data on one chromosome:

SELECT ?entry
WHERE {
  ?entry prop:chr 10.0
}

Which returns a list of observation URIs

entry
http://www.rqtl.org/ns/#obsD10M298
http://www.rqtl.org/ns/#obsD10M294
http://www.rqtl.org/ns/#obsD10M42_
http://www.rqtl.org/ns/#obsD10M10
http://www.rqtl.org/ns/#obsD10M233

These could be used as the subjects of further queries, but while the observation naming scheme I chose gives you a reasonable idea of what the resources represent, you don’t have any guarantee that observation URIs will be human readable. One way would be to query the rdfs:label predicate of the observation, but if you already know which identifying properties you’re interested in, you could run a query such as the following to select them:

SELECT DISTINCT ?chr ?pos ?lod
WHERE {
  ?entry prop:chr 10.0;
         prop:chr ?chr;
         prop:pos ?pos;
         prop:lod ?lod.
}

Which yields

chr pos lod
10 40.70983 0.536721
10 0 0.08268
10 24.74745 0.759688
10 61.05621 0.254428
10 48.73004 0.584946

You may have noticed the semicolons and slightly different shape of the last query. SPARQL includes a few helpers to make your queries less verbose, in this case telling the parser that each statement separated by a semicolon, as opposed to a period, has the same subject.

While you could simply use it as a means accessing your RDF back-end, slicing out the data you need and working on it in R or some other dedicated tool, you can use SPARQL alone for many basic analysis tasks. As an example, here’s a query that uses a few keywords we haven’t seen before to select entries with a high LOD score, sort them in descending order, and give them human readable names:

SELECT DISTINCT ?name ?lod
WHERE {
  ?entry prop:lod ?lod.
  ?entry prop:refRow ?row.
  ?row rdfs:label ?name.
  FILTER(?lod > 4)
}ORDER BY DESC(?lod)

Yielding

name lod
D5M357 6.373633
D5M83 6.044958
D5M91 5.839359
D13M147 5.819851
D5M205 5.728438
D5M257 5.592882
D5M307 5.352222
D5M338 4.805622
D13M106 4.62314
D13M290 4.511684
D13M99 4.408392

Some of the predicates involved may be a little opaque, but most of the keywords (capitalized as a matter of convention) are pretty descriptive of their function. There’s a lot more depth to SPARQL than is on display here, but nonetheless we are performing the sorts of queries an actual researcher would, without having to learn anything too complex or engage in any unpleasant contortions to only grab the data we want. The latest SPARQL standard (v 1.1) includes support for many more specific graph search patterns as well as a facility for updating your data, but everything you’ve seen in this post should work just fine with any SPARQL endpoint available. 

This cat's name is sparql. She is not, alas, a query language (credit: danja http://www.flickr.com/photos/danja/236712101/)

This cat’s name is sparql. She is, alas, neither a query language nor a nascent web standard (credit: danja)

If your eyes glazed over through the example and you’re only paying attention now because of the unexpected cat picture, the key point to remember is that we can use these same techniques for any sort of data set in the Data Cube format, be it genetics, finance, or public health. We could select a subset of the information that we can import into our local data store and visualize using tools like Cubeviz, or we can use query patterns to pick out just the information that interests us. Future blog posts will talk about some of the more complicated operations you can perform, and how the language makes it easy to bring together information from multiple sources, but I hope this sample gives you an idea of the usefulness of SPARQL, and why you’d want your data mapped to the Data Cube format. This post, focusing more on the fundamental mechanics of querying Data Cube encoded information, barely touches on the “Semantic” aspect of The Semantic Web; while we do have some meaningful information about what’s a dimension, a measure, and so on, a lot of what makes RDF related technologies powerful is missing. I will soon be adding context specific semantics compliant with the Qtab format, so any other software which understands the format can automatically integrate information from Data Cube resources. Once this process is finished, I will begin creating tools to map general Ruby objects into this format, and help end users decide which types of semantic information they want to include.

If you want to try the queries out for yourself, or see how slight modifications might work, you can find a SPARQL endpoint for the data set I’ve been using here. Unfortunately results will be returned as xml, which is not very easy (for humans) to read, so if you’re interested in trying out your new knowledge in a more friendly setting, you may want to try Dbpedia, a project to convert information from Wikipedia to RDF, which has a SPARQL endpoint.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s