Sparkle Cubes

So you have your QTL analysis, GDP data, or Bathing Water information represented as a Data Cube. You can load your data into a triple store, make some pretty graphs in Cubeviz, and you or anyone else can get a pretty good idea of what it looks like by reading the native n3 formatted encoding. Neato. But you’re wondering how this is really any better than the other formats you’re already familiar with; sure it’s easier to load and share than the data in a relational database, but there are plenty of tools to help with that around already. Perhaps the more relevant comparison is to flat file formats such as CSV, since it’s still the de-facto way of sharing bioinformatics data. Why bother learning a new format that is not yet widely used? The most important reason, the “Semantic” part of The Semantic Web, will be the subject of another post, but today I’d like to write a little about another important technology, which you can already use to take control of your Cube formatted data and really make it shine (sorry): SPARQL.

w3cSPARQL-logo

SPARQL, an example of everyone favorite internet neologism, the recursive acronym, stands for SPARQL Protocol and RDF Query Language. As its name suggests, its main function is in querying RDF stores. Its general shape should look somewhat familiar to SQL users, but it is designed to create queries based on the “Subject Predicate Object” format native to RDF. Instead of simply listing the elements of a SPARQL query, let’s go through an example (from wikipedia):

PREFIX foaf: 
SELECT ?name ?email
WHERE {
  ?person a foaf:Person.
  ?person foaf:name ?name.
  ?person foaf:mbox ?email.
}

In brief, this query will return the names and emails of everyone in the database, which we assume contains records specified according to a particular vocabulary. If you’d like to know more details, read on, otherwise skip to the next section to see how we’ll apply this to our Data Cube data.

The first thing you’ll see is a PREFIX definition, which allows you to specify which vocabulary a resource is defined under. Using the prefix is just a shortcut to save space; you could replace every instance of “foaf:” with “” and have an equivalent query. The foaf (friend of a friend) vocabulary is one of the oldest Semantic Web vocabularies. It is used to define data about people and social networks, such as their names, emails, and connections with each other. If you’d like to know more, all you have to do is browse to the url, and you can find a detailed, human readable specification for the vocabulary. This is one nice convention in the Semantic Web community; when you browse to a URI for a vocabulary, you will frequently be redirected to a human readable version of it. This makes it easy to learn about and use new vocabularies, and to share ones you develop with others.

Next comes the SELECT line. This is one area that will look particularly familiar to SQL users, although more complex queries may not be. In this case, all we’re saying is we want to grab the parts of the data specified by “?name” and “?email” in the next part of the query. In SPARQL, tokens beginning with “?” are considered variables, so they could be named anything, but as with other languages its good practice to name them based on what they represent.

Last is the WHERE block, which usually makes up the bulk of the query. Here you can see three conditions specified in Subject Predicate Object form. If you’ve been reading along in previous posts, you may be able to understand their meanings, but even if not it’s fairly comprehensible. We’re looking for an object which is a foaf:Person, which has a foaf:name and a foaf:mbox. Although there are shortcuts which can make queries less verbose than this, the WHERE block is essentially just a list of RDF statements which you want to be true for all the data you are selecting.

Once the WHERE block returns the objects it specifies, the SELECT block picks out the portions the user has asked for, in this case the name and email, and returns them.

SPARQL and Data Cube

So now you know the basic structure of a SPARQL query, but how is it useful for the data we created in previous posts? In a multitude of ways, as it turns out. We’ll be using the following prefixes in the example queries. Note that if you were to run these queries yourself, you would need to include the prefixes at the beginning of every query, but in the interest of brevity I’ll be omitting them for the rest of the post.

PREFIX :     <http://www.rqtl.org/ns/#> 
PREFIX qb:   <http://purl.org/linked-data/cube#> 
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX prop: <http://www.rqtl.org/dc/properties/> 
PREFIX cs:   <http://www.rqtl.org/dc/cs/>

Just using relatively familiar syntax, we can do things like select all of the data on one chromosome:

SELECT ?entry
WHERE {
  ?entry prop:chr 10.0
}

Which returns a list of observation URIs

entry
http://www.rqtl.org/ns/#obsD10M298
http://www.rqtl.org/ns/#obsD10M294
http://www.rqtl.org/ns/#obsD10M42_
http://www.rqtl.org/ns/#obsD10M10
http://www.rqtl.org/ns/#obsD10M233

These could be used as the subjects of further queries, but while the observation naming scheme I chose gives you a reasonable idea of what the resources represent, you don’t have any guarantee that observation URIs will be human readable. One way would be to query the rdfs:label predicate of the observation, but if you already know which identifying properties you’re interested in, you could run a query such as the following to select them:

SELECT DISTINCT ?chr ?pos ?lod
WHERE {
  ?entry prop:chr 10.0;
         prop:chr ?chr;
         prop:pos ?pos;
         prop:lod ?lod.
}

Which yields

chr pos lod
10 40.70983 0.536721
10 0 0.08268
10 24.74745 0.759688
10 61.05621 0.254428
10 48.73004 0.584946

You may have noticed the semicolons and slightly different shape of the last query. SPARQL includes a few helpers to make your queries less verbose, in this case telling the parser that each statement separated by a semicolon, as opposed to a period, has the same subject.

While you could simply use it as a means accessing your RDF back-end, slicing out the data you need and working on it in R or some other dedicated tool, you can use SPARQL alone for many basic analysis tasks. As an example, here’s a query that uses a few keywords we haven’t seen before to select entries with a high LOD score, sort them in descending order, and give them human readable names:

SELECT DISTINCT ?name ?lod
WHERE {
  ?entry prop:lod ?lod.
  ?entry prop:refRow ?row.
  ?row rdfs:label ?name.
  FILTER(?lod > 4)
}ORDER BY DESC(?lod)

Yielding

name lod
D5M357 6.373633
D5M83 6.044958
D5M91 5.839359
D13M147 5.819851
D5M205 5.728438
D5M257 5.592882
D5M307 5.352222
D5M338 4.805622
D13M106 4.62314
D13M290 4.511684
D13M99 4.408392

Some of the predicates involved may be a little opaque, but most of the keywords (capitalized as a matter of convention) are pretty descriptive of their function. There’s a lot more depth to SPARQL than is on display here, but nonetheless we are performing the sorts of queries an actual researcher would, without having to learn anything too complex or engage in any unpleasant contortions to only grab the data we want. The latest SPARQL standard (v 1.1) includes support for many more specific graph search patterns as well as a facility for updating your data, but everything you’ve seen in this post should work just fine with any SPARQL endpoint available. 

This cat's name is sparql. She is not, alas, a query language (credit: danja http://www.flickr.com/photos/danja/236712101/)

This cat’s name is sparql. She is, alas, neither a query language nor a nascent web standard (credit: danja)

If your eyes glazed over through the example and you’re only paying attention now because of the unexpected cat picture, the key point to remember is that we can use these same techniques for any sort of data set in the Data Cube format, be it genetics, finance, or public health. We could select a subset of the information that we can import into our local data store and visualize using tools like Cubeviz, or we can use query patterns to pick out just the information that interests us. Future blog posts will talk about some of the more complicated operations you can perform, and how the language makes it easy to bring together information from multiple sources, but I hope this sample gives you an idea of the usefulness of SPARQL, and why you’d want your data mapped to the Data Cube format. This post, focusing more on the fundamental mechanics of querying Data Cube encoded information, barely touches on the “Semantic” aspect of The Semantic Web; while we do have some meaningful information about what’s a dimension, a measure, and so on, a lot of what makes RDF related technologies powerful is missing. I will soon be adding context specific semantics compliant with the Qtab format, so any other software which understands the format can automatically integrate information from Data Cube resources. Once this process is finished, I will begin creating tools to map general Ruby objects into this format, and help end users decide which types of semantic information they want to include.

If you want to try the queries out for yourself, or see how slight modifications might work, you can find a SPARQL endpoint for the data set I’ve been using here. Unfortunately results will be returned as xml, which is not very easy (for humans) to read, so if you’re interested in trying out your new knowledge in a more friendly setting, you may want to try Dbpedia, a project to convert information from Wikipedia to RDF, which has a SPARQL endpoint.

Frame to Cube

I’d like to talk a little more about the mapping between R Data Frames and the RDF format. This isn’t a very natural union; information in a Data Frame has a simple and well-defined structure, while RDF is a sequence of statements which need not be in any particular order. This is by design of course; it is what allows the flexibility and generality of RDF. The Data Cube vocabulary, however, forms a bridge between these two very different ways of representing information. Data Cube was developed by a group of statistics and data science experts commissioned by the UK government to develop a vocabulary for representing multi-dimensional data. It can be used with tabular data, but in keeping with the flexible nature of semantic web technologies, it can also
represent nearly any kind of data which can be broken down by dimension, and includes facilities for attaching semantic context to data sets and extending itself to accommodate additional complexities of units, measure types, and other common features of data.

OLAP cubes, a common data structure in corporate settings, use the same basic model as the Data Cube vocabulary, but do not include any semantic information.

The OLAP cube, a common data structure in corporate settings, uses the same basic model as the Data Cube vocabulary, but does not include any semantic information.

A Dataframe, by contrast, is a fairly simple structure; a set of lists of equal length. Although some incidental complexity can hide under this description, in our sample use case, r/qtl, they can be thought of simply as tables with labeled rows. While the Data Cube vocabulary is capable of handling much more complicated structures, it is also well suited to representing simple objects such as Dataframes.

In order to explore the relationship between these two data structures, we will look at a small data set representing the results of an r/qtl analysis session. These particular data are excerpted from running a marker regression on the Listeria dataset included in r/qtl.

 

chr

pos

lod

D10M44

1

0.0

0.457

D15M68

15

23.9

3.066

The actual Dataframe has about 130 rows, but this short excerpt will suffice to show how the mapping works. A future post may touch briefly on the meaning of these entries in the context of QTL analysis. For now, just think of it as you would any other table, although note that the rows are labeled.

We will have to make some assumptions for our mapping, since R doesn’t include any information about the meaning (semantics) of its data. We assume that each column of the Dataframe is a property we’re interested in measuring, and that each row of the table specifies a category or dimension to place our measurements in. If you’re familiar with the structure of our example Dataframe, an r/qtl scanone result, you may notice that this mapping isn’t entirely parsimonious with what’s represented in the data; the “chr” and “pos” columns probably better specify dimensions, and since the row names contain no extra information, they should probably just be labels for the data points. Unfortunately there’s no way to fully automate this process; there simply isn’t enough information in the Dataframe to unambiguously determine our mapping. I am working on some tools to allow end users to specify this information, which could be used to build up a library of mappings for different R classes, but for now it suffices to develop a technically valid Data Cube representation of the information.

Let’s break down the mapping by individual data cube elements, and see what they correspond to in R.

Prefixes

Prefixes are standard in a number of RDF languages, including Turtle, which I will use for these examples. Prefixes simply specify that certain tokens, when followed by a “:” (colon) should be replaced by the appropriate URI. Although any Turtle file could be written without them, they are essential to making your RDF comprehensible to humans as well as machines. The prefixes used for this data set will be

While some of these prefixes are part of standard RDF vocabularies, anything under rqtl.org should be considered a placeholder for the time being, as no vocabulary definitions actually exist at the given address.

Data Structure Definition

One of the highest level elements of a Data Cube resource is the Data Structure Definition. It provides a reusable definition that Datasets can operate under, specifying dimensions, measures, attributes, and extra information such as ordering and importance.

In our basic implementation, this is little more than a list of component specifications for the dimension properties. As the program develops, a lot of the additional semantic detail will wind up here.

Note that the Data Structure Definition resource has been named after the variable used to generate, which I called “mr” (marker regression), so the resource’s name is “dsd-mr”.

Data Set

Although it can contain some additional information about how to interpret the data it applies to, a DataSet’s main job is to attach a series of observations to a Data Structure Definition.

This is currently implemented as a mostly static string, with the dataset labeled (by default) based on the variable used to generate it.

Component Specifications

In the Data Structure Definition you only need to provide a list of components specifications, making DSDs easier to read and create. The Component Specification is the bridge between this list and the component resources, marking components as measures, dimensions, or attributes and providing a reference to the actual component object. It also contains some component metadata such as whether or not it is required, and if it has any extra attached RDF resources.

The project currently creates Component Specifications based on the row names for the R Data Frame used to generate them.

Dimension Properties

Data Cube Properties are separated into four types; Dimension Properties, Measure Properties, Attribute Properties, and Coded Properties. Of these, only the former two are in use in my project, so these will be the only ones I’ll cover in this post.

Dimension Properties specify the way in which data are measured. They provides a means of categorize observations, and are generally used by visualization programs to draw axes for data. Dimension Properties can also specify a Measure Type, which allows you to categorize what the dimension is measuring, for example weight, or time.

The converter defaults to making row the only dimension, which will work with most data sets but may not be the best mapping possible. A near term goal of this project will be providing a facility to specify this at run time.

Each row must also be declared as such

Measure Properties

Measure Properties are used to represent the values of interest in your data, whether its GDP levels, cancer rates, or LOD scores from a QTL analysis.

In the default mapping, each column of the R DataFrame becomes a measure property. In the example object, this is all of the “chr”, “pos”, and “lod” values.

Observations

Observations are the heart of the Data Cube vocabulary. They represent one particular data point, and specify values for each dimension property measure property. Because the definition of these properties has already been laid out as its own resource, all an observation needs to list is the value at that point in whatever format it is set up to use.

In the case of an R Dataframe, each row forms an observation, and each column specifies one component of it. By default all are measure values, although in future versions some may specify dimension or attribute values.

In the example object, each row specifies the measure values for chr, pos, and lod, generating a Data Cube observation that looks like:

Other

The Data Cube vocabulary contains a number of other important features not covered in this overview. For example, the vocabulary is compatible with, and to some extent integrates, SKOS, the Simple Knowledge Organization System, a popular ontology for organizing taxonomies, thesauri, and other types of controlled vocabulary. Each Data Cube Property can have an attached “concept”, which aids in reuse and comprehension by specifying a known statistical concept represented by the property.

I hope you found this example illustrative and you have a better idea of how to get your data into the Data Cube format. If combined into one text file, the results specify a valid Data Cube which can be used with tools such as Cubeviz, or queried as part of a research or analysis task. If you’d like to see what this looks like for a larger object, here is the Data Cube representation of a marker regression on the whole Listeria data set. In the next post, I’ll talk about how to do this using SPARQL, RDF’s native query language.

Motivations: Why we need to improve the Semantic Web

David Karger: How the Semantic Web Can Help End Users

MIT AI researcher David Karger gave the keynote at this year’s European Semantic Web Conference, and has posted his slides as well as a summary of his talk on the MIT Haystack blog. He’s an expert on the topic, and does a much better job than I could of explaining the value of flexible and extensible data representation. I hope to distill some of the writing of Karger and others and post it here over the summer, but for now, if you’re not sure what advantage these technologies have over traditional databases and ad-hoc formats, or you think there’s no more useful work to be done on them, have a look at the presentation.

Data Frames and Data Cubes

One of the most commonly used abstract structures in R is the dataframe. Technically, a dataframe is a list containing vectors of equal lengths, but it is generally used to represent tables of data, where each row represents an observation, and each column a variable attribute for the observation. Dataframes are commonly used to to store structured results for some computation in R, such as the chromosome, position, and LOD score for a QTL analysis operation. It is less flexible than a general R class, but more complex than a simple list or primitive value, so I decided to use it as my test case for the R to RDF process.

Tools

Although there’s nothing approaching the range of Semantic Web libraries available for popular language like Java, the Ruby community has developed a fair number of libraries for interacting with RDF formatted data, which will allow me to focus more development time on building useful tools than on the background infrastructure to support them. Chief among these is Ruby RDF Project, which hosts a number of interrelated libraries built on a basic core for interacting with Semantic data. Ruby RDF features a Ruby object representation of triples and graphs and methods for interacting with them, as well as a number of plugins for reading and writing data using different storage systems and RDF serialization standards. It takes the approach of focusing on a limited set of functions, implementing them efficiently and robustly, and providing an extensible architecture to allow new functionality to be added. Although it started with built-in support for only one of the simplest RDF formats, ntriples, functional and well-maintained extensions or plugins are now available for most popular triple stores, serialization formats, and querying methods.

Although anyone can invent their own vocabulary and write perfectly valid RDF using it, the focus of the Semantic Web is on interoperability via re-use of existing standards. Converting tabular data to RDF has historically been somewhat difficult, as the two formats are not naturally very compatible. Fortunately recent work in unifying a number of existing data description vocabularies has yielded a reasonably general specification for multidimensional data: The Data Cube Vocabulary. It is a rich and descriptive system, as you might guess looking at this chart of its components:

Data Cube Vocabulary

However, it can still describe simple datasets without requiring every component above, so it is flexible as well as powerful. At its core, the Data Cube vocabulary is essentially just a long list of observations, and the dimensions, units, measurements and attributes that define them. It has been used in a number of real world applications, particularly in the publication of large data sets by government agencies, such as the financial data published by the UK government as part of their transparency initiative, and a wide variety of others.

Once data is in RDF form, you need a somewhere to store it. Although theoretically you could use a standard relational database (and many do offer direct support for triple formatted data), or even just store everything as a bunch of serialized text files, in practice RDF data is generally stored using a type of database called a triplestore. These are optimized for storing data in Subject Predicate Object and related forms, so are generally more efficient and easy to integrate with other triple based data sets or querying protocols. Although there are differences between types of triple stores, ranging for subtle to extreme, Ruby RDF abstracts many of these away with a Repository object that acts as an interface for loading data into a store and performing queries on it. To begin with, we are using 4store for our triplestore, as it’s free, supports the latest version of the RDF and SPARQL standard, is a native triple store (as opposed to an extension for another type of DB), and can run as a cluster for operating on large amounts of data. A plugin is already available to use 4store as a repository in Ruby RDF. It isn’t fully finished, so implementing the remaining functions may become a part of my work this summer if we continue to use it, but it works well enough to load a set of triples into the store, which is all I need at this stage.

Finally, I’ll be using the Rserve tool and accompanying rserve-client Ruby gem to manage the interaction between R and Ruby. While there are a few options for doing this, Rserve seemed to offer the best compromise between performance and deployability, making up for its slight disadvantage in speed compared to some libraries (eg RSruby) by being written in pure Ruby, and thus more easily deployable in different environments.

Method

The overall process for converting from an R dataframe to a rudimentary Data Cube object is actually fairly straightforward. When using the standalone program, you just provide the name of the R variable to dump, and possibly the location of your R workspace and the port your 4store instance is running on if these differ from the defaults (loading the Ruby class in your code offers more flexibility). The script first creates a hash from the data, which is somewhat counter-intuitively represented as a monkey patched Array object with “name” attributes instead of just a regular hash. This is a reasonably simple operation for a basic dataframe, which, as mentioned before, is structurally equivalent to a table with named rows, but the process may become more complex when I move on to supporting a wider set of R objects.

At first I attempted to create an ad-hoc placeholder vocabulary and create triples for each entry and relation in the dataframe using Ruby RDF’s included statement builder, think I would test the 4store interface and implement a more complete vocabulary later. I quickly realized working with plain triples is too cumbersome an approach to truly be viable, and it was producing unnecessarily large graphs for the amount of information I was attempting to store. I decided to move straight to implementing the Data Cube vocabulary, using an RDF description language known as N3, or more specifically a subset of it known as Turtle. N3 is actually powerful enough to express statements which do not translate to valid RDF, while Turtle is limited to just the subset of N3 that corresponds to proper RDF. In addition to being more powerful than the basic language of triples, Turtle is also one of the most human readable formats for working with RDF, which can otherwise wind up looking like a big morass of unformatted HTML. The downside is that Ruby RDF doesn’t  have full yet support for writing blocks of Turtle statements, so currently much of the code relies on string building.

I decided on a few simple mappings for parts of the dataframe to get things started. They may not be perfectly accurate or the proper representation of the data, but I wanted to make sure I was at least generating a valid Data Cube before adding the fine details. Each row of the dataframe becomes a Data Cube dimension, each column a measure, and the  dataset as well as its supporting components objects are named using the name of the variable being converted. The exact way in which these elements are linked can seem a little arcane at first, but as I mentioned, it really just comes down to splitting the data into observations and attaching dimensions or attributes to describe it. If you’d like an example if how it all winds up looking, see this simple file describing just one datapoint: https://gist.github.com/wstrinz/cd20661a8d1fda123e2b

Once its in this format, 4store and most other triple stores can load the data easily, and if the store provides a built in SPARQL endpoint, it can already be queried using the standard patterns. If new data using the same definition is added, it can be automatically integrated, and if someone else wants to extend or provide more details for any element, they can do so without redefining their schema, and anyone looking for a definition of how the data is structure can find it easily.

How To

I will go through a quick example from loading a data set in R all the way to viewing it in a Data Cube visualizer as a demonstration of what I’ve implemented so far. Although the native Ruby code allows for more flexibility, I will be using the standalone java executable for this example. This assumes you’ve installed R/qtl, as well as Rserve. If not, both are available from the R package management system. You’ll also need 4store installed and running its http SPARQL server in unsafe mode, for which you can find instructions here. In this example, I’ll be running mine on the default port 8080.

First lets start R in our home directory and do some simple analysis with a sample dataset provided by R/qtl. Once we’ve loaded the results into a variable, we’ll close the R session and save our workspace.

$ R
> library(qtl)
> data(listeria)
> mregress = scanone(listeria, method="mr")
> quit()
Save workspace image? [y/n/c]: y

Next, we want to run our tool to dump the objects into 4store. To do this, you simply run the executable jar with the variable to dump as the first argument, and optionally the directory the R workspace was saved in (if it’s not the same directory you’re running the jar from).

$ java -jar QTL2RDF.jar mregress ~

You can find a copy of the jar here, but since this project is under active development that will likely go out of date soon, so you’ll want to check for the latest build on the project’s Github page.

Thats all you really need to do! Your data are now in your local 4store, ready to be extracted or queried as you see fit. However, the jar will also offers the option of returning its output as a turtle formatted string, so if you wanted to use it with some other program you can pass a fourth argument (for now it can be anything) when you run the jar, then pipe the output to a file.

java -jar QTL2RDF.jar mregress ~ 8080 something > mregress.ttl

This file should be importable into any triple store, and usable with other tools that recognize the Data Cube format. One such tool is Cubeviz, which allows you to browse and present statistical information encoded as Data Cube RDF. Cubeviz uses a different triple store called Virtuoso, so it can be a bit time consuming to set up, but once you’ve done so importing your data is as easy as pointing it at your data, which you can either upload as a file, provide a link to, or simply paste in (if you’re dataset is small enough).

Pasting the data set

Pasting the data set

You can then decide how you’d like to visualize your data. In this case, we’re probably interested in the LOD, so I set measurement to “lod” and selected a few loci in the dimension configuration.

graph

You could select each locus to view the whole data set, but Cubeviz is still in development,  and so consequently has some issues such as failing to scale down the graph for a wide range of values.

So far, this is just the bare minimum required to properly define the object in the Data Cube vocabulary. Some of the mappings may not be quite correct, and there’s a lot that could be done to better define the semantics of the data that’s being converted. As an example, the ‘chr’ and ‘pos’ measures may be better described as attributes rather than measures in their own right. There are no labels or units on the graph, although Cubeviz will create them if your data is formatted correctly. Some of these features will be included as I improve the conversion program, but there will still be some important semantic information that be easily extracted from the R dataframe. Developing ways for users to specify how this should be handled in an friendly and intuitive way is an important facet of this summer’s project, but even at its minimum functionality the tool I’ve developed can already convert and store a popular data structure in a useful and flexible format.

Next Steps

I plan to finish mapping simple R types over the next few days, then write some functions to break general r objects into their component parts and serialize those. By the official start of coding day, I should have the groundwork laid to start applying these methods to Ruby objects and developing more complete ontology handling.

Again, please check out the Github repository (https://github.com/wstrinz/R2RDF) if you’d like to see some more details. The code isn’t particularly efficient or well documented yet, but I’d appreciate any feedback anyone has. If you have questions, please leave a comment here or get in touch with me some other way.

Goals: Reproducible Science

This project has a number of goals, including improving support for large datasets in the bioinformatics community, furthering the development of semantic web technologies, and supporting data sharing and reproducible science. Today I’m going to go into a little more detail about the former and talk about the work I’ve done so far on it.

To get up to speed on all of the interrelated Semantic Web standards and technologies, I have been working on a tool for converting objects from the R statistical computing language into RDF triples, the native format of the Semantic Web. Although this will be a valuable tool on its own, it is also being developed to support the next version of R/qtl, a library developed by Karl Broman and Hao Wu, as well as a host of other contributors, which offers functions for doing Quantitative Trait Loci mapping using R. The next incarnation of R/qtl will focus on support for highly parallelized computation, a key component of which will be storing results in a database that can be queried and manipulated remotely, as opposed to keeping huge data sets in memory on one computer.

rqtlplanAn overview of the plans for R/qtl

Data Sharing and Reproducibility

The other advantage of storing statistical data in triple based format is that it can be easily, even automatically, published for others to download, inspect, and interact with. In RDF, every property of a resource is defined by its relation to other resources or objects, and each relation comes with an attached definition that either a machine or a human can access to get further details about it. This allows for a huge amount of flexibility in data types and storage schema, as well as the application of algorithmic reasoning techniques to simplify a data set or find out more about its implications.

Publishing scientific data in a machine readable format also makes it dramatically more for scientists hoping to replicate the results or build upon them. Most publications will, at best, include supporting data as a table or tables of statistical aggregates, and even when lower level raw data are available it is usually stored in a flat format such as csv, which includes little to no semantic content such as the units or attributes of objects, or their meaning in the context of the rest of the data. While some fields have begun using relational database technology more extensively, the fact that the most popular data storage formats for many researchers are essentially text files speaks to the rigidity and extra complications of using  dedicated database systems.

RDF has a somewhat mind bending structure of its own to understand, and it’s certainly no silver bullet for the problem of generalized data storage, but its flexibility allows it to overcome much of the ossification and user-unfriendliness of trying to use relation databases for storing and publishing scientific results. A primary goal of my work this summer will be to extend existing Ruby tools and build new ones to help make Semantic data storage accessible and simple for researchers, supporting the crucial work of examining, validating, and extending published results.

Protocols

The Semantic Web is built on three core technologies; the RDF protocol, the SPARQL query language, and the OWL web ontology language. There is a large body of documentation about all three of these tools, which can be found around the web or in the W3C’s standards documentation. I may write some posts going into more detail, but for now, a brief overview:

RDF essentially comes down to describing data using the ‘Subject Predicate Object’ format, creating statements known as ‘Triples’. An example would be “John(subject) knows(predicate) Mary(object)”. With a few exceptions, each component of this triple is given a URI, which looks like a lot like a regular URL and can be used to uniquely identify a resource (such as ‘John’ or ‘Mary’), or a relation (such as ‘friends with’), as well as providing a link where more information about the object or relation may be found. You can think of subjects and objects as nodes, and predicates as a lines connecting them. Although this doesn’t mean much for one statement such as “John knows Mary”, a collection of similar statements define a big directed graph of interconnected nodes, where every connection is labeled with its meaning.

SPARQL is the official query language of the Semantic Web. It can be used to select elements of an RDF graph based on their subject, predicate, or object, elements. Its syntax resembles, superficially at least, the SQL language familiar to many database users, but in practice it functions quite differently. However, many of the advanced operations of SQL, such as pivoting, are still supported.

OWL is a language for describing ontologies, which are used to formally represent the concepts your data represent in an RDF store, allowing simpler machine interpretation. The technology is crucial to the interconnectedness of the Semantic Web, as it is the means by which the relations between disparate resources can be automatically discovered, allowing relatively easy integration of new or existing data.

There is, of course, much more to each of these technologies, and the technicalities, use cases, and extensions of each of them, then would fit in one section of the blog post, but hopefully this gives a broad overview of how the project will work.

A Starting Place

To begin with, I have developed a Ruby based tool to automatically convert data frame objects in R to RDF, using the Data Cube vocabulary, which was developed as a generalizable way of representing  multidimensional data. It can be run easily on any Ruby capable machine, but since the script and all of its dependencies are pure Ruby libraries, I was also able to deploy it as an executable jar with warbler, so anyone with java installed can use it without having to download any dependencies. I’ll go into more detail about how this works, why its useful, and where this part of the project is headed, but I’ve decided to break it into a separate blog post.

Hello

Hello everyone, thanks for taking the time to look at my blog! This will serve as a record of what I’m doing for the Google Summer of Code 2013, where I will be working with members of the SciRuby organization on writing programs to make data storage and sharing easier using Semantic Web tools. Some posts will about the technical details of the project as it progresses, others about broader topics such as Ontologies and Triple Stores, and sometimes there will be a mix of both. Please leave a comment if you have any questions or feedback, whether or not you’re involved in GSoC or SciRuby.

If you’d like to get in touch some other way, you can reach me by:
Email: wstrinz@gmail.com
Twitter: @wstrinz
GitHub: github.com/wstrinz