One of the most commonly used abstract structures in R is the dataframe. Technically, a dataframe is a list containing vectors of equal lengths, but it is generally used to represent tables of data, where each row represents an observation, and each column a variable attribute for the observation. Dataframes are commonly used to to store structured results for some computation in R, such as the chromosome, position, and LOD score for a QTL analysis operation. It is less flexible than a general R class, but more complex than a simple list or primitive value, so I decided to use it as my test case for the R to RDF process.
Although there’s nothing approaching the range of Semantic Web libraries available for popular language like Java, the Ruby community has developed a fair number of libraries for interacting with RDF formatted data, which will allow me to focus more development time on building useful tools than on the background infrastructure to support them. Chief among these is Ruby RDF Project, which hosts a number of interrelated libraries built on a basic core for interacting with Semantic data. Ruby RDF features a Ruby object representation of triples and graphs and methods for interacting with them, as well as a number of plugins for reading and writing data using different storage systems and RDF serialization standards. It takes the approach of focusing on a limited set of functions, implementing them efficiently and robustly, and providing an extensible architecture to allow new functionality to be added. Although it started with built-in support for only one of the simplest RDF formats, ntriples, functional and well-maintained extensions or plugins are now available for most popular triple stores, serialization formats, and querying methods.
Although anyone can invent their own vocabulary and write perfectly valid RDF using it, the focus of the Semantic Web is on interoperability via re-use of existing standards. Converting tabular data to RDF has historically been somewhat difficult, as the two formats are not naturally very compatible. Fortunately recent work in unifying a number of existing data description vocabularies has yielded a reasonably general specification for multidimensional data: The Data Cube Vocabulary. It is a rich and descriptive system, as you might guess looking at this chart of its components:
However, it can still describe simple datasets without requiring every component above, so it is flexible as well as powerful. At its core, the Data Cube vocabulary is essentially just a long list of observations, and the dimensions, units, measurements and attributes that define them. It has been used in a number of real world applications, particularly in the publication of large data sets by government agencies, such as the financial data published by the UK government as part of their transparency initiative, and a wide variety of others.
Once data is in RDF form, you need a somewhere to store it. Although theoretically you could use a standard relational database (and many do offer direct support for triple formatted data), or even just store everything as a bunch of serialized text files, in practice RDF data is generally stored using a type of database called a triplestore. These are optimized for storing data in Subject Predicate Object and related forms, so are generally more efficient and easy to integrate with other triple based data sets or querying protocols. Although there are differences between types of triple stores, ranging for subtle to extreme, Ruby RDF abstracts many of these away with a Repository object that acts as an interface for loading data into a store and performing queries on it. To begin with, we are using 4store for our triplestore, as it’s free, supports the latest version of the RDF and SPARQL standard, is a native triple store (as opposed to an extension for another type of DB), and can run as a cluster for operating on large amounts of data. A plugin is already available to use 4store as a repository in Ruby RDF. It isn’t fully finished, so implementing the remaining functions may become a part of my work this summer if we continue to use it, but it works well enough to load a set of triples into the store, which is all I need at this stage.
Finally, I’ll be using the Rserve tool and accompanying rserve-client Ruby gem to manage the interaction between R and Ruby. While there are a few options for doing this, Rserve seemed to offer the best compromise between performance and deployability, making up for its slight disadvantage in speed compared to some libraries (eg RSruby) by being written in pure Ruby, and thus more easily deployable in different environments.
The overall process for converting from an R dataframe to a rudimentary Data Cube object is actually fairly straightforward. When using the standalone program, you just provide the name of the R variable to dump, and possibly the location of your R workspace and the port your 4store instance is running on if these differ from the defaults (loading the Ruby class in your code offers more flexibility). The script first creates a hash from the data, which is somewhat counter-intuitively represented as a monkey patched Array object with “name” attributes instead of just a regular hash. This is a reasonably simple operation for a basic dataframe, which, as mentioned before, is structurally equivalent to a table with named rows, but the process may become more complex when I move on to supporting a wider set of R objects.
At first I attempted to create an ad-hoc placeholder vocabulary and create triples for each entry and relation in the dataframe using Ruby RDF’s included statement builder, think I would test the 4store interface and implement a more complete vocabulary later. I quickly realized working with plain triples is too cumbersome an approach to truly be viable, and it was producing unnecessarily large graphs for the amount of information I was attempting to store. I decided to move straight to implementing the Data Cube vocabulary, using an RDF description language known as N3, or more specifically a subset of it known as Turtle. N3 is actually powerful enough to express statements which do not translate to valid RDF, while Turtle is limited to just the subset of N3 that corresponds to proper RDF. In addition to being more powerful than the basic language of triples, Turtle is also one of the most human readable formats for working with RDF, which can otherwise wind up looking like a big morass of unformatted HTML. The downside is that Ruby RDF doesn’t have full yet support for writing blocks of Turtle statements, so currently much of the code relies on string building.
I decided on a few simple mappings for parts of the dataframe to get things started. They may not be perfectly accurate or the proper representation of the data, but I wanted to make sure I was at least generating a valid Data Cube before adding the fine details. Each row of the dataframe becomes a Data Cube dimension, each column a measure, and the dataset as well as its supporting components objects are named using the name of the variable being converted. The exact way in which these elements are linked can seem a little arcane at first, but as I mentioned, it really just comes down to splitting the data into observations and attaching dimensions or attributes to describe it. If you’d like an example if how it all winds up looking, see this simple file describing just one datapoint: https://gist.github.com/wstrinz/cd20661a8d1fda123e2b
Once its in this format, 4store and most other triple stores can load the data easily, and if the store provides a built in SPARQL endpoint, it can already be queried using the standard patterns. If new data using the same definition is added, it can be automatically integrated, and if someone else wants to extend or provide more details for any element, they can do so without redefining their schema, and anyone looking for a definition of how the data is structure can find it easily.
I will go through a quick example from loading a data set in R all the way to viewing it in a Data Cube visualizer as a demonstration of what I’ve implemented so far. Although the native Ruby code allows for more flexibility, I will be using the standalone java executable for this example. This assumes you’ve installed R/qtl, as well as Rserve. If not, both are available from the R package management system. You’ll also need 4store installed and running its http SPARQL server in unsafe mode, for which you can find instructions here. In this example, I’ll be running mine on the default port 8080.
First lets start R in our home directory and do some simple analysis with a sample dataset provided by R/qtl. Once we’ve loaded the results into a variable, we’ll close the R session and save our workspace.
$ R > library(qtl) > data(listeria) > mregress = scanone(listeria, method="mr") > quit() Save workspace image? [y/n/c]: y
Next, we want to run our tool to dump the objects into 4store. To do this, you simply run the executable jar with the variable to dump as the first argument, and optionally the directory the R workspace was saved in (if it’s not the same directory you’re running the jar from).
$ java -jar QTL2RDF.jar mregress ~
You can find a copy of the jar here, but since this project is under active development that will likely go out of date soon, so you’ll want to check for the latest build on the project’s Github page.
Thats all you really need to do! Your data are now in your local 4store, ready to be extracted or queried as you see fit. However, the jar will also offers the option of returning its output as a turtle formatted string, so if you wanted to use it with some other program you can pass a fourth argument (for now it can be anything) when you run the jar, then pipe the output to a file.
java -jar QTL2RDF.jar mregress ~ 8080 something > mregress.ttl
This file should be importable into any triple store, and usable with other tools that recognize the Data Cube format. One such tool is Cubeviz, which allows you to browse and present statistical information encoded as Data Cube RDF. Cubeviz uses a different triple store called Virtuoso, so it can be a bit time consuming to set up, but once you’ve done so importing your data is as easy as pointing it at your data, which you can either upload as a file, provide a link to, or simply paste in (if you’re dataset is small enough).
You can then decide how you’d like to visualize your data. In this case, we’re probably interested in the LOD, so I set measurement to “lod” and selected a few loci in the dimension configuration.
You could select each locus to view the whole data set, but Cubeviz is still in development, and so consequently has some issues such as failing to scale down the graph for a wide range of values.
So far, this is just the bare minimum required to properly define the object in the Data Cube vocabulary. Some of the mappings may not be quite correct, and there’s a lot that could be done to better define the semantics of the data that’s being converted. As an example, the ‘chr’ and ‘pos’ measures may be better described as attributes rather than measures in their own right. There are no labels or units on the graph, although Cubeviz will create them if your data is formatted correctly. Some of these features will be included as I improve the conversion program, but there will still be some important semantic information that be easily extracted from the R dataframe. Developing ways for users to specify how this should be handled in an friendly and intuitive way is an important facet of this summer’s project, but even at its minimum functionality the tool I’ve developed can already convert and store a popular data structure in a useful and flexible format.
I plan to finish mapping simple R types over the next few days, then write some functions to break general r objects into their component parts and serialize those. By the official start of coding day, I should have the groundwork laid to start applying these methods to Ruby objects and developing more complete ontology handling.
Again, please check out the Github repository (https://github.com/wstrinz/R2RDF) if you’d like to see some more details. The code isn’t particularly efficient or well documented yet, but I’d appreciate any feedback anyone has. If you have questions, please leave a comment here or get in touch with me some other way.