Sharp Scissors, Safety Scissors: What to do With Your PubliSci Dataset

If you’ve been following along with the last two blog posts, you should have a pretty good idea of how to turn most flat or tabular file formats into an RDF dataset using PubliSci’s Reader tools. You now have an unambiguous, self annotated dataset that is both easy for humans to read and can be queried in sexy, sexy SPARQL once loaded into a triple store. So what do you do with it?

In storing, serializing, or “writing down” data, we hope (beyond overcoming a poor memory) to be able to to share what we’ve learned with others who might have questions, criticism, or things they’d like to derive for the information within. Often these ‘others’ are other people, but more and more frequently they are machines and algorithms, especially in fields such as biology which are struggling to growing heaps of data they generate. SPARQL, RDF, and other Semantic Web components are designed to making describing knowledge and posing questions to it accessible to both these types of actors, through its flexible data model, ontological structures, and a host of inter-related software and standards.

Along with a web-friendly scripting language such as Ruby, you can easily build domain specific applications using the Semantic Web’s tools. To provide an example, I’ve created a demonstration server, which you can find at, based on a breast cancer dataset stored collected by Washington University’s Genome Institute, and stored in the TCGA database.

There are two ways to use the demo server; one public, the other private. The public side offers a way to load maf files into the database, a simple html interface with some parts of the data highlighted and linked for you to browse through, and a page for querying the repository using SPARQL.

The private side, protected by a password for now, offers a much more flexible way to interact with the dataset, essentially by letting you write Ruby scripts to run a set of templated queries, create your own, and perform operations such as sorting or statistical tests on the output. However, as James Edward Gray says, Ruby trusts us with the sharp scissors, so if you were to host such an interface on your own machine, you’d want to make sure you don’t give the password to anyone you don’t trust with the sharp scissors, unless you’re running it in a virtual machine or have taken other precautions.

I’ll go over both of these interfaces in turn, starting with the public side.

The Safety Scissors

There’s still a lot you can find out about the dataset from the public side. It’s not much to look at, but you can browse through linked data for the patients and genes represented in the maf file. Because of the semantic web practice of using dereferencable URIs, a lot of the raw data is directly linked to more information about it. Most of the information being presented comes from direct SPARQL queries to the maf dataset, constructed and executed using the ruby-rdf library.

With some further development a very flexible tool for slicing and analyzing one or multiple TCGA datasets could be developed on this backend. As of now most responses are returned as streaming text, which prevents queries and remote service calls from causing timeouts, but makes building a pretty interface more difficult. This could be resolved by splitting it into javascript output and a better looking web interface (such as the one for the PROV demo I created). On top of that, the inclusion of gene sizes is just a small example of the vast amount of information available from external databases; this is, after all, the state of affairs that has lead bioinformaticians to adopt the semantic web.

However, the remaining time in GSOC doesn’t afford me the scope to build up many of these services in a way that makes full use of the information available and the flexible method of accessing it. To address this, I’ve created a more direct interface to the underlying classes and queries which can be accessed using Ruby scripts. It’s protected by a password on the demo site, so if you want to try any of these examples yourself you should grab a clone of the github repository.

Sharp Scissors

The Scissors Cat, by hibbary

In its base form, the scripting interface is not really safe to share with anyone you don’t already trust. Its not quite as insecure as sharing a computer, since it only returns simple strings, but theoretically a motivated person could completely hijack and rewrite the server from this interface; such is the price for the power of Ruby. However, with some sandboxing and a non-instance_eval based implementation the situation could be improved, or this could form the basis of a proper DSL such as Cucumber’s gherkin, which has a well defined grammar using treetop, allowing for a much safer evaluation of arbitrary inputs.

The select Method

The script interface sets you up in an environment with access to the 4store instance holding the maf data, and gives you a few helper methods to access it. Primary among these is the ‘select’ method, which can be used to retrieve specific information from the MAF file by patient ID, and retrieve a few other relevant pieces of information about the dataset, such as the number of patients represented in it.

For example, here’s the script you’d use to wrap a simple query, retrieving the genes with mutations for a given patient.

An example script

An example script

You can further refine results by specifying additional restrictions. Here, the first query first selects all sample with a mutation on NUP107 at first, and the second restricts its results to those starting at position 69135678.

You can also select multiple columns in one go, returning a hash with a key for each selection

Using these methods of accessing the underlying data, you can write more complex scripts to perform analysis, for example here we look for samples with mutations in the gene CASR which more mutations more than one base pair in length

Inline SPARQL Queries

While it may be a blessing for rubyists just getting into the semantic web, if you’re also familiar with SPARQL you probably know that most of the sorting and comparison you might want to do can be performed with it alone. The public side of the maf server does expose a query endpoint, but if you want to tie a series of queries together in a script, or run the output through an external library, you can also easily run inline queries using the scripting interface

This can be used to derive information about how to best access the dataset, which adheres to the general structure of the data cube vocabulary. For example, to see all of the columns you can select data from, you could run a script like

And of course you can mix the two methods, pulling the results of a sparql query into a select call, or vice versa, such as in this next example, where we create a list of all the genes which patients with a mutation in SHANK1 also have.

SPARQL Templates, RDF.rb Queries

A couple of other small features to mention; first, I’ve included the ad-hoc templating system I’ve been using in the gem. It’s similar to the handlebars templating system, which is marked by using double braces ( ‘ {{ ‘ and ‘ }} ‘ ), although here we’re working with SPARQL rather than HTML. This has a few different applications, in that you can reuse query templates in a script, and write a query early on that you will fill values into later.

Second, when you make a ‘select query’ call, the results are converted into plain ruby objects for simpler interaction. Under the hood however these are retrieved using the RDF::Query class, which returns RDF::Solutions that can be interacted with in a more semantic-web aware manner. To get this kind of object as a result, either use “select_raw query” instead, or instantiate a query object and call its #run method, as demonstrated in the gist below where we retrieve all the Nonsense Mutations then process them afterward to sort by patient id or gene type

Saving and Sharing

Finally, the way I’ve set up the server and the nature of instance eval allowed me to include the saving of a ‘workspace’ between evaluations, and sharing of results or methods across sessions and users. To save a variable or result, simple prefix it with an “@” sign, declaring it as an instance variable.

Then you can come back later and run another script

That reuses the instance variable “@result” stored in your instance of the script evaluator. You can do this for procs or lambdas to reuse functions, and pretty much anything else you can think of. Similarly, prefixing the variable with “@@” will mark it as a class variable, enabling anyone accessing the script interface to use it.

Do Not Try This At Home

Again I want to stress that this is by no means a thorough approach to providing public access to an RDF dataset. It is so ridiculously permissive that I’m sure there are people online who would be in physical ill just thinking about the insecurity of my approach. Hopefully if they’re reading this they’d feel inclined to offer some advice for how to do it better, but in lieu of that, I believe that working in a small group on a closed server with this interface could aid collaboration and the prototyping of queries and algorithms. It also helps to show just how flexible the underlying data model we’re operation on can be, and how the impedance between programs and query accessible databases is in many cases less severe with SPARQL than with SQL.

The one huge component of the semantic web this does leave out is interaction between services. The ability to unambiguously make statements with RDF triples creates a natural route for integrating and consuming external services, which I will talk about in more detail in a followup post.



Alongside my work on Karl Broman’s visualizations, I have continued to add features and tests for my Data Cube generator. As a result of some conversations with other members of the development community, we decided to change the way missing values are handled, which entailed a fair amount of refactoring, so I needed to make sure a good set of tests was in place. I’ve built out the Rspec tests to cover some of these situations, but we were also asked to create some tests using Cucumber, a higher level Behavior Driven Development focused testing tool. I’ve always just used Rspec, since it’s favored by some of the people in Madison who helped get me into Ruby. Many people are not fond of the more magic (seeming) way in which cucumber works, but I thought it’d at least be good to learn the basics even if I didn’t use it much beyond that.

Sadly this is not from a bar catering primarily to Rubyists (source)

Turns out, I really like Cucumber. I may even grow to love it one day, although I don’t want to be too hasty with such pronouncements since our relationship is barely a week old. With Rspec, I like the simplicity of laying everything out in one file and organizing instance variables by context, but I still find myself frustrated by little things like trying to test slight variations on an object or procedure. I know Rspec allows for a lot of modularity if you know how to use it, but personally I just find them a little too clunky most of the time. Cucumber, by contrast, is all about reusability, of methods and objects, and the interface it presents just seems to click a lot more with me.

Cucumber is built around describing behaviors, or “features”, of your application in a human readable manner, and tying this description to code that actually tests the features. You create a plaintext description, and provide a list of steps to take to achieve it. The magic comes in the step definitions, which are bound using regular expressions to allow the reuse of individual steps in different scenarios.

Here’s a simple scenario:

Most of this is just descriptive information. The main thing to notice is the “Scenario”, which contains “Given” and “Then” keywords. These, in addition to the “When” keyword (which isn’t used in this example) are the primary building blocks of a Cucumber feature. They are backed up by step definitions in a separate file, written in Ruby.

Cucumber will pull in these step definitions and use them for the “Given” and “Then” calls in the main feature. So far, so good, but this really isn’t worth the added complexity over Rspec yet. If I had a more complicated feature such as this one:

I could add “Given” definitions to cover the new Scenarios, but that would be a waste of the tools Cucumber offers (and the powers of Ruby). Instead, I can put parentheses around an element of the regular expression in the step definition to make it an argument for the step. Then with just a little Ruby cleverness I can reuse one step definition for all of these scenarios:

Nothing too complicated here, but it always makes me smile when I can do something simply in Ruby that I’d only barely feel comfortable hacking together with reflection in Java even after an entire undergrad education in the language.

I’ve written some steps to test the presence of various Data Cube elements in the output of a generator (check the repository if you’re interested), and I’m also working on moving my Integrity Constraints tests over from Rspec. This will make it easy to test my code on your code, or random csv files, so will help with adding features and investigating edge cases tremendously.

Data Frames and Data Cubes

One of the most commonly used abstract structures in R is the dataframe. Technically, a dataframe is a list containing vectors of equal lengths, but it is generally used to represent tables of data, where each row represents an observation, and each column a variable attribute for the observation. Dataframes are commonly used to to store structured results for some computation in R, such as the chromosome, position, and LOD score for a QTL analysis operation. It is less flexible than a general R class, but more complex than a simple list or primitive value, so I decided to use it as my test case for the R to RDF process.


Although there’s nothing approaching the range of Semantic Web libraries available for popular language like Java, the Ruby community has developed a fair number of libraries for interacting with RDF formatted data, which will allow me to focus more development time on building useful tools than on the background infrastructure to support them. Chief among these is Ruby RDF Project, which hosts a number of interrelated libraries built on a basic core for interacting with Semantic data. Ruby RDF features a Ruby object representation of triples and graphs and methods for interacting with them, as well as a number of plugins for reading and writing data using different storage systems and RDF serialization standards. It takes the approach of focusing on a limited set of functions, implementing them efficiently and robustly, and providing an extensible architecture to allow new functionality to be added. Although it started with built-in support for only one of the simplest RDF formats, ntriples, functional and well-maintained extensions or plugins are now available for most popular triple stores, serialization formats, and querying methods.

Although anyone can invent their own vocabulary and write perfectly valid RDF using it, the focus of the Semantic Web is on interoperability via re-use of existing standards. Converting tabular data to RDF has historically been somewhat difficult, as the two formats are not naturally very compatible. Fortunately recent work in unifying a number of existing data description vocabularies has yielded a reasonably general specification for multidimensional data: The Data Cube Vocabulary. It is a rich and descriptive system, as you might guess looking at this chart of its components:

Data Cube Vocabulary

However, it can still describe simple datasets without requiring every component above, so it is flexible as well as powerful. At its core, the Data Cube vocabulary is essentially just a long list of observations, and the dimensions, units, measurements and attributes that define them. It has been used in a number of real world applications, particularly in the publication of large data sets by government agencies, such as the financial data published by the UK government as part of their transparency initiative, and a wide variety of others.

Once data is in RDF form, you need a somewhere to store it. Although theoretically you could use a standard relational database (and many do offer direct support for triple formatted data), or even just store everything as a bunch of serialized text files, in practice RDF data is generally stored using a type of database called a triplestore. These are optimized for storing data in Subject Predicate Object and related forms, so are generally more efficient and easy to integrate with other triple based data sets or querying protocols. Although there are differences between types of triple stores, ranging for subtle to extreme, Ruby RDF abstracts many of these away with a Repository object that acts as an interface for loading data into a store and performing queries on it. To begin with, we are using 4store for our triplestore, as it’s free, supports the latest version of the RDF and SPARQL standard, is a native triple store (as opposed to an extension for another type of DB), and can run as a cluster for operating on large amounts of data. A plugin is already available to use 4store as a repository in Ruby RDF. It isn’t fully finished, so implementing the remaining functions may become a part of my work this summer if we continue to use it, but it works well enough to load a set of triples into the store, which is all I need at this stage.

Finally, I’ll be using the Rserve tool and accompanying rserve-client Ruby gem to manage the interaction between R and Ruby. While there are a few options for doing this, Rserve seemed to offer the best compromise between performance and deployability, making up for its slight disadvantage in speed compared to some libraries (eg RSruby) by being written in pure Ruby, and thus more easily deployable in different environments.


The overall process for converting from an R dataframe to a rudimentary Data Cube object is actually fairly straightforward. When using the standalone program, you just provide the name of the R variable to dump, and possibly the location of your R workspace and the port your 4store instance is running on if these differ from the defaults (loading the Ruby class in your code offers more flexibility). The script first creates a hash from the data, which is somewhat counter-intuitively represented as a monkey patched Array object with “name” attributes instead of just a regular hash. This is a reasonably simple operation for a basic dataframe, which, as mentioned before, is structurally equivalent to a table with named rows, but the process may become more complex when I move on to supporting a wider set of R objects.

At first I attempted to create an ad-hoc placeholder vocabulary and create triples for each entry and relation in the dataframe using Ruby RDF’s included statement builder, think I would test the 4store interface and implement a more complete vocabulary later. I quickly realized working with plain triples is too cumbersome an approach to truly be viable, and it was producing unnecessarily large graphs for the amount of information I was attempting to store. I decided to move straight to implementing the Data Cube vocabulary, using an RDF description language known as N3, or more specifically a subset of it known as Turtle. N3 is actually powerful enough to express statements which do not translate to valid RDF, while Turtle is limited to just the subset of N3 that corresponds to proper RDF. In addition to being more powerful than the basic language of triples, Turtle is also one of the most human readable formats for working with RDF, which can otherwise wind up looking like a big morass of unformatted HTML. The downside is that Ruby RDF doesn’t  have full yet support for writing blocks of Turtle statements, so currently much of the code relies on string building.

I decided on a few simple mappings for parts of the dataframe to get things started. They may not be perfectly accurate or the proper representation of the data, but I wanted to make sure I was at least generating a valid Data Cube before adding the fine details. Each row of the dataframe becomes a Data Cube dimension, each column a measure, and the  dataset as well as its supporting components objects are named using the name of the variable being converted. The exact way in which these elements are linked can seem a little arcane at first, but as I mentioned, it really just comes down to splitting the data into observations and attaching dimensions or attributes to describe it. If you’d like an example if how it all winds up looking, see this simple file describing just one datapoint:

Once its in this format, 4store and most other triple stores can load the data easily, and if the store provides a built in SPARQL endpoint, it can already be queried using the standard patterns. If new data using the same definition is added, it can be automatically integrated, and if someone else wants to extend or provide more details for any element, they can do so without redefining their schema, and anyone looking for a definition of how the data is structure can find it easily.

How To

I will go through a quick example from loading a data set in R all the way to viewing it in a Data Cube visualizer as a demonstration of what I’ve implemented so far. Although the native Ruby code allows for more flexibility, I will be using the standalone java executable for this example. This assumes you’ve installed R/qtl, as well as Rserve. If not, both are available from the R package management system. You’ll also need 4store installed and running its http SPARQL server in unsafe mode, for which you can find instructions here. In this example, I’ll be running mine on the default port 8080.

First lets start R in our home directory and do some simple analysis with a sample dataset provided by R/qtl. Once we’ve loaded the results into a variable, we’ll close the R session and save our workspace.

$ R
> library(qtl)
> data(listeria)
> mregress = scanone(listeria, method="mr")
> quit()
Save workspace image? [y/n/c]: y

Next, we want to run our tool to dump the objects into 4store. To do this, you simply run the executable jar with the variable to dump as the first argument, and optionally the directory the R workspace was saved in (if it’s not the same directory you’re running the jar from).

$ java -jar QTL2RDF.jar mregress ~

You can find a copy of the jar here, but since this project is under active development that will likely go out of date soon, so you’ll want to check for the latest build on the project’s Github page.

Thats all you really need to do! Your data are now in your local 4store, ready to be extracted or queried as you see fit. However, the jar will also offers the option of returning its output as a turtle formatted string, so if you wanted to use it with some other program you can pass a fourth argument (for now it can be anything) when you run the jar, then pipe the output to a file.

java -jar QTL2RDF.jar mregress ~ 8080 something > mregress.ttl

This file should be importable into any triple store, and usable with other tools that recognize the Data Cube format. One such tool is Cubeviz, which allows you to browse and present statistical information encoded as Data Cube RDF. Cubeviz uses a different triple store called Virtuoso, so it can be a bit time consuming to set up, but once you’ve done so importing your data is as easy as pointing it at your data, which you can either upload as a file, provide a link to, or simply paste in (if you’re dataset is small enough).

Pasting the data set

Pasting the data set

You can then decide how you’d like to visualize your data. In this case, we’re probably interested in the LOD, so I set measurement to “lod” and selected a few loci in the dimension configuration.


You could select each locus to view the whole data set, but Cubeviz is still in development,  and so consequently has some issues such as failing to scale down the graph for a wide range of values.

So far, this is just the bare minimum required to properly define the object in the Data Cube vocabulary. Some of the mappings may not be quite correct, and there’s a lot that could be done to better define the semantics of the data that’s being converted. As an example, the ‘chr’ and ‘pos’ measures may be better described as attributes rather than measures in their own right. There are no labels or units on the graph, although Cubeviz will create them if your data is formatted correctly. Some of these features will be included as I improve the conversion program, but there will still be some important semantic information that be easily extracted from the R dataframe. Developing ways for users to specify how this should be handled in an friendly and intuitive way is an important facet of this summer’s project, but even at its minimum functionality the tool I’ve developed can already convert and store a popular data structure in a useful and flexible format.

Next Steps

I plan to finish mapping simple R types over the next few days, then write some functions to break general r objects into their component parts and serialize those. By the official start of coding day, I should have the groundwork laid to start applying these methods to Ruby objects and developing more complete ontology handling.

Again, please check out the Github repository ( if you’d like to see some more details. The code isn’t particularly efficient or well documented yet, but I’d appreciate any feedback anyone has. If you have questions, please leave a comment here or get in touch with me some other way.

Goals: Reproducible Science

This project has a number of goals, including improving support for large datasets in the bioinformatics community, furthering the development of semantic web technologies, and supporting data sharing and reproducible science. Today I’m going to go into a little more detail about the former and talk about the work I’ve done so far on it.

To get up to speed on all of the interrelated Semantic Web standards and technologies, I have been working on a tool for converting objects from the R statistical computing language into RDF triples, the native format of the Semantic Web. Although this will be a valuable tool on its own, it is also being developed to support the next version of R/qtl, a library developed by Karl Broman and Hao Wu, as well as a host of other contributors, which offers functions for doing Quantitative Trait Loci mapping using R. The next incarnation of R/qtl will focus on support for highly parallelized computation, a key component of which will be storing results in a database that can be queried and manipulated remotely, as opposed to keeping huge data sets in memory on one computer.

rqtlplanAn overview of the plans for R/qtl

Data Sharing and Reproducibility

The other advantage of storing statistical data in triple based format is that it can be easily, even automatically, published for others to download, inspect, and interact with. In RDF, every property of a resource is defined by its relation to other resources or objects, and each relation comes with an attached definition that either a machine or a human can access to get further details about it. This allows for a huge amount of flexibility in data types and storage schema, as well as the application of algorithmic reasoning techniques to simplify a data set or find out more about its implications.

Publishing scientific data in a machine readable format also makes it dramatically more for scientists hoping to replicate the results or build upon them. Most publications will, at best, include supporting data as a table or tables of statistical aggregates, and even when lower level raw data are available it is usually stored in a flat format such as csv, which includes little to no semantic content such as the units or attributes of objects, or their meaning in the context of the rest of the data. While some fields have begun using relational database technology more extensively, the fact that the most popular data storage formats for many researchers are essentially text files speaks to the rigidity and extra complications of using  dedicated database systems.

RDF has a somewhat mind bending structure of its own to understand, and it’s certainly no silver bullet for the problem of generalized data storage, but its flexibility allows it to overcome much of the ossification and user-unfriendliness of trying to use relation databases for storing and publishing scientific results. A primary goal of my work this summer will be to extend existing Ruby tools and build new ones to help make Semantic data storage accessible and simple for researchers, supporting the crucial work of examining, validating, and extending published results.


The Semantic Web is built on three core technologies; the RDF protocol, the SPARQL query language, and the OWL web ontology language. There is a large body of documentation about all three of these tools, which can be found around the web or in the W3C’s standards documentation. I may write some posts going into more detail, but for now, a brief overview:

RDF essentially comes down to describing data using the ‘Subject Predicate Object’ format, creating statements known as ‘Triples’. An example would be “John(subject) knows(predicate) Mary(object)”. With a few exceptions, each component of this triple is given a URI, which looks like a lot like a regular URL and can be used to uniquely identify a resource (such as ‘John’ or ‘Mary’), or a relation (such as ‘friends with’), as well as providing a link where more information about the object or relation may be found. You can think of subjects and objects as nodes, and predicates as a lines connecting them. Although this doesn’t mean much for one statement such as “John knows Mary”, a collection of similar statements define a big directed graph of interconnected nodes, where every connection is labeled with its meaning.

SPARQL is the official query language of the Semantic Web. It can be used to select elements of an RDF graph based on their subject, predicate, or object, elements. Its syntax resembles, superficially at least, the SQL language familiar to many database users, but in practice it functions quite differently. However, many of the advanced operations of SQL, such as pivoting, are still supported.

OWL is a language for describing ontologies, which are used to formally represent the concepts your data represent in an RDF store, allowing simpler machine interpretation. The technology is crucial to the interconnectedness of the Semantic Web, as it is the means by which the relations between disparate resources can be automatically discovered, allowing relatively easy integration of new or existing data.

There is, of course, much more to each of these technologies, and the technicalities, use cases, and extensions of each of them, then would fit in one section of the blog post, but hopefully this gives a broad overview of how the project will work.

A Starting Place

To begin with, I have developed a Ruby based tool to automatically convert data frame objects in R to RDF, using the Data Cube vocabulary, which was developed as a generalizable way of representing  multidimensional data. It can be run easily on any Ruby capable machine, but since the script and all of its dependencies are pure Ruby libraries, I was also able to deploy it as an executable jar with warbler, so anyone with java installed can use it without having to download any dependencies. I’ll go into more detail about how this works, why its useful, and where this part of the project is headed, but I’ve decided to break it into a separate blog post.