Sharp Scissors, Safety Scissors: What to do With Your PubliSci Dataset

If you’ve been following along with the last two blog posts, you should have a pretty good idea of how to turn most flat or tabular file formats into an RDF dataset using PubliSci’s Reader tools. You now have an unambiguous, self annotated dataset that is both easy for humans to read and can be queried in sexy, sexy SPARQL once loaded into a triple store. So what do you do with it?

In storing, serializing, or “writing down” data, we hope (beyond overcoming a poor memory) to be able to to share what we’ve learned with others who might have questions, criticism, or things they’d like to derive for the information within. Often these ‘others’ are other people, but more and more frequently they are machines and algorithms, especially in fields such as biology which are struggling to growing heaps of data they generate. SPARQL, RDF, and other Semantic Web components are designed to making describing knowledge and posing questions to it accessible to both these types of actors, through its flexible data model, ontological structures, and a host of inter-related software and standards.

Along with a web-friendly scripting language such as Ruby, you can easily build domain specific applications using the Semantic Web’s tools. To provide an example, I’ve created a demonstration server, which you can find at, based on a breast cancer dataset stored collected by Washington University’s Genome Institute, and stored in the TCGA database.

There are two ways to use the demo server; one public, the other private. The public side offers a way to load maf files into the database, a simple html interface with some parts of the data highlighted and linked for you to browse through, and a page for querying the repository using SPARQL.

The private side, protected by a password for now, offers a much more flexible way to interact with the dataset, essentially by letting you write Ruby scripts to run a set of templated queries, create your own, and perform operations such as sorting or statistical tests on the output. However, as James Edward Gray says, Ruby trusts us with the sharp scissors, so if you were to host such an interface on your own machine, you’d want to make sure you don’t give the password to anyone you don’t trust with the sharp scissors, unless you’re running it in a virtual machine or have taken other precautions.

I’ll go over both of these interfaces in turn, starting with the public side.

The Safety Scissors

There’s still a lot you can find out about the dataset from the public side. It’s not much to look at, but you can browse through linked data for the patients and genes represented in the maf file. Because of the semantic web practice of using dereferencable URIs, a lot of the raw data is directly linked to more information about it. Most of the information being presented comes from direct SPARQL queries to the maf dataset, constructed and executed using the ruby-rdf library.

With some further development a very flexible tool for slicing and analyzing one or multiple TCGA datasets could be developed on this backend. As of now most responses are returned as streaming text, which prevents queries and remote service calls from causing timeouts, but makes building a pretty interface more difficult. This could be resolved by splitting it into javascript output and a better looking web interface (such as the one for the PROV demo I created). On top of that, the inclusion of gene sizes is just a small example of the vast amount of information available from external databases; this is, after all, the state of affairs that has lead bioinformaticians to adopt the semantic web.

However, the remaining time in GSOC doesn’t afford me the scope to build up many of these services in a way that makes full use of the information available and the flexible method of accessing it. To address this, I’ve created a more direct interface to the underlying classes and queries which can be accessed using Ruby scripts. It’s protected by a password on the demo site, so if you want to try any of these examples yourself you should grab a clone of the github repository.

Sharp Scissors

The Scissors Cat, by hibbary

In its base form, the scripting interface is not really safe to share with anyone you don’t already trust. Its not quite as insecure as sharing a computer, since it only returns simple strings, but theoretically a motivated person could completely hijack and rewrite the server from this interface; such is the price for the power of Ruby. However, with some sandboxing and a non-instance_eval based implementation the situation could be improved, or this could form the basis of a proper DSL such as Cucumber’s gherkin, which has a well defined grammar using treetop, allowing for a much safer evaluation of arbitrary inputs.

The select Method

The script interface sets you up in an environment with access to the 4store instance holding the maf data, and gives you a few helper methods to access it. Primary among these is the ‘select’ method, which can be used to retrieve specific information from the MAF file by patient ID, and retrieve a few other relevant pieces of information about the dataset, such as the number of patients represented in it.

For example, here’s the script you’d use to wrap a simple query, retrieving the genes with mutations for a given patient.

An example script

An example script

You can further refine results by specifying additional restrictions. Here, the first query first selects all sample with a mutation on NUP107 at first, and the second restricts its results to those starting at position 69135678.

You can also select multiple columns in one go, returning a hash with a key for each selection

Using these methods of accessing the underlying data, you can write more complex scripts to perform analysis, for example here we look for samples with mutations in the gene CASR which more mutations more than one base pair in length

Inline SPARQL Queries

While it may be a blessing for rubyists just getting into the semantic web, if you’re also familiar with SPARQL you probably know that most of the sorting and comparison you might want to do can be performed with it alone. The public side of the maf server does expose a query endpoint, but if you want to tie a series of queries together in a script, or run the output through an external library, you can also easily run inline queries using the scripting interface

This can be used to derive information about how to best access the dataset, which adheres to the general structure of the data cube vocabulary. For example, to see all of the columns you can select data from, you could run a script like

And of course you can mix the two methods, pulling the results of a sparql query into a select call, or vice versa, such as in this next example, where we create a list of all the genes which patients with a mutation in SHANK1 also have.

SPARQL Templates, RDF.rb Queries

A couple of other small features to mention; first, I’ve included the ad-hoc templating system I’ve been using in the gem. It’s similar to the handlebars templating system, which is marked by using double braces ( ‘ {{ ‘ and ‘ }} ‘ ), although here we’re working with SPARQL rather than HTML. This has a few different applications, in that you can reuse query templates in a script, and write a query early on that you will fill values into later.

Second, when you make a ‘select query’ call, the results are converted into plain ruby objects for simpler interaction. Under the hood however these are retrieved using the RDF::Query class, which returns RDF::Solutions that can be interacted with in a more semantic-web aware manner. To get this kind of object as a result, either use “select_raw query” instead, or instantiate a query object and call its #run method, as demonstrated in the gist below where we retrieve all the Nonsense Mutations then process them afterward to sort by patient id or gene type

Saving and Sharing

Finally, the way I’ve set up the server and the nature of instance eval allowed me to include the saving of a ‘workspace’ between evaluations, and sharing of results or methods across sessions and users. To save a variable or result, simple prefix it with an “@” sign, declaring it as an instance variable.

Then you can come back later and run another script

That reuses the instance variable “@result” stored in your instance of the script evaluator. You can do this for procs or lambdas to reuse functions, and pretty much anything else you can think of. Similarly, prefixing the variable with “@@” will mark it as a class variable, enabling anyone accessing the script interface to use it.

Do Not Try This At Home

Again I want to stress that this is by no means a thorough approach to providing public access to an RDF dataset. It is so ridiculously permissive that I’m sure there are people online who would be in physical ill just thinking about the insecurity of my approach. Hopefully if they’re reading this they’d feel inclined to offer some advice for how to do it better, but in lieu of that, I believe that working in a small group on a closed server with this interface could aid collaboration and the prototyping of queries and algorithms. It also helps to show just how flexible the underlying data model we’re operation on can be, and how the impedance between programs and query accessible databases is in many cases less severe with SPARQL than with SQL.

The one huge component of the semantic web this does leave out is interaction between services. The ability to unambiguously make statements with RDF triples creates a natural route for integrating and consuming external services, which I will talk about in more detail in a followup post.

PubliSci as a service

The provenance DSL and DataSet generation I’ve been working on have most of their basic functionality in place, but I also planned on creating a web-based API for accessing utilizing the gem and building services on top of it. I’ve created a demo site as a prototype for this feature using Ruby On Rails, and I’m happy enough with it that I’d like to make the address public so people can poke around and give me feedback. Although eventually I’ll be separating some of this functionality into a lighter weight server, Rails has helped immensely in developing it, both because it naturally encourages good RESTful design, and the Ruby community has created many useful gems and tools for rapid prototyping of websites using the framework. You can find the demo site is up at, or you can take a look at the source on Github.

REST in a nutshell. Think putting the nouns in the URL and the verbs in the request type (source)

The server acts as an interface to most of the basic functions of the gem; the DSL, the dataset RDFization classes, and the Triplestore convenience methods. Furthermore, this functionality is accessible either through an HTML interface (with a pleasant bootstrap theme!), or programmatically as a (mostly) RESTful web service, using javascript and JSON.

I’m planning to write a tutorial on how to create a publication with it, but for this post I’ll just give a broad overview of how you can use the service. The example data on the site now is based on the PROV primer, with a couple of other elements added to test different features, so it may seem a bit contrived, but it should give you some idea how you could use the site’s various features.

The root page of the site will show you the DSL script that was used to initialize the site, with syntax highlighting thanks to the lovely Coderay gem.


You can also edit the DSL script, which will regenerate the underlying data and set up a new repository object for you. As a warning up front, the DSL is currently based on instance_eval, which introduces a big security risk if not handled properly. I’m working on automatically sandboxing the evaluation in a future version, but for now if you’re worried about security you can easily change a line of the initializer disable remote users updating the DSL.

Along the top, you’ll see links for Entities, Activites, and Agents, which are elements of the Prov ontology, as well as Datasets, which represent any Data Cube formatted data stored in the repository. Each of these elements acts as a RESTful resource, which can be created/read/updated/deleted in much the same way as with a standard ActiveRecord model. Let’s take a look at the Entities page to see how this works.


On the Entities page, you can see a table where each row represents an entity. Prov relevant properties and relationships are also displayed and hyperlinked, allowing you to browse through the information using the familiar web idiom of linked pages. All of this is done using SPARQL queries behind the scenes, but the user doesn’t (immediately) need to know about this to use the service.

Clicking the “subject” field for each Entity will take you to a page with more details, as well as a link to the corresponding DataSet if it exists.


From that page, you can export the DataSet using the writers in the bio-publisci gem. At the moment, the demo site can export csv or wekka arff data, but I’ve been working recently on streamlining the writer classes, and I’ll be adding writers for R and some of the common SciRuby tools before the end of the summer.

You can also edit Entities, Agents, and Activities, or create new ones in case you want to correct a mistake or add information to the graph. This is a tiny bit wonky in some spots, both because of the impedance between RDF and the standard tabular backend Rails applications generally use, and because I’m by no means a Rails expert, but you can edit most of the fields and the creation of new resources generally works fine.


I think being able to browse the provenance graph of a dataset or piece of research using an intuitive browser-based interface forms a useful bridge between the constraints and simplicity of a standard CRUD-ish website and the powerful but daunting complexity of RDF and SPARQL, but if you find parts of this model too constraining you can also query the repository directly using SPARQL:


Two things to note about this; first, this provides a single interface to run queries on any of the repositories supported by the ruby-rdf project, even some without a built-in SPARQL endpoint. Second, since I’ve set up the Rails server with a permissive CORS policy, you can run queries or access resources across domains, allowing you to easily integrate it into an AJAX application or just about anywhere else. For an example, have a look at this jsfiddle that creates a bar chart in d3 from one of the datasets on the demo site.

A few other features have been implemented I’ll wait to detail in a later post that may come in handy. One that may be useful to some people is the ability to dump the entire contents of the repository in turtle rdf form. If you wanted to make a complete copy of the sites repository, or save changes you’d made in a serialized format for later, it’s as easy as calling the repository/dump route. the dataset dsl will automatically download and handle remote files specified by their url’s using the ‘object’ keyword, which makes loading external datasets extremely simple,

There’s a fair bit more to do to make this a fully featured web service; some elements of the Prov vocabulary are not fully represented, I’d really like the separate out fundamental parts into a more lightweight and deployable Sinatra server, and raw (non-rdf) datasets need better handling. Additionally, while you can easily switch between an in-memory repository for experimentation and more dedicated software such as 4store for real work, it’d be nice to make the two work together, so you could have an in-memory workspace, then save your changes to the more permanent store when your were ready. Aside from the performance gain due to not having to wait for queries on a large repository, this helps with deleting resources or clearing your workspace, as the methods of deleting data from triple stores are somewhat inconsistent and underdeveloped across different software.

Some of the cooler but less important features will have to wait until after GSOC, but if this is a tool you might use and there’s a particular feature you think would be important to have included in the basic version by the end of the summer please get in touch!