PubliSci as a service

The provenance DSL and DataSet generation I’ve been working on have most of their basic functionality in place, but I also planned on creating a web-based API for accessing utilizing the gem and building services on top of it. I’ve created a demo site as a prototype for this feature using Ruby On Rails, and I’m happy enough with it that I’d like to make the address public so people can poke around and give me feedback. Although eventually I’ll be separating some of this functionality into a lighter weight server, Rails has helped immensely in developing it, both because it naturally encourages good RESTful design, and the Ruby community has created many useful gems and tools for rapid prototyping of websites using the framework. You can find the demo site is up at, or you can take a look at the source on Github.

REST in a nutshell. Think putting the nouns in the URL and the verbs in the request type (source)

The server acts as an interface to most of the basic functions of the gem; the DSL, the dataset RDFization classes, and the Triplestore convenience methods. Furthermore, this functionality is accessible either through an HTML interface (with a pleasant bootstrap theme!), or programmatically as a (mostly) RESTful web service, using javascript and JSON.

I’m planning to write a tutorial on how to create a publication with it, but for this post I’ll just give a broad overview of how you can use the service. The example data on the site now is based on the PROV primer, with a couple of other elements added to test different features, so it may seem a bit contrived, but it should give you some idea how you could use the site’s various features.

The root page of the site will show you the DSL script that was used to initialize the site, with syntax highlighting thanks to the lovely Coderay gem.


You can also edit the DSL script, which will regenerate the underlying data and set up a new repository object for you. As a warning up front, the DSL is currently based on instance_eval, which introduces a big security risk if not handled properly. I’m working on automatically sandboxing the evaluation in a future version, but for now if you’re worried about security you can easily change a line of the initializer disable remote users updating the DSL.

Along the top, you’ll see links for Entities, Activites, and Agents, which are elements of the Prov ontology, as well as Datasets, which represent any Data Cube formatted data stored in the repository. Each of these elements acts as a RESTful resource, which can be created/read/updated/deleted in much the same way as with a standard ActiveRecord model. Let’s take a look at the Entities page to see how this works.


On the Entities page, you can see a table where each row represents an entity. Prov relevant properties and relationships are also displayed and hyperlinked, allowing you to browse through the information using the familiar web idiom of linked pages. All of this is done using SPARQL queries behind the scenes, but the user doesn’t (immediately) need to know about this to use the service.

Clicking the “subject” field for each Entity will take you to a page with more details, as well as a link to the corresponding DataSet if it exists.


From that page, you can export the DataSet using the writers in the bio-publisci gem. At the moment, the demo site can export csv or wekka arff data, but I’ve been working recently on streamlining the writer classes, and I’ll be adding writers for R and some of the common SciRuby tools before the end of the summer.

You can also edit Entities, Agents, and Activities, or create new ones in case you want to correct a mistake or add information to the graph. This is a tiny bit wonky in some spots, both because of the impedance between RDF and the standard tabular backend Rails applications generally use, and because I’m by no means a Rails expert, but you can edit most of the fields and the creation of new resources generally works fine.


I think being able to browse the provenance graph of a dataset or piece of research using an intuitive browser-based interface forms a useful bridge between the constraints and simplicity of a standard CRUD-ish website and the powerful but daunting complexity of RDF and SPARQL, but if you find parts of this model too constraining you can also query the repository directly using SPARQL:


Two things to note about this; first, this provides a single interface to run queries on any of the repositories supported by the ruby-rdf project, even some without a built-in SPARQL endpoint. Second, since I’ve set up the Rails server with a permissive CORS policy, you can run queries or access resources across domains, allowing you to easily integrate it into an AJAX application or just about anywhere else. For an example, have a look at this jsfiddle that creates a bar chart in d3 from one of the datasets on the demo site.

A few other features have been implemented I’ll wait to detail in a later post that may come in handy. One that may be useful to some people is the ability to dump the entire contents of the repository in turtle rdf form. If you wanted to make a complete copy of the sites repository, or save changes you’d made in a serialized format for later, it’s as easy as calling the repository/dump route. the dataset dsl will automatically download and handle remote files specified by their url’s using the ‘object’ keyword, which makes loading external datasets extremely simple,

There’s a fair bit more to do to make this a fully featured web service; some elements of the Prov vocabulary are not fully represented, I’d really like the separate out fundamental parts into a more lightweight and deployable Sinatra server, and raw (non-rdf) datasets need better handling. Additionally, while you can easily switch between an in-memory repository for experimentation and more dedicated software such as 4store for real work, it’d be nice to make the two work together, so you could have an in-memory workspace, then save your changes to the more permanent store when your were ready. Aside from the performance gain due to not having to wait for queries on a large repository, this helps with deleting resources or clearing your workspace, as the methods of deleting data from triple stores are somewhat inconsistent and underdeveloped across different software.

Some of the cooler but less important features will have to wait until after GSOC, but if this is a tool you might use and there’s a particular feature you think would be important to have included in the basic version by the end of the summer please get in touch!

Goals: Reproducible Science

This project has a number of goals, including improving support for large datasets in the bioinformatics community, furthering the development of semantic web technologies, and supporting data sharing and reproducible science. Today I’m going to go into a little more detail about the former and talk about the work I’ve done so far on it.

To get up to speed on all of the interrelated Semantic Web standards and technologies, I have been working on a tool for converting objects from the R statistical computing language into RDF triples, the native format of the Semantic Web. Although this will be a valuable tool on its own, it is also being developed to support the next version of R/qtl, a library developed by Karl Broman and Hao Wu, as well as a host of other contributors, which offers functions for doing Quantitative Trait Loci mapping using R. The next incarnation of R/qtl will focus on support for highly parallelized computation, a key component of which will be storing results in a database that can be queried and manipulated remotely, as opposed to keeping huge data sets in memory on one computer.

rqtlplanAn overview of the plans for R/qtl

Data Sharing and Reproducibility

The other advantage of storing statistical data in triple based format is that it can be easily, even automatically, published for others to download, inspect, and interact with. In RDF, every property of a resource is defined by its relation to other resources or objects, and each relation comes with an attached definition that either a machine or a human can access to get further details about it. This allows for a huge amount of flexibility in data types and storage schema, as well as the application of algorithmic reasoning techniques to simplify a data set or find out more about its implications.

Publishing scientific data in a machine readable format also makes it dramatically more for scientists hoping to replicate the results or build upon them. Most publications will, at best, include supporting data as a table or tables of statistical aggregates, and even when lower level raw data are available it is usually stored in a flat format such as csv, which includes little to no semantic content such as the units or attributes of objects, or their meaning in the context of the rest of the data. While some fields have begun using relational database technology more extensively, the fact that the most popular data storage formats for many researchers are essentially text files speaks to the rigidity and extra complications of using  dedicated database systems.

RDF has a somewhat mind bending structure of its own to understand, and it’s certainly no silver bullet for the problem of generalized data storage, but its flexibility allows it to overcome much of the ossification and user-unfriendliness of trying to use relation databases for storing and publishing scientific results. A primary goal of my work this summer will be to extend existing Ruby tools and build new ones to help make Semantic data storage accessible and simple for researchers, supporting the crucial work of examining, validating, and extending published results.


The Semantic Web is built on three core technologies; the RDF protocol, the SPARQL query language, and the OWL web ontology language. There is a large body of documentation about all three of these tools, which can be found around the web or in the W3C’s standards documentation. I may write some posts going into more detail, but for now, a brief overview:

RDF essentially comes down to describing data using the ‘Subject Predicate Object’ format, creating statements known as ‘Triples’. An example would be “John(subject) knows(predicate) Mary(object)”. With a few exceptions, each component of this triple is given a URI, which looks like a lot like a regular URL and can be used to uniquely identify a resource (such as ‘John’ or ‘Mary’), or a relation (such as ‘friends with’), as well as providing a link where more information about the object or relation may be found. You can think of subjects and objects as nodes, and predicates as a lines connecting them. Although this doesn’t mean much for one statement such as “John knows Mary”, a collection of similar statements define a big directed graph of interconnected nodes, where every connection is labeled with its meaning.

SPARQL is the official query language of the Semantic Web. It can be used to select elements of an RDF graph based on their subject, predicate, or object, elements. Its syntax resembles, superficially at least, the SQL language familiar to many database users, but in practice it functions quite differently. However, many of the advanced operations of SQL, such as pivoting, are still supported.

OWL is a language for describing ontologies, which are used to formally represent the concepts your data represent in an RDF store, allowing simpler machine interpretation. The technology is crucial to the interconnectedness of the Semantic Web, as it is the means by which the relations between disparate resources can be automatically discovered, allowing relatively easy integration of new or existing data.

There is, of course, much more to each of these technologies, and the technicalities, use cases, and extensions of each of them, then would fit in one section of the blog post, but hopefully this gives a broad overview of how the project will work.

A Starting Place

To begin with, I have developed a Ruby based tool to automatically convert data frame objects in R to RDF, using the Data Cube vocabulary, which was developed as a generalizable way of representing  multidimensional data. It can be run easily on any Ruby capable machine, but since the script and all of its dependencies are pure Ruby libraries, I was also able to deploy it as an executable jar with warbler, so anyone with java installed can use it without having to download any dependencies. I’ll go into more detail about how this works, why its useful, and where this part of the project is headed, but I’ve decided to break it into a separate blog post.