PubliSci as a service

The provenance DSL and DataSet generation I’ve been working on have most of their basic functionality in place, but I also planned on creating a web-based API for accessing utilizing the gem and building services on top of it. I’ve created a demo site as a prototype for this feature using Ruby On Rails, and I’m happy enough with it that I’d like to make the address public so people can poke around and give me feedback. Although eventually I’ll be separating some of this functionality into a lighter weight server, Rails has helped immensely in developing it, both because it naturally encourages good RESTful design, and the Ruby community has created many useful gems and tools for rapid prototyping of websites using the framework. You can find the demo site is up at http://50.116.40.22:3000, or you can take a look at the source on Github.

REST in a nutshell. Think putting the nouns in the URL and the verbs in the request type (source)

The server acts as an interface to most of the basic functions of the gem; the DSL, the dataset RDFization classes, and the Triplestore convenience methods. Furthermore, this functionality is accessible either through an HTML interface (with a pleasant bootstrap theme!), or programmatically as a (mostly) RESTful web service, using javascript and JSON.

I’m planning to write a tutorial on how to create a publication with it, but for this post I’ll just give a broad overview of how you can use the service. The example data on the site now is based on the PROV primer, with a couple of other elements added to test different features, so it may seem a bit contrived, but it should give you some idea how you could use the site’s various features.

The root page of the site will show you the DSL script that was used to initialize the site, with syntax highlighting thanks to the lovely Coderay gem.

dsl_show

You can also edit the DSL script, which will regenerate the underlying data and set up a new repository object for you. As a warning up front, the DSL is currently based on instance_eval, which introduces a big security risk if not handled properly. I’m working on automatically sandboxing the evaluation in a future version, but for now if you’re worried about security you can easily change a line of the initializer disable remote users updating the DSL.

Along the top, you’ll see links for Entities, Activites, and Agents, which are elements of the Prov ontology, as well as Datasets, which represent any Data Cube formatted data stored in the repository. Each of these elements acts as a RESTful resource, which can be created/read/updated/deleted in much the same way as with a standard ActiveRecord model. Let’s take a look at the Entities page to see how this works.

entities

On the Entities page, you can see a table where each row represents an entity. Prov relevant properties and relationships are also displayed and hyperlinked, allowing you to browse through the information using the familiar web idiom of linked pages. All of this is done using SPARQL queries behind the scenes, but the user doesn’t (immediately) need to know about this to use the service.

Clicking the “subject” field for each Entity will take you to a page with more details, as well as a link to the corresponding DataSet if it exists.

entity_show

From that page, you can export the DataSet using the writers in the bio-publisci gem. At the moment, the demo site can export csv or wekka arff data, but I’ve been working recently on streamlining the writer classes, and I’ll be adding writers for R and some of the common SciRuby tools before the end of the summer.

You can also edit Entities, Agents, and Activities, or create new ones in case you want to correct a mistake or add information to the graph. This is a tiny bit wonky in some spots, both because of the impedance between RDF and the standard tabular backend Rails applications generally use, and because I’m by no means a Rails expert, but you can edit most of the fields and the creation of new resources generally works fine.

edit_entity

I think being able to browse the provenance graph of a dataset or piece of research using an intuitive browser-based interface forms a useful bridge between the constraints and simplicity of a standard CRUD-ish website and the powerful but daunting complexity of RDF and SPARQL, but if you find parts of this model too constraining you can also query the repository directly using SPARQL:

query

Two things to note about this; first, this provides a single interface to run queries on any of the repositories supported by the ruby-rdf project, even some without a built-in SPARQL endpoint. Second, since I’ve set up the Rails server with a permissive CORS policy, you can run queries or access resources across domains, allowing you to easily integrate it into an AJAX application or just about anywhere else. For an example, have a look at this jsfiddle that creates a bar chart in d3 from one of the datasets on the demo site.

A few other features have been implemented I’ll wait to detail in a later post that may come in handy. One that may be useful to some people is the ability to dump the entire contents of the repository in turtle rdf form. If you wanted to make a complete copy of the sites repository, or save changes you’d made in a serialized format for later, it’s as easy as calling the repository/dump route. the dataset dsl will automatically download and handle remote files specified by their url’s using the ‘object’ keyword, which makes loading external datasets extremely simple,

There’s a fair bit more to do to make this a fully featured web service; some elements of the Prov vocabulary are not fully represented, I’d really like the separate out fundamental parts into a more lightweight and deployable Sinatra server, and raw (non-rdf) datasets need better handling. Additionally, while you can easily switch between an in-memory repository for experimentation and more dedicated software such as 4store for real work, it’d be nice to make the two work together, so you could have an in-memory workspace, then save your changes to the more permanent store when your were ready. Aside from the performance gain due to not having to wait for queries on a large repository, this helps with deleting resources or clearing your workspace, as the methods of deleting data from triple stores are somewhat inconsistent and underdeveloped across different software.

Some of the cooler but less important features will have to wait until after GSOC, but if this is a tool you might use and there’s a particular feature you think would be important to have included in the basic version by the end of the summer please get in touch!

Advertisements

Data Frames and Data Cubes

One of the most commonly used abstract structures in R is the dataframe. Technically, a dataframe is a list containing vectors of equal lengths, but it is generally used to represent tables of data, where each row represents an observation, and each column a variable attribute for the observation. Dataframes are commonly used to to store structured results for some computation in R, such as the chromosome, position, and LOD score for a QTL analysis operation. It is less flexible than a general R class, but more complex than a simple list or primitive value, so I decided to use it as my test case for the R to RDF process.

Tools

Although there’s nothing approaching the range of Semantic Web libraries available for popular language like Java, the Ruby community has developed a fair number of libraries for interacting with RDF formatted data, which will allow me to focus more development time on building useful tools than on the background infrastructure to support them. Chief among these is Ruby RDF Project, which hosts a number of interrelated libraries built on a basic core for interacting with Semantic data. Ruby RDF features a Ruby object representation of triples and graphs and methods for interacting with them, as well as a number of plugins for reading and writing data using different storage systems and RDF serialization standards. It takes the approach of focusing on a limited set of functions, implementing them efficiently and robustly, and providing an extensible architecture to allow new functionality to be added. Although it started with built-in support for only one of the simplest RDF formats, ntriples, functional and well-maintained extensions or plugins are now available for most popular triple stores, serialization formats, and querying methods.

Although anyone can invent their own vocabulary and write perfectly valid RDF using it, the focus of the Semantic Web is on interoperability via re-use of existing standards. Converting tabular data to RDF has historically been somewhat difficult, as the two formats are not naturally very compatible. Fortunately recent work in unifying a number of existing data description vocabularies has yielded a reasonably general specification for multidimensional data: The Data Cube Vocabulary. It is a rich and descriptive system, as you might guess looking at this chart of its components:

Data Cube Vocabulary

However, it can still describe simple datasets without requiring every component above, so it is flexible as well as powerful. At its core, the Data Cube vocabulary is essentially just a long list of observations, and the dimensions, units, measurements and attributes that define them. It has been used in a number of real world applications, particularly in the publication of large data sets by government agencies, such as the financial data published by the UK government as part of their transparency initiative, and a wide variety of others.

Once data is in RDF form, you need a somewhere to store it. Although theoretically you could use a standard relational database (and many do offer direct support for triple formatted data), or even just store everything as a bunch of serialized text files, in practice RDF data is generally stored using a type of database called a triplestore. These are optimized for storing data in Subject Predicate Object and related forms, so are generally more efficient and easy to integrate with other triple based data sets or querying protocols. Although there are differences between types of triple stores, ranging for subtle to extreme, Ruby RDF abstracts many of these away with a Repository object that acts as an interface for loading data into a store and performing queries on it. To begin with, we are using 4store for our triplestore, as it’s free, supports the latest version of the RDF and SPARQL standard, is a native triple store (as opposed to an extension for another type of DB), and can run as a cluster for operating on large amounts of data. A plugin is already available to use 4store as a repository in Ruby RDF. It isn’t fully finished, so implementing the remaining functions may become a part of my work this summer if we continue to use it, but it works well enough to load a set of triples into the store, which is all I need at this stage.

Finally, I’ll be using the Rserve tool and accompanying rserve-client Ruby gem to manage the interaction between R and Ruby. While there are a few options for doing this, Rserve seemed to offer the best compromise between performance and deployability, making up for its slight disadvantage in speed compared to some libraries (eg RSruby) by being written in pure Ruby, and thus more easily deployable in different environments.

Method

The overall process for converting from an R dataframe to a rudimentary Data Cube object is actually fairly straightforward. When using the standalone program, you just provide the name of the R variable to dump, and possibly the location of your R workspace and the port your 4store instance is running on if these differ from the defaults (loading the Ruby class in your code offers more flexibility). The script first creates a hash from the data, which is somewhat counter-intuitively represented as a monkey patched Array object with “name” attributes instead of just a regular hash. This is a reasonably simple operation for a basic dataframe, which, as mentioned before, is structurally equivalent to a table with named rows, but the process may become more complex when I move on to supporting a wider set of R objects.

At first I attempted to create an ad-hoc placeholder vocabulary and create triples for each entry and relation in the dataframe using Ruby RDF’s included statement builder, think I would test the 4store interface and implement a more complete vocabulary later. I quickly realized working with plain triples is too cumbersome an approach to truly be viable, and it was producing unnecessarily large graphs for the amount of information I was attempting to store. I decided to move straight to implementing the Data Cube vocabulary, using an RDF description language known as N3, or more specifically a subset of it known as Turtle. N3 is actually powerful enough to express statements which do not translate to valid RDF, while Turtle is limited to just the subset of N3 that corresponds to proper RDF. In addition to being more powerful than the basic language of triples, Turtle is also one of the most human readable formats for working with RDF, which can otherwise wind up looking like a big morass of unformatted HTML. The downside is that Ruby RDF doesn’t  have full yet support for writing blocks of Turtle statements, so currently much of the code relies on string building.

I decided on a few simple mappings for parts of the dataframe to get things started. They may not be perfectly accurate or the proper representation of the data, but I wanted to make sure I was at least generating a valid Data Cube before adding the fine details. Each row of the dataframe becomes a Data Cube dimension, each column a measure, and the  dataset as well as its supporting components objects are named using the name of the variable being converted. The exact way in which these elements are linked can seem a little arcane at first, but as I mentioned, it really just comes down to splitting the data into observations and attaching dimensions or attributes to describe it. If you’d like an example if how it all winds up looking, see this simple file describing just one datapoint: https://gist.github.com/wstrinz/cd20661a8d1fda123e2b

Once its in this format, 4store and most other triple stores can load the data easily, and if the store provides a built in SPARQL endpoint, it can already be queried using the standard patterns. If new data using the same definition is added, it can be automatically integrated, and if someone else wants to extend or provide more details for any element, they can do so without redefining their schema, and anyone looking for a definition of how the data is structure can find it easily.

How To

I will go through a quick example from loading a data set in R all the way to viewing it in a Data Cube visualizer as a demonstration of what I’ve implemented so far. Although the native Ruby code allows for more flexibility, I will be using the standalone java executable for this example. This assumes you’ve installed R/qtl, as well as Rserve. If not, both are available from the R package management system. You’ll also need 4store installed and running its http SPARQL server in unsafe mode, for which you can find instructions here. In this example, I’ll be running mine on the default port 8080.

First lets start R in our home directory and do some simple analysis with a sample dataset provided by R/qtl. Once we’ve loaded the results into a variable, we’ll close the R session and save our workspace.

$ R
> library(qtl)
> data(listeria)
> mregress = scanone(listeria, method="mr")
> quit()
Save workspace image? [y/n/c]: y

Next, we want to run our tool to dump the objects into 4store. To do this, you simply run the executable jar with the variable to dump as the first argument, and optionally the directory the R workspace was saved in (if it’s not the same directory you’re running the jar from).

$ java -jar QTL2RDF.jar mregress ~

You can find a copy of the jar here, but since this project is under active development that will likely go out of date soon, so you’ll want to check for the latest build on the project’s Github page.

Thats all you really need to do! Your data are now in your local 4store, ready to be extracted or queried as you see fit. However, the jar will also offers the option of returning its output as a turtle formatted string, so if you wanted to use it with some other program you can pass a fourth argument (for now it can be anything) when you run the jar, then pipe the output to a file.

java -jar QTL2RDF.jar mregress ~ 8080 something > mregress.ttl

This file should be importable into any triple store, and usable with other tools that recognize the Data Cube format. One such tool is Cubeviz, which allows you to browse and present statistical information encoded as Data Cube RDF. Cubeviz uses a different triple store called Virtuoso, so it can be a bit time consuming to set up, but once you’ve done so importing your data is as easy as pointing it at your data, which you can either upload as a file, provide a link to, or simply paste in (if you’re dataset is small enough).

Pasting the data set

Pasting the data set

You can then decide how you’d like to visualize your data. In this case, we’re probably interested in the LOD, so I set measurement to “lod” and selected a few loci in the dimension configuration.

graph

You could select each locus to view the whole data set, but Cubeviz is still in development,  and so consequently has some issues such as failing to scale down the graph for a wide range of values.

So far, this is just the bare minimum required to properly define the object in the Data Cube vocabulary. Some of the mappings may not be quite correct, and there’s a lot that could be done to better define the semantics of the data that’s being converted. As an example, the ‘chr’ and ‘pos’ measures may be better described as attributes rather than measures in their own right. There are no labels or units on the graph, although Cubeviz will create them if your data is formatted correctly. Some of these features will be included as I improve the conversion program, but there will still be some important semantic information that be easily extracted from the R dataframe. Developing ways for users to specify how this should be handled in an friendly and intuitive way is an important facet of this summer’s project, but even at its minimum functionality the tool I’ve developed can already convert and store a popular data structure in a useful and flexible format.

Next Steps

I plan to finish mapping simple R types over the next few days, then write some functions to break general r objects into their component parts and serialize those. By the official start of coding day, I should have the groundwork laid to start applying these methods to Ruby objects and developing more complete ontology handling.

Again, please check out the Github repository (https://github.com/wstrinz/R2RDF) if you’d like to see some more details. The code isn’t particularly efficient or well documented yet, but I’d appreciate any feedback anyone has. If you have questions, please leave a comment here or get in touch with me some other way.