Visualizations and Validations

Earlier this week I met with Karl Broman, a biostatistician at UW Madison who created the r/qtl library I’ve been working with in the last few weeks, about another project of his that could benefit from some Semantic Web backing. Karl has created a number of interesting visualizations of bioinformatics data. His graphs make use of the d3 javascript framework to display high-dimensional data in an interactive and intuitive way. The focus on dimensional data, as well as the fact that most of his datasets exist as R objects, fits naturally with the Data Cube generators I’ve created already.

A screenshot of the cis/trans plot. Make sure you check out the real, interactive version.

A screenshot of the cis/trans plot. Make sure you check out the real, interactive version.

To begin with, I will be converting Karl’s cis/trans eQTL plot to pull its data from a triple store dynamically. Currently there are two scripts that process an r/qtl cross and a few supporting dataframes to create a set of static JSON files, which are then loaded into the graph. Using a triple store to hold the underlying data, however, the values required by the visualization can be accessed dynamically based on the structure of the original R objects. As of now I have successfully converted each of of the necessary datasets to RDF, and am working on generating queries that Karl’s d3 code can use to access it through a 4store SPARQL endpoint (which supports JSON output).

The objects involved are quite large, and the Data Cube vocabulary (really RDF in general) is fairly verbose in its representation of information, so I am working on loading what I have into the right databases and reducing redundancy in the output. However, if you’d like some idea of how the data are being represented and accessed, I’ve set up a demo on Dydra with a subset of the data and some example queries.

Testing and Validation

In addition to working with Karl, I’ve taken time to refactor my code toward creating Data Cube RDF for more general structures. Originally the main module worked off of an Rserve object, but I’ve redone everything to use plain ruby objects, which the generator classes are responsible for creating. To support this refactoring, and the creation of new generators for data types such as CSV files, I’ve begun using Rspec to build the spec for my project. I’ve added tests against reference output and syntactical correctness, but these are respectively too brittle and too permissive to ensure novel data sets will generate valid output. To this end, I have implemented a number of the official Data Cube Integrity Constraints as part of the spec. The ICs are a set of SPARQL queries that can be run on your RDF output to ensure various high level features are present, and go beyond simple syntax validity in ensuring you have properly encoded your data. I’ve had to make a few modifications, since the ICs are slightly out of data, and some of the SPARQL 1.1 facilities they make use of aren’t fully supported by the RDF.rb SPARQL gem. Aside from a place in the set of tests, the ICs could also be useful as part of the main code, providing a way for the end user to ensure that their data is compatible with Data Cube tools like Cubeviz.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s