Sharp Scissors, Safety Scissors: What to do With Your PubliSci Dataset

If you’ve been following along with the last two blog posts, you should have a pretty good idea of how to turn most flat or tabular file formats into an RDF dataset using PubliSci’s Reader tools. You now have an unambiguous, self annotated dataset that is both easy for humans to read and can be queried in sexy, sexy SPARQL once loaded into a triple store. So what do you do with it?

In storing, serializing, or “writing down” data, we hope (beyond overcoming a poor memory) to be able to to share what we’ve learned with others who might have questions, criticism, or things they’d like to derive for the information within. Often these ‘others’ are other people, but more and more frequently they are machines and algorithms, especially in fields such as biology which are struggling to growing heaps of data they generate. SPARQL, RDF, and other Semantic Web components are designed to making describing knowledge and posing questions to it accessible to both these types of actors, through its flexible data model, ontological structures, and a host of inter-related software and standards.

Along with a web-friendly scripting language such as Ruby, you can easily build domain specific applications using the Semantic Web’s tools. To provide an example, I’ve created a demonstration server, which you can find at mafdemo.strinz.me, based on a breast cancer dataset stored collected by Washington University’s Genome Institute, and stored in the TCGA database.

There are two ways to use the demo server; one public, the other private. The public side offers a way to load maf files into the database, a simple html interface with some parts of the data highlighted and linked for you to browse through, and a page for querying the repository using SPARQL.

The private side, protected by a password for now, offers a much more flexible way to interact with the dataset, essentially by letting you write Ruby scripts to run a set of templated queries, create your own, and perform operations such as sorting or statistical tests on the output. However, as James Edward Gray says, Ruby trusts us with the sharp scissors, so if you were to host such an interface on your own machine, you’d want to make sure you don’t give the password to anyone you don’t trust with the sharp scissors, unless you’re running it in a virtual machine or have taken other precautions.

I’ll go over both of these interfaces in turn, starting with the public side.

The Safety Scissors

There’s still a lot you can find out about the dataset from the public side. It’s not much to look at, but you can browse through linked data for the patients and genes represented in the maf file. Because of the semantic web practice of using dereferencable URIs, a lot of the raw data is directly linked to more information about it. Most of the information being presented comes from direct SPARQL queries to the maf dataset, constructed and executed using the ruby-rdf library.

With some further development a very flexible tool for slicing and analyzing one or multiple TCGA datasets could be developed on this backend. As of now most responses are returned as streaming text, which prevents queries and remote service calls from causing timeouts, but makes building a pretty interface more difficult. This could be resolved by splitting it into javascript output and a better looking web interface (such as the one for the PROV demo I created). On top of that, the inclusion of gene sizes is just a small example of the vast amount of information available from external databases; this is, after all, the state of affairs that has lead bioinformaticians to adopt the semantic web.

However, the remaining time in GSOC doesn’t afford me the scope to build up many of these services in a way that makes full use of the information available and the flexible method of accessing it. To address this, I’ve created a more direct interface to the underlying classes and queries which can be accessed using Ruby scripts. It’s protected by a password on the demo site, so if you want to try any of these examples yourself you should grab a clone of the github repository.

Sharp Scissors

The Scissors Cat, by hibbary

In its base form, the scripting interface is not really safe to share with anyone you don’t already trust. Its not quite as insecure as sharing a computer, since it only returns simple strings, but theoretically a motivated person could completely hijack and rewrite the server from this interface; such is the price for the power of Ruby. However, with some sandboxing and a non-instance_eval based implementation the situation could be improved, or this could form the basis of a proper DSL such as Cucumber’s gherkin, which has a well defined grammar using treetop, allowing for a much safer evaluation of arbitrary inputs.

The select Method

The script interface sets you up in an environment with access to the 4store instance holding the maf data, and gives you a few helper methods to access it. Primary among these is the ‘select’ method, which can be used to retrieve specific information from the MAF file by patient ID, and retrieve a few other relevant pieces of information about the dataset, such as the number of patients represented in it.

For example, here’s the script you’d use to wrap a simple query, retrieving the genes with mutations for a given patient.

An example script

An example script

You can further refine results by specifying additional restrictions. Here, the first query first selects all sample with a mutation on NUP107 at first, and the second restricts its results to those starting at position 69135678.

You can also select multiple columns in one go, returning a hash with a key for each selection

https://gist.github.com/wstrinz/28da6c67ab6e44d26340

Using these methods of accessing the underlying data, you can write more complex scripts to perform analysis, for example here we look for samples with mutations in the gene CASR which more mutations more than one base pair in length

Inline SPARQL Queries

While it may be a blessing for rubyists just getting into the semantic web, if you’re also familiar with SPARQL you probably know that most of the sorting and comparison you might want to do can be performed with it alone. The public side of the maf server does expose a query endpoint, but if you want to tie a series of queries together in a script, or run the output through an external library, you can also easily run inline queries using the scripting interface

This can be used to derive information about how to best access the dataset, which adheres to the general structure of the data cube vocabulary. For example, to see all of the columns you can select data from, you could run a script like

And of course you can mix the two methods, pulling the results of a sparql query into a select call, or vice versa, such as in this next example, where we create a list of all the genes which patients with a mutation in SHANK1 also have.

SPARQL Templates, RDF.rb Queries

A couple of other small features to mention; first, I’ve included the ad-hoc templating system I’ve been using in the gem. It’s similar to the handlebars templating system, which is marked by using double braces ( ‘ {{ ‘ and ‘ }} ‘ ), although here we’re working with SPARQL rather than HTML. This has a few different applications, in that you can reuse query templates in a script, and write a query early on that you will fill values into later.

Second, when you make a ‘select query’ call, the results are converted into plain ruby objects for simpler interaction. Under the hood however these are retrieved using the RDF::Query class, which returns RDF::Solutions that can be interacted with in a more semantic-web aware manner. To get this kind of object as a result, either use “select_raw query” instead, or instantiate a query object and call its #run method, as demonstrated in the gist below where we retrieve all the Nonsense Mutations then process them afterward to sort by patient id or gene type

Saving and Sharing

Finally, the way I’ve set up the server and the nature of instance eval allowed me to include the saving of a ‘workspace’ between evaluations, and sharing of results or methods across sessions and users. To save a variable or result, simple prefix it with an “@” sign, declaring it as an instance variable.

Then you can come back later and run another script

That reuses the instance variable “@result” stored in your instance of the script evaluator. You can do this for procs or lambdas to reuse functions, and pretty much anything else you can think of. Similarly, prefixing the variable with “@@” will mark it as a class variable, enabling anyone accessing the script interface to use it.

Do Not Try This At Home

Again I want to stress that this is by no means a thorough approach to providing public access to an RDF dataset. It is so ridiculously permissive that I’m sure there are people online who would be in physical ill just thinking about the insecurity of my approach. Hopefully if they’re reading this they’d feel inclined to offer some advice for how to do it better, but in lieu of that, I believe that working in a small group on a closed server with this interface could aid collaboration and the prototyping of queries and algorithms. It also helps to show just how flexible the underlying data model we’re operation on can be, and how the impedance between programs and query accessible databases is in many cases less severe with SPARQL than with SQL.

The one huge component of the semantic web this does leave out is interaction between services. The ability to unambiguously make statements with RDF triples creates a natural route for integrating and consuming external services, which I will talk about in more detail in a followup post.

Advertisements

Objects for all

The code I’ve written for generating Data Cube RDF is mostly in the form of methods stored in modules, which need to be included in a class before they are called. Although these will work just fine for converting your data, those who are not already familiar with the model may wish for a more “friendly” interface using familiar ruby objects and idioms. Though you can access your datasets with a native query language which gives you far more flexibility than a simple Ruby interface ever could, you may not want to bother with the upfront cost of learning a whole new language (even one as lovely as SPARQL) just to get started using the tool.

This is not just a problem for me, this summer; it occurs quite often when object oriented languages are used to interact with some sort of database. Naturally, a lot of work both theoretical and practical has been done to make the connection between the language and the database as seamless as possible. In Ruby, and especially Rails, a technique called Object Relational Mapping is frequently used to create interfaces that allow Ruby objects to stand in for database concepts such as tables and columns.

The ORM translates between program objects and the database (source)

ORM and design patterns derived from it are common features of any web developer’s life today. Active Record, a core Rails gem implementing the design pattern of the same name, has been an important (if somewhat maligned) part of the frameworks success and appeal to new developers. Using Active Record and other similar libraries you can leverage the power of SQL and dedicated database software to it without needing to learn a suite of new language languages.

ORM for the Semantic Web?

There has been work done applying this pattern to the Semantic Web in the form of the ActiveRDF gem, although it hasn’t been updated in quite some time. Maybe it will be picked up again one day, but the reality is that impedance mismatch, where the fundamental differences between object oriented and relational database concepts creates ambiguity and information loss, poses a serious problem for any attempt to build this such a tool. Still, one of the root causes of this problem, that the schema of relational databases are far more constrained than those of OO objects, is somewhat mitigated for the Semantic Web, since RDF representations can often be less constraining than typical OO structures. So there is hope that a useful general mapping tool will emerge in time, but it’s going to take some doing.

Hierarchies and abstraction are a problem for relational DBs, but they’re core concepts in the OWL Semantic Web language (source)

Despite the challenges, having an object oriented interface to your database structures makes them a lot more accessible to other programmers, and helps reduces the upfront cost of picking up a new format and new software, so it has always been a part of the plan to implement such an interface for my project this summer. Fortunately the Data Cube vocabulary offers a much more constrained environment to work in than RDF in general, so creating an object interface is actually quite feasible.

Data Mapping the Data Cube

To begin with, I created a simple wrapper class for the generator module, using instance variables for various elements of the vocabulary, and a second class to represent individual observations.

Observations are added as simple hashes, as shown in this cucumber feature:

With the basic structure completed, I hooked up the generator module, which can be accessed by calling the “to_n3” command on your Data Cube object.

Having an object based interface also makes running validations at different points easier as well. Currently you can specify the <#@ validate_each? @#> option when you create your DataCube object, which if set to true will ensure that the fields in your observations match up with the measures and dimensions you’ve defined. If you don’t set this option, your data will be validated when you call the to_n3 method.

You can also include metadata about your dataset such as author, subject, and publisher, which are now added during the to_n3 method.

The other side of an interface such as this is being able to construct an object from an RDF graph. Although my work in this area hasn’t progressed as far, it is possible to automatically create an instance of the object using the ORM::Datacube.load method, either on a turtle file or an endpoint URL:

All of this is supported by a number of new modules that assist with querying, parsing, and analyzing datasets, and were designed to be useful with or without the object model. There’s still a lot to be done on this part of the project; you should be able to delete dimensions as well as add them, and more validations and inference methods need to be implemented, but most of all there needs to be a more direct mapping of methods and objects to SPARQL queries. Not only does would this conform better to the idea of an ORM pattern, but it will also allow large datasets to be handled much more easily, as loading every observation at once can take a long time and may run up against memory limits for some datasets.

Coded properties for proper semantics

While it would be nice if all the data we are interested in working with could be accessed using SPARQL, but the reality is that most data is stored in some kind of tabular format, and may lack important structural information, or even data points, either of which could be a serious stumbling block for trying to represent it as a set of triples. In the case of missing data, the difficulty is that RDF contains no concept null values. This isn’t simply an oversight; the Semantic nature of the format requires it. The specific reasons for this are related to the formal logic and machine inference aspects of the Semantic Web, which haven’t been covered yet here. This post on the W3C’s semantic web mailing list provides a good explanation. As a brief summary, consider what would happen if you had a predicate for “is married to”, and you programmed your (unduly provincial) reasoner to conclude that any resource that is the object of an “is married to” statement for a subject of type “Man” is of type “Woman” (and vice versa). In this situation, if you had an unmarried man and an unmarried woman in your dataset, and chose to represent this state by listing both as “is married to” a null object, say “rdf:null”, you would run into a contradiction; your reasoner would conclude that rdf:null was both a man and a woman. Assuming your un-cosmopolitan reasoner specifies “Man” and “Woman” as disjoint classes, you have created a contradiction, and invalidated your ontology!

Paradoxes: not machine friendly

Since  missing values are actually quite common in the context of qtl analysis, where missing information is often imputed or estimated from an incomplete dataset, I have been discussing the best way to proceed with my mentors. We have decided to use the “NA” string literal by default, and leave it up to the software accessing our data to decide how to handle the missing data in a given domain. This is specified in the code which converts raw values to resources or literals:

The 1.1 version of the RDF standard also includes NaN as a valid numeric literal, so I am experimenting with this as a way of dealing with missing numeric values. Missing structure is a somewhat larger problem; in some situations, such as the d3 visualizations Karl Broman has created, it is sufficient to simply store all data as literal values under a single namespace. For example, consider this (made up) database of economic indicators by country; You could say a given data point has a country value of “America”, a GDP of $49,000 per capita, an infant mortality rate of 4 per thousand, and so on, storing everything as a literal value. This is fine if you’re just working with your own data, but what if you want to be able to find more information about this “America” concept? The great thing about Semantic Web tools is that you can easily query another SPARQL endpoint for triples with an object value of “America”. But this “America” is simply a raw string; you could just as well be receiving information about the band “America”, the entire continent of North, South, or Central America, or something else entirely. Furthermore, RDF does not allow literals as subjects in triples, so you wouldn’t be able to make any statements about “America”. This is particularly problematic for the Data Cube format for a number of reasons, not the least of which is the requirement that all dimensions must have an rdfs:range concept that specifies the set of possible values. The solution to this problem is in making “America” a resource, inside of a namespace. For example, if we were converting these data for the IMF, we could replace “America” (the string literal) with (a URI). We can now write statements about the resource, and ensure that there is no ambiguity between different Americas. This doesn’t quite get us all the way toward fully linked data, since it’s not clear yet how to specify that the “America” in the imf.org namespace is the same as the one in, say, the unfao.org namespace (for that you will need to employ OWL, a more complex Semantic Web technology outside the scope of this post), but it at least allows us to create a valid representation of our data. In the context of a Data Cube dataset, this can be automated through the use of coded properties, supported by the skos vocabulary, an ontology developed for categorization and classification. Using skos, I define a “concept scheme” for each coded dimension, and a set of “hasTopConcept”  relations for each scheme.

Each concept gets its own resource, for which some rudimentary information is generated automatically

Currently the generator only enumerates the concepts and creates the scheme, but these concepts provide a place to link concepts to other datasets and define their semantics. As an example, if the codes were generated for countries, you could tell the generator to link to the dbpedia entries for each country. Additionally, I plan to create a “No Data” code for each concept set, now that we’ve had a discussion about the way to handle such values.

To see how this all comes together, I’ll go through an example using one of the most popular data representation schemes around; the trusty old CSV.

CSV to Data Cube

Most are probably already familiar with this format, but even if you aren’t it’s about the simplest conceivable way of representing tabular data; each column is separated by commas, and each row by a new line. The first such row typically represents the labels for the columns. Although very little in the way of meaning is embedded in this representation, it can still be translated to a valid data cube representation, and more detailed semantics can be added through a combination of automated processing and user input.

The current CSV generator is fairly basic; you can provide it with an array of dimensions, coded dimensions, and measures using the options hash, point it at a file, and it will create Data Cube formatted output. There is only one extra option at the moment, which allows you to specify a column (by number) to use to generate labels for your output. See below for an example of how to use it.

Here is a simple file I’ve been using for my rspec tests (with spaces added for readability).

producer, pricerange, chunkiness, deliciousness
hormel,     low,        1,           1
newskies,  medium,      6,           9
whys,      nonexistant, 9001,        6

This can be fed into the generator using a call such as the following:

And that’s all there is to it! Any columns not specified as dimensions will be treated as measures, and if you provide no configuration information at all it will default to using the first column as the dimension. At the moment, all dimension properties for CSV files are assumed to be coded lists (as discussed above), but this will change as I add support for other dimension types and sdmx concepts. If you’d like to see what the output looks like, you can find an example gist on Github.

As the earlier portion of this post explains, the generator will create a concept scheme based on the range of possible values for the dimension/s, and define the rdfs:range of the dimension/s as this concept scheme. Aside from producing a valid Data Cube representation (which requires all dimensions to have a range), this also creates a platform to add more information about coding schemes and individual concepts, either manually or, in the near future, with some help from the generation algorithm.

I’ll cover just how you might go about linking up more information to your dataset in a future post, but if you’d like a preview, have a look at the excellent work Sarven Capadisli has done in converting and providing endpoints for data from the UN and other international organizations.

An overview of Sarven’s data linkages (source)

The country codes for these datasets are all linked to further data on their concepts using the skos vocabulary, for example, see this page with data about country code CA (Canada). This practice of linking together different datasets is a critical part of the Semantic Web, and will be an important direction for future work this summer.

Motivations: Why we need to improve the Semantic Web

David Karger: How the Semantic Web Can Help End Users

MIT AI researcher David Karger gave the keynote at this year’s European Semantic Web Conference, and has posted his slides as well as a summary of his talk on the MIT Haystack blog. He’s an expert on the topic, and does a much better job than I could of explaining the value of flexible and extensible data representation. I hope to distill some of the writing of Karger and others and post it here over the summer, but for now, if you’re not sure what advantage these technologies have over traditional databases and ad-hoc formats, or you think there’s no more useful work to be done on them, have a look at the presentation.

Data Frames and Data Cubes

One of the most commonly used abstract structures in R is the dataframe. Technically, a dataframe is a list containing vectors of equal lengths, but it is generally used to represent tables of data, where each row represents an observation, and each column a variable attribute for the observation. Dataframes are commonly used to to store structured results for some computation in R, such as the chromosome, position, and LOD score for a QTL analysis operation. It is less flexible than a general R class, but more complex than a simple list or primitive value, so I decided to use it as my test case for the R to RDF process.

Tools

Although there’s nothing approaching the range of Semantic Web libraries available for popular language like Java, the Ruby community has developed a fair number of libraries for interacting with RDF formatted data, which will allow me to focus more development time on building useful tools than on the background infrastructure to support them. Chief among these is Ruby RDF Project, which hosts a number of interrelated libraries built on a basic core for interacting with Semantic data. Ruby RDF features a Ruby object representation of triples and graphs and methods for interacting with them, as well as a number of plugins for reading and writing data using different storage systems and RDF serialization standards. It takes the approach of focusing on a limited set of functions, implementing them efficiently and robustly, and providing an extensible architecture to allow new functionality to be added. Although it started with built-in support for only one of the simplest RDF formats, ntriples, functional and well-maintained extensions or plugins are now available for most popular triple stores, serialization formats, and querying methods.

Although anyone can invent their own vocabulary and write perfectly valid RDF using it, the focus of the Semantic Web is on interoperability via re-use of existing standards. Converting tabular data to RDF has historically been somewhat difficult, as the two formats are not naturally very compatible. Fortunately recent work in unifying a number of existing data description vocabularies has yielded a reasonably general specification for multidimensional data: The Data Cube Vocabulary. It is a rich and descriptive system, as you might guess looking at this chart of its components:

Data Cube Vocabulary

However, it can still describe simple datasets without requiring every component above, so it is flexible as well as powerful. At its core, the Data Cube vocabulary is essentially just a long list of observations, and the dimensions, units, measurements and attributes that define them. It has been used in a number of real world applications, particularly in the publication of large data sets by government agencies, such as the financial data published by the UK government as part of their transparency initiative, and a wide variety of others.

Once data is in RDF form, you need a somewhere to store it. Although theoretically you could use a standard relational database (and many do offer direct support for triple formatted data), or even just store everything as a bunch of serialized text files, in practice RDF data is generally stored using a type of database called a triplestore. These are optimized for storing data in Subject Predicate Object and related forms, so are generally more efficient and easy to integrate with other triple based data sets or querying protocols. Although there are differences between types of triple stores, ranging for subtle to extreme, Ruby RDF abstracts many of these away with a Repository object that acts as an interface for loading data into a store and performing queries on it. To begin with, we are using 4store for our triplestore, as it’s free, supports the latest version of the RDF and SPARQL standard, is a native triple store (as opposed to an extension for another type of DB), and can run as a cluster for operating on large amounts of data. A plugin is already available to use 4store as a repository in Ruby RDF. It isn’t fully finished, so implementing the remaining functions may become a part of my work this summer if we continue to use it, but it works well enough to load a set of triples into the store, which is all I need at this stage.

Finally, I’ll be using the Rserve tool and accompanying rserve-client Ruby gem to manage the interaction between R and Ruby. While there are a few options for doing this, Rserve seemed to offer the best compromise between performance and deployability, making up for its slight disadvantage in speed compared to some libraries (eg RSruby) by being written in pure Ruby, and thus more easily deployable in different environments.

Method

The overall process for converting from an R dataframe to a rudimentary Data Cube object is actually fairly straightforward. When using the standalone program, you just provide the name of the R variable to dump, and possibly the location of your R workspace and the port your 4store instance is running on if these differ from the defaults (loading the Ruby class in your code offers more flexibility). The script first creates a hash from the data, which is somewhat counter-intuitively represented as a monkey patched Array object with “name” attributes instead of just a regular hash. This is a reasonably simple operation for a basic dataframe, which, as mentioned before, is structurally equivalent to a table with named rows, but the process may become more complex when I move on to supporting a wider set of R objects.

At first I attempted to create an ad-hoc placeholder vocabulary and create triples for each entry and relation in the dataframe using Ruby RDF’s included statement builder, think I would test the 4store interface and implement a more complete vocabulary later. I quickly realized working with plain triples is too cumbersome an approach to truly be viable, and it was producing unnecessarily large graphs for the amount of information I was attempting to store. I decided to move straight to implementing the Data Cube vocabulary, using an RDF description language known as N3, or more specifically a subset of it known as Turtle. N3 is actually powerful enough to express statements which do not translate to valid RDF, while Turtle is limited to just the subset of N3 that corresponds to proper RDF. In addition to being more powerful than the basic language of triples, Turtle is also one of the most human readable formats for working with RDF, which can otherwise wind up looking like a big morass of unformatted HTML. The downside is that Ruby RDF doesn’t  have full yet support for writing blocks of Turtle statements, so currently much of the code relies on string building.

I decided on a few simple mappings for parts of the dataframe to get things started. They may not be perfectly accurate or the proper representation of the data, but I wanted to make sure I was at least generating a valid Data Cube before adding the fine details. Each row of the dataframe becomes a Data Cube dimension, each column a measure, and the  dataset as well as its supporting components objects are named using the name of the variable being converted. The exact way in which these elements are linked can seem a little arcane at first, but as I mentioned, it really just comes down to splitting the data into observations and attaching dimensions or attributes to describe it. If you’d like an example if how it all winds up looking, see this simple file describing just one datapoint: https://gist.github.com/wstrinz/cd20661a8d1fda123e2b

Once its in this format, 4store and most other triple stores can load the data easily, and if the store provides a built in SPARQL endpoint, it can already be queried using the standard patterns. If new data using the same definition is added, it can be automatically integrated, and if someone else wants to extend or provide more details for any element, they can do so without redefining their schema, and anyone looking for a definition of how the data is structure can find it easily.

How To

I will go through a quick example from loading a data set in R all the way to viewing it in a Data Cube visualizer as a demonstration of what I’ve implemented so far. Although the native Ruby code allows for more flexibility, I will be using the standalone java executable for this example. This assumes you’ve installed R/qtl, as well as Rserve. If not, both are available from the R package management system. You’ll also need 4store installed and running its http SPARQL server in unsafe mode, for which you can find instructions here. In this example, I’ll be running mine on the default port 8080.

First lets start R in our home directory and do some simple analysis with a sample dataset provided by R/qtl. Once we’ve loaded the results into a variable, we’ll close the R session and save our workspace.

$ R
> library(qtl)
> data(listeria)
> mregress = scanone(listeria, method="mr")
> quit()
Save workspace image? [y/n/c]: y

Next, we want to run our tool to dump the objects into 4store. To do this, you simply run the executable jar with the variable to dump as the first argument, and optionally the directory the R workspace was saved in (if it’s not the same directory you’re running the jar from).

$ java -jar QTL2RDF.jar mregress ~

You can find a copy of the jar here, but since this project is under active development that will likely go out of date soon, so you’ll want to check for the latest build on the project’s Github page.

Thats all you really need to do! Your data are now in your local 4store, ready to be extracted or queried as you see fit. However, the jar will also offers the option of returning its output as a turtle formatted string, so if you wanted to use it with some other program you can pass a fourth argument (for now it can be anything) when you run the jar, then pipe the output to a file.

java -jar QTL2RDF.jar mregress ~ 8080 something > mregress.ttl

This file should be importable into any triple store, and usable with other tools that recognize the Data Cube format. One such tool is Cubeviz, which allows you to browse and present statistical information encoded as Data Cube RDF. Cubeviz uses a different triple store called Virtuoso, so it can be a bit time consuming to set up, but once you’ve done so importing your data is as easy as pointing it at your data, which you can either upload as a file, provide a link to, or simply paste in (if you’re dataset is small enough).

Pasting the data set

Pasting the data set

You can then decide how you’d like to visualize your data. In this case, we’re probably interested in the LOD, so I set measurement to “lod” and selected a few loci in the dimension configuration.

graph

You could select each locus to view the whole data set, but Cubeviz is still in development,  and so consequently has some issues such as failing to scale down the graph for a wide range of values.

So far, this is just the bare minimum required to properly define the object in the Data Cube vocabulary. Some of the mappings may not be quite correct, and there’s a lot that could be done to better define the semantics of the data that’s being converted. As an example, the ‘chr’ and ‘pos’ measures may be better described as attributes rather than measures in their own right. There are no labels or units on the graph, although Cubeviz will create them if your data is formatted correctly. Some of these features will be included as I improve the conversion program, but there will still be some important semantic information that be easily extracted from the R dataframe. Developing ways for users to specify how this should be handled in an friendly and intuitive way is an important facet of this summer’s project, but even at its minimum functionality the tool I’ve developed can already convert and store a popular data structure in a useful and flexible format.

Next Steps

I plan to finish mapping simple R types over the next few days, then write some functions to break general r objects into their component parts and serialize those. By the official start of coding day, I should have the groundwork laid to start applying these methods to Ruby objects and developing more complete ontology handling.

Again, please check out the Github repository (https://github.com/wstrinz/R2RDF) if you’d like to see some more details. The code isn’t particularly efficient or well documented yet, but I’d appreciate any feedback anyone has. If you have questions, please leave a comment here or get in touch with me some other way.

Goals: Reproducible Science

This project has a number of goals, including improving support for large datasets in the bioinformatics community, furthering the development of semantic web technologies, and supporting data sharing and reproducible science. Today I’m going to go into a little more detail about the former and talk about the work I’ve done so far on it.

To get up to speed on all of the interrelated Semantic Web standards and technologies, I have been working on a tool for converting objects from the R statistical computing language into RDF triples, the native format of the Semantic Web. Although this will be a valuable tool on its own, it is also being developed to support the next version of R/qtl, a library developed by Karl Broman and Hao Wu, as well as a host of other contributors, which offers functions for doing Quantitative Trait Loci mapping using R. The next incarnation of R/qtl will focus on support for highly parallelized computation, a key component of which will be storing results in a database that can be queried and manipulated remotely, as opposed to keeping huge data sets in memory on one computer.

rqtlplanAn overview of the plans for R/qtl

Data Sharing and Reproducibility

The other advantage of storing statistical data in triple based format is that it can be easily, even automatically, published for others to download, inspect, and interact with. In RDF, every property of a resource is defined by its relation to other resources or objects, and each relation comes with an attached definition that either a machine or a human can access to get further details about it. This allows for a huge amount of flexibility in data types and storage schema, as well as the application of algorithmic reasoning techniques to simplify a data set or find out more about its implications.

Publishing scientific data in a machine readable format also makes it dramatically more for scientists hoping to replicate the results or build upon them. Most publications will, at best, include supporting data as a table or tables of statistical aggregates, and even when lower level raw data are available it is usually stored in a flat format such as csv, which includes little to no semantic content such as the units or attributes of objects, or their meaning in the context of the rest of the data. While some fields have begun using relational database technology more extensively, the fact that the most popular data storage formats for many researchers are essentially text files speaks to the rigidity and extra complications of using  dedicated database systems.

RDF has a somewhat mind bending structure of its own to understand, and it’s certainly no silver bullet for the problem of generalized data storage, but its flexibility allows it to overcome much of the ossification and user-unfriendliness of trying to use relation databases for storing and publishing scientific results. A primary goal of my work this summer will be to extend existing Ruby tools and build new ones to help make Semantic data storage accessible and simple for researchers, supporting the crucial work of examining, validating, and extending published results.

Protocols

The Semantic Web is built on three core technologies; the RDF protocol, the SPARQL query language, and the OWL web ontology language. There is a large body of documentation about all three of these tools, which can be found around the web or in the W3C’s standards documentation. I may write some posts going into more detail, but for now, a brief overview:

RDF essentially comes down to describing data using the ‘Subject Predicate Object’ format, creating statements known as ‘Triples’. An example would be “John(subject) knows(predicate) Mary(object)”. With a few exceptions, each component of this triple is given a URI, which looks like a lot like a regular URL and can be used to uniquely identify a resource (such as ‘John’ or ‘Mary’), or a relation (such as ‘friends with’), as well as providing a link where more information about the object or relation may be found. You can think of subjects and objects as nodes, and predicates as a lines connecting them. Although this doesn’t mean much for one statement such as “John knows Mary”, a collection of similar statements define a big directed graph of interconnected nodes, where every connection is labeled with its meaning.

SPARQL is the official query language of the Semantic Web. It can be used to select elements of an RDF graph based on their subject, predicate, or object, elements. Its syntax resembles, superficially at least, the SQL language familiar to many database users, but in practice it functions quite differently. However, many of the advanced operations of SQL, such as pivoting, are still supported.

OWL is a language for describing ontologies, which are used to formally represent the concepts your data represent in an RDF store, allowing simpler machine interpretation. The technology is crucial to the interconnectedness of the Semantic Web, as it is the means by which the relations between disparate resources can be automatically discovered, allowing relatively easy integration of new or existing data.

There is, of course, much more to each of these technologies, and the technicalities, use cases, and extensions of each of them, then would fit in one section of the blog post, but hopefully this gives a broad overview of how the project will work.

A Starting Place

To begin with, I have developed a Ruby based tool to automatically convert data frame objects in R to RDF, using the Data Cube vocabulary, which was developed as a generalizable way of representing  multidimensional data. It can be run easily on any Ruby capable machine, but since the script and all of its dependencies are pure Ruby libraries, I was also able to deploy it as an executable jar with warbler, so anyone with java installed can use it without having to download any dependencies. I’ll go into more detail about how this works, why its useful, and where this part of the project is headed, but I’ve decided to break it into a separate blog post.