Data for your data

One of the key applications of RDF is representing and disseminating data about other datasets, also known as metadata This can be all sorts of things, from the publisher or subject of a document to the file format of a video, but in bioinformatics, and science in general, you’re often most interested in how you can make use of the dataset in your domain. This might include getting information on the particular species or region your dataset refers to, or more complex questions such as where and under what terms you can access it, what process was used to create or derive it, and ultimately whether or not you can “trust” it. Although RDF and the Semantic Web don’t automatically answer these questions, they provide a powerful and widely used platform on which to do so.

This is the appropriate metadata reference, not Inception

This is the appropriate metadata reference, not Inception

To begin with, I am using two ontologies to represent metadata. One is concerned with general metadata, such as author and subject, and the other is more focused on the process used to create the data. For now the interface is a little clunky, since it’s just the basic generation functions. Later on they’ll be wrapped in classes that provide a more friendly interface and probably decomposed into smaller functions , similar to the Data Cube part of the gem.

Dublin Core

The Dublin Core vocabulary is a flexible and widely used standard for representing basic metadata. DC is fairly venerable by the standards of the Semantic Web; it traces its roots back to a metadata workshop in Dublin, Ohio in 1995. Since then has been developed and maintained by a an organization known as the Dublin Core Metadata Initiative . It is probably the most ubiquitous vocabulary outside of the core set of RDF ontologies, and has been ratified as an ANSI and ISO standard.

At the moment, my gem supports some of the most basic elements of DC, such as author and publication date. The method for this takes a hash and writes the DC terms for any of the elements that are specified, attempting to generate or infer any missing components

Using this method will add some basic information to any dataset created with the gem, as shown in this cucumber test:

Publisher and subject information are also supported, although there’s still some work to be done bridging the gap between informal subjects and those defined under various ontologies, which is really more what the ‘subject’ term was designed for.

Prov

The PROV ontology is a more specialized standard that Dublin Core, designed to represent provenance metadata, which includes the sources of and processes used to create a dataset, which people, software, or organizations were involved in creating it, and which data elements used or were derived from others. PROV was developed by a W3C working group given the goal of creating a unified standard for publishing provenance information, where before a patchwork of standards existed, each missing some important component of provenance representation.

Essentially, PROV is about the interplay of Agents, Activities, and Entities, with Agents engaging in Activities to generate Entities or derive them from other Entities. All of these elements can be either digital (software agents and algorithmic activities), physical (lab technicians and in person data collection), or some combination of the two. There are additional specializations of these classes, as well as a suite of terms to describe their relationships with one another.

These can get a little complicated, so I’ve been tracking my understanding of it with a diagram of the relationship between elements. This is still a work in progress, so if anything looks off to you I’d be happy to hear about it!

Basic provenance

Basic provenance

This is just the basic provenance for one entity, so it’s pretty comprehensible, but the whole point of the vocabulary is to link different entities and datasets with one another, which can get a little more complicated.

A longer provenance chain (full version)

A longer provenance chain (full version)

My mentors and I have agreed on the importance of being able to generate metadata for non-RDF resources, so the diagram reflects the notion that the triplified dataset may or may not be present, along with any entities or activities in the provenance chain of the main dataset. Using this system, quite a bit of useful information can be generated from a fairly small set of inputs

This better reflects the current capabilities of my code, but it’s still not a complete use of the ontology. The connections between entities, activities, and agents need not be linear, and more than one entity could be the object of a “used” or “wasDerivedFrom” relationship. This is something I’ll be working toward for during the rest of the summer, but for now this scheme provides a reasonable way to represent the provenance of many workflows.

Advertisements

Objects for all

The code I’ve written for generating Data Cube RDF is mostly in the form of methods stored in modules, which need to be included in a class before they are called. Although these will work just fine for converting your data, those who are not already familiar with the model may wish for a more “friendly” interface using familiar ruby objects and idioms. Though you can access your datasets with a native query language which gives you far more flexibility than a simple Ruby interface ever could, you may not want to bother with the upfront cost of learning a whole new language (even one as lovely as SPARQL) just to get started using the tool.

This is not just a problem for me, this summer; it occurs quite often when object oriented languages are used to interact with some sort of database. Naturally, a lot of work both theoretical and practical has been done to make the connection between the language and the database as seamless as possible. In Ruby, and especially Rails, a technique called Object Relational Mapping is frequently used to create interfaces that allow Ruby objects to stand in for database concepts such as tables and columns.

The ORM translates between program objects and the database (source)

ORM and design patterns derived from it are common features of any web developer’s life today. Active Record, a core Rails gem implementing the design pattern of the same name, has been an important (if somewhat maligned) part of the frameworks success and appeal to new developers. Using Active Record and other similar libraries you can leverage the power of SQL and dedicated database software to it without needing to learn a suite of new language languages.

ORM for the Semantic Web?

There has been work done applying this pattern to the Semantic Web in the form of the ActiveRDF gem, although it hasn’t been updated in quite some time. Maybe it will be picked up again one day, but the reality is that impedance mismatch, where the fundamental differences between object oriented and relational database concepts creates ambiguity and information loss, poses a serious problem for any attempt to build this such a tool. Still, one of the root causes of this problem, that the schema of relational databases are far more constrained than those of OO objects, is somewhat mitigated for the Semantic Web, since RDF representations can often be less constraining than typical OO structures. So there is hope that a useful general mapping tool will emerge in time, but it’s going to take some doing.

Hierarchies and abstraction are a problem for relational DBs, but they’re core concepts in the OWL Semantic Web language (source)

Despite the challenges, having an object oriented interface to your database structures makes them a lot more accessible to other programmers, and helps reduces the upfront cost of picking up a new format and new software, so it has always been a part of the plan to implement such an interface for my project this summer. Fortunately the Data Cube vocabulary offers a much more constrained environment to work in than RDF in general, so creating an object interface is actually quite feasible.

Data Mapping the Data Cube

To begin with, I created a simple wrapper class for the generator module, using instance variables for various elements of the vocabulary, and a second class to represent individual observations.

Observations are added as simple hashes, as shown in this cucumber feature:

With the basic structure completed, I hooked up the generator module, which can be accessed by calling the “to_n3” command on your Data Cube object.

Having an object based interface also makes running validations at different points easier as well. Currently you can specify the <#@ validate_each? @#> option when you create your DataCube object, which if set to true will ensure that the fields in your observations match up with the measures and dimensions you’ve defined. If you don’t set this option, your data will be validated when you call the to_n3 method.

You can also include metadata about your dataset such as author, subject, and publisher, which are now added during the to_n3 method.

The other side of an interface such as this is being able to construct an object from an RDF graph. Although my work in this area hasn’t progressed as far, it is possible to automatically create an instance of the object using the ORM::Datacube.load method, either on a turtle file or an endpoint URL:

All of this is supported by a number of new modules that assist with querying, parsing, and analyzing datasets, and were designed to be useful with or without the object model. There’s still a lot to be done on this part of the project; you should be able to delete dimensions as well as add them, and more validations and inference methods need to be implemented, but most of all there needs to be a more direct mapping of methods and objects to SPARQL queries. Not only does would this conform better to the idea of an ORM pattern, but it will also allow large datasets to be handled much more easily, as loading every observation at once can take a long time and may run up against memory limits for some datasets.

Cubecumber

Alongside my work on Karl Broman’s visualizations, I have continued to add features and tests for my Data Cube generator. As a result of some conversations with other members of the development community, we decided to change the way missing values are handled, which entailed a fair amount of refactoring, so I needed to make sure a good set of tests was in place. I’ve built out the Rspec tests to cover some of these situations, but we were also asked to create some tests using Cucumber, a higher level Behavior Driven Development focused testing tool. I’ve always just used Rspec, since it’s favored by some of the people in Madison who helped get me into Ruby. Many people are not fond of the more magic (seeming) way in which cucumber works, but I thought it’d at least be good to learn the basics even if I didn’t use it much beyond that.

Sadly this is not from a bar catering primarily to Rubyists (source)

Turns out, I really like Cucumber. I may even grow to love it one day, although I don’t want to be too hasty with such pronouncements since our relationship is barely a week old. With Rspec, I like the simplicity of laying everything out in one file and organizing instance variables by context, but I still find myself frustrated by little things like trying to test slight variations on an object or procedure. I know Rspec allows for a lot of modularity if you know how to use it, but personally I just find them a little too clunky most of the time. Cucumber, by contrast, is all about reusability, of methods and objects, and the interface it presents just seems to click a lot more with me.

Cucumber is built around describing behaviors, or “features”, of your application in a human readable manner, and tying this description to code that actually tests the features. You create a plaintext description, and provide a list of steps to take to achieve it. The magic comes in the step definitions, which are bound using regular expressions to allow the reuse of individual steps in different scenarios.

Here’s a simple scenario:

Most of this is just descriptive information. The main thing to notice is the “Scenario”, which contains “Given” and “Then” keywords. These, in addition to the “When” keyword (which isn’t used in this example) are the primary building blocks of a Cucumber feature. They are backed up by step definitions in a separate file, written in Ruby.

Cucumber will pull in these step definitions and use them for the “Given” and “Then” calls in the main feature. So far, so good, but this really isn’t worth the added complexity over Rspec yet. If I had a more complicated feature such as this one:

I could add “Given” definitions to cover the new Scenarios, but that would be a waste of the tools Cucumber offers (and the powers of Ruby). Instead, I can put parentheses around an element of the regular expression in the step definition to make it an argument for the step. Then with just a little Ruby cleverness I can reuse one step definition for all of these scenarios:

Nothing too complicated here, but it always makes me smile when I can do something simply in Ruby that I’d only barely feel comfortable hacking together with reflection in Java even after an entire undergrad education in the language.

I’ve written some steps to test the presence of various Data Cube elements in the output of a generator (check the repository if you’re interested), and I’m also working on moving my Integrity Constraints tests over from Rspec. This will make it easy to test my code on your code, or random csv files, so will help with adding features and investigating edge cases tremendously.

Coded properties for proper semantics

While it would be nice if all the data we are interested in working with could be accessed using SPARQL, but the reality is that most data is stored in some kind of tabular format, and may lack important structural information, or even data points, either of which could be a serious stumbling block for trying to represent it as a set of triples. In the case of missing data, the difficulty is that RDF contains no concept null values. This isn’t simply an oversight; the Semantic nature of the format requires it. The specific reasons for this are related to the formal logic and machine inference aspects of the Semantic Web, which haven’t been covered yet here. This post on the W3C’s semantic web mailing list provides a good explanation. As a brief summary, consider what would happen if you had a predicate for “is married to”, and you programmed your (unduly provincial) reasoner to conclude that any resource that is the object of an “is married to” statement for a subject of type “Man” is of type “Woman” (and vice versa). In this situation, if you had an unmarried man and an unmarried woman in your dataset, and chose to represent this state by listing both as “is married to” a null object, say “rdf:null”, you would run into a contradiction; your reasoner would conclude that rdf:null was both a man and a woman. Assuming your un-cosmopolitan reasoner specifies “Man” and “Woman” as disjoint classes, you have created a contradiction, and invalidated your ontology!

Paradoxes: not machine friendly

Since  missing values are actually quite common in the context of qtl analysis, where missing information is often imputed or estimated from an incomplete dataset, I have been discussing the best way to proceed with my mentors. We have decided to use the “NA” string literal by default, and leave it up to the software accessing our data to decide how to handle the missing data in a given domain. This is specified in the code which converts raw values to resources or literals:

The 1.1 version of the RDF standard also includes NaN as a valid numeric literal, so I am experimenting with this as a way of dealing with missing numeric values. Missing structure is a somewhat larger problem; in some situations, such as the d3 visualizations Karl Broman has created, it is sufficient to simply store all data as literal values under a single namespace. For example, consider this (made up) database of economic indicators by country; You could say a given data point has a country value of “America”, a GDP of $49,000 per capita, an infant mortality rate of 4 per thousand, and so on, storing everything as a literal value. This is fine if you’re just working with your own data, but what if you want to be able to find more information about this “America” concept? The great thing about Semantic Web tools is that you can easily query another SPARQL endpoint for triples with an object value of “America”. But this “America” is simply a raw string; you could just as well be receiving information about the band “America”, the entire continent of North, South, or Central America, or something else entirely. Furthermore, RDF does not allow literals as subjects in triples, so you wouldn’t be able to make any statements about “America”. This is particularly problematic for the Data Cube format for a number of reasons, not the least of which is the requirement that all dimensions must have an rdfs:range concept that specifies the set of possible values. The solution to this problem is in making “America” a resource, inside of a namespace. For example, if we were converting these data for the IMF, we could replace “America” (the string literal) with (a URI). We can now write statements about the resource, and ensure that there is no ambiguity between different Americas. This doesn’t quite get us all the way toward fully linked data, since it’s not clear yet how to specify that the “America” in the imf.org namespace is the same as the one in, say, the unfao.org namespace (for that you will need to employ OWL, a more complex Semantic Web technology outside the scope of this post), but it at least allows us to create a valid representation of our data. In the context of a Data Cube dataset, this can be automated through the use of coded properties, supported by the skos vocabulary, an ontology developed for categorization and classification. Using skos, I define a “concept scheme” for each coded dimension, and a set of “hasTopConcept”  relations for each scheme.

Each concept gets its own resource, for which some rudimentary information is generated automatically

Currently the generator only enumerates the concepts and creates the scheme, but these concepts provide a place to link concepts to other datasets and define their semantics. As an example, if the codes were generated for countries, you could tell the generator to link to the dbpedia entries for each country. Additionally, I plan to create a “No Data” code for each concept set, now that we’ve had a discussion about the way to handle such values.

To see how this all comes together, I’ll go through an example using one of the most popular data representation schemes around; the trusty old CSV.

CSV to Data Cube

Most are probably already familiar with this format, but even if you aren’t it’s about the simplest conceivable way of representing tabular data; each column is separated by commas, and each row by a new line. The first such row typically represents the labels for the columns. Although very little in the way of meaning is embedded in this representation, it can still be translated to a valid data cube representation, and more detailed semantics can be added through a combination of automated processing and user input.

The current CSV generator is fairly basic; you can provide it with an array of dimensions, coded dimensions, and measures using the options hash, point it at a file, and it will create Data Cube formatted output. There is only one extra option at the moment, which allows you to specify a column (by number) to use to generate labels for your output. See below for an example of how to use it.

Here is a simple file I’ve been using for my rspec tests (with spaces added for readability).

producer, pricerange, chunkiness, deliciousness
hormel,     low,        1,           1
newskies,  medium,      6,           9
whys,      nonexistant, 9001,        6

This can be fed into the generator using a call such as the following:

And that’s all there is to it! Any columns not specified as dimensions will be treated as measures, and if you provide no configuration information at all it will default to using the first column as the dimension. At the moment, all dimension properties for CSV files are assumed to be coded lists (as discussed above), but this will change as I add support for other dimension types and sdmx concepts. If you’d like to see what the output looks like, you can find an example gist on Github.

As the earlier portion of this post explains, the generator will create a concept scheme based on the range of possible values for the dimension/s, and define the rdfs:range of the dimension/s as this concept scheme. Aside from producing a valid Data Cube representation (which requires all dimensions to have a range), this also creates a platform to add more information about coding schemes and individual concepts, either manually or, in the near future, with some help from the generation algorithm.

I’ll cover just how you might go about linking up more information to your dataset in a future post, but if you’d like a preview, have a look at the excellent work Sarven Capadisli has done in converting and providing endpoints for data from the UN and other international organizations.

An overview of Sarven’s data linkages (source)

The country codes for these datasets are all linked to further data on their concepts using the skos vocabulary, for example, see this page with data about country code CA (Canada). This practice of linking together different datasets is a critical part of the Semantic Web, and will be an important direction for future work this summer.

Visualizations and Validations

Earlier this week I met with Karl Broman, a biostatistician at UW Madison who created the r/qtl library I’ve been working with in the last few weeks, about another project of his that could benefit from some Semantic Web backing. Karl has created a number of interesting visualizations of bioinformatics data. His graphs make use of the d3 javascript framework to display high-dimensional data in an interactive and intuitive way. The focus on dimensional data, as well as the fact that most of his datasets exist as R objects, fits naturally with the Data Cube generators I’ve created already.

A screenshot of the cis/trans plot. Make sure you check out the real, interactive version.

A screenshot of the cis/trans plot. Make sure you check out the real, interactive version.

To begin with, I will be converting Karl’s cis/trans eQTL plot to pull its data from a triple store dynamically. Currently there are two scripts that process an r/qtl cross and a few supporting dataframes to create a set of static JSON files, which are then loaded into the graph. Using a triple store to hold the underlying data, however, the values required by the visualization can be accessed dynamically based on the structure of the original R objects. As of now I have successfully converted each of of the necessary datasets to RDF, and am working on generating queries that Karl’s d3 code can use to access it through a 4store SPARQL endpoint (which supports JSON output).

The objects involved are quite large, and the Data Cube vocabulary (really RDF in general) is fairly verbose in its representation of information, so I am working on loading what I have into the right databases and reducing redundancy in the output. However, if you’d like some idea of how the data are being represented and accessed, I’ve set up a demo on Dydra with a subset of the data and some example queries.

Testing and Validation

In addition to working with Karl, I’ve taken time to refactor my code toward creating Data Cube RDF for more general structures. Originally the main module worked off of an Rserve object, but I’ve redone everything to use plain ruby objects, which the generator classes are responsible for creating. To support this refactoring, and the creation of new generators for data types such as CSV files, I’ve begun using Rspec to build the spec for my project. I’ve added tests against reference output and syntactical correctness, but these are respectively too brittle and too permissive to ensure novel data sets will generate valid output. To this end, I have implemented a number of the official Data Cube Integrity Constraints as part of the spec. The ICs are a set of SPARQL queries that can be run on your RDF output to ensure various high level features are present, and go beyond simple syntax validity in ensuring you have properly encoded your data. I’ve had to make a few modifications, since the ICs are slightly out of data, and some of the SPARQL 1.1 facilities they make use of aren’t fully supported by the RDF.rb SPARQL gem. Aside from a place in the set of tests, the ICs could also be useful as part of the main code, providing a way for the end user to ensure that their data is compatible with Data Cube tools like Cubeviz.