Objects for all

The code I’ve written for generating Data Cube RDF is mostly in the form of methods stored in modules, which need to be included in a class before they are called. Although these will work just fine for converting your data, those who are not already familiar with the model may wish for a more “friendly” interface using familiar ruby objects and idioms. Though you can access your datasets with a native query language which gives you far more flexibility than a simple Ruby interface ever could, you may not want to bother with the upfront cost of learning a whole new language (even one as lovely as SPARQL) just to get started using the tool.

This is not just a problem for me, this summer; it occurs quite often when object oriented languages are used to interact with some sort of database. Naturally, a lot of work both theoretical and practical has been done to make the connection between the language and the database as seamless as possible. In Ruby, and especially Rails, a technique called Object Relational Mapping is frequently used to create interfaces that allow Ruby objects to stand in for database concepts such as tables and columns.

The ORM translates between program objects and the database (source)

ORM and design patterns derived from it are common features of any web developer’s life today. Active Record, a core Rails gem implementing the design pattern of the same name, has been an important (if somewhat maligned) part of the frameworks success and appeal to new developers. Using Active Record and other similar libraries you can leverage the power of SQL and dedicated database software to it without needing to learn a suite of new language languages.

ORM for the Semantic Web?

There has been work done applying this pattern to the Semantic Web in the form of the ActiveRDF gem, although it hasn’t been updated in quite some time. Maybe it will be picked up again one day, but the reality is that impedance mismatch, where the fundamental differences between object oriented and relational database concepts creates ambiguity and information loss, poses a serious problem for any attempt to build this such a tool. Still, one of the root causes of this problem, that the schema of relational databases are far more constrained than those of OO objects, is somewhat mitigated for the Semantic Web, since RDF representations can often be less constraining than typical OO structures. So there is hope that a useful general mapping tool will emerge in time, but it’s going to take some doing.

Hierarchies and abstraction are a problem for relational DBs, but they’re core concepts in the OWL Semantic Web language (source)

Despite the challenges, having an object oriented interface to your database structures makes them a lot more accessible to other programmers, and helps reduces the upfront cost of picking up a new format and new software, so it has always been a part of the plan to implement such an interface for my project this summer. Fortunately the Data Cube vocabulary offers a much more constrained environment to work in than RDF in general, so creating an object interface is actually quite feasible.

Data Mapping the Data Cube

To begin with, I created a simple wrapper class for the generator module, using instance variables for various elements of the vocabulary, and a second class to represent individual observations.

Observations are added as simple hashes, as shown in this cucumber feature:

With the basic structure completed, I hooked up the generator module, which can be accessed by calling the “to_n3” command on your Data Cube object.

Having an object based interface also makes running validations at different points easier as well. Currently you can specify the <#@ validate_each? @#> option when you create your DataCube object, which if set to true will ensure that the fields in your observations match up with the measures and dimensions you’ve defined. If you don’t set this option, your data will be validated when you call the to_n3 method.

You can also include metadata about your dataset such as author, subject, and publisher, which are now added during the to_n3 method.

The other side of an interface such as this is being able to construct an object from an RDF graph. Although my work in this area hasn’t progressed as far, it is possible to automatically create an instance of the object using the ORM::Datacube.load method, either on a turtle file or an endpoint URL:

All of this is supported by a number of new modules that assist with querying, parsing, and analyzing datasets, and were designed to be useful with or without the object model. There’s still a lot to be done on this part of the project; you should be able to delete dimensions as well as add them, and more validations and inference methods need to be implemented, but most of all there needs to be a more direct mapping of methods and objects to SPARQL queries. Not only does would this conform better to the idea of an ORM pattern, but it will also allow large datasets to be handled much more easily, as loading every observation at once can take a long time and may run up against memory limits for some datasets.

Advertisements

Cubecumber

Alongside my work on Karl Broman’s visualizations, I have continued to add features and tests for my Data Cube generator. As a result of some conversations with other members of the development community, we decided to change the way missing values are handled, which entailed a fair amount of refactoring, so I needed to make sure a good set of tests was in place. I’ve built out the Rspec tests to cover some of these situations, but we were also asked to create some tests using Cucumber, a higher level Behavior Driven Development focused testing tool. I’ve always just used Rspec, since it’s favored by some of the people in Madison who helped get me into Ruby. Many people are not fond of the more magic (seeming) way in which cucumber works, but I thought it’d at least be good to learn the basics even if I didn’t use it much beyond that.

Sadly this is not from a bar catering primarily to Rubyists (source)

Turns out, I really like Cucumber. I may even grow to love it one day, although I don’t want to be too hasty with such pronouncements since our relationship is barely a week old. With Rspec, I like the simplicity of laying everything out in one file and organizing instance variables by context, but I still find myself frustrated by little things like trying to test slight variations on an object or procedure. I know Rspec allows for a lot of modularity if you know how to use it, but personally I just find them a little too clunky most of the time. Cucumber, by contrast, is all about reusability, of methods and objects, and the interface it presents just seems to click a lot more with me.

Cucumber is built around describing behaviors, or “features”, of your application in a human readable manner, and tying this description to code that actually tests the features. You create a plaintext description, and provide a list of steps to take to achieve it. The magic comes in the step definitions, which are bound using regular expressions to allow the reuse of individual steps in different scenarios.

Here’s a simple scenario:

Most of this is just descriptive information. The main thing to notice is the “Scenario”, which contains “Given” and “Then” keywords. These, in addition to the “When” keyword (which isn’t used in this example) are the primary building blocks of a Cucumber feature. They are backed up by step definitions in a separate file, written in Ruby.

Cucumber will pull in these step definitions and use them for the “Given” and “Then” calls in the main feature. So far, so good, but this really isn’t worth the added complexity over Rspec yet. If I had a more complicated feature such as this one:

I could add “Given” definitions to cover the new Scenarios, but that would be a waste of the tools Cucumber offers (and the powers of Ruby). Instead, I can put parentheses around an element of the regular expression in the step definition to make it an argument for the step. Then with just a little Ruby cleverness I can reuse one step definition for all of these scenarios:

Nothing too complicated here, but it always makes me smile when I can do something simply in Ruby that I’d only barely feel comfortable hacking together with reflection in Java even after an entire undergrad education in the language.

I’ve written some steps to test the presence of various Data Cube elements in the output of a generator (check the repository if you’re interested), and I’m also working on moving my Integrity Constraints tests over from Rspec. This will make it easy to test my code on your code, or random csv files, so will help with adding features and investigating edge cases tremendously.

Coded properties for proper semantics

While it would be nice if all the data we are interested in working with could be accessed using SPARQL, but the reality is that most data is stored in some kind of tabular format, and may lack important structural information, or even data points, either of which could be a serious stumbling block for trying to represent it as a set of triples. In the case of missing data, the difficulty is that RDF contains no concept null values. This isn’t simply an oversight; the Semantic nature of the format requires it. The specific reasons for this are related to the formal logic and machine inference aspects of the Semantic Web, which haven’t been covered yet here. This post on the W3C’s semantic web mailing list provides a good explanation. As a brief summary, consider what would happen if you had a predicate for “is married to”, and you programmed your (unduly provincial) reasoner to conclude that any resource that is the object of an “is married to” statement for a subject of type “Man” is of type “Woman” (and vice versa). In this situation, if you had an unmarried man and an unmarried woman in your dataset, and chose to represent this state by listing both as “is married to” a null object, say “rdf:null”, you would run into a contradiction; your reasoner would conclude that rdf:null was both a man and a woman. Assuming your un-cosmopolitan reasoner specifies “Man” and “Woman” as disjoint classes, you have created a contradiction, and invalidated your ontology!

Paradoxes: not machine friendly

Since  missing values are actually quite common in the context of qtl analysis, where missing information is often imputed or estimated from an incomplete dataset, I have been discussing the best way to proceed with my mentors. We have decided to use the “NA” string literal by default, and leave it up to the software accessing our data to decide how to handle the missing data in a given domain. This is specified in the code which converts raw values to resources or literals:

The 1.1 version of the RDF standard also includes NaN as a valid numeric literal, so I am experimenting with this as a way of dealing with missing numeric values. Missing structure is a somewhat larger problem; in some situations, such as the d3 visualizations Karl Broman has created, it is sufficient to simply store all data as literal values under a single namespace. For example, consider this (made up) database of economic indicators by country; You could say a given data point has a country value of “America”, a GDP of $49,000 per capita, an infant mortality rate of 4 per thousand, and so on, storing everything as a literal value. This is fine if you’re just working with your own data, but what if you want to be able to find more information about this “America” concept? The great thing about Semantic Web tools is that you can easily query another SPARQL endpoint for triples with an object value of “America”. But this “America” is simply a raw string; you could just as well be receiving information about the band “America”, the entire continent of North, South, or Central America, or something else entirely. Furthermore, RDF does not allow literals as subjects in triples, so you wouldn’t be able to make any statements about “America”. This is particularly problematic for the Data Cube format for a number of reasons, not the least of which is the requirement that all dimensions must have an rdfs:range concept that specifies the set of possible values. The solution to this problem is in making “America” a resource, inside of a namespace. For example, if we were converting these data for the IMF, we could replace “America” (the string literal) with (a URI). We can now write statements about the resource, and ensure that there is no ambiguity between different Americas. This doesn’t quite get us all the way toward fully linked data, since it’s not clear yet how to specify that the “America” in the imf.org namespace is the same as the one in, say, the unfao.org namespace (for that you will need to employ OWL, a more complex Semantic Web technology outside the scope of this post), but it at least allows us to create a valid representation of our data. In the context of a Data Cube dataset, this can be automated through the use of coded properties, supported by the skos vocabulary, an ontology developed for categorization and classification. Using skos, I define a “concept scheme” for each coded dimension, and a set of “hasTopConcept”  relations for each scheme.

Each concept gets its own resource, for which some rudimentary information is generated automatically

Currently the generator only enumerates the concepts and creates the scheme, but these concepts provide a place to link concepts to other datasets and define their semantics. As an example, if the codes were generated for countries, you could tell the generator to link to the dbpedia entries for each country. Additionally, I plan to create a “No Data” code for each concept set, now that we’ve had a discussion about the way to handle such values.

To see how this all comes together, I’ll go through an example using one of the most popular data representation schemes around; the trusty old CSV.

CSV to Data Cube

Most are probably already familiar with this format, but even if you aren’t it’s about the simplest conceivable way of representing tabular data; each column is separated by commas, and each row by a new line. The first such row typically represents the labels for the columns. Although very little in the way of meaning is embedded in this representation, it can still be translated to a valid data cube representation, and more detailed semantics can be added through a combination of automated processing and user input.

The current CSV generator is fairly basic; you can provide it with an array of dimensions, coded dimensions, and measures using the options hash, point it at a file, and it will create Data Cube formatted output. There is only one extra option at the moment, which allows you to specify a column (by number) to use to generate labels for your output. See below for an example of how to use it.

Here is a simple file I’ve been using for my rspec tests (with spaces added for readability).

producer, pricerange, chunkiness, deliciousness
hormel,     low,        1,           1
newskies,  medium,      6,           9
whys,      nonexistant, 9001,        6

This can be fed into the generator using a call such as the following:

And that’s all there is to it! Any columns not specified as dimensions will be treated as measures, and if you provide no configuration information at all it will default to using the first column as the dimension. At the moment, all dimension properties for CSV files are assumed to be coded lists (as discussed above), but this will change as I add support for other dimension types and sdmx concepts. If you’d like to see what the output looks like, you can find an example gist on Github.

As the earlier portion of this post explains, the generator will create a concept scheme based on the range of possible values for the dimension/s, and define the rdfs:range of the dimension/s as this concept scheme. Aside from producing a valid Data Cube representation (which requires all dimensions to have a range), this also creates a platform to add more information about coding schemes and individual concepts, either manually or, in the near future, with some help from the generation algorithm.

I’ll cover just how you might go about linking up more information to your dataset in a future post, but if you’d like a preview, have a look at the excellent work Sarven Capadisli has done in converting and providing endpoints for data from the UN and other international organizations.

An overview of Sarven’s data linkages (source)

The country codes for these datasets are all linked to further data on their concepts using the skos vocabulary, for example, see this page with data about country code CA (Canada). This practice of linking together different datasets is a critical part of the Semantic Web, and will be an important direction for future work this summer.

Visualizations and Validations

Earlier this week I met with Karl Broman, a biostatistician at UW Madison who created the r/qtl library I’ve been working with in the last few weeks, about another project of his that could benefit from some Semantic Web backing. Karl has created a number of interesting visualizations of bioinformatics data. His graphs make use of the d3 javascript framework to display high-dimensional data in an interactive and intuitive way. The focus on dimensional data, as well as the fact that most of his datasets exist as R objects, fits naturally with the Data Cube generators I’ve created already.

A screenshot of the cis/trans plot. Make sure you check out the real, interactive version.

A screenshot of the cis/trans plot. Make sure you check out the real, interactive version.

To begin with, I will be converting Karl’s cis/trans eQTL plot to pull its data from a triple store dynamically. Currently there are two scripts that process an r/qtl cross and a few supporting dataframes to create a set of static JSON files, which are then loaded into the graph. Using a triple store to hold the underlying data, however, the values required by the visualization can be accessed dynamically based on the structure of the original R objects. As of now I have successfully converted each of of the necessary datasets to RDF, and am working on generating queries that Karl’s d3 code can use to access it through a 4store SPARQL endpoint (which supports JSON output).

The objects involved are quite large, and the Data Cube vocabulary (really RDF in general) is fairly verbose in its representation of information, so I am working on loading what I have into the right databases and reducing redundancy in the output. However, if you’d like some idea of how the data are being represented and accessed, I’ve set up a demo on Dydra with a subset of the data and some example queries.

Testing and Validation

In addition to working with Karl, I’ve taken time to refactor my code toward creating Data Cube RDF for more general structures. Originally the main module worked off of an Rserve object, but I’ve redone everything to use plain ruby objects, which the generator classes are responsible for creating. To support this refactoring, and the creation of new generators for data types such as CSV files, I’ve begun using Rspec to build the spec for my project. I’ve added tests against reference output and syntactical correctness, but these are respectively too brittle and too permissive to ensure novel data sets will generate valid output. To this end, I have implemented a number of the official Data Cube Integrity Constraints as part of the spec. The ICs are a set of SPARQL queries that can be run on your RDF output to ensure various high level features are present, and go beyond simple syntax validity in ensuring you have properly encoded your data. I’ve had to make a few modifications, since the ICs are slightly out of data, and some of the SPARQL 1.1 facilities they make use of aren’t fully supported by the RDF.rb SPARQL gem. Aside from a place in the set of tests, the ICs could also be useful as part of the main code, providing a way for the end user to ensure that their data is compatible with Data Cube tools like Cubeviz.

Sparkle Cubes

So you have your QTL analysis, GDP data, or Bathing Water information represented as a Data Cube. You can load your data into a triple store, make some pretty graphs in Cubeviz, and you or anyone else can get a pretty good idea of what it looks like by reading the native n3 formatted encoding. Neato. But you’re wondering how this is really any better than the other formats you’re already familiar with; sure it’s easier to load and share than the data in a relational database, but there are plenty of tools to help with that around already. Perhaps the more relevant comparison is to flat file formats such as CSV, since it’s still the de-facto way of sharing bioinformatics data. Why bother learning a new format that is not yet widely used? The most important reason, the “Semantic” part of The Semantic Web, will be the subject of another post, but today I’d like to write a little about another important technology, which you can already use to take control of your Cube formatted data and really make it shine (sorry): SPARQL.

w3cSPARQL-logo

SPARQL, an example of everyone favorite internet neologism, the recursive acronym, stands for SPARQL Protocol and RDF Query Language. As its name suggests, its main function is in querying RDF stores. Its general shape should look somewhat familiar to SQL users, but it is designed to create queries based on the “Subject Predicate Object” format native to RDF. Instead of simply listing the elements of a SPARQL query, let’s go through an example (from wikipedia):

PREFIX foaf: 
SELECT ?name ?email
WHERE {
  ?person a foaf:Person.
  ?person foaf:name ?name.
  ?person foaf:mbox ?email.
}

In brief, this query will return the names and emails of everyone in the database, which we assume contains records specified according to a particular vocabulary. If you’d like to know more details, read on, otherwise skip to the next section to see how we’ll apply this to our Data Cube data.

The first thing you’ll see is a PREFIX definition, which allows you to specify which vocabulary a resource is defined under. Using the prefix is just a shortcut to save space; you could replace every instance of “foaf:” with “” and have an equivalent query. The foaf (friend of a friend) vocabulary is one of the oldest Semantic Web vocabularies. It is used to define data about people and social networks, such as their names, emails, and connections with each other. If you’d like to know more, all you have to do is browse to the url, and you can find a detailed, human readable specification for the vocabulary. This is one nice convention in the Semantic Web community; when you browse to a URI for a vocabulary, you will frequently be redirected to a human readable version of it. This makes it easy to learn about and use new vocabularies, and to share ones you develop with others.

Next comes the SELECT line. This is one area that will look particularly familiar to SQL users, although more complex queries may not be. In this case, all we’re saying is we want to grab the parts of the data specified by “?name” and “?email” in the next part of the query. In SPARQL, tokens beginning with “?” are considered variables, so they could be named anything, but as with other languages its good practice to name them based on what they represent.

Last is the WHERE block, which usually makes up the bulk of the query. Here you can see three conditions specified in Subject Predicate Object form. If you’ve been reading along in previous posts, you may be able to understand their meanings, but even if not it’s fairly comprehensible. We’re looking for an object which is a foaf:Person, which has a foaf:name and a foaf:mbox. Although there are shortcuts which can make queries less verbose than this, the WHERE block is essentially just a list of RDF statements which you want to be true for all the data you are selecting.

Once the WHERE block returns the objects it specifies, the SELECT block picks out the portions the user has asked for, in this case the name and email, and returns them.

SPARQL and Data Cube

So now you know the basic structure of a SPARQL query, but how is it useful for the data we created in previous posts? In a multitude of ways, as it turns out. We’ll be using the following prefixes in the example queries. Note that if you were to run these queries yourself, you would need to include the prefixes at the beginning of every query, but in the interest of brevity I’ll be omitting them for the rest of the post.

PREFIX :     <http://www.rqtl.org/ns/#> 
PREFIX qb:   <http://purl.org/linked-data/cube#> 
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX prop: <http://www.rqtl.org/dc/properties/> 
PREFIX cs:   <http://www.rqtl.org/dc/cs/>

Just using relatively familiar syntax, we can do things like select all of the data on one chromosome:

SELECT ?entry
WHERE {
  ?entry prop:chr 10.0
}

Which returns a list of observation URIs

entry
http://www.rqtl.org/ns/#obsD10M298
http://www.rqtl.org/ns/#obsD10M294
http://www.rqtl.org/ns/#obsD10M42_
http://www.rqtl.org/ns/#obsD10M10
http://www.rqtl.org/ns/#obsD10M233

These could be used as the subjects of further queries, but while the observation naming scheme I chose gives you a reasonable idea of what the resources represent, you don’t have any guarantee that observation URIs will be human readable. One way would be to query the rdfs:label predicate of the observation, but if you already know which identifying properties you’re interested in, you could run a query such as the following to select them:

SELECT DISTINCT ?chr ?pos ?lod
WHERE {
  ?entry prop:chr 10.0;
         prop:chr ?chr;
         prop:pos ?pos;
         prop:lod ?lod.
}

Which yields

chr pos lod
10 40.70983 0.536721
10 0 0.08268
10 24.74745 0.759688
10 61.05621 0.254428
10 48.73004 0.584946

You may have noticed the semicolons and slightly different shape of the last query. SPARQL includes a few helpers to make your queries less verbose, in this case telling the parser that each statement separated by a semicolon, as opposed to a period, has the same subject.

While you could simply use it as a means accessing your RDF back-end, slicing out the data you need and working on it in R or some other dedicated tool, you can use SPARQL alone for many basic analysis tasks. As an example, here’s a query that uses a few keywords we haven’t seen before to select entries with a high LOD score, sort them in descending order, and give them human readable names:

SELECT DISTINCT ?name ?lod
WHERE {
  ?entry prop:lod ?lod.
  ?entry prop:refRow ?row.
  ?row rdfs:label ?name.
  FILTER(?lod > 4)
}ORDER BY DESC(?lod)

Yielding

name lod
D5M357 6.373633
D5M83 6.044958
D5M91 5.839359
D13M147 5.819851
D5M205 5.728438
D5M257 5.592882
D5M307 5.352222
D5M338 4.805622
D13M106 4.62314
D13M290 4.511684
D13M99 4.408392

Some of the predicates involved may be a little opaque, but most of the keywords (capitalized as a matter of convention) are pretty descriptive of their function. There’s a lot more depth to SPARQL than is on display here, but nonetheless we are performing the sorts of queries an actual researcher would, without having to learn anything too complex or engage in any unpleasant contortions to only grab the data we want. The latest SPARQL standard (v 1.1) includes support for many more specific graph search patterns as well as a facility for updating your data, but everything you’ve seen in this post should work just fine with any SPARQL endpoint available. 

This cat's name is sparql. She is not, alas, a query language (credit: danja http://www.flickr.com/photos/danja/236712101/)

This cat’s name is sparql. She is, alas, neither a query language nor a nascent web standard (credit: danja)

If your eyes glazed over through the example and you’re only paying attention now because of the unexpected cat picture, the key point to remember is that we can use these same techniques for any sort of data set in the Data Cube format, be it genetics, finance, or public health. We could select a subset of the information that we can import into our local data store and visualize using tools like Cubeviz, or we can use query patterns to pick out just the information that interests us. Future blog posts will talk about some of the more complicated operations you can perform, and how the language makes it easy to bring together information from multiple sources, but I hope this sample gives you an idea of the usefulness of SPARQL, and why you’d want your data mapped to the Data Cube format. This post, focusing more on the fundamental mechanics of querying Data Cube encoded information, barely touches on the “Semantic” aspect of The Semantic Web; while we do have some meaningful information about what’s a dimension, a measure, and so on, a lot of what makes RDF related technologies powerful is missing. I will soon be adding context specific semantics compliant with the Qtab format, so any other software which understands the format can automatically integrate information from Data Cube resources. Once this process is finished, I will begin creating tools to map general Ruby objects into this format, and help end users decide which types of semantic information they want to include.

If you want to try the queries out for yourself, or see how slight modifications might work, you can find a SPARQL endpoint for the data set I’ve been using here. Unfortunately results will be returned as xml, which is not very easy (for humans) to read, so if you’re interested in trying out your new knowledge in a more friendly setting, you may want to try Dbpedia, a project to convert information from Wikipedia to RDF, which has a SPARQL endpoint.

Frame to Cube

I’d like to talk a little more about the mapping between R Data Frames and the RDF format. This isn’t a very natural union; information in a Data Frame has a simple and well-defined structure, while RDF is a sequence of statements which need not be in any particular order. This is by design of course; it is what allows the flexibility and generality of RDF. The Data Cube vocabulary, however, forms a bridge between these two very different ways of representing information. Data Cube was developed by a group of statistics and data science experts commissioned by the UK government to develop a vocabulary for representing multi-dimensional data. It can be used with tabular data, but in keeping with the flexible nature of semantic web technologies, it can also
represent nearly any kind of data which can be broken down by dimension, and includes facilities for attaching semantic context to data sets and extending itself to accommodate additional complexities of units, measure types, and other common features of data.

OLAP cubes, a common data structure in corporate settings, use the same basic model as the Data Cube vocabulary, but do not include any semantic information.

The OLAP cube, a common data structure in corporate settings, uses the same basic model as the Data Cube vocabulary, but does not include any semantic information.

A Dataframe, by contrast, is a fairly simple structure; a set of lists of equal length. Although some incidental complexity can hide under this description, in our sample use case, r/qtl, they can be thought of simply as tables with labeled rows. While the Data Cube vocabulary is capable of handling much more complicated structures, it is also well suited to representing simple objects such as Dataframes.

In order to explore the relationship between these two data structures, we will look at a small data set representing the results of an r/qtl analysis session. These particular data are excerpted from running a marker regression on the Listeria dataset included in r/qtl.

 

chr

pos

lod

D10M44

1

0.0

0.457

D15M68

15

23.9

3.066

The actual Dataframe has about 130 rows, but this short excerpt will suffice to show how the mapping works. A future post may touch briefly on the meaning of these entries in the context of QTL analysis. For now, just think of it as you would any other table, although note that the rows are labeled.

We will have to make some assumptions for our mapping, since R doesn’t include any information about the meaning (semantics) of its data. We assume that each column of the Dataframe is a property we’re interested in measuring, and that each row of the table specifies a category or dimension to place our measurements in. If you’re familiar with the structure of our example Dataframe, an r/qtl scanone result, you may notice that this mapping isn’t entirely parsimonious with what’s represented in the data; the “chr” and “pos” columns probably better specify dimensions, and since the row names contain no extra information, they should probably just be labels for the data points. Unfortunately there’s no way to fully automate this process; there simply isn’t enough information in the Dataframe to unambiguously determine our mapping. I am working on some tools to allow end users to specify this information, which could be used to build up a library of mappings for different R classes, but for now it suffices to develop a technically valid Data Cube representation of the information.

Let’s break down the mapping by individual data cube elements, and see what they correspond to in R.

Prefixes

Prefixes are standard in a number of RDF languages, including Turtle, which I will use for these examples. Prefixes simply specify that certain tokens, when followed by a “:” (colon) should be replaced by the appropriate URI. Although any Turtle file could be written without them, they are essential to making your RDF comprehensible to humans as well as machines. The prefixes used for this data set will be

While some of these prefixes are part of standard RDF vocabularies, anything under rqtl.org should be considered a placeholder for the time being, as no vocabulary definitions actually exist at the given address.

Data Structure Definition

One of the highest level elements of a Data Cube resource is the Data Structure Definition. It provides a reusable definition that Datasets can operate under, specifying dimensions, measures, attributes, and extra information such as ordering and importance.

In our basic implementation, this is little more than a list of component specifications for the dimension properties. As the program develops, a lot of the additional semantic detail will wind up here.

Note that the Data Structure Definition resource has been named after the variable used to generate, which I called “mr” (marker regression), so the resource’s name is “dsd-mr”.

Data Set

Although it can contain some additional information about how to interpret the data it applies to, a DataSet’s main job is to attach a series of observations to a Data Structure Definition.

This is currently implemented as a mostly static string, with the dataset labeled (by default) based on the variable used to generate it.

Component Specifications

In the Data Structure Definition you only need to provide a list of components specifications, making DSDs easier to read and create. The Component Specification is the bridge between this list and the component resources, marking components as measures, dimensions, or attributes and providing a reference to the actual component object. It also contains some component metadata such as whether or not it is required, and if it has any extra attached RDF resources.

The project currently creates Component Specifications based on the row names for the R Data Frame used to generate them.

Dimension Properties

Data Cube Properties are separated into four types; Dimension Properties, Measure Properties, Attribute Properties, and Coded Properties. Of these, only the former two are in use in my project, so these will be the only ones I’ll cover in this post.

Dimension Properties specify the way in which data are measured. They provides a means of categorize observations, and are generally used by visualization programs to draw axes for data. Dimension Properties can also specify a Measure Type, which allows you to categorize what the dimension is measuring, for example weight, or time.

The converter defaults to making row the only dimension, which will work with most data sets but may not be the best mapping possible. A near term goal of this project will be providing a facility to specify this at run time.

Each row must also be declared as such

Measure Properties

Measure Properties are used to represent the values of interest in your data, whether its GDP levels, cancer rates, or LOD scores from a QTL analysis.

In the default mapping, each column of the R DataFrame becomes a measure property. In the example object, this is all of the “chr”, “pos”, and “lod” values.

Observations

Observations are the heart of the Data Cube vocabulary. They represent one particular data point, and specify values for each dimension property measure property. Because the definition of these properties has already been laid out as its own resource, all an observation needs to list is the value at that point in whatever format it is set up to use.

In the case of an R Dataframe, each row forms an observation, and each column specifies one component of it. By default all are measure values, although in future versions some may specify dimension or attribute values.

In the example object, each row specifies the measure values for chr, pos, and lod, generating a Data Cube observation that looks like:

Other

The Data Cube vocabulary contains a number of other important features not covered in this overview. For example, the vocabulary is compatible with, and to some extent integrates, SKOS, the Simple Knowledge Organization System, a popular ontology for organizing taxonomies, thesauri, and other types of controlled vocabulary. Each Data Cube Property can have an attached “concept”, which aids in reuse and comprehension by specifying a known statistical concept represented by the property.

I hope you found this example illustrative and you have a better idea of how to get your data into the Data Cube format. If combined into one text file, the results specify a valid Data Cube which can be used with tools such as Cubeviz, or queried as part of a research or analysis task. If you’d like to see what this looks like for a larger object, here is the Data Cube representation of a marker regression on the whole Listeria data set. In the next post, I’ll talk about how to do this using SPARQL, RDF’s native query language.

Motivations: Why we need to improve the Semantic Web

David Karger: How the Semantic Web Can Help End Users

MIT AI researcher David Karger gave the keynote at this year’s European Semantic Web Conference, and has posted his slides as well as a summary of his talk on the MIT Haystack blog. He’s an expert on the topic, and does a much better job than I could of explaining the value of flexible and extensible data representation. I hope to distill some of the writing of Karger and others and post it here over the summer, but for now, if you’re not sure what advantage these technologies have over traditional databases and ad-hoc formats, or you think there’s no more useful work to be done on them, have a look at the presentation.