I’d like to talk a little more about the mapping between R Data Frames and the RDF format. This isn’t a very natural union; information in a Data Frame has a simple and well-defined structure, while RDF is a sequence of statements which need not be in any particular order. This is by design of course; it is what allows the flexibility and generality of RDF. The Data Cube vocabulary, however, forms a bridge between these two very different ways of representing information. Data Cube was developed by a group of statistics and data science experts commissioned by the UK government to develop a vocabulary for representing multi-dimensional data. It can be used with tabular data, but in keeping with the flexible nature of semantic web technologies, it can also
represent nearly any kind of data which can be broken down by dimension, and includes facilities for attaching semantic context to data sets and extending itself to accommodate additional complexities of units, measure types, and other common features of data.
A Dataframe, by contrast, is a fairly simple structure; a set of lists of equal length. Although some incidental complexity can hide under this description, in our sample use case, r/qtl, they can be thought of simply as tables with labeled rows. While the Data Cube vocabulary is capable of handling much more complicated structures, it is also well suited to representing simple objects such as Dataframes.
In order to explore the relationship between these two data structures, we will look at a small data set representing the results of an r/qtl analysis session. These particular data are excerpted from running a marker regression on the Listeria dataset included in r/qtl.
The actual Dataframe has about 130 rows, but this short excerpt will suffice to show how the mapping works. A future post may touch briefly on the meaning of these entries in the context of QTL analysis. For now, just think of it as you would any other table, although note that the rows are labeled.
We will have to make some assumptions for our mapping, since R doesn’t include any information about the meaning (semantics) of its data. We assume that each column of the Dataframe is a property we’re interested in measuring, and that each row of the table specifies a category or dimension to place our measurements in. If you’re familiar with the structure of our example Dataframe, an r/qtl scanone result, you may notice that this mapping isn’t entirely parsimonious with what’s represented in the data; the “chr” and “pos” columns probably better specify dimensions, and since the row names contain no extra information, they should probably just be labels for the data points. Unfortunately there’s no way to fully automate this process; there simply isn’t enough information in the Dataframe to unambiguously determine our mapping. I am working on some tools to allow end users to specify this information, which could be used to build up a library of mappings for different R classes, but for now it suffices to develop a technically valid Data Cube representation of the information.
Let’s break down the mapping by individual data cube elements, and see what they correspond to in R.
Prefixes are standard in a number of RDF languages, including Turtle, which I will use for these examples. Prefixes simply specify that certain tokens, when followed by a “:” (colon) should be replaced by the appropriate URI. Although any Turtle file could be written without them, they are essential to making your RDF comprehensible to humans as well as machines. The prefixes used for this data set will be
While some of these prefixes are part of standard RDF vocabularies, anything under rqtl.org should be considered a placeholder for the time being, as no vocabulary definitions actually exist at the given address.
Data Structure Definition
One of the highest level elements of a Data Cube resource is the Data Structure Definition. It provides a reusable definition that Datasets can operate under, specifying dimensions, measures, attributes, and extra information such as ordering and importance.
In our basic implementation, this is little more than a list of component specifications for the dimension properties. As the program develops, a lot of the additional semantic detail will wind up here.
Note that the Data Structure Definition resource has been named after the variable used to generate, which I called “mr” (marker regression), so the resource’s name is “dsd-mr”.
Although it can contain some additional information about how to interpret the data it applies to, a DataSet’s main job is to attach a series of observations to a Data Structure Definition.
This is currently implemented as a mostly static string, with the dataset labeled (by default) based on the variable used to generate it.
In the Data Structure Definition you only need to provide a list of components specifications, making DSDs easier to read and create. The Component Specification is the bridge between this list and the component resources, marking components as measures, dimensions, or attributes and providing a reference to the actual component object. It also contains some component metadata such as whether or not it is required, and if it has any extra attached RDF resources.
The project currently creates Component Specifications based on the row names for the R Data Frame used to generate them.
Data Cube Properties are separated into four types; Dimension Properties, Measure Properties, Attribute Properties, and Coded Properties. Of these, only the former two are in use in my project, so these will be the only ones I’ll cover in this post.
Dimension Properties specify the way in which data are measured. They provides a means of categorize observations, and are generally used by visualization programs to draw axes for data. Dimension Properties can also specify a Measure Type, which allows you to categorize what the dimension is measuring, for example weight, or time.
The converter defaults to making row the only dimension, which will work with most data sets but may not be the best mapping possible. A near term goal of this project will be providing a facility to specify this at run time.
Each row must also be declared as such
Measure Properties are used to represent the values of interest in your data, whether its GDP levels, cancer rates, or LOD scores from a QTL analysis.
In the default mapping, each column of the R DataFrame becomes a measure property. In the example object, this is all of the “chr”, “pos”, and “lod” values.
Observations are the heart of the Data Cube vocabulary. They represent one particular data point, and specify values for each dimension property measure property. Because the definition of these properties has already been laid out as its own resource, all an observation needs to list is the value at that point in whatever format it is set up to use.
In the case of an R Dataframe, each row forms an observation, and each column specifies one component of it. By default all are measure values, although in future versions some may specify dimension or attribute values.
In the example object, each row specifies the measure values for chr, pos, and lod, generating a Data Cube observation that looks like:
The Data Cube vocabulary contains a number of other important features not covered in this overview. For example, the vocabulary is compatible with, and to some extent integrates, SKOS, the Simple Knowledge Organization System, a popular ontology for organizing taxonomies, thesauri, and other types of controlled vocabulary. Each Data Cube Property can have an attached “concept”, which aids in reuse and comprehension by specifying a known statistical concept represented by the property.
I hope you found this example illustrative and you have a better idea of how to get your data into the Data Cube format. If combined into one text file, the results specify a valid Data Cube which can be used with tools such as Cubeviz, or queried as part of a research or analysis task. If you’d like to see what this looks like for a larger object, here is the Data Cube representation of a marker regression on the whole Listeria data set. In the next post, I’ll talk about how to do this using SPARQL, RDF’s native query language.