Sharp Scissors, Safety Scissors: What to do With Your PubliSci Dataset

If you’ve been following along with the last two blog posts, you should have a pretty good idea of how to turn most flat or tabular file formats into an RDF dataset using PubliSci’s Reader tools. You now have an unambiguous, self annotated dataset that is both easy for humans to read and can be queried in sexy, sexy SPARQL once loaded into a triple store. So what do you do with it?

In storing, serializing, or “writing down” data, we hope (beyond overcoming a poor memory) to be able to to share what we’ve learned with others who might have questions, criticism, or things they’d like to derive for the information within. Often these ‘others’ are other people, but more and more frequently they are machines and algorithms, especially in fields such as biology which are struggling to growing heaps of data they generate. SPARQL, RDF, and other Semantic Web components are designed to making describing knowledge and posing questions to it accessible to both these types of actors, through its flexible data model, ontological structures, and a host of inter-related software and standards.

Along with a web-friendly scripting language such as Ruby, you can easily build domain specific applications using the Semantic Web’s tools. To provide an example, I’ve created a demonstration server, which you can find at mafdemo.strinz.me, based on a breast cancer dataset stored collected by Washington University’s Genome Institute, and stored in the TCGA database.

There are two ways to use the demo server; one public, the other private. The public side offers a way to load maf files into the database, a simple html interface with some parts of the data highlighted and linked for you to browse through, and a page for querying the repository using SPARQL.

The private side, protected by a password for now, offers a much more flexible way to interact with the dataset, essentially by letting you write Ruby scripts to run a set of templated queries, create your own, and perform operations such as sorting or statistical tests on the output. However, as James Edward Gray says, Ruby trusts us with the sharp scissors, so if you were to host such an interface on your own machine, you’d want to make sure you don’t give the password to anyone you don’t trust with the sharp scissors, unless you’re running it in a virtual machine or have taken other precautions.

I’ll go over both of these interfaces in turn, starting with the public side.

The Safety Scissors

There’s still a lot you can find out about the dataset from the public side. It’s not much to look at, but you can browse through linked data for the patients and genes represented in the maf file. Because of the semantic web practice of using dereferencable URIs, a lot of the raw data is directly linked to more information about it. Most of the information being presented comes from direct SPARQL queries to the maf dataset, constructed and executed using the ruby-rdf library.

With some further development a very flexible tool for slicing and analyzing one or multiple TCGA datasets could be developed on this backend. As of now most responses are returned as streaming text, which prevents queries and remote service calls from causing timeouts, but makes building a pretty interface more difficult. This could be resolved by splitting it into javascript output and a better looking web interface (such as the one for the PROV demo I created). On top of that, the inclusion of gene sizes is just a small example of the vast amount of information available from external databases; this is, after all, the state of affairs that has lead bioinformaticians to adopt the semantic web.

However, the remaining time in GSOC doesn’t afford me the scope to build up many of these services in a way that makes full use of the information available and the flexible method of accessing it. To address this, I’ve created a more direct interface to the underlying classes and queries which can be accessed using Ruby scripts. It’s protected by a password on the demo site, so if you want to try any of these examples yourself you should grab a clone of the github repository.

Sharp Scissors

The Scissors Cat, by hibbary

In its base form, the scripting interface is not really safe to share with anyone you don’t already trust. Its not quite as insecure as sharing a computer, since it only returns simple strings, but theoretically a motivated person could completely hijack and rewrite the server from this interface; such is the price for the power of Ruby. However, with some sandboxing and a non-instance_eval based implementation the situation could be improved, or this could form the basis of a proper DSL such as Cucumber’s gherkin, which has a well defined grammar using treetop, allowing for a much safer evaluation of arbitrary inputs.

The select Method

The script interface sets you up in an environment with access to the 4store instance holding the maf data, and gives you a few helper methods to access it. Primary among these is the ‘select’ method, which can be used to retrieve specific information from the MAF file by patient ID, and retrieve a few other relevant pieces of information about the dataset, such as the number of patients represented in it.

For example, here’s the script you’d use to wrap a simple query, retrieving the genes with mutations for a given patient.

An example script

An example script

You can further refine results by specifying additional restrictions. Here, the first query first selects all sample with a mutation on NUP107 at first, and the second restricts its results to those starting at position 69135678.

You can also select multiple columns in one go, returning a hash with a key for each selection

https://gist.github.com/wstrinz/28da6c67ab6e44d26340

Using these methods of accessing the underlying data, you can write more complex scripts to perform analysis, for example here we look for samples with mutations in the gene CASR which more mutations more than one base pair in length

Inline SPARQL Queries

While it may be a blessing for rubyists just getting into the semantic web, if you’re also familiar with SPARQL you probably know that most of the sorting and comparison you might want to do can be performed with it alone. The public side of the maf server does expose a query endpoint, but if you want to tie a series of queries together in a script, or run the output through an external library, you can also easily run inline queries using the scripting interface

This can be used to derive information about how to best access the dataset, which adheres to the general structure of the data cube vocabulary. For example, to see all of the columns you can select data from, you could run a script like

And of course you can mix the two methods, pulling the results of a sparql query into a select call, or vice versa, such as in this next example, where we create a list of all the genes which patients with a mutation in SHANK1 also have.

SPARQL Templates, RDF.rb Queries

A couple of other small features to mention; first, I’ve included the ad-hoc templating system I’ve been using in the gem. It’s similar to the handlebars templating system, which is marked by using double braces ( ‘ {{ ‘ and ‘ }} ‘ ), although here we’re working with SPARQL rather than HTML. This has a few different applications, in that you can reuse query templates in a script, and write a query early on that you will fill values into later.

Second, when you make a ‘select query’ call, the results are converted into plain ruby objects for simpler interaction. Under the hood however these are retrieved using the RDF::Query class, which returns RDF::Solutions that can be interacted with in a more semantic-web aware manner. To get this kind of object as a result, either use “select_raw query” instead, or instantiate a query object and call its #run method, as demonstrated in the gist below where we retrieve all the Nonsense Mutations then process them afterward to sort by patient id or gene type

Saving and Sharing

Finally, the way I’ve set up the server and the nature of instance eval allowed me to include the saving of a ‘workspace’ between evaluations, and sharing of results or methods across sessions and users. To save a variable or result, simple prefix it with an “@” sign, declaring it as an instance variable.

Then you can come back later and run another script

That reuses the instance variable “@result” stored in your instance of the script evaluator. You can do this for procs or lambdas to reuse functions, and pretty much anything else you can think of. Similarly, prefixing the variable with “@@” will mark it as a class variable, enabling anyone accessing the script interface to use it.

Do Not Try This At Home

Again I want to stress that this is by no means a thorough approach to providing public access to an RDF dataset. It is so ridiculously permissive that I’m sure there are people online who would be in physical ill just thinking about the insecurity of my approach. Hopefully if they’re reading this they’d feel inclined to offer some advice for how to do it better, but in lieu of that, I believe that working in a small group on a closed server with this interface could aid collaboration and the prototyping of queries and algorithms. It also helps to show just how flexible the underlying data model we’re operation on can be, and how the impedance between programs and query accessible databases is in many cases less severe with SPARQL than with SQL.

The one huge component of the semantic web this does leave out is interaction between services. The ability to unambiguously make statements with RDF triples creates a natural route for integrating and consuming external services, which I will talk about in more detail in a followup post.

Advertisements

Coded properties for proper semantics

While it would be nice if all the data we are interested in working with could be accessed using SPARQL, but the reality is that most data is stored in some kind of tabular format, and may lack important structural information, or even data points, either of which could be a serious stumbling block for trying to represent it as a set of triples. In the case of missing data, the difficulty is that RDF contains no concept null values. This isn’t simply an oversight; the Semantic nature of the format requires it. The specific reasons for this are related to the formal logic and machine inference aspects of the Semantic Web, which haven’t been covered yet here. This post on the W3C’s semantic web mailing list provides a good explanation. As a brief summary, consider what would happen if you had a predicate for “is married to”, and you programmed your (unduly provincial) reasoner to conclude that any resource that is the object of an “is married to” statement for a subject of type “Man” is of type “Woman” (and vice versa). In this situation, if you had an unmarried man and an unmarried woman in your dataset, and chose to represent this state by listing both as “is married to” a null object, say “rdf:null”, you would run into a contradiction; your reasoner would conclude that rdf:null was both a man and a woman. Assuming your un-cosmopolitan reasoner specifies “Man” and “Woman” as disjoint classes, you have created a contradiction, and invalidated your ontology!

Paradoxes: not machine friendly

Since  missing values are actually quite common in the context of qtl analysis, where missing information is often imputed or estimated from an incomplete dataset, I have been discussing the best way to proceed with my mentors. We have decided to use the “NA” string literal by default, and leave it up to the software accessing our data to decide how to handle the missing data in a given domain. This is specified in the code which converts raw values to resources or literals:

The 1.1 version of the RDF standard also includes NaN as a valid numeric literal, so I am experimenting with this as a way of dealing with missing numeric values. Missing structure is a somewhat larger problem; in some situations, such as the d3 visualizations Karl Broman has created, it is sufficient to simply store all data as literal values under a single namespace. For example, consider this (made up) database of economic indicators by country; You could say a given data point has a country value of “America”, a GDP of $49,000 per capita, an infant mortality rate of 4 per thousand, and so on, storing everything as a literal value. This is fine if you’re just working with your own data, but what if you want to be able to find more information about this “America” concept? The great thing about Semantic Web tools is that you can easily query another SPARQL endpoint for triples with an object value of “America”. But this “America” is simply a raw string; you could just as well be receiving information about the band “America”, the entire continent of North, South, or Central America, or something else entirely. Furthermore, RDF does not allow literals as subjects in triples, so you wouldn’t be able to make any statements about “America”. This is particularly problematic for the Data Cube format for a number of reasons, not the least of which is the requirement that all dimensions must have an rdfs:range concept that specifies the set of possible values. The solution to this problem is in making “America” a resource, inside of a namespace. For example, if we were converting these data for the IMF, we could replace “America” (the string literal) with (a URI). We can now write statements about the resource, and ensure that there is no ambiguity between different Americas. This doesn’t quite get us all the way toward fully linked data, since it’s not clear yet how to specify that the “America” in the imf.org namespace is the same as the one in, say, the unfao.org namespace (for that you will need to employ OWL, a more complex Semantic Web technology outside the scope of this post), but it at least allows us to create a valid representation of our data. In the context of a Data Cube dataset, this can be automated through the use of coded properties, supported by the skos vocabulary, an ontology developed for categorization and classification. Using skos, I define a “concept scheme” for each coded dimension, and a set of “hasTopConcept”  relations for each scheme.

Each concept gets its own resource, for which some rudimentary information is generated automatically

Currently the generator only enumerates the concepts and creates the scheme, but these concepts provide a place to link concepts to other datasets and define their semantics. As an example, if the codes were generated for countries, you could tell the generator to link to the dbpedia entries for each country. Additionally, I plan to create a “No Data” code for each concept set, now that we’ve had a discussion about the way to handle such values.

To see how this all comes together, I’ll go through an example using one of the most popular data representation schemes around; the trusty old CSV.

CSV to Data Cube

Most are probably already familiar with this format, but even if you aren’t it’s about the simplest conceivable way of representing tabular data; each column is separated by commas, and each row by a new line. The first such row typically represents the labels for the columns. Although very little in the way of meaning is embedded in this representation, it can still be translated to a valid data cube representation, and more detailed semantics can be added through a combination of automated processing and user input.

The current CSV generator is fairly basic; you can provide it with an array of dimensions, coded dimensions, and measures using the options hash, point it at a file, and it will create Data Cube formatted output. There is only one extra option at the moment, which allows you to specify a column (by number) to use to generate labels for your output. See below for an example of how to use it.

Here is a simple file I’ve been using for my rspec tests (with spaces added for readability).

producer, pricerange, chunkiness, deliciousness
hormel,     low,        1,           1
newskies,  medium,      6,           9
whys,      nonexistant, 9001,        6

This can be fed into the generator using a call such as the following:

And that’s all there is to it! Any columns not specified as dimensions will be treated as measures, and if you provide no configuration information at all it will default to using the first column as the dimension. At the moment, all dimension properties for CSV files are assumed to be coded lists (as discussed above), but this will change as I add support for other dimension types and sdmx concepts. If you’d like to see what the output looks like, you can find an example gist on Github.

As the earlier portion of this post explains, the generator will create a concept scheme based on the range of possible values for the dimension/s, and define the rdfs:range of the dimension/s as this concept scheme. Aside from producing a valid Data Cube representation (which requires all dimensions to have a range), this also creates a platform to add more information about coding schemes and individual concepts, either manually or, in the near future, with some help from the generation algorithm.

I’ll cover just how you might go about linking up more information to your dataset in a future post, but if you’d like a preview, have a look at the excellent work Sarven Capadisli has done in converting and providing endpoints for data from the UN and other international organizations.

An overview of Sarven’s data linkages (source)

The country codes for these datasets are all linked to further data on their concepts using the skos vocabulary, for example, see this page with data about country code CA (Canada). This practice of linking together different datasets is a critical part of the Semantic Web, and will be an important direction for future work this summer.

Frame to Cube

I’d like to talk a little more about the mapping between R Data Frames and the RDF format. This isn’t a very natural union; information in a Data Frame has a simple and well-defined structure, while RDF is a sequence of statements which need not be in any particular order. This is by design of course; it is what allows the flexibility and generality of RDF. The Data Cube vocabulary, however, forms a bridge between these two very different ways of representing information. Data Cube was developed by a group of statistics and data science experts commissioned by the UK government to develop a vocabulary for representing multi-dimensional data. It can be used with tabular data, but in keeping with the flexible nature of semantic web technologies, it can also
represent nearly any kind of data which can be broken down by dimension, and includes facilities for attaching semantic context to data sets and extending itself to accommodate additional complexities of units, measure types, and other common features of data.

OLAP cubes, a common data structure in corporate settings, use the same basic model as the Data Cube vocabulary, but do not include any semantic information.

The OLAP cube, a common data structure in corporate settings, uses the same basic model as the Data Cube vocabulary, but does not include any semantic information.

A Dataframe, by contrast, is a fairly simple structure; a set of lists of equal length. Although some incidental complexity can hide under this description, in our sample use case, r/qtl, they can be thought of simply as tables with labeled rows. While the Data Cube vocabulary is capable of handling much more complicated structures, it is also well suited to representing simple objects such as Dataframes.

In order to explore the relationship between these two data structures, we will look at a small data set representing the results of an r/qtl analysis session. These particular data are excerpted from running a marker regression on the Listeria dataset included in r/qtl.

 

chr

pos

lod

D10M44

1

0.0

0.457

D15M68

15

23.9

3.066

The actual Dataframe has about 130 rows, but this short excerpt will suffice to show how the mapping works. A future post may touch briefly on the meaning of these entries in the context of QTL analysis. For now, just think of it as you would any other table, although note that the rows are labeled.

We will have to make some assumptions for our mapping, since R doesn’t include any information about the meaning (semantics) of its data. We assume that each column of the Dataframe is a property we’re interested in measuring, and that each row of the table specifies a category or dimension to place our measurements in. If you’re familiar with the structure of our example Dataframe, an r/qtl scanone result, you may notice that this mapping isn’t entirely parsimonious with what’s represented in the data; the “chr” and “pos” columns probably better specify dimensions, and since the row names contain no extra information, they should probably just be labels for the data points. Unfortunately there’s no way to fully automate this process; there simply isn’t enough information in the Dataframe to unambiguously determine our mapping. I am working on some tools to allow end users to specify this information, which could be used to build up a library of mappings for different R classes, but for now it suffices to develop a technically valid Data Cube representation of the information.

Let’s break down the mapping by individual data cube elements, and see what they correspond to in R.

Prefixes

Prefixes are standard in a number of RDF languages, including Turtle, which I will use for these examples. Prefixes simply specify that certain tokens, when followed by a “:” (colon) should be replaced by the appropriate URI. Although any Turtle file could be written without them, they are essential to making your RDF comprehensible to humans as well as machines. The prefixes used for this data set will be

While some of these prefixes are part of standard RDF vocabularies, anything under rqtl.org should be considered a placeholder for the time being, as no vocabulary definitions actually exist at the given address.

Data Structure Definition

One of the highest level elements of a Data Cube resource is the Data Structure Definition. It provides a reusable definition that Datasets can operate under, specifying dimensions, measures, attributes, and extra information such as ordering and importance.

In our basic implementation, this is little more than a list of component specifications for the dimension properties. As the program develops, a lot of the additional semantic detail will wind up here.

Note that the Data Structure Definition resource has been named after the variable used to generate, which I called “mr” (marker regression), so the resource’s name is “dsd-mr”.

Data Set

Although it can contain some additional information about how to interpret the data it applies to, a DataSet’s main job is to attach a series of observations to a Data Structure Definition.

This is currently implemented as a mostly static string, with the dataset labeled (by default) based on the variable used to generate it.

Component Specifications

In the Data Structure Definition you only need to provide a list of components specifications, making DSDs easier to read and create. The Component Specification is the bridge between this list and the component resources, marking components as measures, dimensions, or attributes and providing a reference to the actual component object. It also contains some component metadata such as whether or not it is required, and if it has any extra attached RDF resources.

The project currently creates Component Specifications based on the row names for the R Data Frame used to generate them.

Dimension Properties

Data Cube Properties are separated into four types; Dimension Properties, Measure Properties, Attribute Properties, and Coded Properties. Of these, only the former two are in use in my project, so these will be the only ones I’ll cover in this post.

Dimension Properties specify the way in which data are measured. They provides a means of categorize observations, and are generally used by visualization programs to draw axes for data. Dimension Properties can also specify a Measure Type, which allows you to categorize what the dimension is measuring, for example weight, or time.

The converter defaults to making row the only dimension, which will work with most data sets but may not be the best mapping possible. A near term goal of this project will be providing a facility to specify this at run time.

Each row must also be declared as such

Measure Properties

Measure Properties are used to represent the values of interest in your data, whether its GDP levels, cancer rates, or LOD scores from a QTL analysis.

In the default mapping, each column of the R DataFrame becomes a measure property. In the example object, this is all of the “chr”, “pos”, and “lod” values.

Observations

Observations are the heart of the Data Cube vocabulary. They represent one particular data point, and specify values for each dimension property measure property. Because the definition of these properties has already been laid out as its own resource, all an observation needs to list is the value at that point in whatever format it is set up to use.

In the case of an R Dataframe, each row forms an observation, and each column specifies one component of it. By default all are measure values, although in future versions some may specify dimension or attribute values.

In the example object, each row specifies the measure values for chr, pos, and lod, generating a Data Cube observation that looks like:

Other

The Data Cube vocabulary contains a number of other important features not covered in this overview. For example, the vocabulary is compatible with, and to some extent integrates, SKOS, the Simple Knowledge Organization System, a popular ontology for organizing taxonomies, thesauri, and other types of controlled vocabulary. Each Data Cube Property can have an attached “concept”, which aids in reuse and comprehension by specifying a known statistical concept represented by the property.

I hope you found this example illustrative and you have a better idea of how to get your data into the Data Cube format. If combined into one text file, the results specify a valid Data Cube which can be used with tools such as Cubeviz, or queried as part of a research or analysis task. If you’d like to see what this looks like for a larger object, here is the Data Cube representation of a marker regression on the whole Listeria data set. In the next post, I’ll talk about how to do this using SPARQL, RDF’s native query language.