Rise of the Machines: Using SADI to Augment Your Data

As discussed in my last post, converting a flat dataset to an RDF graph can give you access to a variety analysis and exposure tools, or form the base of new software that relies on your data format. Its is a flexible format, in that new information can be added in a variety of formats without concern for table joins or changing schema, and yet it has incredible descriptive power, because it is possible in principle to know that two statements in separate data stores are equivalent information, and because individual data elements can often simply be loaded in a web browser to find out more information.

These properties certainly have a lot to offer scientists and other human consumers of data, but in many ways they are a particularly good feature for algorithmic data users, which can process a stream of information very quickly but are not as adept at resolving ambiguity and reliably finding more information on a concept without guidance on where to look.

SADI: Semantic Automated Discovery and Integration

One particular project that illustrates this is SADI, the brainchild of bioinformatician and semantic web expert Dr Mark Wilkinson. Mark, who also happens to be one of my GSOC mentors, has built a framework to support automated discovery of and access to distributed datasets and services, which is a practical example of a service built using the concepts I’ve been working to learn and make use of this summer.

SADI is not so much a tool or piece of software as a set of standards for service interoperability, grounded in and supported by the existing standards of the internet and the semantic web. This means that, although most of the existing SADI services are focused on bioinformatics data, the system is flexible and general purpose enough to apply to essentially any service or web interface.

SADI is comprised of six key conventions, available at the How SADI works page:

  1. SADI Services consume and provide data via simple HTTP POST and GET.
  2. SADI Services consume and produce data in RDF format. This allows SADI Services to exploit existing OWL reasoners and SPARQL query engines to enhance interoperability between Services and the interpretation of the data being passed between them.
  3. Service interfaces (i.e., Inputs and Outputs) are defined in terms of OWL-DL classes; the property restrictions on these OWL classes define what specific data elements are required by the Service and what data will be provided by the Service, respectively.
  4. Input RDF data – data that is compliant with the Input OWL Class – is “decorated” or “annotated” by the service provider to include new properties. These properties will (of course) be a function of the lookup/analytical operations performed by the Web Service.
  5. Importantly, discovery of SADI Services can include searches for the properties the user wants to add to their data. This contrasts with other Semantic Web Service standards which attempt only to define the computational process by which input data is analysed, rather than the properties that process generates between the input and output data. This is KEY to the semantic behaviours of SADI.
  6. SADI Web Services are stateless and atomic.

Essentially, a SADI service uses the OWL ontology language to describe the information it expects as input, and what it will return. It uses common internet conventions for its communication protocol, based on recognized W3C standards.

These conventions allow all of the same web tools and agents that access the world wide web to interact with SADI in a smilar manner. Because of this and the use of semantic web standards to represent information in the system, a database of which services provide what sorts of information, and for which inputs, has been constructed that can take an entry straight from a triplified dataset and discover more information about it, all without any user intervention.

The basic process of using a SADI service involves retrieving information about a service by issuing a GET request to it, then sending it OWL classes as input based on what it expects. The service will then respond using the same class/es, but annotated with the new information it provides.

As an example, here are the request headers and a couple of input and output objects for the example service, which sends sends back a greeting for each “named individual” it receives (how nice!)

and the response

All of this behavior is defined by the OWL classes in the service description, so any user or algorithm acessing the service can learn how to handle interacting with the service simply by reading the description, which is both human and machine friendly.

Asynchronous Responses

Beyond the clarity and simplicity of the RDF and the reuse of familiar web standards, much of the SADI’s strength lies in its ability to robustly process large requests and inputs. Services can be built set up as synchronous, where a response isn’t returned until the service has finished processing its input, or asynchronous, where the server instead returns an address that a client can check, or “poll” to see if the operation has finished.

The response to a poll request also includes a header specifying how long the client should wait before trying again, making the whole process of retrieving an asynchronous result simple and transparent to coordinate. As we’ll see later on, this makes batch processing much simpler, allowing large volumes of information to be exchanged without having to worry about dealing with timeouts and network issues, or trying to efficiently coordinate many different remote requests.

The SADI framework has other benefits and features that we haven’t specifically used yet, such as the security afforded by its enforcement of an object model (as opposed to raw SPARQL queries), and the ability to distribute queries over multiple resources. In addition to sadiframework.org, further details can be found in The Semantic Automated Discovery and Integration (SADI) Web service Design-Pattern, API and Reference Implementation, a paper published in the Journal of Biomedical Semantics by Mark, Benjamin Vandervalk and Luke McCarthy, and available in full at the link.

SADI in action

To give a concrete example of how to use SADI, I’ll go over the script I wrote which uses it to assist in our analysis of the MAF dataset we’ve been working with. When trying to make inferences based on the frequency with which mutations appear in a gene, it is necessary to adjust for the size of that gene. The location of a gene can actually be a bit of a fuzzy concept, since the very concept of what, exactly, makes for a gene can itself be less clear-cut than you might expect, but databases exist that contain the generally accepted start and end positions of the gene, from which its length can be found.

The old way

To get started, I searched for databases that contained gene location information and allowed it to be accessed programmatically. Of these, the Ensembl genome database was the easiest for me to use, as it has a new RESTful endpoint, and I’m usually happiest working with REST services.

The first step in construction a query to it was to find the canonical name for a HUGO symbol from the dataset. Unfortunately, in addition to the occasional error or nonsense entry, the gene information for the MAF dataset often used synonyms for the “official” gene name, which are recognized by the HGNC, but not immediately convertible to their equivalent Ensembl ID. To deal with this I used the hgnc dataset provided by bio2rdf to look up first the official symbol, and then the symbol’s ID in the Ensembl database.

In the end, I came up with a couple of methods to retrieve the information.

This required multiple queries and was both error prone and slow, since each lookup involved multiple remote queries and had to be completed one at a time. Even worse, the results weren’t stored anywhere so a new request had to be made each time the information was required.

SADI to the Rescue

Mark, however, was kind enough to set up a SADI service to handle the process, which is great, since it runs a lot more smoothly and gave me the opportunity to work some with SADI, but it also makes saving and integrating the responses almost trivial.

To begin with, I created a class with a simple method to run a synchronous SADI request and return the results as an RDF graph:

It takes a service, and RDF input, then uses the rest-client gem to handle the request. SADI supports both turtle and RDF/XML input, but I’m partial to turtle so the script uses it for input. An example, for the gene “ACF”, would be

producing the output

Some of the supporting ontologies’ predicates have been replaced by more readable forms, but in general this is fully valid RDF on both ends, so loading it to or from a triple store is no trouble at all.

Caution: Semantic Hazard

Its important to note that the some work still needs to be done on the data model for the triplified MAF dataset before it will play nice with other scientific datasets such as those exposed by SADI. Mark was willing to set up a service which could accommodate the dataset I’d constructed (SADI is quite flexible after all), but this shouldn’t be taken as a representative example of how a service should look or be used. Although my gem has gotten to the point where it avoids gross incompatibilities of stepping on others’ name spaces and failing to reuse common vocabularies, there are some subtler semantic issues prevent simple integration with SADIs more interesting functions.

First of all, SADI makes frequent use of the SIO ontology, which provides a rich and unified system of describing data using RDF at the cost of certain restrictions on how that data is represented. You can see the general outline of how SIO works in the output above; attributes of objects are attached with the “has_attribute” predicate, and literal values for attributes using “has_value”. I spent some time trying to use this pattern in the MAF parser, but we decided to just use the simple representation I described in earlier blog posts given the amount of time we had left. I believe full use of SIO would be both possible and worthwhile though, since it allows for much greater interoperability without sacrificing flexibility, so this will be something I continue working on past the end of GSOC.

Second, there are also some “philosophical” issues getting in the way of full SADI integration. I’m current using the URIs from identifiers.org to provide dereferancable identifiers for the HUGO symbols in the MAF file. This is a great application of linked data principles, since it automatically attaches both more information about a particular gene, and about the service and scheme used to represent it. However, a statement like “http://identifiers.org/hgnc.symbol/RBFOX1 has_gene_length 1694246” doesn’t really make sense; the identifiers.org url for RBFOX1 doesn’t have a gene length because its just an identifier! As I understand it, right structure would be more like “X is_a gene, X has_identifier identifiers.org/X , X has_gene_length Y”, although I could still be wrong about this; getting the semantics right is one of the trickiest parts of working with these systems.

If you’re like me, you love the idea of a database technology where the ontological characteristics of entities stored in it are as important as the raw data itself. But if you think this seems like splitting hairs you’re missing the scope of the vision the Semantic Web is working towards, which is demonstrated by SADI; once we make a statement about something, that statement should be unambiguously defined and verifiable to someone or some algorithm with knowledge of its particular domain, and by making other statements using its component parts, we can build a vast web of interlinked knowledge, perhaps one day supplanting the web of linked documents we all use today.

RDF’s flexibility supports this vision, but it also gets in the way in that it allows you to make nonsensical statements such as the ones above. Formally defined ontologies like SIO provide the more precise structure that allows you to make a statement with reasonable confidence that it will be both meaningful and easily reusable by others. In my own time after this summer I’m looking forward to working on and writing more about this topic, as I think it really gets at the potential for using semantic technologies in science, programming, and machine intelligence research.

Speeding things up with Async

A single request runs fairly quickly and retrieves the information we need, but in this simple form it’s not quite sufficient for larger volumes of information. The nature of RDF makes batch input very easy to set up; you just have to add more objects to turtle input. However, the BRCA dataset has 1,760 distinct genes, so even trying to load a small subset of them through the synchronous service takes long enough to cause the request to time out.

This is precisely what the asynchronous mode is meant for, so after getting the basic synchronous query up and running I moved on to that. Asynchronous queries have some added complexity, so the class got quite a bit longer, but it’s still a fairly simple process for all the work that’s going on behind the scenes

When the fetch_async method POSTs data to the service, it receives a url to poll for each of the inputs. The method creates a list of poll urls, then handles the process of going through and waiting until a response is available. This means there is no need to keep a connection open the whole time, and the client can just follow the instructions from the service on how long to wait before checking back in. If the responses are split over multiple polling URLs, it waits until each has finished processing, then returns the output, again in turtle form.

Straight to the Database

At first after getting this working I immediately set to parsing out just the gene lengths from the output, so I could use return them as Ruby objects. This habit comes from my previous experience using various APIs, where the general process involves parsing your data into a special format, making the request, and then grabbing the information you want from the response. SADI eliminates the last of these, and with the right input structures the first as well; the response is already in an RDF format, so you can simply load it straight into a triple store, automatically augmenting the information you already have and providing an offline database of gene lengths for later lookup.

I’ve written a script to do just this, which currently is configured to work with fourstore specifically. It retrieves the hugo genes currently in the database, sets up the SADI input with them, and can load the output directly into the triple store. The requests are split into batches of 250, which makes the set can be processed a lot faster than doing them one at a time, and this way its a one time process, instead of something that gets repeated every time to access the length of a particular gene.

When you step back and think about it, this ability to make a request using some entries from your database and be able to load the response straight back in without parsing or conversion is a pretty remarkable, and it doesn’t even begin to address SADIs support for discovering entirely new information. While this post should serve as a small example of what it can do, there is a huge list of available services on the SADI site. And if you’re looking for a simple ruby client for accessing a service, try out the code in the gists above, or clone the sinatra-based web interface I built.

Advertisements

Parsing with PubliSci Part 2: Being a good Semantic Citizen

Once you have created a basic converter using PubliSci’s Base reader class, it’s important that you work to improve the links between your dataset and others, and use terms and descriptions that others will understand.

The data_cube.rb module will generate these where required by the vocabulary or the syntax of RDF, and there are a number of configuration options to control this process, but in general a new namespace will be created for every dataset. This prevents semantic issues and namespace collisions in the output; if two file formats have a “Score” property, you could wind up with two data sets that have conflicting definitions of the term. However, it severely limits reuse and interoperability, which is very much against the spirit of RDF and the Semantic Web.

Fortunately, the generation code is smart enough to try to recognize when you already have a valid URI for a part of a triple, in which case it will use the raw input instead of generating a URI from it. This means you can force the generation code to use identifiers of your choosing, just by modifying your input data, and without needing to add any extra configuration options.

Universal, Resolvable Identifiers

Based on advice from Mark Wilkinson, one of my mentors, I’ve tried to use URIs from the identifiers.org system. The site provides persistent identifiers for many important bioinformatics concepts and databases, as well as access URLs and other helpful information.

Among the many benefits of using the site, a crucial one is the fact that all of its identifiers resolve to a page on their host service. For example the URL http://identifiers.org/hgnc.symbol/RBFOX1 serves to uniquely identify the gene RBFOX1 in the maf reader’s output, but pasting the link into your web browser will also take you directly to the HGNC page for RBFOX1. There’s a lot of other useful metadata provided by identifiers.org, all of which is also available as turtle rdf, so I’d encourage you to have a look at it yourself.

I found identifiers for Hugo Symbol, Entrez ID, and dbSNP ID, but there may be others I’ve missed. The better linked and identified your data, the easier it will be to query and reuse. Once I’d found the right base URIs, adding them to the reader code was fairly simple; just a modification of the process_line method:

The one small exception to this is the possibility of HGNC synonyms, where the symbol used in the original MAF file is an accepted but not canonical way of identifying the gene. If these are not replaced with their ‘official’ equivalent, the resulting URIs will not resolve correctly!

SPARQL To The Rescue

For now, we can solve this by looking up the correct symbol using bio2rdf, which has created a network of linked data in the life sciences that can be queried using SPARQL. You may have noticed the updated process_line method called a official_symbol method. This will query one of the bio2rdf endpoints, and return the approved HGNC identifier for a given input

With a large input file, this remote query method could become too time consuming, so in the future it may be worthwhile to use an offline database of some sort to do the conversion. Of course, you could always download the entire dataset and load it into your own rdf store. This is one of the great advantages of RDF; since most storage software supports the same set of official serialization formats, the contents of one database can be easily dumped straight into another. And at 836,060 triples the hgnc dataset is well within the limits of most triple stores.

You can (and often should) also override the URI for a component property, if an equivalent concept is in use elsewhere. To demonstrate, I’ve changed the Hugo_symbol property to use the base identifiers.org/hgnc.symbol URI, which is as simple as changing the first entry in the COLUMN_NAMES array. I’m not sure if using this particular URI is the correct approach yet, so something different may be used in the gem’s version of the maf reader.
Here’s what the whole class looks like with these changes

Enumeration with Coded Properties

As discussed in a previous post, Data Cube’s coded properties are a good way to “bootstrap” semantics for certain types of data. Below I’ve just changed the Variant_Classification column to use coded properties, but since many of the columns in a MAF file have a specific set of valid values, representing other properties this way is a fairly simple process.

The only modifications needed here are adding two extra lines in the structure method to generate the coded properties’ structure, specifying which columns should be represented with codes (at the top of the generate_n3 method), and adding the list of possible codes as using the tcga_codes method.

If you’re an expert at finding and using Semantic Web ontologies, the gem will hopefully make prototyping or creating an RDFization algorithm faster and easier, but you may also be familiar with a more domain specific format than Data Cube that is a better fit for your data. However, most scientists and other people who want to publish large quantities of data are not usually familiar with these options. Just getting started with RDF requires a dedicated effort to understand its syntax and data model, which can seem very different from the types of structures most programmers are used to. And this leaves aside the issue of making proper use of existing concepts, and ensuring your data are accessible to other people or algorithms.

Even for me, having worked on a Semantic Web project all summer and with ready access to the direct advice of experts, the sheer amount of tools and vocabularies available is daunting, and I still feel as though I’ve just scratched the surface on what is possible with these technologies.

Parsing with PubliSci Part 1: How to get your data into the Semantic Web

One of the core functions of the PubliSci gem is to convert data from non-semantic formats to RDF so that they can be loaded into a triple store and accessed via SPARQL queries. The gem provides a growing number of Reader classes to ‘triplify’ formats using the Data Cube vocabulary, such as CSV, Wekka arff, and some data types from the R statistics language, as well as a DSL to access these readers and load their output into various external stores. However, there are many, many common formats that aren’t yet supported, so the gem also provides a “Base” reader class which can be extended to create a parser for the file format of your choice.

To wrap up the summer and show an application of my gem, I’ve started to work with my mentors to convert data from the Mutation Annotation Format, used by The Cancer Genome Atlas, to RDF and access it with a SPARQL backed DSL. The RDF converter and most of the underlying queries have been implemented in their basic form, so I thought I could use a writeup of the process of creating them as a way of illustrating the general process of creating PubliSci::Readers class using the tools provided by my gem.

A much cooler logo than my RDF/SciRuby mashup above

This post got a bit long, so I’ve decided to break it up into two separate posts, I’ll put up at the same time, followed by a third on how to actually use the data you’ve generated, and integrate it with different services. For this post, I’m just going to focus on getting a working parser class together which generates valid RDF

The .maf Format

Maf is a fairly simple format, with 16 tab delimited columns and the possibility of comments prefixed with a pound symbol. Each line of the file represents a mutation in a particular gene of a tumor sample, as well as other relevant information such as the type of mutation, the gene’s identity in various databases, and validation information. The files can get a bit long, but using the CSV reader in Ruby’s standard library and the helpful methods provided by the PubliSci::Readers::Base class it is pretty easy to efficiently convert a maf file to valid, useful RDF.

Getting Setup

First of all, if you’re following along at home you’ll need to install the bio-publisci gem, and add require “bio-publisci” to the first line of the your file. In another post, I’ll talk about how you can add the class you’ve created the PubliSci DSL’s DataSet.for method, making it possible to dump your output into any repository supported by ruby-rdf.

I’ll go into more detail about the process and methods below, but here’s the final MAF class we’ll end up with

First Steps

It’s always nice to get a little code in place to organize my thoughts. To get started, I’ll just create a simple outline of what we want our reader to do.

Eventually I intend to make this reader accessible from the PubliSci Dataset DSL, so I put the generation code in the generate_n3 method, which the gem will expect to be available when it decides to use this reader to convert a file. I’ve implemented registration of external classes in the DSL, but I haven’t finalized the way it works yet, so I won’t post an example here. If you’re interested, there’s a spec in the gem’s Github repository which demonstrates its use.

The next step is choosing which of the columns to make measures and which dimensions. This is largely up to your interpretation of the data, although there are a few constraints imposed by the Data Cube vocabulary which I’ll go into more detail about below.

No Coding Until You’ve Finished Your Tests!

Although I often stray from the path, it’s usually best to start with tests, then write the code to make them pass. I tend to “forget” this every time I start a project, but it really saves a lot of time and headaches to have a decent spec to work from. For now, I’ll just use one simple test to make sure some valid turtle triples are being generated

Making it Work

First I came up with a few expressions to make sure each of the columns is assigned to a measure or dimension, and generate a dataset name based on the input file name by default (you could add this code to the generate_n3 method)

Next, create a method to generate the structural information for our Data Cube rdf. This should take the form of a simple turtle string, and can be generated using the methods provided by the data_cube.rb module, which is included in the PubliSci::Readers::Base class. For more information about the semantics of the Data Cube format, check out the official specification, or earlier posts on this blog.

Then I’ll write a method to parse the individual lines of the file, which should process each entry and pass it to data_cube.rb’s observations method, skipping over comments and the header line. The observations method requires data to be formatted as a hash from measure/dimension to an array of values, which can be accomplished by zipping the column names and line entries together, coercing it into a hash, and wrapping each value of the hash in an array.

Finally, we’ll put it all together and call these two methods from the main generate_n3 method. For small files and testing purposes, we’ll add the option to store the resulting strings in memory and print them out, but with most maf files you may run out of memory trying to do this, so by default we’ll send the output straight to a file.

Now would also be a fine time to write out a better and nicer looking spec which examines the output more closely.

One Last Thing

The code above will generate valid turtle RDF that can be loaded into any triple store and used in SPARQL backed applications, but there’s certainly room for improvement. First of all, it’d be useful to be able to filter our queries by individual patient (a component of the Tumor_Sample_Barcode property).

SPARQL is quite powerful so you could certainly do this using it alone, with regular expressions for example, but it’d be nice for the patient component of the barcode to be represented explicitly in the data. To add this to the RDFization code, you can just add a sample_id and patient_id value to the column list, and an extra step to the process_line method to parse out this information.

Here’s what the reader class will look like after the change (this is the same as the first gist in this post)

Iterate and Improve

There’s a lot more to generating a good RDF version of a dataset than simply getting the syntax right and being able to run queries. A number of important principles and practices must be followed to ensure your data is useful to the world in general, rather than just in some narrow application. That’s what the Semantic Web is all about after all! To see how you can continue to improve the generation code detailed here, see the next post in this series.

Bio-PubliSci

Having reached the halfway point for GSOC last week, we’ve been asked to summarize what our gems will deliver by the end of the summer, and what our plans are for them after that.

On that pretext, I’d also like to announce that my gem has been officially released in alpha form, and named bio-publisci. Its goal is to provide a framework for publishing scientific results and data to the Semantic Web, which provides a unified data representation format, query language, integration standards, and a focus on using machine understanding to deal with the vast quantities of data being published today. For the version 1.0 release of the gem in September, you can expect to see

edit: Sorry about the formatting issues, wordpress seems to have no interest in making this post look how I want it to.

A Domain Specific Language for Scientific Results

  • A clean, simple interface for publishing results and datasets to the semantic web
    Describe your data and results in a descriptive language implemented in Ruby, and the gem will generate RDF formatted output with it. Using simple syntax such as

    You can generate RDFize your raw data, include basic authorship and publishing metadata, and add information about your data’s provenance.All of the methods declare objects which have their own independant serialization functions, so since the DSL is implemented in Ruby you are free to mix and match your output set, include the DSL in your own programs or access the underlying methods, and make use of the full range of ruby syntactic sugar, clever tricks, and metaprogramming in your scripts if you so desire.

    Every component is designed to be optional, so if you just need dataset or provenance generation then you can still use the the gem and the DSL.

  • Serialize output as human readable turtle rdf, or store in a dedicated triple store
    RDF data can be encoded in a number of different formats, which are designed for various purposes such as compatibility with existing standards, simplicity, or terseness and human readability. Readability is the goal of Turtle, the Terse RDF Triple Language, which is the primary serialization format supported by my gem. Turtle files are relatively human readable as plaintext, since URIs can be abbreviated using prefixes and grouping, and literal types are often implied and so not necessary to include.
  • Use built in helpers and symbols, or custom predicates and resources
    In the example gist above all of the resources involved are generated under the single base uri http://example.org. In ‘The Wild’ of the open world semantic data, this may make it difficult to integrate existing data or unnecessarily constrain how you’d like to represent your data. Fortunately, anywhere you see a symbol, which starts with a “:” (besides the initial label for the object), you can replace it with a string representing a URI, which will be used instead of the automatically generated URI when the object is accessed or serialized.You can also add custom predicates (properties) using the “has” method, and either the built in vocabulary helper, an RDF::Vocabulary object, or a raw URI.
  • Pure Ruby, including dependencies
    The gem and all of its requirements are pure Ruby libraries, so it is compatible with all current interpreters, and also deployable any system where Java is available (even if Ruby isn’t) using Warbler.

Describe Data using well known standards

  • Basic metadata using the Dublin Core Terms
    See Data for your data

  • Provenance using the PROV ontology
    See Data for your data
  • Dimensional and tabular data using the Data Cube vocabulary
    See Sparkle Cubes

  • Readers and writers to and from a variety of common formats
    Receive input from R , as a CSV file, or using Weka’s arff format, and go in the other direction from RDF to domain files. Over the rest of the summer, I will also be adding support for relevant SciRuby libraries and GSOC projects, such as NMatrix to Data Cube conversion, plotting with Plotrb, and Statsample integration.

Integration with Ruby RDF

  • Zero configuration in-memory repository
    The world of Triple Storage software has yet to see its SQLite equivalent; a tool that is drop-dead simple to set up and a perfect fit for its domain. There are commercial offerings such as OpenLink Virtuoso, which may be feature rich and easy to set up, but are not worth the expense for simple projects, and open source projects such as Sesame or 4store, which are free but often either difficult to set up, or missing crucial features such as a built in SPARQL endpoint. This makes it very difficult to get started working with the Semantic Web, since you may have to spend hours setting up software and learning new standards just to execute a simple query.The rdf gem does not provide this be-all end-all storage solution, but it does help alleviate the startup cost of using triple based storage by providing an in memory repository object, theRDF::Repository, which can be queried using basic graph patterns or the SPARQL language. While it will choke on moderately sized datasets of a few thousand triples, it handles small datasets well and supports integration utilization of RDF in ruby programs. To make things even better, the interface it defines has been implemented for many dedicated triple stores, so once you need something more powerful you can change over with a almost no reconfiguration.The DSL I’ve written includes a “to_repository” method, which can added at the end of a script to send the output directly to the repository, making it radically easier to go straight from a DSL script to a working, persistent RDF dataset with no configuration whatsoever.
  • Minimal configuration storage using triple stores and NoSQL databases
    Including
    Sesame
    4store
    – AllegroGraph
    – Virtuoso
    – MongoDB
    DataObjects
    Ruby RDF defines an interface for using triple stores and other graph-capable persistence software as an RDF::Repository object. Usually all these require for configuration (once the actual repository is installed and set up) is a URI to locate the database, and you’re able to use a dedicated persistence tool to store your data.
  • SPARQL queries using the sparql and sparql-client gems
    All RDF::Repository objects can be queried using the SPARQL language, the official query language for the Semantic Web. This can be done either in raw form, with the sparql gem or the helpers in bio-publisci, or using the relational algebra provided by the sparql-client gem.
  • An HTTP interface and API written using Sinatra
    Using these libraries and tools, I’ve created a simple HTTP interface that allows you to test DSL scripts, view the Turtle output, and execute SPARQL queries. Because of the excellent tools in the Ruby RDF project, and the generation and description capabilities of the DSL, it is possible to implement this sort of functionality in a lightweight server using Sinatra, which is deployable to any Rack compliant host.I will soon post a link here to the demo page, which isn’t much to look at now but does have a working implementation of all the aforementioned capabilities. I’m sharing it with my mentors, but since the DSL is ultimate just raw ruby I need to add some more security to the server before I make it public. After I’ve done this and tightened up the API you’ll be able to use the site to experiment with publication scripts and SPARQL queries, or as a web service for converting and publishing your data.Sinatra is simple and lightweight enough that an end user could host their own publication server, which has a number of interesting potential applications, aside from making development easier.Additionally, the Ruby RDF includes some interesting projects which will now be easier to integrate, such as the object mapping gem spira, the goal of which is to offer an RDF based replacement for the Model layer of Rails and similar frameworks, implemented using ActiveModel’s interface.

If you’d like to see some more heavily annotated examples of the DSL that explain how to use the keywords and blocks, have a look at one of these gists. Most of the methods reflect the underlying ontology’s predicates, but since naming is one of the most important parts of the DSL I’m trying to provide shorter aliases that better fit the Ruby idiom, so I’d love to hear any advice anyone has on my choice of labels.

The Future

I’m very excited about the possibilities for this kind of a tool and plan to continue improving it after the end of the summer. RDF offers a well designed and widely accepted format which is great for publishing scientific results in a searchable and unambiguous manner, and I think is one of our best hopes for dealing with the unfathomable amount of data being generated in the Biology, Physics, and many other fields today. Unfortunately its basic concept and data model takes some time to wrap your head around, and tabular data software has a good 50 year head start on triple stores, so there remain many barriers to its adoption. I believe that by using the cleanness and expressivity of Ruby these barriers can be lowered, and in some cases eliminated. By the end of the summer, I’ll have written a gem with a friendly and flexible interface for converting data and adding much of the metadata relevant to scientific publication, and either interacting with it from within Ruby, serializing it, or publishing it to a dedicated store. But there is a lot more I’d like to do after the summer, once version 1.0 has been released, such as

    • Assertions
      One of the key components of scientific papers is the basic, underlying statement it is trying to make. This may be a statistical correlation that’s been observed, a simple statement of fact such as a gene sequence, or someone’s opinion of the effects of the peer review process on scientific discourse. These assertions are the result of a provenance chain, and potentially a set of supporting evidence or data, both of which are represented in my gem, but assertion are not explicitly a part of it.There are a number of interesting models for representing assertions in RDF, such as Nanopub, which I’d personally like to try out, but in the interest of having a solid data and metadata DSL by the end of the summer, I don’t want to commit to adding this until the fall.
    • More import methods
      RDF is by nature very friendly to the integration of different datatypes. Although the provenance and metadata generation modules are designed to apply equally well to publishing non-RDF data, or data generated using a different technique, it would be good to have a standard place in the DSL to attach other programs or specify flat files. This would allow easy integration with cool existing projects such as Biointerchange.
    • Rails stack integration
      One thing that would really help with the adoption of the Semantic Web is further integration with popular frameworks. In the right hands, these tools could inspire entirely novel ways of using the Model layer of an MVC application.Aside from proselytization, this also allows familiar patterns such as validations and callbacks, not to mention a more comfortable object oriented interface, for interaction with RDF data. In the fall when I can justify more time experimenting with these kinds of things I’d like to work on building rich RDF backed applications using Sinatra and Rails.
    • Novel interaction methods
      Its remarkable the number of people I’ve talked to that stare blankly at me when I talk about the Semantic web, then instantly understand when I show them a couple of drawings.I’d like to explore new ways of interacting with RDF graphs based on visual metaphors and other more “human-oriented” interfaces.Having a web service that can handle all the data formatting behind the scenes would be an important part of this, as the tools available for in-browser visualization and interaction are becoming ever more powerful and widely used.

    • Reasoning based property assignment
      The way the DSL assigns connections between elements is currently more or less hardcoded, according to my understanding of the vocabularies involved. They are, in fact, described using the machine understandable OWL Web Ontology Language.I’d really like to try building the validations and property assignment for a new DSL component directly from an owl ontology, since I think Ruby’s metaprogramming features are well suited to this and it could make the DSL extensible to the point of being a framework in and of itself.
    • Data Linking
      As of now there’s no facility for linking concepts in the RDFization to resources such as dbpedia and bio2rdf, which would make for a much more informative publication, and is standard practice in the Semantic Web world. Although it’d be pretty easy to add this links by hand to the turtle output, I’d like to build that kind of functionality automatically into the gem, to find and suggest linkages and annotate existing datasets with them.

Data for your data

One of the key applications of RDF is representing and disseminating data about other datasets, also known as metadata This can be all sorts of things, from the publisher or subject of a document to the file format of a video, but in bioinformatics, and science in general, you’re often most interested in how you can make use of the dataset in your domain. This might include getting information on the particular species or region your dataset refers to, or more complex questions such as where and under what terms you can access it, what process was used to create or derive it, and ultimately whether or not you can “trust” it. Although RDF and the Semantic Web don’t automatically answer these questions, they provide a powerful and widely used platform on which to do so.

This is the appropriate metadata reference, not Inception

This is the appropriate metadata reference, not Inception

To begin with, I am using two ontologies to represent metadata. One is concerned with general metadata, such as author and subject, and the other is more focused on the process used to create the data. For now the interface is a little clunky, since it’s just the basic generation functions. Later on they’ll be wrapped in classes that provide a more friendly interface and probably decomposed into smaller functions , similar to the Data Cube part of the gem.

Dublin Core

The Dublin Core vocabulary is a flexible and widely used standard for representing basic metadata. DC is fairly venerable by the standards of the Semantic Web; it traces its roots back to a metadata workshop in Dublin, Ohio in 1995. Since then has been developed and maintained by a an organization known as the Dublin Core Metadata Initiative . It is probably the most ubiquitous vocabulary outside of the core set of RDF ontologies, and has been ratified as an ANSI and ISO standard.

At the moment, my gem supports some of the most basic elements of DC, such as author and publication date. The method for this takes a hash and writes the DC terms for any of the elements that are specified, attempting to generate or infer any missing components

Using this method will add some basic information to any dataset created with the gem, as shown in this cucumber test:

Publisher and subject information are also supported, although there’s still some work to be done bridging the gap between informal subjects and those defined under various ontologies, which is really more what the ‘subject’ term was designed for.

Prov

The PROV ontology is a more specialized standard that Dublin Core, designed to represent provenance metadata, which includes the sources of and processes used to create a dataset, which people, software, or organizations were involved in creating it, and which data elements used or were derived from others. PROV was developed by a W3C working group given the goal of creating a unified standard for publishing provenance information, where before a patchwork of standards existed, each missing some important component of provenance representation.

Essentially, PROV is about the interplay of Agents, Activities, and Entities, with Agents engaging in Activities to generate Entities or derive them from other Entities. All of these elements can be either digital (software agents and algorithmic activities), physical (lab technicians and in person data collection), or some combination of the two. There are additional specializations of these classes, as well as a suite of terms to describe their relationships with one another.

These can get a little complicated, so I’ve been tracking my understanding of it with a diagram of the relationship between elements. This is still a work in progress, so if anything looks off to you I’d be happy to hear about it!

Basic provenance

Basic provenance

This is just the basic provenance for one entity, so it’s pretty comprehensible, but the whole point of the vocabulary is to link different entities and datasets with one another, which can get a little more complicated.

A longer provenance chain (full version)

A longer provenance chain (full version)

My mentors and I have agreed on the importance of being able to generate metadata for non-RDF resources, so the diagram reflects the notion that the triplified dataset may or may not be present, along with any entities or activities in the provenance chain of the main dataset. Using this system, quite a bit of useful information can be generated from a fairly small set of inputs

This better reflects the current capabilities of my code, but it’s still not a complete use of the ontology. The connections between entities, activities, and agents need not be linear, and more than one entity could be the object of a “used” or “wasDerivedFrom” relationship. This is something I’ll be working toward for during the rest of the summer, but for now this scheme provides a reasonable way to represent the provenance of many workflows.

Visualizations and Validations

Earlier this week I met with Karl Broman, a biostatistician at UW Madison who created the r/qtl library I’ve been working with in the last few weeks, about another project of his that could benefit from some Semantic Web backing. Karl has created a number of interesting visualizations of bioinformatics data. His graphs make use of the d3 javascript framework to display high-dimensional data in an interactive and intuitive way. The focus on dimensional data, as well as the fact that most of his datasets exist as R objects, fits naturally with the Data Cube generators I’ve created already.

A screenshot of the cis/trans plot. Make sure you check out the real, interactive version.

A screenshot of the cis/trans plot. Make sure you check out the real, interactive version.

To begin with, I will be converting Karl’s cis/trans eQTL plot to pull its data from a triple store dynamically. Currently there are two scripts that process an r/qtl cross and a few supporting dataframes to create a set of static JSON files, which are then loaded into the graph. Using a triple store to hold the underlying data, however, the values required by the visualization can be accessed dynamically based on the structure of the original R objects. As of now I have successfully converted each of of the necessary datasets to RDF, and am working on generating queries that Karl’s d3 code can use to access it through a 4store SPARQL endpoint (which supports JSON output).

The objects involved are quite large, and the Data Cube vocabulary (really RDF in general) is fairly verbose in its representation of information, so I am working on loading what I have into the right databases and reducing redundancy in the output. However, if you’d like some idea of how the data are being represented and accessed, I’ve set up a demo on Dydra with a subset of the data and some example queries.

Testing and Validation

In addition to working with Karl, I’ve taken time to refactor my code toward creating Data Cube RDF for more general structures. Originally the main module worked off of an Rserve object, but I’ve redone everything to use plain ruby objects, which the generator classes are responsible for creating. To support this refactoring, and the creation of new generators for data types such as CSV files, I’ve begun using Rspec to build the spec for my project. I’ve added tests against reference output and syntactical correctness, but these are respectively too brittle and too permissive to ensure novel data sets will generate valid output. To this end, I have implemented a number of the official Data Cube Integrity Constraints as part of the spec. The ICs are a set of SPARQL queries that can be run on your RDF output to ensure various high level features are present, and go beyond simple syntax validity in ensuring you have properly encoded your data. I’ve had to make a few modifications, since the ICs are slightly out of data, and some of the SPARQL 1.1 facilities they make use of aren’t fully supported by the RDF.rb SPARQL gem. Aside from a place in the set of tests, the ICs could also be useful as part of the main code, providing a way for the end user to ensure that their data is compatible with Data Cube tools like Cubeviz.

Sparkle Cubes

So you have your QTL analysis, GDP data, or Bathing Water information represented as a Data Cube. You can load your data into a triple store, make some pretty graphs in Cubeviz, and you or anyone else can get a pretty good idea of what it looks like by reading the native n3 formatted encoding. Neato. But you’re wondering how this is really any better than the other formats you’re already familiar with; sure it’s easier to load and share than the data in a relational database, but there are plenty of tools to help with that around already. Perhaps the more relevant comparison is to flat file formats such as CSV, since it’s still the de-facto way of sharing bioinformatics data. Why bother learning a new format that is not yet widely used? The most important reason, the “Semantic” part of The Semantic Web, will be the subject of another post, but today I’d like to write a little about another important technology, which you can already use to take control of your Cube formatted data and really make it shine (sorry): SPARQL.

w3cSPARQL-logo

SPARQL, an example of everyone favorite internet neologism, the recursive acronym, stands for SPARQL Protocol and RDF Query Language. As its name suggests, its main function is in querying RDF stores. Its general shape should look somewhat familiar to SQL users, but it is designed to create queries based on the “Subject Predicate Object” format native to RDF. Instead of simply listing the elements of a SPARQL query, let’s go through an example (from wikipedia):

PREFIX foaf: 
SELECT ?name ?email
WHERE {
  ?person a foaf:Person.
  ?person foaf:name ?name.
  ?person foaf:mbox ?email.
}

In brief, this query will return the names and emails of everyone in the database, which we assume contains records specified according to a particular vocabulary. If you’d like to know more details, read on, otherwise skip to the next section to see how we’ll apply this to our Data Cube data.

The first thing you’ll see is a PREFIX definition, which allows you to specify which vocabulary a resource is defined under. Using the prefix is just a shortcut to save space; you could replace every instance of “foaf:” with “” and have an equivalent query. The foaf (friend of a friend) vocabulary is one of the oldest Semantic Web vocabularies. It is used to define data about people and social networks, such as their names, emails, and connections with each other. If you’d like to know more, all you have to do is browse to the url, and you can find a detailed, human readable specification for the vocabulary. This is one nice convention in the Semantic Web community; when you browse to a URI for a vocabulary, you will frequently be redirected to a human readable version of it. This makes it easy to learn about and use new vocabularies, and to share ones you develop with others.

Next comes the SELECT line. This is one area that will look particularly familiar to SQL users, although more complex queries may not be. In this case, all we’re saying is we want to grab the parts of the data specified by “?name” and “?email” in the next part of the query. In SPARQL, tokens beginning with “?” are considered variables, so they could be named anything, but as with other languages its good practice to name them based on what they represent.

Last is the WHERE block, which usually makes up the bulk of the query. Here you can see three conditions specified in Subject Predicate Object form. If you’ve been reading along in previous posts, you may be able to understand their meanings, but even if not it’s fairly comprehensible. We’re looking for an object which is a foaf:Person, which has a foaf:name and a foaf:mbox. Although there are shortcuts which can make queries less verbose than this, the WHERE block is essentially just a list of RDF statements which you want to be true for all the data you are selecting.

Once the WHERE block returns the objects it specifies, the SELECT block picks out the portions the user has asked for, in this case the name and email, and returns them.

SPARQL and Data Cube

So now you know the basic structure of a SPARQL query, but how is it useful for the data we created in previous posts? In a multitude of ways, as it turns out. We’ll be using the following prefixes in the example queries. Note that if you were to run these queries yourself, you would need to include the prefixes at the beginning of every query, but in the interest of brevity I’ll be omitting them for the rest of the post.

PREFIX :     <http://www.rqtl.org/ns/#> 
PREFIX qb:   <http://purl.org/linked-data/cube#> 
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX prop: <http://www.rqtl.org/dc/properties/> 
PREFIX cs:   <http://www.rqtl.org/dc/cs/>

Just using relatively familiar syntax, we can do things like select all of the data on one chromosome:

SELECT ?entry
WHERE {
  ?entry prop:chr 10.0
}

Which returns a list of observation URIs

entry
http://www.rqtl.org/ns/#obsD10M298
http://www.rqtl.org/ns/#obsD10M294
http://www.rqtl.org/ns/#obsD10M42_
http://www.rqtl.org/ns/#obsD10M10
http://www.rqtl.org/ns/#obsD10M233

These could be used as the subjects of further queries, but while the observation naming scheme I chose gives you a reasonable idea of what the resources represent, you don’t have any guarantee that observation URIs will be human readable. One way would be to query the rdfs:label predicate of the observation, but if you already know which identifying properties you’re interested in, you could run a query such as the following to select them:

SELECT DISTINCT ?chr ?pos ?lod
WHERE {
  ?entry prop:chr 10.0;
         prop:chr ?chr;
         prop:pos ?pos;
         prop:lod ?lod.
}

Which yields

chr pos lod
10 40.70983 0.536721
10 0 0.08268
10 24.74745 0.759688
10 61.05621 0.254428
10 48.73004 0.584946

You may have noticed the semicolons and slightly different shape of the last query. SPARQL includes a few helpers to make your queries less verbose, in this case telling the parser that each statement separated by a semicolon, as opposed to a period, has the same subject.

While you could simply use it as a means accessing your RDF back-end, slicing out the data you need and working on it in R or some other dedicated tool, you can use SPARQL alone for many basic analysis tasks. As an example, here’s a query that uses a few keywords we haven’t seen before to select entries with a high LOD score, sort them in descending order, and give them human readable names:

SELECT DISTINCT ?name ?lod
WHERE {
  ?entry prop:lod ?lod.
  ?entry prop:refRow ?row.
  ?row rdfs:label ?name.
  FILTER(?lod > 4)
}ORDER BY DESC(?lod)

Yielding

name lod
D5M357 6.373633
D5M83 6.044958
D5M91 5.839359
D13M147 5.819851
D5M205 5.728438
D5M257 5.592882
D5M307 5.352222
D5M338 4.805622
D13M106 4.62314
D13M290 4.511684
D13M99 4.408392

Some of the predicates involved may be a little opaque, but most of the keywords (capitalized as a matter of convention) are pretty descriptive of their function. There’s a lot more depth to SPARQL than is on display here, but nonetheless we are performing the sorts of queries an actual researcher would, without having to learn anything too complex or engage in any unpleasant contortions to only grab the data we want. The latest SPARQL standard (v 1.1) includes support for many more specific graph search patterns as well as a facility for updating your data, but everything you’ve seen in this post should work just fine with any SPARQL endpoint available. 

This cat's name is sparql. She is not, alas, a query language (credit: danja http://www.flickr.com/photos/danja/236712101/)

This cat’s name is sparql. She is, alas, neither a query language nor a nascent web standard (credit: danja)

If your eyes glazed over through the example and you’re only paying attention now because of the unexpected cat picture, the key point to remember is that we can use these same techniques for any sort of data set in the Data Cube format, be it genetics, finance, or public health. We could select a subset of the information that we can import into our local data store and visualize using tools like Cubeviz, or we can use query patterns to pick out just the information that interests us. Future blog posts will talk about some of the more complicated operations you can perform, and how the language makes it easy to bring together information from multiple sources, but I hope this sample gives you an idea of the usefulness of SPARQL, and why you’d want your data mapped to the Data Cube format. This post, focusing more on the fundamental mechanics of querying Data Cube encoded information, barely touches on the “Semantic” aspect of The Semantic Web; while we do have some meaningful information about what’s a dimension, a measure, and so on, a lot of what makes RDF related technologies powerful is missing. I will soon be adding context specific semantics compliant with the Qtab format, so any other software which understands the format can automatically integrate information from Data Cube resources. Once this process is finished, I will begin creating tools to map general Ruby objects into this format, and help end users decide which types of semantic information they want to include.

If you want to try the queries out for yourself, or see how slight modifications might work, you can find a SPARQL endpoint for the data set I’ve been using here. Unfortunately results will be returned as xml, which is not very easy (for humans) to read, so if you’re interested in trying out your new knowledge in a more friendly setting, you may want to try Dbpedia, a project to convert information from Wikipedia to RDF, which has a SPARQL endpoint.