Rise of the Machines: Using SADI to Augment Your Data

As discussed in my last post, converting a flat dataset to an RDF graph can give you access to a variety analysis and exposure tools, or form the base of new software that relies on your data format. Its is a flexible format, in that new information can be added in a variety of formats without concern for table joins or changing schema, and yet it has incredible descriptive power, because it is possible in principle to know that two statements in separate data stores are equivalent information, and because individual data elements can often simply be loaded in a web browser to find out more information.

These properties certainly have a lot to offer scientists and other human consumers of data, but in many ways they are a particularly good feature for algorithmic data users, which can process a stream of information very quickly but are not as adept at resolving ambiguity and reliably finding more information on a concept without guidance on where to look.

SADI: Semantic Automated Discovery and Integration

One particular project that illustrates this is SADI, the brainchild of bioinformatician and semantic web expert Dr Mark Wilkinson. Mark, who also happens to be one of my GSOC mentors, has built a framework to support automated discovery of and access to distributed datasets and services, which is a practical example of a service built using the concepts I’ve been working to learn and make use of this summer.

SADI is not so much a tool or piece of software as a set of standards for service interoperability, grounded in and supported by the existing standards of the internet and the semantic web. This means that, although most of the existing SADI services are focused on bioinformatics data, the system is flexible and general purpose enough to apply to essentially any service or web interface.

SADI is comprised of six key conventions, available at the How SADI works page:

  1. SADI Services consume and provide data via simple HTTP POST and GET.
  2. SADI Services consume and produce data in RDF format. This allows SADI Services to exploit existing OWL reasoners and SPARQL query engines to enhance interoperability between Services and the interpretation of the data being passed between them.
  3. Service interfaces (i.e., Inputs and Outputs) are defined in terms of OWL-DL classes; the property restrictions on these OWL classes define what specific data elements are required by the Service and what data will be provided by the Service, respectively.
  4. Input RDF data – data that is compliant with the Input OWL Class – is “decorated” or “annotated” by the service provider to include new properties. These properties will (of course) be a function of the lookup/analytical operations performed by the Web Service.
  5. Importantly, discovery of SADI Services can include searches for the properties the user wants to add to their data. This contrasts with other Semantic Web Service standards which attempt only to define the computational process by which input data is analysed, rather than the properties that process generates between the input and output data. This is KEY to the semantic behaviours of SADI.
  6. SADI Web Services are stateless and atomic.

Essentially, a SADI service uses the OWL ontology language to describe the information it expects as input, and what it will return. It uses common internet conventions for its communication protocol, based on recognized W3C standards.

These conventions allow all of the same web tools and agents that access the world wide web to interact with SADI in a smilar manner. Because of this and the use of semantic web standards to represent information in the system, a database of which services provide what sorts of information, and for which inputs, has been constructed that can take an entry straight from a triplified dataset and discover more information about it, all without any user intervention.

The basic process of using a SADI service involves retrieving information about a service by issuing a GET request to it, then sending it OWL classes as input based on what it expects. The service will then respond using the same class/es, but annotated with the new information it provides.

As an example, here are the request headers and a couple of input and output objects for the example service, which sends sends back a greeting for each “named individual” it receives (how nice!)

and the response

All of this behavior is defined by the OWL classes in the service description, so any user or algorithm acessing the service can learn how to handle interacting with the service simply by reading the description, which is both human and machine friendly.

Asynchronous Responses

Beyond the clarity and simplicity of the RDF and the reuse of familiar web standards, much of the SADI’s strength lies in its ability to robustly process large requests and inputs. Services can be built set up as synchronous, where a response isn’t returned until the service has finished processing its input, or asynchronous, where the server instead returns an address that a client can check, or “poll” to see if the operation has finished.

The response to a poll request also includes a header specifying how long the client should wait before trying again, making the whole process of retrieving an asynchronous result simple and transparent to coordinate. As we’ll see later on, this makes batch processing much simpler, allowing large volumes of information to be exchanged without having to worry about dealing with timeouts and network issues, or trying to efficiently coordinate many different remote requests.

The SADI framework has other benefits and features that we haven’t specifically used yet, such as the security afforded by its enforcement of an object model (as opposed to raw SPARQL queries), and the ability to distribute queries over multiple resources. In addition to sadiframework.org, further details can be found in The Semantic Automated Discovery and Integration (SADI) Web service Design-Pattern, API and Reference Implementation, a paper published in the Journal of Biomedical Semantics by Mark, Benjamin Vandervalk and Luke McCarthy, and available in full at the link.

SADI in action

To give a concrete example of how to use SADI, I’ll go over the script I wrote which uses it to assist in our analysis of the MAF dataset we’ve been working with. When trying to make inferences based on the frequency with which mutations appear in a gene, it is necessary to adjust for the size of that gene. The location of a gene can actually be a bit of a fuzzy concept, since the very concept of what, exactly, makes for a gene can itself be less clear-cut than you might expect, but databases exist that contain the generally accepted start and end positions of the gene, from which its length can be found.

The old way

To get started, I searched for databases that contained gene location information and allowed it to be accessed programmatically. Of these, the Ensembl genome database was the easiest for me to use, as it has a new RESTful endpoint, and I’m usually happiest working with REST services.

The first step in construction a query to it was to find the canonical name for a HUGO symbol from the dataset. Unfortunately, in addition to the occasional error or nonsense entry, the gene information for the MAF dataset often used synonyms for the “official” gene name, which are recognized by the HGNC, but not immediately convertible to their equivalent Ensembl ID. To deal with this I used the hgnc dataset provided by bio2rdf to look up first the official symbol, and then the symbol’s ID in the Ensembl database.

In the end, I came up with a couple of methods to retrieve the information.

This required multiple queries and was both error prone and slow, since each lookup involved multiple remote queries and had to be completed one at a time. Even worse, the results weren’t stored anywhere so a new request had to be made each time the information was required.

SADI to the Rescue

Mark, however, was kind enough to set up a SADI service to handle the process, which is great, since it runs a lot more smoothly and gave me the opportunity to work some with SADI, but it also makes saving and integrating the responses almost trivial.

To begin with, I created a class with a simple method to run a synchronous SADI request and return the results as an RDF graph:

It takes a service, and RDF input, then uses the rest-client gem to handle the request. SADI supports both turtle and RDF/XML input, but I’m partial to turtle so the script uses it for input. An example, for the gene “ACF”, would be

producing the output

Some of the supporting ontologies’ predicates have been replaced by more readable forms, but in general this is fully valid RDF on both ends, so loading it to or from a triple store is no trouble at all.

Caution: Semantic Hazard

Its important to note that the some work still needs to be done on the data model for the triplified MAF dataset before it will play nice with other scientific datasets such as those exposed by SADI. Mark was willing to set up a service which could accommodate the dataset I’d constructed (SADI is quite flexible after all), but this shouldn’t be taken as a representative example of how a service should look or be used. Although my gem has gotten to the point where it avoids gross incompatibilities of stepping on others’ name spaces and failing to reuse common vocabularies, there are some subtler semantic issues prevent simple integration with SADIs more interesting functions.

First of all, SADI makes frequent use of the SIO ontology, which provides a rich and unified system of describing data using RDF at the cost of certain restrictions on how that data is represented. You can see the general outline of how SIO works in the output above; attributes of objects are attached with the “has_attribute” predicate, and literal values for attributes using “has_value”. I spent some time trying to use this pattern in the MAF parser, but we decided to just use the simple representation I described in earlier blog posts given the amount of time we had left. I believe full use of SIO would be both possible and worthwhile though, since it allows for much greater interoperability without sacrificing flexibility, so this will be something I continue working on past the end of GSOC.

Second, there are also some “philosophical” issues getting in the way of full SADI integration. I’m current using the URIs from identifiers.org to provide dereferancable identifiers for the HUGO symbols in the MAF file. This is a great application of linked data principles, since it automatically attaches both more information about a particular gene, and about the service and scheme used to represent it. However, a statement like “http://identifiers.org/hgnc.symbol/RBFOX1 has_gene_length 1694246” doesn’t really make sense; the identifiers.org url for RBFOX1 doesn’t have a gene length because its just an identifier! As I understand it, right structure would be more like “X is_a gene, X has_identifier identifiers.org/X , X has_gene_length Y”, although I could still be wrong about this; getting the semantics right is one of the trickiest parts of working with these systems.

If you’re like me, you love the idea of a database technology where the ontological characteristics of entities stored in it are as important as the raw data itself. But if you think this seems like splitting hairs you’re missing the scope of the vision the Semantic Web is working towards, which is demonstrated by SADI; once we make a statement about something, that statement should be unambiguously defined and verifiable to someone or some algorithm with knowledge of its particular domain, and by making other statements using its component parts, we can build a vast web of interlinked knowledge, perhaps one day supplanting the web of linked documents we all use today.

RDF’s flexibility supports this vision, but it also gets in the way in that it allows you to make nonsensical statements such as the ones above. Formally defined ontologies like SIO provide the more precise structure that allows you to make a statement with reasonable confidence that it will be both meaningful and easily reusable by others. In my own time after this summer I’m looking forward to working on and writing more about this topic, as I think it really gets at the potential for using semantic technologies in science, programming, and machine intelligence research.

Speeding things up with Async

A single request runs fairly quickly and retrieves the information we need, but in this simple form it’s not quite sufficient for larger volumes of information. The nature of RDF makes batch input very easy to set up; you just have to add more objects to turtle input. However, the BRCA dataset has 1,760 distinct genes, so even trying to load a small subset of them through the synchronous service takes long enough to cause the request to time out.

This is precisely what the asynchronous mode is meant for, so after getting the basic synchronous query up and running I moved on to that. Asynchronous queries have some added complexity, so the class got quite a bit longer, but it’s still a fairly simple process for all the work that’s going on behind the scenes

When the fetch_async method POSTs data to the service, it receives a url to poll for each of the inputs. The method creates a list of poll urls, then handles the process of going through and waiting until a response is available. This means there is no need to keep a connection open the whole time, and the client can just follow the instructions from the service on how long to wait before checking back in. If the responses are split over multiple polling URLs, it waits until each has finished processing, then returns the output, again in turtle form.

Straight to the Database

At first after getting this working I immediately set to parsing out just the gene lengths from the output, so I could use return them as Ruby objects. This habit comes from my previous experience using various APIs, where the general process involves parsing your data into a special format, making the request, and then grabbing the information you want from the response. SADI eliminates the last of these, and with the right input structures the first as well; the response is already in an RDF format, so you can simply load it straight into a triple store, automatically augmenting the information you already have and providing an offline database of gene lengths for later lookup.

I’ve written a script to do just this, which currently is configured to work with fourstore specifically. It retrieves the hugo genes currently in the database, sets up the SADI input with them, and can load the output directly into the triple store. The requests are split into batches of 250, which makes the set can be processed a lot faster than doing them one at a time, and this way its a one time process, instead of something that gets repeated every time to access the length of a particular gene.

When you step back and think about it, this ability to make a request using some entries from your database and be able to load the response straight back in without parsing or conversion is a pretty remarkable, and it doesn’t even begin to address SADIs support for discovering entirely new information. While this post should serve as a small example of what it can do, there is a huge list of available services on the SADI site. And if you’re looking for a simple ruby client for accessing a service, try out the code in the gists above, or clone the sinatra-based web interface I built.

Sharp Scissors, Safety Scissors: What to do With Your PubliSci Dataset

If you’ve been following along with the last two blog posts, you should have a pretty good idea of how to turn most flat or tabular file formats into an RDF dataset using PubliSci’s Reader tools. You now have an unambiguous, self annotated dataset that is both easy for humans to read and can be queried in sexy, sexy SPARQL once loaded into a triple store. So what do you do with it?

In storing, serializing, or “writing down” data, we hope (beyond overcoming a poor memory) to be able to to share what we’ve learned with others who might have questions, criticism, or things they’d like to derive for the information within. Often these ‘others’ are other people, but more and more frequently they are machines and algorithms, especially in fields such as biology which are struggling to growing heaps of data they generate. SPARQL, RDF, and other Semantic Web components are designed to making describing knowledge and posing questions to it accessible to both these types of actors, through its flexible data model, ontological structures, and a host of inter-related software and standards.

Along with a web-friendly scripting language such as Ruby, you can easily build domain specific applications using the Semantic Web’s tools. To provide an example, I’ve created a demonstration server, which you can find at mafdemo.strinz.me, based on a breast cancer dataset stored collected by Washington University’s Genome Institute, and stored in the TCGA database.

There are two ways to use the demo server; one public, the other private. The public side offers a way to load maf files into the database, a simple html interface with some parts of the data highlighted and linked for you to browse through, and a page for querying the repository using SPARQL.

The private side, protected by a password for now, offers a much more flexible way to interact with the dataset, essentially by letting you write Ruby scripts to run a set of templated queries, create your own, and perform operations such as sorting or statistical tests on the output. However, as James Edward Gray says, Ruby trusts us with the sharp scissors, so if you were to host such an interface on your own machine, you’d want to make sure you don’t give the password to anyone you don’t trust with the sharp scissors, unless you’re running it in a virtual machine or have taken other precautions.

I’ll go over both of these interfaces in turn, starting with the public side.

The Safety Scissors

There’s still a lot you can find out about the dataset from the public side. It’s not much to look at, but you can browse through linked data for the patients and genes represented in the maf file. Because of the semantic web practice of using dereferencable URIs, a lot of the raw data is directly linked to more information about it. Most of the information being presented comes from direct SPARQL queries to the maf dataset, constructed and executed using the ruby-rdf library.

With some further development a very flexible tool for slicing and analyzing one or multiple TCGA datasets could be developed on this backend. As of now most responses are returned as streaming text, which prevents queries and remote service calls from causing timeouts, but makes building a pretty interface more difficult. This could be resolved by splitting it into javascript output and a better looking web interface (such as the one for the PROV demo I created). On top of that, the inclusion of gene sizes is just a small example of the vast amount of information available from external databases; this is, after all, the state of affairs that has lead bioinformaticians to adopt the semantic web.

However, the remaining time in GSOC doesn’t afford me the scope to build up many of these services in a way that makes full use of the information available and the flexible method of accessing it. To address this, I’ve created a more direct interface to the underlying classes and queries which can be accessed using Ruby scripts. It’s protected by a password on the demo site, so if you want to try any of these examples yourself you should grab a clone of the github repository.

Sharp Scissors

The Scissors Cat, by hibbary

In its base form, the scripting interface is not really safe to share with anyone you don’t already trust. Its not quite as insecure as sharing a computer, since it only returns simple strings, but theoretically a motivated person could completely hijack and rewrite the server from this interface; such is the price for the power of Ruby. However, with some sandboxing and a non-instance_eval based implementation the situation could be improved, or this could form the basis of a proper DSL such as Cucumber’s gherkin, which has a well defined grammar using treetop, allowing for a much safer evaluation of arbitrary inputs.

The select Method

The script interface sets you up in an environment with access to the 4store instance holding the maf data, and gives you a few helper methods to access it. Primary among these is the ‘select’ method, which can be used to retrieve specific information from the MAF file by patient ID, and retrieve a few other relevant pieces of information about the dataset, such as the number of patients represented in it.

For example, here’s the script you’d use to wrap a simple query, retrieving the genes with mutations for a given patient.

An example script

An example script

You can further refine results by specifying additional restrictions. Here, the first query first selects all sample with a mutation on NUP107 at first, and the second restricts its results to those starting at position 69135678.

You can also select multiple columns in one go, returning a hash with a key for each selection

https://gist.github.com/wstrinz/28da6c67ab6e44d26340

Using these methods of accessing the underlying data, you can write more complex scripts to perform analysis, for example here we look for samples with mutations in the gene CASR which more mutations more than one base pair in length

Inline SPARQL Queries

While it may be a blessing for rubyists just getting into the semantic web, if you’re also familiar with SPARQL you probably know that most of the sorting and comparison you might want to do can be performed with it alone. The public side of the maf server does expose a query endpoint, but if you want to tie a series of queries together in a script, or run the output through an external library, you can also easily run inline queries using the scripting interface

This can be used to derive information about how to best access the dataset, which adheres to the general structure of the data cube vocabulary. For example, to see all of the columns you can select data from, you could run a script like

And of course you can mix the two methods, pulling the results of a sparql query into a select call, or vice versa, such as in this next example, where we create a list of all the genes which patients with a mutation in SHANK1 also have.

SPARQL Templates, RDF.rb Queries

A couple of other small features to mention; first, I’ve included the ad-hoc templating system I’ve been using in the gem. It’s similar to the handlebars templating system, which is marked by using double braces ( ‘ {{ ‘ and ‘ }} ‘ ), although here we’re working with SPARQL rather than HTML. This has a few different applications, in that you can reuse query templates in a script, and write a query early on that you will fill values into later.

Second, when you make a ‘select query’ call, the results are converted into plain ruby objects for simpler interaction. Under the hood however these are retrieved using the RDF::Query class, which returns RDF::Solutions that can be interacted with in a more semantic-web aware manner. To get this kind of object as a result, either use “select_raw query” instead, or instantiate a query object and call its #run method, as demonstrated in the gist below where we retrieve all the Nonsense Mutations then process them afterward to sort by patient id or gene type

Saving and Sharing

Finally, the way I’ve set up the server and the nature of instance eval allowed me to include the saving of a ‘workspace’ between evaluations, and sharing of results or methods across sessions and users. To save a variable or result, simple prefix it with an “@” sign, declaring it as an instance variable.

Then you can come back later and run another script

That reuses the instance variable “@result” stored in your instance of the script evaluator. You can do this for procs or lambdas to reuse functions, and pretty much anything else you can think of. Similarly, prefixing the variable with “@@” will mark it as a class variable, enabling anyone accessing the script interface to use it.

Do Not Try This At Home

Again I want to stress that this is by no means a thorough approach to providing public access to an RDF dataset. It is so ridiculously permissive that I’m sure there are people online who would be in physical ill just thinking about the insecurity of my approach. Hopefully if they’re reading this they’d feel inclined to offer some advice for how to do it better, but in lieu of that, I believe that working in a small group on a closed server with this interface could aid collaboration and the prototyping of queries and algorithms. It also helps to show just how flexible the underlying data model we’re operation on can be, and how the impedance between programs and query accessible databases is in many cases less severe with SPARQL than with SQL.

The one huge component of the semantic web this does leave out is interaction between services. The ability to unambiguously make statements with RDF triples creates a natural route for integrating and consuming external services, which I will talk about in more detail in a followup post.

Parsing with PubliSci Part 2: Being a good Semantic Citizen

Once you have created a basic converter using PubliSci’s Base reader class, it’s important that you work to improve the links between your dataset and others, and use terms and descriptions that others will understand.

The data_cube.rb module will generate these where required by the vocabulary or the syntax of RDF, and there are a number of configuration options to control this process, but in general a new namespace will be created for every dataset. This prevents semantic issues and namespace collisions in the output; if two file formats have a “Score” property, you could wind up with two data sets that have conflicting definitions of the term. However, it severely limits reuse and interoperability, which is very much against the spirit of RDF and the Semantic Web.

Fortunately, the generation code is smart enough to try to recognize when you already have a valid URI for a part of a triple, in which case it will use the raw input instead of generating a URI from it. This means you can force the generation code to use identifiers of your choosing, just by modifying your input data, and without needing to add any extra configuration options.

Universal, Resolvable Identifiers

Based on advice from Mark Wilkinson, one of my mentors, I’ve tried to use URIs from the identifiers.org system. The site provides persistent identifiers for many important bioinformatics concepts and databases, as well as access URLs and other helpful information.

Among the many benefits of using the site, a crucial one is the fact that all of its identifiers resolve to a page on their host service. For example the URL http://identifiers.org/hgnc.symbol/RBFOX1 serves to uniquely identify the gene RBFOX1 in the maf reader’s output, but pasting the link into your web browser will also take you directly to the HGNC page for RBFOX1. There’s a lot of other useful metadata provided by identifiers.org, all of which is also available as turtle rdf, so I’d encourage you to have a look at it yourself.

I found identifiers for Hugo Symbol, Entrez ID, and dbSNP ID, but there may be others I’ve missed. The better linked and identified your data, the easier it will be to query and reuse. Once I’d found the right base URIs, adding them to the reader code was fairly simple; just a modification of the process_line method:

The one small exception to this is the possibility of HGNC synonyms, where the symbol used in the original MAF file is an accepted but not canonical way of identifying the gene. If these are not replaced with their ‘official’ equivalent, the resulting URIs will not resolve correctly!

SPARQL To The Rescue

For now, we can solve this by looking up the correct symbol using bio2rdf, which has created a network of linked data in the life sciences that can be queried using SPARQL. You may have noticed the updated process_line method called a official_symbol method. This will query one of the bio2rdf endpoints, and return the approved HGNC identifier for a given input

With a large input file, this remote query method could become too time consuming, so in the future it may be worthwhile to use an offline database of some sort to do the conversion. Of course, you could always download the entire dataset and load it into your own rdf store. This is one of the great advantages of RDF; since most storage software supports the same set of official serialization formats, the contents of one database can be easily dumped straight into another. And at 836,060 triples the hgnc dataset is well within the limits of most triple stores.

You can (and often should) also override the URI for a component property, if an equivalent concept is in use elsewhere. To demonstrate, I’ve changed the Hugo_symbol property to use the base identifiers.org/hgnc.symbol URI, which is as simple as changing the first entry in the COLUMN_NAMES array. I’m not sure if using this particular URI is the correct approach yet, so something different may be used in the gem’s version of the maf reader.
Here’s what the whole class looks like with these changes

Enumeration with Coded Properties

As discussed in a previous post, Data Cube’s coded properties are a good way to “bootstrap” semantics for certain types of data. Below I’ve just changed the Variant_Classification column to use coded properties, but since many of the columns in a MAF file have a specific set of valid values, representing other properties this way is a fairly simple process.

The only modifications needed here are adding two extra lines in the structure method to generate the coded properties’ structure, specifying which columns should be represented with codes (at the top of the generate_n3 method), and adding the list of possible codes as using the tcga_codes method.

If you’re an expert at finding and using Semantic Web ontologies, the gem will hopefully make prototyping or creating an RDFization algorithm faster and easier, but you may also be familiar with a more domain specific format than Data Cube that is a better fit for your data. However, most scientists and other people who want to publish large quantities of data are not usually familiar with these options. Just getting started with RDF requires a dedicated effort to understand its syntax and data model, which can seem very different from the types of structures most programmers are used to. And this leaves aside the issue of making proper use of existing concepts, and ensuring your data are accessible to other people or algorithms.

Even for me, having worked on a Semantic Web project all summer and with ready access to the direct advice of experts, the sheer amount of tools and vocabularies available is daunting, and I still feel as though I’ve just scratched the surface on what is possible with these technologies.