Once you have created a basic converter using PubliSci’s Base reader class, it’s important that you work to improve the links between your dataset and others, and use terms and descriptions that others will understand.
The data_cube.rb module will generate these where required by the vocabulary or the syntax of RDF, and there are a number of configuration options to control this process, but in general a new namespace will be created for every dataset. This prevents semantic issues and namespace collisions in the output; if two file formats have a “Score” property, you could wind up with two data sets that have conflicting definitions of the term. However, it severely limits reuse and interoperability, which is very much against the spirit of RDF and the Semantic Web.
Fortunately, the generation code is smart enough to try to recognize when you already have a valid URI for a part of a triple, in which case it will use the raw input instead of generating a URI from it. This means you can force the generation code to use identifiers of your choosing, just by modifying your input data, and without needing to add any extra configuration options.
Universal, Resolvable Identifiers
Based on advice from Mark Wilkinson, one of my mentors, I’ve tried to use URIs from the identifiers.org system. The site provides persistent identifiers for many important bioinformatics concepts and databases, as well as access URLs and other helpful information.
Among the many benefits of using the site, a crucial one is the fact that all of its identifiers resolve to a page on their host service. For example the URL http://identifiers.org/hgnc.symbol/RBFOX1 serves to uniquely identify the gene RBFOX1 in the maf reader’s output, but pasting the link into your web browser will also take you directly to the HGNC page for RBFOX1. There’s a lot of other useful metadata provided by identifiers.org, all of which is also available as turtle rdf, so I’d encourage you to have a look at it yourself.
I found identifiers for Hugo Symbol, Entrez ID, and dbSNP ID, but there may be others I’ve missed. The better linked and identified your data, the easier it will be to query and reuse. Once I’d found the right base URIs, adding them to the reader code was fairly simple; just a modification of the process_line method:
The one small exception to this is the possibility of HGNC synonyms, where the symbol used in the original MAF file is an accepted but not canonical way of identifying the gene. If these are not replaced with their ‘official’ equivalent, the resulting URIs will not resolve correctly!
SPARQL To The Rescue
For now, we can solve this by looking up the correct symbol using bio2rdf, which has created a network of linked data in the life sciences that can be queried using SPARQL. You may have noticed the updated process_line method called a official_symbol method. This will query one of the bio2rdf endpoints, and return the approved HGNC identifier for a given input
With a large input file, this remote query method could become too time consuming, so in the future it may be worthwhile to use an offline database of some sort to do the conversion. Of course, you could always download the entire dataset and load it into your own rdf store. This is one of the great advantages of RDF; since most storage software supports the same set of official serialization formats, the contents of one database can be easily dumped straight into another. And at 836,060 triples the hgnc dataset is well within the limits of most triple stores.
You can (and often should) also override the URI for a component property, if an equivalent concept is in use elsewhere. To demonstrate, I’ve changed the Hugo_symbol property to use the base identifiers.org/hgnc.symbol URI, which is as simple as changing the first entry in the COLUMN_NAMES array. I’m not sure if using this particular URI is the correct approach yet, so something different may be used in the gem’s version of the maf reader.
Here’s what the whole class looks like with these changes
Enumeration with Coded Properties
As discussed in a previous post, Data Cube’s coded properties are a good way to “bootstrap” semantics for certain types of data. Below I’ve just changed the Variant_Classification column to use coded properties, but since many of the columns in a MAF file have a specific set of valid values, representing other properties this way is a fairly simple process.
The only modifications needed here are adding two extra lines in the structure method to generate the coded properties’ structure, specifying which columns should be represented with codes (at the top of the generate_n3 method), and adding the list of possible codes as using the tcga_codes method.
If you’re an expert at finding and using Semantic Web ontologies, the gem will hopefully make prototyping or creating an RDFization algorithm faster and easier, but you may also be familiar with a more domain specific format than Data Cube that is a better fit for your data. However, most scientists and other people who want to publish large quantities of data are not usually familiar with these options. Just getting started with RDF requires a dedicated effort to understand its syntax and data model, which can seem very different from the types of structures most programmers are used to. And this leaves aside the issue of making proper use of existing concepts, and ensuring your data are accessible to other people or algorithms.
Even for me, having worked on a Semantic Web project all summer and with ready access to the direct advice of experts, the sheer amount of tools and vocabularies available is daunting, and I still feel as though I’ve just scratched the surface on what is possible with these technologies.