Parsing with PubliSci Part 1: How to get your data into the Semantic Web

One of the core functions of the PubliSci gem is to convert data from non-semantic formats to RDF so that they can be loaded into a triple store and accessed via SPARQL queries. The gem provides a growing number of Reader classes to ‘triplify’ formats using the Data Cube vocabulary, such as CSV, Wekka arff, and some data types from the R statistics language, as well as a DSL to access these readers and load their output into various external stores. However, there are many, many common formats that aren’t yet supported, so the gem also provides a “Base” reader class which can be extended to create a parser for the file format of your choice.

To wrap up the summer and show an application of my gem, I’ve started to work with my mentors to convert data from the Mutation Annotation Format, used by The Cancer Genome Atlas, to RDF and access it with a SPARQL backed DSL. The RDF converter and most of the underlying queries have been implemented in their basic form, so I thought I could use a writeup of the process of creating them as a way of illustrating the general process of creating PubliSci::Readers class using the tools provided by my gem.

A much cooler logo than my RDF/SciRuby mashup above

This post got a bit long, so I’ve decided to break it up into two separate posts, I’ll put up at the same time, followed by a third on how to actually use the data you’ve generated, and integrate it with different services. For this post, I’m just going to focus on getting a working parser class together which generates valid RDF

The .maf Format

Maf is a fairly simple format, with 16 tab delimited columns and the possibility of comments prefixed with a pound symbol. Each line of the file represents a mutation in a particular gene of a tumor sample, as well as other relevant information such as the type of mutation, the gene’s identity in various databases, and validation information. The files can get a bit long, but using the CSV reader in Ruby’s standard library and the helpful methods provided by the PubliSci::Readers::Base class it is pretty easy to efficiently convert a maf file to valid, useful RDF.

Getting Setup

First of all, if you’re following along at home you’ll need to install the bio-publisci gem, and add require “bio-publisci” to the first line of the your file. In another post, I’ll talk about how you can add the class you’ve created the PubliSci DSL’s DataSet.for method, making it possible to dump your output into any repository supported by ruby-rdf.

I’ll go into more detail about the process and methods below, but here’s the final MAF class we’ll end up with

First Steps

It’s always nice to get a little code in place to organize my thoughts. To get started, I’ll just create a simple outline of what we want our reader to do.

Eventually I intend to make this reader accessible from the PubliSci Dataset DSL, so I put the generation code in the generate_n3 method, which the gem will expect to be available when it decides to use this reader to convert a file. I’ve implemented registration of external classes in the DSL, but I haven’t finalized the way it works yet, so I won’t post an example here. If you’re interested, there’s a spec in the gem’s Github repository which demonstrates its use.

The next step is choosing which of the columns to make measures and which dimensions. This is largely up to your interpretation of the data, although there are a few constraints imposed by the Data Cube vocabulary which I’ll go into more detail about below.

No Coding Until You’ve Finished Your Tests!

Although I often stray from the path, it’s usually best to start with tests, then write the code to make them pass. I tend to “forget” this every time I start a project, but it really saves a lot of time and headaches to have a decent spec to work from. For now, I’ll just use one simple test to make sure some valid turtle triples are being generated

Making it Work

First I came up with a few expressions to make sure each of the columns is assigned to a measure or dimension, and generate a dataset name based on the input file name by default (you could add this code to the generate_n3 method)

Next, create a method to generate the structural information for our Data Cube rdf. This should take the form of a simple turtle string, and can be generated using the methods provided by the data_cube.rb module, which is included in the PubliSci::Readers::Base class. For more information about the semantics of the Data Cube format, check out the official specification, or earlier posts on this blog.

Then I’ll write a method to parse the individual lines of the file, which should process each entry and pass it to data_cube.rb’s observations method, skipping over comments and the header line. The observations method requires data to be formatted as a hash from measure/dimension to an array of values, which can be accomplished by zipping the column names and line entries together, coercing it into a hash, and wrapping each value of the hash in an array.

Finally, we’ll put it all together and call these two methods from the main generate_n3 method. For small files and testing purposes, we’ll add the option to store the resulting strings in memory and print them out, but with most maf files you may run out of memory trying to do this, so by default we’ll send the output straight to a file.

Now would also be a fine time to write out a better and nicer looking spec which examines the output more closely.

One Last Thing

The code above will generate valid turtle RDF that can be loaded into any triple store and used in SPARQL backed applications, but there’s certainly room for improvement. First of all, it’d be useful to be able to filter our queries by individual patient (a component of the Tumor_Sample_Barcode property).

SPARQL is quite powerful so you could certainly do this using it alone, with regular expressions for example, but it’d be nice for the patient component of the barcode to be represented explicitly in the data. To add this to the RDFization code, you can just add a sample_id and patient_id value to the column list, and an extra step to the process_line method to parse out this information.

Here’s what the reader class will look like after the change (this is the same as the first gist in this post)

Iterate and Improve

There’s a lot more to generating a good RDF version of a dataset than simply getting the syntax right and being able to run queries. A number of important principles and practices must be followed to ensure your data is useful to the world in general, rather than just in some narrow application. That’s what the Semantic Web is all about after all! To see how you can continue to improve the generation code detailed here, see the next post in this series.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s