Parsing with PubliSci Part 1: How to get your data into the Semantic Web

One of the core functions of the PubliSci gem is to convert data from non-semantic formats to RDF so that they can be loaded into a triple store and accessed via SPARQL queries. The gem provides a growing number of Reader classes to ‘triplify’ formats using the Data Cube vocabulary, such as CSV, Wekka arff, and some data types from the R statistics language, as well as a DSL to access these readers and load their output into various external stores. However, there are many, many common formats that aren’t yet supported, so the gem also provides a “Base” reader class which can be extended to create a parser for the file format of your choice.

To wrap up the summer and show an application of my gem, I’ve started to work with my mentors to convert data from the Mutation Annotation Format, used by The Cancer Genome Atlas, to RDF and access it with a SPARQL backed DSL. The RDF converter and most of the underlying queries have been implemented in their basic form, so I thought I could use a writeup of the process of creating them as a way of illustrating the general process of creating PubliSci::Readers class using the tools provided by my gem.

A much cooler logo than my RDF/SciRuby mashup above

This post got a bit long, so I’ve decided to break it up into two separate posts, I’ll put up at the same time, followed by a third on how to actually use the data you’ve generated, and integrate it with different services. For this post, I’m just going to focus on getting a working parser class together which generates valid RDF

The .maf Format

Maf is a fairly simple format, with 16 tab delimited columns and the possibility of comments prefixed with a pound symbol. Each line of the file represents a mutation in a particular gene of a tumor sample, as well as other relevant information such as the type of mutation, the gene’s identity in various databases, and validation information. The files can get a bit long, but using the CSV reader in Ruby’s standard library and the helpful methods provided by the PubliSci::Readers::Base class it is pretty easy to efficiently convert a maf file to valid, useful RDF.

Getting Setup

First of all, if you’re following along at home you’ll need to install the bio-publisci gem, and add require “bio-publisci” to the first line of the your file. In another post, I’ll talk about how you can add the class you’ve created the PubliSci DSL’s DataSet.for method, making it possible to dump your output into any repository supported by ruby-rdf.

I’ll go into more detail about the process and methods below, but here’s the final MAF class we’ll end up with

First Steps

It’s always nice to get a little code in place to organize my thoughts. To get started, I’ll just create a simple outline of what we want our reader to do.

Eventually I intend to make this reader accessible from the PubliSci Dataset DSL, so I put the generation code in the generate_n3 method, which the gem will expect to be available when it decides to use this reader to convert a file. I’ve implemented registration of external classes in the DSL, but I haven’t finalized the way it works yet, so I won’t post an example here. If you’re interested, there’s a spec in the gem’s Github repository which demonstrates its use.

The next step is choosing which of the columns to make measures and which dimensions. This is largely up to your interpretation of the data, although there are a few constraints imposed by the Data Cube vocabulary which I’ll go into more detail about below.

No Coding Until You’ve Finished Your Tests!

Although I often stray from the path, it’s usually best to start with tests, then write the code to make them pass. I tend to “forget” this every time I start a project, but it really saves a lot of time and headaches to have a decent spec to work from. For now, I’ll just use one simple test to make sure some valid turtle triples are being generated

Making it Work

First I came up with a few expressions to make sure each of the columns is assigned to a measure or dimension, and generate a dataset name based on the input file name by default (you could add this code to the generate_n3 method)

Next, create a method to generate the structural information for our Data Cube rdf. This should take the form of a simple turtle string, and can be generated using the methods provided by the data_cube.rb module, which is included in the PubliSci::Readers::Base class. For more information about the semantics of the Data Cube format, check out the official specification, or earlier posts on this blog.

Then I’ll write a method to parse the individual lines of the file, which should process each entry and pass it to data_cube.rb’s observations method, skipping over comments and the header line. The observations method requires data to be formatted as a hash from measure/dimension to an array of values, which can be accomplished by zipping the column names and line entries together, coercing it into a hash, and wrapping each value of the hash in an array.

Finally, we’ll put it all together and call these two methods from the main generate_n3 method. For small files and testing purposes, we’ll add the option to store the resulting strings in memory and print them out, but with most maf files you may run out of memory trying to do this, so by default we’ll send the output straight to a file.

Now would also be a fine time to write out a better and nicer looking spec which examines the output more closely.

One Last Thing

The code above will generate valid turtle RDF that can be loaded into any triple store and used in SPARQL backed applications, but there’s certainly room for improvement. First of all, it’d be useful to be able to filter our queries by individual patient (a component of the Tumor_Sample_Barcode property).

SPARQL is quite powerful so you could certainly do this using it alone, with regular expressions for example, but it’d be nice for the patient component of the barcode to be represented explicitly in the data. To add this to the RDFization code, you can just add a sample_id and patient_id value to the column list, and an extra step to the process_line method to parse out this information.

Here’s what the reader class will look like after the change (this is the same as the first gist in this post)

Iterate and Improve

There’s a lot more to generating a good RDF version of a dataset than simply getting the syntax right and being able to run queries. A number of important principles and practices must be followed to ensure your data is useful to the world in general, rather than just in some narrow application. That’s what the Semantic Web is all about after all! To see how you can continue to improve the generation code detailed here, see the next post in this series.

PubliSci as a service

The provenance DSL and DataSet generation I’ve been working on have most of their basic functionality in place, but I also planned on creating a web-based API for accessing utilizing the gem and building services on top of it. I’ve created a demo site as a prototype for this feature using Ruby On Rails, and I’m happy enough with it that I’d like to make the address public so people can poke around and give me feedback. Although eventually I’ll be separating some of this functionality into a lighter weight server, Rails has helped immensely in developing it, both because it naturally encourages good RESTful design, and the Ruby community has created many useful gems and tools for rapid prototyping of websites using the framework. You can find the demo site is up at http://50.116.40.22:3000, or you can take a look at the source on Github.

REST in a nutshell. Think putting the nouns in the URL and the verbs in the request type (source)

The server acts as an interface to most of the basic functions of the gem; the DSL, the dataset RDFization classes, and the Triplestore convenience methods. Furthermore, this functionality is accessible either through an HTML interface (with a pleasant bootstrap theme!), or programmatically as a (mostly) RESTful web service, using javascript and JSON.

I’m planning to write a tutorial on how to create a publication with it, but for this post I’ll just give a broad overview of how you can use the service. The example data on the site now is based on the PROV primer, with a couple of other elements added to test different features, so it may seem a bit contrived, but it should give you some idea how you could use the site’s various features.

The root page of the site will show you the DSL script that was used to initialize the site, with syntax highlighting thanks to the lovely Coderay gem.

dsl_show

You can also edit the DSL script, which will regenerate the underlying data and set up a new repository object for you. As a warning up front, the DSL is currently based on instance_eval, which introduces a big security risk if not handled properly. I’m working on automatically sandboxing the evaluation in a future version, but for now if you’re worried about security you can easily change a line of the initializer disable remote users updating the DSL.

Along the top, you’ll see links for Entities, Activites, and Agents, which are elements of the Prov ontology, as well as Datasets, which represent any Data Cube formatted data stored in the repository. Each of these elements acts as a RESTful resource, which can be created/read/updated/deleted in much the same way as with a standard ActiveRecord model. Let’s take a look at the Entities page to see how this works.

entities

On the Entities page, you can see a table where each row represents an entity. Prov relevant properties and relationships are also displayed and hyperlinked, allowing you to browse through the information using the familiar web idiom of linked pages. All of this is done using SPARQL queries behind the scenes, but the user doesn’t (immediately) need to know about this to use the service.

Clicking the “subject” field for each Entity will take you to a page with more details, as well as a link to the corresponding DataSet if it exists.

entity_show

From that page, you can export the DataSet using the writers in the bio-publisci gem. At the moment, the demo site can export csv or wekka arff data, but I’ve been working recently on streamlining the writer classes, and I’ll be adding writers for R and some of the common SciRuby tools before the end of the summer.

You can also edit Entities, Agents, and Activities, or create new ones in case you want to correct a mistake or add information to the graph. This is a tiny bit wonky in some spots, both because of the impedance between RDF and the standard tabular backend Rails applications generally use, and because I’m by no means a Rails expert, but you can edit most of the fields and the creation of new resources generally works fine.

edit_entity

I think being able to browse the provenance graph of a dataset or piece of research using an intuitive browser-based interface forms a useful bridge between the constraints and simplicity of a standard CRUD-ish website and the powerful but daunting complexity of RDF and SPARQL, but if you find parts of this model too constraining you can also query the repository directly using SPARQL:

query

Two things to note about this; first, this provides a single interface to run queries on any of the repositories supported by the ruby-rdf project, even some without a built-in SPARQL endpoint. Second, since I’ve set up the Rails server with a permissive CORS policy, you can run queries or access resources across domains, allowing you to easily integrate it into an AJAX application or just about anywhere else. For an example, have a look at this jsfiddle that creates a bar chart in d3 from one of the datasets on the demo site.

A few other features have been implemented I’ll wait to detail in a later post that may come in handy. One that may be useful to some people is the ability to dump the entire contents of the repository in turtle rdf form. If you wanted to make a complete copy of the sites repository, or save changes you’d made in a serialized format for later, it’s as easy as calling the repository/dump route. the dataset dsl will automatically download and handle remote files specified by their url’s using the ‘object’ keyword, which makes loading external datasets extremely simple,

There’s a fair bit more to do to make this a fully featured web service; some elements of the Prov vocabulary are not fully represented, I’d really like the separate out fundamental parts into a more lightweight and deployable Sinatra server, and raw (non-rdf) datasets need better handling. Additionally, while you can easily switch between an in-memory repository for experimentation and more dedicated software such as 4store for real work, it’d be nice to make the two work together, so you could have an in-memory workspace, then save your changes to the more permanent store when your were ready. Aside from the performance gain due to not having to wait for queries on a large repository, this helps with deleting resources or clearing your workspace, as the methods of deleting data from triple stores are somewhat inconsistent and underdeveloped across different software.

Some of the cooler but less important features will have to wait until after GSOC, but if this is a tool you might use and there’s a particular feature you think would be important to have included in the basic version by the end of the summer please get in touch!

Bio-PubliSci

Having reached the halfway point for GSOC last week, we’ve been asked to summarize what our gems will deliver by the end of the summer, and what our plans are for them after that.

On that pretext, I’d also like to announce that my gem has been officially released in alpha form, and named bio-publisci. Its goal is to provide a framework for publishing scientific results and data to the Semantic Web, which provides a unified data representation format, query language, integration standards, and a focus on using machine understanding to deal with the vast quantities of data being published today. For the version 1.0 release of the gem in September, you can expect to see

edit: Sorry about the formatting issues, wordpress seems to have no interest in making this post look how I want it to.

A Domain Specific Language for Scientific Results

  • A clean, simple interface for publishing results and datasets to the semantic web
    Describe your data and results in a descriptive language implemented in Ruby, and the gem will generate RDF formatted output with it. Using simple syntax such as

    You can generate RDFize your raw data, include basic authorship and publishing metadata, and add information about your data’s provenance.All of the methods declare objects which have their own independant serialization functions, so since the DSL is implemented in Ruby you are free to mix and match your output set, include the DSL in your own programs or access the underlying methods, and make use of the full range of ruby syntactic sugar, clever tricks, and metaprogramming in your scripts if you so desire.

    Every component is designed to be optional, so if you just need dataset or provenance generation then you can still use the the gem and the DSL.

  • Serialize output as human readable turtle rdf, or store in a dedicated triple store
    RDF data can be encoded in a number of different formats, which are designed for various purposes such as compatibility with existing standards, simplicity, or terseness and human readability. Readability is the goal of Turtle, the Terse RDF Triple Language, which is the primary serialization format supported by my gem. Turtle files are relatively human readable as plaintext, since URIs can be abbreviated using prefixes and grouping, and literal types are often implied and so not necessary to include.
  • Use built in helpers and symbols, or custom predicates and resources
    In the example gist above all of the resources involved are generated under the single base uri http://example.org. In ‘The Wild’ of the open world semantic data, this may make it difficult to integrate existing data or unnecessarily constrain how you’d like to represent your data. Fortunately, anywhere you see a symbol, which starts with a “:” (besides the initial label for the object), you can replace it with a string representing a URI, which will be used instead of the automatically generated URI when the object is accessed or serialized.You can also add custom predicates (properties) using the “has” method, and either the built in vocabulary helper, an RDF::Vocabulary object, or a raw URI.
  • Pure Ruby, including dependencies
    The gem and all of its requirements are pure Ruby libraries, so it is compatible with all current interpreters, and also deployable any system where Java is available (even if Ruby isn’t) using Warbler.

Describe Data using well known standards

  • Basic metadata using the Dublin Core Terms
    See Data for your data

  • Provenance using the PROV ontology
    See Data for your data
  • Dimensional and tabular data using the Data Cube vocabulary
    See Sparkle Cubes

  • Readers and writers to and from a variety of common formats
    Receive input from R , as a CSV file, or using Weka’s arff format, and go in the other direction from RDF to domain files. Over the rest of the summer, I will also be adding support for relevant SciRuby libraries and GSOC projects, such as NMatrix to Data Cube conversion, plotting with Plotrb, and Statsample integration.

Integration with Ruby RDF

  • Zero configuration in-memory repository
    The world of Triple Storage software has yet to see its SQLite equivalent; a tool that is drop-dead simple to set up and a perfect fit for its domain. There are commercial offerings such as OpenLink Virtuoso, which may be feature rich and easy to set up, but are not worth the expense for simple projects, and open source projects such as Sesame or 4store, which are free but often either difficult to set up, or missing crucial features such as a built in SPARQL endpoint. This makes it very difficult to get started working with the Semantic Web, since you may have to spend hours setting up software and learning new standards just to execute a simple query.The rdf gem does not provide this be-all end-all storage solution, but it does help alleviate the startup cost of using triple based storage by providing an in memory repository object, theRDF::Repository, which can be queried using basic graph patterns or the SPARQL language. While it will choke on moderately sized datasets of a few thousand triples, it handles small datasets well and supports integration utilization of RDF in ruby programs. To make things even better, the interface it defines has been implemented for many dedicated triple stores, so once you need something more powerful you can change over with a almost no reconfiguration.The DSL I’ve written includes a “to_repository” method, which can added at the end of a script to send the output directly to the repository, making it radically easier to go straight from a DSL script to a working, persistent RDF dataset with no configuration whatsoever.
  • Minimal configuration storage using triple stores and NoSQL databases
    Including
    Sesame
    4store
    – AllegroGraph
    – Virtuoso
    – MongoDB
    DataObjects
    Ruby RDF defines an interface for using triple stores and other graph-capable persistence software as an RDF::Repository object. Usually all these require for configuration (once the actual repository is installed and set up) is a URI to locate the database, and you’re able to use a dedicated persistence tool to store your data.
  • SPARQL queries using the sparql and sparql-client gems
    All RDF::Repository objects can be queried using the SPARQL language, the official query language for the Semantic Web. This can be done either in raw form, with the sparql gem or the helpers in bio-publisci, or using the relational algebra provided by the sparql-client gem.
  • An HTTP interface and API written using Sinatra
    Using these libraries and tools, I’ve created a simple HTTP interface that allows you to test DSL scripts, view the Turtle output, and execute SPARQL queries. Because of the excellent tools in the Ruby RDF project, and the generation and description capabilities of the DSL, it is possible to implement this sort of functionality in a lightweight server using Sinatra, which is deployable to any Rack compliant host.I will soon post a link here to the demo page, which isn’t much to look at now but does have a working implementation of all the aforementioned capabilities. I’m sharing it with my mentors, but since the DSL is ultimate just raw ruby I need to add some more security to the server before I make it public. After I’ve done this and tightened up the API you’ll be able to use the site to experiment with publication scripts and SPARQL queries, or as a web service for converting and publishing your data.Sinatra is simple and lightweight enough that an end user could host their own publication server, which has a number of interesting potential applications, aside from making development easier.Additionally, the Ruby RDF includes some interesting projects which will now be easier to integrate, such as the object mapping gem spira, the goal of which is to offer an RDF based replacement for the Model layer of Rails and similar frameworks, implemented using ActiveModel’s interface.

If you’d like to see some more heavily annotated examples of the DSL that explain how to use the keywords and blocks, have a look at one of these gists. Most of the methods reflect the underlying ontology’s predicates, but since naming is one of the most important parts of the DSL I’m trying to provide shorter aliases that better fit the Ruby idiom, so I’d love to hear any advice anyone has on my choice of labels.

The Future

I’m very excited about the possibilities for this kind of a tool and plan to continue improving it after the end of the summer. RDF offers a well designed and widely accepted format which is great for publishing scientific results in a searchable and unambiguous manner, and I think is one of our best hopes for dealing with the unfathomable amount of data being generated in the Biology, Physics, and many other fields today. Unfortunately its basic concept and data model takes some time to wrap your head around, and tabular data software has a good 50 year head start on triple stores, so there remain many barriers to its adoption. I believe that by using the cleanness and expressivity of Ruby these barriers can be lowered, and in some cases eliminated. By the end of the summer, I’ll have written a gem with a friendly and flexible interface for converting data and adding much of the metadata relevant to scientific publication, and either interacting with it from within Ruby, serializing it, or publishing it to a dedicated store. But there is a lot more I’d like to do after the summer, once version 1.0 has been released, such as

    • Assertions
      One of the key components of scientific papers is the basic, underlying statement it is trying to make. This may be a statistical correlation that’s been observed, a simple statement of fact such as a gene sequence, or someone’s opinion of the effects of the peer review process on scientific discourse. These assertions are the result of a provenance chain, and potentially a set of supporting evidence or data, both of which are represented in my gem, but assertion are not explicitly a part of it.There are a number of interesting models for representing assertions in RDF, such as Nanopub, which I’d personally like to try out, but in the interest of having a solid data and metadata DSL by the end of the summer, I don’t want to commit to adding this until the fall.
    • More import methods
      RDF is by nature very friendly to the integration of different datatypes. Although the provenance and metadata generation modules are designed to apply equally well to publishing non-RDF data, or data generated using a different technique, it would be good to have a standard place in the DSL to attach other programs or specify flat files. This would allow easy integration with cool existing projects such as Biointerchange.
    • Rails stack integration
      One thing that would really help with the adoption of the Semantic Web is further integration with popular frameworks. In the right hands, these tools could inspire entirely novel ways of using the Model layer of an MVC application.Aside from proselytization, this also allows familiar patterns such as validations and callbacks, not to mention a more comfortable object oriented interface, for interaction with RDF data. In the fall when I can justify more time experimenting with these kinds of things I’d like to work on building rich RDF backed applications using Sinatra and Rails.
    • Novel interaction methods
      Its remarkable the number of people I’ve talked to that stare blankly at me when I talk about the Semantic web, then instantly understand when I show them a couple of drawings.I’d like to explore new ways of interacting with RDF graphs based on visual metaphors and other more “human-oriented” interfaces.Having a web service that can handle all the data formatting behind the scenes would be an important part of this, as the tools available for in-browser visualization and interaction are becoming ever more powerful and widely used.

    • Reasoning based property assignment
      The way the DSL assigns connections between elements is currently more or less hardcoded, according to my understanding of the vocabularies involved. They are, in fact, described using the machine understandable OWL Web Ontology Language.I’d really like to try building the validations and property assignment for a new DSL component directly from an owl ontology, since I think Ruby’s metaprogramming features are well suited to this and it could make the DSL extensible to the point of being a framework in and of itself.
    • Data Linking
      As of now there’s no facility for linking concepts in the RDFization to resources such as dbpedia and bio2rdf, which would make for a much more informative publication, and is standard practice in the Semantic Web world. Although it’d be pretty easy to add this links by hand to the turtle output, I’d like to build that kind of functionality automatically into the gem, to find and suggest linkages and annotate existing datasets with them.