Data for your data

One of the key applications of RDF is representing and disseminating data about other datasets, also known as metadata This can be all sorts of things, from the publisher or subject of a document to the file format of a video, but in bioinformatics, and science in general, you’re often most interested in how you can make use of the dataset in your domain. This might include getting information on the particular species or region your dataset refers to, or more complex questions such as where and under what terms you can access it, what process was used to create or derive it, and ultimately whether or not you can “trust” it. Although RDF and the Semantic Web don’t automatically answer these questions, they provide a powerful and widely used platform on which to do so.

This is the appropriate metadata reference, not Inception

This is the appropriate metadata reference, not Inception

To begin with, I am using two ontologies to represent metadata. One is concerned with general metadata, such as author and subject, and the other is more focused on the process used to create the data. For now the interface is a little clunky, since it’s just the basic generation functions. Later on they’ll be wrapped in classes that provide a more friendly interface and probably decomposed into smaller functions , similar to the Data Cube part of the gem.

Dublin Core

The Dublin Core vocabulary is a flexible and widely used standard for representing basic metadata. DC is fairly venerable by the standards of the Semantic Web; it traces its roots back to a metadata workshop in Dublin, Ohio in 1995. Since then has been developed and maintained by a an organization known as the Dublin Core Metadata Initiative . It is probably the most ubiquitous vocabulary outside of the core set of RDF ontologies, and has been ratified as an ANSI and ISO standard.

At the moment, my gem supports some of the most basic elements of DC, such as author and publication date. The method for this takes a hash and writes the DC terms for any of the elements that are specified, attempting to generate or infer any missing components

Using this method will add some basic information to any dataset created with the gem, as shown in this cucumber test:

Publisher and subject information are also supported, although there’s still some work to be done bridging the gap between informal subjects and those defined under various ontologies, which is really more what the ‘subject’ term was designed for.


The PROV ontology is a more specialized standard that Dublin Core, designed to represent provenance metadata, which includes the sources of and processes used to create a dataset, which people, software, or organizations were involved in creating it, and which data elements used or were derived from others. PROV was developed by a W3C working group given the goal of creating a unified standard for publishing provenance information, where before a patchwork of standards existed, each missing some important component of provenance representation.

Essentially, PROV is about the interplay of Agents, Activities, and Entities, with Agents engaging in Activities to generate Entities or derive them from other Entities. All of these elements can be either digital (software agents and algorithmic activities), physical (lab technicians and in person data collection), or some combination of the two. There are additional specializations of these classes, as well as a suite of terms to describe their relationships with one another.

These can get a little complicated, so I’ve been tracking my understanding of it with a diagram of the relationship between elements. This is still a work in progress, so if anything looks off to you I’d be happy to hear about it!

Basic provenance

Basic provenance

This is just the basic provenance for one entity, so it’s pretty comprehensible, but the whole point of the vocabulary is to link different entities and datasets with one another, which can get a little more complicated.

A longer provenance chain (full version)

A longer provenance chain (full version)

My mentors and I have agreed on the importance of being able to generate metadata for non-RDF resources, so the diagram reflects the notion that the triplified dataset may or may not be present, along with any entities or activities in the provenance chain of the main dataset. Using this system, quite a bit of useful information can be generated from a fairly small set of inputs

This better reflects the current capabilities of my code, but it’s still not a complete use of the ontology. The connections between entities, activities, and agents need not be linear, and more than one entity could be the object of a “used” or “wasDerivedFrom” relationship. This is something I’ll be working toward for during the rest of the summer, but for now this scheme provides a reasonable way to represent the provenance of many workflows.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s