While it would be nice if all the data we are interested in working with could be accessed using SPARQL, but the reality is that most data is stored in some kind of tabular format, and may lack important structural information, or even data points, either of which could be a serious stumbling block for trying to represent it as a set of triples. In the case of missing data, the difficulty is that RDF contains no concept null values. This isn’t simply an oversight; the Semantic nature of the format requires it. The specific reasons for this are related to the formal logic and machine inference aspects of the Semantic Web, which haven’t been covered yet here. This post on the W3C’s semantic web mailing list provides a good explanation. As a brief summary, consider what would happen if you had a predicate for “is married to”, and you programmed your (unduly provincial) reasoner to conclude that any resource that is the object of an “is married to” statement for a subject of type “Man” is of type “Woman” (and vice versa). In this situation, if you had an unmarried man and an unmarried woman in your dataset, and chose to represent this state by listing both as “is married to” a null object, say “rdf:null”, you would run into a contradiction; your reasoner would conclude that rdf:null was both a man and a woman. Assuming your un-cosmopolitan reasoner specifies “Man” and “Woman” as disjoint classes, you have created a contradiction, and invalidated your ontology!
Since missing values are actually quite common in the context of qtl analysis, where missing information is often imputed or estimated from an incomplete dataset, I have been discussing the best way to proceed with my mentors. We have decided to use the “NA” string literal by default, and leave it up to the software accessing our data to decide how to handle the missing data in a given domain. This is specified in the code which converts raw values to resources or literals:
The 1.1 version of the RDF standard also includes NaN as a valid numeric literal, so I am experimenting with this as a way of dealing with missing numeric values. Missing structure is a somewhat larger problem; in some situations, such as the d3 visualizations Karl Broman has created, it is sufficient to simply store all data as literal values under a single namespace. For example, consider this (made up) database of economic indicators by country; You could say a given data point has a country value of “America”, a GDP of $49,000 per capita, an infant mortality rate of 4 per thousand, and so on, storing everything as a literal value. This is fine if you’re just working with your own data, but what if you want to be able to find more information about this “America” concept? The great thing about Semantic Web tools is that you can easily query another SPARQL endpoint for triples with an object value of “America”. But this “America” is simply a raw string; you could just as well be receiving information about the band “America”, the entire continent of North, South, or Central America, or something else entirely. Furthermore, RDF does not allow literals as subjects in triples, so you wouldn’t be able to make any statements about “America”. This is particularly problematic for the Data Cube format for a number of reasons, not the least of which is the requirement that all dimensions must have an rdfs:range concept that specifies the set of possible values. The solution to this problem is in making “America” a resource, inside of a namespace. For example, if we were converting these data for the IMF, we could replace “America” (the string literal) with (a URI). We can now write statements about the resource, and ensure that there is no ambiguity between different Americas. This doesn’t quite get us all the way toward fully linked data, since it’s not clear yet how to specify that the “America” in the imf.org namespace is the same as the one in, say, the unfao.org namespace (for that you will need to employ OWL, a more complex Semantic Web technology outside the scope of this post), but it at least allows us to create a valid representation of our data. In the context of a Data Cube dataset, this can be automated through the use of coded properties, supported by the skos vocabulary, an ontology developed for categorization and classification. Using skos, I define a “concept scheme” for each coded dimension, and a set of “hasTopConcept” relations for each scheme.
Each concept gets its own resource, for which some rudimentary information is generated automatically
Currently the generator only enumerates the concepts and creates the scheme, but these concepts provide a place to link concepts to other datasets and define their semantics. As an example, if the codes were generated for countries, you could tell the generator to link to the dbpedia entries for each country. Additionally, I plan to create a “No Data” code for each concept set, now that we’ve had a discussion about the way to handle such values.
To see how this all comes together, I’ll go through an example using one of the most popular data representation schemes around; the trusty old CSV.
CSV to Data Cube
Most are probably already familiar with this format, but even if you aren’t it’s about the simplest conceivable way of representing tabular data; each column is separated by commas, and each row by a new line. The first such row typically represents the labels for the columns. Although very little in the way of meaning is embedded in this representation, it can still be translated to a valid data cube representation, and more detailed semantics can be added through a combination of automated processing and user input.
The current CSV generator is fairly basic; you can provide it with an array of dimensions, coded dimensions, and measures using the options hash, point it at a file, and it will create Data Cube formatted output. There is only one extra option at the moment, which allows you to specify a column (by number) to use to generate labels for your output. See below for an example of how to use it.
Here is a simple file I’ve been using for my rspec tests (with spaces added for readability).
producer, pricerange, chunkiness, deliciousness hormel, low, 1, 1 newskies, medium, 6, 9 whys, nonexistant, 9001, 6
This can be fed into the generator using a call such as the following:
And that’s all there is to it! Any columns not specified as dimensions will be treated as measures, and if you provide no configuration information at all it will default to using the first column as the dimension. At the moment, all dimension properties for CSV files are assumed to be coded lists (as discussed above), but this will change as I add support for other dimension types and sdmx concepts. If you’d like to see what the output looks like, you can find an example gist on Github.
As the earlier portion of this post explains, the generator will create a concept scheme based on the range of possible values for the dimension/s, and define the rdfs:range of the dimension/s as this concept scheme. Aside from producing a valid Data Cube representation (which requires all dimensions to have a range), this also creates a platform to add more information about coding schemes and individual concepts, either manually or, in the near future, with some help from the generation algorithm.
I’ll cover just how you might go about linking up more information to your dataset in a future post, but if you’d like a preview, have a look at the excellent work Sarven Capadisli has done in converting and providing endpoints for data from the UN and other international organizations.
The country codes for these datasets are all linked to further data on their concepts using the skos vocabulary, for example, see this page with data about country code CA (Canada). This practice of linking together different datasets is a critical part of the Semantic Web, and will be an important direction for future work this summer.