In previous posts, I described:
- the model of the “world” on which we’re basing the Archives Hub RDF data: the types of “thing” being described, and (some of) the relationships between them (1, 2, 3); and
- the patterns for URIs to be assigned to the individual “things”
In this post and the next one, I’ll outline the RDF vocabularies we’re using to describe those “things”. This post covers some of the considerations in choosing the vocabularies and some of the “patterns” we’ve used in deploying them; the next lists the properties and classes you can expect to find in the LOCAH data.
Using existing RDF vocabularies
As far as possible, we’ve tried to make use of existing, deployed RDF vocabularies. These include:
Those distinctions between which vocabulary “describes” what are somewhat rough, particularly taking into account that the “directionality” of properties in RDF is somewhat arbitrary: a triple using the dcterms:creator property to link a created work to an agent is as much “about” the agent as it is “about” the thing created.
However, where we’ve seen a need to express a notion that is not well addressed by an existing vocabulary, we have defined the additional classes and properties required and provided URIs for them as a small “local” LOCAH RDF vocabulary. At this point in time, I consider most of these terms something of a “work in progress”, and likely to be revised (or even dropped completely) before the end of the project. But I suspect some will remain – which, given the bounded timescale of the project, leaves questions about the longer term management of such vocabularies.
Discovering Appropriate Vocabularies
Most of my knowledge of existing RDF vocabularies has come from lurking on good old-fashioned mailing lists, particularly the W3C Semantic Web Interest Group list and the Linked Open Data list. I don’t read every posting by any means, and the signal-to-noise ratio can be variable, but for me they remain an excellent source of information with a knowledgeable and active contributing community (and the archives are a great repository.)
In similar territory, Semantic Stackoverflow provides a “question-and-answer”-style service, though it tends to have a fairly technical focus.
Another useful source is to look at actual linked data datasets, particularly those which are in a similar “domain” to the one you’re working in and cover similar resource types, and check out what vocabularies they are using (and how they are using them). In the library/bibliographic domain in particular, there has been a fairly steady stream of linked data datasets appearing over the last couple of years, so there’s quite a bit to go on, rather less so for the archives case. For a few pointers, see e.g. this review post by Ed Summers (itself already nearly a year old).
There are some services which aim to provide disclosure/discovery services based on aggregations of information about vocabularies and their constituent terms, sometimes called “metadata registries” or “metadata schema registries”. I’ve had mixed experiences of using these services: in some cases the content is not current; in others the coverage is intentionally tailored to the requirements of a particular community, so the challenge becomes one of finding a registry whose coverage matches the task at hand. One service (with quite general coverage) which I have occasionally found useful is Schemapedia, a project by Ian Davis of Talis; it provides “vocabulary”-level descriptions, rather than descriptions of individual “terms” but it includes some examples of actual terms: see, e.g. the entry for the Biographical Vocabulary.
There are a number of services which provide search functions across aggregations of data gathered from the linked data Web/Semantic Web. Sindice crawls and aggregates a huge range of RDF data and provides a “Google”-like search across that aggregation. (I’ve also found navigating such an aggregation helpful in thinking about various aspects of linked data: the sig.ma browser highlights the consequences of merging data from multiple sources, and related issues of provenance, attribution and trust, for example).
Finally, at the risk of stating the obvious, plain old Web search engines can still be a useful entry point.
Having said all this, I admit that the discovery of RDF vocabularies is still something of a challenge, and I continue to come across useful things I’d missed. And having found something potentially useful often raises further questions: Is the vocabulary stable or still being developed? Is it described following “modern” good practice for RDF vocabularies? Is it being managed/curated? By an individual/institution/community? Does it have the support of a community of users? Particularly if the intention is for a dataset to have some longevity, these may be significant considerations.
Patterns for using RDF Vocabularies
While discovering RDF vocabularies capable of expressing the information you want to represent is a first step, it often raises issues of exactly how those vocabularies might best be deployed, or of choosing between several possible alternative solutions.
Leigh Dodds and Ian Davis of Talis have authored a booklet Linked Data Patterns which tries to address some of these challenges, by gathering together some common “patterns” of use, based on existing practice by linked data implementers – though perhaps inevitably at this stage, some aspects of that practice are something of a “moving target” as new challenges are identified and practice evolves to address them. (See, for example, a recent debate on the Linked Open Data mailing list covering the question of expectations for what the object of an rdfs:seeAlso triple might/should dereference to.)
I continue to find the reflections of linked data practitioners an excellent source, particularly those working in domains close to those I’m interested in. I regularly find myself referring to the series of posts by Jeni Tennison on creating linked data. In this context, the fifth post on “Finishing Touches” is particularly relevant, and in large part prompts my next couple of points.
One of the principles I’ve tried to adhere to, following the guidance by Jeni is that each resource we expose should have a human-readable label, provided using the rdfs:label property, and as far as possible that label should function as a useful “stand-alone” name for the thing.
In some cases this is a straightforward matter of using some text content node in the EAD XML document as an RDF literal. In other cases, a single element in the EAD document is mapped to a number of distinct resources in our model. In these cases, the transformation process typically prefixes or suffixes the source text to generate labels for the various different things. Perhaps unsurprisingly, this sometimes leads to some slightly “artificial” or “stilted” results, so it’s something we may need to refine.
Also, and perhaps more problematically, as I’ve noted in a previous post, the practice of archival description has traditionally relied heavily on a “multi-level description” approach which results in the presentation of resource descriptions “in the context of” the descriptions of other related resources. So it is common to find individual items within a collection labelled simply as something like “Letter”, on the basis that the reader of the finding aid will glean further information from the fact that the description of the item is presented within a context provided by a list of other “sibling” items, all “children” of a “parent” aggregation of some form. Currently our mapping generates the rdfs:label of an item using only the label (EAD unititle element) of that item in the EAD document, with the result that we may indeed end up with many individual resources labelled “Letter” (though of course the description will also include other properties derived from other EAD data and links to “parent” resources). An alternative might be to try to generate a label by “qualifying” the item unittitle, say, by prefixing it with the label of a “parent” resource – though I suspect in practice this would generate some somewhat unwieldy results.
Where the source data makes it seem reasonable to express it, I’ve also indicated the use of a “preferred label”, using the skos:prefLabel property. I’m conscious here of the need to be careful: the SKOS specification includes a number of “integrity conditions”, rules which data using the SKOS vocabulary should follow. Amongst them is the requirement that
A resource has no more than one value of skos:prefLabel per language tag.
The important thing to remember is that this is intended to apply in an “open world” context, not simply as a condition scoped to a particular “document”. The EAD to RDF transform process is performed on a document-by-document basis. Within the Hub dataset, it is quite common that for a single resource, labels for that resource are generated from the content of multiple EAD documents. While in theory naming within the set of EAD documents should be consistent, in practice, the use of variants of names is widespread in our data – the names of archival repositories is one example. Generating an skos:prefLabel triple for each variant would result in a conflict with the integrity condition once the data was merged in the triple store.
Bearing in mind that the “open world” extends beyond the boundaries of our own dataset, the same considerations apply in the case where we are exposing URIs for resources for which other parties already expose descriptions, including an skos:prefLabel triple, and we can’t guarantee that the names in our data correspond to those provided by that source.
Another issue to consider is that referred to by Leigh and Ian in their “Materialize Inferences” pattern, and by Jeni Tennison in her discussion of “Derivable Data”. One of the strengths of using the RDF model is that it is supported by a formal semantics, a framework for reasoning with data, i.e. given some set of data, it is often possible to apply some formalised set of rules to infer or derive additional triples. However, it should not be assumed that all consumers of the data will have access to the tools which support such reasoning, so it may be more appropriate for a data provider like LOCAH to explicitly include at least some of those “derivable” triples in the data we provide.
For a simple example of what I mean, the Friend of a Friend (FOAF) vocabulary provides a property called foaf:name (“A name for some thing.”). As part of their description of that property, the FOAF vocabulary owners provide the triple:
foaf:name rdfs:subPropertyOf rdfs:label .
The RDFS property rdfs:subPropertyOf is one of those properties which is associated with a set of rules. What those rules say is that, for any two properties linked by an rdfs:subPropertyOf relation, two resources related by the first property are also related by the second. So each time I find a triple using foaf:name as a predicate, I can infer (deduce, derive) a second triple using the rdfs:label predicate, e.g. if I find
<http://example.org/id/person/p123> foaf:name “Ernest Henry Shackleton” .
then I can conclude
<http://example.org/id/person/p123> rdfs:label “Ernest Henry Shackleton” .
However, to reach that conclusion, my application needs (a) knowledge of the general rdfs:subPropertyOf inference rule, and (b) knowledge that foaf:name is a subproperty of rdfs:label – and (c) the processing capability to apply that rule!
By providing – “materializing” – both those triples in our source data, we relieve the consuming application of that responsibility – though that benefit comes at the cost of increasing the size of the descriptions we provide.
This tactic can be particularly useful, I think, for properties which are subproperties of “generic” vocabularies like the RDF Schema vocabulary or the Dublin Core vocabularies. Sometimes generic linked data tools have some “built-in knowledge” of, and/or specific behaviour associated with, some of these vocabularies (e.g. to obtain literal names/labels/titles for display to human readers). It may be perfectly reasonable to use a triple with some more specialised subproperty in our data to indicate some specific relationship, but where appropriate it is also helpful to “materialize” the triple using the more generic property as well, so that an application looking for RDF Schema or DC properties can easily access that data.
Extending that slightly, Jeni suggests a “rule of thumb” that “if the result of the reasoning involves a resource from another vocabulary, then we should include it”.
The subproperty case is just one example: the inference of resource type based on rdfs:range and rdfs:domain is another case in point. In the LOCAH data, we’ve tried to provide fairly “generous” type data (e.g. including “super-classes”) where possible – again, on the grounds that such information is a commonly used “hook” in user queries (“Select resources of type T where [some other criteria]”).
The “cost” of this approach is that the dataset and the individual “bounded descriptions” served are larger – so there is a “trade-off” here which we may want to monitor and reconsider once we see how the data is being used.
As I mentioned earlier, we extended our very initial draft model to include a notion of “event”. Currently, the application of this approach in our data is quite limited: it is applied to the “creation”/”origination” of the archival resources, and to the birth, death and “periods of activity” (floruit) of individuals. What we do is similar to the approach sketched by Ben O’Steen in his processing of the British Library’s British National Bibliography data – though with a little more complexity as we make use of event ontologies which model time periods as resources, rather than as literals.
This is probably best illustrated by means of an example. Given a person with birth date of 1901 and death date of 1985, we generate an RDF graph like the following:
RDF Graph of Life Events Data
(The image links through to a larger version)
The time interval nodes at the right-hand side are reference.data.gov.uk URIs for years, like http://reference.data.gov.uk/id/year/1901
What I haven’t illustrated on that diagram is that I’ve also included some data using the CIDOC CRM ontology – actually using the Erlangen CRM vocabulary. I’m feeling my way a bit with this, so it is somewhat partial/experimental at the moment, but I hope to refine/extend it in the future.
The point I wanted to highlight is that we’ve made use of multiple “overlapping” vocabularies here – again on the grounds that it may be useful to provide that flexibility to consumers of the data querying using a specific vocabulary. As above, this is a “trade-off” which we may want to monitor and reconsider in the future.
I’ve tried to cover here some of the issues around our choices of RDF vocabularies and how we’ve deployed them. The next post will summarise the actual terms used.