Querying the Linked Archives Hub data using SPARQL

We’ve just announced the availability of our first draft linked data dataset of data from the Archives Hub. When newly available linked data datasets appear, I sometimes hear comments/questions along the lines of:

  • How do I know what the data looks like?
  • Show me some example SPARQL queries that I can use as starting points for my own exploration of the data

We’ve tried to go some way to addressing the first of those points in previous posts, in which I outlined the data model we’re using, to give a general picture of the types of things described and the relationships between them, and then provided a more detailed list of the RDF terms used to describe things. (That second post in particular will, I hope, be useful in thinking about how to construct queries).

In addition, there are some useful posts around on techniques for “probing” a SPARQL endpoint, i.e. issuing some general queries to get a picture of the nature of the graph(s) in the dataset behind an endpoint. See, for example:

In this post, I’ll focus mainly on responding to the second point, by providing a few sample SPARQL queries. Inevitably, these can only give a flavour of what is possible, but I hope they provide a starting point for people to build on.

This isn’t intended to be a tutorial on SPARQL; there are various such tutorials available, but one I found particularly thorough and helpful is:

The SPARQL endpoint for the Linked Archives Hub dataset is:

http://data.archiveshub.ac.uk/sparql.

The data is hosted in an instance of the Talis Platform, which supports a few useful extensions to the SPARQL standard, some of which are used in the examples below.

Listing “top-level” archival “collections”

Following the principles of “multi-level” description of archives, archivists apply a conceptualisation of archival materials as constituting hierarchically organised “collections”, where one “unit of description” may contain others, which in turn may contain others. It is often the case that an archival finding aid provides descriptions of materials only at the “collection-level”, or perhaps at some “sub-collection” level, without describing items individually at all.

In the LOCAH archival data, this approach is reflected in the use of a class ArchivalResource, where an instance of that class may have other instances as parts or members (or, inversely, one instance may be a part, or member, of another instance). This relationship is expressed using the properties dcterms:hasPart/dcterms:isPartOf and ore:aggregates/ore:isAggregatedBy.

The following query provides the URIs and labels (titles) of all archival resources mentioned in the dataset:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?ar ?arlabel
WHERE { 
?ar a locah:ArchivalResource ;
   rdfs:label ?arlabel .
}

This list includes archival resources at any “level”, from collections down to individual items.

We want to narrow down that selection so that it includes only “top-level” archival resources i.e. archival resources which are not “part of” another archival resource. This can be done by extending our pattern to allow for the optional presence of a triple with predicate dcterms:isPartOf, and filtering to select only those cases where the object in that optional pattern is “not bound” i.e. no such triple is present in the dataset:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?ar ?arlabel
WHERE { 
?ar a locah:ArchivalResource ;
   rdfs:label ?arlabel .
   OPTIONAL { ?ar dcterms:isPartOf ?parent } .
   FILTER (!bound(?parent))
}

Run this query against the current LOCAH endpoint.

Finding the location of the Repository holding an Archival Resource

For each archival resource, access to that resource is provided by a Repository (an agent, an entity with the ability to do things). This relationship is expressed using the property locah:accessProvidedBy. The Repository-as-Agent manages a place where the resource is held, a relationship expressed using the locah:administers property, and that place is associated with a postcode, both as a literal, and (perhaps more usefully) in the form of a link to a “postcode unit” in the dataset provided by the Ordnance Survey; by “following” that link, more information about the location can be obtained (e.g. latitude and longitude, relationships with other places) from the data provided by the OS.

Given the URI of an archival resource (in this example http://data.archiveshub.ac.uk/id/archivalresource/gb1086skinner), the following query returns the URI of the repository (agent), the postcode as literal, and the URI of the postcode unit:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX gn: <http://www.geonames.org/ontology#>
PREFIX ospc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>

SELECT ?repo ?pc ?pcunit
WHERE {
   ?repo locah:providesAccessTo 
                <http://data.archiveshub.ac.uk/id/archivalresource/gb1086skinner> ;
           locah:administers ?place .
   ?place gn:postalCode ?pc ;
          ospc:postcode ?pcunit
}

Run this query against the current LOCAH endpoint.

Listing the Archival Resources associated with a Person

In the EAD finding aids, the description of an archival resource may provide an association with the name of one or more persons associated with the resource as “index terms”. The person may be the creator of the resource, they may be the topic of it, or there may be some other association which is considered by the archivist to be significant for people searching the catalogue.

The following query provides a list of person names, the “authority file” form of the name, the identifiers of the archival resources with which they are associated, and the URI of a page on the existing Hub Web site describing the resource. I’ve limited it to a particular repository as without that constraint it potentially generates a quite large result set (and it helps me conceal the fact that some of the person name data is still a little bit rough and ready!)

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX locah: <http://data.archiveshub.ac.uk/def/>

SELECT DISTINCT ?name ?famname ?givenname ?authname ?unitid ?hubpage
WHERE {
?arcres locah:accessProvidedBy <http://data.archiveshub.ac.uk/id/repository/gb15> ;
        locah:associatedWith ?concept ;
        dcterms:identifier ?unitid ;
        rdfs:seeAlso ?hubpage .
?concept foaf:focus ?person ;
             rdfs:label ?authname .
?person a foaf:Person;
        foaf:name ?name;
OPTIONAL {?person foaf:familyName ?famname;
                  foaf:givenName ?givenname }
}
ORDER BY ?famname ?givenname ?name  

Run this query against the current LOCAH endpoint.

Listing Concepts by number of associated Archival Resources

The following query lists the concepts from a specified concept scheme (here the UNESCO thesaurus, which is assigned the URI http://data.archiveshub.ac.uk/id/conceptscheme/unesco, and orders them according to the number of archival resources with which they are associated (This makes use of the count and GROUP BY Talis Platform SPARQL extensions):

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?concept ( count(?concept) AS ?count ) 
WHERE {
   ?x locah:associatedWith ?concept .
   ?concept skos:inScheme  <http://data.archiveshub.ac.uk/id/conceptscheme/unesco> .
 }
GROUP BY ?concept
ORDER BY DESC(?count)

Run this query against the current LOCAH endpoint.

Listing Persons associated with Archival Resources, where Persons are born during a specified period

In an earlier post, I described the modelling of the births and deaths of individual persons as “events”.

Based on this approach, birth or death events occurring within a specified period can be selected. So, for example, the following query returns a list of persons born during the 1940s, with the archival resources with which they are associated:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX bio: <http://purl.org/vocab/bio/0.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?birthdate ?person ?name ?famname ?givenname ?ar
WHERE { 
?event a bio:Birth ;
   bio:date ?birthdate ;
   bio:principal ?person .
   FILTER regex(str(?birthdate), '^194') .
?person foaf:name ?name .
   OPTIONAL { ?person foaf:familyName ?famname ; foaf:givenName ?givenname } .
?concept foaf:focus ?person .
?ar locah:associatedWith ?concept .
}
ORDER BY ?birthdate ?name

Run this query against the current LOCAH endpoint.

(I use this to illustrate the “event” approach, but in this case, birth and death dates are also provided as literal values of properties associated with the person, so there are other (easier!) ways of getting that information.)

To close, I’ll just emphasise again that these are only a few simple examples, intended to give an idea of the structure/”shape” of the data, and a flavour of what sort of queries are possible. If you come up with any examples of your own you’d like to share, we’d be glad to hear about them in comments below. (Come to think of it, it’s probably not very easy to maintain formatting/whitespace etc in comments, so it might be easier to host any such examples elsewhere and just post links here).

P.S. If there are any “tweaks” that you think we could make that would make things easier for those consuming/querying the data, it would be good to hear about them. I can’t promise we’ll be able to implement them, but we are still at the stage where things can be changed and we do want the data to be as usable and useful as possible.