Querying the Linked Archives Hub data using SPARQL

We’ve just announced the availability of our first draft linked data dataset of data from the Archives Hub. When newly available linked data datasets appear, I sometimes hear comments/questions along the lines of:

  • How do I know what the data looks like?
  • Show me some example SPARQL queries that I can use as starting points for my own exploration of the data

We’ve tried to go some way to addressing the first of those points in previous posts, in which I outlined the data model we’re using, to give a general picture of the types of things described and the relationships between them, and then provided a more detailed list of the RDF terms used to describe things. (That second post in particular will, I hope, be useful in thinking about how to construct queries).

In addition, there are some useful posts around on techniques for “probing” a SPARQL endpoint, i.e. issuing some general queries to get a picture of the nature of the graph(s) in the dataset behind an endpoint. See, for example:

In this post, I’ll focus mainly on responding to the second point, by providing a few sample SPARQL queries. Inevitably, these can only give a flavour of what is possible, but I hope they provide a starting point for people to build on.

This isn’t intended to be a tutorial on SPARQL; there are various such tutorials available, but one I found particularly thorough and helpful is:

The SPARQL endpoint for the Linked Archives Hub dataset is:

http://data.archiveshub.ac.uk/sparql.

The data is hosted in an instance of the Talis Platform, which supports a few useful extensions to the SPARQL standard, some of which are used in the examples below.

Listing “top-level” archival “collections”

Following the principles of “multi-level” description of archives, archivists apply a conceptualisation of archival materials as constituting hierarchically organised “collections”, where one “unit of description” may contain others, which in turn may contain others. It is often the case that an archival finding aid provides descriptions of materials only at the “collection-level”, or perhaps at some “sub-collection” level, without describing items individually at all.

In the LOCAH archival data, this approach is reflected in the use of a class ArchivalResource, where an instance of that class may have other instances as parts or members (or, inversely, one instance may be a part, or member, of another instance). This relationship is expressed using the properties dcterms:hasPart/dcterms:isPartOf and ore:aggregates/ore:isAggregatedBy.

The following query provides the URIs and labels (titles) of all archival resources mentioned in the dataset:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?ar ?arlabel
WHERE { 
?ar a locah:ArchivalResource ;
   rdfs:label ?arlabel .
}

This list includes archival resources at any “level”, from collections down to individual items.

We want to narrow down that selection so that it includes only “top-level” archival resources i.e. archival resources which are not “part of” another archival resource. This can be done by extending our pattern to allow for the optional presence of a triple with predicate dcterms:isPartOf, and filtering to select only those cases where the object in that optional pattern is “not bound” i.e. no such triple is present in the dataset:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?ar ?arlabel
WHERE { 
?ar a locah:ArchivalResource ;
   rdfs:label ?arlabel .
   OPTIONAL { ?ar dcterms:isPartOf ?parent } .
   FILTER (!bound(?parent))
}

Run this query against the current LOCAH endpoint.

Finding the location of the Repository holding an Archival Resource

For each archival resource, access to that resource is provided by a Repository (an agent, an entity with the ability to do things). This relationship is expressed using the property locah:accessProvidedBy. The Repository-as-Agent manages a place where the resource is held, a relationship expressed using the locah:administers property, and that place is associated with a postcode, both as a literal, and (perhaps more usefully) in the form of a link to a “postcode unit” in the dataset provided by the Ordnance Survey; by “following” that link, more information about the location can be obtained (e.g. latitude and longitude, relationships with other places) from the data provided by the OS.

Given the URI of an archival resource (in this example http://data.archiveshub.ac.uk/id/archivalresource/gb1086skinner), the following query returns the URI of the repository (agent), the postcode as literal, and the URI of the postcode unit:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX gn: <http://www.geonames.org/ontology#>
PREFIX ospc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>

SELECT ?repo ?pc ?pcunit
WHERE {
   ?repo locah:providesAccessTo 
                <http://data.archiveshub.ac.uk/id/archivalresource/gb1086skinner> ;
           locah:administers ?place .
   ?place gn:postalCode ?pc ;
          ospc:postcode ?pcunit
}

Run this query against the current LOCAH endpoint.

Listing the Archival Resources associated with a Person

In the EAD finding aids, the description of an archival resource may provide an association with the name of one or more persons associated with the resource as “index terms”. The person may be the creator of the resource, they may be the topic of it, or there may be some other association which is considered by the archivist to be significant for people searching the catalogue.

The following query provides a list of person names, the “authority file” form of the name, the identifiers of the archival resources with which they are associated, and the URI of a page on the existing Hub Web site describing the resource. I’ve limited it to a particular repository as without that constraint it potentially generates a quite large result set (and it helps me conceal the fact that some of the person name data is still a little bit rough and ready!)

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX locah: <http://data.archiveshub.ac.uk/def/>

SELECT DISTINCT ?name ?famname ?givenname ?authname ?unitid ?hubpage
WHERE {
?arcres locah:accessProvidedBy <http://data.archiveshub.ac.uk/id/repository/gb15> ;
        locah:associatedWith ?concept ;
        dcterms:identifier ?unitid ;
        rdfs:seeAlso ?hubpage .
?concept foaf:focus ?person ;
             rdfs:label ?authname .
?person a foaf:Person;
        foaf:name ?name;
OPTIONAL {?person foaf:familyName ?famname;
                  foaf:givenName ?givenname }
}
ORDER BY ?famname ?givenname ?name  

Run this query against the current LOCAH endpoint.

Listing Concepts by number of associated Archival Resources

The following query lists the concepts from a specified concept scheme (here the UNESCO thesaurus, which is assigned the URI http://data.archiveshub.ac.uk/id/conceptscheme/unesco, and orders them according to the number of archival resources with which they are associated (This makes use of the count and GROUP BY Talis Platform SPARQL extensions):

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?concept ( count(?concept) AS ?count ) 
WHERE {
   ?x locah:associatedWith ?concept .
   ?concept skos:inScheme  <http://data.archiveshub.ac.uk/id/conceptscheme/unesco> .
 }
GROUP BY ?concept
ORDER BY DESC(?count)

Run this query against the current LOCAH endpoint.

Listing Persons associated with Archival Resources, where Persons are born during a specified period

In an earlier post, I described the modelling of the births and deaths of individual persons as “events”.

Based on this approach, birth or death events occurring within a specified period can be selected. So, for example, the following query returns a list of persons born during the 1940s, with the archival resources with which they are associated:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX bio: <http://purl.org/vocab/bio/0.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?birthdate ?person ?name ?famname ?givenname ?ar
WHERE { 
?event a bio:Birth ;
   bio:date ?birthdate ;
   bio:principal ?person .
   FILTER regex(str(?birthdate), '^194') .
?person foaf:name ?name .
   OPTIONAL { ?person foaf:familyName ?famname ; foaf:givenName ?givenname } .
?concept foaf:focus ?person .
?ar locah:associatedWith ?concept .
}
ORDER BY ?birthdate ?name

Run this query against the current LOCAH endpoint.

(I use this to illustrate the “event” approach, but in this case, birth and death dates are also provided as literal values of properties associated with the person, so there are other (easier!) ways of getting that information.)

To close, I’ll just emphasise again that these are only a few simple examples, intended to give an idea of the structure/”shape” of the data, and a flavour of what sort of queries are possible. If you come up with any examples of your own you’d like to share, we’d be glad to hear about them in comments below. (Come to think of it, it’s probably not very easy to maintain formatting/whitespace etc in comments, so it might be easier to host any such examples elsewhere and just post links here).

P.S. If there are any “tweaks” that you think we could make that would make things easier for those consuming/querying the data, it would be good to hear about them. I can’t promise we’ll be able to implement them, but we are still at the stage where things can be changed and we do want the data to be as usable and useful as possible.

Archives Hub Linked Data Release

We’re very pleased to announce the release of http://data.archiveshub.ac.uk, the first Linked Data set produced by the LOCAH project. The team has been working hard since the beginning of the project on modelling the complex archival data and transforming it into RDF Linked Data. This is now available in a variety of forms via the data.archiveshub.ac.uk home page. A number of previous blog posts outline the modelling and transformation process, the RDF terms used in the data, and the challenges and opportunities arising along the way. A forthcoming post will provide some example queries for accessing data from the SPARQL query endpoint. The data and content is licensed under a Creative Commons CC0 1.0 licence.

We’re working on a visualisation prototype that provides an example of how we link the Hub Data with other Linked Data sources on the Web using our enhanced dataset to provide a useful graphical resource for researchers.

One important point to note is that this initial release is a selected subset, representative of the Hub collection descriptions as a proof of concept, and does not contain the full Archives Hub dataset at present, although we are very keen to explore this in the future.

We still have some work to do, this being the initial release of the Hub data. Some revisions for a later release will address a few issues including reconciling our internal person and subject names, and will also contain some further enhancements to the data to include links to Library of Congress subject headings and further links to DBPedia based on subject terms. We also hope to include links for place names using Geonames and Ordnance Survey.

We encourage feedback on the data, the model and any other aspect of data.archiveshub.ac.uk, so please leave comments or contact us directly.

We are also working hard on our other main LOCAH release, the Copac Linked Data. Our first version of the model for this is now finished, and we have the data in our test triple store. We hope to release this in about a month’s time.

I’d personally like to thank the LOCAH team for all their hard work on this exciting and challenging project. I’d also like to thank our technology partner, Talis for kindly providing our Linked Data store.

Describing the “things”: the RDF terms used (part 2)

In the previous post, I described some of the considerations in choosing RDF vocabularies to use for the LOCAH archival metadata. In the tables below, I’ve tried to summarise the properties used to “describe” an instance of each of the classes in our model, i.e. for a particular thing URI, in our dataset, one might expect to find triples with that URI as subject and these property URIs as predicates, and when our data is served as linked data, and a thing URI is dereferenced the “bounded description” provided will include those triples (and others) – though some may be optional, so not necessarily present for all instances (and some may not be present at all until we add some more data…!)

This is really more of a “reference document” than a blog post, but I provide it in part as documentation of the data creation/transformation process, and in part as a guide for potential users of the actual data. Having said that, the data is liable (even likely) to change so consumers should always refer to the actual data for an up-to-date picture of the terms used. I’ve tried to highlight (dark grey background) below terms which I consider to be particularly “at risk” and liable to be removed/replaced, mostly terms from the “locah” vocabulary.

Most of this data is generated from the transformation of the EAD XML documents; a small proportion is added separately. Again, I’ve tried to indicate that in the tables below (light grey background).

Finding Aid

Type rdf:type Class URI locah:FindingAid
foaf:Document
bibo:Document
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Identifier dcterms:identifier plain literal
Description dcterms:description plain literal
Conforms to dcterms:conformsTo Standard URI standard:isadg
Publisher dcterms:publisher Repository (Agent) URI
Encoded As locah:encodedAs EAD URI
Topic foaf:topic Archival Resource URI
Subject dcterms:subject Archival Resource URI
Has Part dcterms:hasPart Biographical History URI

The following should probably be an owl:sameAs relationship (or we should just cite the Hub URI?)

See also rdfs:seeAlso Hub Page URI e.g.
http://archiveshub.ac.uk/data/
gb15sirernesthenryshackleton

EAD

Type rdf:type Class URI locah:EAD
foaf:Document
bibo:Document
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Identifier dcterms:identifier plain literal
Description dcterms:description plain literal
Date Created dcterms:created plain literal
Conforms to dcterms:conformsTo Standard URI dbpedia:
Encoded_Archival_Description

standard:ead2002
Encoding Of locah:encodingOf Finding Aid URI

The Hub does not currently provide a URI for the EAD document, but it is planned to do so, at which point we should add an owl:sameAs relationship (or just cite the Hub URI?)

Repository (Agent)

Type rdf:type Class URI locah:Repository
foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Identifier dcterms:identifier plain literal
Country Code locah:
countryCode
plain literal
Maintenance Agency Code locah:
maintenance
AgencyCode
plain literal
Is Publisher Of locah:isPublisherOf Finding Aid URI
Provides Access To locah:
providesAccessTo
Archival Resource URI
Administers locah:administers Place URI
See also rdfs:seeAlso Archon Page URI e.g.
http://www.nationalarchives.gov.uk/
archon/searches/
locresult_details.asp?LR=15

Repository (Place)

Type rdf:type Class URI wg84_pos:
SpatialThing
Label rdfs:label plain literal
Title dcterms:title plain literal
Is Administered By locah:
isAdministeredBy
Repository URI
See also rdfs:seeAlso Archon Page URI e.g.
http://www.nationalarchives.gov.uk/
archon/searches/
locresult_details.asp?LR=15

The following data is not generated from the EAD documents, but added in from a separate source:

Postal Code gn:postalCode plain literal
Located In gn:locatedIn Postcode Unit URI e.g.
http://data.ordnancesurvey.co.uk/id/postcodeunit/CB21ER
Within ossr:within Postcode Unit URI e.g.
http://data.ordnancesurvey.co.uk/id/postcodeunit/CB21ER
Postcode postcode:postcode Postcode Unit URI e.g.
http://data.ordnancesurvey.co.uk/id/postcodeunit/CB21ER

Archival Resource

Type rdf:type Class URI locah:ArchivalResource
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Level locah:level Level URI
Page foaf:page Finding Aid URI
Access Provided By locah:
accessProvidedBy
Repository (Agent) URI
Identifier dcterms:identifier plain literal
Date dcterms:date plain literal
Date Created or Accumulated locah:
dateCreated
AccumulatedString
plain literal

The following properties were introduced to distinguish different date cases (date range v single date).

Date Created or Accumulated locah:
dateCreated
Accumulated
typed literal
Date Created or Accumulated (Start) locah:
dateCreated
AccumulatedStart
typed literal
Date Created or Accumulated (End) locah:
dateCreated
AccumulatedEnd
typed literal
Produced In event:produced_in Creation Event URI
Extent (String) locah:extent plain literal
Extent dcterms:extent Extent URI
Language dcterms:language Language URI e.g.
http://lexvo.org/id/iso639-3/eng
Is Represented By locah:
isRepresentedBy
Document URI or Aggregation URI
Origination locah:origination Agent URI
Has Biographical History locah:
hasBiographicalHistory
Bioghist URI
Associated With locah:
asssociatedWith
Concept URI
Has Part dcterms:hasPart Archival Resource URI
Aggregates ore:aggregates Archival Resource URI
Is Part Of dcterms:isPartOf Archival Resource URI
Is Aggregated By ore:isAggregatedBy Archival Resource URI
Members locah:members (RDF Collection)
See also rdfs:seeAlso Hub Page URI e.g.
http://archiveshub.ac.uk/data/
gb15sirernesthenryshackleton-
gb15sirernesthenryshackleton-
imperialtrans-antarcticexpedition

For all of the following, the object is simply a copy of the XML element content from the EAD document as an XML Literal. This is a rather “dumb” and probably not terribly useful “translation” from the EAD; in a future iteration of the transform, we hope to extract further useful triples from this part of the EAD data, and we will probably remove some of these triples.

Custodial History locah:
custodialHistory
XML literal
Acquisitions locah:acquisitions XML literal
Scope and Content locah:
scopecontent
XML literal
Appraisal locah:appraisal XML literal
Accruals locah:accruals XML literal
Access Restrictions locah:
accessRestrictions
XML literal
Use Restrictions locah:
useRestrictions
XML literal
Physical or Technical Requirements locah:
physicalTechnical
Requirements
XML literal
Other Finding Aids locah:
otherFindingAids
XML literal
Location Of Originals locah:
locationOfOriginals
XML literal
Alternate Forms Available locah:
alternateForms
Available
XML literal
Related Material locah:
relatedMaterial
XML literal
Bibliography locah:bibliography XML literal
Note locah:note XML literal
Processing locah:processing XML literal

Level

Type rdf:type Class URI locah:Level
skos:Concept
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Comment rdfs:comment plain literal
Note skos:note plain literal
Definition skos:definition plain literal
Description dcterms:description plain literal

Language

Type rdf:type Class URI lvont:Language

Creation (Event)

Type rdf:type Class URI locah:Creation
lode:Event
event:Event
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Product event:product Archival Resource URI
Involved lode:involved Archival Resource URI
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI

Time Interval

Type rdf:type Class URI time:Interval
time:TemporalEntity
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Timeline timeline:timeline Timeline URI timeline:universaltimeline
Start timeline:start typed literal
Interval Starts time:intervalStarts Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
End timeline:end typed literal
Interval Ends time:intervalEnds Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
Contains crm:
P86i_contains
Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
At timeline:at typed literal
Interval During time:intervalDuring Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
Falls Within crm:
P86_falls_within
Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874

Extent

Type rdf:type Class URI locah:Extent
Label rdfs:label plain literal

In the EAD XML doc, extent is expressed simply as a literal. Where possible we’ve tried to parse out a “unit of measurement” and a quantity, reflected in RDF as a triple where the predicate reflects the unit and the object the quantity, as a typed literal, to try to make comparisons easier. I need to catch up with what current “best practice” is for representing quantities/units of measurement so this may well change. Also, currently, “units” include things like “file”, “paper” and “envelope”, which may not be terribly useful.

Archival Box locah:archbox typed literal (xsd:decimal)
Metre (Linear) locah:metre typed literal (xsd:decimal)
Cubic Metre locah:cubicmetre typed literal (xsd:decimal)
Folder locah:folder typed literal (xsd:decimal)
Envelope locah:envelope typed literal (xsd:decimal)
Volume locah:volume typed literal (xsd:decimal)
File locah:file typed literal (xsd:decimal)
Item locah:archbox typed literal (xsd:decimal)
Page locah:page typed literal (xsd:decimal)
Paper locah:paper typed literal (xsd:decimal)

Origination (Agent)

Type rdf:type Class URI foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Page foaf:page Biographical History URI
Is Origination Of locah:
isOriginationOf
Archival Resource URI

For links to other agents (external or internal):

Same As owl:sameAs Agent URI
Is Like umbel:isLike Agent URI

Biographical History

Type rdf:type Class URI locah:
BiographicalHistory
bibo:DocumentPart
bibo:Document
foaf:Document
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Body locah:body XML literal, plain literal
Topic foaf:topic Agent URI
Subject dcterms:subject Agent URI
Is Biographical History For locah:
isBiographicalHistoryFor
Archival Resource URI
Is Part Of dcterms:isPartOf Finding Aid URI

Concept Scheme

Type rdf:type Class URI skos:ConceptScheme
Label rdfs:label plain literal

Concept (ControlAccess – Subject)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Concept (ControlAccess – Persname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Person URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the person who is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Surname locah:surname plain literal
Forename locah:forename plain literal
Dates locah:dates plain literal
Title locah:title plain literal
Epithet locah:epithet plain literal
Other locah:other plain literal

Person (ControlAccess – Persname)

Type rdf:type Class URI foaf:Person
foaf:Agent
dcterms:Agent
crm:
E21_Person
Label rdfs:label plain literal
Name foaf:name plain literal
Family Name foaf:familyName plain literal
Given Name foaf:givenName plain literal
Dates locah:dates plain literal
Title locah:title plain literal
Epithet locah:epithet plain literal
Other locah:other plain literal

For links to other persons (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Person URI
Is Like umbel:isLike Person URI

Concept (ControlAccess – Famname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Family URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the family that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Family (ControlAccess – Famname)

Type rdf:type Class URI locah:Family
foaf:Group
foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

For links to other families (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Family URI
Is Like umbel:isLike Family URI

Concept (ControlAccess – Corpname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Family URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the organisation that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Organisation (ControlAccess – Corpname)

Type rdf:type Class URI foaf:Organization
foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

For links to other organisations (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Organisation URI
Is Like umbel:isLike Organisation URI

Concept (ControlAccess – Geogname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Place URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the place that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Place (ControlAccess – Geogname)

Type rdf:type Class URI wg84_pos:
SpatialThing
Label rdfs:label plain literal
Name locah:name plain literal
Location locah:location plain literal
Other locah:other plain literal

For links to other places (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Place URI
Is Like umbel:isLike Place URI

Concept (ControlAccess – GenreForm)

Type rdf:type Class URI locah:GenreForm
skos:Concept
Label rdfs:label plain literal
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

Concept (ControlAccess – Function)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

Book/Document (ControlAccess – Title)

Type rdf:type Class URI foaf:Document
bibo:Document
Label rdfs:label plain literal
Title dcterms:title plain literal

Birth (Event)

Type rdf:type Class URI bio:Birth
bio:IndividualEvent
bio:Event
lode:Event
event:Event
crm:E67_Birth
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Date bio:date typed literal
Date dcterms:date typed literal
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI
Has Time-Span crm:
P4_has_time-span
Time Interval URI
Agent bio:agent Person URI
Principal bio:principal Person URI
Agent event:agent Person URI
Involved Agent lode:involvedAgent Person URI
Brought Into Life crm:
P98_brought_into_life
Person URI

Death (Event)

Type rdf:type Class URI bio:Death
bio:IndividualEvent
bio:Event
lode:Event
event:Event
crm:E69_Death
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Date bio:date typed literal
Date dcterms:date typed literal
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI
Has Time-Span crm:
P4_has_time-span
Time Interval URI
Agent bio:agent Person URI
Principal bio:principal Person URI
Agent event:agent Person URI
Involved Agent lode:involvedAgent Person URI
Was Death Of crm:
P100_was_death_of
Person URI

Floruit (Event)

Type rdf:type Class URI locah:Floruit
bio:IndividualEvent
bio:Event
lode:Event
event:Event
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Date bio:date typed literal
Date dcterms:date typed literal
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI
Agent bio:agent Person URI
Principal bio:principal Person URI
Agent event:agent Person URI
Involved Agent lode:involvedAgent Person URI
Was Death Of crm:
P100_was_death_of
Person URI

Object

Type rdf:type Class URI foaf:Document
bibo:Document
Is Aggregated By ore:isAggregatedBy Object Group URI

Object Group

Type rdf:type Class URI ore:Aggregation
dcmitype:Collection
bibo:Collection
Aggregates ore:aggregates Object URI

Describing the “things”: the RDF terms used (part 1)

In previous posts, I described:

  • the model of the “world” on which we’re basing the Archives Hub RDF data: the types of “thing” being described, and (some of) the relationships between them (1, 2, 3); and
  • the patterns for URIs to be assigned to the individual “things”

In this post and the next one, I’ll outline the RDF vocabularies we’re using to describe those “things”. This post covers some of the considerations in choosing the vocabularies and some of the “patterns” we’ve used in deploying them; the next lists the properties and classes you can expect to find in the LOCAH data.

Using existing RDF vocabularies

As far as possible, we’ve tried to make use of existing, deployed RDF vocabularies. These include:

Those distinctions between which vocabulary “describes” what are somewhat rough, particularly taking into account that the “directionality” of properties in RDF is somewhat arbitrary: a triple using the dcterms:creator property to link a created work to an agent is as much “about” the agent as it is “about” the thing created.

However, where we’ve seen a need to express a notion that is not well addressed by an existing vocabulary, we have defined the additional classes and properties required and provided URIs for them as a small “local” LOCAH RDF vocabulary. At this point in time, I consider most of these terms something of a “work in progress”, and likely to be revised (or even dropped completely) before the end of the project. But I suspect some will remain – which, given the bounded timescale of the project, leaves questions about the longer term management of such vocabularies.

Discovering Appropriate Vocabularies

Most of my knowledge of existing RDF vocabularies has come from lurking on good old-fashioned mailing lists, particularly the W3C Semantic Web Interest Group list and the Linked Open Data list. I don’t read every posting by any means, and the signal-to-noise ratio can be variable, but for me they remain an excellent source of information with a knowledgeable and active contributing community (and the archives are a great repository.)

In similar territory, Semantic Stackoverflow provides a “question-and-answer”-style service, though it tends to have a fairly technical focus.

Another useful source is to look at actual linked data datasets, particularly those which are in a similar “domain” to the one you’re working in and cover similar resource types, and check out what vocabularies they are using (and how they are using them). In the library/bibliographic domain in particular, there has been a fairly steady stream of linked data datasets appearing over the last couple of years, so there’s quite a bit to go on, rather less so for the archives case. For a few pointers, see e.g. this review post by Ed Summers (itself already nearly a year old).

There are some services which aim to provide disclosure/discovery services based on aggregations of information about vocabularies and their constituent terms, sometimes called “metadata registries” or “metadata schema registries”. I’ve had mixed experiences of using these services: in some cases the content is not current; in others the coverage is intentionally tailored to the requirements of a particular community, so the challenge becomes one of finding a registry whose coverage matches the task at hand. One service (with quite general coverage) which I have occasionally found useful is Schemapedia, a project by Ian Davis of Talis; it provides “vocabulary”-level descriptions, rather than descriptions of individual “terms” but it includes some examples of actual terms: see, e.g. the entry for the Biographical Vocabulary.

There are a number of services which provide search functions across aggregations of data gathered from the linked data Web/Semantic Web. Sindice crawls and aggregates a huge range of RDF data and provides a “Google”-like search across that aggregation. (I’ve also found navigating such an aggregation helpful in thinking about various aspects of linked data: the sig.ma browser highlights the consequences of merging data from multiple sources, and related issues of provenance, attribution and trust, for example).

Finally, at the risk of stating the obvious, plain old Web search engines can still be a useful entry point.

Having said all this, I admit that the discovery of RDF vocabularies is still something of a challenge, and I continue to come across useful things I’d missed. And having found something potentially useful often raises further questions: Is the vocabulary stable or still being developed? Is it described following “modern” good practice for RDF vocabularies? Is it being managed/curated? By an individual/institution/community? Does it have the support of a community of users? Particularly if the intention is for a dataset to have some longevity, these may be significant considerations.

Patterns for using RDF Vocabularies

While discovering RDF vocabularies capable of expressing the information you want to represent is a first step, it often raises issues of exactly how those vocabularies might best be deployed, or of choosing between several possible alternative solutions.

Leigh Dodds and Ian Davis of Talis have authored a booklet Linked Data Patterns which tries to address some of these challenges, by gathering together some common “patterns” of use, based on existing practice by linked data implementers – though perhaps inevitably at this stage, some aspects of that practice are something of a “moving target” as new challenges are identified and practice evolves to address them. (See, for example, a recent debate on the Linked Open Data mailing list covering the question of expectations for what the object of an rdfs:seeAlso triple might/should dereference to.)

I continue to find the reflections of linked data practitioners an excellent source, particularly those working in domains close to those I’m interested in. I regularly find myself referring to the series of posts by Jeni Tennison on creating linked data. In this context, the fifth post on “Finishing Touches” is particularly relevant, and in large part prompts my next couple of points.

Labelling

One of the principles I’ve tried to adhere to, following the guidance by Jeni is that each resource we expose should have a human-readable label, provided using the rdfs:label property, and as far as possible that label should function as a useful “stand-alone” name for the thing.

In some cases this is a straightforward matter of using some text content node in the EAD XML document as an RDF literal. In other cases, a single element in the EAD document is mapped to a number of distinct resources in our model. In these cases, the transformation process typically prefixes or suffixes the source text to generate labels for the various different things. Perhaps unsurprisingly, this sometimes leads to some slightly “artificial” or “stilted” results, so it’s something we may need to refine.

Also, and perhaps more problematically, as I’ve noted in a previous post, the practice of archival description has traditionally relied heavily on a “multi-level description” approach which results in the presentation of resource descriptions “in the context of” the descriptions of other related resources. So it is common to find individual items within a collection labelled simply as something like “Letter”, on the basis that the reader of the finding aid will glean further information from the fact that the description of the item is presented within a context provided by a list of other “sibling” items, all “children” of a “parent” aggregation of some form. Currently our mapping generates the rdfs:label of an item using only the label (EAD unititle element) of that item in the EAD document, with the result that we may indeed end up with many individual resources labelled “Letter” (though of course the description will also include other properties derived from other EAD data and links to “parent” resources). An alternative might be to try to generate a label by “qualifying” the item unittitle, say, by prefixing it with the label of a “parent” resource – though I suspect in practice this would generate some somewhat unwieldy results.

Where the source data makes it seem reasonable to express it, I’ve also indicated the use of a “preferred label”, using the skos:prefLabel property. I’m conscious here of the need to be careful: the SKOS specification includes a number of “integrity conditions”, rules which data using the SKOS vocabulary should follow. Amongst them is the requirement that

A resource has no more than one value of skos:prefLabel per language tag.

The important thing to remember is that this is intended to apply in an “open world” context, not simply as a condition scoped to a particular “document”. The EAD to RDF transform process is performed on a document-by-document basis. Within the Hub dataset, it is quite common that for a single resource, labels for that resource are generated from the content of multiple EAD documents. While in theory naming within the set of EAD documents should be consistent, in practice, the use of variants of names is widespread in our data – the names of archival repositories is one example. Generating an skos:prefLabel triple for each variant would result in a conflict with the integrity condition once the data was merged in the triple store.

Bearing in mind that the “open world” extends beyond the boundaries of our own dataset, the same considerations apply in the case where we are exposing URIs for resources for which other parties already expose descriptions, including an skos:prefLabel triple, and we can’t guarantee that the names in our data correspond to those provided by that source.

Inferencing

Another issue to consider is that referred to by Leigh and Ian in their “Materialize Inferences” pattern, and by Jeni Tennison in her discussion of “Derivable Data”. One of the strengths of using the RDF model is that it is supported by a formal semantics, a framework for reasoning with data, i.e. given some set of data, it is often possible to apply some formalised set of rules to infer or derive additional triples. However, it should not be assumed that all consumers of the data will have access to the tools which support such reasoning, so it may be more appropriate for a data provider like LOCAH to explicitly include at least some of those “derivable” triples in the data we provide.

For a simple example of what I mean, the Friend of a Friend (FOAF) vocabulary provides a property called foaf:name (“A name for some thing.”). As part of their description of that property, the FOAF vocabulary owners provide the triple:

foaf:name rdfs:subPropertyOf rdfs:label .

The RDFS property rdfs:subPropertyOf is one of those properties which is associated with a set of rules. What those rules say is that, for any two properties linked by an rdfs:subPropertyOf relation, two resources related by the first property are also related by the second. So each time I find a triple using foaf:name as a predicate, I can infer (deduce, derive) a second triple using the rdfs:label predicate, e.g. if I find

<http://example.org/id/person/p123> foaf:name “Ernest Henry Shackleton” .

then I can conclude

<http://example.org/id/person/p123> rdfs:label “Ernest Henry Shackleton” .

However, to reach that conclusion, my application needs (a) knowledge of the general rdfs:subPropertyOf inference rule, and (b) knowledge that foaf:name is a subproperty of rdfs:label – and (c) the processing capability to apply that rule!

By providing – “materializing” – both those triples in our source data, we relieve the consuming application of that responsibility – though that benefit comes at the cost of increasing the size of the descriptions we provide.

This tactic can be particularly useful, I think, for properties which are subproperties of “generic” vocabularies like the RDF Schema vocabulary or the Dublin Core vocabularies. Sometimes generic linked data tools have some “built-in knowledge” of, and/or specific behaviour associated with, some of these vocabularies (e.g. to obtain literal names/labels/titles for display to human readers). It may be perfectly reasonable to use a triple with some more specialised subproperty in our data to indicate some specific relationship, but where appropriate it is also helpful to “materialize” the triple using the more generic property as well, so that an application looking for RDF Schema or DC properties can easily access that data.

Extending that slightly, Jeni suggests a “rule of thumb” that “if the result of the reasoning involves a resource from another vocabulary, then we should include it”.

The subproperty case is just one example: the inference of resource type based on rdfs:range and rdfs:domain is another case in point. In the LOCAH data, we’ve tried to provide fairly “generous” type data (e.g. including “super-classes”) where possible – again, on the grounds that such information is a commonly used “hook” in user queries (“Select resources of type T where [some other criteria]”).

The “cost” of this approach is that the dataset and the individual “bounded descriptions” served are larger – so there is a “trade-off” here which we may want to monitor and reconsider once we see how the data is being used.

Events

As I mentioned earlier, we extended our very initial draft model to include a notion of “event”. Currently, the application of this approach in our data is quite limited: it is applied to the “creation”/”origination” of the archival resources, and to the birth, death and “periods of activity” (floruit) of individuals. What we do is similar to the approach sketched by Ben O’Steen in his processing of the British Library’s British National Bibliography data – though with a little more complexity as we make use of event ontologies which model time periods as resources, rather than as literals.

This is probably best illustrated by means of an example. Given a person with birth date of 1901 and death date of 1985, we generate an RDF graph like the following:

RDF Graph of Life Events Data

RDF Graph of Life Events Data

(The image links through to a larger version)

The time interval nodes at the right-hand side are reference.data.gov.uk URIs for years, like http://reference.data.gov.uk/id/year/1901

What I haven’t illustrated on that diagram is that I’ve also included some data using the CIDOC CRM ontology – actually using the Erlangen CRM vocabulary. I’m feeling my way a bit with this, so it is somewhat partial/experimental at the moment, but I hope to refine/extend it in the future.

The point I wanted to highlight is that we’ve made use of multiple “overlapping” vocabularies here – again on the grounds that it may be useful to provide that flexibility to consumers of the data querying using a specific vocabulary. As above, this is a “trade-off” which we may want to monitor and reconsider in the future.

Summary

I’ve tried to cover here some of the issues around our choices of RDF vocabularies and how we’ve deployed them. The next post will summarise the actual terms used.

LOD-LAM: International Linked Open Data in Libraries, Archives, and Museums Summit

LOD LAMI’m really pleased to announce that I was asked to join the organising committee for the International Linked Open Data in Libraries, Archives, and Museums Summit that will take place this June 2-3, 2011 in San Francisco, California, USA. There’s still time to apply until February 28th, and funding is available to help cover travel costs.

The International Linked Open Data in Libraries, Archives, and Museums Summit (“LOD-LAM”) will convene leaders in their respective areas of expertise from the humanities and sciences to catalyze practical, actionable approaches to publishing Linked Open Data, specifically:

  • Identify the tools and techniques for publishing and working with Linked Open Data.
  • Draft precedents and policy for licensing and copyright considerations regarding the publishing of library, archive, and museum metadata.
  • Publish definitions and promote use cases that will give LAM staff the tools they need to advocate for Linked Open Data in their institutions.

For more information see http://lod-lam.net/summit/about/.

The principal organiser/facilitator is Jon Voss (@LookBackMaps), Founder of LookBackMaps, along with Kris Carpenter Negulescu, Director of Web Group, Internet Archive, who is project managing.

I’m very chuffed to be part of the illustrious Organising Committee:

Lisa Goddard (@lisagoddard), Acting Associate University Librarian for Information Technology, Memorial University Libraries.
Martin Kalfatovic (@UDCMRK), Assistant Director, Digital Services Division at Smithsonian Institution Libraries and the Deputy Project Director of the Biodiversity Heritage Library.
Mark Matienzo (@anarchivist), Digital Archivist in Manuscripts and Archives at the Yale University Library.
Mia Ridge (@mia_out), Lead Web Developer & Technical Architect, Science Museum/NMSI (UK)
Tim Sherratt (@wragge), National Museum of Australia & University of Canberra
MacKenzie Smith, Research Director, MIT Libraries.
Adrian Stevenson (@adrianstevenson), UKOLN; Project Manager, LOCAH Linked Data Project.
John Wilbanks (@wilbanks), VP of Science, Director of Science Commons, Creative Commons.

It’ll be a great event I’m sure, so get your application in ASAP.

Two changes to the model and some definitions

Over the last few weeks we’ve been testing our initial cut at an EAD-to-RDF transform against a range of data that extends beyond EAD documents prepared using the Hub data entry template to documents created using other tools – and varying somewhat in terms of the markup conventions used.

In the course of that, I’ve been pondering some of the choices we made in the model I described here and here, and we decided to make a couple of changes (one very minor and the second still relatively so, I think):

  • Archival Resource: We’ve changed the name of the class we were calling “Unit of Description” to “Archival Resource”. I think “Unit of Description” was problematic for two reasons. First, it was ambiguous, because it could be interpreted either as the unit (of archival material) being described (which is what was intended) or as a unit/part of the archival description (which is not what was intended). Second, I adopted it from the ISAD(G) standard, where the context is one in which the archival resources are considered to be the primary things being described. I’m less sure the label works in the “linked data” context where we’re providing statements, and sets of statements (descriptions), “about” not just the archival materials, but many other things. In this context, everything that is described (people, concepts, places, etc) might be seen, in some sense, a “unit of description”, and so using that label for one subset of them seems inappropriate. That left us with finding a suitable alternative, a generic term that covers archival material in general, at any level of description (fonds, collection, item etc), and “archival resource” seemed like a reasonable fit.
  • Origination as Concept: When I first sketched out the model, I raised some questions, including (as “question 3” in that post) whether it was useful/necessary to model the origination of the archival resource as a pair of concept and agent, following the pattern used for the <controlaccess> terms. Having experimented with that approach, we’ve decided it introduces unnecessary complexity and we’ve fallen back on treating <origination> as a simple relation between archival resource and agent. The use of concept and agent is retained for the <controlaccess> case, where names are typically drawn from an “authority file”, as it allows us to maintain the distinction between a conceptualisation of the agent (as reflected by the authority record/entry) and the agent itself (a distinction which is also made in the model underpinning datasets such as VIAF, which we will be making links to).

The revised model is summarised in the following diagram (an amended version of Figure 3 from the earlier post):

Amended data model for EAD

Amended data model for EAD

i.e. an Archival Resource and a Biographical History are now related directly to an Agent.

Below is a draft list of human-readable definitions for the classes in the model. Some are simply references to classes provided by existing vocabularies like Dublin Core, FOAF, event vocabularies:

Finding Aid
A document describing an archival resource.
Subclass of: bibo:Document, foaf:Document
EAD
A document conforming to the Encoded Archival Description standard.
Subclass of: bibo:Document, foaf:Document
Biographical History
A narrative or chronology that places the archival materials in context by providing information about their creator(s). A finding aid may contain several such narratives or chronologies pertaining to different archival materials and their creators.
Subclass of: bibo:DocumentPart, (bibo:Document), foaf:Document
Repository
An institution or agency responsible for providing access to archival materials.
Subclass of: foaf:Organization, (foaf:Agent), dcterms:Agent
Place
= wgs84_pos:SpatialThing
Postcode Unit
= ospc:PostcodeUnit
Archival Resource
Recorded information in any form or medium, created or received and maintained, by an organization or person(s) in the transaction of business or the conduct of affairs, and maintained for its long-term research value. An archival resource may be an individual item, such as a letter or photograph, or (more commonly) some aggregation of such items managed and described as a unit.
Level
An indicator of the part of an archival collection constituted by an archival resource, whether it is the whole collection or a sub-section of it.
Subclass of: skos:Concept
Language
= lvont:Language
Extent
The size of an archival resource.
Subclass of: dcterms:SizeOrDuration
Temporal Entity
= time:TemporalEntity
Creation
An event that resulted in the creation or accumulation of an archival resource.
Subclass of: event:Event, lode:Event
Concept
= skos:Concept
Concept Scheme
= skos:ConceptScheme
Agent
= foaf:Agent, dcterms:Agent
Person
= foaf:Person, (foaf:Agent), dcterms:Agent
Family
A group of people affiliated by consanguinity, affinity, or co-residence.
Subclass of: foaf:Group, (foaf:Agent), dcterms:Agent
Organisation
= foaf:Organization, (foaf:Agent), dcterms:Agent
Genre or Form
A category of archival material, defined either by style or technique of intellectual content, order of information or object function, or physical characteristics.
Subclass of: skos:Concept
Function
A sphere of activity or process.
Subclass of: skos:Concept
Birth
= bio:Birth, (bio:IndividualEvent), (bio:Event),
(event:Event), (lode:Event)
Death
= bio:Death, (bio:IndividualEvent), (bio:Event),
(event:Event), (lode:Event)
Object
= foaf:Document, bibo:Document
Book
= bibo:Book, (bibo:Document), (foaf:Document)

Locah Lightening at Dev8d

This is just a quick post to say that I’ll be giving a “lightening talk” on the Locah project at 2.45pm this Wednesday 16th February at the Dev8d developer event in London. If you’ve got any questions or would like to know more about the project, then please come along to the session. I should be at Dev8d for the full two days, so grab me anytime if you can’t make the session.

I’ll also be participating in a panel session on Linked Data as well, but I’m not sure when this is scheduled for yet.

Abstract for the talk:

“The Locah project is making records from the Archives Hub service and Copac service available as Linked Data. The Archives Hub is an aggregation of archival metadata from repositories across the UK; Copac provides access to the merged library catalogues of libraries throughout the UK, including all national libraries. In each case the aim is to provide Linked Data according to the principles set out by Tim Berners-Lee, so that we make our data interconnected with other data and contribute to the growth of the Semantic Web. The talk will touch on data modelling, the selection of vocabularies and the design of URI patterns. It will look at the practical realities of how we are turning the Archives Hub EAD data and Copac MODS data into RDF XML, and then loading it into triple stores. The talk will conclude with a look at some of the main opportunities and barriers to the creation and use of Linked Data. There will be a panel session on linked data where delegates can ask further questions.”

I’ve also added my tune to the Dev8d playlist, the sublime ‘French Disko‘ by Stereolab.

Postscript 22nd February 2011:

I’ve now uploaded my slides from this talk to slideshare and embedded them below. The talk was primarily aimed at developers with the assumption that they knew a bit about RDF and Linked Data, so it doesn’t discuss these except in passing. I was mainly trying to give some specifics on the technicalities involved, and what platforms and tools we’re using, so people can follow the same path if they wanted. Please comment below with any questions.

It was another great #dev8d this year, and especially useful for me in terms of learning more about Linked Data related technologies. Top job to organiser Mahendra and the rest of the UKOLN team involved.

[slideshare id=7000641&doc=dev8d2011-110221081440-phpapp02]

Assessing Linked Data

For me, the journey from an understanding of modelling data, and creating our own models for the Hub and Copac, to being able to understand the processes and decisions involved in creating XML RDF has been challenging. It has raised one question that often applies when dealing with something quite technical: how much should a manager (in my case an archivist managing an online archive service) be expected to understand the ‘technical’ aspects of something?  This is a question I have spoken about and written about before; mainly in terms of what archivists (or other information professionals) in the digital age need to know in order to understand the implications of choices around things like data structure and software systems.  In the case of Linked Data, I am still not sure how much I need to know about the detail of Linked Data, the RDF model, the use of RDF XML, the benefits of other output fomats, the application of stylesheets, etc. I have been thinking about how hard it is to create Linked Data – I have had a few enquiries from colleagues who are interested in doing the same sort of thing already and I want to be able to offer useful advice.

One thing that occurs to me is that it is reasonable to acknowledge that Linked Data does involves programming skills, and therefore it is not so dissimilar from structuring and outputting your data through a traditional relational database, for example, where you would expect that specialist skills are needed. But in either case there is the the same need for a manager to understand what the system offers and be able to offer the best service to researchers. I think what is important from the point of view of the manager is to be involved with the decision-making process and understand the implications of Linked Data; you need to know what you are saying about your own data. I am not sure that this requires a thorough knowledge of RDF and certainly I only have a rudimentary knowledge of stylesheets (and no knowledge of programming).

Service managers or administrators are not expected to understand systems from an in-depth technical point of view. But in fact, I think that one of the advantages of RDF model is that it is easier to get a sense of what is going on in terms of data processing than typically occurs with a database management system. After about six months of learning about Linked Data and RDF (I estimate that this translates into about one month of fairly intense learning), I can look at the stylesheet that we have for the Archives Hub, to transform the EAD data into RDF XML, and I can look at an RDF document representing the entities that we are describing, and I have a reasonably good overall sense of what it all means, which helps me with my main role: to understand the outputs and the potential benefits of Linked Data. For Locah, we’ve used XSLT, but there is no requirement for this, and maybe one of the challenges of outputting Linked Data is that there are a number of options in terms of translating your RDF model into an output.

There is no doubt that choices made now about how we model the data will have implications for what users can do with it, and some choices may limit future potential more than others. For example, which ‘things’ do we choose to represent? Should we have a conceptualisation of a person as an entity represented in the description, and to link this to a conceptualisation of the person? Which information should we provide as URIs and which as literals? I’m only gradually coming to understand the implications of these decisions, as we start to explore the potential of the data. Of course, this is always true, whatever data, structures and systems we are working with. This brings me to another point that I think is probably particularly relevant for Locah: we are doing this work at a time when we are very much early adopters. Whilst the classic Linked Data diagram may give the impression that the world has embraced Linked Data, the reality is that it is still very much at a hand-crafted level: we have not had tools available to us to aid us in this work, and in the case of EAD, there has been very little activity up till now. It is therefore difficult to judge how feasible it might be to output RDF in the future, as it is likely that more tools will be developed, and there will be greater awareness and skills built up around the whole Semantic Web. However, I wonder if we are currently still at that difficult point where we need to build the momentum of the Linked Data movement, but it is still very unfamiliar and poorly understood by many data providers?

Many Linked Data evangelists claim that Linked Data is ‘easy’. I’m not sure that it is necessarily easy, and I don’t think that it’s very helpful to say that it is easy. Easy compared to what? Easy for whom? It’s easy if you know how, if you have the requisite skills and experience, but we need to persuade people who don’t yet know how that it is worth doing, and provide a realistic assessment of the skills that are required. I suppose the question of how easy it is does rest in large part on the data you are working with as well. Archival finding aids are quite challenging. As Mark Matienzo, archivist at Yale University, states in his presentation on Linked Data and Archival Description: “Archival description is inherently multi-level and relational” and “EAD is both too flexible and too unforgiving” to be Linked Data friendly…and database-friendly for that matter. Also, ISAD(G) recommends the non-repetition of information and archival description generally contains implicit information. I suppose Linked Data might help provide the opportunity and impetus to move towards a more Web-friendly way of describing archives, if it does become more widely used.

At present, I can’t help thinking that if archive repositories and libraries would like to output their data as Linked Data, many of them will struggle, and I would have thought it might be similar for other types of data providers. I do think that expertise is required, and time needs to be invested in understanding some key aspects of Linked Data. On the other hand, this is the case whenever you are looking at creating effective means to output structured (but often inconsistent) data. However, I think that it makes good sense for the Archives Hub and Copac to do this work, as it is on behalf of our contributors, so it effectively will allow these repositories and libraries to output Linked Data.  In other words, it may be that for Linked Data to really take hold, it will benefit from this kind of aggregated set-up, where skills and resources can be pooled. At present, I’m inclined to think that it is worth the investment of time and resources by our Locah team because it is benefitting a large number of data providers. I think it will be important for us to convey to our contributors, and indeed to other archivists and librarians, what we are doing and why, what the implications are and what the benefits may be. I have already had contact with two people, one representing another aggregation of content, interested in benefitting from our work. This is really important, because it potentially makes the investment more worthwhile.

We are in a fortunate position with the Locah project because we are part of a JISC-funded innovations project, with a team of people with a variety of skills, and we have support from Talis, who have significant experience of Linked Data.  If we can work on behalf of our community, then I feel that the time invested may be worthwhile. For the second half of our year-long project we will want to explore the benefits more thoroughly – we will be looking at the crucial issues of creating links to other data, which is really Linked Data’s key selling point, and we will be developing a prototype to show some potential benefits for researchers.

(With thanks to Pete and Ade for their contributions to this blog post).

Identifying the “things”: URI Patterns for the Hub Linked Data

In my previous couple of posts, I outlined the model of the “world” on which we’re basing the RDF data we’re generating from the Archives Hub‘s EAD XML documents.

At the heart of the Linked Data approach is the principle that all the “things” we want to “say anything about” should be named using a URI, and that those URIs should use the http URI scheme, so that they can be easily “looked up” or “dereferenced” using Web technologies in order to obtain some information provided by the URI owner about the thing. So, having specified the types or classes of thing we want to refer to and describe, the next step is to decide on the structure of the http URIs that we’ll use to name the “instances” of those classes – the individual “things” – archival resources, repositories, concepts, persons, places, and so on. In this post, I’ll try to describe the patterns we’re using, and outline how we construct individual URIs using those patterns from the EAD input data. As I hope will become clearer, the nature of the input data conditions the form of the patterns we’ve chosen. This has turned into a rather long post (again!) but I hope the detail is useful – I think it’s important for us to try to document our processes and some of the issues we’ve grappled with as well as to present the conclusions.

In some (most) cases, these will be newly created URIs, under a domain that we (well, MIMAS and the Archives Hub service) own. For these URIs, the project is responsible for choosing the URIs and putting in place the mechanisms to ensure that their dereferencing results in the provision of some “useful information”. In other cases, we will simply be citing existing URIs, defined by other agencies who (hopefully!) provide for their dereferencing.

The UK Cabinet Office has recently published some general guidelines on URI patterns for government Linked Data, Designing URI Sets for the UK Public Sector, and within the JISC programme strand under which LOCAH is funded, projects are encouraged to follow the recommendations of those guidelines. Following these guidelines, the general URI pattern recommended to identify “things” is:

http://{domain}/id/{concept}/{reference}

where:

  • concept is a name for a class (resource type), like “person”
  • reference is a name for an individual instance of that class or type

Our RDF data is being generated, at least in the first instance, by processing EAD XML documents, so we want to construct our URIs for our “things” from content within those XML documents. And we want to do so in a way that, as far as possible, ensures that each of those URIs is an unambiguous name/referrer, i.e. it identifies a single “thing”, and we don’t end up with a single URI being used for what are in fact two different things. On the other hand, we can live with the case where we end up with multiple URIs, all of which identify a single thing, because information can be added at a later stage to indicate that they are synonyms.

The other point to note is that the initial transformation step is being performed on a “document-by-document basis”, i.e. taking a single EAD document as input and outputting RDF/XML. So for any given resource, the information we generate – including the URI of the resource – is based only on the content of that document (and any generally applicable information we can embed in the transform itself). There may be other data “about” that “thing” in another EAD document but we don’t have access to it at the time of transformation.

Also, it’s desirable that we construct our URIs in such a way that if we need to re-run the transform, we generate the same URIs from the same input data (unless we explicitly decide to change the patterns for some reason).

Finally, although the patterns below often make use of human-readable strings from the EAD document content, I haven’t treated human-readability as a major consideration. Having said that, I’ve tended to make use of (slightly normalised forms of) human-readable strings where possible, rather than, say, creating opaque “hashes”.

As with other aspects of the work, at this stage, this is a first cut at tackling the issue, and we may revise our approaches based on the experience of applying them over the dataset. Having gone through and constructed patterns for the various resource types, looking back over them now, I think I can see a small number of distinct methods that we’ve used:

  1. Identifiers: For some of these “things”, the EAD documents contain some sort of formally assigned identification code or number, which unambiguously – at least within the scope of the Hub collection – identifies that instance within the set of resources of that type (i.e. it serves as a “reference” in the terms of the Designing URI Sets… document). This is the case, for example, with the languages of the materials, using the did/langmaterial/language/@langcode attribute value. A variant of this is the case where such an identifier can be constructed from a combination of multiple pieces of content. Repositories, for example, can be identified by the pair of country code (ead/eadheader/eadid/@countrycode) and maintenance agency code (ead/eadheader/eadid/@mainagencycode). For these cases a combination of the name of the resource type and that identification code provides the basis for the “reference” part of the URI.
  2. “Authority-Controlled” Names: For many of the “things”, however, the EAD documents do not contain such a code; rather, they refer to things only by name. In some cases, the form of the name is drawn from an “authority file” – indicated in the EAD document – and the name includes sufficient information (e.g. birth/death dates, titles etc for a person) to make the resulting string an unambiguous referrer within the set of names from that source. For these cases, a combination of a name for the authority file and the name provides the basis for the “reference”. However, this does depend on the creator of the EAD document having accurately transcribed the “authoritative” form of the name, at least sufficiently to maintain unambiguity of reference.
  3. “Rule-Based” Names: In other cases, the “thing” is named, not using a name from a controlled list, but rather a name constructed according to some codified set of rules, where the rules used are indicated in the EAD document. The intent behind such rules is to try to ensure consistency of form and unambiguity of reference. The National Council of Archives’ Rules for the Construction of Personal, Place and Corporate Names (one of the rule sets recommended to Hub data creators) states “A personal name is constructed by combining mandatory and optional components of the name so that the person concerned can be identified with certainty and distinguished from others bearing similar names. An individual should have only one authorised form of name and each name should apply to only one individual.”Typically, as for the “authority file” case, this is achieved through the inclusion of dates, titles etc for persons. For these cases, a combination of a name for the rules and the name itself should provide the basis for the “reference”. However, in practice, the picture with the Hub data is somewhat more complex. First, in some cases where it is claimed that rules are followed, the content itself indicates that this is not the case. For example, the NCA Rules mandate that a personal name should include “the year in which a person was born or died, the span of years of his/her lifetime or the approximate period covered by his/her activities”, even if those dates are estimated. But there are cases in the data marked up as following the NCA Rules which do not meet this requirement – e.g. personal names providing only surname and forename with no dates – , which I suspect may result in ambiguous references. Second, even where the rule is followed and the mandatory components are present, the distributed nature of Hub data creation means that I suspect there is still some possibility that a single personal name may be used in two different sources to refer to what in fact are two different people (Consider e.g. the case of two data providers using the name “Smith, John, fl 1920-1950”).
  4. “Locally-Scoped” Names: In other cases, the form of the name is neither authority-controlled nor rule-based, but nevertheless there is some expectation that the form of the name used is sufficient to make it an unambiguous referrer within some context. This is the case, for example, with the content of the did/origination element. The difficulty, however, is in establishing reliably what that context is. What is that “local scope”? We’ve tentatively taken the approach that such names have been constructed in such a way as at least to be unambiguous within the collection of submissions to the Hub by a single repository. So by combining the repository identifier and the name, hopefully, we can arrive at a “reference” which avoids ambiguity. Again, it may turn out that this assumption is unreliable, and results in ambiguous references, so we may need to revisit this approach.
  5. “Identifier Inheritance”: (I’m sure there must be a formal term for this but I’m not sure what it is!) In these cases the EAD document does not provide an unambiguous name for the “thing” itself; however the “thing” has a simple relationship with some other “thing” for which identification fits into one of the other categories. Where the relationship is one-to-one, a URI can be constructed by adopting the pattern for that other “thing” and substituting the name of the resource type. An example of this is the case of the “biographical history” associated with a “unit of description”. The unit of description has an identifier (based on a pattern described below) and since – in data constructed using the Hub template – each unit has at most one biographical history, replacing the “unit” resource type name with a “bioghist” resource type name gives us a suitable URI path, e.g. for a unit of description for which the URI path contains “/unit/gb15abc”, the URI for the biographical history would contain “/bioghist/gb15abc”.A variant of this is the case where the relationship is many-to-one, rather than one-to-one. Here the approach needs to be extended to include e.g. a sequence number to distinguish the multiple “things”. This is the approach taken for the Unit of Description, where a “child” (“part”) unit of description uses the URI of the “parent” (“whole”) unit suffixed with a sequence number, e.g. for a unit of description for which the URI path contains “/unit/gb15abc”, the URIs for the “child” units would contain “/unit/gb15abc-1”, “/unit/gb15abc-2” and so on. In theory, this should not be necessary as the unitid for a unit should be unique within an EAD document, but in practice we’ve found that this is not the case in the actual data. (In this case, the identifier would be “reproducable” only if any new units are inserted at the end of a sequence rather than in the middle).
  6. So, with the caveat above that this is all somewhat tentative at this stage, I summarise below the approaches taken to generating URIs for instances of each of the classes in the Hub model. Note that sometimes, an instance of the same class is generated in different “contexts” within the EAD document, and in these cases different rules for URI construction may be applied in those different contexts, depending on the information available within the EAD document.

    We haven’t yet finalised the domain name we’ll be using, so for the purposes of the following, {root} represents the domain and the first part of the path. Italicised text is used for the URI patterns (or parts of them); bold text is used for XPath(-ish!) representations of the source of data within the EAD XML document.

    Finding Aid

    Pattern(s)

    {root}/id/findingaid/{eadid}

    eadid
    normalised form of ead/eadheader/eadid

    Example:

    {root}/id/findingaid/gb15sirernesthenryshackleton

    EAD document

    Pattern(s)

    {root}/id/EAD/{eadid}

    eadid
    normalised form of ead/eadheader/eadid

    Example(s)

    {root}/id/ead/gb15sirernesthenryshackleton

    Repository (Agent)

    Pattern(s)

    {root}/id/repository/{repositoryid}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/repository/gb15

    Repository (Place)

    Pattern(s)

    {root}/id/place/{repositoryid}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/place/gb15

    Unit of Description

    Pattern(s)

    {root}/id/unit/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Note: In principle, it should be possible to use c/unitid content rather than position in tree, but in practice, there are cases where unitid content is not unique within the EAD document.

    Example(s)

    {root}/id/unit/gb15sirernesthenryshackleton

    {root}/id/unit/gb15sirernesthenryshackleton-1

    Level

    Pattern(s)

    {root}/id/level/{level-name}

    level-name
    archdesc/@level or archdesc/@otherlevel or c{n}/@level or c{n}/@otherlevel

    Example(s)

    {root}/id/level/fonds

    Language

    Pattern(s)

    http://lexvo.org/id/iso639-3/{langcode}

    Note: use existing lexvo.org URIs for languages.

    langcode
    did/langmaterial/language/@langcode

    Example(s)

    http://lexvo.org/id/iso639-3/eng

    Creation (Event)

    Pattern(s)

    {root}/id/creation/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/creation/gb15sirernesthenryshackleton

    Creation (Time)

    Pattern(s)

    {root}/id/creationtime/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/creationtime/gb15sirernesthenryshackleton

    Extent

    Pattern(s)

    {root}/id/extent/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/extent/gb15sirernesthenryshackleton

    Biographical History

    Pattern(s)

    {root}/id/bioghist/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/bioghist/gb15sirernesthenryshackleton

    Concept (Origination)

    Pattern(s)

    {root}/id/concept/agent/{repositoryid}/{origination-name}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/concept/agent/gb15/sirernesthenryshackleton

    Agent (Origination)

    Pattern(s)

    {root}/id/agent/{repositoryid}/{origination-name}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/agent/gb15/sirernesthenryshackleton

    Concept (ControlAccess – Subject)

    Pattern(s)

    {root}/id/concept/{source}/{subject-name}

    {root}/id/concept/{repositoryid}/{subject-name}

    source
    controlaccess/subject/@source
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    subject-name
    normalised form of controlaccess/subject

    Example(s)

    {root}/id/concept/lcsh/antiquities

    Concept (ControlAccess – Persname)

    Pattern(s)

    {root}/id/concept/person/{source}/{person-name}

    {root}/id/concept/person/{rules}/{person-name}

    {root}/id/concept/person/{repositoryid}/{person-name}

    source
    controlaccess/persname/@source
    rules
    controlaccess/persname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    person-name
    normalised form of controlaccess/persname/

    Example(s)

    {root}/id/concept/person/nra/shackletonernesthenry1874-1922sirknightexplorer

    {root}/id/concept/person/ncarules/holdenwendyfl1990cartoonist

    {root}/id/concept/person/gb1832/berlinisaiah1909-1997sirknighthistorian

    Person (ControlAccess – Persname)

    Pattern(s)

    {root}/id/person/{source}/{person-name}

    {root}/id/person/{rules}/{person-name}

    {root}/id/person/{repositoryid}/{person-name}

    source
    controlaccess/persname/@source
    rules
    controlaccess/persname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    person-name
    normalised form of controlaccess/persname/

    Example(s)

    {root}/id/person/nra/shackletonernesthenry1874-1922sirknightexplorer

    {root}/id/person/ncarules/holdenwendyfl1990cartoonist

    {root}/id/person/gb1832/berlinisaiah1909-1997sirknighthistorian

    Concept (ControlAccess – Famname)

    Pattern(s)

    {root}/id/concept/family/{source}/{family-name}

    {root}/id/concept/family/{rules}/{family-name}

    {root}/id/concept/family/{repositoryid}/{family-name}

    source
    controlaccess/famname/@source
    rules
    controlaccess/famname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    family-name
    normalised form of controlaccess/famname/

    Example(s)

    {root}/id/concept/family/nra/dundasviscountsmelvilledunira

    {root}/id/concept/family/ncarules/boucicault

    Family (ControlAccess – Famname)

    Pattern(s)

    {root}/id/family/{source}/{family-name}

    {root}/id/family/{rules}/{family-name}

    {root}/id/family/{repositoryid}/{family-name}

    source
    controlaccess/famname/@source
    rules
    controlaccess/famname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    family-name
    normalised form of controlaccess/famname/

    Example(s)

    {root}/id/family/nra/dundasviscountsmelvilledunira

    {root}/id/family/ncarules/boucicault

    Concept (ControlAccess – Corpname)

    Pattern(s)

    {root}/id/concept/organisation/{source}/{org-name}

    {root}/id/concept/organisation/{rules}/{org-name}

    {root}/id/concept/organisation/{repositoryid}/{org-name}

    source
    controlaccess/corpname/@source
    rules
    controlaccess/corpname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    org-name
    normalised form of controlaccess/corpname/

    Example(s)

    {root}/id/concept/organisation/nra/britishbroadcastingcorporation

    {root}/id/concept/organisation/aacr2/dailymail%28london%2Cengland%29

    {root}/id/concept/organisation/gb1578/vizards%2Csolicitors%2Cmonmouth

    Organisation (ControlAccess – Corpname)

    Pattern(s)

    {root}/id/organisation/{source}/{org-name}

    {root}/id/organisation/{rules}/{org-name}

    {root}/id/organisation/{repositoryid}/{org-name}

    source
    controlaccess/corpname/@source
    rules
    controlaccess/corpname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    org-name
    normalised form of controlaccess/corpname/

    Example(s)

    {root}/id/organisation/nra/britishbroadcastingcorporation

    {root}/id/organisation/aacr2/dailymail%28london%2Cengland%29

    {root}/id/organisation/gb1578/vizards%2Csolicitors%2Cmonmouth

    Concept (ControlAccess – Geogname)

    Pattern(s)

    {root}/id/concept/place/{source}/{place-name}

    {root}/id/concept/place/{rules}/{place-name}

    {root}/id/concept/place/{repositoryid}/{place-name}

    source
    controlaccess/geogname/@source
    rules
    controlaccess/geogname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    place-name
    normalised form of controlaccess/geogname/

    Example(s)

    {root}/id/concept/place/lcsh/mcmurdosound%28antarctica%29

    {root}/id/concept/place/ncarules/canada

    {root}/id/concept/place/gb982/meirionethshire%28wales%29

    Place (ControlAccess – Geogname)

    Pattern(s)

    {root}/id/place/{source}/{place-name}

    {root}/id/place/{rules}/{place-name}

    {root}/id/place/{repositoryid}/{place-name}

    source
    controlaccess/geogname/@source
    rules
    controlaccess/geogname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    place-name
    normalised form of controlaccess/geogname/

    Example(s)

    {root}/id/place/lcsh/mcmurdosound%28antarctica%29

    {root}/id/place/ncarules/canada

    {root}/id/place/gb982/meirionethshire%28wales%29

    Concept (ControlAccess – GenreForm)

    Pattern(s)

    {root}/id/concept/{source}/{genreform-name}

    {root}/id/concept/{rules}/{genreform-name}

    {root}/id/concept/{repositoryid}/{genreform-name}

    source
    controlaccess/genreform/@source
    rules
    controlaccess/genreform/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    genreform-name
    normalised form of controlaccess/genreform

    Example(s)

    {root}/id/concept/aat/buildingplans

    Concept (ControlAccess – Function)

    Pattern(s)

    {root}/id/concept/{source}/{function-name}

    {root}/id/concept/{rules}/{function-name}

    {root}/id/concept/{repositoryid}/{function-name}

    source
    controlaccess/function/@source
    rules
    controlaccess/function/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    function-name
    normalised form of controlaccess/function

    Example(s)

    {root}/id/concept/agift/miningregulations

    Book

    Pattern(s)

    {root}/id/document/{title}

    source
    controlaccess/title/@source
    rules
    controlaccess/title/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    title
    normalised form of controlaccess/title

    Example(s)

    {root}/id/document/aacr2/thecastlediaries1974-761980

    Birth (Event)

    Pattern(s)

    {root}/id/birth/{source}/{person-name}

    {root}/id/birth/{rules}/{person-name}

    {root}/id/birth/{repositoryid}/{person-name}

    source
    controlaccess/persname/@source
    rules
    controlaccess/persname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    person-name
    normalised form of controlaccess/persname/

    Example(s)

    {root}/id/birth/nra/shackletonernesthenry1874-1922sirknightexplorer

    {root}/id/birth/ncarules/allenjim1926-1999playwright

    {root}/id/birth/gb1832/berlinisaiah1909-1997sirknighthistorian

    Object

    Pattern(s)

    {object-uri}

    object-uri
    dao/@href or daogrp/daoloc/@href

    Example(s)

    http://library.kent.ac.uk/library/special/html/specoll/jack.gif

    Object Group

    Pattern(s)

    {root}/id/group/{unitid}-{groupno}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree
    groupno
    position within daogrp sequence for archdesc or c{n}

    Example(s)

    {root}/id/group/gb0254ms274-1

    Time Interval (Year, Month, Day)

    i.e. specific intervals of time.

    Pattern(s)

    http://reference.data.gov.uk/id/year/{yyyy}

    http://reference.data.gov.uk/id/month/{yyyy}-{mm}

    http://reference.data.gov.uk/id/day/{yyyy}-{mm}-{dd}

    Note: use existing reference.data.gov.uk URIs for intervals.

    langcode
    did/langmaterial/language/@langcode

    Example(s)

    http://reference.data.gov.uk/id/year/1921

    http://reference.data.gov.uk/id/month/1921-06

    http://reference.data.gov.uk/id/day/1921-06-03

Some more “things”: some extensions to the Hub model

Having had a little more time to experiment with the Archives Hub EAD data, and to think about what sort of operations on the RDF data we might wish to perform or enable others to perform, I’ve introduced a few small extensions to the model I described a couple a few weeks ago.

Extents

At our last project meeting, we talked about some of the possibilities for visualisations of the data. One of the ideas (suggested by Jane) is to explore representing relative sizes of collections, perhaps on a map, so that, for example, a researcher could provide a geographic location and a subject area and get a visual representation of the relative sizes of collections within that area.

The EAD XML format provides an element called <extent> for “information about the quantity of the materials being described or an expression of the physical space they occupy”. Although the EAD Tag Library provides guidelines to try to encourage some uniformity of the content, the data in the Hub EAD documents is quite variable. Examples of the content in the samples I’ve looked at include:

  • 6.5 linear metres
  • 2.04 metres
  • 0.48m
  • 190 archive boxes
  • 13 boxes
  • One sheet of paper
  • 13 lever arch files, 48 sound tape reels, 490 audio cassette tapes (1 filing cabinet)

In the initial model, this was just treated in RDF as a single triple with subject the URI of the unit of description (an archival collection or some part of it) and this string a literal object. I’m suggesting changing this to treat the “extent” as a resource with its own URI, rather than simply as a literal. Doing that enables us – for at least some of these cases – to make explicit that it is a value measured in some “unit” (linear metres, archival boxes), to “normalise” the way those units are represented (so e.g. “linear metres”, “metres” and “m” can be mapped to a single form in the RDF data), and possibly to make comparisons, albeit approximate ones, between extents measured in different units (for example, “archival boxes” and “linear metres”).

So we end up with patterns in the RDF graph like:

unit:123 dcterms:extent extent:123 .

extent:123 ex:metres “2.04”^^xsd:decimal .

Having said that, I recognise that the nature of the input data is such that such techniques are usefully applicable only to a subset of the data; I’m not sure there’s a great deal we can do with “composite” strings like the last one in the list above, other than present them to a human reader.

Events and Times

One of the other ideas for presenting data we’ve chewed around is that of some sort of “timeline” view. It’s something I’ve been quite keen to explore – though I’m conscious that the much of the most useful information is, in the EAD documents, in the form only of prose in the “biographical/administrative histories” provided for the originators of the archives.

As a first tentative step in this direction, I’ve introduced a notion of “event” into the model, where, in the first instance:

  • the Creation of a unit of description is modelled as an event taking place during a period of time
  • (where birth/death dates are provided in the input) the Birth and Death of a person are modelled as events taking place during a period of time

It’s possible to generate this just from simple processing of the input data. It may be possible to go further and generate a richer range of “events” through the use of some flavour of intelligent text analysis/”entity extraction” tools on the biographical/administrative history text, but that’s something for us to consider in the future.

Postcodes

Finally – and as I noted in the previous post this is something which goes beyond the content of the EAD documents themselves – prompted mainly by the recent announcement by John Goodwin that the Ordnance Survey had extended their linked data dataset to include “post code units”, I’ve added in a notion of “Postcode Unit” so that we can make links to resources from that dataset (and also to the UK Postcodes dataset).

So the revised model looks something like the figure below:

Diagram showing data model for EAD data

Figure 1

So, I’m hoping that – bug fixes aside – I can stop tinkering with this for a while 🙂 and that we can work with this version of the model, and test out what is possible and where any “pain points” are, and then think about where further changes might be useful.