Transforming EAD XML into RDF/XML using XSLT

This is a (brief!) second post revisiting my “process” diagram from an early post. Here I’ll focus on the “transform” process on the left of the diagram:

Diagram showing process of transforming EAD to RDF and exposing as Linked Data

The “transform” process is currently performed using XSLT to read an EAD XML document and output RDF/XML, and the current version of the stylesheet is now available:

(The data currently available via http://data.archiveshub.ac.uk/ was actually generated using the previous version http://data.archiveshub.ac.uk/xslt/20110502/ead2rdf.xsl. The 20110630 version includes a few tweaks and bug fixes which will be reflected when we reload the data, hopefully within the next week.)

As I’ve noted previously, we initially focused our efforts on processing the set of EAD documents held by the Archives Hub, and on the particular set of markup conventions recommended by the Hub for data contributors – what I sometimes referred to as the Archives Hub EAD “profile” – though in practice, the actual dataset we’ve worked with encompasses a good degree of variation. But it remains the case that the transform is really designed to handle the set of EAD XML documents within that particular dataset rather than EAD in general. (I admit that it also remains somewhat “untidy” – the date handling is particularly messy! And parts of it were developed in a rather ad hoc fashion as I amended things as I encountered new variations in new batches of data. I should try to spend some time cleaning it up before the end of the project.)

Over the last few months, I’ve also been working on another JISC-funded project, SALDA, with Karen Watson and Chris Keene of the University of Sussex Library, focusing on making available their catalogue data for the Mass Observation Archive as Linked Data.

I wrote a post over on the SALDA blog on how I’d gone about applying and adapting the transform we developed in LOCAH for use with the SALDA data. That work has prompted me to think a bit more about the different facets of the data and how they are reflected in aspects of the transform process:

  • aspects which are generic/common to all EAD documents
  • aspects which are common to some quite large subset of EAD documents (like the Archives Hub dataset, with its (more or less) common set of conventions)
  • aspects which are “generic” in some way, but require some sort of “local” parameterisation – here, I’m thinking of the sort of “name/keyword lookup” techniques I describe in the SALDA post: the technique is broadly usable but the “lookup tables” used would vary from one dataset to another
  • aspects which reflect very specific, “local” characteristics of the data – e.g., some of the SALDA processing is based on testing for text patterns/structures which are very particular to the Mass Observation catalogue data

What I’d like to do (but haven’t done yet) is to reorganise the transform to try to make it a little more “modular” and to separate the “general”/”generic” from the “local”/”specific”, so that it might be easier for other users to “plug in” components more suitable for their own data.

Serving Linked Data

Back near the start of the project, I published a post outlining the processes involved in generating the Archives Hub RDF dataset and serving up “Linked Data” descriptions from that dataset; it’s perhaps best summarised in the following diagram from that post:

Diagram showing process of transforming EAD to RDF and exposing as Linked Data

In this post, I’ll say a little bit more about what is involved in the “Expose” operation up in the top right of the diagram.

Cool URIs for the Semantic Web

In an earlier post, I discussed the URI patterns we are using for the URIs of “things” described in our data (archival resources, concepts, people, places, and so on). One of the core requirements for exposing our RDF data as Linked Data is that, given one of these URIs, a user/consumer of that URI can use the HTTP protocol to “look up” that URI and obtain a description of the thing identified by that URI. So as providers of the data, our challenge is to enable our HTTP server to respond to such requests and provide such descriptions.

The W3C Note Cool URIs for the Semantic Web lists a number of possible “recipes” for achieving this while also paying attention to the principle of avoiding URI ambiguity i.e. of avoiding using a single URI to refer to more than one resource – and in particularly to maintaining a distinction between the URI of a “thing” and the URIs of documents describing that thing.

Document URI Patterns

Within the JISCExpo programme which funds LOCAH, projects generating Linked Data were encouraged to make use of the guidelines provided by the UK Cabinet Office in Designing URI Sets for the UK Public Sector.

Thse guidelines refer to the URIs used to identify “things” (somewhat tautologically, it seems to me!) as “Identifier URIs”, where they have the general pattern:

http://{domain}/id/{concept}/{reference}

where:

  • concept is a name for a resource type, like “person”;
  • reference is a name for an individual instance of that type or class

(The guidelines also allow for the option of using URIs with fragment identifiers (“Hash URIs”) as “Identifier URIs”.)

The document also recommends patterns for the URIs of the documents which provide information about these “things”, “Document URIs”:

http://{domain}/doc/{concept}/{reference}

These documents are, I think, what Berners-Lee calls Generic Resources. For each such document, multiple representations may be available, each in different formats, and each of those multiple “more specific” documents in a single concrete format may be available as a separate resource in its own right. So a third set of URIs, “Representation URIs,” name documents in a specific format, using the suggested pattern:

http://{domain}/doc/{concept}/{reference}/{doc.file-extension}

i.e. for each “thing URI”/”Identifier URI” in our data, like:

http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist, which identifies a person, the artist Beverley Skinner;

there is a corresponding “Document URI” which identifies a (“generic”) document describing the thing:

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist

and a set of “Representation URIs” each identifying a (“specific”) document in a particular format:

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.html, which identifies an HTML document;

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.rdf, which identifies an RDF/XML document;

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.turtle, which identifies a Turtle document;

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.json, which identifies a JSON document (more specifically one using Talis’ RDF/JSON conventions for serializing RDF)

(We’ve deviated slightly from the recommended pattern here in that we just add “.{extension}” to the “reference” string, rather than adding “/doc.{extension}”, but we’ve retained the basic approach of distinguishing generic document and documents in specific formats, which I think is the significant aspect of the recommendations.)

This set of URI patterns corresponds to those used in the “recipe” described in section 4.2 of the W3C Cool URIs note, “303 URIs forwarding to One Generic Document”.

The Talis Platform

It is perhaps worth emphasising here that in the LOCAH case a “description” of any one of the things in our model may contain data which originated in multiple EAD documents e.g. a description of a concept may contain links to multiple archival resources with which it is associated, or a description of a repository may contain links to multiple finding aids they have published, and so on. A description may also contain data which originated from a source other than the EAD documents: for example, we add some postcode data provided by the National Archives, and most of the links to external resources, such as people described by VIAF records, are generated by post-transformation processes.

This aggregated RDF data – the output of the EAD-to-RDF transformation process and this additional data – is stored in an instance of the Talis Platform store. Simplifying things slightly, the Platform store is a “database” specialised for the storage and retieval of RDF data. It is hosted by Talis, and made avalable as what in cloud computing terms is referred to as “Software as a Service” (SaaS). (Actually, a Platform store allows the storage of content other than RDF data too – see the discussion of the ContentBox and MetaBox features in the Talis documentation – but we are, currently at least, making use only of the MetaBox facilities).

Access to the store is provided through a Web API. Using the MetaBox API, data can be added/uploaded to the MetaBox using HTTP POST, updates can be applied through what Talis call “Changesets” (essentially “remove that set of triples” and “add this set of triples”) again using HTTP POST, and “bounded descriptions” of individual resources can be retrieved using HTTP GET. There are also “admin” functions like “give me a dump of the contents” and “clear the database”. In addition, the Platform provides a simple full-text search over literals (which returns result sets in RSS), a configurable faceted search, an “augment” function and a SPARQL endpoint.

A number of client software libraries for working with the Platform are available, developed either by Talis staff or by developers who have worked with the Platform.

Delivering Linked Data from the Platform

I’m going to focus here on retrieving data from the MetaBox, and more specifically retrieving the “bounded descriptions” of individual resources which which provide the basis for the “Linked Data” documents.

This process involves a small Web application which responds to HTTP GET requests for these URIs:

  • For an “Identifier URI”, the server responds with a 303 status code and a Location header redirecting the client to the “Document URI”
  • For a “Document URI”, the server derives the corresponding “Identifier URI”, queries the Platform store to obtain a description of the thing identified by that URI, and responds with a 200 status code, a document in a format selected according to the preferences specified by the client (i.e. following the principles of HTTP content negotiation), and a Content-Location header providing a “Representation URI” for a document in that format.
  • For a “Representation URI”, the server derives the corresponding “Identifier URI”, queries the Platform store to obtain a description of the thing identified by that URI, and responds with a 200 status code and a document in the format associated with that URI.

The first step above is handled using a simple Apache rewrite rule. For the latter two steps, we’ve made use of the Paget PHP library created by Ian Davis of Talis for working with the Platform (Paget itself makes use of another library, Moriarty, also created by Ian). I’m sure there are many other ways of achieving this; I chose Paget in part because my software development abilities are fairly limited, but having had a quick look at the documentation and one of Ian’s blog posts, I felt there was enough there to enable me to take an example and apply my basic and rather rusty PHP skills to tweak it to make it work – at least as a short-term path to getting something functional we could “put out there”, and then polish in the future if necessary.

The main challenge was that the default Paget behaviour seemed to be to use the approach described in section 4.3 of the Cool URIs document, “303 URIs forwarding to Different Documents”, where the server performs content negotiation on the request for the “Identifier URI” and redirects directly to a “Representation URI”, i.e. a GET for an “Identifier URI” like http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist resulted in redirects to “Representation URIs” like http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist.html or http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist.rdf

If possible we wanted to use the alternative “recipe” described in the previous section, and after some tweaking we managed to get something that did the job. We also made some minor changes to provide a small amount of additional “document metadata”, e.g. the publisher of and license for the document. (I do recognise that the presentation of the HTML pages is currently pretty basic, and there is room for improvement!)

Finally, it’s maybe worth noting here that the Platform store itself doesn’t contain any information about the documents i.e. neither the Document URI nor the Representation URIs appear in RDF triples loaded to the store. So, in principle at least, we could add additional formats using additional Representation URIs simply by extending the PHP to handle the URIs and generate documents in those formats, without needing to extend the data in the store.

I’d started to write more here about extending what we’ve done to provide other ways of accessing the data, but having written quite a lot here already, I think that is probably best saved for a future post.

Querying the Linked Archives Hub data using SPARQL

We’ve just announced the availability of our first draft linked data dataset of data from the Archives Hub. When newly available linked data datasets appear, I sometimes hear comments/questions along the lines of:

  • How do I know what the data looks like?
  • Show me some example SPARQL queries that I can use as starting points for my own exploration of the data

We’ve tried to go some way to addressing the first of those points in previous posts, in which I outlined the data model we’re using, to give a general picture of the types of things described and the relationships between them, and then provided a more detailed list of the RDF terms used to describe things. (That second post in particular will, I hope, be useful in thinking about how to construct queries).

In addition, there are some useful posts around on techniques for “probing” a SPARQL endpoint, i.e. issuing some general queries to get a picture of the nature of the graph(s) in the dataset behind an endpoint. See, for example:

In this post, I’ll focus mainly on responding to the second point, by providing a few sample SPARQL queries. Inevitably, these can only give a flavour of what is possible, but I hope they provide a starting point for people to build on.

This isn’t intended to be a tutorial on SPARQL; there are various such tutorials available, but one I found particularly thorough and helpful is:

The SPARQL endpoint for the Linked Archives Hub dataset is:

http://data.archiveshub.ac.uk/sparql.

The data is hosted in an instance of the Talis Platform, which supports a few useful extensions to the SPARQL standard, some of which are used in the examples below.

Listing “top-level” archival “collections”

Following the principles of “multi-level” description of archives, archivists apply a conceptualisation of archival materials as constituting hierarchically organised “collections”, where one “unit of description” may contain others, which in turn may contain others. It is often the case that an archival finding aid provides descriptions of materials only at the “collection-level”, or perhaps at some “sub-collection” level, without describing items individually at all.

In the LOCAH archival data, this approach is reflected in the use of a class ArchivalResource, where an instance of that class may have other instances as parts or members (or, inversely, one instance may be a part, or member, of another instance). This relationship is expressed using the properties dcterms:hasPart/dcterms:isPartOf and ore:aggregates/ore:isAggregatedBy.

The following query provides the URIs and labels (titles) of all archival resources mentioned in the dataset:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?ar ?arlabel
WHERE { 
?ar a locah:ArchivalResource ;
   rdfs:label ?arlabel .
}

This list includes archival resources at any “level”, from collections down to individual items.

We want to narrow down that selection so that it includes only “top-level” archival resources i.e. archival resources which are not “part of” another archival resource. This can be done by extending our pattern to allow for the optional presence of a triple with predicate dcterms:isPartOf, and filtering to select only those cases where the object in that optional pattern is “not bound” i.e. no such triple is present in the dataset:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?ar ?arlabel
WHERE { 
?ar a locah:ArchivalResource ;
   rdfs:label ?arlabel .
   OPTIONAL { ?ar dcterms:isPartOf ?parent } .
   FILTER (!bound(?parent))
}

Run this query against the current LOCAH endpoint.

Finding the location of the Repository holding an Archival Resource

For each archival resource, access to that resource is provided by a Repository (an agent, an entity with the ability to do things). This relationship is expressed using the property locah:accessProvidedBy. The Repository-as-Agent manages a place where the resource is held, a relationship expressed using the locah:administers property, and that place is associated with a postcode, both as a literal, and (perhaps more usefully) in the form of a link to a “postcode unit” in the dataset provided by the Ordnance Survey; by “following” that link, more information about the location can be obtained (e.g. latitude and longitude, relationships with other places) from the data provided by the OS.

Given the URI of an archival resource (in this example http://data.archiveshub.ac.uk/id/archivalresource/gb1086skinner), the following query returns the URI of the repository (agent), the postcode as literal, and the URI of the postcode unit:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX gn: <http://www.geonames.org/ontology#>
PREFIX ospc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>

SELECT ?repo ?pc ?pcunit
WHERE {
   ?repo locah:providesAccessTo 
                <http://data.archiveshub.ac.uk/id/archivalresource/gb1086skinner> ;
           locah:administers ?place .
   ?place gn:postalCode ?pc ;
          ospc:postcode ?pcunit
}

Run this query against the current LOCAH endpoint.

Listing the Archival Resources associated with a Person

In the EAD finding aids, the description of an archival resource may provide an association with the name of one or more persons associated with the resource as “index terms”. The person may be the creator of the resource, they may be the topic of it, or there may be some other association which is considered by the archivist to be significant for people searching the catalogue.

The following query provides a list of person names, the “authority file” form of the name, the identifiers of the archival resources with which they are associated, and the URI of a page on the existing Hub Web site describing the resource. I’ve limited it to a particular repository as without that constraint it potentially generates a quite large result set (and it helps me conceal the fact that some of the person name data is still a little bit rough and ready!)

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX locah: <http://data.archiveshub.ac.uk/def/>

SELECT DISTINCT ?name ?famname ?givenname ?authname ?unitid ?hubpage
WHERE {
?arcres locah:accessProvidedBy <http://data.archiveshub.ac.uk/id/repository/gb15> ;
        locah:associatedWith ?concept ;
        dcterms:identifier ?unitid ;
        rdfs:seeAlso ?hubpage .
?concept foaf:focus ?person ;
             rdfs:label ?authname .
?person a foaf:Person;
        foaf:name ?name;
OPTIONAL {?person foaf:familyName ?famname;
                  foaf:givenName ?givenname }
}
ORDER BY ?famname ?givenname ?name  

Run this query against the current LOCAH endpoint.

Listing Concepts by number of associated Archival Resources

The following query lists the concepts from a specified concept scheme (here the UNESCO thesaurus, which is assigned the URI http://data.archiveshub.ac.uk/id/conceptscheme/unesco, and orders them according to the number of archival resources with which they are associated (This makes use of the count and GROUP BY Talis Platform SPARQL extensions):

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?concept ( count(?concept) AS ?count ) 
WHERE {
   ?x locah:associatedWith ?concept .
   ?concept skos:inScheme  <http://data.archiveshub.ac.uk/id/conceptscheme/unesco> .
 }
GROUP BY ?concept
ORDER BY DESC(?count)

Run this query against the current LOCAH endpoint.

Listing Persons associated with Archival Resources, where Persons are born during a specified period

In an earlier post, I described the modelling of the births and deaths of individual persons as “events”.

Based on this approach, birth or death events occurring within a specified period can be selected. So, for example, the following query returns a list of persons born during the 1940s, with the archival resources with which they are associated:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX bio: <http://purl.org/vocab/bio/0.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?birthdate ?person ?name ?famname ?givenname ?ar
WHERE { 
?event a bio:Birth ;
   bio:date ?birthdate ;
   bio:principal ?person .
   FILTER regex(str(?birthdate), '^194') .
?person foaf:name ?name .
   OPTIONAL { ?person foaf:familyName ?famname ; foaf:givenName ?givenname } .
?concept foaf:focus ?person .
?ar locah:associatedWith ?concept .
}
ORDER BY ?birthdate ?name

Run this query against the current LOCAH endpoint.

(I use this to illustrate the “event” approach, but in this case, birth and death dates are also provided as literal values of properties associated with the person, so there are other (easier!) ways of getting that information.)

To close, I’ll just emphasise again that these are only a few simple examples, intended to give an idea of the structure/”shape” of the data, and a flavour of what sort of queries are possible. If you come up with any examples of your own you’d like to share, we’d be glad to hear about them in comments below. (Come to think of it, it’s probably not very easy to maintain formatting/whitespace etc in comments, so it might be easier to host any such examples elsewhere and just post links here).

P.S. If there are any “tweaks” that you think we could make that would make things easier for those consuming/querying the data, it would be good to hear about them. I can’t promise we’ll be able to implement them, but we are still at the stage where things can be changed and we do want the data to be as usable and useful as possible.

Describing the “things”: the RDF terms used (part 2)

In the previous post, I described some of the considerations in choosing RDF vocabularies to use for the LOCAH archival metadata. In the tables below, I’ve tried to summarise the properties used to “describe” an instance of each of the classes in our model, i.e. for a particular thing URI, in our dataset, one might expect to find triples with that URI as subject and these property URIs as predicates, and when our data is served as linked data, and a thing URI is dereferenced the “bounded description” provided will include those triples (and others) – though some may be optional, so not necessarily present for all instances (and some may not be present at all until we add some more data…!)

This is really more of a “reference document” than a blog post, but I provide it in part as documentation of the data creation/transformation process, and in part as a guide for potential users of the actual data. Having said that, the data is liable (even likely) to change so consumers should always refer to the actual data for an up-to-date picture of the terms used. I’ve tried to highlight (dark grey background) below terms which I consider to be particularly “at risk” and liable to be removed/replaced, mostly terms from the “locah” vocabulary.

Most of this data is generated from the transformation of the EAD XML documents; a small proportion is added separately. Again, I’ve tried to indicate that in the tables below (light grey background).

Finding Aid

Type rdf:type Class URI locah:FindingAid
foaf:Document
bibo:Document
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Identifier dcterms:identifier plain literal
Description dcterms:description plain literal
Conforms to dcterms:conformsTo Standard URI standard:isadg
Publisher dcterms:publisher Repository (Agent) URI
Encoded As locah:encodedAs EAD URI
Topic foaf:topic Archival Resource URI
Subject dcterms:subject Archival Resource URI
Has Part dcterms:hasPart Biographical History URI

The following should probably be an owl:sameAs relationship (or we should just cite the Hub URI?)

See also rdfs:seeAlso Hub Page URI e.g.
http://archiveshub.ac.uk/data/
gb15sirernesthenryshackleton

EAD

Type rdf:type Class URI locah:EAD
foaf:Document
bibo:Document
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Identifier dcterms:identifier plain literal
Description dcterms:description plain literal
Date Created dcterms:created plain literal
Conforms to dcterms:conformsTo Standard URI dbpedia:
Encoded_Archival_Description

standard:ead2002
Encoding Of locah:encodingOf Finding Aid URI

The Hub does not currently provide a URI for the EAD document, but it is planned to do so, at which point we should add an owl:sameAs relationship (or just cite the Hub URI?)

Repository (Agent)

Type rdf:type Class URI locah:Repository
foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Identifier dcterms:identifier plain literal
Country Code locah:
countryCode
plain literal
Maintenance Agency Code locah:
maintenance
AgencyCode
plain literal
Is Publisher Of locah:isPublisherOf Finding Aid URI
Provides Access To locah:
providesAccessTo
Archival Resource URI
Administers locah:administers Place URI
See also rdfs:seeAlso Archon Page URI e.g.
http://www.nationalarchives.gov.uk/
archon/searches/
locresult_details.asp?LR=15

Repository (Place)

Type rdf:type Class URI wg84_pos:
SpatialThing
Label rdfs:label plain literal
Title dcterms:title plain literal
Is Administered By locah:
isAdministeredBy
Repository URI
See also rdfs:seeAlso Archon Page URI e.g.
http://www.nationalarchives.gov.uk/
archon/searches/
locresult_details.asp?LR=15

The following data is not generated from the EAD documents, but added in from a separate source:

Postal Code gn:postalCode plain literal
Located In gn:locatedIn Postcode Unit URI e.g.
http://data.ordnancesurvey.co.uk/id/postcodeunit/CB21ER
Within ossr:within Postcode Unit URI e.g.
http://data.ordnancesurvey.co.uk/id/postcodeunit/CB21ER
Postcode postcode:postcode Postcode Unit URI e.g.
http://data.ordnancesurvey.co.uk/id/postcodeunit/CB21ER

Archival Resource

Type rdf:type Class URI locah:ArchivalResource
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Level locah:level Level URI
Page foaf:page Finding Aid URI
Access Provided By locah:
accessProvidedBy
Repository (Agent) URI
Identifier dcterms:identifier plain literal
Date dcterms:date plain literal
Date Created or Accumulated locah:
dateCreated
AccumulatedString
plain literal

The following properties were introduced to distinguish different date cases (date range v single date).

Date Created or Accumulated locah:
dateCreated
Accumulated
typed literal
Date Created or Accumulated (Start) locah:
dateCreated
AccumulatedStart
typed literal
Date Created or Accumulated (End) locah:
dateCreated
AccumulatedEnd
typed literal
Produced In event:produced_in Creation Event URI
Extent (String) locah:extent plain literal
Extent dcterms:extent Extent URI
Language dcterms:language Language URI e.g.
http://lexvo.org/id/iso639-3/eng
Is Represented By locah:
isRepresentedBy
Document URI or Aggregation URI
Origination locah:origination Agent URI
Has Biographical History locah:
hasBiographicalHistory
Bioghist URI
Associated With locah:
asssociatedWith
Concept URI
Has Part dcterms:hasPart Archival Resource URI
Aggregates ore:aggregates Archival Resource URI
Is Part Of dcterms:isPartOf Archival Resource URI
Is Aggregated By ore:isAggregatedBy Archival Resource URI
Members locah:members (RDF Collection)
See also rdfs:seeAlso Hub Page URI e.g.
http://archiveshub.ac.uk/data/
gb15sirernesthenryshackleton-
gb15sirernesthenryshackleton-
imperialtrans-antarcticexpedition

For all of the following, the object is simply a copy of the XML element content from the EAD document as an XML Literal. This is a rather “dumb” and probably not terribly useful “translation” from the EAD; in a future iteration of the transform, we hope to extract further useful triples from this part of the EAD data, and we will probably remove some of these triples.

Custodial History locah:
custodialHistory
XML literal
Acquisitions locah:acquisitions XML literal
Scope and Content locah:
scopecontent
XML literal
Appraisal locah:appraisal XML literal
Accruals locah:accruals XML literal
Access Restrictions locah:
accessRestrictions
XML literal
Use Restrictions locah:
useRestrictions
XML literal
Physical or Technical Requirements locah:
physicalTechnical
Requirements
XML literal
Other Finding Aids locah:
otherFindingAids
XML literal
Location Of Originals locah:
locationOfOriginals
XML literal
Alternate Forms Available locah:
alternateForms
Available
XML literal
Related Material locah:
relatedMaterial
XML literal
Bibliography locah:bibliography XML literal
Note locah:note XML literal
Processing locah:processing XML literal

Level

Type rdf:type Class URI locah:Level
skos:Concept
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Comment rdfs:comment plain literal
Note skos:note plain literal
Definition skos:definition plain literal
Description dcterms:description plain literal

Language

Type rdf:type Class URI lvont:Language

Creation (Event)

Type rdf:type Class URI locah:Creation
lode:Event
event:Event
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Product event:product Archival Resource URI
Involved lode:involved Archival Resource URI
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI

Time Interval

Type rdf:type Class URI time:Interval
time:TemporalEntity
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Timeline timeline:timeline Timeline URI timeline:universaltimeline
Start timeline:start typed literal
Interval Starts time:intervalStarts Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
End timeline:end typed literal
Interval Ends time:intervalEnds Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
Contains crm:
P86i_contains
Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
At timeline:at typed literal
Interval During time:intervalDuring Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
Falls Within crm:
P86_falls_within
Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874

Extent

Type rdf:type Class URI locah:Extent
Label rdfs:label plain literal

In the EAD XML doc, extent is expressed simply as a literal. Where possible we’ve tried to parse out a “unit of measurement” and a quantity, reflected in RDF as a triple where the predicate reflects the unit and the object the quantity, as a typed literal, to try to make comparisons easier. I need to catch up with what current “best practice” is for representing quantities/units of measurement so this may well change. Also, currently, “units” include things like “file”, “paper” and “envelope”, which may not be terribly useful.

Archival Box locah:archbox typed literal (xsd:decimal)
Metre (Linear) locah:metre typed literal (xsd:decimal)
Cubic Metre locah:cubicmetre typed literal (xsd:decimal)
Folder locah:folder typed literal (xsd:decimal)
Envelope locah:envelope typed literal (xsd:decimal)
Volume locah:volume typed literal (xsd:decimal)
File locah:file typed literal (xsd:decimal)
Item locah:archbox typed literal (xsd:decimal)
Page locah:page typed literal (xsd:decimal)
Paper locah:paper typed literal (xsd:decimal)

Origination (Agent)

Type rdf:type Class URI foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Page foaf:page Biographical History URI
Is Origination Of locah:
isOriginationOf
Archival Resource URI

For links to other agents (external or internal):

Same As owl:sameAs Agent URI
Is Like umbel:isLike Agent URI

Biographical History

Type rdf:type Class URI locah:
BiographicalHistory
bibo:DocumentPart
bibo:Document
foaf:Document
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Body locah:body XML literal, plain literal
Topic foaf:topic Agent URI
Subject dcterms:subject Agent URI
Is Biographical History For locah:
isBiographicalHistoryFor
Archival Resource URI
Is Part Of dcterms:isPartOf Finding Aid URI

Concept Scheme

Type rdf:type Class URI skos:ConceptScheme
Label rdfs:label plain literal

Concept (ControlAccess – Subject)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Concept (ControlAccess – Persname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Person URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the person who is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Surname locah:surname plain literal
Forename locah:forename plain literal
Dates locah:dates plain literal
Title locah:title plain literal
Epithet locah:epithet plain literal
Other locah:other plain literal

Person (ControlAccess – Persname)

Type rdf:type Class URI foaf:Person
foaf:Agent
dcterms:Agent
crm:
E21_Person
Label rdfs:label plain literal
Name foaf:name plain literal
Family Name foaf:familyName plain literal
Given Name foaf:givenName plain literal
Dates locah:dates plain literal
Title locah:title plain literal
Epithet locah:epithet plain literal
Other locah:other plain literal

For links to other persons (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Person URI
Is Like umbel:isLike Person URI

Concept (ControlAccess – Famname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Family URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the family that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Family (ControlAccess – Famname)

Type rdf:type Class URI locah:Family
foaf:Group
foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

For links to other families (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Family URI
Is Like umbel:isLike Family URI

Concept (ControlAccess – Corpname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Family URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the organisation that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Organisation (ControlAccess – Corpname)

Type rdf:type Class URI foaf:Organization
foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

For links to other organisations (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Organisation URI
Is Like umbel:isLike Organisation URI

Concept (ControlAccess – Geogname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Place URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the place that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Place (ControlAccess – Geogname)

Type rdf:type Class URI wg84_pos:
SpatialThing
Label rdfs:label plain literal
Name locah:name plain literal
Location locah:location plain literal
Other locah:other plain literal

For links to other places (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Place URI
Is Like umbel:isLike Place URI

Concept (ControlAccess – GenreForm)

Type rdf:type Class URI locah:GenreForm
skos:Concept
Label rdfs:label plain literal
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

Concept (ControlAccess – Function)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

Book/Document (ControlAccess – Title)

Type rdf:type Class URI foaf:Document
bibo:Document
Label rdfs:label plain literal
Title dcterms:title plain literal

Birth (Event)

Type rdf:type Class URI bio:Birth
bio:IndividualEvent
bio:Event
lode:Event
event:Event
crm:E67_Birth
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Date bio:date typed literal
Date dcterms:date typed literal
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI
Has Time-Span crm:
P4_has_time-span
Time Interval URI
Agent bio:agent Person URI
Principal bio:principal Person URI
Agent event:agent Person URI
Involved Agent lode:involvedAgent Person URI
Brought Into Life crm:
P98_brought_into_life
Person URI

Death (Event)

Type rdf:type Class URI bio:Death
bio:IndividualEvent
bio:Event
lode:Event
event:Event
crm:E69_Death
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Date bio:date typed literal
Date dcterms:date typed literal
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI
Has Time-Span crm:
P4_has_time-span
Time Interval URI
Agent bio:agent Person URI
Principal bio:principal Person URI
Agent event:agent Person URI
Involved Agent lode:involvedAgent Person URI
Was Death Of crm:
P100_was_death_of
Person URI

Floruit (Event)

Type rdf:type Class URI locah:Floruit
bio:IndividualEvent
bio:Event
lode:Event
event:Event
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Date bio:date typed literal
Date dcterms:date typed literal
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI
Agent bio:agent Person URI
Principal bio:principal Person URI
Agent event:agent Person URI
Involved Agent lode:involvedAgent Person URI
Was Death Of crm:
P100_was_death_of
Person URI

Object

Type rdf:type Class URI foaf:Document
bibo:Document
Is Aggregated By ore:isAggregatedBy Object Group URI

Object Group

Type rdf:type Class URI ore:Aggregation
dcmitype:Collection
bibo:Collection
Aggregates ore:aggregates Object URI

Describing the “things”: the RDF terms used (part 1)

In previous posts, I described:

  • the model of the “world” on which we’re basing the Archives Hub RDF data: the types of “thing” being described, and (some of) the relationships between them (1, 2, 3); and
  • the patterns for URIs to be assigned to the individual “things”

In this post and the next one, I’ll outline the RDF vocabularies we’re using to describe those “things”. This post covers some of the considerations in choosing the vocabularies and some of the “patterns” we’ve used in deploying them; the next lists the properties and classes you can expect to find in the LOCAH data.

Using existing RDF vocabularies

As far as possible, we’ve tried to make use of existing, deployed RDF vocabularies. These include:

Those distinctions between which vocabulary “describes” what are somewhat rough, particularly taking into account that the “directionality” of properties in RDF is somewhat arbitrary: a triple using the dcterms:creator property to link a created work to an agent is as much “about” the agent as it is “about” the thing created.

However, where we’ve seen a need to express a notion that is not well addressed by an existing vocabulary, we have defined the additional classes and properties required and provided URIs for them as a small “local” LOCAH RDF vocabulary. At this point in time, I consider most of these terms something of a “work in progress”, and likely to be revised (or even dropped completely) before the end of the project. But I suspect some will remain – which, given the bounded timescale of the project, leaves questions about the longer term management of such vocabularies.

Discovering Appropriate Vocabularies

Most of my knowledge of existing RDF vocabularies has come from lurking on good old-fashioned mailing lists, particularly the W3C Semantic Web Interest Group list and the Linked Open Data list. I don’t read every posting by any means, and the signal-to-noise ratio can be variable, but for me they remain an excellent source of information with a knowledgeable and active contributing community (and the archives are a great repository.)

In similar territory, Semantic Stackoverflow provides a “question-and-answer”-style service, though it tends to have a fairly technical focus.

Another useful source is to look at actual linked data datasets, particularly those which are in a similar “domain” to the one you’re working in and cover similar resource types, and check out what vocabularies they are using (and how they are using them). In the library/bibliographic domain in particular, there has been a fairly steady stream of linked data datasets appearing over the last couple of years, so there’s quite a bit to go on, rather less so for the archives case. For a few pointers, see e.g. this review post by Ed Summers (itself already nearly a year old).

There are some services which aim to provide disclosure/discovery services based on aggregations of information about vocabularies and their constituent terms, sometimes called “metadata registries” or “metadata schema registries”. I’ve had mixed experiences of using these services: in some cases the content is not current; in others the coverage is intentionally tailored to the requirements of a particular community, so the challenge becomes one of finding a registry whose coverage matches the task at hand. One service (with quite general coverage) which I have occasionally found useful is Schemapedia, a project by Ian Davis of Talis; it provides “vocabulary”-level descriptions, rather than descriptions of individual “terms” but it includes some examples of actual terms: see, e.g. the entry for the Biographical Vocabulary.

There are a number of services which provide search functions across aggregations of data gathered from the linked data Web/Semantic Web. Sindice crawls and aggregates a huge range of RDF data and provides a “Google”-like search across that aggregation. (I’ve also found navigating such an aggregation helpful in thinking about various aspects of linked data: the sig.ma browser highlights the consequences of merging data from multiple sources, and related issues of provenance, attribution and trust, for example).

Finally, at the risk of stating the obvious, plain old Web search engines can still be a useful entry point.

Having said all this, I admit that the discovery of RDF vocabularies is still something of a challenge, and I continue to come across useful things I’d missed. And having found something potentially useful often raises further questions: Is the vocabulary stable or still being developed? Is it described following “modern” good practice for RDF vocabularies? Is it being managed/curated? By an individual/institution/community? Does it have the support of a community of users? Particularly if the intention is for a dataset to have some longevity, these may be significant considerations.

Patterns for using RDF Vocabularies

While discovering RDF vocabularies capable of expressing the information you want to represent is a first step, it often raises issues of exactly how those vocabularies might best be deployed, or of choosing between several possible alternative solutions.

Leigh Dodds and Ian Davis of Talis have authored a booklet Linked Data Patterns which tries to address some of these challenges, by gathering together some common “patterns” of use, based on existing practice by linked data implementers – though perhaps inevitably at this stage, some aspects of that practice are something of a “moving target” as new challenges are identified and practice evolves to address them. (See, for example, a recent debate on the Linked Open Data mailing list covering the question of expectations for what the object of an rdfs:seeAlso triple might/should dereference to.)

I continue to find the reflections of linked data practitioners an excellent source, particularly those working in domains close to those I’m interested in. I regularly find myself referring to the series of posts by Jeni Tennison on creating linked data. In this context, the fifth post on “Finishing Touches” is particularly relevant, and in large part prompts my next couple of points.

Labelling

One of the principles I’ve tried to adhere to, following the guidance by Jeni is that each resource we expose should have a human-readable label, provided using the rdfs:label property, and as far as possible that label should function as a useful “stand-alone” name for the thing.

In some cases this is a straightforward matter of using some text content node in the EAD XML document as an RDF literal. In other cases, a single element in the EAD document is mapped to a number of distinct resources in our model. In these cases, the transformation process typically prefixes or suffixes the source text to generate labels for the various different things. Perhaps unsurprisingly, this sometimes leads to some slightly “artificial” or “stilted” results, so it’s something we may need to refine.

Also, and perhaps more problematically, as I’ve noted in a previous post, the practice of archival description has traditionally relied heavily on a “multi-level description” approach which results in the presentation of resource descriptions “in the context of” the descriptions of other related resources. So it is common to find individual items within a collection labelled simply as something like “Letter”, on the basis that the reader of the finding aid will glean further information from the fact that the description of the item is presented within a context provided by a list of other “sibling” items, all “children” of a “parent” aggregation of some form. Currently our mapping generates the rdfs:label of an item using only the label (EAD unititle element) of that item in the EAD document, with the result that we may indeed end up with many individual resources labelled “Letter” (though of course the description will also include other properties derived from other EAD data and links to “parent” resources). An alternative might be to try to generate a label by “qualifying” the item unittitle, say, by prefixing it with the label of a “parent” resource – though I suspect in practice this would generate some somewhat unwieldy results.

Where the source data makes it seem reasonable to express it, I’ve also indicated the use of a “preferred label”, using the skos:prefLabel property. I’m conscious here of the need to be careful: the SKOS specification includes a number of “integrity conditions”, rules which data using the SKOS vocabulary should follow. Amongst them is the requirement that

A resource has no more than one value of skos:prefLabel per language tag.

The important thing to remember is that this is intended to apply in an “open world” context, not simply as a condition scoped to a particular “document”. The EAD to RDF transform process is performed on a document-by-document basis. Within the Hub dataset, it is quite common that for a single resource, labels for that resource are generated from the content of multiple EAD documents. While in theory naming within the set of EAD documents should be consistent, in practice, the use of variants of names is widespread in our data – the names of archival repositories is one example. Generating an skos:prefLabel triple for each variant would result in a conflict with the integrity condition once the data was merged in the triple store.

Bearing in mind that the “open world” extends beyond the boundaries of our own dataset, the same considerations apply in the case where we are exposing URIs for resources for which other parties already expose descriptions, including an skos:prefLabel triple, and we can’t guarantee that the names in our data correspond to those provided by that source.

Inferencing

Another issue to consider is that referred to by Leigh and Ian in their “Materialize Inferences” pattern, and by Jeni Tennison in her discussion of “Derivable Data”. One of the strengths of using the RDF model is that it is supported by a formal semantics, a framework for reasoning with data, i.e. given some set of data, it is often possible to apply some formalised set of rules to infer or derive additional triples. However, it should not be assumed that all consumers of the data will have access to the tools which support such reasoning, so it may be more appropriate for a data provider like LOCAH to explicitly include at least some of those “derivable” triples in the data we provide.

For a simple example of what I mean, the Friend of a Friend (FOAF) vocabulary provides a property called foaf:name (“A name for some thing.”). As part of their description of that property, the FOAF vocabulary owners provide the triple:

foaf:name rdfs:subPropertyOf rdfs:label .

The RDFS property rdfs:subPropertyOf is one of those properties which is associated with a set of rules. What those rules say is that, for any two properties linked by an rdfs:subPropertyOf relation, two resources related by the first property are also related by the second. So each time I find a triple using foaf:name as a predicate, I can infer (deduce, derive) a second triple using the rdfs:label predicate, e.g. if I find

<http://example.org/id/person/p123> foaf:name “Ernest Henry Shackleton” .

then I can conclude

<http://example.org/id/person/p123> rdfs:label “Ernest Henry Shackleton” .

However, to reach that conclusion, my application needs (a) knowledge of the general rdfs:subPropertyOf inference rule, and (b) knowledge that foaf:name is a subproperty of rdfs:label – and (c) the processing capability to apply that rule!

By providing – “materializing” – both those triples in our source data, we relieve the consuming application of that responsibility – though that benefit comes at the cost of increasing the size of the descriptions we provide.

This tactic can be particularly useful, I think, for properties which are subproperties of “generic” vocabularies like the RDF Schema vocabulary or the Dublin Core vocabularies. Sometimes generic linked data tools have some “built-in knowledge” of, and/or specific behaviour associated with, some of these vocabularies (e.g. to obtain literal names/labels/titles for display to human readers). It may be perfectly reasonable to use a triple with some more specialised subproperty in our data to indicate some specific relationship, but where appropriate it is also helpful to “materialize” the triple using the more generic property as well, so that an application looking for RDF Schema or DC properties can easily access that data.

Extending that slightly, Jeni suggests a “rule of thumb” that “if the result of the reasoning involves a resource from another vocabulary, then we should include it”.

The subproperty case is just one example: the inference of resource type based on rdfs:range and rdfs:domain is another case in point. In the LOCAH data, we’ve tried to provide fairly “generous” type data (e.g. including “super-classes”) where possible – again, on the grounds that such information is a commonly used “hook” in user queries (“Select resources of type T where [some other criteria]”).

The “cost” of this approach is that the dataset and the individual “bounded descriptions” served are larger – so there is a “trade-off” here which we may want to monitor and reconsider once we see how the data is being used.

Events

As I mentioned earlier, we extended our very initial draft model to include a notion of “event”. Currently, the application of this approach in our data is quite limited: it is applied to the “creation”/”origination” of the archival resources, and to the birth, death and “periods of activity” (floruit) of individuals. What we do is similar to the approach sketched by Ben O’Steen in his processing of the British Library’s British National Bibliography data – though with a little more complexity as we make use of event ontologies which model time periods as resources, rather than as literals.

This is probably best illustrated by means of an example. Given a person with birth date of 1901 and death date of 1985, we generate an RDF graph like the following:

RDF Graph of Life Events Data

RDF Graph of Life Events Data

(The image links through to a larger version)

The time interval nodes at the right-hand side are reference.data.gov.uk URIs for years, like http://reference.data.gov.uk/id/year/1901

What I haven’t illustrated on that diagram is that I’ve also included some data using the CIDOC CRM ontology – actually using the Erlangen CRM vocabulary. I’m feeling my way a bit with this, so it is somewhat partial/experimental at the moment, but I hope to refine/extend it in the future.

The point I wanted to highlight is that we’ve made use of multiple “overlapping” vocabularies here – again on the grounds that it may be useful to provide that flexibility to consumers of the data querying using a specific vocabulary. As above, this is a “trade-off” which we may want to monitor and reconsider in the future.

Summary

I’ve tried to cover here some of the issues around our choices of RDF vocabularies and how we’ve deployed them. The next post will summarise the actual terms used.

Two changes to the model and some definitions

Over the last few weeks we’ve been testing our initial cut at an EAD-to-RDF transform against a range of data that extends beyond EAD documents prepared using the Hub data entry template to documents created using other tools – and varying somewhat in terms of the markup conventions used.

In the course of that, I’ve been pondering some of the choices we made in the model I described here and here, and we decided to make a couple of changes (one very minor and the second still relatively so, I think):

  • Archival Resource: We’ve changed the name of the class we were calling “Unit of Description” to “Archival Resource”. I think “Unit of Description” was problematic for two reasons. First, it was ambiguous, because it could be interpreted either as the unit (of archival material) being described (which is what was intended) or as a unit/part of the archival description (which is not what was intended). Second, I adopted it from the ISAD(G) standard, where the context is one in which the archival resources are considered to be the primary things being described. I’m less sure the label works in the “linked data” context where we’re providing statements, and sets of statements (descriptions), “about” not just the archival materials, but many other things. In this context, everything that is described (people, concepts, places, etc) might be seen, in some sense, a “unit of description”, and so using that label for one subset of them seems inappropriate. That left us with finding a suitable alternative, a generic term that covers archival material in general, at any level of description (fonds, collection, item etc), and “archival resource” seemed like a reasonable fit.
  • Origination as Concept: When I first sketched out the model, I raised some questions, including (as “question 3” in that post) whether it was useful/necessary to model the origination of the archival resource as a pair of concept and agent, following the pattern used for the <controlaccess> terms. Having experimented with that approach, we’ve decided it introduces unnecessary complexity and we’ve fallen back on treating <origination> as a simple relation between archival resource and agent. The use of concept and agent is retained for the <controlaccess> case, where names are typically drawn from an “authority file”, as it allows us to maintain the distinction between a conceptualisation of the agent (as reflected by the authority record/entry) and the agent itself (a distinction which is also made in the model underpinning datasets such as VIAF, which we will be making links to).

The revised model is summarised in the following diagram (an amended version of Figure 3 from the earlier post):

Amended data model for EAD

Amended data model for EAD

i.e. an Archival Resource and a Biographical History are now related directly to an Agent.

Below is a draft list of human-readable definitions for the classes in the model. Some are simply references to classes provided by existing vocabularies like Dublin Core, FOAF, event vocabularies:

Finding Aid
A document describing an archival resource.
Subclass of: bibo:Document, foaf:Document
EAD
A document conforming to the Encoded Archival Description standard.
Subclass of: bibo:Document, foaf:Document
Biographical History
A narrative or chronology that places the archival materials in context by providing information about their creator(s). A finding aid may contain several such narratives or chronologies pertaining to different archival materials and their creators.
Subclass of: bibo:DocumentPart, (bibo:Document), foaf:Document
Repository
An institution or agency responsible for providing access to archival materials.
Subclass of: foaf:Organization, (foaf:Agent), dcterms:Agent
Place
= wgs84_pos:SpatialThing
Postcode Unit
= ospc:PostcodeUnit
Archival Resource
Recorded information in any form or medium, created or received and maintained, by an organization or person(s) in the transaction of business or the conduct of affairs, and maintained for its long-term research value. An archival resource may be an individual item, such as a letter or photograph, or (more commonly) some aggregation of such items managed and described as a unit.
Level
An indicator of the part of an archival collection constituted by an archival resource, whether it is the whole collection or a sub-section of it.
Subclass of: skos:Concept
Language
= lvont:Language
Extent
The size of an archival resource.
Subclass of: dcterms:SizeOrDuration
Temporal Entity
= time:TemporalEntity
Creation
An event that resulted in the creation or accumulation of an archival resource.
Subclass of: event:Event, lode:Event
Concept
= skos:Concept
Concept Scheme
= skos:ConceptScheme
Agent
= foaf:Agent, dcterms:Agent
Person
= foaf:Person, (foaf:Agent), dcterms:Agent
Family
A group of people affiliated by consanguinity, affinity, or co-residence.
Subclass of: foaf:Group, (foaf:Agent), dcterms:Agent
Organisation
= foaf:Organization, (foaf:Agent), dcterms:Agent
Genre or Form
A category of archival material, defined either by style or technique of intellectual content, order of information or object function, or physical characteristics.
Subclass of: skos:Concept
Function
A sphere of activity or process.
Subclass of: skos:Concept
Birth
= bio:Birth, (bio:IndividualEvent), (bio:Event),
(event:Event), (lode:Event)
Death
= bio:Death, (bio:IndividualEvent), (bio:Event),
(event:Event), (lode:Event)
Object
= foaf:Document, bibo:Document
Book
= bibo:Book, (bibo:Document), (foaf:Document)

Identifying the “things”: URI Patterns for the Hub Linked Data

In my previous couple of posts, I outlined the model of the “world” on which we’re basing the RDF data we’re generating from the Archives Hub‘s EAD XML documents.

At the heart of the Linked Data approach is the principle that all the “things” we want to “say anything about” should be named using a URI, and that those URIs should use the http URI scheme, so that they can be easily “looked up” or “dereferenced” using Web technologies in order to obtain some information provided by the URI owner about the thing. So, having specified the types or classes of thing we want to refer to and describe, the next step is to decide on the structure of the http URIs that we’ll use to name the “instances” of those classes – the individual “things” – archival resources, repositories, concepts, persons, places, and so on. In this post, I’ll try to describe the patterns we’re using, and outline how we construct individual URIs using those patterns from the EAD input data. As I hope will become clearer, the nature of the input data conditions the form of the patterns we’ve chosen. This has turned into a rather long post (again!) but I hope the detail is useful – I think it’s important for us to try to document our processes and some of the issues we’ve grappled with as well as to present the conclusions.

In some (most) cases, these will be newly created URIs, under a domain that we (well, MIMAS and the Archives Hub service) own. For these URIs, the project is responsible for choosing the URIs and putting in place the mechanisms to ensure that their dereferencing results in the provision of some “useful information”. In other cases, we will simply be citing existing URIs, defined by other agencies who (hopefully!) provide for their dereferencing.

The UK Cabinet Office has recently published some general guidelines on URI patterns for government Linked Data, Designing URI Sets for the UK Public Sector, and within the JISC programme strand under which LOCAH is funded, projects are encouraged to follow the recommendations of those guidelines. Following these guidelines, the general URI pattern recommended to identify “things” is:

http://{domain}/id/{concept}/{reference}

where:

  • concept is a name for a class (resource type), like “person”
  • reference is a name for an individual instance of that class or type

Our RDF data is being generated, at least in the first instance, by processing EAD XML documents, so we want to construct our URIs for our “things” from content within those XML documents. And we want to do so in a way that, as far as possible, ensures that each of those URIs is an unambiguous name/referrer, i.e. it identifies a single “thing”, and we don’t end up with a single URI being used for what are in fact two different things. On the other hand, we can live with the case where we end up with multiple URIs, all of which identify a single thing, because information can be added at a later stage to indicate that they are synonyms.

The other point to note is that the initial transformation step is being performed on a “document-by-document basis”, i.e. taking a single EAD document as input and outputting RDF/XML. So for any given resource, the information we generate – including the URI of the resource – is based only on the content of that document (and any generally applicable information we can embed in the transform itself). There may be other data “about” that “thing” in another EAD document but we don’t have access to it at the time of transformation.

Also, it’s desirable that we construct our URIs in such a way that if we need to re-run the transform, we generate the same URIs from the same input data (unless we explicitly decide to change the patterns for some reason).

Finally, although the patterns below often make use of human-readable strings from the EAD document content, I haven’t treated human-readability as a major consideration. Having said that, I’ve tended to make use of (slightly normalised forms of) human-readable strings where possible, rather than, say, creating opaque “hashes”.

As with other aspects of the work, at this stage, this is a first cut at tackling the issue, and we may revise our approaches based on the experience of applying them over the dataset. Having gone through and constructed patterns for the various resource types, looking back over them now, I think I can see a small number of distinct methods that we’ve used:

  1. Identifiers: For some of these “things”, the EAD documents contain some sort of formally assigned identification code or number, which unambiguously – at least within the scope of the Hub collection – identifies that instance within the set of resources of that type (i.e. it serves as a “reference” in the terms of the Designing URI Sets… document). This is the case, for example, with the languages of the materials, using the did/langmaterial/language/@langcode attribute value. A variant of this is the case where such an identifier can be constructed from a combination of multiple pieces of content. Repositories, for example, can be identified by the pair of country code (ead/eadheader/eadid/@countrycode) and maintenance agency code (ead/eadheader/eadid/@mainagencycode). For these cases a combination of the name of the resource type and that identification code provides the basis for the “reference” part of the URI.
  2. “Authority-Controlled” Names: For many of the “things”, however, the EAD documents do not contain such a code; rather, they refer to things only by name. In some cases, the form of the name is drawn from an “authority file” – indicated in the EAD document – and the name includes sufficient information (e.g. birth/death dates, titles etc for a person) to make the resulting string an unambiguous referrer within the set of names from that source. For these cases, a combination of a name for the authority file and the name provides the basis for the “reference”. However, this does depend on the creator of the EAD document having accurately transcribed the “authoritative” form of the name, at least sufficiently to maintain unambiguity of reference.
  3. “Rule-Based” Names: In other cases, the “thing” is named, not using a name from a controlled list, but rather a name constructed according to some codified set of rules, where the rules used are indicated in the EAD document. The intent behind such rules is to try to ensure consistency of form and unambiguity of reference. The National Council of Archives’ Rules for the Construction of Personal, Place and Corporate Names (one of the rule sets recommended to Hub data creators) states “A personal name is constructed by combining mandatory and optional components of the name so that the person concerned can be identified with certainty and distinguished from others bearing similar names. An individual should have only one authorised form of name and each name should apply to only one individual.”Typically, as for the “authority file” case, this is achieved through the inclusion of dates, titles etc for persons. For these cases, a combination of a name for the rules and the name itself should provide the basis for the “reference”. However, in practice, the picture with the Hub data is somewhat more complex. First, in some cases where it is claimed that rules are followed, the content itself indicates that this is not the case. For example, the NCA Rules mandate that a personal name should include “the year in which a person was born or died, the span of years of his/her lifetime or the approximate period covered by his/her activities”, even if those dates are estimated. But there are cases in the data marked up as following the NCA Rules which do not meet this requirement – e.g. personal names providing only surname and forename with no dates – , which I suspect may result in ambiguous references. Second, even where the rule is followed and the mandatory components are present, the distributed nature of Hub data creation means that I suspect there is still some possibility that a single personal name may be used in two different sources to refer to what in fact are two different people (Consider e.g. the case of two data providers using the name “Smith, John, fl 1920-1950”).
  4. “Locally-Scoped” Names: In other cases, the form of the name is neither authority-controlled nor rule-based, but nevertheless there is some expectation that the form of the name used is sufficient to make it an unambiguous referrer within some context. This is the case, for example, with the content of the did/origination element. The difficulty, however, is in establishing reliably what that context is. What is that “local scope”? We’ve tentatively taken the approach that such names have been constructed in such a way as at least to be unambiguous within the collection of submissions to the Hub by a single repository. So by combining the repository identifier and the name, hopefully, we can arrive at a “reference” which avoids ambiguity. Again, it may turn out that this assumption is unreliable, and results in ambiguous references, so we may need to revisit this approach.
  5. “Identifier Inheritance”: (I’m sure there must be a formal term for this but I’m not sure what it is!) In these cases the EAD document does not provide an unambiguous name for the “thing” itself; however the “thing” has a simple relationship with some other “thing” for which identification fits into one of the other categories. Where the relationship is one-to-one, a URI can be constructed by adopting the pattern for that other “thing” and substituting the name of the resource type. An example of this is the case of the “biographical history” associated with a “unit of description”. The unit of description has an identifier (based on a pattern described below) and since – in data constructed using the Hub template – each unit has at most one biographical history, replacing the “unit” resource type name with a “bioghist” resource type name gives us a suitable URI path, e.g. for a unit of description for which the URI path contains “/unit/gb15abc”, the URI for the biographical history would contain “/bioghist/gb15abc”.A variant of this is the case where the relationship is many-to-one, rather than one-to-one. Here the approach needs to be extended to include e.g. a sequence number to distinguish the multiple “things”. This is the approach taken for the Unit of Description, where a “child” (“part”) unit of description uses the URI of the “parent” (“whole”) unit suffixed with a sequence number, e.g. for a unit of description for which the URI path contains “/unit/gb15abc”, the URIs for the “child” units would contain “/unit/gb15abc-1”, “/unit/gb15abc-2” and so on. In theory, this should not be necessary as the unitid for a unit should be unique within an EAD document, but in practice we’ve found that this is not the case in the actual data. (In this case, the identifier would be “reproducable” only if any new units are inserted at the end of a sequence rather than in the middle).
  6. So, with the caveat above that this is all somewhat tentative at this stage, I summarise below the approaches taken to generating URIs for instances of each of the classes in the Hub model. Note that sometimes, an instance of the same class is generated in different “contexts” within the EAD document, and in these cases different rules for URI construction may be applied in those different contexts, depending on the information available within the EAD document.

    We haven’t yet finalised the domain name we’ll be using, so for the purposes of the following, {root} represents the domain and the first part of the path. Italicised text is used for the URI patterns (or parts of them); bold text is used for XPath(-ish!) representations of the source of data within the EAD XML document.

    Finding Aid

    Pattern(s)

    {root}/id/findingaid/{eadid}

    eadid
    normalised form of ead/eadheader/eadid

    Example:

    {root}/id/findingaid/gb15sirernesthenryshackleton

    EAD document

    Pattern(s)

    {root}/id/EAD/{eadid}

    eadid
    normalised form of ead/eadheader/eadid

    Example(s)

    {root}/id/ead/gb15sirernesthenryshackleton

    Repository (Agent)

    Pattern(s)

    {root}/id/repository/{repositoryid}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/repository/gb15

    Repository (Place)

    Pattern(s)

    {root}/id/place/{repositoryid}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/place/gb15

    Unit of Description

    Pattern(s)

    {root}/id/unit/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Note: In principle, it should be possible to use c/unitid content rather than position in tree, but in practice, there are cases where unitid content is not unique within the EAD document.

    Example(s)

    {root}/id/unit/gb15sirernesthenryshackleton

    {root}/id/unit/gb15sirernesthenryshackleton-1

    Level

    Pattern(s)

    {root}/id/level/{level-name}

    level-name
    archdesc/@level or archdesc/@otherlevel or c{n}/@level or c{n}/@otherlevel

    Example(s)

    {root}/id/level/fonds

    Language

    Pattern(s)

    http://lexvo.org/id/iso639-3/{langcode}

    Note: use existing lexvo.org URIs for languages.

    langcode
    did/langmaterial/language/@langcode

    Example(s)

    http://lexvo.org/id/iso639-3/eng

    Creation (Event)

    Pattern(s)

    {root}/id/creation/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/creation/gb15sirernesthenryshackleton

    Creation (Time)

    Pattern(s)

    {root}/id/creationtime/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/creationtime/gb15sirernesthenryshackleton

    Extent

    Pattern(s)

    {root}/id/extent/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/extent/gb15sirernesthenryshackleton

    Biographical History

    Pattern(s)

    {root}/id/bioghist/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/bioghist/gb15sirernesthenryshackleton

    Concept (Origination)

    Pattern(s)

    {root}/id/concept/agent/{repositoryid}/{origination-name}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/concept/agent/gb15/sirernesthenryshackleton

    Agent (Origination)

    Pattern(s)

    {root}/id/agent/{repositoryid}/{origination-name}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/agent/gb15/sirernesthenryshackleton

    Concept (ControlAccess – Subject)

    Pattern(s)

    {root}/id/concept/{source}/{subject-name}

    {root}/id/concept/{repositoryid}/{subject-name}

    source
    controlaccess/subject/@source
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    subject-name
    normalised form of controlaccess/subject

    Example(s)

    {root}/id/concept/lcsh/antiquities

    Concept (ControlAccess – Persname)

    Pattern(s)

    {root}/id/concept/person/{source}/{person-name}

    {root}/id/concept/person/{rules}/{person-name}

    {root}/id/concept/person/{repositoryid}/{person-name}

    source
    controlaccess/persname/@source
    rules
    controlaccess/persname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    person-name
    normalised form of controlaccess/persname/

    Example(s)

    {root}/id/concept/person/nra/shackletonernesthenry1874-1922sirknightexplorer

    {root}/id/concept/person/ncarules/holdenwendyfl1990cartoonist

    {root}/id/concept/person/gb1832/berlinisaiah1909-1997sirknighthistorian

    Person (ControlAccess – Persname)

    Pattern(s)

    {root}/id/person/{source}/{person-name}

    {root}/id/person/{rules}/{person-name}

    {root}/id/person/{repositoryid}/{person-name}

    source
    controlaccess/persname/@source
    rules
    controlaccess/persname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    person-name
    normalised form of controlaccess/persname/

    Example(s)

    {root}/id/person/nra/shackletonernesthenry1874-1922sirknightexplorer

    {root}/id/person/ncarules/holdenwendyfl1990cartoonist

    {root}/id/person/gb1832/berlinisaiah1909-1997sirknighthistorian

    Concept (ControlAccess – Famname)

    Pattern(s)

    {root}/id/concept/family/{source}/{family-name}

    {root}/id/concept/family/{rules}/{family-name}

    {root}/id/concept/family/{repositoryid}/{family-name}

    source
    controlaccess/famname/@source
    rules
    controlaccess/famname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    family-name
    normalised form of controlaccess/famname/

    Example(s)

    {root}/id/concept/family/nra/dundasviscountsmelvilledunira

    {root}/id/concept/family/ncarules/boucicault

    Family (ControlAccess – Famname)

    Pattern(s)

    {root}/id/family/{source}/{family-name}

    {root}/id/family/{rules}/{family-name}

    {root}/id/family/{repositoryid}/{family-name}

    source
    controlaccess/famname/@source
    rules
    controlaccess/famname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    family-name
    normalised form of controlaccess/famname/

    Example(s)

    {root}/id/family/nra/dundasviscountsmelvilledunira

    {root}/id/family/ncarules/boucicault

    Concept (ControlAccess – Corpname)

    Pattern(s)

    {root}/id/concept/organisation/{source}/{org-name}

    {root}/id/concept/organisation/{rules}/{org-name}

    {root}/id/concept/organisation/{repositoryid}/{org-name}

    source
    controlaccess/corpname/@source
    rules
    controlaccess/corpname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    org-name
    normalised form of controlaccess/corpname/

    Example(s)

    {root}/id/concept/organisation/nra/britishbroadcastingcorporation

    {root}/id/concept/organisation/aacr2/dailymail%28london%2Cengland%29

    {root}/id/concept/organisation/gb1578/vizards%2Csolicitors%2Cmonmouth

    Organisation (ControlAccess – Corpname)

    Pattern(s)

    {root}/id/organisation/{source}/{org-name}

    {root}/id/organisation/{rules}/{org-name}

    {root}/id/organisation/{repositoryid}/{org-name}

    source
    controlaccess/corpname/@source
    rules
    controlaccess/corpname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    org-name
    normalised form of controlaccess/corpname/

    Example(s)

    {root}/id/organisation/nra/britishbroadcastingcorporation

    {root}/id/organisation/aacr2/dailymail%28london%2Cengland%29

    {root}/id/organisation/gb1578/vizards%2Csolicitors%2Cmonmouth

    Concept (ControlAccess – Geogname)

    Pattern(s)

    {root}/id/concept/place/{source}/{place-name}

    {root}/id/concept/place/{rules}/{place-name}

    {root}/id/concept/place/{repositoryid}/{place-name}

    source
    controlaccess/geogname/@source
    rules
    controlaccess/geogname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    place-name
    normalised form of controlaccess/geogname/

    Example(s)

    {root}/id/concept/place/lcsh/mcmurdosound%28antarctica%29

    {root}/id/concept/place/ncarules/canada

    {root}/id/concept/place/gb982/meirionethshire%28wales%29

    Place (ControlAccess – Geogname)

    Pattern(s)

    {root}/id/place/{source}/{place-name}

    {root}/id/place/{rules}/{place-name}

    {root}/id/place/{repositoryid}/{place-name}

    source
    controlaccess/geogname/@source
    rules
    controlaccess/geogname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    place-name
    normalised form of controlaccess/geogname/

    Example(s)

    {root}/id/place/lcsh/mcmurdosound%28antarctica%29

    {root}/id/place/ncarules/canada

    {root}/id/place/gb982/meirionethshire%28wales%29

    Concept (ControlAccess – GenreForm)

    Pattern(s)

    {root}/id/concept/{source}/{genreform-name}

    {root}/id/concept/{rules}/{genreform-name}

    {root}/id/concept/{repositoryid}/{genreform-name}

    source
    controlaccess/genreform/@source
    rules
    controlaccess/genreform/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    genreform-name
    normalised form of controlaccess/genreform

    Example(s)

    {root}/id/concept/aat/buildingplans

    Concept (ControlAccess – Function)

    Pattern(s)

    {root}/id/concept/{source}/{function-name}

    {root}/id/concept/{rules}/{function-name}

    {root}/id/concept/{repositoryid}/{function-name}

    source
    controlaccess/function/@source
    rules
    controlaccess/function/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    function-name
    normalised form of controlaccess/function

    Example(s)

    {root}/id/concept/agift/miningregulations

    Book

    Pattern(s)

    {root}/id/document/{title}

    source
    controlaccess/title/@source
    rules
    controlaccess/title/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    title
    normalised form of controlaccess/title

    Example(s)

    {root}/id/document/aacr2/thecastlediaries1974-761980

    Birth (Event)

    Pattern(s)

    {root}/id/birth/{source}/{person-name}

    {root}/id/birth/{rules}/{person-name}

    {root}/id/birth/{repositoryid}/{person-name}

    source
    controlaccess/persname/@source
    rules
    controlaccess/persname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    person-name
    normalised form of controlaccess/persname/

    Example(s)

    {root}/id/birth/nra/shackletonernesthenry1874-1922sirknightexplorer

    {root}/id/birth/ncarules/allenjim1926-1999playwright

    {root}/id/birth/gb1832/berlinisaiah1909-1997sirknighthistorian

    Object

    Pattern(s)

    {object-uri}

    object-uri
    dao/@href or daogrp/daoloc/@href

    Example(s)

    http://library.kent.ac.uk/library/special/html/specoll/jack.gif

    Object Group

    Pattern(s)

    {root}/id/group/{unitid}-{groupno}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree
    groupno
    position within daogrp sequence for archdesc or c{n}

    Example(s)

    {root}/id/group/gb0254ms274-1

    Time Interval (Year, Month, Day)

    i.e. specific intervals of time.

    Pattern(s)

    http://reference.data.gov.uk/id/year/{yyyy}

    http://reference.data.gov.uk/id/month/{yyyy}-{mm}

    http://reference.data.gov.uk/id/day/{yyyy}-{mm}-{dd}

    Note: use existing reference.data.gov.uk URIs for intervals.

    langcode
    did/langmaterial/language/@langcode

    Example(s)

    http://reference.data.gov.uk/id/year/1921

    http://reference.data.gov.uk/id/month/1921-06

    http://reference.data.gov.uk/id/day/1921-06-03

The “things” in EAD: a first cut at a model

As mentioned by Jane in a couple of previous posts, she, Bethan and I met up in Manchester in August to share our thoughts about how to model the Archives Hub EAD data in a form that can be represented in RDF.

RDF in a nutshell

For the purposes of this discussion, the main point to bear in mind is that the “grammatical principle” underpinning RDF is one of making simple three-part statements, each of which makes an assertion of a relationship (of some particular type) between two things. So for example, in RDF I can “say” things like:

Document 123 has-title “Arthur and George”

or

Document 123 is-authored-by Person P
Person P has-name “Julian Barnes”

When considering how to represent EAD data in RDF, then, the first step is to try to take a step back from the “nitty-gritty” of the EAD XML markup, and think about the three part statements we might construct to represent the “information content” of that document. We need to think in terms, not of XML documents and elements and attributes and nesting/containment, but rather of what an EAD document is “saying” about “things in the world” (perhaps more accurately, in the “world” as conceptualised by the creator of the archival finding aid, shaped by archival description practices in general) and what sort of questions we want to answer about those “things”. What are the “things” – and here I use the term in a general sense to include concepts and abstractions as well as material objects – that an EAD document provides information about? What are the relationships between these things? What else does an EAD document say about those things?

Note: The discussion here does not cover the “document”/”description” side of the “Linked Data” picture i.e. for each “thing”, we’ll be providing a “description” of that “thing” in the form of a “document”. Metadata describing that “document” will be important in providing information about provenance and currency, for example, but that is not discussed here.

EAD as used by the Archives Hub

The EAD XML format was designed to cope with the “encoding” of a wide range of archival finding aids, including those constructed according to the (slightly different) cataloguing practices and traditions of different communities.

Further, many features of the EAD format are optional: one can construct a valid EAD document using only a fairly minimal level of markup, or one can use more detailed markup to represent more information.

This flexibility can be something of a “double-edged sword”: on the one hand, it enables data creators to accommodate a wide range of data, and it provides choice in the level of detail of markup (and human resources in creating that markup!) to be applied; on the other hand, it can make working with EAD data quite complex for a consumer, particularly when processing data from a range of sources which perhaps use a range of different conventions and features of the language.

In part to address this sort of issue (as well as to make things simpler for data providers by insulating them from the detail of EAD markup), the Archives Hub provides a forms-based EAD editor, based primarily on the information categories enumerated by the ISAD(G) archival description standard, which generates EAD documents following a consistent set of markup conventions. (I sometimes think of this as a “profile” of EAD, a narrower set of constraints than that imposed by the EAD DTD/schema itself, but I’m not sure that sort of terminology is in widespread use in this context.)

So, we made the “pragmatic” decision to work, in the first instance at least, on the basis of this particular set of EAD markup conventions, rather than trying to address the full EAD format, which means we can limit the number of variants we need to deal with. Having said that, even for the case of data created using the Hub editor, an element of variation is present, because although the data entry form generates a common high-level structure, data creators can apply different markup within those high-level structural components. In this first cut at a model, we have focused on analysing those common structural elements, with the intention of extending and refining our approach at a later stage.

In the course of this (or in thinking about it afterwards) we’ve come up with a few questions, which I’ll try to highlight in the course of the discussion below. Any feed back on these points (or indeed on any other aspect of the post!) would be very welcome.

The “world” as seen by EAD

Jane and I had both done some doodling before our meeting, and we started out by walking through our ideas, highlighting both those aspects which seemed pretty clear and uncontroversial, and aspects where we were uncertain or several alternatives seemed possible (and reasonable). Although we were using slightly different terminology, I think we had come up with quite similar notions, and after a bit of discussion, we arrived at a first cut at a “core” model which I’m representing graphically in Figure 1 below. This isn’t intended as a formal UML or E-R diagram, but each box represents a type of “thing” (a class) and each arrow represents a type of relationship between individual things (“instances” of those classes):

Diagram showing draft data model for EAD data (1)

Figure 1

So the “core” types of things identified in this first stage were:

  • Unit of Description: these are the “units” of archival material, a document or set of documents, the actual stuff held in the repository and described by the finding aid. It’s a “generic” class to reflect the archival description principle of “multi-level description”. An archival finding aid typically has a “hierarchical” structure, in which one “unit of description” is (described as logically forming) “part of” another “unit of description”. A finding aid may provide a only a “collection-level” description of a collection which contains many thousands of individual records, without describing those records individually at all; or it may include descriptions of various component groupings and sub-groupings of records; or it may indeed go as far as describing individual records within such groupings. For each Unit of Description, information relevant to that particular unit is provided. EAD and ISAD(G)) allow for the provision of more or less the same set of information whatever the “level” of unit described, though in practice some elements are more commonly used for “aggregate/group” units.
  • Archival Finding Aid: these are the documents created by archival cataloguers to describe the archival materials. Often a single finding aid describes (or has as its topic/subject) several units of description, but it may be the case that a finding aid describes only a single unit – where only a description of the collection as a whole is provided.
  • Repository (Agent): the organisations who curate and provide access to the archival material, and who create and maintain the archival finding aids. (EAD allows for the possibility that two different agencies perform these two roles; the Hub EAD Editor works on the basis that a single agent is responsible for both).
  • Origination (Agent): the entity (individual, organisation or family) “responsible for the creation, accumulation, or assembly of the described materials before their incorporation into an archival repository” (from the description of the EAD <origination> element). Jane analysed the rather complex nature of the ISAD(G) Creator/EAD origination relationship, which encompases notions of both “item creator” and “collector”, in <a href="http://archiveshub.ac.uk/blog/?p=2401"an earlier post on the Archives Hub blog.
  • “Things” which are referenced in the form of names used as “access points” or “index terms” using the EAD <controlaccess> element. The Hub EAD Editor supports the provision of the following as <controlaccess> terms, and recommends the use of a number of thesauri or “authority files” from which they should be drawn: Names of “Subjects” (topics); Personal Names; Family Names; Corporate Names; Place Names; Book Titles; Names of Genres or Forms; Names of Functions. So the corresponding “things named” are: Concepts, Persons, Families, Organisations, Places, Books, Genres or Forms, and Functions. As Jane notes in her recent post the relationship between the Unit of Description and the entity named in the <controlaccess> element is not necessarily a relationship of “about”-ness, but a rather less specific one, which for the moment we’ve labelled as simply “associated with” (though a better label might be preferable!).

(I’ve shown the Origination and Repository as distinct classes in the diagram, rather than as a single Agent class, because, as I hope will become clearer below, it ends up that they participate in a slightly different set of relationships).

We went on to extend and refine this core model to accommodate more of the information from the EAD document.

First, we refined the way the “access points” are represented. I’d discussed this aspect of the model with Leigh Dodds of Talis and he suggested that we consider modelling the physical entities here as concepts, in turn related to physical entities, i.e. that we represent the “conceptualisation” of a person, family, organisation or place captured in a thesaurus entry or authority file record, as distinct from the actual physical entity. So, to take an example which I think Bethan used during our conversation, we can distinguish between a conceptualisation of William Blake as a poet and one of William Blake as an artist, each in turn related to William Blake the person.

Although I don’t plan to discuss the specifics of RDF vocabulary in this post, it’s worth noting that the FOAF RDF vocabulary has recently been extended with the addition of a property, foaf:focus, to represent the relationship between the conceptualisation and the thing conceptualised (person, place etc), to support exactly this convention.

For some of the <controlaccess> named entities – like the topics, genres/forms and functions – there is no “other thing conceptualised” and it is sufficient to model them simply as concepts (or as instances of a subclass); and for the book case, we’ll just treat it as a “book” (and for the moment, at least, sidestep any FRBR-ish issues).

In both cases, the notion that the concept is a member of a specific thesarus/authority file can be captured by introducing the notion (from SKOS) of a “Concept Scheme”.

Question 1: One question raised by this approach is whether, for the cases where there is a distinct entity involved, in transforming an EAD document into RDF, we should:

  1. Coin URIs for, and generate “descriptions” of, both the concept and the person/family/organisation/place conceptualised (with a triple with a foaf:focus predicate relating the two? Or:
  2. Coin a URI for, and generate a “description” of only the concept, and leave the relationship with the person/family/organisation/place conceptualised “out of scope” at the transform stage (though that relationship might be obtained at a later stage by linking the concept to external data)?

My inclination is to do the former, on the grounds that this enables us to capture more of the information present in the EAD document i.e. to capture the information that where a <persname> element is used, this is the name of a conceptualisation of a person, where a <corpname> element is used, this is the name of a conceptualisation of an organisation, and so on.

Question 2: Is it necessary/useful to also model the name itself as a distinct resource? I think we can manage without that, but we may revisit that point in the future.

Second, having made this choice for the <controlaccess> entities, we decided to apply it also to the case of the “origination” agent discussed above, with the “origination” relationship becoming one between a Unit of Description and a conceptualisation of an agent, rather than between a Unit of Description and the agent itself. I admit I’m still not completely sure this is necessary/useful/”the right thing to do”. The use of the <origination> element in the Hub EAD profile is described in the guidelines here. It allows for names to be presented in “the commonly used form of name”, rather than the form specified by an authority record (and indeed a survey of the data reveals a good deal of variation), so it’s a bit more difficult to argue that this corresponds directly to the name of an entry (concept) listed in an “authority file”.

Question 3: Is it necessary/useful to introduce a “conceptualisation” of the agent who “originated” the Unit of Description? For now, we’re working on the basis that it is, but we may revisit that choice.

This extended model is represented graphically in Figure 2:

Diagram showing draft data model for EAD data (2)

Figure 2

A final stage of refinement gave us a few further extensions.

First the EAD Document is introduced as a particular “encoding of” the Finding Aid.

Second, I’ve suggested that we model the Biographical or Administrative History associated with each Unit of Description as a resource in its own right, distinct from the Finding Aid as a whole. I’m not sure this is strictly necessary, and again it’s something that we may revist in the future. But it enables us to provide information about the Biographical History as a distinct resource. One of the reasons this may be useful is that we’ve discussed (albeit somewhat vaguely at this point!) analysing/mining the text of the Biographical History as a source of further information, and having a URI for the Biographical History enables us to be explicit about the source of that data. We can also make the Biographical History the subject of triples to indicate that it is related not just to the Unit of Description but also to the entity who “originated” that unit (or, given the discussion above, to the conceptualisation of that entity). Also, we could associate it with different literal expressions (e.g. the original EAD fragment as XML Literal, but also an XHTML or plain text derivative). It also, of course, makes the Biographical History into a resource that others can refer to in their own assertions in their own data.</p

Third, we introduced the “level” of the Unit of Description as a distinct resource, a concept. This means that each “level” within the (relatively small) set used within the Hub data can each be assigned a distinct URI, and described in their own right, and – again – referenced by others.

Fourth, similarly, the “language” of the Unit of Description is treated as a distinct resource. (The plan here is that we’ll try to simply reference resources within an existing Linked Data dataset, such as lexvo.orga>.)

Fifth, the EAD <dao> and <daogrp< elements are mapped into a relationship between the Unit of Description and an external digital object (or group of objects). I’ve labelled the relationship here as “is represented by” as that is the description provided by the EAD documentation, but I think Jane and Bethan felt that in practice in the Hub data, the relationship might sometimes be rather less specific than that.

For the moment, the other EAD elements corresponding to ISAD(G) elements (i.e. to textboxes in the Hub data entry form) will be treated as properties with XML Literal values (though we could follow the <bioghist> approach and generate individual URI-identified resources if that proves to be useful).

Sixth – and here we stepped slightly beyond the scope of the EAD document itself (so I’ve greyed it out in the diagram below) – we’ve added a notion of the location of the Repository and a relationship between the Repository-as-Agent and that Place. Although details of repository location aren’t included in the Hub EAD documents, Jane and Bethan said they do have that data available, and it should be fairly easy to integrate it.

So we’ve ended up with the model illustrated in Figure 3.

Diagram showing draft data model for EAD data (3)

Figure 3

Question 4: Are we missing any obvious “things” that we need to treat as resources?

Note: In this post, I haven’t gone as far here as to enumerate all the properties that will be used to describe instances of each of those classes, but I’ll provide that in a future post.

Multi-level description, context, “completeness” and “inheritance”

The one remaining question – and perhaps one of the thorniest to address fully – is that arising from one of the fundamental characteristics of the nature of archival description. As noted above, archival description is typically based on a “hierarchical”, “multi-level” approach, in which, within a single finding aid, information is provided about an aggregation of records, and then about component parts of that aggregation, and so on, perhaps down to the level of providing descriptions of individual records, but often stopping short of that.

The ISAD(G) standard presents principles of moving from the general to the specific, and providing information relevant to the particular unit of description (ISAD(G) 2.2):

Provide only such information as is appropriate to the level being described. For example, do not provide detailed file content information if the unit of description is a fonds; do not provide an administrative history for an entire department if the creator of a unit of description is a division or a branch.

And of “non repetition” (ISAD(G) 2.4):

At the highest appropriate level, give information that is common to the component parts. Do not repeat information at a lower level of description that has already been given at a higher level.

In some cases, it may indeed be the case that if some descriptive attribute is not explicitly provided for the unit of description, then the information provided for its “parent” unit in the hierarchy is applicable; however, this is often not the case. The elements of the ISAD(G) Identity Statement Area (or the EAD <did> child elements), for example, are specific to the unit of description and do not apply to its “child” units; and for many other descriptive elements, a simple rule of “direct inheritance” may not be appropriate. For the <controlaccess> elements, for example, a “blunt” inference rule that the named entities “associated with” a unit of description are also “associated with” every “child” unit (and so on “down the tree”) may result in associations that are simply not useful to the consumer of the data.

In a post on the Archives Hub blog, Jane emphasised the value of the “Linked Data” approach in making things mentioned in our data into “first-class citizens”. One consequence of the multi-level approach in archival description practice is a strong sense of the importance of “context”, and that the descriptions of the “lower level” units should be read and interpreted in the context of the higher levels of description (perhaps even that they are in some sense “incomplete” without that “contextual” data). In contrast, the “Linked Data” approach typically involves exposing “bounded descriptions” of individual resources. Now, certainly, yes, those “bounded descriptions” include assertions of relationships with other resources (including the sort of part-whole/member-of/component-of relationships present here), and those links can be followed by consumers to obtain further information on the other resources – however, there is no requirement or expectation that consumers will do so. So, there is arguably a (perhaps unavoidable) element of tension between the strongly “contextual” emphasis of EAD and ISAD(G) and the “bounded descriptions” of “Linked Data”. Rather than seeing that as an insurmountable hurdle, however, I think it provides an area that the project can usefully explore and evaluate.

(If I remember correctly) we made the decision that, for now at least, the only piece of information for which we would implement an explicit “inheritance” from a “higher-level” Unit of Description to a “lower” one (and generate additional RDF triples in the data) would be that of the repository which provides access to the material (i.e. the EAD <repository> element).

Conclusion

As I said above, the model I’ve outlined here is intended as very much a first cut, not the “last word”, and something we’ll most likely revisit and refine further in the future, particularly as we see in practice what it enables us (and others) to do with the data generated, and where we might require some further tweaks to enable us to do more. For now, we feel it provides a basis for our initial work on transforming EAD data into RDF.

The next steps are:

  1. to decide on URI patterns for the URIs we will be generating (i.e. URIs for instances of the classes in the diagram above)
  2. to select terms from existing RDF vocabularies and to define any additional RDF terms required to create “descriptions” of these things based on information from the EAD document
  3. to create a transformation that implements the model (in the first instance, an XSLT transform)

I’ve already done some work on all of these, and I’ll write about them in a separate post here – which hopefully will be rather shorter than this one and will take me rather less time to write!

Some thoughts on architecture and workflows

This is an attempt to sketch out some of my/our initial thoughts on the approaches the project is considering to exposing data as Linked Data. I should emphasise that these are very much initial thoughts, and things may change as we progress.

The project is dealing with two main data sources, and at the moment two different approaches are being considered to those sources.

The first data source is the collection of archival finding aids describing the holdings of the archives of educational and research institutions in the UK, aggregated by the JISC Archives Hub service. This data takes the form of XML documents in the Encoded Archival Description (EAD) format, created by archivists in the various institutions, and submitted to the Hub.

Currently, the aggregated data is indexed using the Cheshire 3 application, and exposed as HTML pages on the archiveshub.ac.uk site for search and browse. (SRU and Z39.50 targets and an OAI-PMH repository are also available.)

To expose (probably a subset of) the Hub EAD finding aids as Linked Data, the workflow is expected to look something like that represented in Figure 1 below:

Diagram showing process of transforming EAD to RDF and exposing as Linked Data (1)
  1. Transform: EAD XML documents are transformed to an RDF format. We’ll write about our current thinking on this more in a subsequent post, as working out how best to represent the EAD data in RDF as the target for the transform is in itself a significant chunk of work (and an area I’m particularly interested in). This is likely to be something of an “iterative” process: we’ll start with a fairly basic transform that captures some subset of the content of the input documents, and perhaps refine things later to generate more data (and correct errors we’ll no doubt make in the first cut!)
  2. Enhance: RDF data from the previous step is “enhanced” and augmented. This step might include processes to (1) generally “clean up” the data (e.g. normalise some literals, identify internal co-references etc); (ii) add links to resources in other datasets; (iii) (maybe) pull in some useful data from other datasets, either data held by the Hub but not included in the EAD docs or data from other sources. Again this will probably be a process which we extend and refine over time.
  3. Upload: Load the RDF data from the previous step to an instance of the Talis Platform triple store, which Talis are kindly making available to the project.
  4. Expose: Expose a set of linked “bounded descriptions” from the triple store over HTTP, as documents in both human-readable and RDF formats, following the principles of the W3C TAG httpRange-14 resolution/Cool URIs for the Semantic Web. The use of the Platform also provides us with a SPARQL endpoint for the data – which we can make available to others to use – and which also means we can consider layering other Web interfaces over that endpoint. For example, I’d be interested in trying out the Linked Data API, which I talked about over on eFoundations a while ago.

It may be that that the second and third steps are reversed and we upload the data to the triple store and perform the “enhance” step on the data there, i.e. something closer to Figure 2:

Diagram showing process of transforming EAD to RDF and exposing as Linked Data (2)

Or indeed that a “hybrid” of the two is appropriate, and some “enhance” processes take place before upload and others take place afterwards.

We’ll also need to integrate some provision for “version control” and “provenance”/”attribution” (e.g. to track which data comes directly from the EAD sources, and which is added from elsewhere) into this process.

So for the Hub data, the plan is that the data is “exported” from the existing EAD dataset, and that the Platform triplestore provides the “back-end” for the app that serves up the “Linked Data” document views and provides a SPARQL endpoint.

The second data source of interest is the collection of bibliographic metadata aggregated into the Copac catalogue from the member libraries of Research Libraries UK and from other specialist libraries. This data is also held as XML in the MODS XML format. (Bethan Ruddock has a couple of posts on the Copac Development blog which describe the processes by which data is transferred from the contributor libraries to the Copac catalogue).

As for the case of the Archives Hub data, the first stage will be to design an appropriate RDF representation and an algorithm for transforming the MODS data to RDF (or to select – and maybe adapt, if necessary – an existing one).

In contrast to the case of the Hub I outlined above, the plan is to serve the RDF data from the existing Copac database, rather than upload it to a triplestore. This will probably require the development of a small additional application (or maybe just the configuration of an HTTP server) to service the new URIs coined for resources, to support content negotiation and redirect to URIs of appropriate pages.

One of the questions raised by this approach is how to handle the process I described above as “enhance”, and in particular how to accommodate the addition of new data – at a minimum, links to existing resources described in other Linked Data datasets – assuming that we aren’t going to be able to update the source MODS XML documents. For some cases, it may be trivial to incorporate this in the MODS-to-RDF transform (e.g., to generate links to languages described by lexvo.org). Another approach might be to generate simple “seeAlso” links to an additional set of documents (which could be simple static documents or could be served from an RDF store). Hmm. As you can probably tell, I’ve thought about this rather less than I’ve thought about the Hub case! Anyway, the suggested approach is sketched in Figure 3:

Diagram showing process of transforming MODS to RDF and exposing as Linked Data

Another constraint of this approach would be that although we can serve the set of linked documents, it doesn’t provide a SPARQL endpoint.

One of the expectations for the project is that it “explores and reports on the opportunities and barriers in making content structured and exposed”, and an assessment of the pros and cons of the different approaches to hosting the data should contribute to that report.