In my previous couple of posts, I outlined the model of the “world” on which we’re basing the RDF data we’re generating from the Archives Hub‘s EAD XML documents.
At the heart of the Linked Data approach is the principle that all the “things” we want to “say anything about” should be named using a URI, and that those URIs should use the http URI scheme, so that they can be easily “looked up” or “dereferenced” using Web technologies in order to obtain some information provided by the URI owner about the thing. So, having specified the types or classes of thing we want to refer to and describe, the next step is to decide on the structure of the http URIs that we’ll use to name the “instances” of those classes – the individual “things” – archival resources, repositories, concepts, persons, places, and so on. In this post, I’ll try to describe the patterns we’re using, and outline how we construct individual URIs using those patterns from the EAD input data. As I hope will become clearer, the nature of the input data conditions the form of the patterns we’ve chosen. This has turned into a rather long post (again!) but I hope the detail is useful – I think it’s important for us to try to document our processes and some of the issues we’ve grappled with as well as to present the conclusions.
In some (most) cases, these will be newly created URIs, under a domain that we (well, MIMAS and the Archives Hub service) own. For these URIs, the project is responsible for choosing the URIs and putting in place the mechanisms to ensure that their dereferencing results in the provision of some “useful information”. In other cases, we will simply be citing existing URIs, defined by other agencies who (hopefully!) provide for their dereferencing.
The UK Cabinet Office has recently published some general guidelines on URI patterns for government Linked Data, Designing URI Sets for the UK Public Sector, and within the JISC programme strand under which LOCAH is funded, projects are encouraged to follow the recommendations of those guidelines. Following these guidelines, the general URI pattern recommended to identify “things” is:
http://{domain}/id/{concept}/{reference}
where:
- concept is a name for a class (resource type), like “person”
- reference is a name for an individual instance of that class or type
Our RDF data is being generated, at least in the first instance, by processing EAD XML documents, so we want to construct our URIs for our “things” from content within those XML documents. And we want to do so in a way that, as far as possible, ensures that each of those URIs is an unambiguous name/referrer, i.e. it identifies a single “thing”, and we don’t end up with a single URI being used for what are in fact two different things. On the other hand, we can live with the case where we end up with multiple URIs, all of which identify a single thing, because information can be added at a later stage to indicate that they are synonyms.
The other point to note is that the initial transformation step is being performed on a “document-by-document basis”, i.e. taking a single EAD document as input and outputting RDF/XML. So for any given resource, the information we generate – including the URI of the resource – is based only on the content of that document (and any generally applicable information we can embed in the transform itself). There may be other data “about” that “thing” in another EAD document but we don’t have access to it at the time of transformation.
Also, it’s desirable that we construct our URIs in such a way that if we need to re-run the transform, we generate the same URIs from the same input data (unless we explicitly decide to change the patterns for some reason).
Finally, although the patterns below often make use of human-readable strings from the EAD document content, I haven’t treated human-readability as a major consideration. Having said that, I’ve tended to make use of (slightly normalised forms of) human-readable strings where possible, rather than, say, creating opaque “hashes”.
As with other aspects of the work, at this stage, this is a first cut at tackling the issue, and we may revise our approaches based on the experience of applying them over the dataset. Having gone through and constructed patterns for the various resource types, looking back over them now, I think I can see a small number of distinct methods that we’ve used:
- Identifiers: For some of these “things”, the EAD documents contain some sort of formally assigned identification code or number, which unambiguously – at least within the scope of the Hub collection – identifies that instance within the set of resources of that type (i.e. it serves as a “reference” in the terms of the Designing URI Sets… document). This is the case, for example, with the languages of the materials, using the did/langmaterial/language/@langcode attribute value. A variant of this is the case where such an identifier can be constructed from a combination of multiple pieces of content. Repositories, for example, can be identified by the pair of country code (ead/eadheader/eadid/@countrycode) and maintenance agency code (ead/eadheader/eadid/@mainagencycode). For these cases a combination of the name of the resource type and that identification code provides the basis for the “reference” part of the URI.
- “Authority-Controlled” Names: For many of the “things”, however, the EAD documents do not contain such a code; rather, they refer to things only by name. In some cases, the form of the name is drawn from an “authority file” – indicated in the EAD document – and the name includes sufficient information (e.g. birth/death dates, titles etc for a person) to make the resulting string an unambiguous referrer within the set of names from that source. For these cases, a combination of a name for the authority file and the name provides the basis for the “reference”. However, this does depend on the creator of the EAD document having accurately transcribed the “authoritative” form of the name, at least sufficiently to maintain unambiguity of reference.
- “Rule-Based” Names: In other cases, the “thing” is named, not using a name from a controlled list, but rather a name constructed according to some codified set of rules, where the rules used are indicated in the EAD document. The intent behind such rules is to try to ensure consistency of form and unambiguity of reference. The National Council of Archives’ Rules for the Construction of Personal, Place and Corporate Names (one of the rule sets recommended to Hub data creators) states “A personal name is constructed by combining mandatory and optional components of the name so that the person concerned can be identified with certainty and distinguished from others bearing similar names. An individual should have only one authorised form of name and each name should apply to only one individual.”Typically, as for the “authority file” case, this is achieved through the inclusion of dates, titles etc for persons. For these cases, a combination of a name for the rules and the name itself should provide the basis for the “reference”. However, in practice, the picture with the Hub data is somewhat more complex. First, in some cases where it is claimed that rules are followed, the content itself indicates that this is not the case. For example, the NCA Rules mandate that a personal name should include “the year in which a person was born or died, the span of years of his/her lifetime or the approximate period covered by his/her activities”, even if those dates are estimated. But there are cases in the data marked up as following the NCA Rules which do not meet this requirement – e.g. personal names providing only surname and forename with no dates – , which I suspect may result in ambiguous references. Second, even where the rule is followed and the mandatory components are present, the distributed nature of Hub data creation means that I suspect there is still some possibility that a single personal name may be used in two different sources to refer to what in fact are two different people (Consider e.g. the case of two data providers using the name “Smith, John, fl 1920-1950”).
- “Locally-Scoped” Names: In other cases, the form of the name is neither authority-controlled nor rule-based, but nevertheless there is some expectation that the form of the name used is sufficient to make it an unambiguous referrer within some context. This is the case, for example, with the content of the did/origination element. The difficulty, however, is in establishing reliably what that context is. What is that “local scope”? We’ve tentatively taken the approach that such names have been constructed in such a way as at least to be unambiguous within the collection of submissions to the Hub by a single repository. So by combining the repository identifier and the name, hopefully, we can arrive at a “reference” which avoids ambiguity. Again, it may turn out that this assumption is unreliable, and results in ambiguous references, so we may need to revisit this approach.
- “Identifier Inheritance”: (I’m sure there must be a formal term for this but I’m not sure what it is!) In these cases the EAD document does not provide an unambiguous name for the “thing” itself; however the “thing” has a simple relationship with some other “thing” for which identification fits into one of the other categories. Where the relationship is one-to-one, a URI can be constructed by adopting the pattern for that other “thing” and substituting the name of the resource type. An example of this is the case of the “biographical history” associated with a “unit of description”. The unit of description has an identifier (based on a pattern described below) and since – in data constructed using the Hub template – each unit has at most one biographical history, replacing the “unit” resource type name with a “bioghist” resource type name gives us a suitable URI path, e.g. for a unit of description for which the URI path contains “/unit/gb15abc”, the URI for the biographical history would contain “/bioghist/gb15abc”.A variant of this is the case where the relationship is many-to-one, rather than one-to-one. Here the approach needs to be extended to include e.g. a sequence number to distinguish the multiple “things”. This is the approach taken for the Unit of Description, where a “child” (“part”) unit of description uses the URI of the “parent” (“whole”) unit suffixed with a sequence number, e.g. for a unit of description for which the URI path contains “/unit/gb15abc”, the URIs for the “child” units would contain “/unit/gb15abc-1”, “/unit/gb15abc-2” and so on. In theory, this should not be necessary as the unitid for a unit should be unique within an EAD document, but in practice we’ve found that this is not the case in the actual data. (In this case, the identifier would be “reproducable” only if any new units are inserted at the end of a sequence rather than in the middle).
- eadid
- normalised form of ead/eadheader/eadid
- eadid
- normalised form of ead/eadheader/eadid
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- unitid
- normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree
- level-name
- archdesc/@level or archdesc/@otherlevel or c{n}/@level or c{n}/@otherlevel
- langcode
- did/langmaterial/language/@langcode
- unitid
- normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree
- unitid
- normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree
- unitid
- normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree
- unitid
- normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- source
- controlaccess/subject/@source
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- subject-name
- normalised form of controlaccess/subject
- source
- controlaccess/persname/@source
- rules
- controlaccess/persname/@rules
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- person-name
- normalised form of controlaccess/persname/
- source
- controlaccess/persname/@source
- rules
- controlaccess/persname/@rules
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- person-name
- normalised form of controlaccess/persname/
- source
- controlaccess/famname/@source
- rules
- controlaccess/famname/@rules
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- family-name
- normalised form of controlaccess/famname/
- source
- controlaccess/famname/@source
- rules
- controlaccess/famname/@rules
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- family-name
- normalised form of controlaccess/famname/
- source
- controlaccess/corpname/@source
- rules
- controlaccess/corpname/@rules
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- org-name
- normalised form of controlaccess/corpname/
- source
- controlaccess/corpname/@source
- rules
- controlaccess/corpname/@rules
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- org-name
- normalised form of controlaccess/corpname/
- source
- controlaccess/geogname/@source
- rules
- controlaccess/geogname/@rules
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- place-name
- normalised form of controlaccess/geogname/
- source
- controlaccess/geogname/@source
- rules
- controlaccess/geogname/@rules
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- place-name
- normalised form of controlaccess/geogname/
- source
- controlaccess/genreform/@source
- rules
- controlaccess/genreform/@rules
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- genreform-name
- normalised form of controlaccess/genreform
- source
- controlaccess/function/@source
- rules
- controlaccess/function/@rules
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- function-name
- normalised form of controlaccess/function
- source
- controlaccess/title/@source
- rules
- controlaccess/title/@rules
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- title
- normalised form of controlaccess/title
- source
- controlaccess/persname/@source
- rules
- controlaccess/persname/@rules
- repositoryid
- normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
- person-name
- normalised form of controlaccess/persname/
- object-uri
- dao/@href or daogrp/daoloc/@href
- unitid
- normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree
- groupno
- position within daogrp sequence for archdesc or c{n}
- langcode
- did/langmaterial/language/@langcode
So, with the caveat above that this is all somewhat tentative at this stage, I summarise below the approaches taken to generating URIs for instances of each of the classes in the Hub model. Note that sometimes, an instance of the same class is generated in different “contexts” within the EAD document, and in these cases different rules for URI construction may be applied in those different contexts, depending on the information available within the EAD document.
We haven’t yet finalised the domain name we’ll be using, so for the purposes of the following, {root} represents the domain and the first part of the path. Italicised text is used for the URI patterns (or parts of them); bold text is used for XPath(-ish!) representations of the source of data within the EAD XML document.
Finding Aid
Pattern(s)
{root}/id/findingaid/{eadid}
Example:
{root}/id/findingaid/gb15sirernesthenryshackleton
EAD document
Pattern(s)
{root}/id/EAD/{eadid}
Example(s)
{root}/id/ead/gb15sirernesthenryshackleton
Repository (Agent)
Pattern(s)
{root}/id/repository/{repositoryid}
Example(s)
{root}/id/repository/gb15
Repository (Place)
Pattern(s)
{root}/id/place/{repositoryid}
Example(s)
{root}/id/place/gb15
Unit of Description
Pattern(s)
{root}/id/unit/{unitid}
Note: In principle, it should be possible to use c/unitid content rather than position in tree, but in practice, there are cases where unitid content is not unique within the EAD document.
Example(s)
{root}/id/unit/gb15sirernesthenryshackleton
{root}/id/unit/gb15sirernesthenryshackleton-1
Level
Pattern(s)
{root}/id/level/{level-name}
Example(s)
{root}/id/level/fonds
Language
Pattern(s)
http://lexvo.org/id/iso639-3/{langcode}
Note: use existing lexvo.org URIs for languages.
Example(s)
http://lexvo.org/id/iso639-3/eng
Creation (Event)
Pattern(s)
{root}/id/creation/{unitid}
Example(s)
{root}/id/creation/gb15sirernesthenryshackleton
Creation (Time)
Pattern(s)
{root}/id/creationtime/{unitid}
Example(s)
{root}/id/creationtime/gb15sirernesthenryshackleton
Extent
Pattern(s)
{root}/id/extent/{unitid}
Example(s)
{root}/id/extent/gb15sirernesthenryshackleton
Biographical History
Pattern(s)
{root}/id/bioghist/{unitid}
Example(s)
{root}/id/bioghist/gb15sirernesthenryshackleton
Concept (Origination)
Pattern(s)
{root}/id/concept/agent/{repositoryid}/{origination-name}
Example(s)
{root}/id/concept/agent/gb15/sirernesthenryshackleton
Agent (Origination)
Pattern(s)
{root}/id/agent/{repositoryid}/{origination-name}
Example(s)
{root}/id/agent/gb15/sirernesthenryshackleton
Concept (ControlAccess – Subject)
Pattern(s)
{root}/id/concept/{source}/{subject-name}
{root}/id/concept/{repositoryid}/{subject-name}
Example(s)
{root}/id/concept/lcsh/antiquities
Concept (ControlAccess – Persname)
Pattern(s)
{root}/id/concept/person/{source}/{person-name}
{root}/id/concept/person/{rules}/{person-name}
{root}/id/concept/person/{repositoryid}/{person-name}
Example(s)
{root}/id/concept/person/nra/shackletonernesthenry1874-1922sirknightexplorer
{root}/id/concept/person/ncarules/holdenwendyfl1990cartoonist
{root}/id/concept/person/gb1832/berlinisaiah1909-1997sirknighthistorian
Person (ControlAccess – Persname)
Pattern(s)
{root}/id/person/{source}/{person-name}
{root}/id/person/{rules}/{person-name}
{root}/id/person/{repositoryid}/{person-name}
Example(s)
{root}/id/person/nra/shackletonernesthenry1874-1922sirknightexplorer
{root}/id/person/ncarules/holdenwendyfl1990cartoonist
{root}/id/person/gb1832/berlinisaiah1909-1997sirknighthistorian
Concept (ControlAccess – Famname)
Pattern(s)
{root}/id/concept/family/{source}/{family-name}
{root}/id/concept/family/{rules}/{family-name}
{root}/id/concept/family/{repositoryid}/{family-name}
Example(s)
{root}/id/concept/family/nra/dundasviscountsmelvilledunira
{root}/id/concept/family/ncarules/boucicault
Family (ControlAccess – Famname)
Pattern(s)
{root}/id/family/{source}/{family-name}
{root}/id/family/{rules}/{family-name}
{root}/id/family/{repositoryid}/{family-name}
Example(s)
{root}/id/family/nra/dundasviscountsmelvilledunira
{root}/id/family/ncarules/boucicault
Concept (ControlAccess – Corpname)
Pattern(s)
{root}/id/concept/organisation/{source}/{org-name}
{root}/id/concept/organisation/{rules}/{org-name}
{root}/id/concept/organisation/{repositoryid}/{org-name}
Example(s)
{root}/id/concept/organisation/nra/britishbroadcastingcorporation
{root}/id/concept/organisation/aacr2/dailymail%28london%2Cengland%29
{root}/id/concept/organisation/gb1578/vizards%2Csolicitors%2Cmonmouth
Organisation (ControlAccess – Corpname)
Pattern(s)
{root}/id/organisation/{source}/{org-name}
{root}/id/organisation/{rules}/{org-name}
{root}/id/organisation/{repositoryid}/{org-name}
Example(s)
{root}/id/organisation/nra/britishbroadcastingcorporation
{root}/id/organisation/aacr2/dailymail%28london%2Cengland%29
{root}/id/organisation/gb1578/vizards%2Csolicitors%2Cmonmouth
Concept (ControlAccess – Geogname)
Pattern(s)
{root}/id/concept/place/{source}/{place-name}
{root}/id/concept/place/{rules}/{place-name}
{root}/id/concept/place/{repositoryid}/{place-name}
Example(s)
{root}/id/concept/place/lcsh/mcmurdosound%28antarctica%29
{root}/id/concept/place/ncarules/canada
{root}/id/concept/place/gb982/meirionethshire%28wales%29
Place (ControlAccess – Geogname)
Pattern(s)
{root}/id/place/{source}/{place-name}
{root}/id/place/{rules}/{place-name}
{root}/id/place/{repositoryid}/{place-name}
Example(s)
{root}/id/place/lcsh/mcmurdosound%28antarctica%29
{root}/id/place/ncarules/canada
{root}/id/place/gb982/meirionethshire%28wales%29
Concept (ControlAccess – GenreForm)
Pattern(s)
{root}/id/concept/{source}/{genreform-name}
{root}/id/concept/{rules}/{genreform-name}
{root}/id/concept/{repositoryid}/{genreform-name}
Example(s)
{root}/id/concept/aat/buildingplans
Concept (ControlAccess – Function)
Pattern(s)
{root}/id/concept/{source}/{function-name}
{root}/id/concept/{rules}/{function-name}
{root}/id/concept/{repositoryid}/{function-name}
Example(s)
{root}/id/concept/agift/miningregulations
Book
Pattern(s)
{root}/id/document/{title}
Example(s)
{root}/id/document/aacr2/thecastlediaries1974-761980
Birth (Event)
Pattern(s)
{root}/id/birth/{source}/{person-name}
{root}/id/birth/{rules}/{person-name}
{root}/id/birth/{repositoryid}/{person-name}
Example(s)
{root}/id/birth/nra/shackletonernesthenry1874-1922sirknightexplorer
{root}/id/birth/ncarules/allenjim1926-1999playwright
{root}/id/birth/gb1832/berlinisaiah1909-1997sirknighthistorian
Object
Pattern(s)
{object-uri}
Example(s)
http://library.kent.ac.uk/library/special/html/specoll/jack.gif
Object Group
Pattern(s)
{root}/id/group/{unitid}-{groupno}
Example(s)
{root}/id/group/gb0254ms274-1
Time Interval (Year, Month, Day)
i.e. specific intervals of time.
Pattern(s)
http://reference.data.gov.uk/id/year/{yyyy}
http://reference.data.gov.uk/id/month/{yyyy}-{mm}
http://reference.data.gov.uk/id/day/{yyyy}-{mm}-{dd}
Note: use existing reference.data.gov.uk URIs for intervals.
Example(s)
http://reference.data.gov.uk/id/year/1921
http://reference.data.gov.uk/id/month/1921-06
http://reference.data.gov.uk/id/day/1921-06-03