Co-referencing

I spent the last couple of days in Manchester at the “end of programme” meeting for the JISCexpo programme under which LOCAH is funded. It was a pretty busy couple of days with representatives of all the projects talking about their projects and their experiences and some of the issues arising.

Yesterday I found myself as “scribe” for a discussion on the “co-referencing” question, i.e. how to deal with the fact that different data providers assign and use different URIs for “the same thing”. And these are my rather hasty notes of that discussion.

  • the creation/use of co-references is inevitable; people will always end up creating URIs for things for which URIs already exist;
  • one approach to this problem has been the use of the owl:sameAs property. However, using this property makes a very “strong” assertion of equivalence with consequences in terms of inferencing
  • the actual use of properties sometimes introduces a dimension of “social/community semantics” that may be at odds with the “semantics” provided by the creator/owner of a term
  • the notion of “sameness” is often qualified by a degree of confidence, a “similarity score”, rather than being a statement of certainty
  • the notion of “sameness”/similarity is often context-sensitive: rather than saying “X and Y are names for the same thing in all contexts”, we probably want to say something closer to “for the purposes of this application, or in this context, it’s sufficient to work on the basis that X and Y are names for the same thing”
  • is there a contrast between approaches based on “top-down” “authority” and those based more on context-dependent “grouping”?
  • how do we “correct” assertions which turn out to be “wrong”?
  • we decide whether to make use of such assertions made by other parties, and those decisions are based on an understanding of their source: who made them, on what basis etc.
  • such assessment may include a consideration of how many sources made/support an assertion
  • it is easy for assertions of similarity to become “detached” from such information about provenance/attribution (if it is provided at all!)

Some references:

Serving Linked Data

Back near the start of the project, I published a post outlining the processes involved in generating the Archives Hub RDF dataset and serving up “Linked Data” descriptions from that dataset; it’s perhaps best summarised in the following diagram from that post:

Diagram showing process of transforming EAD to RDF and exposing as Linked Data

In this post, I’ll say a little bit more about what is involved in the “Expose” operation up in the top right of the diagram.

Cool URIs for the Semantic Web

In an earlier post, I discussed the URI patterns we are using for the URIs of “things” described in our data (archival resources, concepts, people, places, and so on). One of the core requirements for exposing our RDF data as Linked Data is that, given one of these URIs, a user/consumer of that URI can use the HTTP protocol to “look up” that URI and obtain a description of the thing identified by that URI. So as providers of the data, our challenge is to enable our HTTP server to respond to such requests and provide such descriptions.

The W3C Note Cool URIs for the Semantic Web lists a number of possible “recipes” for achieving this while also paying attention to the principle of avoiding URI ambiguity i.e. of avoiding using a single URI to refer to more than one resource – and in particularly to maintaining a distinction between the URI of a “thing” and the URIs of documents describing that thing.

Document URI Patterns

Within the JISCExpo programme which funds LOCAH, projects generating Linked Data were encouraged to make use of the guidelines provided by the UK Cabinet Office in Designing URI Sets for the UK Public Sector.

Thse guidelines refer to the URIs used to identify “things” (somewhat tautologically, it seems to me!) as “Identifier URIs”, where they have the general pattern:

http://{domain}/id/{concept}/{reference}

where:

  • concept is a name for a resource type, like “person”;
  • reference is a name for an individual instance of that type or class

(The guidelines also allow for the option of using URIs with fragment identifiers (“Hash URIs”) as “Identifier URIs”.)

The document also recommends patterns for the URIs of the documents which provide information about these “things”, “Document URIs”:

http://{domain}/doc/{concept}/{reference}

These documents are, I think, what Berners-Lee calls Generic Resources. For each such document, multiple representations may be available, each in different formats, and each of those multiple “more specific” documents in a single concrete format may be available as a separate resource in its own right. So a third set of URIs, “Representation URIs,” name documents in a specific format, using the suggested pattern:

http://{domain}/doc/{concept}/{reference}/{doc.file-extension}

i.e. for each “thing URI”/”Identifier URI” in our data, like:

http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist, which identifies a person, the artist Beverley Skinner;

there is a corresponding “Document URI” which identifies a (“generic”) document describing the thing:

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist

and a set of “Representation URIs” each identifying a (“specific”) document in a particular format:

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.html, which identifies an HTML document;

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.rdf, which identifies an RDF/XML document;

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.turtle, which identifies a Turtle document;

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.json, which identifies a JSON document (more specifically one using Talis’ RDF/JSON conventions for serializing RDF)

(We’ve deviated slightly from the recommended pattern here in that we just add “.{extension}” to the “reference” string, rather than adding “/doc.{extension}”, but we’ve retained the basic approach of distinguishing generic document and documents in specific formats, which I think is the significant aspect of the recommendations.)

This set of URI patterns corresponds to those used in the “recipe” described in section 4.2 of the W3C Cool URIs note, “303 URIs forwarding to One Generic Document”.

The Talis Platform

It is perhaps worth emphasising here that in the LOCAH case a “description” of any one of the things in our model may contain data which originated in multiple EAD documents e.g. a description of a concept may contain links to multiple archival resources with which it is associated, or a description of a repository may contain links to multiple finding aids they have published, and so on. A description may also contain data which originated from a source other than the EAD documents: for example, we add some postcode data provided by the National Archives, and most of the links to external resources, such as people described by VIAF records, are generated by post-transformation processes.

This aggregated RDF data – the output of the EAD-to-RDF transformation process and this additional data – is stored in an instance of the Talis Platform store. Simplifying things slightly, the Platform store is a “database” specialised for the storage and retieval of RDF data. It is hosted by Talis, and made avalable as what in cloud computing terms is referred to as “Software as a Service” (SaaS). (Actually, a Platform store allows the storage of content other than RDF data too – see the discussion of the ContentBox and MetaBox features in the Talis documentation – but we are, currently at least, making use only of the MetaBox facilities).

Access to the store is provided through a Web API. Using the MetaBox API, data can be added/uploaded to the MetaBox using HTTP POST, updates can be applied through what Talis call “Changesets” (essentially “remove that set of triples” and “add this set of triples”) again using HTTP POST, and “bounded descriptions” of individual resources can be retrieved using HTTP GET. There are also “admin” functions like “give me a dump of the contents” and “clear the database”. In addition, the Platform provides a simple full-text search over literals (which returns result sets in RSS), a configurable faceted search, an “augment” function and a SPARQL endpoint.

A number of client software libraries for working with the Platform are available, developed either by Talis staff or by developers who have worked with the Platform.

Delivering Linked Data from the Platform

I’m going to focus here on retrieving data from the MetaBox, and more specifically retrieving the “bounded descriptions” of individual resources which which provide the basis for the “Linked Data” documents.

This process involves a small Web application which responds to HTTP GET requests for these URIs:

  • For an “Identifier URI”, the server responds with a 303 status code and a Location header redirecting the client to the “Document URI”
  • For a “Document URI”, the server derives the corresponding “Identifier URI”, queries the Platform store to obtain a description of the thing identified by that URI, and responds with a 200 status code, a document in a format selected according to the preferences specified by the client (i.e. following the principles of HTTP content negotiation), and a Content-Location header providing a “Representation URI” for a document in that format.
  • For a “Representation URI”, the server derives the corresponding “Identifier URI”, queries the Platform store to obtain a description of the thing identified by that URI, and responds with a 200 status code and a document in the format associated with that URI.

The first step above is handled using a simple Apache rewrite rule. For the latter two steps, we’ve made use of the Paget PHP library created by Ian Davis of Talis for working with the Platform (Paget itself makes use of another library, Moriarty, also created by Ian). I’m sure there are many other ways of achieving this; I chose Paget in part because my software development abilities are fairly limited, but having had a quick look at the documentation and one of Ian’s blog posts, I felt there was enough there to enable me to take an example and apply my basic and rather rusty PHP skills to tweak it to make it work – at least as a short-term path to getting something functional we could “put out there”, and then polish in the future if necessary.

The main challenge was that the default Paget behaviour seemed to be to use the approach described in section 4.3 of the Cool URIs document, “303 URIs forwarding to Different Documents”, where the server performs content negotiation on the request for the “Identifier URI” and redirects directly to a “Representation URI”, i.e. a GET for an “Identifier URI” like http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist resulted in redirects to “Representation URIs” like http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist.html or http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist.rdf

If possible we wanted to use the alternative “recipe” described in the previous section, and after some tweaking we managed to get something that did the job. We also made some minor changes to provide a small amount of additional “document metadata”, e.g. the publisher of and license for the document. (I do recognise that the presentation of the HTML pages is currently pretty basic, and there is room for improvement!)

Finally, it’s maybe worth noting here that the Platform store itself doesn’t contain any information about the documents i.e. neither the Document URI nor the Representation URIs appear in RDF triples loaded to the store. So, in principle at least, we could add additional formats using additional Representation URIs simply by extending the PHP to handle the URIs and generate documents in those formats, without needing to extend the data in the store.

I’d started to write more here about extending what we’ve done to provide other ways of accessing the data, but having written quite a lot here already, I think that is probably best saved for a future post.

Lifting the Lid on Linked Data at ELAG 2011

Myself and Jane have just given our ‘Lifting the Lid on Linked Data‘ presentation at the ELAG European Library Automation Group Conference 2011 in Prague today. It seemed to go pretty well. There were a few comments about the licensing situation for the Copac data on the #elag2011 twitter stream, which is something we’re still working on.

[slideshare id=8082967&doc=elag2011-locah-110524105057-phpapp02]

Querying the Linked Archives Hub data using SPARQL

We’ve just announced the availability of our first draft linked data dataset of data from the Archives Hub. When newly available linked data datasets appear, I sometimes hear comments/questions along the lines of:

  • How do I know what the data looks like?
  • Show me some example SPARQL queries that I can use as starting points for my own exploration of the data

We’ve tried to go some way to addressing the first of those points in previous posts, in which I outlined the data model we’re using, to give a general picture of the types of things described and the relationships between them, and then provided a more detailed list of the RDF terms used to describe things. (That second post in particular will, I hope, be useful in thinking about how to construct queries).

In addition, there are some useful posts around on techniques for “probing” a SPARQL endpoint, i.e. issuing some general queries to get a picture of the nature of the graph(s) in the dataset behind an endpoint. See, for example:

In this post, I’ll focus mainly on responding to the second point, by providing a few sample SPARQL queries. Inevitably, these can only give a flavour of what is possible, but I hope they provide a starting point for people to build on.

This isn’t intended to be a tutorial on SPARQL; there are various such tutorials available, but one I found particularly thorough and helpful is:

The SPARQL endpoint for the Linked Archives Hub dataset is:

http://data.archiveshub.ac.uk/sparql.

The data is hosted in an instance of the Talis Platform, which supports a few useful extensions to the SPARQL standard, some of which are used in the examples below.

Listing “top-level” archival “collections”

Following the principles of “multi-level” description of archives, archivists apply a conceptualisation of archival materials as constituting hierarchically organised “collections”, where one “unit of description” may contain others, which in turn may contain others. It is often the case that an archival finding aid provides descriptions of materials only at the “collection-level”, or perhaps at some “sub-collection” level, without describing items individually at all.

In the LOCAH archival data, this approach is reflected in the use of a class ArchivalResource, where an instance of that class may have other instances as parts or members (or, inversely, one instance may be a part, or member, of another instance). This relationship is expressed using the properties dcterms:hasPart/dcterms:isPartOf and ore:aggregates/ore:isAggregatedBy.

The following query provides the URIs and labels (titles) of all archival resources mentioned in the dataset:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?ar ?arlabel
WHERE { 
?ar a locah:ArchivalResource ;
   rdfs:label ?arlabel .
}

This list includes archival resources at any “level”, from collections down to individual items.

We want to narrow down that selection so that it includes only “top-level” archival resources i.e. archival resources which are not “part of” another archival resource. This can be done by extending our pattern to allow for the optional presence of a triple with predicate dcterms:isPartOf, and filtering to select only those cases where the object in that optional pattern is “not bound” i.e. no such triple is present in the dataset:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?ar ?arlabel
WHERE { 
?ar a locah:ArchivalResource ;
   rdfs:label ?arlabel .
   OPTIONAL { ?ar dcterms:isPartOf ?parent } .
   FILTER (!bound(?parent))
}

Run this query against the current LOCAH endpoint.

Finding the location of the Repository holding an Archival Resource

For each archival resource, access to that resource is provided by a Repository (an agent, an entity with the ability to do things). This relationship is expressed using the property locah:accessProvidedBy. The Repository-as-Agent manages a place where the resource is held, a relationship expressed using the locah:administers property, and that place is associated with a postcode, both as a literal, and (perhaps more usefully) in the form of a link to a “postcode unit” in the dataset provided by the Ordnance Survey; by “following” that link, more information about the location can be obtained (e.g. latitude and longitude, relationships with other places) from the data provided by the OS.

Given the URI of an archival resource (in this example http://data.archiveshub.ac.uk/id/archivalresource/gb1086skinner), the following query returns the URI of the repository (agent), the postcode as literal, and the URI of the postcode unit:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX gn: <http://www.geonames.org/ontology#>
PREFIX ospc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>

SELECT ?repo ?pc ?pcunit
WHERE {
   ?repo locah:providesAccessTo 
                <http://data.archiveshub.ac.uk/id/archivalresource/gb1086skinner> ;
           locah:administers ?place .
   ?place gn:postalCode ?pc ;
          ospc:postcode ?pcunit
}

Run this query against the current LOCAH endpoint.

Listing the Archival Resources associated with a Person

In the EAD finding aids, the description of an archival resource may provide an association with the name of one or more persons associated with the resource as “index terms”. The person may be the creator of the resource, they may be the topic of it, or there may be some other association which is considered by the archivist to be significant for people searching the catalogue.

The following query provides a list of person names, the “authority file” form of the name, the identifiers of the archival resources with which they are associated, and the URI of a page on the existing Hub Web site describing the resource. I’ve limited it to a particular repository as without that constraint it potentially generates a quite large result set (and it helps me conceal the fact that some of the person name data is still a little bit rough and ready!)

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX locah: <http://data.archiveshub.ac.uk/def/>

SELECT DISTINCT ?name ?famname ?givenname ?authname ?unitid ?hubpage
WHERE {
?arcres locah:accessProvidedBy <http://data.archiveshub.ac.uk/id/repository/gb15> ;
        locah:associatedWith ?concept ;
        dcterms:identifier ?unitid ;
        rdfs:seeAlso ?hubpage .
?concept foaf:focus ?person ;
             rdfs:label ?authname .
?person a foaf:Person;
        foaf:name ?name;
OPTIONAL {?person foaf:familyName ?famname;
                  foaf:givenName ?givenname }
}
ORDER BY ?famname ?givenname ?name  

Run this query against the current LOCAH endpoint.

Listing Concepts by number of associated Archival Resources

The following query lists the concepts from a specified concept scheme (here the UNESCO thesaurus, which is assigned the URI http://data.archiveshub.ac.uk/id/conceptscheme/unesco, and orders them according to the number of archival resources with which they are associated (This makes use of the count and GROUP BY Talis Platform SPARQL extensions):

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?concept ( count(?concept) AS ?count ) 
WHERE {
   ?x locah:associatedWith ?concept .
   ?concept skos:inScheme  <http://data.archiveshub.ac.uk/id/conceptscheme/unesco> .
 }
GROUP BY ?concept
ORDER BY DESC(?count)

Run this query against the current LOCAH endpoint.

Listing Persons associated with Archival Resources, where Persons are born during a specified period

In an earlier post, I described the modelling of the births and deaths of individual persons as “events”.

Based on this approach, birth or death events occurring within a specified period can be selected. So, for example, the following query returns a list of persons born during the 1940s, with the archival resources with which they are associated:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX bio: <http://purl.org/vocab/bio/0.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?birthdate ?person ?name ?famname ?givenname ?ar
WHERE { 
?event a bio:Birth ;
   bio:date ?birthdate ;
   bio:principal ?person .
   FILTER regex(str(?birthdate), '^194') .
?person foaf:name ?name .
   OPTIONAL { ?person foaf:familyName ?famname ; foaf:givenName ?givenname } .
?concept foaf:focus ?person .
?ar locah:associatedWith ?concept .
}
ORDER BY ?birthdate ?name

Run this query against the current LOCAH endpoint.

(I use this to illustrate the “event” approach, but in this case, birth and death dates are also provided as literal values of properties associated with the person, so there are other (easier!) ways of getting that information.)

To close, I’ll just emphasise again that these are only a few simple examples, intended to give an idea of the structure/”shape” of the data, and a flavour of what sort of queries are possible. If you come up with any examples of your own you’d like to share, we’d be glad to hear about them in comments below. (Come to think of it, it’s probably not very easy to maintain formatting/whitespace etc in comments, so it might be easier to host any such examples elsewhere and just post links here).

P.S. If there are any “tweaks” that you think we could make that would make things easier for those consuming/querying the data, it would be good to hear about them. I can’t promise we’ll be able to implement them, but we are still at the stage where things can be changed and we do want the data to be as usable and useful as possible.

Describing the “things”: the RDF terms used (part 2)

In the previous post, I described some of the considerations in choosing RDF vocabularies to use for the LOCAH archival metadata. In the tables below, I’ve tried to summarise the properties used to “describe” an instance of each of the classes in our model, i.e. for a particular thing URI, in our dataset, one might expect to find triples with that URI as subject and these property URIs as predicates, and when our data is served as linked data, and a thing URI is dereferenced the “bounded description” provided will include those triples (and others) – though some may be optional, so not necessarily present for all instances (and some may not be present at all until we add some more data…!)

This is really more of a “reference document” than a blog post, but I provide it in part as documentation of the data creation/transformation process, and in part as a guide for potential users of the actual data. Having said that, the data is liable (even likely) to change so consumers should always refer to the actual data for an up-to-date picture of the terms used. I’ve tried to highlight (dark grey background) below terms which I consider to be particularly “at risk” and liable to be removed/replaced, mostly terms from the “locah” vocabulary.

Most of this data is generated from the transformation of the EAD XML documents; a small proportion is added separately. Again, I’ve tried to indicate that in the tables below (light grey background).

Finding Aid

Type rdf:type Class URI locah:FindingAid
foaf:Document
bibo:Document
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Identifier dcterms:identifier plain literal
Description dcterms:description plain literal
Conforms to dcterms:conformsTo Standard URI standard:isadg
Publisher dcterms:publisher Repository (Agent) URI
Encoded As locah:encodedAs EAD URI
Topic foaf:topic Archival Resource URI
Subject dcterms:subject Archival Resource URI
Has Part dcterms:hasPart Biographical History URI

The following should probably be an owl:sameAs relationship (or we should just cite the Hub URI?)

See also rdfs:seeAlso Hub Page URI e.g.
http://archiveshub.ac.uk/data/
gb15sirernesthenryshackleton

EAD

Type rdf:type Class URI locah:EAD
foaf:Document
bibo:Document
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Identifier dcterms:identifier plain literal
Description dcterms:description plain literal
Date Created dcterms:created plain literal
Conforms to dcterms:conformsTo Standard URI dbpedia:
Encoded_Archival_Description

standard:ead2002
Encoding Of locah:encodingOf Finding Aid URI

The Hub does not currently provide a URI for the EAD document, but it is planned to do so, at which point we should add an owl:sameAs relationship (or just cite the Hub URI?)

Repository (Agent)

Type rdf:type Class URI locah:Repository
foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Identifier dcterms:identifier plain literal
Country Code locah:
countryCode
plain literal
Maintenance Agency Code locah:
maintenance
AgencyCode
plain literal
Is Publisher Of locah:isPublisherOf Finding Aid URI
Provides Access To locah:
providesAccessTo
Archival Resource URI
Administers locah:administers Place URI
See also rdfs:seeAlso Archon Page URI e.g.
http://www.nationalarchives.gov.uk/
archon/searches/
locresult_details.asp?LR=15

Repository (Place)

Type rdf:type Class URI wg84_pos:
SpatialThing
Label rdfs:label plain literal
Title dcterms:title plain literal
Is Administered By locah:
isAdministeredBy
Repository URI
See also rdfs:seeAlso Archon Page URI e.g.
http://www.nationalarchives.gov.uk/
archon/searches/
locresult_details.asp?LR=15

The following data is not generated from the EAD documents, but added in from a separate source:

Postal Code gn:postalCode plain literal
Located In gn:locatedIn Postcode Unit URI e.g.
http://data.ordnancesurvey.co.uk/id/postcodeunit/CB21ER
Within ossr:within Postcode Unit URI e.g.
http://data.ordnancesurvey.co.uk/id/postcodeunit/CB21ER
Postcode postcode:postcode Postcode Unit URI e.g.
http://data.ordnancesurvey.co.uk/id/postcodeunit/CB21ER

Archival Resource

Type rdf:type Class URI locah:ArchivalResource
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Level locah:level Level URI
Page foaf:page Finding Aid URI
Access Provided By locah:
accessProvidedBy
Repository (Agent) URI
Identifier dcterms:identifier plain literal
Date dcterms:date plain literal
Date Created or Accumulated locah:
dateCreated
AccumulatedString
plain literal

The following properties were introduced to distinguish different date cases (date range v single date).

Date Created or Accumulated locah:
dateCreated
Accumulated
typed literal
Date Created or Accumulated (Start) locah:
dateCreated
AccumulatedStart
typed literal
Date Created or Accumulated (End) locah:
dateCreated
AccumulatedEnd
typed literal
Produced In event:produced_in Creation Event URI
Extent (String) locah:extent plain literal
Extent dcterms:extent Extent URI
Language dcterms:language Language URI e.g.
http://lexvo.org/id/iso639-3/eng
Is Represented By locah:
isRepresentedBy
Document URI or Aggregation URI
Origination locah:origination Agent URI
Has Biographical History locah:
hasBiographicalHistory
Bioghist URI
Associated With locah:
asssociatedWith
Concept URI
Has Part dcterms:hasPart Archival Resource URI
Aggregates ore:aggregates Archival Resource URI
Is Part Of dcterms:isPartOf Archival Resource URI
Is Aggregated By ore:isAggregatedBy Archival Resource URI
Members locah:members (RDF Collection)
See also rdfs:seeAlso Hub Page URI e.g.
http://archiveshub.ac.uk/data/
gb15sirernesthenryshackleton-
gb15sirernesthenryshackleton-
imperialtrans-antarcticexpedition

For all of the following, the object is simply a copy of the XML element content from the EAD document as an XML Literal. This is a rather “dumb” and probably not terribly useful “translation” from the EAD; in a future iteration of the transform, we hope to extract further useful triples from this part of the EAD data, and we will probably remove some of these triples.

Custodial History locah:
custodialHistory
XML literal
Acquisitions locah:acquisitions XML literal
Scope and Content locah:
scopecontent
XML literal
Appraisal locah:appraisal XML literal
Accruals locah:accruals XML literal
Access Restrictions locah:
accessRestrictions
XML literal
Use Restrictions locah:
useRestrictions
XML literal
Physical or Technical Requirements locah:
physicalTechnical
Requirements
XML literal
Other Finding Aids locah:
otherFindingAids
XML literal
Location Of Originals locah:
locationOfOriginals
XML literal
Alternate Forms Available locah:
alternateForms
Available
XML literal
Related Material locah:
relatedMaterial
XML literal
Bibliography locah:bibliography XML literal
Note locah:note XML literal
Processing locah:processing XML literal

Level

Type rdf:type Class URI locah:Level
skos:Concept
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Comment rdfs:comment plain literal
Note skos:note plain literal
Definition skos:definition plain literal
Description dcterms:description plain literal

Language

Type rdf:type Class URI lvont:Language

Creation (Event)

Type rdf:type Class URI locah:Creation
lode:Event
event:Event
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Product event:product Archival Resource URI
Involved lode:involved Archival Resource URI
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI

Time Interval

Type rdf:type Class URI time:Interval
time:TemporalEntity
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Timeline timeline:timeline Timeline URI timeline:universaltimeline
Start timeline:start typed literal
Interval Starts time:intervalStarts Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
End timeline:end typed literal
Interval Ends time:intervalEnds Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
Contains crm:
P86i_contains
Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
At timeline:at typed literal
Interval During time:intervalDuring Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
Falls Within crm:
P86_falls_within
Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874

Extent

Type rdf:type Class URI locah:Extent
Label rdfs:label plain literal

In the EAD XML doc, extent is expressed simply as a literal. Where possible we’ve tried to parse out a “unit of measurement” and a quantity, reflected in RDF as a triple where the predicate reflects the unit and the object the quantity, as a typed literal, to try to make comparisons easier. I need to catch up with what current “best practice” is for representing quantities/units of measurement so this may well change. Also, currently, “units” include things like “file”, “paper” and “envelope”, which may not be terribly useful.

Archival Box locah:archbox typed literal (xsd:decimal)
Metre (Linear) locah:metre typed literal (xsd:decimal)
Cubic Metre locah:cubicmetre typed literal (xsd:decimal)
Folder locah:folder typed literal (xsd:decimal)
Envelope locah:envelope typed literal (xsd:decimal)
Volume locah:volume typed literal (xsd:decimal)
File locah:file typed literal (xsd:decimal)
Item locah:archbox typed literal (xsd:decimal)
Page locah:page typed literal (xsd:decimal)
Paper locah:paper typed literal (xsd:decimal)

Origination (Agent)

Type rdf:type Class URI foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Page foaf:page Biographical History URI
Is Origination Of locah:
isOriginationOf
Archival Resource URI

For links to other agents (external or internal):

Same As owl:sameAs Agent URI
Is Like umbel:isLike Agent URI

Biographical History

Type rdf:type Class URI locah:
BiographicalHistory
bibo:DocumentPart
bibo:Document
foaf:Document
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Body locah:body XML literal, plain literal
Topic foaf:topic Agent URI
Subject dcterms:subject Agent URI
Is Biographical History For locah:
isBiographicalHistoryFor
Archival Resource URI
Is Part Of dcterms:isPartOf Finding Aid URI

Concept Scheme

Type rdf:type Class URI skos:ConceptScheme
Label rdfs:label plain literal

Concept (ControlAccess – Subject)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Concept (ControlAccess – Persname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Person URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the person who is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Surname locah:surname plain literal
Forename locah:forename plain literal
Dates locah:dates plain literal
Title locah:title plain literal
Epithet locah:epithet plain literal
Other locah:other plain literal

Person (ControlAccess – Persname)

Type rdf:type Class URI foaf:Person
foaf:Agent
dcterms:Agent
crm:
E21_Person
Label rdfs:label plain literal
Name foaf:name plain literal
Family Name foaf:familyName plain literal
Given Name foaf:givenName plain literal
Dates locah:dates plain literal
Title locah:title plain literal
Epithet locah:epithet plain literal
Other locah:other plain literal

For links to other persons (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Person URI
Is Like umbel:isLike Person URI

Concept (ControlAccess – Famname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Family URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the family that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Family (ControlAccess – Famname)

Type rdf:type Class URI locah:Family
foaf:Group
foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

For links to other families (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Family URI
Is Like umbel:isLike Family URI

Concept (ControlAccess – Corpname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Family URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the organisation that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Organisation (ControlAccess – Corpname)

Type rdf:type Class URI foaf:Organization
foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

For links to other organisations (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Organisation URI
Is Like umbel:isLike Organisation URI

Concept (ControlAccess – Geogname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Place URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the place that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Place (ControlAccess – Geogname)

Type rdf:type Class URI wg84_pos:
SpatialThing
Label rdfs:label plain literal
Name locah:name plain literal
Location locah:location plain literal
Other locah:other plain literal

For links to other places (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Place URI
Is Like umbel:isLike Place URI

Concept (ControlAccess – GenreForm)

Type rdf:type Class URI locah:GenreForm
skos:Concept
Label rdfs:label plain literal
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

Concept (ControlAccess – Function)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

Book/Document (ControlAccess – Title)

Type rdf:type Class URI foaf:Document
bibo:Document
Label rdfs:label plain literal
Title dcterms:title plain literal

Birth (Event)

Type rdf:type Class URI bio:Birth
bio:IndividualEvent
bio:Event
lode:Event
event:Event
crm:E67_Birth
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Date bio:date typed literal
Date dcterms:date typed literal
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI
Has Time-Span crm:
P4_has_time-span
Time Interval URI
Agent bio:agent Person URI
Principal bio:principal Person URI
Agent event:agent Person URI
Involved Agent lode:involvedAgent Person URI
Brought Into Life crm:
P98_brought_into_life
Person URI

Death (Event)

Type rdf:type Class URI bio:Death
bio:IndividualEvent
bio:Event
lode:Event
event:Event
crm:E69_Death
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Date bio:date typed literal
Date dcterms:date typed literal
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI
Has Time-Span crm:
P4_has_time-span
Time Interval URI
Agent bio:agent Person URI
Principal bio:principal Person URI
Agent event:agent Person URI
Involved Agent lode:involvedAgent Person URI
Was Death Of crm:
P100_was_death_of
Person URI

Floruit (Event)

Type rdf:type Class URI locah:Floruit
bio:IndividualEvent
bio:Event
lode:Event
event:Event
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Date bio:date typed literal
Date dcterms:date typed literal
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI
Agent bio:agent Person URI
Principal bio:principal Person URI
Agent event:agent Person URI
Involved Agent lode:involvedAgent Person URI
Was Death Of crm:
P100_was_death_of
Person URI

Object

Type rdf:type Class URI foaf:Document
bibo:Document
Is Aggregated By ore:isAggregatedBy Object Group URI

Object Group

Type rdf:type Class URI ore:Aggregation
dcmitype:Collection
bibo:Collection
Aggregates ore:aggregates Object URI

Describing the “things”: the RDF terms used (part 1)

In previous posts, I described:

  • the model of the “world” on which we’re basing the Archives Hub RDF data: the types of “thing” being described, and (some of) the relationships between them (1, 2, 3); and
  • the patterns for URIs to be assigned to the individual “things”

In this post and the next one, I’ll outline the RDF vocabularies we’re using to describe those “things”. This post covers some of the considerations in choosing the vocabularies and some of the “patterns” we’ve used in deploying them; the next lists the properties and classes you can expect to find in the LOCAH data.

Using existing RDF vocabularies

As far as possible, we’ve tried to make use of existing, deployed RDF vocabularies. These include:

Those distinctions between which vocabulary “describes” what are somewhat rough, particularly taking into account that the “directionality” of properties in RDF is somewhat arbitrary: a triple using the dcterms:creator property to link a created work to an agent is as much “about” the agent as it is “about” the thing created.

However, where we’ve seen a need to express a notion that is not well addressed by an existing vocabulary, we have defined the additional classes and properties required and provided URIs for them as a small “local” LOCAH RDF vocabulary. At this point in time, I consider most of these terms something of a “work in progress”, and likely to be revised (or even dropped completely) before the end of the project. But I suspect some will remain – which, given the bounded timescale of the project, leaves questions about the longer term management of such vocabularies.

Discovering Appropriate Vocabularies

Most of my knowledge of existing RDF vocabularies has come from lurking on good old-fashioned mailing lists, particularly the W3C Semantic Web Interest Group list and the Linked Open Data list. I don’t read every posting by any means, and the signal-to-noise ratio can be variable, but for me they remain an excellent source of information with a knowledgeable and active contributing community (and the archives are a great repository.)

In similar territory, Semantic Stackoverflow provides a “question-and-answer”-style service, though it tends to have a fairly technical focus.

Another useful source is to look at actual linked data datasets, particularly those which are in a similar “domain” to the one you’re working in and cover similar resource types, and check out what vocabularies they are using (and how they are using them). In the library/bibliographic domain in particular, there has been a fairly steady stream of linked data datasets appearing over the last couple of years, so there’s quite a bit to go on, rather less so for the archives case. For a few pointers, see e.g. this review post by Ed Summers (itself already nearly a year old).

There are some services which aim to provide disclosure/discovery services based on aggregations of information about vocabularies and their constituent terms, sometimes called “metadata registries” or “metadata schema registries”. I’ve had mixed experiences of using these services: in some cases the content is not current; in others the coverage is intentionally tailored to the requirements of a particular community, so the challenge becomes one of finding a registry whose coverage matches the task at hand. One service (with quite general coverage) which I have occasionally found useful is Schemapedia, a project by Ian Davis of Talis; it provides “vocabulary”-level descriptions, rather than descriptions of individual “terms” but it includes some examples of actual terms: see, e.g. the entry for the Biographical Vocabulary.

There are a number of services which provide search functions across aggregations of data gathered from the linked data Web/Semantic Web. Sindice crawls and aggregates a huge range of RDF data and provides a “Google”-like search across that aggregation. (I’ve also found navigating such an aggregation helpful in thinking about various aspects of linked data: the sig.ma browser highlights the consequences of merging data from multiple sources, and related issues of provenance, attribution and trust, for example).

Finally, at the risk of stating the obvious, plain old Web search engines can still be a useful entry point.

Having said all this, I admit that the discovery of RDF vocabularies is still something of a challenge, and I continue to come across useful things I’d missed. And having found something potentially useful often raises further questions: Is the vocabulary stable or still being developed? Is it described following “modern” good practice for RDF vocabularies? Is it being managed/curated? By an individual/institution/community? Does it have the support of a community of users? Particularly if the intention is for a dataset to have some longevity, these may be significant considerations.

Patterns for using RDF Vocabularies

While discovering RDF vocabularies capable of expressing the information you want to represent is a first step, it often raises issues of exactly how those vocabularies might best be deployed, or of choosing between several possible alternative solutions.

Leigh Dodds and Ian Davis of Talis have authored a booklet Linked Data Patterns which tries to address some of these challenges, by gathering together some common “patterns” of use, based on existing practice by linked data implementers – though perhaps inevitably at this stage, some aspects of that practice are something of a “moving target” as new challenges are identified and practice evolves to address them. (See, for example, a recent debate on the Linked Open Data mailing list covering the question of expectations for what the object of an rdfs:seeAlso triple might/should dereference to.)

I continue to find the reflections of linked data practitioners an excellent source, particularly those working in domains close to those I’m interested in. I regularly find myself referring to the series of posts by Jeni Tennison on creating linked data. In this context, the fifth post on “Finishing Touches” is particularly relevant, and in large part prompts my next couple of points.

Labelling

One of the principles I’ve tried to adhere to, following the guidance by Jeni is that each resource we expose should have a human-readable label, provided using the rdfs:label property, and as far as possible that label should function as a useful “stand-alone” name for the thing.

In some cases this is a straightforward matter of using some text content node in the EAD XML document as an RDF literal. In other cases, a single element in the EAD document is mapped to a number of distinct resources in our model. In these cases, the transformation process typically prefixes or suffixes the source text to generate labels for the various different things. Perhaps unsurprisingly, this sometimes leads to some slightly “artificial” or “stilted” results, so it’s something we may need to refine.

Also, and perhaps more problematically, as I’ve noted in a previous post, the practice of archival description has traditionally relied heavily on a “multi-level description” approach which results in the presentation of resource descriptions “in the context of” the descriptions of other related resources. So it is common to find individual items within a collection labelled simply as something like “Letter”, on the basis that the reader of the finding aid will glean further information from the fact that the description of the item is presented within a context provided by a list of other “sibling” items, all “children” of a “parent” aggregation of some form. Currently our mapping generates the rdfs:label of an item using only the label (EAD unititle element) of that item in the EAD document, with the result that we may indeed end up with many individual resources labelled “Letter” (though of course the description will also include other properties derived from other EAD data and links to “parent” resources). An alternative might be to try to generate a label by “qualifying” the item unittitle, say, by prefixing it with the label of a “parent” resource – though I suspect in practice this would generate some somewhat unwieldy results.

Where the source data makes it seem reasonable to express it, I’ve also indicated the use of a “preferred label”, using the skos:prefLabel property. I’m conscious here of the need to be careful: the SKOS specification includes a number of “integrity conditions”, rules which data using the SKOS vocabulary should follow. Amongst them is the requirement that

A resource has no more than one value of skos:prefLabel per language tag.

The important thing to remember is that this is intended to apply in an “open world” context, not simply as a condition scoped to a particular “document”. The EAD to RDF transform process is performed on a document-by-document basis. Within the Hub dataset, it is quite common that for a single resource, labels for that resource are generated from the content of multiple EAD documents. While in theory naming within the set of EAD documents should be consistent, in practice, the use of variants of names is widespread in our data – the names of archival repositories is one example. Generating an skos:prefLabel triple for each variant would result in a conflict with the integrity condition once the data was merged in the triple store.

Bearing in mind that the “open world” extends beyond the boundaries of our own dataset, the same considerations apply in the case where we are exposing URIs for resources for which other parties already expose descriptions, including an skos:prefLabel triple, and we can’t guarantee that the names in our data correspond to those provided by that source.

Inferencing

Another issue to consider is that referred to by Leigh and Ian in their “Materialize Inferences” pattern, and by Jeni Tennison in her discussion of “Derivable Data”. One of the strengths of using the RDF model is that it is supported by a formal semantics, a framework for reasoning with data, i.e. given some set of data, it is often possible to apply some formalised set of rules to infer or derive additional triples. However, it should not be assumed that all consumers of the data will have access to the tools which support such reasoning, so it may be more appropriate for a data provider like LOCAH to explicitly include at least some of those “derivable” triples in the data we provide.

For a simple example of what I mean, the Friend of a Friend (FOAF) vocabulary provides a property called foaf:name (“A name for some thing.”). As part of their description of that property, the FOAF vocabulary owners provide the triple:

foaf:name rdfs:subPropertyOf rdfs:label .

The RDFS property rdfs:subPropertyOf is one of those properties which is associated with a set of rules. What those rules say is that, for any two properties linked by an rdfs:subPropertyOf relation, two resources related by the first property are also related by the second. So each time I find a triple using foaf:name as a predicate, I can infer (deduce, derive) a second triple using the rdfs:label predicate, e.g. if I find

<http://example.org/id/person/p123> foaf:name “Ernest Henry Shackleton” .

then I can conclude

<http://example.org/id/person/p123> rdfs:label “Ernest Henry Shackleton” .

However, to reach that conclusion, my application needs (a) knowledge of the general rdfs:subPropertyOf inference rule, and (b) knowledge that foaf:name is a subproperty of rdfs:label – and (c) the processing capability to apply that rule!

By providing – “materializing” – both those triples in our source data, we relieve the consuming application of that responsibility – though that benefit comes at the cost of increasing the size of the descriptions we provide.

This tactic can be particularly useful, I think, for properties which are subproperties of “generic” vocabularies like the RDF Schema vocabulary or the Dublin Core vocabularies. Sometimes generic linked data tools have some “built-in knowledge” of, and/or specific behaviour associated with, some of these vocabularies (e.g. to obtain literal names/labels/titles for display to human readers). It may be perfectly reasonable to use a triple with some more specialised subproperty in our data to indicate some specific relationship, but where appropriate it is also helpful to “materialize” the triple using the more generic property as well, so that an application looking for RDF Schema or DC properties can easily access that data.

Extending that slightly, Jeni suggests a “rule of thumb” that “if the result of the reasoning involves a resource from another vocabulary, then we should include it”.

The subproperty case is just one example: the inference of resource type based on rdfs:range and rdfs:domain is another case in point. In the LOCAH data, we’ve tried to provide fairly “generous” type data (e.g. including “super-classes”) where possible – again, on the grounds that such information is a commonly used “hook” in user queries (“Select resources of type T where [some other criteria]”).

The “cost” of this approach is that the dataset and the individual “bounded descriptions” served are larger – so there is a “trade-off” here which we may want to monitor and reconsider once we see how the data is being used.

Events

As I mentioned earlier, we extended our very initial draft model to include a notion of “event”. Currently, the application of this approach in our data is quite limited: it is applied to the “creation”/”origination” of the archival resources, and to the birth, death and “periods of activity” (floruit) of individuals. What we do is similar to the approach sketched by Ben O’Steen in his processing of the British Library’s British National Bibliography data – though with a little more complexity as we make use of event ontologies which model time periods as resources, rather than as literals.

This is probably best illustrated by means of an example. Given a person with birth date of 1901 and death date of 1985, we generate an RDF graph like the following:

RDF Graph of Life Events Data

RDF Graph of Life Events Data

(The image links through to a larger version)

The time interval nodes at the right-hand side are reference.data.gov.uk URIs for years, like http://reference.data.gov.uk/id/year/1901

What I haven’t illustrated on that diagram is that I’ve also included some data using the CIDOC CRM ontology – actually using the Erlangen CRM vocabulary. I’m feeling my way a bit with this, so it is somewhat partial/experimental at the moment, but I hope to refine/extend it in the future.

The point I wanted to highlight is that we’ve made use of multiple “overlapping” vocabularies here – again on the grounds that it may be useful to provide that flexibility to consumers of the data querying using a specific vocabulary. As above, this is a “trade-off” which we may want to monitor and reconsider in the future.

Summary

I’ve tried to cover here some of the issues around our choices of RDF vocabularies and how we’ve deployed them. The next post will summarise the actual terms used.

Identifying the “things”: URI Patterns for the Hub Linked Data

In my previous couple of posts, I outlined the model of the “world” on which we’re basing the RDF data we’re generating from the Archives Hub‘s EAD XML documents.

At the heart of the Linked Data approach is the principle that all the “things” we want to “say anything about” should be named using a URI, and that those URIs should use the http URI scheme, so that they can be easily “looked up” or “dereferenced” using Web technologies in order to obtain some information provided by the URI owner about the thing. So, having specified the types or classes of thing we want to refer to and describe, the next step is to decide on the structure of the http URIs that we’ll use to name the “instances” of those classes – the individual “things” – archival resources, repositories, concepts, persons, places, and so on. In this post, I’ll try to describe the patterns we’re using, and outline how we construct individual URIs using those patterns from the EAD input data. As I hope will become clearer, the nature of the input data conditions the form of the patterns we’ve chosen. This has turned into a rather long post (again!) but I hope the detail is useful – I think it’s important for us to try to document our processes and some of the issues we’ve grappled with as well as to present the conclusions.

In some (most) cases, these will be newly created URIs, under a domain that we (well, MIMAS and the Archives Hub service) own. For these URIs, the project is responsible for choosing the URIs and putting in place the mechanisms to ensure that their dereferencing results in the provision of some “useful information”. In other cases, we will simply be citing existing URIs, defined by other agencies who (hopefully!) provide for their dereferencing.

The UK Cabinet Office has recently published some general guidelines on URI patterns for government Linked Data, Designing URI Sets for the UK Public Sector, and within the JISC programme strand under which LOCAH is funded, projects are encouraged to follow the recommendations of those guidelines. Following these guidelines, the general URI pattern recommended to identify “things” is:

http://{domain}/id/{concept}/{reference}

where:

  • concept is a name for a class (resource type), like “person”
  • reference is a name for an individual instance of that class or type

Our RDF data is being generated, at least in the first instance, by processing EAD XML documents, so we want to construct our URIs for our “things” from content within those XML documents. And we want to do so in a way that, as far as possible, ensures that each of those URIs is an unambiguous name/referrer, i.e. it identifies a single “thing”, and we don’t end up with a single URI being used for what are in fact two different things. On the other hand, we can live with the case where we end up with multiple URIs, all of which identify a single thing, because information can be added at a later stage to indicate that they are synonyms.

The other point to note is that the initial transformation step is being performed on a “document-by-document basis”, i.e. taking a single EAD document as input and outputting RDF/XML. So for any given resource, the information we generate – including the URI of the resource – is based only on the content of that document (and any generally applicable information we can embed in the transform itself). There may be other data “about” that “thing” in another EAD document but we don’t have access to it at the time of transformation.

Also, it’s desirable that we construct our URIs in such a way that if we need to re-run the transform, we generate the same URIs from the same input data (unless we explicitly decide to change the patterns for some reason).

Finally, although the patterns below often make use of human-readable strings from the EAD document content, I haven’t treated human-readability as a major consideration. Having said that, I’ve tended to make use of (slightly normalised forms of) human-readable strings where possible, rather than, say, creating opaque “hashes”.

As with other aspects of the work, at this stage, this is a first cut at tackling the issue, and we may revise our approaches based on the experience of applying them over the dataset. Having gone through and constructed patterns for the various resource types, looking back over them now, I think I can see a small number of distinct methods that we’ve used:

  1. Identifiers: For some of these “things”, the EAD documents contain some sort of formally assigned identification code or number, which unambiguously – at least within the scope of the Hub collection – identifies that instance within the set of resources of that type (i.e. it serves as a “reference” in the terms of the Designing URI Sets… document). This is the case, for example, with the languages of the materials, using the did/langmaterial/language/@langcode attribute value. A variant of this is the case where such an identifier can be constructed from a combination of multiple pieces of content. Repositories, for example, can be identified by the pair of country code (ead/eadheader/eadid/@countrycode) and maintenance agency code (ead/eadheader/eadid/@mainagencycode). For these cases a combination of the name of the resource type and that identification code provides the basis for the “reference” part of the URI.
  2. “Authority-Controlled” Names: For many of the “things”, however, the EAD documents do not contain such a code; rather, they refer to things only by name. In some cases, the form of the name is drawn from an “authority file” – indicated in the EAD document – and the name includes sufficient information (e.g. birth/death dates, titles etc for a person) to make the resulting string an unambiguous referrer within the set of names from that source. For these cases, a combination of a name for the authority file and the name provides the basis for the “reference”. However, this does depend on the creator of the EAD document having accurately transcribed the “authoritative” form of the name, at least sufficiently to maintain unambiguity of reference.
  3. “Rule-Based” Names: In other cases, the “thing” is named, not using a name from a controlled list, but rather a name constructed according to some codified set of rules, where the rules used are indicated in the EAD document. The intent behind such rules is to try to ensure consistency of form and unambiguity of reference. The National Council of Archives’ Rules for the Construction of Personal, Place and Corporate Names (one of the rule sets recommended to Hub data creators) states “A personal name is constructed by combining mandatory and optional components of the name so that the person concerned can be identified with certainty and distinguished from others bearing similar names. An individual should have only one authorised form of name and each name should apply to only one individual.”Typically, as for the “authority file” case, this is achieved through the inclusion of dates, titles etc for persons. For these cases, a combination of a name for the rules and the name itself should provide the basis for the “reference”. However, in practice, the picture with the Hub data is somewhat more complex. First, in some cases where it is claimed that rules are followed, the content itself indicates that this is not the case. For example, the NCA Rules mandate that a personal name should include “the year in which a person was born or died, the span of years of his/her lifetime or the approximate period covered by his/her activities”, even if those dates are estimated. But there are cases in the data marked up as following the NCA Rules which do not meet this requirement – e.g. personal names providing only surname and forename with no dates – , which I suspect may result in ambiguous references. Second, even where the rule is followed and the mandatory components are present, the distributed nature of Hub data creation means that I suspect there is still some possibility that a single personal name may be used in two different sources to refer to what in fact are two different people (Consider e.g. the case of two data providers using the name “Smith, John, fl 1920-1950”).
  4. “Locally-Scoped” Names: In other cases, the form of the name is neither authority-controlled nor rule-based, but nevertheless there is some expectation that the form of the name used is sufficient to make it an unambiguous referrer within some context. This is the case, for example, with the content of the did/origination element. The difficulty, however, is in establishing reliably what that context is. What is that “local scope”? We’ve tentatively taken the approach that such names have been constructed in such a way as at least to be unambiguous within the collection of submissions to the Hub by a single repository. So by combining the repository identifier and the name, hopefully, we can arrive at a “reference” which avoids ambiguity. Again, it may turn out that this assumption is unreliable, and results in ambiguous references, so we may need to revisit this approach.
  5. “Identifier Inheritance”: (I’m sure there must be a formal term for this but I’m not sure what it is!) In these cases the EAD document does not provide an unambiguous name for the “thing” itself; however the “thing” has a simple relationship with some other “thing” for which identification fits into one of the other categories. Where the relationship is one-to-one, a URI can be constructed by adopting the pattern for that other “thing” and substituting the name of the resource type. An example of this is the case of the “biographical history” associated with a “unit of description”. The unit of description has an identifier (based on a pattern described below) and since – in data constructed using the Hub template – each unit has at most one biographical history, replacing the “unit” resource type name with a “bioghist” resource type name gives us a suitable URI path, e.g. for a unit of description for which the URI path contains “/unit/gb15abc”, the URI for the biographical history would contain “/bioghist/gb15abc”.A variant of this is the case where the relationship is many-to-one, rather than one-to-one. Here the approach needs to be extended to include e.g. a sequence number to distinguish the multiple “things”. This is the approach taken for the Unit of Description, where a “child” (“part”) unit of description uses the URI of the “parent” (“whole”) unit suffixed with a sequence number, e.g. for a unit of description for which the URI path contains “/unit/gb15abc”, the URIs for the “child” units would contain “/unit/gb15abc-1”, “/unit/gb15abc-2” and so on. In theory, this should not be necessary as the unitid for a unit should be unique within an EAD document, but in practice we’ve found that this is not the case in the actual data. (In this case, the identifier would be “reproducable” only if any new units are inserted at the end of a sequence rather than in the middle).
  6. So, with the caveat above that this is all somewhat tentative at this stage, I summarise below the approaches taken to generating URIs for instances of each of the classes in the Hub model. Note that sometimes, an instance of the same class is generated in different “contexts” within the EAD document, and in these cases different rules for URI construction may be applied in those different contexts, depending on the information available within the EAD document.

    We haven’t yet finalised the domain name we’ll be using, so for the purposes of the following, {root} represents the domain and the first part of the path. Italicised text is used for the URI patterns (or parts of them); bold text is used for XPath(-ish!) representations of the source of data within the EAD XML document.

    Finding Aid

    Pattern(s)

    {root}/id/findingaid/{eadid}

    eadid
    normalised form of ead/eadheader/eadid

    Example:

    {root}/id/findingaid/gb15sirernesthenryshackleton

    EAD document

    Pattern(s)

    {root}/id/EAD/{eadid}

    eadid
    normalised form of ead/eadheader/eadid

    Example(s)

    {root}/id/ead/gb15sirernesthenryshackleton

    Repository (Agent)

    Pattern(s)

    {root}/id/repository/{repositoryid}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/repository/gb15

    Repository (Place)

    Pattern(s)

    {root}/id/place/{repositoryid}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/place/gb15

    Unit of Description

    Pattern(s)

    {root}/id/unit/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Note: In principle, it should be possible to use c/unitid content rather than position in tree, but in practice, there are cases where unitid content is not unique within the EAD document.

    Example(s)

    {root}/id/unit/gb15sirernesthenryshackleton

    {root}/id/unit/gb15sirernesthenryshackleton-1

    Level

    Pattern(s)

    {root}/id/level/{level-name}

    level-name
    archdesc/@level or archdesc/@otherlevel or c{n}/@level or c{n}/@otherlevel

    Example(s)

    {root}/id/level/fonds

    Language

    Pattern(s)

    http://lexvo.org/id/iso639-3/{langcode}

    Note: use existing lexvo.org URIs for languages.

    langcode
    did/langmaterial/language/@langcode

    Example(s)

    http://lexvo.org/id/iso639-3/eng

    Creation (Event)

    Pattern(s)

    {root}/id/creation/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/creation/gb15sirernesthenryshackleton

    Creation (Time)

    Pattern(s)

    {root}/id/creationtime/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/creationtime/gb15sirernesthenryshackleton

    Extent

    Pattern(s)

    {root}/id/extent/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/extent/gb15sirernesthenryshackleton

    Biographical History

    Pattern(s)

    {root}/id/bioghist/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/bioghist/gb15sirernesthenryshackleton

    Concept (Origination)

    Pattern(s)

    {root}/id/concept/agent/{repositoryid}/{origination-name}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/concept/agent/gb15/sirernesthenryshackleton

    Agent (Origination)

    Pattern(s)

    {root}/id/agent/{repositoryid}/{origination-name}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/agent/gb15/sirernesthenryshackleton

    Concept (ControlAccess – Subject)

    Pattern(s)

    {root}/id/concept/{source}/{subject-name}

    {root}/id/concept/{repositoryid}/{subject-name}

    source
    controlaccess/subject/@source
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    subject-name
    normalised form of controlaccess/subject

    Example(s)

    {root}/id/concept/lcsh/antiquities

    Concept (ControlAccess – Persname)

    Pattern(s)

    {root}/id/concept/person/{source}/{person-name}

    {root}/id/concept/person/{rules}/{person-name}

    {root}/id/concept/person/{repositoryid}/{person-name}

    source
    controlaccess/persname/@source
    rules
    controlaccess/persname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    person-name
    normalised form of controlaccess/persname/

    Example(s)

    {root}/id/concept/person/nra/shackletonernesthenry1874-1922sirknightexplorer

    {root}/id/concept/person/ncarules/holdenwendyfl1990cartoonist

    {root}/id/concept/person/gb1832/berlinisaiah1909-1997sirknighthistorian

    Person (ControlAccess – Persname)

    Pattern(s)

    {root}/id/person/{source}/{person-name}

    {root}/id/person/{rules}/{person-name}

    {root}/id/person/{repositoryid}/{person-name}

    source
    controlaccess/persname/@source
    rules
    controlaccess/persname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    person-name
    normalised form of controlaccess/persname/

    Example(s)

    {root}/id/person/nra/shackletonernesthenry1874-1922sirknightexplorer

    {root}/id/person/ncarules/holdenwendyfl1990cartoonist

    {root}/id/person/gb1832/berlinisaiah1909-1997sirknighthistorian

    Concept (ControlAccess – Famname)

    Pattern(s)

    {root}/id/concept/family/{source}/{family-name}

    {root}/id/concept/family/{rules}/{family-name}

    {root}/id/concept/family/{repositoryid}/{family-name}

    source
    controlaccess/famname/@source
    rules
    controlaccess/famname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    family-name
    normalised form of controlaccess/famname/

    Example(s)

    {root}/id/concept/family/nra/dundasviscountsmelvilledunira

    {root}/id/concept/family/ncarules/boucicault

    Family (ControlAccess – Famname)

    Pattern(s)

    {root}/id/family/{source}/{family-name}

    {root}/id/family/{rules}/{family-name}

    {root}/id/family/{repositoryid}/{family-name}

    source
    controlaccess/famname/@source
    rules
    controlaccess/famname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    family-name
    normalised form of controlaccess/famname/

    Example(s)

    {root}/id/family/nra/dundasviscountsmelvilledunira

    {root}/id/family/ncarules/boucicault

    Concept (ControlAccess – Corpname)

    Pattern(s)

    {root}/id/concept/organisation/{source}/{org-name}

    {root}/id/concept/organisation/{rules}/{org-name}

    {root}/id/concept/organisation/{repositoryid}/{org-name}

    source
    controlaccess/corpname/@source
    rules
    controlaccess/corpname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    org-name
    normalised form of controlaccess/corpname/

    Example(s)

    {root}/id/concept/organisation/nra/britishbroadcastingcorporation

    {root}/id/concept/organisation/aacr2/dailymail%28london%2Cengland%29

    {root}/id/concept/organisation/gb1578/vizards%2Csolicitors%2Cmonmouth

    Organisation (ControlAccess – Corpname)

    Pattern(s)

    {root}/id/organisation/{source}/{org-name}

    {root}/id/organisation/{rules}/{org-name}

    {root}/id/organisation/{repositoryid}/{org-name}

    source
    controlaccess/corpname/@source
    rules
    controlaccess/corpname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    org-name
    normalised form of controlaccess/corpname/

    Example(s)

    {root}/id/organisation/nra/britishbroadcastingcorporation

    {root}/id/organisation/aacr2/dailymail%28london%2Cengland%29

    {root}/id/organisation/gb1578/vizards%2Csolicitors%2Cmonmouth

    Concept (ControlAccess – Geogname)

    Pattern(s)

    {root}/id/concept/place/{source}/{place-name}

    {root}/id/concept/place/{rules}/{place-name}

    {root}/id/concept/place/{repositoryid}/{place-name}

    source
    controlaccess/geogname/@source
    rules
    controlaccess/geogname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    place-name
    normalised form of controlaccess/geogname/

    Example(s)

    {root}/id/concept/place/lcsh/mcmurdosound%28antarctica%29

    {root}/id/concept/place/ncarules/canada

    {root}/id/concept/place/gb982/meirionethshire%28wales%29

    Place (ControlAccess – Geogname)

    Pattern(s)

    {root}/id/place/{source}/{place-name}

    {root}/id/place/{rules}/{place-name}

    {root}/id/place/{repositoryid}/{place-name}

    source
    controlaccess/geogname/@source
    rules
    controlaccess/geogname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    place-name
    normalised form of controlaccess/geogname/

    Example(s)

    {root}/id/place/lcsh/mcmurdosound%28antarctica%29

    {root}/id/place/ncarules/canada

    {root}/id/place/gb982/meirionethshire%28wales%29

    Concept (ControlAccess – GenreForm)

    Pattern(s)

    {root}/id/concept/{source}/{genreform-name}

    {root}/id/concept/{rules}/{genreform-name}

    {root}/id/concept/{repositoryid}/{genreform-name}

    source
    controlaccess/genreform/@source
    rules
    controlaccess/genreform/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    genreform-name
    normalised form of controlaccess/genreform

    Example(s)

    {root}/id/concept/aat/buildingplans

    Concept (ControlAccess – Function)

    Pattern(s)

    {root}/id/concept/{source}/{function-name}

    {root}/id/concept/{rules}/{function-name}

    {root}/id/concept/{repositoryid}/{function-name}

    source
    controlaccess/function/@source
    rules
    controlaccess/function/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    function-name
    normalised form of controlaccess/function

    Example(s)

    {root}/id/concept/agift/miningregulations

    Book

    Pattern(s)

    {root}/id/document/{title}

    source
    controlaccess/title/@source
    rules
    controlaccess/title/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    title
    normalised form of controlaccess/title

    Example(s)

    {root}/id/document/aacr2/thecastlediaries1974-761980

    Birth (Event)

    Pattern(s)

    {root}/id/birth/{source}/{person-name}

    {root}/id/birth/{rules}/{person-name}

    {root}/id/birth/{repositoryid}/{person-name}

    source
    controlaccess/persname/@source
    rules
    controlaccess/persname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    person-name
    normalised form of controlaccess/persname/

    Example(s)

    {root}/id/birth/nra/shackletonernesthenry1874-1922sirknightexplorer

    {root}/id/birth/ncarules/allenjim1926-1999playwright

    {root}/id/birth/gb1832/berlinisaiah1909-1997sirknighthistorian

    Object

    Pattern(s)

    {object-uri}

    object-uri
    dao/@href or daogrp/daoloc/@href

    Example(s)

    http://library.kent.ac.uk/library/special/html/specoll/jack.gif

    Object Group

    Pattern(s)

    {root}/id/group/{unitid}-{groupno}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree
    groupno
    position within daogrp sequence for archdesc or c{n}

    Example(s)

    {root}/id/group/gb0254ms274-1

    Time Interval (Year, Month, Day)

    i.e. specific intervals of time.

    Pattern(s)

    http://reference.data.gov.uk/id/year/{yyyy}

    http://reference.data.gov.uk/id/month/{yyyy}-{mm}

    http://reference.data.gov.uk/id/day/{yyyy}-{mm}-{dd}

    Note: use existing reference.data.gov.uk URIs for intervals.

    langcode
    did/langmaterial/language/@langcode

    Example(s)

    http://reference.data.gov.uk/id/year/1921

    http://reference.data.gov.uk/id/month/1921-06

    http://reference.data.gov.uk/id/day/1921-06-03

Some more “things”: some extensions to the Hub model

Having had a little more time to experiment with the Archives Hub EAD data, and to think about what sort of operations on the RDF data we might wish to perform or enable others to perform, I’ve introduced a few small extensions to the model I described a couple a few weeks ago.

Extents

At our last project meeting, we talked about some of the possibilities for visualisations of the data. One of the ideas (suggested by Jane) is to explore representing relative sizes of collections, perhaps on a map, so that, for example, a researcher could provide a geographic location and a subject area and get a visual representation of the relative sizes of collections within that area.

The EAD XML format provides an element called <extent> for “information about the quantity of the materials being described or an expression of the physical space they occupy”. Although the EAD Tag Library provides guidelines to try to encourage some uniformity of the content, the data in the Hub EAD documents is quite variable. Examples of the content in the samples I’ve looked at include:

  • 6.5 linear metres
  • 2.04 metres
  • 0.48m
  • 190 archive boxes
  • 13 boxes
  • One sheet of paper
  • 13 lever arch files, 48 sound tape reels, 490 audio cassette tapes (1 filing cabinet)

In the initial model, this was just treated in RDF as a single triple with subject the URI of the unit of description (an archival collection or some part of it) and this string a literal object. I’m suggesting changing this to treat the “extent” as a resource with its own URI, rather than simply as a literal. Doing that enables us – for at least some of these cases – to make explicit that it is a value measured in some “unit” (linear metres, archival boxes), to “normalise” the way those units are represented (so e.g. “linear metres”, “metres” and “m” can be mapped to a single form in the RDF data), and possibly to make comparisons, albeit approximate ones, between extents measured in different units (for example, “archival boxes” and “linear metres”).

So we end up with patterns in the RDF graph like:

unit:123 dcterms:extent extent:123 .

extent:123 ex:metres “2.04”^^xsd:decimal .

Having said that, I recognise that the nature of the input data is such that such techniques are usefully applicable only to a subset of the data; I’m not sure there’s a great deal we can do with “composite” strings like the last one in the list above, other than present them to a human reader.

Events and Times

One of the other ideas for presenting data we’ve chewed around is that of some sort of “timeline” view. It’s something I’ve been quite keen to explore – though I’m conscious that the much of the most useful information is, in the EAD documents, in the form only of prose in the “biographical/administrative histories” provided for the originators of the archives.

As a first tentative step in this direction, I’ve introduced a notion of “event” into the model, where, in the first instance:

  • the Creation of a unit of description is modelled as an event taking place during a period of time
  • (where birth/death dates are provided in the input) the Birth and Death of a person are modelled as events taking place during a period of time

It’s possible to generate this just from simple processing of the input data. It may be possible to go further and generate a richer range of “events” through the use of some flavour of intelligent text analysis/”entity extraction” tools on the biographical/administrative history text, but that’s something for us to consider in the future.

Postcodes

Finally – and as I noted in the previous post this is something which goes beyond the content of the EAD documents themselves – prompted mainly by the recent announcement by John Goodwin that the Ordnance Survey had extended their linked data dataset to include “post code units”, I’ve added in a notion of “Postcode Unit” so that we can make links to resources from that dataset (and also to the UK Postcodes dataset).

So the revised model looks something like the figure below:

Diagram showing data model for EAD data

Figure 1

So, I’m hoping that – bug fixes aside – I can stop tinkering with this for a while 🙂 and that we can work with this version of the model, and test out what is possible and where any “pain points” are, and then think about where further changes might be useful.

Modelling Copac data

With the Archives Hub data well under way, it was time to start looking at the Copac data.  The first decision to be made was which version of Copac data to use – consolidated or unconsolidated.  As part of the process of adding records to Copac they are de-duplicated, allowing different institutions’ records for the same item to be presented as one record, instead of several.  For more info on Copac de-duplication, see this blog post.

So our first question was: deal with the individual records from each library, or with the consolidated records created for Copac?  This made us think about the nature of what we were describing. The unconsolidated records (generally!) relate to the actual, physical ‘thing’ – what in FRBR would be the ‘item’.

The consolidated records are closer to (but by no means a perfect example of) the FRBR manifestation.  That is to say, they are describing different physical instances of the same theoretical work; in linked data terms, they are ‘same as’.  They aren’t perfect manifestation level records, as there may be other records on Copac for the same manifestation which haven’t been consolidated due to cataloguing differences.  At Copac, we err on the side of caution, and would rather have this happen, than have records which aren’t the same consolidated into the same record.

So we could do our mapping and our transformations at unconsolidated level, and then use ‘same as’ to link together the descriptions that would later be consolidated in Copac.  But as we’re accepting Copac’s judgement that they are describing the same set of items, why not save ourselves that trouble, and work from the consolidated description?  We can then hang the individual bibliographic records off this central unit of description.

This means that all of the information provided by the different libraries is related to the same unit of description.  The bibliographic records that go together to make up a consolidated Copac record may not contain all of the same information, but they won’t contain any contradictory information.  Thus two records which are the same in all details except date of publication (say 1983 in one, 1984 in the other) will not consolidate, but records which are the same in all details except that one contains a subject where the other does not, will consolidate.

In fact, subjects are one of the things (along with notes) that don’t affect consolidation at all.  We will combine all of the subjects that come in individual descriptions, so that a consolidated record might end up with the subjects:

Management

Management — theory

Management (theoretical)

Business & management

We will leave these in the linked data description for the same reason they are left in the Copac description – while such similar terms may seem superfluous, they actually increase discoverability, by providing multiple access points.  They will link into the central ‘unit of description’, rather than the individual bibliographic records.

Once we’d decided on this central unit of description (name TBD, but likely to be ‘Copac record’ or something boringly similar), other aspects of the description started to fall into place.  Some of these were straightforward – publication date, for instance, is fairly obviously a literal – while others took more thought and discussion.

Among the more complicated issues was that of creator.  We are working with MODS data, which has come from MARC data, and MARC allows you to have only one ‘creator’.  This creator sits in the 100s as the main access point, and all other contributors (including co-authors!) are relegated to the 700s, where they become what we have decided to call ‘other person associated with this unit of description’.  Not very snappy, but hopefully fairly accurate.  In theory, the role that this person has in the creation of the item should be reflected in a MARC indicator, but in practise this is not often included in descriptions.  Where it is indicated that a person (or a corporate body) is an editor, contributor, translator, illustrator etc, we can build these into the modelling; where not, they will have to be satisfied with the vague title of ‘associated person’.

This will work for most situations, but it does still leave room for error.  Where a person is named in the 700s with no indicator of role, it is possible that they are a person who was associated with one particular item, rather than the manifestation – a former owner or bookseller, for example.  While we do want to present this information, which works as another access point, and may be of interest to users, we have the problem that this information should really be associated with the item, not our quasi-manifestation.  This information only concerns one specific physical item, as described in one of the individual bibliographic records. Should it really have a link to our central unit of description?  If not, where do we link it to?  Our entries for individual bib records describe only the records themselves, not a physical real-world item.  It’s an interesting point, and one we’ll be dicussing more as the project goes on.

We’re continuing to work with Copac data, and will discuss other issues here as they arise.

The “things” in EAD: a first cut at a model

As mentioned by Jane in a couple of previous posts, she, Bethan and I met up in Manchester in August to share our thoughts about how to model the Archives Hub EAD data in a form that can be represented in RDF.

RDF in a nutshell

For the purposes of this discussion, the main point to bear in mind is that the “grammatical principle” underpinning RDF is one of making simple three-part statements, each of which makes an assertion of a relationship (of some particular type) between two things. So for example, in RDF I can “say” things like:

Document 123 has-title “Arthur and George”

or

Document 123 is-authored-by Person P
Person P has-name “Julian Barnes”

When considering how to represent EAD data in RDF, then, the first step is to try to take a step back from the “nitty-gritty” of the EAD XML markup, and think about the three part statements we might construct to represent the “information content” of that document. We need to think in terms, not of XML documents and elements and attributes and nesting/containment, but rather of what an EAD document is “saying” about “things in the world” (perhaps more accurately, in the “world” as conceptualised by the creator of the archival finding aid, shaped by archival description practices in general) and what sort of questions we want to answer about those “things”. What are the “things” – and here I use the term in a general sense to include concepts and abstractions as well as material objects – that an EAD document provides information about? What are the relationships between these things? What else does an EAD document say about those things?

Note: The discussion here does not cover the “document”/”description” side of the “Linked Data” picture i.e. for each “thing”, we’ll be providing a “description” of that “thing” in the form of a “document”. Metadata describing that “document” will be important in providing information about provenance and currency, for example, but that is not discussed here.

EAD as used by the Archives Hub

The EAD XML format was designed to cope with the “encoding” of a wide range of archival finding aids, including those constructed according to the (slightly different) cataloguing practices and traditions of different communities.

Further, many features of the EAD format are optional: one can construct a valid EAD document using only a fairly minimal level of markup, or one can use more detailed markup to represent more information.

This flexibility can be something of a “double-edged sword”: on the one hand, it enables data creators to accommodate a wide range of data, and it provides choice in the level of detail of markup (and human resources in creating that markup!) to be applied; on the other hand, it can make working with EAD data quite complex for a consumer, particularly when processing data from a range of sources which perhaps use a range of different conventions and features of the language.

In part to address this sort of issue (as well as to make things simpler for data providers by insulating them from the detail of EAD markup), the Archives Hub provides a forms-based EAD editor, based primarily on the information categories enumerated by the ISAD(G) archival description standard, which generates EAD documents following a consistent set of markup conventions. (I sometimes think of this as a “profile” of EAD, a narrower set of constraints than that imposed by the EAD DTD/schema itself, but I’m not sure that sort of terminology is in widespread use in this context.)

So, we made the “pragmatic” decision to work, in the first instance at least, on the basis of this particular set of EAD markup conventions, rather than trying to address the full EAD format, which means we can limit the number of variants we need to deal with. Having said that, even for the case of data created using the Hub editor, an element of variation is present, because although the data entry form generates a common high-level structure, data creators can apply different markup within those high-level structural components. In this first cut at a model, we have focused on analysing those common structural elements, with the intention of extending and refining our approach at a later stage.

In the course of this (or in thinking about it afterwards) we’ve come up with a few questions, which I’ll try to highlight in the course of the discussion below. Any feed back on these points (or indeed on any other aspect of the post!) would be very welcome.

The “world” as seen by EAD

Jane and I had both done some doodling before our meeting, and we started out by walking through our ideas, highlighting both those aspects which seemed pretty clear and uncontroversial, and aspects where we were uncertain or several alternatives seemed possible (and reasonable). Although we were using slightly different terminology, I think we had come up with quite similar notions, and after a bit of discussion, we arrived at a first cut at a “core” model which I’m representing graphically in Figure 1 below. This isn’t intended as a formal UML or E-R diagram, but each box represents a type of “thing” (a class) and each arrow represents a type of relationship between individual things (“instances” of those classes):

Diagram showing draft data model for EAD data (1)

Figure 1

So the “core” types of things identified in this first stage were:

  • Unit of Description: these are the “units” of archival material, a document or set of documents, the actual stuff held in the repository and described by the finding aid. It’s a “generic” class to reflect the archival description principle of “multi-level description”. An archival finding aid typically has a “hierarchical” structure, in which one “unit of description” is (described as logically forming) “part of” another “unit of description”. A finding aid may provide a only a “collection-level” description of a collection which contains many thousands of individual records, without describing those records individually at all; or it may include descriptions of various component groupings and sub-groupings of records; or it may indeed go as far as describing individual records within such groupings. For each Unit of Description, information relevant to that particular unit is provided. EAD and ISAD(G)) allow for the provision of more or less the same set of information whatever the “level” of unit described, though in practice some elements are more commonly used for “aggregate/group” units.
  • Archival Finding Aid: these are the documents created by archival cataloguers to describe the archival materials. Often a single finding aid describes (or has as its topic/subject) several units of description, but it may be the case that a finding aid describes only a single unit – where only a description of the collection as a whole is provided.
  • Repository (Agent): the organisations who curate and provide access to the archival material, and who create and maintain the archival finding aids. (EAD allows for the possibility that two different agencies perform these two roles; the Hub EAD Editor works on the basis that a single agent is responsible for both).
  • Origination (Agent): the entity (individual, organisation or family) “responsible for the creation, accumulation, or assembly of the described materials before their incorporation into an archival repository” (from the description of the EAD <origination> element). Jane analysed the rather complex nature of the ISAD(G) Creator/EAD origination relationship, which encompases notions of both “item creator” and “collector”, in <a href="http://archiveshub.ac.uk/blog/?p=2401"an earlier post on the Archives Hub blog.
  • “Things” which are referenced in the form of names used as “access points” or “index terms” using the EAD <controlaccess> element. The Hub EAD Editor supports the provision of the following as <controlaccess> terms, and recommends the use of a number of thesauri or “authority files” from which they should be drawn: Names of “Subjects” (topics); Personal Names; Family Names; Corporate Names; Place Names; Book Titles; Names of Genres or Forms; Names of Functions. So the corresponding “things named” are: Concepts, Persons, Families, Organisations, Places, Books, Genres or Forms, and Functions. As Jane notes in her recent post the relationship between the Unit of Description and the entity named in the <controlaccess> element is not necessarily a relationship of “about”-ness, but a rather less specific one, which for the moment we’ve labelled as simply “associated with” (though a better label might be preferable!).

(I’ve shown the Origination and Repository as distinct classes in the diagram, rather than as a single Agent class, because, as I hope will become clearer below, it ends up that they participate in a slightly different set of relationships).

We went on to extend and refine this core model to accommodate more of the information from the EAD document.

First, we refined the way the “access points” are represented. I’d discussed this aspect of the model with Leigh Dodds of Talis and he suggested that we consider modelling the physical entities here as concepts, in turn related to physical entities, i.e. that we represent the “conceptualisation” of a person, family, organisation or place captured in a thesaurus entry or authority file record, as distinct from the actual physical entity. So, to take an example which I think Bethan used during our conversation, we can distinguish between a conceptualisation of William Blake as a poet and one of William Blake as an artist, each in turn related to William Blake the person.

Although I don’t plan to discuss the specifics of RDF vocabulary in this post, it’s worth noting that the FOAF RDF vocabulary has recently been extended with the addition of a property, foaf:focus, to represent the relationship between the conceptualisation and the thing conceptualised (person, place etc), to support exactly this convention.

For some of the <controlaccess> named entities – like the topics, genres/forms and functions – there is no “other thing conceptualised” and it is sufficient to model them simply as concepts (or as instances of a subclass); and for the book case, we’ll just treat it as a “book” (and for the moment, at least, sidestep any FRBR-ish issues).

In both cases, the notion that the concept is a member of a specific thesarus/authority file can be captured by introducing the notion (from SKOS) of a “Concept Scheme”.

Question 1: One question raised by this approach is whether, for the cases where there is a distinct entity involved, in transforming an EAD document into RDF, we should:

  1. Coin URIs for, and generate “descriptions” of, both the concept and the person/family/organisation/place conceptualised (with a triple with a foaf:focus predicate relating the two? Or:
  2. Coin a URI for, and generate a “description” of only the concept, and leave the relationship with the person/family/organisation/place conceptualised “out of scope” at the transform stage (though that relationship might be obtained at a later stage by linking the concept to external data)?

My inclination is to do the former, on the grounds that this enables us to capture more of the information present in the EAD document i.e. to capture the information that where a <persname> element is used, this is the name of a conceptualisation of a person, where a <corpname> element is used, this is the name of a conceptualisation of an organisation, and so on.

Question 2: Is it necessary/useful to also model the name itself as a distinct resource? I think we can manage without that, but we may revisit that point in the future.

Second, having made this choice for the <controlaccess> entities, we decided to apply it also to the case of the “origination” agent discussed above, with the “origination” relationship becoming one between a Unit of Description and a conceptualisation of an agent, rather than between a Unit of Description and the agent itself. I admit I’m still not completely sure this is necessary/useful/”the right thing to do”. The use of the <origination> element in the Hub EAD profile is described in the guidelines here. It allows for names to be presented in “the commonly used form of name”, rather than the form specified by an authority record (and indeed a survey of the data reveals a good deal of variation), so it’s a bit more difficult to argue that this corresponds directly to the name of an entry (concept) listed in an “authority file”.

Question 3: Is it necessary/useful to introduce a “conceptualisation” of the agent who “originated” the Unit of Description? For now, we’re working on the basis that it is, but we may revisit that choice.

This extended model is represented graphically in Figure 2:

Diagram showing draft data model for EAD data (2)

Figure 2

A final stage of refinement gave us a few further extensions.

First the EAD Document is introduced as a particular “encoding of” the Finding Aid.

Second, I’ve suggested that we model the Biographical or Administrative History associated with each Unit of Description as a resource in its own right, distinct from the Finding Aid as a whole. I’m not sure this is strictly necessary, and again it’s something that we may revist in the future. But it enables us to provide information about the Biographical History as a distinct resource. One of the reasons this may be useful is that we’ve discussed (albeit somewhat vaguely at this point!) analysing/mining the text of the Biographical History as a source of further information, and having a URI for the Biographical History enables us to be explicit about the source of that data. We can also make the Biographical History the subject of triples to indicate that it is related not just to the Unit of Description but also to the entity who “originated” that unit (or, given the discussion above, to the conceptualisation of that entity). Also, we could associate it with different literal expressions (e.g. the original EAD fragment as XML Literal, but also an XHTML or plain text derivative). It also, of course, makes the Biographical History into a resource that others can refer to in their own assertions in their own data.</p

Third, we introduced the “level” of the Unit of Description as a distinct resource, a concept. This means that each “level” within the (relatively small) set used within the Hub data can each be assigned a distinct URI, and described in their own right, and – again – referenced by others.

Fourth, similarly, the “language” of the Unit of Description is treated as a distinct resource. (The plan here is that we’ll try to simply reference resources within an existing Linked Data dataset, such as lexvo.orga>.)

Fifth, the EAD <dao> and <daogrp< elements are mapped into a relationship between the Unit of Description and an external digital object (or group of objects). I’ve labelled the relationship here as “is represented by” as that is the description provided by the EAD documentation, but I think Jane and Bethan felt that in practice in the Hub data, the relationship might sometimes be rather less specific than that.

For the moment, the other EAD elements corresponding to ISAD(G) elements (i.e. to textboxes in the Hub data entry form) will be treated as properties with XML Literal values (though we could follow the <bioghist> approach and generate individual URI-identified resources if that proves to be useful).

Sixth – and here we stepped slightly beyond the scope of the EAD document itself (so I’ve greyed it out in the diagram below) – we’ve added a notion of the location of the Repository and a relationship between the Repository-as-Agent and that Place. Although details of repository location aren’t included in the Hub EAD documents, Jane and Bethan said they do have that data available, and it should be fairly easy to integrate it.

So we’ve ended up with the model illustrated in Figure 3.

Diagram showing draft data model for EAD data (3)

Figure 3

Question 4: Are we missing any obvious “things” that we need to treat as resources?

Note: In this post, I haven’t gone as far here as to enumerate all the properties that will be used to describe instances of each of those classes, but I’ll provide that in a future post.

Multi-level description, context, “completeness” and “inheritance”

The one remaining question – and perhaps one of the thorniest to address fully – is that arising from one of the fundamental characteristics of the nature of archival description. As noted above, archival description is typically based on a “hierarchical”, “multi-level” approach, in which, within a single finding aid, information is provided about an aggregation of records, and then about component parts of that aggregation, and so on, perhaps down to the level of providing descriptions of individual records, but often stopping short of that.

The ISAD(G) standard presents principles of moving from the general to the specific, and providing information relevant to the particular unit of description (ISAD(G) 2.2):

Provide only such information as is appropriate to the level being described. For example, do not provide detailed file content information if the unit of description is a fonds; do not provide an administrative history for an entire department if the creator of a unit of description is a division or a branch.

And of “non repetition” (ISAD(G) 2.4):

At the highest appropriate level, give information that is common to the component parts. Do not repeat information at a lower level of description that has already been given at a higher level.

In some cases, it may indeed be the case that if some descriptive attribute is not explicitly provided for the unit of description, then the information provided for its “parent” unit in the hierarchy is applicable; however, this is often not the case. The elements of the ISAD(G) Identity Statement Area (or the EAD <did> child elements), for example, are specific to the unit of description and do not apply to its “child” units; and for many other descriptive elements, a simple rule of “direct inheritance” may not be appropriate. For the <controlaccess> elements, for example, a “blunt” inference rule that the named entities “associated with” a unit of description are also “associated with” every “child” unit (and so on “down the tree”) may result in associations that are simply not useful to the consumer of the data.

In a post on the Archives Hub blog, Jane emphasised the value of the “Linked Data” approach in making things mentioned in our data into “first-class citizens”. One consequence of the multi-level approach in archival description practice is a strong sense of the importance of “context”, and that the descriptions of the “lower level” units should be read and interpreted in the context of the higher levels of description (perhaps even that they are in some sense “incomplete” without that “contextual” data). In contrast, the “Linked Data” approach typically involves exposing “bounded descriptions” of individual resources. Now, certainly, yes, those “bounded descriptions” include assertions of relationships with other resources (including the sort of part-whole/member-of/component-of relationships present here), and those links can be followed by consumers to obtain further information on the other resources – however, there is no requirement or expectation that consumers will do so. So, there is arguably a (perhaps unavoidable) element of tension between the strongly “contextual” emphasis of EAD and ISAD(G) and the “bounded descriptions” of “Linked Data”. Rather than seeing that as an insurmountable hurdle, however, I think it provides an area that the project can usefully explore and evaluate.

(If I remember correctly) we made the decision that, for now at least, the only piece of information for which we would implement an explicit “inheritance” from a “higher-level” Unit of Description to a “lower” one (and generate additional RDF triples in the data) would be that of the repository which provides access to the material (i.e. the EAD <repository> element).

Conclusion

As I said above, the model I’ve outlined here is intended as very much a first cut, not the “last word”, and something we’ll most likely revisit and refine further in the future, particularly as we see in practice what it enables us (and others) to do with the data generated, and where we might require some further tweaks to enable us to do more. For now, we feel it provides a basis for our initial work on transforming EAD data into RDF.

The next steps are:

  1. to decide on URI patterns for the URIs we will be generating (i.e. URIs for instances of the classes in the diagram above)
  2. to select terms from existing RDF vocabularies and to define any additional RDF terms required to create “descriptions” of these things based on information from the EAD document
  3. to create a transformation that implements the model (in the first instance, an XSLT transform)

I’ve already done some work on all of these, and I’ll write about them in a separate post here – which hopefully will be rather shorter than this one and will take me rather less time to write!