Transforming EAD XML into RDF/XML using XSLT

This is a (brief!) second post revisiting my “process” diagram from an early post. Here I’ll focus on the “transform” process on the left of the diagram:

Diagram showing process of transforming EAD to RDF and exposing as Linked Data

The “transform” process is currently performed using XSLT to read an EAD XML document and output RDF/XML, and the current version of the stylesheet is now available:

(The data currently available via http://data.archiveshub.ac.uk/ was actually generated using the previous version http://data.archiveshub.ac.uk/xslt/20110502/ead2rdf.xsl. The 20110630 version includes a few tweaks and bug fixes which will be reflected when we reload the data, hopefully within the next week.)

As I’ve noted previously, we initially focused our efforts on processing the set of EAD documents held by the Archives Hub, and on the particular set of markup conventions recommended by the Hub for data contributors – what I sometimes referred to as the Archives Hub EAD “profile” – though in practice, the actual dataset we’ve worked with encompasses a good degree of variation. But it remains the case that the transform is really designed to handle the set of EAD XML documents within that particular dataset rather than EAD in general. (I admit that it also remains somewhat “untidy” – the date handling is particularly messy! And parts of it were developed in a rather ad hoc fashion as I amended things as I encountered new variations in new batches of data. I should try to spend some time cleaning it up before the end of the project.)

Over the last few months, I’ve also been working on another JISC-funded project, SALDA, with Karen Watson and Chris Keene of the University of Sussex Library, focusing on making available their catalogue data for the Mass Observation Archive as Linked Data.

I wrote a post over on the SALDA blog on how I’d gone about applying and adapting the transform we developed in LOCAH for use with the SALDA data. That work has prompted me to think a bit more about the different facets of the data and how they are reflected in aspects of the transform process:

  • aspects which are generic/common to all EAD documents
  • aspects which are common to some quite large subset of EAD documents (like the Archives Hub dataset, with its (more or less) common set of conventions)
  • aspects which are “generic” in some way, but require some sort of “local” parameterisation – here, I’m thinking of the sort of “name/keyword lookup” techniques I describe in the SALDA post: the technique is broadly usable but the “lookup tables” used would vary from one dataset to another
  • aspects which reflect very specific, “local” characteristics of the data – e.g., some of the SALDA processing is based on testing for text patterns/structures which are very particular to the Mass Observation catalogue data

What I’d like to do (but haven’t done yet) is to reorganise the transform to try to make it a little more “modular” and to separate the “general”/”generic” from the “local”/”specific”, so that it might be easier for other users to “plug in” components more suitable for their own data.

Lifting the Lid on Linked Data at ELAG 2011

Myself and Jane have just given our ‘Lifting the Lid on Linked Data‘ presentation at the ELAG European Library Automation Group Conference 2011 in Prague today. It seemed to go pretty well. There were a few comments about the licensing situation for the Copac data on the #elag2011 twitter stream, which is something we’re still working on.

[slideshare id=8082967&doc=elag2011-locah-110524105057-phpapp02]

Two changes to the model and some definitions

Over the last few weeks we’ve been testing our initial cut at an EAD-to-RDF transform against a range of data that extends beyond EAD documents prepared using the Hub data entry template to documents created using other tools – and varying somewhat in terms of the markup conventions used.

In the course of that, I’ve been pondering some of the choices we made in the model I described here and here, and we decided to make a couple of changes (one very minor and the second still relatively so, I think):

  • Archival Resource: We’ve changed the name of the class we were calling “Unit of Description” to “Archival Resource”. I think “Unit of Description” was problematic for two reasons. First, it was ambiguous, because it could be interpreted either as the unit (of archival material) being described (which is what was intended) or as a unit/part of the archival description (which is not what was intended). Second, I adopted it from the ISAD(G) standard, where the context is one in which the archival resources are considered to be the primary things being described. I’m less sure the label works in the “linked data” context where we’re providing statements, and sets of statements (descriptions), “about” not just the archival materials, but many other things. In this context, everything that is described (people, concepts, places, etc) might be seen, in some sense, a “unit of description”, and so using that label for one subset of them seems inappropriate. That left us with finding a suitable alternative, a generic term that covers archival material in general, at any level of description (fonds, collection, item etc), and “archival resource” seemed like a reasonable fit.
  • Origination as Concept: When I first sketched out the model, I raised some questions, including (as “question 3” in that post) whether it was useful/necessary to model the origination of the archival resource as a pair of concept and agent, following the pattern used for the <controlaccess> terms. Having experimented with that approach, we’ve decided it introduces unnecessary complexity and we’ve fallen back on treating <origination> as a simple relation between archival resource and agent. The use of concept and agent is retained for the <controlaccess> case, where names are typically drawn from an “authority file”, as it allows us to maintain the distinction between a conceptualisation of the agent (as reflected by the authority record/entry) and the agent itself (a distinction which is also made in the model underpinning datasets such as VIAF, which we will be making links to).

The revised model is summarised in the following diagram (an amended version of Figure 3 from the earlier post):

Amended data model for EAD

Amended data model for EAD

i.e. an Archival Resource and a Biographical History are now related directly to an Agent.

Below is a draft list of human-readable definitions for the classes in the model. Some are simply references to classes provided by existing vocabularies like Dublin Core, FOAF, event vocabularies:

Finding Aid
A document describing an archival resource.
Subclass of: bibo:Document, foaf:Document
EAD
A document conforming to the Encoded Archival Description standard.
Subclass of: bibo:Document, foaf:Document
Biographical History
A narrative or chronology that places the archival materials in context by providing information about their creator(s). A finding aid may contain several such narratives or chronologies pertaining to different archival materials and their creators.
Subclass of: bibo:DocumentPart, (bibo:Document), foaf:Document
Repository
An institution or agency responsible for providing access to archival materials.
Subclass of: foaf:Organization, (foaf:Agent), dcterms:Agent
Place
= wgs84_pos:SpatialThing
Postcode Unit
= ospc:PostcodeUnit
Archival Resource
Recorded information in any form or medium, created or received and maintained, by an organization or person(s) in the transaction of business or the conduct of affairs, and maintained for its long-term research value. An archival resource may be an individual item, such as a letter or photograph, or (more commonly) some aggregation of such items managed and described as a unit.
Level
An indicator of the part of an archival collection constituted by an archival resource, whether it is the whole collection or a sub-section of it.
Subclass of: skos:Concept
Language
= lvont:Language
Extent
The size of an archival resource.
Subclass of: dcterms:SizeOrDuration
Temporal Entity
= time:TemporalEntity
Creation
An event that resulted in the creation or accumulation of an archival resource.
Subclass of: event:Event, lode:Event
Concept
= skos:Concept
Concept Scheme
= skos:ConceptScheme
Agent
= foaf:Agent, dcterms:Agent
Person
= foaf:Person, (foaf:Agent), dcterms:Agent
Family
A group of people affiliated by consanguinity, affinity, or co-residence.
Subclass of: foaf:Group, (foaf:Agent), dcterms:Agent
Organisation
= foaf:Organization, (foaf:Agent), dcterms:Agent
Genre or Form
A category of archival material, defined either by style or technique of intellectual content, order of information or object function, or physical characteristics.
Subclass of: skos:Concept
Function
A sphere of activity or process.
Subclass of: skos:Concept
Birth
= bio:Birth, (bio:IndividualEvent), (bio:Event),
(event:Event), (lode:Event)
Death
= bio:Death, (bio:IndividualEvent), (bio:Event),
(event:Event), (lode:Event)
Object
= foaf:Document, bibo:Document
Book
= bibo:Book, (bibo:Document), (foaf:Document)

Assessing Linked Data

For me, the journey from an understanding of modelling data, and creating our own models for the Hub and Copac, to being able to understand the processes and decisions involved in creating XML RDF has been challenging. It has raised one question that often applies when dealing with something quite technical: how much should a manager (in my case an archivist managing an online archive service) be expected to understand the ‘technical’ aspects of something?  This is a question I have spoken about and written about before; mainly in terms of what archivists (or other information professionals) in the digital age need to know in order to understand the implications of choices around things like data structure and software systems.  In the case of Linked Data, I am still not sure how much I need to know about the detail of Linked Data, the RDF model, the use of RDF XML, the benefits of other output fomats, the application of stylesheets, etc. I have been thinking about how hard it is to create Linked Data – I have had a few enquiries from colleagues who are interested in doing the same sort of thing already and I want to be able to offer useful advice.

One thing that occurs to me is that it is reasonable to acknowledge that Linked Data does involves programming skills, and therefore it is not so dissimilar from structuring and outputting your data through a traditional relational database, for example, where you would expect that specialist skills are needed. But in either case there is the the same need for a manager to understand what the system offers and be able to offer the best service to researchers. I think what is important from the point of view of the manager is to be involved with the decision-making process and understand the implications of Linked Data; you need to know what you are saying about your own data. I am not sure that this requires a thorough knowledge of RDF and certainly I only have a rudimentary knowledge of stylesheets (and no knowledge of programming).

Service managers or administrators are not expected to understand systems from an in-depth technical point of view. But in fact, I think that one of the advantages of RDF model is that it is easier to get a sense of what is going on in terms of data processing than typically occurs with a database management system. After about six months of learning about Linked Data and RDF (I estimate that this translates into about one month of fairly intense learning), I can look at the stylesheet that we have for the Archives Hub, to transform the EAD data into RDF XML, and I can look at an RDF document representing the entities that we are describing, and I have a reasonably good overall sense of what it all means, which helps me with my main role: to understand the outputs and the potential benefits of Linked Data. For Locah, we’ve used XSLT, but there is no requirement for this, and maybe one of the challenges of outputting Linked Data is that there are a number of options in terms of translating your RDF model into an output.

There is no doubt that choices made now about how we model the data will have implications for what users can do with it, and some choices may limit future potential more than others. For example, which ‘things’ do we choose to represent? Should we have a conceptualisation of a person as an entity represented in the description, and to link this to a conceptualisation of the person? Which information should we provide as URIs and which as literals? I’m only gradually coming to understand the implications of these decisions, as we start to explore the potential of the data. Of course, this is always true, whatever data, structures and systems we are working with. This brings me to another point that I think is probably particularly relevant for Locah: we are doing this work at a time when we are very much early adopters. Whilst the classic Linked Data diagram may give the impression that the world has embraced Linked Data, the reality is that it is still very much at a hand-crafted level: we have not had tools available to us to aid us in this work, and in the case of EAD, there has been very little activity up till now. It is therefore difficult to judge how feasible it might be to output RDF in the future, as it is likely that more tools will be developed, and there will be greater awareness and skills built up around the whole Semantic Web. However, I wonder if we are currently still at that difficult point where we need to build the momentum of the Linked Data movement, but it is still very unfamiliar and poorly understood by many data providers?

Many Linked Data evangelists claim that Linked Data is ‘easy’. I’m not sure that it is necessarily easy, and I don’t think that it’s very helpful to say that it is easy. Easy compared to what? Easy for whom? It’s easy if you know how, if you have the requisite skills and experience, but we need to persuade people who don’t yet know how that it is worth doing, and provide a realistic assessment of the skills that are required. I suppose the question of how easy it is does rest in large part on the data you are working with as well. Archival finding aids are quite challenging. As Mark Matienzo, archivist at Yale University, states in his presentation on Linked Data and Archival Description: “Archival description is inherently multi-level and relational” and “EAD is both too flexible and too unforgiving” to be Linked Data friendly…and database-friendly for that matter. Also, ISAD(G) recommends the non-repetition of information and archival description generally contains implicit information. I suppose Linked Data might help provide the opportunity and impetus to move towards a more Web-friendly way of describing archives, if it does become more widely used.

At present, I can’t help thinking that if archive repositories and libraries would like to output their data as Linked Data, many of them will struggle, and I would have thought it might be similar for other types of data providers. I do think that expertise is required, and time needs to be invested in understanding some key aspects of Linked Data. On the other hand, this is the case whenever you are looking at creating effective means to output structured (but often inconsistent) data. However, I think that it makes good sense for the Archives Hub and Copac to do this work, as it is on behalf of our contributors, so it effectively will allow these repositories and libraries to output Linked Data.  In other words, it may be that for Linked Data to really take hold, it will benefit from this kind of aggregated set-up, where skills and resources can be pooled. At present, I’m inclined to think that it is worth the investment of time and resources by our Locah team because it is benefitting a large number of data providers. I think it will be important for us to convey to our contributors, and indeed to other archivists and librarians, what we are doing and why, what the implications are and what the benefits may be. I have already had contact with two people, one representing another aggregation of content, interested in benefitting from our work. This is really important, because it potentially makes the investment more worthwhile.

We are in a fortunate position with the Locah project because we are part of a JISC-funded innovations project, with a team of people with a variety of skills, and we have support from Talis, who have significant experience of Linked Data.  If we can work on behalf of our community, then I feel that the time invested may be worthwhile. For the second half of our year-long project we will want to explore the benefits more thoroughly – we will be looking at the crucial issues of creating links to other data, which is really Linked Data’s key selling point, and we will be developing a prototype to show some potential benefits for researchers.

(With thanks to Pete and Ade for their contributions to this blog post).

Identifying the “things”: URI Patterns for the Hub Linked Data

In my previous couple of posts, I outlined the model of the “world” on which we’re basing the RDF data we’re generating from the Archives Hub‘s EAD XML documents.

At the heart of the Linked Data approach is the principle that all the “things” we want to “say anything about” should be named using a URI, and that those URIs should use the http URI scheme, so that they can be easily “looked up” or “dereferenced” using Web technologies in order to obtain some information provided by the URI owner about the thing. So, having specified the types or classes of thing we want to refer to and describe, the next step is to decide on the structure of the http URIs that we’ll use to name the “instances” of those classes – the individual “things” – archival resources, repositories, concepts, persons, places, and so on. In this post, I’ll try to describe the patterns we’re using, and outline how we construct individual URIs using those patterns from the EAD input data. As I hope will become clearer, the nature of the input data conditions the form of the patterns we’ve chosen. This has turned into a rather long post (again!) but I hope the detail is useful – I think it’s important for us to try to document our processes and some of the issues we’ve grappled with as well as to present the conclusions.

In some (most) cases, these will be newly created URIs, under a domain that we (well, MIMAS and the Archives Hub service) own. For these URIs, the project is responsible for choosing the URIs and putting in place the mechanisms to ensure that their dereferencing results in the provision of some “useful information”. In other cases, we will simply be citing existing URIs, defined by other agencies who (hopefully!) provide for their dereferencing.

The UK Cabinet Office has recently published some general guidelines on URI patterns for government Linked Data, Designing URI Sets for the UK Public Sector, and within the JISC programme strand under which LOCAH is funded, projects are encouraged to follow the recommendations of those guidelines. Following these guidelines, the general URI pattern recommended to identify “things” is:

http://{domain}/id/{concept}/{reference}

where:

  • concept is a name for a class (resource type), like “person”
  • reference is a name for an individual instance of that class or type

Our RDF data is being generated, at least in the first instance, by processing EAD XML documents, so we want to construct our URIs for our “things” from content within those XML documents. And we want to do so in a way that, as far as possible, ensures that each of those URIs is an unambiguous name/referrer, i.e. it identifies a single “thing”, and we don’t end up with a single URI being used for what are in fact two different things. On the other hand, we can live with the case where we end up with multiple URIs, all of which identify a single thing, because information can be added at a later stage to indicate that they are synonyms.

The other point to note is that the initial transformation step is being performed on a “document-by-document basis”, i.e. taking a single EAD document as input and outputting RDF/XML. So for any given resource, the information we generate – including the URI of the resource – is based only on the content of that document (and any generally applicable information we can embed in the transform itself). There may be other data “about” that “thing” in another EAD document but we don’t have access to it at the time of transformation.

Also, it’s desirable that we construct our URIs in such a way that if we need to re-run the transform, we generate the same URIs from the same input data (unless we explicitly decide to change the patterns for some reason).

Finally, although the patterns below often make use of human-readable strings from the EAD document content, I haven’t treated human-readability as a major consideration. Having said that, I’ve tended to make use of (slightly normalised forms of) human-readable strings where possible, rather than, say, creating opaque “hashes”.

As with other aspects of the work, at this stage, this is a first cut at tackling the issue, and we may revise our approaches based on the experience of applying them over the dataset. Having gone through and constructed patterns for the various resource types, looking back over them now, I think I can see a small number of distinct methods that we’ve used:

  1. Identifiers: For some of these “things”, the EAD documents contain some sort of formally assigned identification code or number, which unambiguously – at least within the scope of the Hub collection – identifies that instance within the set of resources of that type (i.e. it serves as a “reference” in the terms of the Designing URI Sets… document). This is the case, for example, with the languages of the materials, using the did/langmaterial/language/@langcode attribute value. A variant of this is the case where such an identifier can be constructed from a combination of multiple pieces of content. Repositories, for example, can be identified by the pair of country code (ead/eadheader/eadid/@countrycode) and maintenance agency code (ead/eadheader/eadid/@mainagencycode). For these cases a combination of the name of the resource type and that identification code provides the basis for the “reference” part of the URI.
  2. “Authority-Controlled” Names: For many of the “things”, however, the EAD documents do not contain such a code; rather, they refer to things only by name. In some cases, the form of the name is drawn from an “authority file” – indicated in the EAD document – and the name includes sufficient information (e.g. birth/death dates, titles etc for a person) to make the resulting string an unambiguous referrer within the set of names from that source. For these cases, a combination of a name for the authority file and the name provides the basis for the “reference”. However, this does depend on the creator of the EAD document having accurately transcribed the “authoritative” form of the name, at least sufficiently to maintain unambiguity of reference.
  3. “Rule-Based” Names: In other cases, the “thing” is named, not using a name from a controlled list, but rather a name constructed according to some codified set of rules, where the rules used are indicated in the EAD document. The intent behind such rules is to try to ensure consistency of form and unambiguity of reference. The National Council of Archives’ Rules for the Construction of Personal, Place and Corporate Names (one of the rule sets recommended to Hub data creators) states “A personal name is constructed by combining mandatory and optional components of the name so that the person concerned can be identified with certainty and distinguished from others bearing similar names. An individual should have only one authorised form of name and each name should apply to only one individual.”Typically, as for the “authority file” case, this is achieved through the inclusion of dates, titles etc for persons. For these cases, a combination of a name for the rules and the name itself should provide the basis for the “reference”. However, in practice, the picture with the Hub data is somewhat more complex. First, in some cases where it is claimed that rules are followed, the content itself indicates that this is not the case. For example, the NCA Rules mandate that a personal name should include “the year in which a person was born or died, the span of years of his/her lifetime or the approximate period covered by his/her activities”, even if those dates are estimated. But there are cases in the data marked up as following the NCA Rules which do not meet this requirement – e.g. personal names providing only surname and forename with no dates – , which I suspect may result in ambiguous references. Second, even where the rule is followed and the mandatory components are present, the distributed nature of Hub data creation means that I suspect there is still some possibility that a single personal name may be used in two different sources to refer to what in fact are two different people (Consider e.g. the case of two data providers using the name “Smith, John, fl 1920-1950”).
  4. “Locally-Scoped” Names: In other cases, the form of the name is neither authority-controlled nor rule-based, but nevertheless there is some expectation that the form of the name used is sufficient to make it an unambiguous referrer within some context. This is the case, for example, with the content of the did/origination element. The difficulty, however, is in establishing reliably what that context is. What is that “local scope”? We’ve tentatively taken the approach that such names have been constructed in such a way as at least to be unambiguous within the collection of submissions to the Hub by a single repository. So by combining the repository identifier and the name, hopefully, we can arrive at a “reference” which avoids ambiguity. Again, it may turn out that this assumption is unreliable, and results in ambiguous references, so we may need to revisit this approach.
  5. “Identifier Inheritance”: (I’m sure there must be a formal term for this but I’m not sure what it is!) In these cases the EAD document does not provide an unambiguous name for the “thing” itself; however the “thing” has a simple relationship with some other “thing” for which identification fits into one of the other categories. Where the relationship is one-to-one, a URI can be constructed by adopting the pattern for that other “thing” and substituting the name of the resource type. An example of this is the case of the “biographical history” associated with a “unit of description”. The unit of description has an identifier (based on a pattern described below) and since – in data constructed using the Hub template – each unit has at most one biographical history, replacing the “unit” resource type name with a “bioghist” resource type name gives us a suitable URI path, e.g. for a unit of description for which the URI path contains “/unit/gb15abc”, the URI for the biographical history would contain “/bioghist/gb15abc”.A variant of this is the case where the relationship is many-to-one, rather than one-to-one. Here the approach needs to be extended to include e.g. a sequence number to distinguish the multiple “things”. This is the approach taken for the Unit of Description, where a “child” (“part”) unit of description uses the URI of the “parent” (“whole”) unit suffixed with a sequence number, e.g. for a unit of description for which the URI path contains “/unit/gb15abc”, the URIs for the “child” units would contain “/unit/gb15abc-1”, “/unit/gb15abc-2” and so on. In theory, this should not be necessary as the unitid for a unit should be unique within an EAD document, but in practice we’ve found that this is not the case in the actual data. (In this case, the identifier would be “reproducable” only if any new units are inserted at the end of a sequence rather than in the middle).
  6. So, with the caveat above that this is all somewhat tentative at this stage, I summarise below the approaches taken to generating URIs for instances of each of the classes in the Hub model. Note that sometimes, an instance of the same class is generated in different “contexts” within the EAD document, and in these cases different rules for URI construction may be applied in those different contexts, depending on the information available within the EAD document.

    We haven’t yet finalised the domain name we’ll be using, so for the purposes of the following, {root} represents the domain and the first part of the path. Italicised text is used for the URI patterns (or parts of them); bold text is used for XPath(-ish!) representations of the source of data within the EAD XML document.

    Finding Aid

    Pattern(s)

    {root}/id/findingaid/{eadid}

    eadid
    normalised form of ead/eadheader/eadid

    Example:

    {root}/id/findingaid/gb15sirernesthenryshackleton

    EAD document

    Pattern(s)

    {root}/id/EAD/{eadid}

    eadid
    normalised form of ead/eadheader/eadid

    Example(s)

    {root}/id/ead/gb15sirernesthenryshackleton

    Repository (Agent)

    Pattern(s)

    {root}/id/repository/{repositoryid}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/repository/gb15

    Repository (Place)

    Pattern(s)

    {root}/id/place/{repositoryid}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/place/gb15

    Unit of Description

    Pattern(s)

    {root}/id/unit/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Note: In principle, it should be possible to use c/unitid content rather than position in tree, but in practice, there are cases where unitid content is not unique within the EAD document.

    Example(s)

    {root}/id/unit/gb15sirernesthenryshackleton

    {root}/id/unit/gb15sirernesthenryshackleton-1

    Level

    Pattern(s)

    {root}/id/level/{level-name}

    level-name
    archdesc/@level or archdesc/@otherlevel or c{n}/@level or c{n}/@otherlevel

    Example(s)

    {root}/id/level/fonds

    Language

    Pattern(s)

    http://lexvo.org/id/iso639-3/{langcode}

    Note: use existing lexvo.org URIs for languages.

    langcode
    did/langmaterial/language/@langcode

    Example(s)

    http://lexvo.org/id/iso639-3/eng

    Creation (Event)

    Pattern(s)

    {root}/id/creation/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/creation/gb15sirernesthenryshackleton

    Creation (Time)

    Pattern(s)

    {root}/id/creationtime/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/creationtime/gb15sirernesthenryshackleton

    Extent

    Pattern(s)

    {root}/id/extent/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/extent/gb15sirernesthenryshackleton

    Biographical History

    Pattern(s)

    {root}/id/bioghist/{unitid}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree

    Example(s)

    {root}/id/bioghist/gb15sirernesthenryshackleton

    Concept (Origination)

    Pattern(s)

    {root}/id/concept/agent/{repositoryid}/{origination-name}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/concept/agent/gb15/sirernesthenryshackleton

    Agent (Origination)

    Pattern(s)

    {root}/id/agent/{repositoryid}/{origination-name}

    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode

    Example(s)

    {root}/id/agent/gb15/sirernesthenryshackleton

    Concept (ControlAccess – Subject)

    Pattern(s)

    {root}/id/concept/{source}/{subject-name}

    {root}/id/concept/{repositoryid}/{subject-name}

    source
    controlaccess/subject/@source
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    subject-name
    normalised form of controlaccess/subject

    Example(s)

    {root}/id/concept/lcsh/antiquities

    Concept (ControlAccess – Persname)

    Pattern(s)

    {root}/id/concept/person/{source}/{person-name}

    {root}/id/concept/person/{rules}/{person-name}

    {root}/id/concept/person/{repositoryid}/{person-name}

    source
    controlaccess/persname/@source
    rules
    controlaccess/persname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    person-name
    normalised form of controlaccess/persname/

    Example(s)

    {root}/id/concept/person/nra/shackletonernesthenry1874-1922sirknightexplorer

    {root}/id/concept/person/ncarules/holdenwendyfl1990cartoonist

    {root}/id/concept/person/gb1832/berlinisaiah1909-1997sirknighthistorian

    Person (ControlAccess – Persname)

    Pattern(s)

    {root}/id/person/{source}/{person-name}

    {root}/id/person/{rules}/{person-name}

    {root}/id/person/{repositoryid}/{person-name}

    source
    controlaccess/persname/@source
    rules
    controlaccess/persname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    person-name
    normalised form of controlaccess/persname/

    Example(s)

    {root}/id/person/nra/shackletonernesthenry1874-1922sirknightexplorer

    {root}/id/person/ncarules/holdenwendyfl1990cartoonist

    {root}/id/person/gb1832/berlinisaiah1909-1997sirknighthistorian

    Concept (ControlAccess – Famname)

    Pattern(s)

    {root}/id/concept/family/{source}/{family-name}

    {root}/id/concept/family/{rules}/{family-name}

    {root}/id/concept/family/{repositoryid}/{family-name}

    source
    controlaccess/famname/@source
    rules
    controlaccess/famname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    family-name
    normalised form of controlaccess/famname/

    Example(s)

    {root}/id/concept/family/nra/dundasviscountsmelvilledunira

    {root}/id/concept/family/ncarules/boucicault

    Family (ControlAccess – Famname)

    Pattern(s)

    {root}/id/family/{source}/{family-name}

    {root}/id/family/{rules}/{family-name}

    {root}/id/family/{repositoryid}/{family-name}

    source
    controlaccess/famname/@source
    rules
    controlaccess/famname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    family-name
    normalised form of controlaccess/famname/

    Example(s)

    {root}/id/family/nra/dundasviscountsmelvilledunira

    {root}/id/family/ncarules/boucicault

    Concept (ControlAccess – Corpname)

    Pattern(s)

    {root}/id/concept/organisation/{source}/{org-name}

    {root}/id/concept/organisation/{rules}/{org-name}

    {root}/id/concept/organisation/{repositoryid}/{org-name}

    source
    controlaccess/corpname/@source
    rules
    controlaccess/corpname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    org-name
    normalised form of controlaccess/corpname/

    Example(s)

    {root}/id/concept/organisation/nra/britishbroadcastingcorporation

    {root}/id/concept/organisation/aacr2/dailymail%28london%2Cengland%29

    {root}/id/concept/organisation/gb1578/vizards%2Csolicitors%2Cmonmouth

    Organisation (ControlAccess – Corpname)

    Pattern(s)

    {root}/id/organisation/{source}/{org-name}

    {root}/id/organisation/{rules}/{org-name}

    {root}/id/organisation/{repositoryid}/{org-name}

    source
    controlaccess/corpname/@source
    rules
    controlaccess/corpname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    org-name
    normalised form of controlaccess/corpname/

    Example(s)

    {root}/id/organisation/nra/britishbroadcastingcorporation

    {root}/id/organisation/aacr2/dailymail%28london%2Cengland%29

    {root}/id/organisation/gb1578/vizards%2Csolicitors%2Cmonmouth

    Concept (ControlAccess – Geogname)

    Pattern(s)

    {root}/id/concept/place/{source}/{place-name}

    {root}/id/concept/place/{rules}/{place-name}

    {root}/id/concept/place/{repositoryid}/{place-name}

    source
    controlaccess/geogname/@source
    rules
    controlaccess/geogname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    place-name
    normalised form of controlaccess/geogname/

    Example(s)

    {root}/id/concept/place/lcsh/mcmurdosound%28antarctica%29

    {root}/id/concept/place/ncarules/canada

    {root}/id/concept/place/gb982/meirionethshire%28wales%29

    Place (ControlAccess – Geogname)

    Pattern(s)

    {root}/id/place/{source}/{place-name}

    {root}/id/place/{rules}/{place-name}

    {root}/id/place/{repositoryid}/{place-name}

    source
    controlaccess/geogname/@source
    rules
    controlaccess/geogname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    place-name
    normalised form of controlaccess/geogname/

    Example(s)

    {root}/id/place/lcsh/mcmurdosound%28antarctica%29

    {root}/id/place/ncarules/canada

    {root}/id/place/gb982/meirionethshire%28wales%29

    Concept (ControlAccess – GenreForm)

    Pattern(s)

    {root}/id/concept/{source}/{genreform-name}

    {root}/id/concept/{rules}/{genreform-name}

    {root}/id/concept/{repositoryid}/{genreform-name}

    source
    controlaccess/genreform/@source
    rules
    controlaccess/genreform/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    genreform-name
    normalised form of controlaccess/genreform

    Example(s)

    {root}/id/concept/aat/buildingplans

    Concept (ControlAccess – Function)

    Pattern(s)

    {root}/id/concept/{source}/{function-name}

    {root}/id/concept/{rules}/{function-name}

    {root}/id/concept/{repositoryid}/{function-name}

    source
    controlaccess/function/@source
    rules
    controlaccess/function/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    function-name
    normalised form of controlaccess/function

    Example(s)

    {root}/id/concept/agift/miningregulations

    Book

    Pattern(s)

    {root}/id/document/{title}

    source
    controlaccess/title/@source
    rules
    controlaccess/title/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    title
    normalised form of controlaccess/title

    Example(s)

    {root}/id/document/aacr2/thecastlediaries1974-761980

    Birth (Event)

    Pattern(s)

    {root}/id/birth/{source}/{person-name}

    {root}/id/birth/{rules}/{person-name}

    {root}/id/birth/{repositoryid}/{person-name}

    source
    controlaccess/persname/@source
    rules
    controlaccess/persname/@rules
    repositoryid
    normalised form of concatentation of ead/eadheader/eadid/@countrycode and ead/eadheader/eadid/@mainagencycode
    person-name
    normalised form of controlaccess/persname/

    Example(s)

    {root}/id/birth/nra/shackletonernesthenry1874-1922sirknightexplorer

    {root}/id/birth/ncarules/allenjim1926-1999playwright

    {root}/id/birth/gb1832/berlinisaiah1909-1997sirknighthistorian

    Object

    Pattern(s)

    {object-uri}

    object-uri
    dao/@href or daogrp/daoloc/@href

    Example(s)

    http://library.kent.ac.uk/library/special/html/specoll/jack.gif

    Object Group

    Pattern(s)

    {root}/id/group/{unitid}-{groupno}

    unitid
    normalised form of archdesc/did/unitid and position within archdesc/dsc/c tree
    groupno
    position within daogrp sequence for archdesc or c{n}

    Example(s)

    {root}/id/group/gb0254ms274-1

    Time Interval (Year, Month, Day)

    i.e. specific intervals of time.

    Pattern(s)

    http://reference.data.gov.uk/id/year/{yyyy}

    http://reference.data.gov.uk/id/month/{yyyy}-{mm}

    http://reference.data.gov.uk/id/day/{yyyy}-{mm}-{dd}

    Note: use existing reference.data.gov.uk URIs for intervals.

    langcode
    did/langmaterial/language/@langcode

    Example(s)

    http://reference.data.gov.uk/id/year/1921

    http://reference.data.gov.uk/id/month/1921-06

    http://reference.data.gov.uk/id/day/1921-06-03

Some more “things”: some extensions to the Hub model

Having had a little more time to experiment with the Archives Hub EAD data, and to think about what sort of operations on the RDF data we might wish to perform or enable others to perform, I’ve introduced a few small extensions to the model I described a couple a few weeks ago.

Extents

At our last project meeting, we talked about some of the possibilities for visualisations of the data. One of the ideas (suggested by Jane) is to explore representing relative sizes of collections, perhaps on a map, so that, for example, a researcher could provide a geographic location and a subject area and get a visual representation of the relative sizes of collections within that area.

The EAD XML format provides an element called <extent> for “information about the quantity of the materials being described or an expression of the physical space they occupy”. Although the EAD Tag Library provides guidelines to try to encourage some uniformity of the content, the data in the Hub EAD documents is quite variable. Examples of the content in the samples I’ve looked at include:

  • 6.5 linear metres
  • 2.04 metres
  • 0.48m
  • 190 archive boxes
  • 13 boxes
  • One sheet of paper
  • 13 lever arch files, 48 sound tape reels, 490 audio cassette tapes (1 filing cabinet)

In the initial model, this was just treated in RDF as a single triple with subject the URI of the unit of description (an archival collection or some part of it) and this string a literal object. I’m suggesting changing this to treat the “extent” as a resource with its own URI, rather than simply as a literal. Doing that enables us – for at least some of these cases – to make explicit that it is a value measured in some “unit” (linear metres, archival boxes), to “normalise” the way those units are represented (so e.g. “linear metres”, “metres” and “m” can be mapped to a single form in the RDF data), and possibly to make comparisons, albeit approximate ones, between extents measured in different units (for example, “archival boxes” and “linear metres”).

So we end up with patterns in the RDF graph like:

unit:123 dcterms:extent extent:123 .

extent:123 ex:metres “2.04”^^xsd:decimal .

Having said that, I recognise that the nature of the input data is such that such techniques are usefully applicable only to a subset of the data; I’m not sure there’s a great deal we can do with “composite” strings like the last one in the list above, other than present them to a human reader.

Events and Times

One of the other ideas for presenting data we’ve chewed around is that of some sort of “timeline” view. It’s something I’ve been quite keen to explore – though I’m conscious that the much of the most useful information is, in the EAD documents, in the form only of prose in the “biographical/administrative histories” provided for the originators of the archives.

As a first tentative step in this direction, I’ve introduced a notion of “event” into the model, where, in the first instance:

  • the Creation of a unit of description is modelled as an event taking place during a period of time
  • (where birth/death dates are provided in the input) the Birth and Death of a person are modelled as events taking place during a period of time

It’s possible to generate this just from simple processing of the input data. It may be possible to go further and generate a richer range of “events” through the use of some flavour of intelligent text analysis/”entity extraction” tools on the biographical/administrative history text, but that’s something for us to consider in the future.

Postcodes

Finally – and as I noted in the previous post this is something which goes beyond the content of the EAD documents themselves – prompted mainly by the recent announcement by John Goodwin that the Ordnance Survey had extended their linked data dataset to include “post code units”, I’ve added in a notion of “Postcode Unit” so that we can make links to resources from that dataset (and also to the UK Postcodes dataset).

So the revised model looks something like the figure below:

Diagram showing data model for EAD data

Figure 1

So, I’m hoping that – bug fixes aside – I can stop tinkering with this for a while 🙂 and that we can work with this version of the model, and test out what is possible and where any “pain points” are, and then think about where further changes might be useful.

The “things” in EAD: a first cut at a model

As mentioned by Jane in a couple of previous posts, she, Bethan and I met up in Manchester in August to share our thoughts about how to model the Archives Hub EAD data in a form that can be represented in RDF.

RDF in a nutshell

For the purposes of this discussion, the main point to bear in mind is that the “grammatical principle” underpinning RDF is one of making simple three-part statements, each of which makes an assertion of a relationship (of some particular type) between two things. So for example, in RDF I can “say” things like:

Document 123 has-title “Arthur and George”

or

Document 123 is-authored-by Person P
Person P has-name “Julian Barnes”

When considering how to represent EAD data in RDF, then, the first step is to try to take a step back from the “nitty-gritty” of the EAD XML markup, and think about the three part statements we might construct to represent the “information content” of that document. We need to think in terms, not of XML documents and elements and attributes and nesting/containment, but rather of what an EAD document is “saying” about “things in the world” (perhaps more accurately, in the “world” as conceptualised by the creator of the archival finding aid, shaped by archival description practices in general) and what sort of questions we want to answer about those “things”. What are the “things” – and here I use the term in a general sense to include concepts and abstractions as well as material objects – that an EAD document provides information about? What are the relationships between these things? What else does an EAD document say about those things?

Note: The discussion here does not cover the “document”/”description” side of the “Linked Data” picture i.e. for each “thing”, we’ll be providing a “description” of that “thing” in the form of a “document”. Metadata describing that “document” will be important in providing information about provenance and currency, for example, but that is not discussed here.

EAD as used by the Archives Hub

The EAD XML format was designed to cope with the “encoding” of a wide range of archival finding aids, including those constructed according to the (slightly different) cataloguing practices and traditions of different communities.

Further, many features of the EAD format are optional: one can construct a valid EAD document using only a fairly minimal level of markup, or one can use more detailed markup to represent more information.

This flexibility can be something of a “double-edged sword”: on the one hand, it enables data creators to accommodate a wide range of data, and it provides choice in the level of detail of markup (and human resources in creating that markup!) to be applied; on the other hand, it can make working with EAD data quite complex for a consumer, particularly when processing data from a range of sources which perhaps use a range of different conventions and features of the language.

In part to address this sort of issue (as well as to make things simpler for data providers by insulating them from the detail of EAD markup), the Archives Hub provides a forms-based EAD editor, based primarily on the information categories enumerated by the ISAD(G) archival description standard, which generates EAD documents following a consistent set of markup conventions. (I sometimes think of this as a “profile” of EAD, a narrower set of constraints than that imposed by the EAD DTD/schema itself, but I’m not sure that sort of terminology is in widespread use in this context.)

So, we made the “pragmatic” decision to work, in the first instance at least, on the basis of this particular set of EAD markup conventions, rather than trying to address the full EAD format, which means we can limit the number of variants we need to deal with. Having said that, even for the case of data created using the Hub editor, an element of variation is present, because although the data entry form generates a common high-level structure, data creators can apply different markup within those high-level structural components. In this first cut at a model, we have focused on analysing those common structural elements, with the intention of extending and refining our approach at a later stage.

In the course of this (or in thinking about it afterwards) we’ve come up with a few questions, which I’ll try to highlight in the course of the discussion below. Any feed back on these points (or indeed on any other aspect of the post!) would be very welcome.

The “world” as seen by EAD

Jane and I had both done some doodling before our meeting, and we started out by walking through our ideas, highlighting both those aspects which seemed pretty clear and uncontroversial, and aspects where we were uncertain or several alternatives seemed possible (and reasonable). Although we were using slightly different terminology, I think we had come up with quite similar notions, and after a bit of discussion, we arrived at a first cut at a “core” model which I’m representing graphically in Figure 1 below. This isn’t intended as a formal UML or E-R diagram, but each box represents a type of “thing” (a class) and each arrow represents a type of relationship between individual things (“instances” of those classes):

Diagram showing draft data model for EAD data (1)

Figure 1

So the “core” types of things identified in this first stage were:

  • Unit of Description: these are the “units” of archival material, a document or set of documents, the actual stuff held in the repository and described by the finding aid. It’s a “generic” class to reflect the archival description principle of “multi-level description”. An archival finding aid typically has a “hierarchical” structure, in which one “unit of description” is (described as logically forming) “part of” another “unit of description”. A finding aid may provide a only a “collection-level” description of a collection which contains many thousands of individual records, without describing those records individually at all; or it may include descriptions of various component groupings and sub-groupings of records; or it may indeed go as far as describing individual records within such groupings. For each Unit of Description, information relevant to that particular unit is provided. EAD and ISAD(G)) allow for the provision of more or less the same set of information whatever the “level” of unit described, though in practice some elements are more commonly used for “aggregate/group” units.
  • Archival Finding Aid: these are the documents created by archival cataloguers to describe the archival materials. Often a single finding aid describes (or has as its topic/subject) several units of description, but it may be the case that a finding aid describes only a single unit – where only a description of the collection as a whole is provided.
  • Repository (Agent): the organisations who curate and provide access to the archival material, and who create and maintain the archival finding aids. (EAD allows for the possibility that two different agencies perform these two roles; the Hub EAD Editor works on the basis that a single agent is responsible for both).
  • Origination (Agent): the entity (individual, organisation or family) “responsible for the creation, accumulation, or assembly of the described materials before their incorporation into an archival repository” (from the description of the EAD <origination> element). Jane analysed the rather complex nature of the ISAD(G) Creator/EAD origination relationship, which encompases notions of both “item creator” and “collector”, in <a href="http://archiveshub.ac.uk/blog/?p=2401"an earlier post on the Archives Hub blog.
  • “Things” which are referenced in the form of names used as “access points” or “index terms” using the EAD <controlaccess> element. The Hub EAD Editor supports the provision of the following as <controlaccess> terms, and recommends the use of a number of thesauri or “authority files” from which they should be drawn: Names of “Subjects” (topics); Personal Names; Family Names; Corporate Names; Place Names; Book Titles; Names of Genres or Forms; Names of Functions. So the corresponding “things named” are: Concepts, Persons, Families, Organisations, Places, Books, Genres or Forms, and Functions. As Jane notes in her recent post the relationship between the Unit of Description and the entity named in the <controlaccess> element is not necessarily a relationship of “about”-ness, but a rather less specific one, which for the moment we’ve labelled as simply “associated with” (though a better label might be preferable!).

(I’ve shown the Origination and Repository as distinct classes in the diagram, rather than as a single Agent class, because, as I hope will become clearer below, it ends up that they participate in a slightly different set of relationships).

We went on to extend and refine this core model to accommodate more of the information from the EAD document.

First, we refined the way the “access points” are represented. I’d discussed this aspect of the model with Leigh Dodds of Talis and he suggested that we consider modelling the physical entities here as concepts, in turn related to physical entities, i.e. that we represent the “conceptualisation” of a person, family, organisation or place captured in a thesaurus entry or authority file record, as distinct from the actual physical entity. So, to take an example which I think Bethan used during our conversation, we can distinguish between a conceptualisation of William Blake as a poet and one of William Blake as an artist, each in turn related to William Blake the person.

Although I don’t plan to discuss the specifics of RDF vocabulary in this post, it’s worth noting that the FOAF RDF vocabulary has recently been extended with the addition of a property, foaf:focus, to represent the relationship between the conceptualisation and the thing conceptualised (person, place etc), to support exactly this convention.

For some of the <controlaccess> named entities – like the topics, genres/forms and functions – there is no “other thing conceptualised” and it is sufficient to model them simply as concepts (or as instances of a subclass); and for the book case, we’ll just treat it as a “book” (and for the moment, at least, sidestep any FRBR-ish issues).

In both cases, the notion that the concept is a member of a specific thesarus/authority file can be captured by introducing the notion (from SKOS) of a “Concept Scheme”.

Question 1: One question raised by this approach is whether, for the cases where there is a distinct entity involved, in transforming an EAD document into RDF, we should:

  1. Coin URIs for, and generate “descriptions” of, both the concept and the person/family/organisation/place conceptualised (with a triple with a foaf:focus predicate relating the two? Or:
  2. Coin a URI for, and generate a “description” of only the concept, and leave the relationship with the person/family/organisation/place conceptualised “out of scope” at the transform stage (though that relationship might be obtained at a later stage by linking the concept to external data)?

My inclination is to do the former, on the grounds that this enables us to capture more of the information present in the EAD document i.e. to capture the information that where a <persname> element is used, this is the name of a conceptualisation of a person, where a <corpname> element is used, this is the name of a conceptualisation of an organisation, and so on.

Question 2: Is it necessary/useful to also model the name itself as a distinct resource? I think we can manage without that, but we may revisit that point in the future.

Second, having made this choice for the <controlaccess> entities, we decided to apply it also to the case of the “origination” agent discussed above, with the “origination” relationship becoming one between a Unit of Description and a conceptualisation of an agent, rather than between a Unit of Description and the agent itself. I admit I’m still not completely sure this is necessary/useful/”the right thing to do”. The use of the <origination> element in the Hub EAD profile is described in the guidelines here. It allows for names to be presented in “the commonly used form of name”, rather than the form specified by an authority record (and indeed a survey of the data reveals a good deal of variation), so it’s a bit more difficult to argue that this corresponds directly to the name of an entry (concept) listed in an “authority file”.

Question 3: Is it necessary/useful to introduce a “conceptualisation” of the agent who “originated” the Unit of Description? For now, we’re working on the basis that it is, but we may revisit that choice.

This extended model is represented graphically in Figure 2:

Diagram showing draft data model for EAD data (2)

Figure 2

A final stage of refinement gave us a few further extensions.

First the EAD Document is introduced as a particular “encoding of” the Finding Aid.

Second, I’ve suggested that we model the Biographical or Administrative History associated with each Unit of Description as a resource in its own right, distinct from the Finding Aid as a whole. I’m not sure this is strictly necessary, and again it’s something that we may revist in the future. But it enables us to provide information about the Biographical History as a distinct resource. One of the reasons this may be useful is that we’ve discussed (albeit somewhat vaguely at this point!) analysing/mining the text of the Biographical History as a source of further information, and having a URI for the Biographical History enables us to be explicit about the source of that data. We can also make the Biographical History the subject of triples to indicate that it is related not just to the Unit of Description but also to the entity who “originated” that unit (or, given the discussion above, to the conceptualisation of that entity). Also, we could associate it with different literal expressions (e.g. the original EAD fragment as XML Literal, but also an XHTML or plain text derivative). It also, of course, makes the Biographical History into a resource that others can refer to in their own assertions in their own data.</p

Third, we introduced the “level” of the Unit of Description as a distinct resource, a concept. This means that each “level” within the (relatively small) set used within the Hub data can each be assigned a distinct URI, and described in their own right, and – again – referenced by others.

Fourth, similarly, the “language” of the Unit of Description is treated as a distinct resource. (The plan here is that we’ll try to simply reference resources within an existing Linked Data dataset, such as lexvo.orga>.)

Fifth, the EAD <dao> and <daogrp< elements are mapped into a relationship between the Unit of Description and an external digital object (or group of objects). I’ve labelled the relationship here as “is represented by” as that is the description provided by the EAD documentation, but I think Jane and Bethan felt that in practice in the Hub data, the relationship might sometimes be rather less specific than that.

For the moment, the other EAD elements corresponding to ISAD(G) elements (i.e. to textboxes in the Hub data entry form) will be treated as properties with XML Literal values (though we could follow the <bioghist> approach and generate individual URI-identified resources if that proves to be useful).

Sixth – and here we stepped slightly beyond the scope of the EAD document itself (so I’ve greyed it out in the diagram below) – we’ve added a notion of the location of the Repository and a relationship between the Repository-as-Agent and that Place. Although details of repository location aren’t included in the Hub EAD documents, Jane and Bethan said they do have that data available, and it should be fairly easy to integrate it.

So we’ve ended up with the model illustrated in Figure 3.

Diagram showing draft data model for EAD data (3)

Figure 3

Question 4: Are we missing any obvious “things” that we need to treat as resources?

Note: In this post, I haven’t gone as far here as to enumerate all the properties that will be used to describe instances of each of those classes, but I’ll provide that in a future post.

Multi-level description, context, “completeness” and “inheritance”

The one remaining question – and perhaps one of the thorniest to address fully – is that arising from one of the fundamental characteristics of the nature of archival description. As noted above, archival description is typically based on a “hierarchical”, “multi-level” approach, in which, within a single finding aid, information is provided about an aggregation of records, and then about component parts of that aggregation, and so on, perhaps down to the level of providing descriptions of individual records, but often stopping short of that.

The ISAD(G) standard presents principles of moving from the general to the specific, and providing information relevant to the particular unit of description (ISAD(G) 2.2):

Provide only such information as is appropriate to the level being described. For example, do not provide detailed file content information if the unit of description is a fonds; do not provide an administrative history for an entire department if the creator of a unit of description is a division or a branch.

And of “non repetition” (ISAD(G) 2.4):

At the highest appropriate level, give information that is common to the component parts. Do not repeat information at a lower level of description that has already been given at a higher level.

In some cases, it may indeed be the case that if some descriptive attribute is not explicitly provided for the unit of description, then the information provided for its “parent” unit in the hierarchy is applicable; however, this is often not the case. The elements of the ISAD(G) Identity Statement Area (or the EAD <did> child elements), for example, are specific to the unit of description and do not apply to its “child” units; and for many other descriptive elements, a simple rule of “direct inheritance” may not be appropriate. For the <controlaccess> elements, for example, a “blunt” inference rule that the named entities “associated with” a unit of description are also “associated with” every “child” unit (and so on “down the tree”) may result in associations that are simply not useful to the consumer of the data.

In a post on the Archives Hub blog, Jane emphasised the value of the “Linked Data” approach in making things mentioned in our data into “first-class citizens”. One consequence of the multi-level approach in archival description practice is a strong sense of the importance of “context”, and that the descriptions of the “lower level” units should be read and interpreted in the context of the higher levels of description (perhaps even that they are in some sense “incomplete” without that “contextual” data). In contrast, the “Linked Data” approach typically involves exposing “bounded descriptions” of individual resources. Now, certainly, yes, those “bounded descriptions” include assertions of relationships with other resources (including the sort of part-whole/member-of/component-of relationships present here), and those links can be followed by consumers to obtain further information on the other resources – however, there is no requirement or expectation that consumers will do so. So, there is arguably a (perhaps unavoidable) element of tension between the strongly “contextual” emphasis of EAD and ISAD(G) and the “bounded descriptions” of “Linked Data”. Rather than seeing that as an insurmountable hurdle, however, I think it provides an area that the project can usefully explore and evaluate.

(If I remember correctly) we made the decision that, for now at least, the only piece of information for which we would implement an explicit “inheritance” from a “higher-level” Unit of Description to a “lower” one (and generate additional RDF triples in the data) would be that of the repository which provides access to the material (i.e. the EAD <repository> element).

Conclusion

As I said above, the model I’ve outlined here is intended as very much a first cut, not the “last word”, and something we’ll most likely revisit and refine further in the future, particularly as we see in practice what it enables us (and others) to do with the data generated, and where we might require some further tweaks to enable us to do more. For now, we feel it provides a basis for our initial work on transforming EAD data into RDF.

The next steps are:

  1. to decide on URI patterns for the URIs we will be generating (i.e. URIs for instances of the classes in the diagram above)
  2. to select terms from existing RDF vocabularies and to define any additional RDF terms required to create “descriptions” of these things based on information from the EAD document
  3. to create a transformation that implements the model (in the first instance, an XSLT transform)

I’ve already done some work on all of these, and I’ll write about them in a separate post here – which hopefully will be rather shorter than this one and will take me rather less time to write!