As mentioned by Jane in a couple of previous posts, she, Bethan and I met up in Manchester in August to share our thoughts about how to model the Archives Hub EAD data in a form that can be represented in RDF.
RDF in a nutshell
For the purposes of this discussion, the main point to bear in mind is that the “grammatical principle” underpinning RDF is one of making simple three-part statements, each of which makes an assertion of a relationship (of some particular type) between two things. So for example, in RDF I can “say” things like:
Document 123 has-title “Arthur and George”
Document 123 is-authored-by Person P Person P has-name “Julian Barnes”
When considering how to represent EAD data in RDF, then, the first step is to try to take a step back from the “nitty-gritty” of the EAD XML markup, and think about the three part statements we might construct to represent the “information content” of that document. We need to think in terms, not of XML documents and elements and attributes and nesting/containment, but rather of what an EAD document is “saying” about “things in the world” (perhaps more accurately, in the “world” as conceptualised by the creator of the archival finding aid, shaped by archival description practices in general) and what sort of questions we want to answer about those “things”. What are the “things” – and here I use the term in a general sense to include concepts and abstractions as well as material objects – that an EAD document provides information about? What are the relationships between these things? What else does an EAD document say about those things?
Note: The discussion here does not cover the “document”/”description” side of the “Linked Data” picture i.e. for each “thing”, we’ll be providing a “description” of that “thing” in the form of a “document”. Metadata describing that “document” will be important in providing information about provenance and currency, for example, but that is not discussed here.
EAD as used by the Archives Hub
The EAD XML format was designed to cope with the “encoding” of a wide range of archival finding aids, including those constructed according to the (slightly different) cataloguing practices and traditions of different communities.
Further, many features of the EAD format are optional: one can construct a valid EAD document using only a fairly minimal level of markup, or one can use more detailed markup to represent more information.
This flexibility can be something of a “double-edged sword”: on the one hand, it enables data creators to accommodate a wide range of data, and it provides choice in the level of detail of markup (and human resources in creating that markup!) to be applied; on the other hand, it can make working with EAD data quite complex for a consumer, particularly when processing data from a range of sources which perhaps use a range of different conventions and features of the language.
In part to address this sort of issue (as well as to make things simpler for data providers by insulating them from the detail of EAD markup), the Archives Hub provides a forms-based EAD editor, based primarily on the information categories enumerated by the ISAD(G) archival description standard, which generates EAD documents following a consistent set of markup conventions. (I sometimes think of this as a “profile” of EAD, a narrower set of constraints than that imposed by the EAD DTD/schema itself, but I’m not sure that sort of terminology is in widespread use in this context.)
So, we made the “pragmatic” decision to work, in the first instance at least, on the basis of this particular set of EAD markup conventions, rather than trying to address the full EAD format, which means we can limit the number of variants we need to deal with. Having said that, even for the case of data created using the Hub editor, an element of variation is present, because although the data entry form generates a common high-level structure, data creators can apply different markup within those high-level structural components. In this first cut at a model, we have focused on analysing those common structural elements, with the intention of extending and refining our approach at a later stage.
In the course of this (or in thinking about it afterwards) we’ve come up with a few questions, which I’ll try to highlight in the course of the discussion below. Any feed back on these points (or indeed on any other aspect of the post!) would be very welcome.
The “world” as seen by EAD
Jane and I had both done some doodling before our meeting, and we started out by walking through our ideas, highlighting both those aspects which seemed pretty clear and uncontroversial, and aspects where we were uncertain or several alternatives seemed possible (and reasonable). Although we were using slightly different terminology, I think we had come up with quite similar notions, and after a bit of discussion, we arrived at a first cut at a “core” model which I’m representing graphically in Figure 1 below. This isn’t intended as a formal UML or E-R diagram, but each box represents a type of “thing” (a class) and each arrow represents a type of relationship between individual things (“instances” of those classes):
So the “core” types of things identified in this first stage were:
- Unit of Description: these are the “units” of archival material, a document or set of documents, the actual stuff held in the repository and described by the finding aid. It’s a “generic” class to reflect the archival description principle of “multi-level description”. An archival finding aid typically has a “hierarchical” structure, in which one “unit of description” is (described as logically forming) “part of” another “unit of description”. A finding aid may provide a only a “collection-level” description of a collection which contains many thousands of individual records, without describing those records individually at all; or it may include descriptions of various component groupings and sub-groupings of records; or it may indeed go as far as describing individual records within such groupings. For each Unit of Description, information relevant to that particular unit is provided. EAD and ISAD(G)) allow for the provision of more or less the same set of information whatever the “level” of unit described, though in practice some elements are more commonly used for “aggregate/group” units.
- Archival Finding Aid: these are the documents created by archival cataloguers to describe the archival materials. Often a single finding aid describes (or has as its topic/subject) several units of description, but it may be the case that a finding aid describes only a single unit – where only a description of the collection as a whole is provided.
- Repository (Agent): the organisations who curate and provide access to the archival material, and who create and maintain the archival finding aids. (EAD allows for the possibility that two different agencies perform these two roles; the Hub EAD Editor works on the basis that a single agent is responsible for both).
- Origination (Agent): the entity (individual, organisation or family) “responsible for the creation, accumulation, or assembly of the described materials before their incorporation into an archival repository” (from the description of the EAD <origination> element). Jane analysed the rather complex nature of the ISAD(G) Creator/EAD origination relationship, which encompases notions of both “item creator” and “collector”, in <a href="http://archiveshub.ac.uk/blog/?p=2401"an earlier post on the Archives Hub blog.
- “Things” which are referenced in the form of names used as “access points” or “index terms” using the EAD <controlaccess> element. The Hub EAD Editor supports the provision of the following as <controlaccess> terms, and recommends the use of a number of thesauri or “authority files” from which they should be drawn: Names of “Subjects” (topics); Personal Names; Family Names; Corporate Names; Place Names; Book Titles; Names of Genres or Forms; Names of Functions. So the corresponding “things named” are: Concepts, Persons, Families, Organisations, Places, Books, Genres or Forms, and Functions. As Jane notes in her recent post the relationship between the Unit of Description and the entity named in the <controlaccess> element is not necessarily a relationship of “about”-ness, but a rather less specific one, which for the moment we’ve labelled as simply “associated with” (though a better label might be preferable!).
(I’ve shown the Origination and Repository as distinct classes in the diagram, rather than as a single Agent class, because, as I hope will become clearer below, it ends up that they participate in a slightly different set of relationships).
We went on to extend and refine this core model to accommodate more of the information from the EAD document.
First, we refined the way the “access points” are represented. I’d discussed this aspect of the model with Leigh Dodds of Talis and he suggested that we consider modelling the physical entities here as concepts, in turn related to physical entities, i.e. that we represent the “conceptualisation” of a person, family, organisation or place captured in a thesaurus entry or authority file record, as distinct from the actual physical entity. So, to take an example which I think Bethan used during our conversation, we can distinguish between a conceptualisation of William Blake as a poet and one of William Blake as an artist, each in turn related to William Blake the person.
Although I don’t plan to discuss the specifics of RDF vocabulary in this post, it’s worth noting that the FOAF RDF vocabulary has recently been extended with the addition of a property, foaf:focus, to represent the relationship between the conceptualisation and the thing conceptualised (person, place etc), to support exactly this convention.
For some of the <controlaccess> named entities – like the topics, genres/forms and functions – there is no “other thing conceptualised” and it is sufficient to model them simply as concepts (or as instances of a subclass); and for the book case, we’ll just treat it as a “book” (and for the moment, at least, sidestep any FRBR-ish issues).
In both cases, the notion that the concept is a member of a specific thesarus/authority file can be captured by introducing the notion (from SKOS) of a “Concept Scheme”.
Question 1: One question raised by this approach is whether, for the cases where there is a distinct entity involved, in transforming an EAD document into RDF, we should:
- Coin URIs for, and generate “descriptions” of, both the concept and the person/family/organisation/place conceptualised (with a triple with a foaf:focus predicate relating the two? Or:
- Coin a URI for, and generate a “description” of only the concept, and leave the relationship with the person/family/organisation/place conceptualised “out of scope” at the transform stage (though that relationship might be obtained at a later stage by linking the concept to external data)?
My inclination is to do the former, on the grounds that this enables us to capture more of the information present in the EAD document i.e. to capture the information that where a <persname> element is used, this is the name of a conceptualisation of a person, where a <corpname> element is used, this is the name of a conceptualisation of an organisation, and so on.
Question 2: Is it necessary/useful to also model the name itself as a distinct resource? I think we can manage without that, but we may revisit that point in the future.
Second, having made this choice for the <controlaccess> entities, we decided to apply it also to the case of the “origination” agent discussed above, with the “origination” relationship becoming one between a Unit of Description and a conceptualisation of an agent, rather than between a Unit of Description and the agent itself. I admit I’m still not completely sure this is necessary/useful/”the right thing to do”. The use of the <origination> element in the Hub EAD profile is described in the guidelines here. It allows for names to be presented in “the commonly used form of name”, rather than the form specified by an authority record (and indeed a survey of the data reveals a good deal of variation), so it’s a bit more difficult to argue that this corresponds directly to the name of an entry (concept) listed in an “authority file”.
Question 3: Is it necessary/useful to introduce a “conceptualisation” of the agent who “originated” the Unit of Description? For now, we’re working on the basis that it is, but we may revisit that choice.
This extended model is represented graphically in Figure 2:
A final stage of refinement gave us a few further extensions.
First the EAD Document is introduced as a particular “encoding of” the Finding Aid.
Second, I’ve suggested that we model the Biographical or Administrative History associated with each Unit of Description as a resource in its own right, distinct from the Finding Aid as a whole. I’m not sure this is strictly necessary, and again it’s something that we may revist in the future. But it enables us to provide information about the Biographical History as a distinct resource. One of the reasons this may be useful is that we’ve discussed (albeit somewhat vaguely at this point!) analysing/mining the text of the Biographical History as a source of further information, and having a URI for the Biographical History enables us to be explicit about the source of that data. We can also make the Biographical History the subject of triples to indicate that it is related not just to the Unit of Description but also to the entity who “originated” that unit (or, given the discussion above, to the conceptualisation of that entity). Also, we could associate it with different literal expressions (e.g. the original EAD fragment as XML Literal, but also an XHTML or plain text derivative). It also, of course, makes the Biographical History into a resource that others can refer to in their own assertions in their own data.</p
Third, we introduced the “level” of the Unit of Description as a distinct resource, a concept. This means that each “level” within the (relatively small) set used within the Hub data can each be assigned a distinct URI, and described in their own right, and – again – referenced by others.
Fourth, similarly, the “language” of the Unit of Description is treated as a distinct resource. (The plan here is that we’ll try to simply reference resources within an existing Linked Data dataset, such as lexvo.orga>.)
Fifth, the EAD <dao> and <daogrp< elements are mapped into a relationship between the Unit of Description and an external digital object (or group of objects). I’ve labelled the relationship here as “is represented by” as that is the description provided by the EAD documentation, but I think Jane and Bethan felt that in practice in the Hub data, the relationship might sometimes be rather less specific than that.
For the moment, the other EAD elements corresponding to ISAD(G) elements (i.e. to textboxes in the Hub data entry form) will be treated as properties with XML Literal values (though we could follow the <bioghist> approach and generate individual URI-identified resources if that proves to be useful).
Sixth – and here we stepped slightly beyond the scope of the EAD document itself (so I’ve greyed it out in the diagram below) – we’ve added a notion of the location of the Repository and a relationship between the Repository-as-Agent and that Place. Although details of repository location aren’t included in the Hub EAD documents, Jane and Bethan said they do have that data available, and it should be fairly easy to integrate it.
So we’ve ended up with the model illustrated in Figure 3.
Question 4: Are we missing any obvious “things” that we need to treat as resources?
Note: In this post, I haven’t gone as far here as to enumerate all the properties that will be used to describe instances of each of those classes, but I’ll provide that in a future post.
Multi-level description, context, “completeness” and “inheritance”
The one remaining question – and perhaps one of the thorniest to address fully – is that arising from one of the fundamental characteristics of the nature of archival description. As noted above, archival description is typically based on a “hierarchical”, “multi-level” approach, in which, within a single finding aid, information is provided about an aggregation of records, and then about component parts of that aggregation, and so on, perhaps down to the level of providing descriptions of individual records, but often stopping short of that.
The ISAD(G) standard presents principles of moving from the general to the specific, and providing information relevant to the particular unit of description (ISAD(G) 2.2):
Provide only such information as is appropriate to the level being described. For example, do not provide detailed file content information if the unit of description is a fonds; do not provide an administrative history for an entire department if the creator of a unit of description is a division or a branch.
And of “non repetition” (ISAD(G) 2.4):
At the highest appropriate level, give information that is common to the component parts. Do not repeat information at a lower level of description that has already been given at a higher level.
In some cases, it may indeed be the case that if some descriptive attribute is not explicitly provided for the unit of description, then the information provided for its “parent” unit in the hierarchy is applicable; however, this is often not the case. The elements of the ISAD(G) Identity Statement Area (or the EAD <did> child elements), for example, are specific to the unit of description and do not apply to its “child” units; and for many other descriptive elements, a simple rule of “direct inheritance” may not be appropriate. For the <controlaccess> elements, for example, a “blunt” inference rule that the named entities “associated with” a unit of description are also “associated with” every “child” unit (and so on “down the tree”) may result in associations that are simply not useful to the consumer of the data.
In a post on the Archives Hub blog, Jane emphasised the value of the “Linked Data” approach in making things mentioned in our data into “first-class citizens”. One consequence of the multi-level approach in archival description practice is a strong sense of the importance of “context”, and that the descriptions of the “lower level” units should be read and interpreted in the context of the higher levels of description (perhaps even that they are in some sense “incomplete” without that “contextual” data). In contrast, the “Linked Data” approach typically involves exposing “bounded descriptions” of individual resources. Now, certainly, yes, those “bounded descriptions” include assertions of relationships with other resources (including the sort of part-whole/member-of/component-of relationships present here), and those links can be followed by consumers to obtain further information on the other resources – however, there is no requirement or expectation that consumers will do so. So, there is arguably a (perhaps unavoidable) element of tension between the strongly “contextual” emphasis of EAD and ISAD(G) and the “bounded descriptions” of “Linked Data”. Rather than seeing that as an insurmountable hurdle, however, I think it provides an area that the project can usefully explore and evaluate.
(If I remember correctly) we made the decision that, for now at least, the only piece of information for which we would implement an explicit “inheritance” from a “higher-level” Unit of Description to a “lower” one (and generate additional RDF triples in the data) would be that of the repository which provides access to the material (i.e. the EAD <repository> element).
As I said above, the model I’ve outlined here is intended as very much a first cut, not the “last word”, and something we’ll most likely revisit and refine further in the future, particularly as we see in practice what it enables us (and others) to do with the data generated, and where we might require some further tweaks to enable us to do more. For now, we feel it provides a basis for our initial work on transforming EAD data into RDF.
The next steps are:
- to decide on URI patterns for the URIs we will be generating (i.e. URIs for instances of the classes in the diagram above)
- to select terms from existing RDF vocabularies and to define any additional RDF terms required to create “descriptions” of these things based on information from the EAD document
- to create a transformation that implements the model (in the first instance, an XSLT transform)
I’ve already done some work on all of these, and I’ll write about them in a separate post here – which hopefully will be rather shorter than this one and will take me rather less time to write!