The “things” in EAD: a first cut at a model

As mentioned by Jane in a couple of previous posts, she, Bethan and I met up in Manchester in August to share our thoughts about how to model the Archives Hub EAD data in a form that can be represented in RDF.

RDF in a nutshell

For the purposes of this discussion, the main point to bear in mind is that the “grammatical principle” underpinning RDF is one of making simple three-part statements, each of which makes an assertion of a relationship (of some particular type) between two things. So for example, in RDF I can “say” things like:

Document 123 has-title “Arthur and George”

or

Document 123 is-authored-by Person P
Person P has-name “Julian Barnes”

When considering how to represent EAD data in RDF, then, the first step is to try to take a step back from the “nitty-gritty” of the EAD XML markup, and think about the three part statements we might construct to represent the “information content” of that document. We need to think in terms, not of XML documents and elements and attributes and nesting/containment, but rather of what an EAD document is “saying” about “things in the world” (perhaps more accurately, in the “world” as conceptualised by the creator of the archival finding aid, shaped by archival description practices in general) and what sort of questions we want to answer about those “things”. What are the “things” – and here I use the term in a general sense to include concepts and abstractions as well as material objects – that an EAD document provides information about? What are the relationships between these things? What else does an EAD document say about those things?

Note: The discussion here does not cover the “document”/”description” side of the “Linked Data” picture i.e. for each “thing”, we’ll be providing a “description” of that “thing” in the form of a “document”. Metadata describing that “document” will be important in providing information about provenance and currency, for example, but that is not discussed here.

EAD as used by the Archives Hub

The EAD XML format was designed to cope with the “encoding” of a wide range of archival finding aids, including those constructed according to the (slightly different) cataloguing practices and traditions of different communities.

Further, many features of the EAD format are optional: one can construct a valid EAD document using only a fairly minimal level of markup, or one can use more detailed markup to represent more information.

This flexibility can be something of a “double-edged sword”: on the one hand, it enables data creators to accommodate a wide range of data, and it provides choice in the level of detail of markup (and human resources in creating that markup!) to be applied; on the other hand, it can make working with EAD data quite complex for a consumer, particularly when processing data from a range of sources which perhaps use a range of different conventions and features of the language.

In part to address this sort of issue (as well as to make things simpler for data providers by insulating them from the detail of EAD markup), the Archives Hub provides a forms-based EAD editor, based primarily on the information categories enumerated by the ISAD(G) archival description standard, which generates EAD documents following a consistent set of markup conventions. (I sometimes think of this as a “profile” of EAD, a narrower set of constraints than that imposed by the EAD DTD/schema itself, but I’m not sure that sort of terminology is in widespread use in this context.)

So, we made the “pragmatic” decision to work, in the first instance at least, on the basis of this particular set of EAD markup conventions, rather than trying to address the full EAD format, which means we can limit the number of variants we need to deal with. Having said that, even for the case of data created using the Hub editor, an element of variation is present, because although the data entry form generates a common high-level structure, data creators can apply different markup within those high-level structural components. In this first cut at a model, we have focused on analysing those common structural elements, with the intention of extending and refining our approach at a later stage.

In the course of this (or in thinking about it afterwards) we’ve come up with a few questions, which I’ll try to highlight in the course of the discussion below. Any feed back on these points (or indeed on any other aspect of the post!) would be very welcome.

The “world” as seen by EAD

Jane and I had both done some doodling before our meeting, and we started out by walking through our ideas, highlighting both those aspects which seemed pretty clear and uncontroversial, and aspects where we were uncertain or several alternatives seemed possible (and reasonable). Although we were using slightly different terminology, I think we had come up with quite similar notions, and after a bit of discussion, we arrived at a first cut at a “core” model which I’m representing graphically in Figure 1 below. This isn’t intended as a formal UML or E-R diagram, but each box represents a type of “thing” (a class) and each arrow represents a type of relationship between individual things (“instances” of those classes):

Diagram showing draft data model for EAD data (1)

Figure 1

So the “core” types of things identified in this first stage were:

  • Unit of Description: these are the “units” of archival material, a document or set of documents, the actual stuff held in the repository and described by the finding aid. It’s a “generic” class to reflect the archival description principle of “multi-level description”. An archival finding aid typically has a “hierarchical” structure, in which one “unit of description” is (described as logically forming) “part of” another “unit of description”. A finding aid may provide a only a “collection-level” description of a collection which contains many thousands of individual records, without describing those records individually at all; or it may include descriptions of various component groupings and sub-groupings of records; or it may indeed go as far as describing individual records within such groupings. For each Unit of Description, information relevant to that particular unit is provided. EAD and ISAD(G)) allow for the provision of more or less the same set of information whatever the “level” of unit described, though in practice some elements are more commonly used for “aggregate/group” units.
  • Archival Finding Aid: these are the documents created by archival cataloguers to describe the archival materials. Often a single finding aid describes (or has as its topic/subject) several units of description, but it may be the case that a finding aid describes only a single unit – where only a description of the collection as a whole is provided.
  • Repository (Agent): the organisations who curate and provide access to the archival material, and who create and maintain the archival finding aids. (EAD allows for the possibility that two different agencies perform these two roles; the Hub EAD Editor works on the basis that a single agent is responsible for both).
  • Origination (Agent): the entity (individual, organisation or family) “responsible for the creation, accumulation, or assembly of the described materials before their incorporation into an archival repository” (from the description of the EAD <origination> element). Jane analysed the rather complex nature of the ISAD(G) Creator/EAD origination relationship, which encompases notions of both “item creator” and “collector”, in <a href="http://archiveshub.ac.uk/blog/?p=2401"an earlier post on the Archives Hub blog.
  • “Things” which are referenced in the form of names used as “access points” or “index terms” using the EAD <controlaccess> element. The Hub EAD Editor supports the provision of the following as <controlaccess> terms, and recommends the use of a number of thesauri or “authority files” from which they should be drawn: Names of “Subjects” (topics); Personal Names; Family Names; Corporate Names; Place Names; Book Titles; Names of Genres or Forms; Names of Functions. So the corresponding “things named” are: Concepts, Persons, Families, Organisations, Places, Books, Genres or Forms, and Functions. As Jane notes in her recent post the relationship between the Unit of Description and the entity named in the <controlaccess> element is not necessarily a relationship of “about”-ness, but a rather less specific one, which for the moment we’ve labelled as simply “associated with” (though a better label might be preferable!).

(I’ve shown the Origination and Repository as distinct classes in the diagram, rather than as a single Agent class, because, as I hope will become clearer below, it ends up that they participate in a slightly different set of relationships).

We went on to extend and refine this core model to accommodate more of the information from the EAD document.

First, we refined the way the “access points” are represented. I’d discussed this aspect of the model with Leigh Dodds of Talis and he suggested that we consider modelling the physical entities here as concepts, in turn related to physical entities, i.e. that we represent the “conceptualisation” of a person, family, organisation or place captured in a thesaurus entry or authority file record, as distinct from the actual physical entity. So, to take an example which I think Bethan used during our conversation, we can distinguish between a conceptualisation of William Blake as a poet and one of William Blake as an artist, each in turn related to William Blake the person.

Although I don’t plan to discuss the specifics of RDF vocabulary in this post, it’s worth noting that the FOAF RDF vocabulary has recently been extended with the addition of a property, foaf:focus, to represent the relationship between the conceptualisation and the thing conceptualised (person, place etc), to support exactly this convention.

For some of the <controlaccess> named entities – like the topics, genres/forms and functions – there is no “other thing conceptualised” and it is sufficient to model them simply as concepts (or as instances of a subclass); and for the book case, we’ll just treat it as a “book” (and for the moment, at least, sidestep any FRBR-ish issues).

In both cases, the notion that the concept is a member of a specific thesarus/authority file can be captured by introducing the notion (from SKOS) of a “Concept Scheme”.

Question 1: One question raised by this approach is whether, for the cases where there is a distinct entity involved, in transforming an EAD document into RDF, we should:

  1. Coin URIs for, and generate “descriptions” of, both the concept and the person/family/organisation/place conceptualised (with a triple with a foaf:focus predicate relating the two? Or:
  2. Coin a URI for, and generate a “description” of only the concept, and leave the relationship with the person/family/organisation/place conceptualised “out of scope” at the transform stage (though that relationship might be obtained at a later stage by linking the concept to external data)?

My inclination is to do the former, on the grounds that this enables us to capture more of the information present in the EAD document i.e. to capture the information that where a <persname> element is used, this is the name of a conceptualisation of a person, where a <corpname> element is used, this is the name of a conceptualisation of an organisation, and so on.

Question 2: Is it necessary/useful to also model the name itself as a distinct resource? I think we can manage without that, but we may revisit that point in the future.

Second, having made this choice for the <controlaccess> entities, we decided to apply it also to the case of the “origination” agent discussed above, with the “origination” relationship becoming one between a Unit of Description and a conceptualisation of an agent, rather than between a Unit of Description and the agent itself. I admit I’m still not completely sure this is necessary/useful/”the right thing to do”. The use of the <origination> element in the Hub EAD profile is described in the guidelines here. It allows for names to be presented in “the commonly used form of name”, rather than the form specified by an authority record (and indeed a survey of the data reveals a good deal of variation), so it’s a bit more difficult to argue that this corresponds directly to the name of an entry (concept) listed in an “authority file”.

Question 3: Is it necessary/useful to introduce a “conceptualisation” of the agent who “originated” the Unit of Description? For now, we’re working on the basis that it is, but we may revisit that choice.

This extended model is represented graphically in Figure 2:

Diagram showing draft data model for EAD data (2)

Figure 2

A final stage of refinement gave us a few further extensions.

First the EAD Document is introduced as a particular “encoding of” the Finding Aid.

Second, I’ve suggested that we model the Biographical or Administrative History associated with each Unit of Description as a resource in its own right, distinct from the Finding Aid as a whole. I’m not sure this is strictly necessary, and again it’s something that we may revist in the future. But it enables us to provide information about the Biographical History as a distinct resource. One of the reasons this may be useful is that we’ve discussed (albeit somewhat vaguely at this point!) analysing/mining the text of the Biographical History as a source of further information, and having a URI for the Biographical History enables us to be explicit about the source of that data. We can also make the Biographical History the subject of triples to indicate that it is related not just to the Unit of Description but also to the entity who “originated” that unit (or, given the discussion above, to the conceptualisation of that entity). Also, we could associate it with different literal expressions (e.g. the original EAD fragment as XML Literal, but also an XHTML or plain text derivative). It also, of course, makes the Biographical History into a resource that others can refer to in their own assertions in their own data.</p

Third, we introduced the “level” of the Unit of Description as a distinct resource, a concept. This means that each “level” within the (relatively small) set used within the Hub data can each be assigned a distinct URI, and described in their own right, and – again – referenced by others.

Fourth, similarly, the “language” of the Unit of Description is treated as a distinct resource. (The plan here is that we’ll try to simply reference resources within an existing Linked Data dataset, such as lexvo.orga>.)

Fifth, the EAD <dao> and <daogrp< elements are mapped into a relationship between the Unit of Description and an external digital object (or group of objects). I’ve labelled the relationship here as “is represented by” as that is the description provided by the EAD documentation, but I think Jane and Bethan felt that in practice in the Hub data, the relationship might sometimes be rather less specific than that.

For the moment, the other EAD elements corresponding to ISAD(G) elements (i.e. to textboxes in the Hub data entry form) will be treated as properties with XML Literal values (though we could follow the <bioghist> approach and generate individual URI-identified resources if that proves to be useful).

Sixth – and here we stepped slightly beyond the scope of the EAD document itself (so I’ve greyed it out in the diagram below) – we’ve added a notion of the location of the Repository and a relationship between the Repository-as-Agent and that Place. Although details of repository location aren’t included in the Hub EAD documents, Jane and Bethan said they do have that data available, and it should be fairly easy to integrate it.

So we’ve ended up with the model illustrated in Figure 3.

Diagram showing draft data model for EAD data (3)

Figure 3

Question 4: Are we missing any obvious “things” that we need to treat as resources?

Note: In this post, I haven’t gone as far here as to enumerate all the properties that will be used to describe instances of each of those classes, but I’ll provide that in a future post.

Multi-level description, context, “completeness” and “inheritance”

The one remaining question – and perhaps one of the thorniest to address fully – is that arising from one of the fundamental characteristics of the nature of archival description. As noted above, archival description is typically based on a “hierarchical”, “multi-level” approach, in which, within a single finding aid, information is provided about an aggregation of records, and then about component parts of that aggregation, and so on, perhaps down to the level of providing descriptions of individual records, but often stopping short of that.

The ISAD(G) standard presents principles of moving from the general to the specific, and providing information relevant to the particular unit of description (ISAD(G) 2.2):

Provide only such information as is appropriate to the level being described. For example, do not provide detailed file content information if the unit of description is a fonds; do not provide an administrative history for an entire department if the creator of a unit of description is a division or a branch.

And of “non repetition” (ISAD(G) 2.4):

At the highest appropriate level, give information that is common to the component parts. Do not repeat information at a lower level of description that has already been given at a higher level.

In some cases, it may indeed be the case that if some descriptive attribute is not explicitly provided for the unit of description, then the information provided for its “parent” unit in the hierarchy is applicable; however, this is often not the case. The elements of the ISAD(G) Identity Statement Area (or the EAD <did> child elements), for example, are specific to the unit of description and do not apply to its “child” units; and for many other descriptive elements, a simple rule of “direct inheritance” may not be appropriate. For the <controlaccess> elements, for example, a “blunt” inference rule that the named entities “associated with” a unit of description are also “associated with” every “child” unit (and so on “down the tree”) may result in associations that are simply not useful to the consumer of the data.

In a post on the Archives Hub blog, Jane emphasised the value of the “Linked Data” approach in making things mentioned in our data into “first-class citizens”. One consequence of the multi-level approach in archival description practice is a strong sense of the importance of “context”, and that the descriptions of the “lower level” units should be read and interpreted in the context of the higher levels of description (perhaps even that they are in some sense “incomplete” without that “contextual” data). In contrast, the “Linked Data” approach typically involves exposing “bounded descriptions” of individual resources. Now, certainly, yes, those “bounded descriptions” include assertions of relationships with other resources (including the sort of part-whole/member-of/component-of relationships present here), and those links can be followed by consumers to obtain further information on the other resources – however, there is no requirement or expectation that consumers will do so. So, there is arguably a (perhaps unavoidable) element of tension between the strongly “contextual” emphasis of EAD and ISAD(G) and the “bounded descriptions” of “Linked Data”. Rather than seeing that as an insurmountable hurdle, however, I think it provides an area that the project can usefully explore and evaluate.

(If I remember correctly) we made the decision that, for now at least, the only piece of information for which we would implement an explicit “inheritance” from a “higher-level” Unit of Description to a “lower” one (and generate additional RDF triples in the data) would be that of the repository which provides access to the material (i.e. the EAD <repository> element).

Conclusion

As I said above, the model I’ve outlined here is intended as very much a first cut, not the “last word”, and something we’ll most likely revisit and refine further in the future, particularly as we see in practice what it enables us (and others) to do with the data generated, and where we might require some further tweaks to enable us to do more. For now, we feel it provides a basis for our initial work on transforming EAD data into RDF.

The next steps are:

  1. to decide on URI patterns for the URIs we will be generating (i.e. URIs for instances of the classes in the diagram above)
  2. to select terms from existing RDF vocabularies and to define any additional RDF terms required to create “descriptions” of these things based on information from the EAD document
  3. to create a transformation that implements the model (in the first instance, an XSLT transform)

I’ve already done some work on all of these, and I’ll write about them in a separate post here – which hopefully will be rather shorter than this one and will take me rather less time to write!

Creating Linked Data: more reflections from the coal face

This post is to highlight some of the barriers and challenges to the creation of Linked Data.  This is a personal reflection, trying to be honest about the challenges as I have found them and the learning experience, which is inevitably a personal thing depending upon your own background, experience and ways of thinking and working. However, I think it also reflects some of the general challenges as we have come across them.

Vocabulary

It comes as no surprise that I have found the terminology somewhat confusing, and it has sometimes led me astray. Only this week Bethan and I were getting tangled up in a conversation about ‘things’  within the data model. We spent a while talking about how having a ‘Hub conceptualisation’ and a ‘thing-in-its-own-right conceptualisation’ of an entity would allow for more clarity. With ‘thing’, ‘concept’, ‘label’, ‘property’, ‘value’, ‘predicate’, ‘information resources’, ‘non-information resources’ etc. – there is quite a bit of room for misinterpretation in communication. I have looked at definitions, but these can actually sometimes hinder rather than help. I think that an attempt at a definitive glossary for Linked Data would help enormously.

Landscape

For me, it has taken a while to really get into the Linked Data way of thinking. I have actually kept a kind of diary of my thoughts over the last 2-3 months, and when I look back now at my earliest attempts at understanding how to model the data, they certainly show a pretty steep learning curve. I started, for example, by being unsure about whether we were wanting to provide information on the ‘creator’ of the archive or the archive itself and what sort of relationships between ‘things’ to include. I don’t think this is surprising, as the power of RDF is that it can be used to model anything – it doesn’t help you by giving you a limited scope or particular rules to start with (which is, of course, generally a good thing).

Archival descriptions

I listened to a number of audio tutorials, read a number of reports, blogs, etc., and learnt a great deal from these, but I still found the lack of examples within my own particular domain to be a barrier. Talis provide a very excellent tutorial that you can sit and listen to, but the real-world example is for a whiskey distillery. It somehow seems a long way away from an archival description! So, I would definitely say this lack of information for my domain was a barrier. But, of course, for others who want to output their finding aids as Linked Data in the future, we should start to see models developing that they can use, with examples and information to help (Locah, we hope, being one source of help).

Expertise and experience

The Locah team has a variety of expertise and experience, but it is undoubtedly true to say that I would be struggling a great deal more than I have done if we had not had the input of Pete Johnston from Eduserv, who has been very much involved in the EAD modelling. Whilst it is important (and pleasant) to give credit where it’s due,  the real point here is actually that I think a certain level of expertise is important, to model data and output RDF. I have experience as an archivist and understand EAD and metadata, Pete also has experience of working with archival descriptions, and also substantial experience of metadata standards and issues around the Semantic Web and technical interoperability. We also have Bethan Ruddock working with us, who now has 18 months experience of working with EAD descriptions, and is a trained librarian. That is just the core team looking at the archival data modelling.  In addition, the expertise of UKOLN will come into play with other aspects of the project.

I find it hard to see how this sort of work could currently be done by a team with substantially less experience in these sorts of areas. However, it is important to state that we will also be working with Talis, who have a great deal of expertise in Linked Data. They are providing access to their own Triple Store and other benefits that we can take advantage of. Others thinking of outputting Linked Data could look to involve companies like Talis more heavily, thus taking advantage of their expertise and requiring less in-house expertise.

The benefits of data modelling

One of the areas that I spent most time trying to find good tuturials about was data modelling. I may have missed some things that would have been very useful, but as it is I found that there simply wasn’t enough helpful information about how to create a data model. This would have saved me quite a bit of time because I think the data model is so central to what we are doing and provides such an effective way to visualise the entities and relationships between them. I think this was partly a case of examples being too simplistic, and partly a lack of data models that used catalogue data – not necessarily archival finding aids, but at least something similar.

The data

I think that we are going to find challenges around the actual content. There are numerous examples of inconsistencies, such as where the ‘creator’ is ‘Joe Bloggs and others’ rather than just a name, or where the access points do not have rules or a source associated with them. I’ve just found some descriptions where the content for the ‘extent’ should acutally really be in the ‘scope’. Some descriptions have rather unsatisfactory references, some do not include the language field, a few do not even include the creator field. For some fields we will just be outputting literal values, but for others consistency would help a great deal with the creation of RDF, particularly when thinking about the vocabulary (or predicate) that we use to define the relationship between a subject and object.  This is the challenge of creating Linked Data for descriptions that have been created by 200 different institutions over several decades and by 100s of different people. We’ll have to see how it goes!

The issue of access points

Within EAD there are access points, or index terms, associated with the description. These are most commonly subject, name and place. We’ve found that establishing the nature of the relationship between the unit of description and the access point is not easy. It looks like the relationship is going to be something very unspecific, such as ‘associatedWith’. I’m not sure yet whether this has any implications…

Conclusions

For me, after a few weeks away from thinking about Locah and Linked Data, getting back into the whole mindset actually takes about an hour and a nice cup of tea. In other words, the mindset I require to think about Linked Data currently feels separate from my normal working mindset. I think this is because LD requires something different. This in itself makes it quite challenging. It doesn’t fall naturally into what we do in the Hub and how we think about metadata.

However, the very big plus with this different kind of thinking is that really by definition it puts what the user is interested in at the forefront of your thinking. Well, maybe I should qualify that: I believe it puts what the user is interested in at the forefront. This is because we understand that users of archives are usually primarily interested in individuals, families, organisations, subjects and places. What they want is information on Sir Ernest Shackleton, Barbara Castle, Victorian theatre, town planning, a local business, a scientific organisation, the history of Manchester the industry of Sheffield,  or anything else. They don’t tend to know that they want to access a particular archive. Or if they do, it is often due to an assumption that there is ‘an archive’ on the person or organisation that they are researching. Even if there is an archive, there may may be a misplaced assumption that this archive is pretty much all the stuff about that entity. Furthermore, there are going to be many many researchers out there who will not be aware of archives and how to access them.  Linked Data provides a way to link archives into…well, into just about anything else.