Transforming EAD XML into RDF/XML using XSLT

This is a (brief!) second post revisiting my “process” diagram from an early post. Here I’ll focus on the “transform” process on the left of the diagram:

Diagram showing process of transforming EAD to RDF and exposing as Linked Data

The “transform” process is currently performed using XSLT to read an EAD XML document and output RDF/XML, and the current version of the stylesheet is now available:

(The data currently available via http://data.archiveshub.ac.uk/ was actually generated using the previous version http://data.archiveshub.ac.uk/xslt/20110502/ead2rdf.xsl. The 20110630 version includes a few tweaks and bug fixes which will be reflected when we reload the data, hopefully within the next week.)

As I’ve noted previously, we initially focused our efforts on processing the set of EAD documents held by the Archives Hub, and on the particular set of markup conventions recommended by the Hub for data contributors – what I sometimes referred to as the Archives Hub EAD “profile” – though in practice, the actual dataset we’ve worked with encompasses a good degree of variation. But it remains the case that the transform is really designed to handle the set of EAD XML documents within that particular dataset rather than EAD in general. (I admit that it also remains somewhat “untidy” – the date handling is particularly messy! And parts of it were developed in a rather ad hoc fashion as I amended things as I encountered new variations in new batches of data. I should try to spend some time cleaning it up before the end of the project.)

Over the last few months, I’ve also been working on another JISC-funded project, SALDA, with Karen Watson and Chris Keene of the University of Sussex Library, focusing on making available their catalogue data for the Mass Observation Archive as Linked Data.

I wrote a post over on the SALDA blog on how I’d gone about applying and adapting the transform we developed in LOCAH for use with the SALDA data. That work has prompted me to think a bit more about the different facets of the data and how they are reflected in aspects of the transform process:

  • aspects which are generic/common to all EAD documents
  • aspects which are common to some quite large subset of EAD documents (like the Archives Hub dataset, with its (more or less) common set of conventions)
  • aspects which are “generic” in some way, but require some sort of “local” parameterisation – here, I’m thinking of the sort of “name/keyword lookup” techniques I describe in the SALDA post: the technique is broadly usable but the “lookup tables” used would vary from one dataset to another
  • aspects which reflect very specific, “local” characteristics of the data – e.g., some of the SALDA processing is based on testing for text patterns/structures which are very particular to the Mass Observation catalogue data

What I’d like to do (but haven’t done yet) is to reorganise the transform to try to make it a little more “modular” and to separate the “general”/”generic” from the “local”/”specific”, so that it might be easier for other users to “plug in” components more suitable for their own data.