Making sense of modelling EAD

Last week Pete Johnston, Bethan Ruddock and I got together and shut ourselves in a room for 5 hours with a whiteboard, flipchart and with our thinking caps on. Pete has already posted some thoughts about architecture and workflows following this meeting. I thought I would share some more informal thoughts of my own – from the perspective of an archivist and someone gradually getting to grips with Linked Data and RDF modelling.

Now that I understand a bit more about RDF, I can see where some of my misunderstandings were leading me astray. Firstly, it took me quite a while to get away from the idea of modelling the EAD record, rather than the actual data. This might seem obvious to those conversant in Linked Data, but I’ve been dealing with records as the unit of information for the last 20 years. With Linked Data you have to get away from this and think about the actual concepts within the data. The record (the EAD description in this case) exists as an entity along with everything else, but it can be misleading to take it as the starting point for data modelling.

I found actually getting a ‘starting point’ a bit difficult. I think this is because everything can be a starting point, and also because I kept going back to thinking of something like <http://archiveshub.ac.uk/search/record.html?id=gb15sirernesthenryshackleton> as the starting point (the record itself). I then moved away from this and started thinking about the archival creator as a central concept. I knew that in RDF this person or organisation would be a subject. I also knew that this subject would need a URI and that we might want to tell people about stuff related to this subject, but I struggled with how we would provide URIs for subjects like this, and also how we would link the creator as subject to things like the index term subjects.

After a quick chat with Pete Johnston I started to understand the real role of URIs within Linked Data. We are probably going to create URIs ourselves for things (concepts) within the Hub. So, we might create a URI for every archival creator, and a URI for every repository, etc. We agreed that we needed to model the data within our world before looking too much at linking to data outside of it.  Whilst I had listened to, and read a good deal of literature on Linked Data, I somehow hadn’t quite got the idea that you might create URIs yourself for your own concepts and that these would be documents in their own right, so then you can link to these URIs within your statements and you can include whatever information you think will be useful within these documents.

For example, we were using Sir Ernest Henry Shackleton as a sample record (the famous Antarctic Explorer). He would have a URI – something like archiveshub.ac.uk/id/person/sirernesthenryshackleton. By providing him with a URI we can then create triples (statements) that include this URI. For example:

archiveshub.ac.uk/id/person/sirernesthenryshackleton ‘created’ http://archiveshub.ac.uk/search/record.html?id=gb15sirernesthenryshackleton.

We can then decide what information we will put in this document that identifies Sir Ernest, so that when researchers look up the URI, they get useful information. We can include links to external locations and we can look at using the ‘sameAs’ relationship to link to other representations of the same person.

Some URIs are fairly straightforward. We will create URIs for archival levels, and then these can in theory be used by others who want to identify levels within the data. For something like language, we will probably use URIs that are already available.

It is useful within data modelling to distinguish the real from the conceptual. So, going back to Sir Ernest, he is a flesh and blood person, and he can also be represented as a concept. If we are thinking about subjects used as index terms within the data, you might have ‘Exploration’ as a subject. We want Sir Ernest, the man described within our description, to be associated with this subject, so we can do this by making him into a concept, and giving that concept a URI. We can then link that to a literal value – his name. In our meeting we discussed one of the advantages of conceptual agents as being that we can distinguish between the person or organisation in its entirety and the person or organisation within this particular context. Archives often only represent a small part of someone’s life or an organisation’s activities, so it is helpful to talk about ‘Sir Ernest Shackleton’ as the explorer and leader of the British National Antarctic Expedition of 1907-1909.

So, we are now starting to move towards a model where we have URIs for a number of key concepts within the Hub. Our intention is to limit the number of concepts that we create URIs for, at least at this stage. We will also simplify some areas with the EAD modelling that we can then open up for investigation later on. For example, it would be good to look at version control and how we might filter changes to Hub descriptions through to the RDF XML, but we think that initially it is a good idea to create Linked Data from our basic model so that we can get feedback and also benefit from the learning process.

The main text heavy field that we are planning to create URIs for at this stage is the Biographical and Administrative History. We haven’t yet explored this thoroughly, but with URIs for archival creators and URIs for administrative and biographical histories, one’s thoughts start to turn to name authorities and EAC-CPF (Encoded Archival Context – Corporate Bodies, Persons and Families – a means to markup information about archival creators in XML). We are not looking at creating EAC descriptions, but it would be good to keep in line with this in whatever ways we can in order to facilitate the subsequent creation of EAC records, or incorporation of our data into EAC records.

We will soon be able to share our current data model, so keep an eye on our blog. We welcome any feedback that the community might have.

Some thoughts on architecture and workflows

This is an attempt to sketch out some of my/our initial thoughts on the approaches the project is considering to exposing data as Linked Data. I should emphasise that these are very much initial thoughts, and things may change as we progress.

The project is dealing with two main data sources, and at the moment two different approaches are being considered to those sources.

The first data source is the collection of archival finding aids describing the holdings of the archives of educational and research institutions in the UK, aggregated by the JISC Archives Hub service. This data takes the form of XML documents in the Encoded Archival Description (EAD) format, created by archivists in the various institutions, and submitted to the Hub.

Currently, the aggregated data is indexed using the Cheshire 3 application, and exposed as HTML pages on the archiveshub.ac.uk site for search and browse. (SRU and Z39.50 targets and an OAI-PMH repository are also available.)

To expose (probably a subset of) the Hub EAD finding aids as Linked Data, the workflow is expected to look something like that represented in Figure 1 below:

Diagram showing process of transforming EAD to RDF and exposing as Linked Data (1)
  1. Transform: EAD XML documents are transformed to an RDF format. We’ll write about our current thinking on this more in a subsequent post, as working out how best to represent the EAD data in RDF as the target for the transform is in itself a significant chunk of work (and an area I’m particularly interested in). This is likely to be something of an “iterative” process: we’ll start with a fairly basic transform that captures some subset of the content of the input documents, and perhaps refine things later to generate more data (and correct errors we’ll no doubt make in the first cut!)
  2. Enhance: RDF data from the previous step is “enhanced” and augmented. This step might include processes to (1) generally “clean up” the data (e.g. normalise some literals, identify internal co-references etc); (ii) add links to resources in other datasets; (iii) (maybe) pull in some useful data from other datasets, either data held by the Hub but not included in the EAD docs or data from other sources. Again this will probably be a process which we extend and refine over time.
  3. Upload: Load the RDF data from the previous step to an instance of the Talis Platform triple store, which Talis are kindly making available to the project.
  4. Expose: Expose a set of linked “bounded descriptions” from the triple store over HTTP, as documents in both human-readable and RDF formats, following the principles of the W3C TAG httpRange-14 resolution/Cool URIs for the Semantic Web. The use of the Platform also provides us with a SPARQL endpoint for the data – which we can make available to others to use – and which also means we can consider layering other Web interfaces over that endpoint. For example, I’d be interested in trying out the Linked Data API, which I talked about over on eFoundations a while ago.

It may be that that the second and third steps are reversed and we upload the data to the triple store and perform the “enhance” step on the data there, i.e. something closer to Figure 2:

Diagram showing process of transforming EAD to RDF and exposing as Linked Data (2)

Or indeed that a “hybrid” of the two is appropriate, and some “enhance” processes take place before upload and others take place afterwards.

We’ll also need to integrate some provision for “version control” and “provenance”/”attribution” (e.g. to track which data comes directly from the EAD sources, and which is added from elsewhere) into this process.

So for the Hub data, the plan is that the data is “exported” from the existing EAD dataset, and that the Platform triplestore provides the “back-end” for the app that serves up the “Linked Data” document views and provides a SPARQL endpoint.

The second data source of interest is the collection of bibliographic metadata aggregated into the Copac catalogue from the member libraries of Research Libraries UK and from other specialist libraries. This data is also held as XML in the MODS XML format. (Bethan Ruddock has a couple of posts on the Copac Development blog which describe the processes by which data is transferred from the contributor libraries to the Copac catalogue).

As for the case of the Archives Hub data, the first stage will be to design an appropriate RDF representation and an algorithm for transforming the MODS data to RDF (or to select – and maybe adapt, if necessary – an existing one).

In contrast to the case of the Hub I outlined above, the plan is to serve the RDF data from the existing Copac database, rather than upload it to a triplestore. This will probably require the development of a small additional application (or maybe just the configuration of an HTTP server) to service the new URIs coined for resources, to support content negotiation and redirect to URIs of appropriate pages.

One of the questions raised by this approach is how to handle the process I described above as “enhance”, and in particular how to accommodate the addition of new data – at a minimum, links to existing resources described in other Linked Data datasets – assuming that we aren’t going to be able to update the source MODS XML documents. For some cases, it may be trivial to incorporate this in the MODS-to-RDF transform (e.g., to generate links to languages described by lexvo.org). Another approach might be to generate simple “seeAlso” links to an additional set of documents (which could be simple static documents or could be served from an RDF store). Hmm. As you can probably tell, I’ve thought about this rather less than I’ve thought about the Hub case! Anyway, the suggested approach is sketched in Figure 3:

Diagram showing process of transforming MODS to RDF and exposing as Linked Data

Another constraint of this approach would be that although we can serve the set of linked documents, it doesn’t provide a SPARQL endpoint.

One of the expectations for the project is that it “explores and reports on the opportunities and barriers in making content structured and exposed”, and an assessment of the pros and cons of the different approaches to hosting the data should contribute to that report.

LOCAH Project – Projected Timeline, Workplan & Overall Project Methodology

Project Plan

WP1:  Project Management.

  • Project management to support the project, the relationships with project partners, and with the funders.

WP2:  Data Modelling

  • Model Archives Hub EAD data and Copac data to RDF

WP3:  Technical Development – Linked Data Interface

  • Transform RDF modelled to RDF XML.
  • Enrich Hub and Copac data with data/links from sources such as DBPedia, BBC, LOC, VIAF, Musicbrainz, Freebase
  • Provide both RDF and HTML documents for Archives Hub and Copac resources with stable well designed URIs
  • Provide a SPARQL endpoint for the Hub Linked Data resources
  • Look at feasibility of providing RESTful API interface to the Hub and Copac Linked Data resources

WP4: Prototype Development

  • Test and refine requirements for proposed prototypes
  • Design user interfaces for prototype
  • Technical development and testing of the user interfaces

WP5: ‘Opportunities and Barriers’ Reporting

  • Design and implement  procedures for logging ongoing projects issues
  • Analyse and synthesise logged issues around known Linked Data issues
  • Report on opportunities and barriers using the project blog outlining methods and recommendations on how to overcome, mediate or mitigate against issues identified wherever possible.

WP6: Advocacy and Dissemination

  • Report on ongoing project progress and findings at JISC programme events
  • Demonstrate project outputs and report to communities on the findings of the opportunities and barriers reporting at relevant conferences and workshops

Timetable

WPMonth 1 2 3 4 5 6 7 8 9 10 11 12
WP1 X X X X X X X X X X X X
WP2 X X X X
WP3 X X X X X
WP4 X X X X X X X
WP5 X X X X X X X X X X
WP6 X X X X X X X X X

Project Management and Staffing

Adrian Stevenson will project manage LOCAH to ensure that the workplan is carried out to the timetable, and that effective dissemination and evaluation mechanisms are implemented according to the JISC Project Management guidelines. Consortium agreements in line with JISC guidelines will be established for the project partners. UKOLN will lead on all the workpackages. Staff who will work on LOCAH are already in post.

Support for Standards, Accessibility and Other Best Practices

LOCAH will adhere to the guidance and good practice provided by JISC in the Standards Catalogue and JISC Information Environment. The primary technology methodologies, standards and specifications adopted for this project will be:

  • XML, XSLT, RDF XML, RDFa, FOAF, SKOS, SPARQL, n3, JSON, RSS/ATOM
  • Metadata standards: EAD, MODS, Dublin Core
  • Berners-Lee,T. (2006). ‘Linked Data – Design Issues’
  • Berners-Lee,T. (1998). ‘W3C Style: Cool URIs don’t change’
  • Cabinet Offices ‘Designing URI Sets for the UK Public Sector’
  • Dodds, L., Davis, I., ‘Linked Data Patterns’
  • W3C Web Accessibility Initiative (WAI)

LOCAH Project – Project Team Relationships and End User Engagement

Project Team

Adrian Stevenson

Adrian Stevenson

Adrian Stevenson is a project manager and researcher at UKOLN. He has managed the highly successful SWORD project since May 2008 and also manages the JISC Information Environment Technical Review project. He has extensive experience of the implementation of interoperability standards, and has a long-standing interest in Linked Data. Adrian will manage LOCAH, and will be involved in the data modelling work, testing and the opportunities and barriers reporting.

Jane Stevenson

Jane Stevenson

Jane Stevenson is the Archives Hub Coordinator at Mimas. In this role, she manages the day-to- day running of the Archives Hub service. She is a registered archivist with substantial experience of cataloguing, implementation of data standards, dissemination and online service provision. She has expertise in the use of Encoded Archival Description for archives, and will be involved in the data modelling work, mapping EAD to RDF, testing as well as the opportunities and barriers reporting.

Pete Johnston

Pete Johnston

Pete Johnston is a Technical Researcher at Eduserv. His work has been primarily in the areas of metadata/resource description, with a particular interest in the use of Semantic Web technologies and the Linked Data approach. He participates in a number of standards development activities, and is an active contributor to the work of the Dublin Core Metadata Initiative. He was also a co-editor of the Open Archives Initiative Object Reuse and Exchange (OAI ORE) specifications.

Pete joined Eduserv in May 2006 from UKOLN, University of Bath, where he advised the UK education and cultural heritage communities on strategies for the effective exchange and reuse of information. Pete will be involved in the data modelling work, mapping EAD and MODS to RDF, software testing and the opportunities and barriers report.

Bethan Ruddock

Bethan Ruddock

Bethan Ruddock is involved in content development activity for both the Archives Hub and Copac. She is currently working on a year-long project to help expand the coverage of the Archives Hub through the refinement of our automated data import routines. Bethan also undertakes a range of outreach and promotional activities, collaborating with Lisa on a number of publications. Bethan will be involved in the modelling work of transforming MODS to RDF.

Julian Cheal

Julian Cheal

Julian Cheal is a software developer at UKOLN. He is currently working on the analysis and visualisation of UK open access repository metadata from the RepUK project. He has experience of writing software to process metadata at UKOLN, and has previous development experience at Aberystwyth University. Julian will be mainly involved in developing the prototype and visualisations.

Ashley Sanders

Ashley Sanders

Ashley Sanders is the Senior Developer for Copac, and has been working with the service since his inception. He is currently leading the technical work involved in the Copac Re-Engineering project, which involves a complete overhaul of the service. Ashley will be involved in the development work of transforming MODS to RDF.

Shirley Cousins is a Coordinator for the Copac service. Shirley will be involved in the work of transforming MODS to RDF.

An additional Mimas developer will provide the development work for transforming the Archives Hub EAD data to RDF. This person will be allocated from existing Mimas staff in post.

Talis are our technology partner on the project, kindly providing us with access to store our data in the Talis Store. Leigh Dodds is our main contact at the company. Talis is a privately owned UK company that is amongst the first organisations to be applying leading edge Semantic Web technologies to the creation of real-world solutions. Talis has significant expertise in semantic web and Linked Data technologies, and the Talis Platform has been used by a variety of organisations including the BBC and UK Government as part of data.gov.uk.

OCLC are also partnering us, mainly to help out with VIAF. Our contacts at OCLC are John MacColl, Ralph LeVan and Thom Hickey. OCLC is a worldwide library cooperative, owned, governed and sustained by members since 1967. Its public purpose is to work with its members to improve access to the information held in libraries around the globe, and find ways to reduce costs for libraries through collaboration. Its Research Division works with the community to identify problems and opportunities, prototype and test solutions, and share findings through publications, presentations and professional interactions.

Engagement with the Community

Stakeholders

Several key stakeholder groups have been identified: end users, particularly historical researchers, students & educators; data providers, including RLUK and the libraries & archives that contribute data to the services; the developer community; the library community; the archival sector and more broadly, the cultural heritage sector.

End users

Copac and the Archives Hub services are heavily used by historical researchers and educators. Copac is one of JISC’s most heavily used services, averaging around one million sessions per month. Around 48% of HE research usage can be attributed historical research. Both services can directly engage relevant end users, and have done so successfully in the past to conduct market research or solicit feedback on service developments. In addition, channels such as twitter can be used to reach end users, particularly the digital humanities community.

Data providers; Library Community; Archival Community; Cultural Heritage Sector

Through the Copac and Archives Hub Steering Committees we have the means to consult with a wide range of representatives from the library and archival sectors. The project partners have well- established links with stakeholders such as RLUK, SCONUL, and the UK Archives Discovery Network, which represents all the key UK archives networks including The National Archives and the Scottish Archives Networks. The Archives Hub delivers training and support to the UK archives community, and can effectively engage its contributors through workshops, fora, and social media. OCLC’s community engagement channels will also provide a valuable means of sharing project outputs for feedback internationally. The key project partners are also engaged in the Resource Discovery Taskforce Vision implementation planning, as well as the JISC/SCONUL Shared Services Proposal. Outputs from this project will be shared in both these contexts. In addition, we will proactively share information with bodies such as the MLA, Collections Trust and Culture24.

Developer Community

As a JISC innovation support centre, UKOLN is uniquely placed to engage the developer community through initiatives such as the DevCSI programme, which is aimed at helping developers in HE to realise their full potential by creating the conditions for them to be able to learn, to network effectively, to share ideas and to collaborate.

Dissemination

The primary channel for disseminating the project outputs will be the UKOLN hosted blog. End users will be primarily engaged for survey feedback via the Copac and Archives Hub services. Social media will be used to reach subject groups with active online communities (e.g. Digital Humanities). Information aimed at the library and archival community, including data providers, will be disseminated through reports to service Steering Group meetings, UKAD meetings, the Resource Discovery Taskforce Vision group, the JISC/SCONUL Shared Services Proposal Group, as well as professional listservs. Conference presentations and demonstrations will be proposed for events such as ILI, Online Information, and JISC conferences. An article will be written for Ariadne. The developer community will be engaged primarily through the project blog, twitter, developer events & the Linked Data competition.

LOCAH Project – Intellectual Property Rights (IPR)

The project will be managed according to JISC guidelines for intellectual property. Any custom-built prototype outputs will be made available under open-source license free of charge to the UK HE and FE community. There may be some rights restrictions relating to the Copac and Hub data content due to data licensing issues. These will be explored and addressed as part of the project.

LOCAH Project – Risk Analysis, Evaluation and Impact

Risk Analysis

Risk Probability Severity Score Action/Mitigation
Difficulties recruiting or retaining staff 2 4 8 Key members of staff already in post at UKOLN, Mimas and Eduserv
Project is over-ambitious 2 2 4 The project plan will ensure that deliverables are delivered in a timely fashion and the project does not divert from agreed goals.
Failure to meet deadlines within the project timescale 2 4 8 Clear project plan with all relevant tasks outlined, continuous review and rescheduling of work as necessary
Failure to disseminate best practices effectively 2 2 4 UKOLN has very effective dissemination channels. The involvement of partners who can gain clear benefits from this work will allow them to be involved in dissemination activities.
Project partners fail to work effectively 1 3 3 UKOLN has good links with all the partners, many through previous joint projects and recent consultancy work. A consortium agreement with address potential concerns.

Evaluation

LOCAH will be evaluated by a number of means including qualitative and quantitative methods, and will look at both the tangible and intangible outputs of the project. We will regularly check progress against the project plan and requirements, and we will engage with users through the blog, social media, questionnaires and events. The project manager will lead the evaluation, liaising with relevant parties and drawing on contacts within the JISC community and wider HE community.

Impact

Several members of the project team are closely involved with current Linked Data activities, and are fully aware of the current ‘state of the art’ against which the impact of the project will be evaluated. The immediate impact of the project will be to provide two new enriched and quality assured data sets to the UK HE and global data graph. It will also provide a prototype that highlights the potential of Linked Data for enhancing learning, teaching and research. The long-term impact will be to help Linked Data gain traction and achieve a critical mass in the UK HE community, as well as providing invaluable experience and insight on a range of issues. Mimas intends to sustain the Linked Data sets, and will ensure that the resources have stable URIs for two years beyond the life of the project. The project may be able to transition to using the Talis Connected Commons scheme if the licensing situation can be clarified. This would then provide long-term sustainability for the data publishing.



LOCAH Project – Wider Benefits to Sector & Achievements for Host Institution

Meeting a need

High quality research and teaching relies partly on access to a broad range of resources. Archive and library materials inform and enhance knowledge and are central to the JISC strategy. JISC invests in bibliographic and archival metadata services to enable discovery of, and access to, those materials, and we know the research, teaching and learning communities value those services.

As articulated in the Resource Discovery Taskforce Vision, that value could be increased if the data can be made to “work harder”, to be used in different ways and repurposed in different contexts.

Providing bibliographic and archive data as Linked Data creates links with other data sources, and allows the development of new channels into the data. Researchers are more likely to discover sources that may materially affect their research outcomes, and the ‘hidden’ collections of archives and special collections are more likely to be exposed and used.

Archive data is by its nature incomplete and often sources are hidden and little known. User studies and log analyses indicate that Archives Hub1 users frequently search laterally through the descriptions; this gives them a way to make serendipitous discoveries. Linked data is a way of vastly expanding the benefits of lateral search, helping users discover contextually related materials. Creating links between archival collections and other sources is crucial – archives relating to the same people, organisations, places and subjects are often widely dispersed. By bringing these together intellectually, new discoveries can be made about the life and work of an individual or the circumstances surrounding important historical events. New connections, new relationships, new ideas about our history and society. Put this together with other data sources, such as special collections, multimedia repositories and geographic information systems, and the opportunities for discovery are significantly increased.

Similarly, by making Copac bibliographic data available as Linked Data we can increase the opportunities for developers to provide contextual links to primary and secondary source material held within the UK’s research libraries and an increasing number of specialist libraries, including the British Museum, the National Trust, and the Royal Society. The provision of library and special collections content as Linked Data will allow developers to build interfaces to link contextually related historical sources that may have been curated and described using differing methodologies. The differences in these methodologies and the emerging standards for description and access have resulted in distinct challenges in providing meaningful cross-searching and interlinking of this related content – a Linked Data approach offers potential to overcome that significant hurdle.

Researchers and teachers will have the ability to repurpose data for their own specific use. Linked Data provides flexibility for people to create their own pathways through Archives Hub and Copac data alongside other data sources. Developers will be able to provide applications and visualisations tailored to the needs of researchers, learning environments, institutional and project goals.

Innovation

Archives are described hierarchically, and this presents challenges for the output of Linked Data. In addition, descriptions are a combination of structured data and semi-structured data. As part of this project, we will explore the challenges in working with semi-structured data, which can potentially provide a very rich source of information. The biographical histories for creators of archives may provide unique information that has been based on the archival source. Extracting event-based data from this can really open up the potential of the archival description to be so much more than the representation of an archive collection. It becomes a much more multi-faceted resource, providing data about people, organisations, places and events.

The library community is beginning to explore the potential of Linked Data. The Swedish and Hungarian National Libraries have exposed their catalogues as Linked Data, the Library of Congress has exposed subject authority data (LCSH), and OCLC is now involved in making the Virtual International Authority File (VIAF) available in this way.

By treating the entities (people, places, concepts etc) referred to in bibliographic data as resources in their own right, links can be made to other data referring to those same resources. Those other sources can be used to enrich the presentation of bibliographic data, and the bibliographic data can be used in conjunction with other data sources to create new applications.

Copac is the largest union catalogue of bibliographic data in the UK, and one of the largest in the world, and its exposure as Linked Data can provide a rich data source, of particular value to the research, learning and teaching communities.

In answering the call, we will be able to report on the challenges of the project, and how we have approached them. This will be of benefit to all institutions with bibliographic and archival data looking to maximise its potential. We are very well placed within the research and teaching communities to share our experiences and findings.