Lifting the Lid on Linked Data at ELAG 2011

Myself and Jane have just given our ‘Lifting the Lid on Linked Data‘ presentation at the ELAG European Library Automation Group Conference 2011 in Prague today. It seemed to go pretty well. There were a few comments about the licensing situation for the Copac data on the #elag2011 twitter stream, which is something we’re still working on.

[slideshare id=8082967&doc=elag2011-locah-110524105057-phpapp02]

Archives Hub Linked Data Release

We’re very pleased to announce the release of http://data.archiveshub.ac.uk, the first Linked Data set produced by the LOCAH project. The team has been working hard since the beginning of the project on modelling the complex archival data and transforming it into RDF Linked Data. This is now available in a variety of forms via the data.archiveshub.ac.uk home page. A number of previous blog posts outline the modelling and transformation process, the RDF terms used in the data, and the challenges and opportunities arising along the way. A forthcoming post will provide some example queries for accessing data from the SPARQL query endpoint. The data and content is licensed under a Creative Commons CC0 1.0 licence.

We’re working on a visualisation prototype that provides an example of how we link the Hub Data with other Linked Data sources on the Web using our enhanced dataset to provide a useful graphical resource for researchers.

One important point to note is that this initial release is a selected subset, representative of the Hub collection descriptions as a proof of concept, and does not contain the full Archives Hub dataset at present, although we are very keen to explore this in the future.

We still have some work to do, this being the initial release of the Hub data. Some revisions for a later release will address a few issues including reconciling our internal person and subject names, and will also contain some further enhancements to the data to include links to Library of Congress subject headings and further links to DBPedia based on subject terms. We also hope to include links for place names using Geonames and Ordnance Survey.

We encourage feedback on the data, the model and any other aspect of data.archiveshub.ac.uk, so please leave comments or contact us directly.

We are also working hard on our other main LOCAH release, the Copac Linked Data. Our first version of the model for this is now finished, and we have the data in our test triple store. We hope to release this in about a month’s time.

I’d personally like to thank the LOCAH team for all their hard work on this exciting and challenging project. I’d also like to thank our technology partner, Talis for kindly providing our Linked Data store.

Creating Linked Data: more reflections from the coal face

This post is to highlight some of the barriers and challenges to the creation of Linked Data.  This is a personal reflection, trying to be honest about the challenges as I have found them and the learning experience, which is inevitably a personal thing depending upon your own background, experience and ways of thinking and working. However, I think it also reflects some of the general challenges as we have come across them.

Vocabulary

It comes as no surprise that I have found the terminology somewhat confusing, and it has sometimes led me astray. Only this week Bethan and I were getting tangled up in a conversation about ‘things’  within the data model. We spent a while talking about how having a ‘Hub conceptualisation’ and a ‘thing-in-its-own-right conceptualisation’ of an entity would allow for more clarity. With ‘thing’, ‘concept’, ‘label’, ‘property’, ‘value’, ‘predicate’, ‘information resources’, ‘non-information resources’ etc. – there is quite a bit of room for misinterpretation in communication. I have looked at definitions, but these can actually sometimes hinder rather than help. I think that an attempt at a definitive glossary for Linked Data would help enormously.

Landscape

For me, it has taken a while to really get into the Linked Data way of thinking. I have actually kept a kind of diary of my thoughts over the last 2-3 months, and when I look back now at my earliest attempts at understanding how to model the data, they certainly show a pretty steep learning curve. I started, for example, by being unsure about whether we were wanting to provide information on the ‘creator’ of the archive or the archive itself and what sort of relationships between ‘things’ to include. I don’t think this is surprising, as the power of RDF is that it can be used to model anything – it doesn’t help you by giving you a limited scope or particular rules to start with (which is, of course, generally a good thing).

Archival descriptions

I listened to a number of audio tutorials, read a number of reports, blogs, etc., and learnt a great deal from these, but I still found the lack of examples within my own particular domain to be a barrier. Talis provide a very excellent tutorial that you can sit and listen to, but the real-world example is for a whiskey distillery. It somehow seems a long way away from an archival description! So, I would definitely say this lack of information for my domain was a barrier. But, of course, for others who want to output their finding aids as Linked Data in the future, we should start to see models developing that they can use, with examples and information to help (Locah, we hope, being one source of help).

Expertise and experience

The Locah team has a variety of expertise and experience, but it is undoubtedly true to say that I would be struggling a great deal more than I have done if we had not had the input of Pete Johnston from Eduserv, who has been very much involved in the EAD modelling. Whilst it is important (and pleasant) to give credit where it’s due,  the real point here is actually that I think a certain level of expertise is important, to model data and output RDF. I have experience as an archivist and understand EAD and metadata, Pete also has experience of working with archival descriptions, and also substantial experience of metadata standards and issues around the Semantic Web and technical interoperability. We also have Bethan Ruddock working with us, who now has 18 months experience of working with EAD descriptions, and is a trained librarian. That is just the core team looking at the archival data modelling.  In addition, the expertise of UKOLN will come into play with other aspects of the project.

I find it hard to see how this sort of work could currently be done by a team with substantially less experience in these sorts of areas. However, it is important to state that we will also be working with Talis, who have a great deal of expertise in Linked Data. They are providing access to their own Triple Store and other benefits that we can take advantage of. Others thinking of outputting Linked Data could look to involve companies like Talis more heavily, thus taking advantage of their expertise and requiring less in-house expertise.

The benefits of data modelling

One of the areas that I spent most time trying to find good tuturials about was data modelling. I may have missed some things that would have been very useful, but as it is I found that there simply wasn’t enough helpful information about how to create a data model. This would have saved me quite a bit of time because I think the data model is so central to what we are doing and provides such an effective way to visualise the entities and relationships between them. I think this was partly a case of examples being too simplistic, and partly a lack of data models that used catalogue data – not necessarily archival finding aids, but at least something similar.

The data

I think that we are going to find challenges around the actual content. There are numerous examples of inconsistencies, such as where the ‘creator’ is ‘Joe Bloggs and others’ rather than just a name, or where the access points do not have rules or a source associated with them. I’ve just found some descriptions where the content for the ‘extent’ should acutally really be in the ‘scope’. Some descriptions have rather unsatisfactory references, some do not include the language field, a few do not even include the creator field. For some fields we will just be outputting literal values, but for others consistency would help a great deal with the creation of RDF, particularly when thinking about the vocabulary (or predicate) that we use to define the relationship between a subject and object.  This is the challenge of creating Linked Data for descriptions that have been created by 200 different institutions over several decades and by 100s of different people. We’ll have to see how it goes!

The issue of access points

Within EAD there are access points, or index terms, associated with the description. These are most commonly subject, name and place. We’ve found that establishing the nature of the relationship between the unit of description and the access point is not easy. It looks like the relationship is going to be something very unspecific, such as ‘associatedWith’. I’m not sure yet whether this has any implications…

Conclusions

For me, after a few weeks away from thinking about Locah and Linked Data, getting back into the whole mindset actually takes about an hour and a nice cup of tea. In other words, the mindset I require to think about Linked Data currently feels separate from my normal working mindset. I think this is because LD requires something different. This in itself makes it quite challenging. It doesn’t fall naturally into what we do in the Hub and how we think about metadata.

However, the very big plus with this different kind of thinking is that really by definition it puts what the user is interested in at the forefront of your thinking. Well, maybe I should qualify that: I believe it puts what the user is interested in at the forefront. This is because we understand that users of archives are usually primarily interested in individuals, families, organisations, subjects and places. What they want is information on Sir Ernest Shackleton, Barbara Castle, Victorian theatre, town planning, a local business, a scientific organisation, the history of Manchester the industry of Sheffield,  or anything else. They don’t tend to know that they want to access a particular archive. Or if they do, it is often due to an assumption that there is ‘an archive’ on the person or organisation that they are researching. Even if there is an archive, there may may be a misplaced assumption that this archive is pretty much all the stuff about that entity. Furthermore, there are going to be many many researchers out there who will not be aware of archives and how to access them.  Linked Data provides a way to link archives into…well, into just about anything else.