Creating Linked Data: more reflections from the coal face

This post is to highlight some of the barriers and challenges to the creation of Linked Data.  This is a personal reflection, trying to be honest about the challenges as I have found them and the learning experience, which is inevitably a personal thing depending upon your own background, experience and ways of thinking and working. However, I think it also reflects some of the general challenges as we have come across them.


It comes as no surprise that I have found the terminology somewhat confusing, and it has sometimes led me astray. Only this week Bethan and I were getting tangled up in a conversation about ‘things’  within the data model. We spent a while talking about how having a ‘Hub conceptualisation’ and a ‘thing-in-its-own-right conceptualisation’ of an entity would allow for more clarity. With ‘thing’, ‘concept’, ‘label’, ‘property’, ‘value’, ‘predicate’, ‘information resources’, ‘non-information resources’ etc. – there is quite a bit of room for misinterpretation in communication. I have looked at definitions, but these can actually sometimes hinder rather than help. I think that an attempt at a definitive glossary for Linked Data would help enormously.


For me, it has taken a while to really get into the Linked Data way of thinking. I have actually kept a kind of diary of my thoughts over the last 2-3 months, and when I look back now at my earliest attempts at understanding how to model the data, they certainly show a pretty steep learning curve. I started, for example, by being unsure about whether we were wanting to provide information on the ‘creator’ of the archive or the archive itself and what sort of relationships between ‘things’ to include. I don’t think this is surprising, as the power of RDF is that it can be used to model anything – it doesn’t help you by giving you a limited scope or particular rules to start with (which is, of course, generally a good thing).

Archival descriptions

I listened to a number of audio tutorials, read a number of reports, blogs, etc., and learnt a great deal from these, but I still found the lack of examples within my own particular domain to be a barrier. Talis provide a very excellent tutorial that you can sit and listen to, but the real-world example is for a whiskey distillery. It somehow seems a long way away from an archival description! So, I would definitely say this lack of information for my domain was a barrier. But, of course, for others who want to output their finding aids as Linked Data in the future, we should start to see models developing that they can use, with examples and information to help (Locah, we hope, being one source of help).

Expertise and experience

The Locah team has a variety of expertise and experience, but it is undoubtedly true to say that I would be struggling a great deal more than I have done if we had not had the input of Pete Johnston from Eduserv, who has been very much involved in the EAD modelling. Whilst it is important (and pleasant) to give credit where it’s due,  the real point here is actually that I think a certain level of expertise is important, to model data and output RDF. I have experience as an archivist and understand EAD and metadata, Pete also has experience of working with archival descriptions, and also substantial experience of metadata standards and issues around the Semantic Web and technical interoperability. We also have Bethan Ruddock working with us, who now has 18 months experience of working with EAD descriptions, and is a trained librarian. That is just the core team looking at the archival data modelling.  In addition, the expertise of UKOLN will come into play with other aspects of the project.

I find it hard to see how this sort of work could currently be done by a team with substantially less experience in these sorts of areas. However, it is important to state that we will also be working with Talis, who have a great deal of expertise in Linked Data. They are providing access to their own Triple Store and other benefits that we can take advantage of. Others thinking of outputting Linked Data could look to involve companies like Talis more heavily, thus taking advantage of their expertise and requiring less in-house expertise.

The benefits of data modelling

One of the areas that I spent most time trying to find good tuturials about was data modelling. I may have missed some things that would have been very useful, but as it is I found that there simply wasn’t enough helpful information about how to create a data model. This would have saved me quite a bit of time because I think the data model is so central to what we are doing and provides such an effective way to visualise the entities and relationships between them. I think this was partly a case of examples being too simplistic, and partly a lack of data models that used catalogue data – not necessarily archival finding aids, but at least something similar.

The data

I think that we are going to find challenges around the actual content. There are numerous examples of inconsistencies, such as where the ‘creator’ is ‘Joe Bloggs and others’ rather than just a name, or where the access points do not have rules or a source associated with them. I’ve just found some descriptions where the content for the ‘extent’ should acutally really be in the ‘scope’. Some descriptions have rather unsatisfactory references, some do not include the language field, a few do not even include the creator field. For some fields we will just be outputting literal values, but for others consistency would help a great deal with the creation of RDF, particularly when thinking about the vocabulary (or predicate) that we use to define the relationship between a subject and object.  This is the challenge of creating Linked Data for descriptions that have been created by 200 different institutions over several decades and by 100s of different people. We’ll have to see how it goes!

The issue of access points

Within EAD there are access points, or index terms, associated with the description. These are most commonly subject, name and place. We’ve found that establishing the nature of the relationship between the unit of description and the access point is not easy. It looks like the relationship is going to be something very unspecific, such as ‘associatedWith’. I’m not sure yet whether this has any implications…


For me, after a few weeks away from thinking about Locah and Linked Data, getting back into the whole mindset actually takes about an hour and a nice cup of tea. In other words, the mindset I require to think about Linked Data currently feels separate from my normal working mindset. I think this is because LD requires something different. This in itself makes it quite challenging. It doesn’t fall naturally into what we do in the Hub and how we think about metadata.

However, the very big plus with this different kind of thinking is that really by definition it puts what the user is interested in at the forefront of your thinking. Well, maybe I should qualify that: I believe it puts what the user is interested in at the forefront. This is because we understand that users of archives are usually primarily interested in individuals, families, organisations, subjects and places. What they want is information on Sir Ernest Shackleton, Barbara Castle, Victorian theatre, town planning, a local business, a scientific organisation, the history of Manchester the industry of Sheffield,  or anything else. They don’t tend to know that they want to access a particular archive. Or if they do, it is often due to an assumption that there is ‘an archive’ on the person or organisation that they are researching. Even if there is an archive, there may may be a misplaced assumption that this archive is pretty much all the stuff about that entity. Furthermore, there are going to be many many researchers out there who will not be aware of archives and how to access them.  Linked Data provides a way to link archives into…well, into just about anything else.