Last week Pete Johnston, Bethan Ruddock and I got together and shut ourselves in a room for 5 hours with a whiteboard, flipchart and with our thinking caps on. Pete has already posted some thoughts about architecture and workflows following this meeting. I thought I would share some more informal thoughts of my own – from the perspective of an archivist and someone gradually getting to grips with Linked Data and RDF modelling.
Now that I understand a bit more about RDF, I can see where some of my misunderstandings were leading me astray. Firstly, it took me quite a while to get away from the idea of modelling the EAD record, rather than the actual data. This might seem obvious to those conversant in Linked Data, but I’ve been dealing with records as the unit of information for the last 20 years. With Linked Data you have to get away from this and think about the actual concepts within the data. The record (the EAD description in this case) exists as an entity along with everything else, but it can be misleading to take it as the starting point for data modelling.
I found actually getting a ‘starting point’ a bit difficult. I think this is because everything can be a starting point, and also because I kept going back to thinking of something like <http://archiveshub.ac.uk/search/record.html?id=gb15sirernesthenryshackleton> as the starting point (the record itself). I then moved away from this and started thinking about the archival creator as a central concept. I knew that in RDF this person or organisation would be a subject. I also knew that this subject would need a URI and that we might want to tell people about stuff related to this subject, but I struggled with how we would provide URIs for subjects like this, and also how we would link the creator as subject to things like the index term subjects.
After a quick chat with Pete Johnston I started to understand the real role of URIs within Linked Data. We are probably going to create URIs ourselves for things (concepts) within the Hub. So, we might create a URI for every archival creator, and a URI for every repository, etc. We agreed that we needed to model the data within our world before looking too much at linking to data outside of it. Whilst I had listened to, and read a good deal of literature on Linked Data, I somehow hadn’t quite got the idea that you might create URIs yourself for your own concepts and that these would be documents in their own right, so then you can link to these URIs within your statements and you can include whatever information you think will be useful within these documents.
For example, we were using Sir Ernest Henry Shackleton as a sample record (the famous Antarctic Explorer). He would have a URI – something like archiveshub.ac.uk/id/person/sirernesthenryshackleton. By providing him with a URI we can then create triples (statements) that include this URI. For example:
archiveshub.ac.uk/id/person/sirernesthenryshackleton ‘created’ http://archiveshub.ac.uk/search/record.html?id=gb15sirernesthenryshackleton.
We can then decide what information we will put in this document that identifies Sir Ernest, so that when researchers look up the URI, they get useful information. We can include links to external locations and we can look at using the ‘sameAs’ relationship to link to other representations of the same person.
Some URIs are fairly straightforward. We will create URIs for archival levels, and then these can in theory be used by others who want to identify levels within the data. For something like language, we will probably use URIs that are already available.
It is useful within data modelling to distinguish the real from the conceptual. So, going back to Sir Ernest, he is a flesh and blood person, and he can also be represented as a concept. If we are thinking about subjects used as index terms within the data, you might have ‘Exploration’ as a subject. We want Sir Ernest, the man described within our description, to be associated with this subject, so we can do this by making him into a concept, and giving that concept a URI. We can then link that to a literal value – his name. In our meeting we discussed one of the advantages of conceptual agents as being that we can distinguish between the person or organisation in its entirety and the person or organisation within this particular context. Archives often only represent a small part of someone’s life or an organisation’s activities, so it is helpful to talk about ‘Sir Ernest Shackleton’ as the explorer and leader of the British National Antarctic Expedition of 1907-1909.
So, we are now starting to move towards a model where we have URIs for a number of key concepts within the Hub. Our intention is to limit the number of concepts that we create URIs for, at least at this stage. We will also simplify some areas with the EAD modelling that we can then open up for investigation later on. For example, it would be good to look at version control and how we might filter changes to Hub descriptions through to the RDF XML, but we think that initially it is a good idea to create Linked Data from our basic model so that we can get feedback and also benefit from the learning process.
The main text heavy field that we are planning to create URIs for at this stage is the Biographical and Administrative History. We haven’t yet explored this thoroughly, but with URIs for archival creators and URIs for administrative and biographical histories, one’s thoughts start to turn to name authorities and EAC-CPF (Encoded Archival Context – Corporate Bodies, Persons and Families – a means to markup information about archival creators in XML). We are not looking at creating EAC descriptions, but it would be good to keep in line with this in whatever ways we can in order to facilitate the subsequent creation of EAC records, or incorporation of our data into EAC records.
We will soon be able to share our current data model, so keep an eye on our blog. We welcome any feedback that the community might have.