Explaining Linked Data to Your Pro Vice Chancellor

At the JISCEXPO Programme meeting today I led a session on ‘Explaining linked data to your Pro Vice Chancellor’, and this post is a summary of that session. The attendees were: myself (Adrian Stevenson), Rob Hawton, Alex Dutton, and Zeth, with later contributions from Chris Gutteridge.

It seemed clear to us that this is really about focussing on institutional administrative data, as it’s probably harder to sell the idea of providing research data in linked data form to the Pro VC. Linked data probably doesn’t allow you to do things that couldn’t do by other means, but it is easier than other approaches in the long run, once you’ve got your linked data available. Linked Data can be of value without having to be open:

“Southampton’s data is used internally. You could draw a ring around the data and say ‘that’s closed’, and it would still have the same value.”

== Benefits ==

Quantifying the value of linked data efficiencies can be tricky, but providing open data allows quicker development of tools, as the data the tools hook into already exist and are standardised.

== Strategies ==

Don’t mention the term ‘linked data’ to the Pro VC, or get into discussing the technology. It’s about the outcomes and the solutions, not the technologies. Getting ‘Champions’ who have the ear of the Pro VC will help.  Some enticing prototype example mash-up demonstrators that help sell the idea are also important. Also, pointing out that other universities are deploying and using linked open data to their advantage may help. Your University will want to be part of the club.

Making it easy for others to supply data that can be utilised as part of linked data efforts is important. This can be via Google spreadsheets, or e-mailing spreadsheets for example. You need to offload the difficult jobs to the people who are motivated and know what they’re doing.

It will also help to sell the idea to other potential consumers, such as the libraries, and other data providers. Possibly sell on the idea of the “increasing prominence of holdings” for libraries. This helps bring attention and re-use.

It’s worth emphasising that linked data simplifies the Freedom of Infomataion (FOI) process.  We can say “yes, we’ve already published that FOI data”. You have a responsibility to publish this data if asked via FOI anyway. This is an example of a Sheer curation approach.

Linked data may provide decreased bureaucracy. There’s no need to ask other parts of the University for their data, wasting their time, if it’s already published centrally. Examples here are estates, HR, library, student statistics.

== Targets ==

Some possible targets are: saving money, bringing in new business, funding, students.

The potential for increased business intelligence is a great sell, and Linked Data can provide the means to do this. Again, you need to sell a solution to a problem, not a technology. The University ‘implementation’ managers need to be involved and brought on board as well as the as the Pro VC.

It can be a problem that some institutions adopt a ‘best of breed’ policy with technology. Linked data doesn’t fit too well with this. However, it’s worth noting that Linked Data doesn’t need to change the user experience.

A lot of the arguments being made here don’t just apply to linked data. Much is about issues such as opening access to data generally. It was noted that there have been many efforts from JISC to solve the institutional data silo problem.

If we were setting a new University up from scratch, going for Linked Data from the start would be a realistic option, but it’s always hard to change currently embedded practice. Universities having Chief Technology Officers would help here, or perhaps a PVC for Technology?

Putting the Case for Linked Data

This is a summary of a break-out group discussion at the JISC Expo Programme meeting, July 2011, looking at ‘Skills required for Linked Data’.

We started off by thinking about the first steps when deciding to create Linked Data. We took a step back from the skills required and thought more about the understanding and the basic need and the importance of putting the case for Linked Data (or otherwise).

Do you have suitable data

Firstly, do you have data that is suitable to output as Linked Data. This comes down to the question: what is suitable data? It would be useful to provide more advise in this area.

Is Linked Data worth doing?

Why do you want Linked Data? Maybe you are producing data that others will find interesting and link into? If you give your data identifiers, others can link into it. But is Linked Data the right approach? Is what you really want open data more than Linked Data? Or just APIs into the data? Sometimes a simpler solution may give you the benefits that you are after.

Maybe for some organisations approaches other than Linked Data are appropriate, or are a way to start off – maybe just something as simple as outputting CSV. You need to think about what is appropriate for you.  By putting your data out in a more low-barrier way, you maybe able to find out more about who might use your data. However, there is an argument that it is very early days for Linked Data, and low levels of use right now may not reflect the potential for the data and how it is used in the future.

Are you the authority on the data? Is someone else the authority? Do you want to link into their stuff? These are the sorts of questions you need to be thinking about.

The group agreed that use cases would be useful here. They could act as a hook to bring people in. Maybe there should be somewhere to go to look up use cases – people can get a better idea of how they (their users) would benefit from creating Linked Data, they can see what others have done and compare their situation.

We talked around the issues involved in making a case for Linked Data. It would be useful if there was more information for people on how it can bring things to the fore. For example, we talked about set of photographs – a single photograph might be seen in a new context, with new connections made that can help to explain it and what it signifies.

What next?

There does appear to be a move of Linked Data a ‘clique’ into the mainstream – this should make it easier to understand and engage with. There are more tutorials, more support, more understanding. New tools will be developed that will make the process easier.

You need to think about different skills – data modelling and data transformation are very different things. We agreed that development is not always top down. Developers can be very self-motivated, in an environment where a continual learning process is often required. It may be that organisations will start to ask for skills around Linked Data when hiring people.

We felt that there is still a need for more support and more tutorials. We should move towards a critical mass, where questions raised are being answered and developers have more of a sense that there is help out there and they will get those answers. It can really help talking to other developers, so providing opportunities for this is important. The JISC Expo projects were tasked with providing documentation – explaining what they have done clearly to help others. We felt that these projects have helped to progress the Linked Data agenda and that it is an important encouraging people to acquire these skills to require processes and results to be written up.

Realistically, for many people, expertise needs to be brought in. Most organisations do not have resources to call upon. Often this is going to be cheaper than up-skilling – a steep learning curve can take weeks or months to negotiate whereas someone expert in this domain could do the work in just a few days. We talked about a role for (JISC) data centres in contributing to this kind of thing. However, we did acknowledge the important contribution that conferences, workshops and other events play in getting people familiar with Linked Data from a range of perspectives (as users of the data as well as providers). It can be useful to have tutorials that address your particular domain – data that you are familiar with.   Maybe we need a combination of approaches – it depends where you are starting from and what you want to know.  But for many people, the need to understand why Linked Data is useful and worth doing is an essential starting point.

We saw the value in having someone involved who is outward facing – otherwise there is a danger of a gap between the requirements of people using your data and what you are doing. There is a danger of going off in the wrong direction.

We concluded that for many, Linked Data is still a big hill to climb. People do still need hand-ups. We also agreed that Linked Data will get good press if there are products that people can understand – they need to see the benefits.

As maybe there is still an element of self-doubt about Linked Data, it is essential not just to output the data but to raise its profile, to advocate what you have done and why. Enthusiasm can start small but it can quickly spread out.

Finally, we agreed that people don’t always know where products are built around Linked Data. So, you may not realise how it is benefitting you. We need to explain what we have done as well as providing the attractive interface/product and we need to relate it to what people are familiar with.

 

 

 

 

 

 

 

Final Product Post: Archives Hub EAD to RDF XSLT Stylesheet

Archives Hub EAD to RDF XSLT Stylesheet

Please note: Although this is the ‘final’ formal post of the LOCAH JISC project, it will not be the last post. Our project is due to complete at the end of July, and we still have plenty to do, so there’ll more blog posts to come.

User this product is for: Archives Hub contributors, EAD aware archivists, software developers, technical librarians, JISC Discovery Programme (SALDA Project), BBC Digital Space.

Description of prototype/product:

We consider the Archives Hub EAD to RDF XSLT stylesheet to be a key product of the Locah project. The stylesheet encapsulates both the Locah developed Linked Data model and provides a simple standards-based means to transform archival data to Linked Data RDF/XML. The stylesheet can straightforwardly be re-used and re-purposed by anyone wishing to transform archival data in EAD form to Linked Data ready RDF/XML.

The stylesheet is available directly from http://data.archiveshub.ac.uk/xslt/ead2rdf.xsl

The stylesheet is the primary source from which we were able to develop data.archiveshub.ac.uk, our main access point to the Archives Hub Linked Data. Data.archiveshub.ac.uk provides access to both human and machine-readable views of our Linked Data, as well as access to our SPARQL endpoint for querying the Hub data and a bulk download of the entire Locah Archives Hub Linked Dataset.

The stylesheet also provided the means necessary to supply data for our first ‘Timemap’ visualisation prototype. This visualisation currently allows researchers to access the Hub data by a small range of pre-selected subjects: travel and exploration, science and politics. Having selected a subject, the researcher can then drag a time slider to view the spread of a range of archive sources through time. If a researcher then selects an archive she/he is interested in on the timeline, a pin appears on the map below showing the location of the archive, and an call out box appears providing some simple information such as the title, size and dates of the archive. We hope to include data from other Linked Data sources, such as Wikipedia in these information boxes.

This visualisation of the Archives Hub data and links to other data sets provides an intuitive view to the user that would be very difficult to provide by means other than exploiting the potential of Linked Data.

Please note these visualisations are currently still work in progress:

Screenshots:

Data.archiveshub.ac.uk home page:

Screenshot of data.archiveshub.ac.uk homepage

Screenshot of data.archiveshub.ac.uk homepage

Prototype visualisation for subject ‘science’ (work in progress):

Screenshot of Locah Visualisation for subject 'science'

Locah Visualisation for subject ‘science’

Working prototype/product:

http://data.archiveshub.ac.uk/ead2rdf/

There are a large number of resources available on the Web for using XSLT stylesheets, as well as our own ‘XSLT’ tagged blog posts.

Instructional documentation:

Our instructional documentation can be found in a series of posts, all tagged with ‘instructionaldocs‘. We provide instructional posts on the following main topics:

Project tag: locah

Full project name: Linked Open Copac Archives Hub

Short description: A JISC-funded project working to make data from Copac and the Archives Hub available as Linked Data.

Longer description: The Archives Hub and Copac national services provide a wealth of rich inter- disciplinary information that we will expose as Linked Data. We will be working with partners who are leaders in their fields: OCLC, Talis and Eduserv. We will be investigating the creation of links between the Hub, Copac and other data sources including DBPedia, data.gov.uk and the BBC, as well as links with OCLC for name authorities and with the Library of Congress for subject headings.This project will put archival and bibliographic data at the heart of the Linked Data Web, making new links between diverse content sources, enabling the free and flexible exploration of data and enabling researchers to make new connections between subjects, people, organisations and places to reveal more about our history and society.

Key deliverables: Output of structured Linked Data for the Archives Hub and Copac services. A prototype visualisation for browsing archives by subject, time and location. Opportunities and barriers reporting via the project blog.

Lead Institution: UKOLN, University of Bath

Person responsible for documentation: Adrian Stevenson

Project Team: Adrian Stevenson, Project Manager (UKOLN); Jane Stevenson, Archives Hub Manager (Mimas); Pete Johnston, Technical Researcher (Eduserv); Bethan Ruddock, Project Officer (Mimas); Yogesh Patel, Software Developer (Mimas); Julian Cheal, Software Developer (UKOLN). Read more about the LOCAH Project team.

Project partners and roles: Talis are our technology partner on the project, providing us with access to store our data in the Talis Store. Leigh Dodds and Tim Hodson are our main contacts at the company. OCLC also partnered, mainly to help with VIAF. Our contacts at OCLC are John MacColl, Ralph LeVan and Thom Hickey. Ed Summers is also helping us out as a voluntary consultant.

The address of the LOCAH Project blog is http://archiveshub.ac.uk/locah/ . The main atom feed is http://archiveshub.ac.uk/locah/feed/atom

All reusable program code produced by the Locah project will be available as free software under the Apache License 2. You will be able to get the code from our project sourceforge repository.

The LOCAH dataset content is licensed under a Creative Commons CC0 1.0 licence.

The contents of this blog are available under a Creative Commons Attribution-ShareAlike 3.0 Unported license.

LOCAH Datasets
LOCAH Blog Content
Locah Code

Project start date: 1st Aug 2010
Project end date: 31st July 2011
Project budget: £100,000


LOCAH was funded by JISC as part of the #jiscexpo programme. See our JISC PIMS project management record.

Archives Hub Linked Data Release

We’re very pleased to announce the release of http://data.archiveshub.ac.uk, the first Linked Data set produced by the LOCAH project. The team has been working hard since the beginning of the project on modelling the complex archival data and transforming it into RDF Linked Data. This is now available in a variety of forms via the data.archiveshub.ac.uk home page. A number of previous blog posts outline the modelling and transformation process, the RDF terms used in the data, and the challenges and opportunities arising along the way. A forthcoming post will provide some example queries for accessing data from the SPARQL query endpoint. The data and content is licensed under a Creative Commons CC0 1.0 licence.

We’re working on a visualisation prototype that provides an example of how we link the Hub Data with other Linked Data sources on the Web using our enhanced dataset to provide a useful graphical resource for researchers.

One important point to note is that this initial release is a selected subset, representative of the Hub collection descriptions as a proof of concept, and does not contain the full Archives Hub dataset at present, although we are very keen to explore this in the future.

We still have some work to do, this being the initial release of the Hub data. Some revisions for a later release will address a few issues including reconciling our internal person and subject names, and will also contain some further enhancements to the data to include links to Library of Congress subject headings and further links to DBPedia based on subject terms. We also hope to include links for place names using Geonames and Ordnance Survey.

We encourage feedback on the data, the model and any other aspect of data.archiveshub.ac.uk, so please leave comments or contact us directly.

We are also working hard on our other main LOCAH release, the Copac Linked Data. Our first version of the model for this is now finished, and we have the data in our test triple store. We hope to release this in about a month’s time.

I’d personally like to thank the LOCAH team for all their hard work on this exciting and challenging project. I’d also like to thank our technology partner, Talis for kindly providing our Linked Data store.

LOD-LAM: International Linked Open Data in Libraries, Archives, and Museums Summit

LOD LAMI’m really pleased to announce that I was asked to join the organising committee for the International Linked Open Data in Libraries, Archives, and Museums Summit that will take place this June 2-3, 2011 in San Francisco, California, USA. There’s still time to apply until February 28th, and funding is available to help cover travel costs.

The International Linked Open Data in Libraries, Archives, and Museums Summit (“LOD-LAM”) will convene leaders in their respective areas of expertise from the humanities and sciences to catalyze practical, actionable approaches to publishing Linked Open Data, specifically:

  • Identify the tools and techniques for publishing and working with Linked Open Data.
  • Draft precedents and policy for licensing and copyright considerations regarding the publishing of library, archive, and museum metadata.
  • Publish definitions and promote use cases that will give LAM staff the tools they need to advocate for Linked Open Data in their institutions.

For more information see http://lod-lam.net/summit/about/.

The principal organiser/facilitator is Jon Voss (@LookBackMaps), Founder of LookBackMaps, along with Kris Carpenter Negulescu, Director of Web Group, Internet Archive, who is project managing.

I’m very chuffed to be part of the illustrious Organising Committee:

Lisa Goddard (@lisagoddard), Acting Associate University Librarian for Information Technology, Memorial University Libraries.
Martin Kalfatovic (@UDCMRK), Assistant Director, Digital Services Division at Smithsonian Institution Libraries and the Deputy Project Director of the Biodiversity Heritage Library.
Mark Matienzo (@anarchivist), Digital Archivist in Manuscripts and Archives at the Yale University Library.
Mia Ridge (@mia_out), Lead Web Developer & Technical Architect, Science Museum/NMSI (UK)
Tim Sherratt (@wragge), National Museum of Australia & University of Canberra
MacKenzie Smith, Research Director, MIT Libraries.
Adrian Stevenson (@adrianstevenson), UKOLN; Project Manager, LOCAH Linked Data Project.
John Wilbanks (@wilbanks), VP of Science, Director of Science Commons, Creative Commons.

It’ll be a great event I’m sure, so get your application in ASAP.

Assessing Linked Data

For me, the journey from an understanding of modelling data, and creating our own models for the Hub and Copac, to being able to understand the processes and decisions involved in creating XML RDF has been challenging. It has raised one question that often applies when dealing with something quite technical: how much should a manager (in my case an archivist managing an online archive service) be expected to understand the ‘technical’ aspects of something?  This is a question I have spoken about and written about before; mainly in terms of what archivists (or other information professionals) in the digital age need to know in order to understand the implications of choices around things like data structure and software systems.  In the case of Linked Data, I am still not sure how much I need to know about the detail of Linked Data, the RDF model, the use of RDF XML, the benefits of other output fomats, the application of stylesheets, etc. I have been thinking about how hard it is to create Linked Data – I have had a few enquiries from colleagues who are interested in doing the same sort of thing already and I want to be able to offer useful advice.

One thing that occurs to me is that it is reasonable to acknowledge that Linked Data does involves programming skills, and therefore it is not so dissimilar from structuring and outputting your data through a traditional relational database, for example, where you would expect that specialist skills are needed. But in either case there is the the same need for a manager to understand what the system offers and be able to offer the best service to researchers. I think what is important from the point of view of the manager is to be involved with the decision-making process and understand the implications of Linked Data; you need to know what you are saying about your own data. I am not sure that this requires a thorough knowledge of RDF and certainly I only have a rudimentary knowledge of stylesheets (and no knowledge of programming).

Service managers or administrators are not expected to understand systems from an in-depth technical point of view. But in fact, I think that one of the advantages of RDF model is that it is easier to get a sense of what is going on in terms of data processing than typically occurs with a database management system. After about six months of learning about Linked Data and RDF (I estimate that this translates into about one month of fairly intense learning), I can look at the stylesheet that we have for the Archives Hub, to transform the EAD data into RDF XML, and I can look at an RDF document representing the entities that we are describing, and I have a reasonably good overall sense of what it all means, which helps me with my main role: to understand the outputs and the potential benefits of Linked Data. For Locah, we’ve used XSLT, but there is no requirement for this, and maybe one of the challenges of outputting Linked Data is that there are a number of options in terms of translating your RDF model into an output.

There is no doubt that choices made now about how we model the data will have implications for what users can do with it, and some choices may limit future potential more than others. For example, which ‘things’ do we choose to represent? Should we have a conceptualisation of a person as an entity represented in the description, and to link this to a conceptualisation of the person? Which information should we provide as URIs and which as literals? I’m only gradually coming to understand the implications of these decisions, as we start to explore the potential of the data. Of course, this is always true, whatever data, structures and systems we are working with. This brings me to another point that I think is probably particularly relevant for Locah: we are doing this work at a time when we are very much early adopters. Whilst the classic Linked Data diagram may give the impression that the world has embraced Linked Data, the reality is that it is still very much at a hand-crafted level: we have not had tools available to us to aid us in this work, and in the case of EAD, there has been very little activity up till now. It is therefore difficult to judge how feasible it might be to output RDF in the future, as it is likely that more tools will be developed, and there will be greater awareness and skills built up around the whole Semantic Web. However, I wonder if we are currently still at that difficult point where we need to build the momentum of the Linked Data movement, but it is still very unfamiliar and poorly understood by many data providers?

Many Linked Data evangelists claim that Linked Data is ‘easy’. I’m not sure that it is necessarily easy, and I don’t think that it’s very helpful to say that it is easy. Easy compared to what? Easy for whom? It’s easy if you know how, if you have the requisite skills and experience, but we need to persuade people who don’t yet know how that it is worth doing, and provide a realistic assessment of the skills that are required. I suppose the question of how easy it is does rest in large part on the data you are working with as well. Archival finding aids are quite challenging. As Mark Matienzo, archivist at Yale University, states in his presentation on Linked Data and Archival Description: “Archival description is inherently multi-level and relational” and “EAD is both too flexible and too unforgiving” to be Linked Data friendly…and database-friendly for that matter. Also, ISAD(G) recommends the non-repetition of information and archival description generally contains implicit information. I suppose Linked Data might help provide the opportunity and impetus to move towards a more Web-friendly way of describing archives, if it does become more widely used.

At present, I can’t help thinking that if archive repositories and libraries would like to output their data as Linked Data, many of them will struggle, and I would have thought it might be similar for other types of data providers. I do think that expertise is required, and time needs to be invested in understanding some key aspects of Linked Data. On the other hand, this is the case whenever you are looking at creating effective means to output structured (but often inconsistent) data. However, I think that it makes good sense for the Archives Hub and Copac to do this work, as it is on behalf of our contributors, so it effectively will allow these repositories and libraries to output Linked Data.  In other words, it may be that for Linked Data to really take hold, it will benefit from this kind of aggregated set-up, where skills and resources can be pooled. At present, I’m inclined to think that it is worth the investment of time and resources by our Locah team because it is benefitting a large number of data providers. I think it will be important for us to convey to our contributors, and indeed to other archivists and librarians, what we are doing and why, what the implications are and what the benefits may be. I have already had contact with two people, one representing another aggregation of content, interested in benefitting from our work. This is really important, because it potentially makes the investment more worthwhile.

We are in a fortunate position with the Locah project because we are part of a JISC-funded innovations project, with a team of people with a variety of skills, and we have support from Talis, who have significant experience of Linked Data.  If we can work on behalf of our community, then I feel that the time invested may be worthwhile. For the second half of our year-long project we will want to explore the benefits more thoroughly – we will be looking at the crucial issues of creating links to other data, which is really Linked Data’s key selling point, and we will be developing a prototype to show some potential benefits for researchers.

(With thanks to Pete and Ade for their contributions to this blog post).

Modelling Copac data

With the Archives Hub data well under way, it was time to start looking at the Copac data.  The first decision to be made was which version of Copac data to use – consolidated or unconsolidated.  As part of the process of adding records to Copac they are de-duplicated, allowing different institutions’ records for the same item to be presented as one record, instead of several.  For more info on Copac de-duplication, see this blog post.

So our first question was: deal with the individual records from each library, or with the consolidated records created for Copac?  This made us think about the nature of what we were describing. The unconsolidated records (generally!) relate to the actual, physical ‘thing’ – what in FRBR would be the ‘item’.

The consolidated records are closer to (but by no means a perfect example of) the FRBR manifestation.  That is to say, they are describing different physical instances of the same theoretical work; in linked data terms, they are ‘same as’.  They aren’t perfect manifestation level records, as there may be other records on Copac for the same manifestation which haven’t been consolidated due to cataloguing differences.  At Copac, we err on the side of caution, and would rather have this happen, than have records which aren’t the same consolidated into the same record.

So we could do our mapping and our transformations at unconsolidated level, and then use ‘same as’ to link together the descriptions that would later be consolidated in Copac.  But as we’re accepting Copac’s judgement that they are describing the same set of items, why not save ourselves that trouble, and work from the consolidated description?  We can then hang the individual bibliographic records off this central unit of description.

This means that all of the information provided by the different libraries is related to the same unit of description.  The bibliographic records that go together to make up a consolidated Copac record may not contain all of the same information, but they won’t contain any contradictory information.  Thus two records which are the same in all details except date of publication (say 1983 in one, 1984 in the other) will not consolidate, but records which are the same in all details except that one contains a subject where the other does not, will consolidate.

In fact, subjects are one of the things (along with notes) that don’t affect consolidation at all.  We will combine all of the subjects that come in individual descriptions, so that a consolidated record might end up with the subjects:

Management

Management — theory

Management (theoretical)

Business & management

We will leave these in the linked data description for the same reason they are left in the Copac description – while such similar terms may seem superfluous, they actually increase discoverability, by providing multiple access points.  They will link into the central ‘unit of description’, rather than the individual bibliographic records.

Once we’d decided on this central unit of description (name TBD, but likely to be ‘Copac record’ or something boringly similar), other aspects of the description started to fall into place.  Some of these were straightforward – publication date, for instance, is fairly obviously a literal – while others took more thought and discussion.

Among the more complicated issues was that of creator.  We are working with MODS data, which has come from MARC data, and MARC allows you to have only one ‘creator’.  This creator sits in the 100s as the main access point, and all other contributors (including co-authors!) are relegated to the 700s, where they become what we have decided to call ‘other person associated with this unit of description’.  Not very snappy, but hopefully fairly accurate.  In theory, the role that this person has in the creation of the item should be reflected in a MARC indicator, but in practise this is not often included in descriptions.  Where it is indicated that a person (or a corporate body) is an editor, contributor, translator, illustrator etc, we can build these into the modelling; where not, they will have to be satisfied with the vague title of ‘associated person’.

This will work for most situations, but it does still leave room for error.  Where a person is named in the 700s with no indicator of role, it is possible that they are a person who was associated with one particular item, rather than the manifestation – a former owner or bookseller, for example.  While we do want to present this information, which works as another access point, and may be of interest to users, we have the problem that this information should really be associated with the item, not our quasi-manifestation.  This information only concerns one specific physical item, as described in one of the individual bibliographic records. Should it really have a link to our central unit of description?  If not, where do we link it to?  Our entries for individual bib records describe only the records themselves, not a physical real-world item.  It’s an interesting point, and one we’ll be dicussing more as the project goes on.

We’re continuing to work with Copac data, and will discuss other issues here as they arise.

Creating Linked Data: more reflections from the coal face

This post is to highlight some of the barriers and challenges to the creation of Linked Data.  This is a personal reflection, trying to be honest about the challenges as I have found them and the learning experience, which is inevitably a personal thing depending upon your own background, experience and ways of thinking and working. However, I think it also reflects some of the general challenges as we have come across them.

Vocabulary

It comes as no surprise that I have found the terminology somewhat confusing, and it has sometimes led me astray. Only this week Bethan and I were getting tangled up in a conversation about ‘things’  within the data model. We spent a while talking about how having a ‘Hub conceptualisation’ and a ‘thing-in-its-own-right conceptualisation’ of an entity would allow for more clarity. With ‘thing’, ‘concept’, ‘label’, ‘property’, ‘value’, ‘predicate’, ‘information resources’, ‘non-information resources’ etc. – there is quite a bit of room for misinterpretation in communication. I have looked at definitions, but these can actually sometimes hinder rather than help. I think that an attempt at a definitive glossary for Linked Data would help enormously.

Landscape

For me, it has taken a while to really get into the Linked Data way of thinking. I have actually kept a kind of diary of my thoughts over the last 2-3 months, and when I look back now at my earliest attempts at understanding how to model the data, they certainly show a pretty steep learning curve. I started, for example, by being unsure about whether we were wanting to provide information on the ‘creator’ of the archive or the archive itself and what sort of relationships between ‘things’ to include. I don’t think this is surprising, as the power of RDF is that it can be used to model anything – it doesn’t help you by giving you a limited scope or particular rules to start with (which is, of course, generally a good thing).

Archival descriptions

I listened to a number of audio tutorials, read a number of reports, blogs, etc., and learnt a great deal from these, but I still found the lack of examples within my own particular domain to be a barrier. Talis provide a very excellent tutorial that you can sit and listen to, but the real-world example is for a whiskey distillery. It somehow seems a long way away from an archival description! So, I would definitely say this lack of information for my domain was a barrier. But, of course, for others who want to output their finding aids as Linked Data in the future, we should start to see models developing that they can use, with examples and information to help (Locah, we hope, being one source of help).

Expertise and experience

The Locah team has a variety of expertise and experience, but it is undoubtedly true to say that I would be struggling a great deal more than I have done if we had not had the input of Pete Johnston from Eduserv, who has been very much involved in the EAD modelling. Whilst it is important (and pleasant) to give credit where it’s due,  the real point here is actually that I think a certain level of expertise is important, to model data and output RDF. I have experience as an archivist and understand EAD and metadata, Pete also has experience of working with archival descriptions, and also substantial experience of metadata standards and issues around the Semantic Web and technical interoperability. We also have Bethan Ruddock working with us, who now has 18 months experience of working with EAD descriptions, and is a trained librarian. That is just the core team looking at the archival data modelling.  In addition, the expertise of UKOLN will come into play with other aspects of the project.

I find it hard to see how this sort of work could currently be done by a team with substantially less experience in these sorts of areas. However, it is important to state that we will also be working with Talis, who have a great deal of expertise in Linked Data. They are providing access to their own Triple Store and other benefits that we can take advantage of. Others thinking of outputting Linked Data could look to involve companies like Talis more heavily, thus taking advantage of their expertise and requiring less in-house expertise.

The benefits of data modelling

One of the areas that I spent most time trying to find good tuturials about was data modelling. I may have missed some things that would have been very useful, but as it is I found that there simply wasn’t enough helpful information about how to create a data model. This would have saved me quite a bit of time because I think the data model is so central to what we are doing and provides such an effective way to visualise the entities and relationships between them. I think this was partly a case of examples being too simplistic, and partly a lack of data models that used catalogue data – not necessarily archival finding aids, but at least something similar.

The data

I think that we are going to find challenges around the actual content. There are numerous examples of inconsistencies, such as where the ‘creator’ is ‘Joe Bloggs and others’ rather than just a name, or where the access points do not have rules or a source associated with them. I’ve just found some descriptions where the content for the ‘extent’ should acutally really be in the ‘scope’. Some descriptions have rather unsatisfactory references, some do not include the language field, a few do not even include the creator field. For some fields we will just be outputting literal values, but for others consistency would help a great deal with the creation of RDF, particularly when thinking about the vocabulary (or predicate) that we use to define the relationship between a subject and object.  This is the challenge of creating Linked Data for descriptions that have been created by 200 different institutions over several decades and by 100s of different people. We’ll have to see how it goes!

The issue of access points

Within EAD there are access points, or index terms, associated with the description. These are most commonly subject, name and place. We’ve found that establishing the nature of the relationship between the unit of description and the access point is not easy. It looks like the relationship is going to be something very unspecific, such as ‘associatedWith’. I’m not sure yet whether this has any implications…

Conclusions

For me, after a few weeks away from thinking about Locah and Linked Data, getting back into the whole mindset actually takes about an hour and a nice cup of tea. In other words, the mindset I require to think about Linked Data currently feels separate from my normal working mindset. I think this is because LD requires something different. This in itself makes it quite challenging. It doesn’t fall naturally into what we do in the Hub and how we think about metadata.

However, the very big plus with this different kind of thinking is that really by definition it puts what the user is interested in at the forefront of your thinking. Well, maybe I should qualify that: I believe it puts what the user is interested in at the forefront. This is because we understand that users of archives are usually primarily interested in individuals, families, organisations, subjects and places. What they want is information on Sir Ernest Shackleton, Barbara Castle, Victorian theatre, town planning, a local business, a scientific organisation, the history of Manchester the industry of Sheffield,  or anything else. They don’t tend to know that they want to access a particular archive. Or if they do, it is often due to an assumption that there is ‘an archive’ on the person or organisation that they are researching. Even if there is an archive, there may may be a misplaced assumption that this archive is pretty much all the stuff about that entity. Furthermore, there are going to be many many researchers out there who will not be aware of archives and how to access them.  Linked Data provides a way to link archives into…well, into just about anything else.

Making sense of modelling EAD

Last week Pete Johnston, Bethan Ruddock and I got together and shut ourselves in a room for 5 hours with a whiteboard, flipchart and with our thinking caps on. Pete has already posted some thoughts about architecture and workflows following this meeting. I thought I would share some more informal thoughts of my own – from the perspective of an archivist and someone gradually getting to grips with Linked Data and RDF modelling.

Now that I understand a bit more about RDF, I can see where some of my misunderstandings were leading me astray. Firstly, it took me quite a while to get away from the idea of modelling the EAD record, rather than the actual data. This might seem obvious to those conversant in Linked Data, but I’ve been dealing with records as the unit of information for the last 20 years. With Linked Data you have to get away from this and think about the actual concepts within the data. The record (the EAD description in this case) exists as an entity along with everything else, but it can be misleading to take it as the starting point for data modelling.

I found actually getting a ‘starting point’ a bit difficult. I think this is because everything can be a starting point, and also because I kept going back to thinking of something like <http://archiveshub.ac.uk/search/record.html?id=gb15sirernesthenryshackleton> as the starting point (the record itself). I then moved away from this and started thinking about the archival creator as a central concept. I knew that in RDF this person or organisation would be a subject. I also knew that this subject would need a URI and that we might want to tell people about stuff related to this subject, but I struggled with how we would provide URIs for subjects like this, and also how we would link the creator as subject to things like the index term subjects.

After a quick chat with Pete Johnston I started to understand the real role of URIs within Linked Data. We are probably going to create URIs ourselves for things (concepts) within the Hub. So, we might create a URI for every archival creator, and a URI for every repository, etc. We agreed that we needed to model the data within our world before looking too much at linking to data outside of it.  Whilst I had listened to, and read a good deal of literature on Linked Data, I somehow hadn’t quite got the idea that you might create URIs yourself for your own concepts and that these would be documents in their own right, so then you can link to these URIs within your statements and you can include whatever information you think will be useful within these documents.

For example, we were using Sir Ernest Henry Shackleton as a sample record (the famous Antarctic Explorer). He would have a URI – something like archiveshub.ac.uk/id/person/sirernesthenryshackleton. By providing him with a URI we can then create triples (statements) that include this URI. For example:

archiveshub.ac.uk/id/person/sirernesthenryshackleton ‘created’ http://archiveshub.ac.uk/search/record.html?id=gb15sirernesthenryshackleton.

We can then decide what information we will put in this document that identifies Sir Ernest, so that when researchers look up the URI, they get useful information. We can include links to external locations and we can look at using the ‘sameAs’ relationship to link to other representations of the same person.

Some URIs are fairly straightforward. We will create URIs for archival levels, and then these can in theory be used by others who want to identify levels within the data. For something like language, we will probably use URIs that are already available.

It is useful within data modelling to distinguish the real from the conceptual. So, going back to Sir Ernest, he is a flesh and blood person, and he can also be represented as a concept. If we are thinking about subjects used as index terms within the data, you might have ‘Exploration’ as a subject. We want Sir Ernest, the man described within our description, to be associated with this subject, so we can do this by making him into a concept, and giving that concept a URI. We can then link that to a literal value – his name. In our meeting we discussed one of the advantages of conceptual agents as being that we can distinguish between the person or organisation in its entirety and the person or organisation within this particular context. Archives often only represent a small part of someone’s life or an organisation’s activities, so it is helpful to talk about ‘Sir Ernest Shackleton’ as the explorer and leader of the British National Antarctic Expedition of 1907-1909.

So, we are now starting to move towards a model where we have URIs for a number of key concepts within the Hub. Our intention is to limit the number of concepts that we create URIs for, at least at this stage. We will also simplify some areas with the EAD modelling that we can then open up for investigation later on. For example, it would be good to look at version control and how we might filter changes to Hub descriptions through to the RDF XML, but we think that initially it is a good idea to create Linked Data from our basic model so that we can get feedback and also benefit from the learning process.

The main text heavy field that we are planning to create URIs for at this stage is the Biographical and Administrative History. We haven’t yet explored this thoroughly, but with URIs for archival creators and URIs for administrative and biographical histories, one’s thoughts start to turn to name authorities and EAC-CPF (Encoded Archival Context – Corporate Bodies, Persons and Families – a means to markup information about archival creators in XML). We are not looking at creating EAC descriptions, but it would be good to keep in line with this in whatever ways we can in order to facilitate the subsequent creation of EAC records, or incorporation of our data into EAC records.

We will soon be able to share our current data model, so keep an eye on our blog. We welcome any feedback that the community might have.