Lifting the Lid on Linked Data at ELAG 2011

Myself and Jane have just given our ‘Lifting the Lid on Linked Data‘ presentation at the ELAG European Library Automation Group Conference 2011 in Prague today. It seemed to go pretty well. There were a few comments about the licensing situation for the Copac data on the #elag2011 twitter stream, which is something we’re still working on.

[slideshare id=8082967&doc=elag2011-locah-110524105057-phpapp02]

LOD-LAM: International Linked Open Data in Libraries, Archives, and Museums Summit

LOD LAMI’m really pleased to announce that I was asked to join the organising committee for the International Linked Open Data in Libraries, Archives, and Museums Summit that will take place this June 2-3, 2011 in San Francisco, California, USA. There’s still time to apply until February 28th, and funding is available to help cover travel costs.

The International Linked Open Data in Libraries, Archives, and Museums Summit (“LOD-LAM”) will convene leaders in their respective areas of expertise from the humanities and sciences to catalyze practical, actionable approaches to publishing Linked Open Data, specifically:

  • Identify the tools and techniques for publishing and working with Linked Open Data.
  • Draft precedents and policy for licensing and copyright considerations regarding the publishing of library, archive, and museum metadata.
  • Publish definitions and promote use cases that will give LAM staff the tools they need to advocate for Linked Open Data in their institutions.

For more information see http://lod-lam.net/summit/about/.

The principal organiser/facilitator is Jon Voss (@LookBackMaps), Founder of LookBackMaps, along with Kris Carpenter Negulescu, Director of Web Group, Internet Archive, who is project managing.

I’m very chuffed to be part of the illustrious Organising Committee:

Lisa Goddard (@lisagoddard), Acting Associate University Librarian for Information Technology, Memorial University Libraries.
Martin Kalfatovic (@UDCMRK), Assistant Director, Digital Services Division at Smithsonian Institution Libraries and the Deputy Project Director of the Biodiversity Heritage Library.
Mark Matienzo (@anarchivist), Digital Archivist in Manuscripts and Archives at the Yale University Library.
Mia Ridge (@mia_out), Lead Web Developer & Technical Architect, Science Museum/NMSI (UK)
Tim Sherratt (@wragge), National Museum of Australia & University of Canberra
MacKenzie Smith, Research Director, MIT Libraries.
Adrian Stevenson (@adrianstevenson), UKOLN; Project Manager, LOCAH Linked Data Project.
John Wilbanks (@wilbanks), VP of Science, Director of Science Commons, Creative Commons.

It’ll be a great event I’m sure, so get your application in ASAP.

Modelling Copac data

With the Archives Hub data well under way, it was time to start looking at the Copac data.  The first decision to be made was which version of Copac data to use – consolidated or unconsolidated.  As part of the process of adding records to Copac they are de-duplicated, allowing different institutions’ records for the same item to be presented as one record, instead of several.  For more info on Copac de-duplication, see this blog post.

So our first question was: deal with the individual records from each library, or with the consolidated records created for Copac?  This made us think about the nature of what we were describing. The unconsolidated records (generally!) relate to the actual, physical ‘thing’ – what in FRBR would be the ‘item’.

The consolidated records are closer to (but by no means a perfect example of) the FRBR manifestation.  That is to say, they are describing different physical instances of the same theoretical work; in linked data terms, they are ‘same as’.  They aren’t perfect manifestation level records, as there may be other records on Copac for the same manifestation which haven’t been consolidated due to cataloguing differences.  At Copac, we err on the side of caution, and would rather have this happen, than have records which aren’t the same consolidated into the same record.

So we could do our mapping and our transformations at unconsolidated level, and then use ‘same as’ to link together the descriptions that would later be consolidated in Copac.  But as we’re accepting Copac’s judgement that they are describing the same set of items, why not save ourselves that trouble, and work from the consolidated description?  We can then hang the individual bibliographic records off this central unit of description.

This means that all of the information provided by the different libraries is related to the same unit of description.  The bibliographic records that go together to make up a consolidated Copac record may not contain all of the same information, but they won’t contain any contradictory information.  Thus two records which are the same in all details except date of publication (say 1983 in one, 1984 in the other) will not consolidate, but records which are the same in all details except that one contains a subject where the other does not, will consolidate.

In fact, subjects are one of the things (along with notes) that don’t affect consolidation at all.  We will combine all of the subjects that come in individual descriptions, so that a consolidated record might end up with the subjects:

Management

Management — theory

Management (theoretical)

Business & management

We will leave these in the linked data description for the same reason they are left in the Copac description – while such similar terms may seem superfluous, they actually increase discoverability, by providing multiple access points.  They will link into the central ‘unit of description’, rather than the individual bibliographic records.

Once we’d decided on this central unit of description (name TBD, but likely to be ‘Copac record’ or something boringly similar), other aspects of the description started to fall into place.  Some of these were straightforward – publication date, for instance, is fairly obviously a literal – while others took more thought and discussion.

Among the more complicated issues was that of creator.  We are working with MODS data, which has come from MARC data, and MARC allows you to have only one ‘creator’.  This creator sits in the 100s as the main access point, and all other contributors (including co-authors!) are relegated to the 700s, where they become what we have decided to call ‘other person associated with this unit of description’.  Not very snappy, but hopefully fairly accurate.  In theory, the role that this person has in the creation of the item should be reflected in a MARC indicator, but in practise this is not often included in descriptions.  Where it is indicated that a person (or a corporate body) is an editor, contributor, translator, illustrator etc, we can build these into the modelling; where not, they will have to be satisfied with the vague title of ‘associated person’.

This will work for most situations, but it does still leave room for error.  Where a person is named in the 700s with no indicator of role, it is possible that they are a person who was associated with one particular item, rather than the manifestation – a former owner or bookseller, for example.  While we do want to present this information, which works as another access point, and may be of interest to users, we have the problem that this information should really be associated with the item, not our quasi-manifestation.  This information only concerns one specific physical item, as described in one of the individual bibliographic records. Should it really have a link to our central unit of description?  If not, where do we link it to?  Our entries for individual bib records describe only the records themselves, not a physical real-world item.  It’s an interesting point, and one we’ll be dicussing more as the project goes on.

We’re continuing to work with Copac data, and will discuss other issues here as they arise.

LOCAH Project – Projected Timeline, Workplan & Overall Project Methodology

Project Plan

WP1:  Project Management.

  • Project management to support the project, the relationships with project partners, and with the funders.

WP2:  Data Modelling

  • Model Archives Hub EAD data and Copac data to RDF

WP3:  Technical Development – Linked Data Interface

  • Transform RDF modelled to RDF XML.
  • Enrich Hub and Copac data with data/links from sources such as DBPedia, BBC, LOC, VIAF, Musicbrainz, Freebase
  • Provide both RDF and HTML documents for Archives Hub and Copac resources with stable well designed URIs
  • Provide a SPARQL endpoint for the Hub Linked Data resources
  • Look at feasibility of providing RESTful API interface to the Hub and Copac Linked Data resources

WP4: Prototype Development

  • Test and refine requirements for proposed prototypes
  • Design user interfaces for prototype
  • Technical development and testing of the user interfaces

WP5: ‘Opportunities and Barriers’ Reporting

  • Design and implement  procedures for logging ongoing projects issues
  • Analyse and synthesise logged issues around known Linked Data issues
  • Report on opportunities and barriers using the project blog outlining methods and recommendations on how to overcome, mediate or mitigate against issues identified wherever possible.

WP6: Advocacy and Dissemination

  • Report on ongoing project progress and findings at JISC programme events
  • Demonstrate project outputs and report to communities on the findings of the opportunities and barriers reporting at relevant conferences and workshops

Timetable

WPMonth 1 2 3 4 5 6 7 8 9 10 11 12
WP1 X X X X X X X X X X X X
WP2 X X X X
WP3 X X X X X
WP4 X X X X X X X
WP5 X X X X X X X X X X
WP6 X X X X X X X X X

Project Management and Staffing

Adrian Stevenson will project manage LOCAH to ensure that the workplan is carried out to the timetable, and that effective dissemination and evaluation mechanisms are implemented according to the JISC Project Management guidelines. Consortium agreements in line with JISC guidelines will be established for the project partners. UKOLN will lead on all the workpackages. Staff who will work on LOCAH are already in post.

Support for Standards, Accessibility and Other Best Practices

LOCAH will adhere to the guidance and good practice provided by JISC in the Standards Catalogue and JISC Information Environment. The primary technology methodologies, standards and specifications adopted for this project will be:

  • XML, XSLT, RDF XML, RDFa, FOAF, SKOS, SPARQL, n3, JSON, RSS/ATOM
  • Metadata standards: EAD, MODS, Dublin Core
  • Berners-Lee,T. (2006). ‘Linked Data – Design Issues’
  • Berners-Lee,T. (1998). ‘W3C Style: Cool URIs don’t change’
  • Cabinet Offices ‘Designing URI Sets for the UK Public Sector’
  • Dodds, L., Davis, I., ‘Linked Data Patterns’
  • W3C Web Accessibility Initiative (WAI)

LOCAH Project – Wider Benefits to Sector & Achievements for Host Institution

Meeting a need

High quality research and teaching relies partly on access to a broad range of resources. Archive and library materials inform and enhance knowledge and are central to the JISC strategy. JISC invests in bibliographic and archival metadata services to enable discovery of, and access to, those materials, and we know the research, teaching and learning communities value those services.

As articulated in the Resource Discovery Taskforce Vision, that value could be increased if the data can be made to “work harder”, to be used in different ways and repurposed in different contexts.

Providing bibliographic and archive data as Linked Data creates links with other data sources, and allows the development of new channels into the data. Researchers are more likely to discover sources that may materially affect their research outcomes, and the ‘hidden’ collections of archives and special collections are more likely to be exposed and used.

Archive data is by its nature incomplete and often sources are hidden and little known. User studies and log analyses indicate that Archives Hub1 users frequently search laterally through the descriptions; this gives them a way to make serendipitous discoveries. Linked data is a way of vastly expanding the benefits of lateral search, helping users discover contextually related materials. Creating links between archival collections and other sources is crucial – archives relating to the same people, organisations, places and subjects are often widely dispersed. By bringing these together intellectually, new discoveries can be made about the life and work of an individual or the circumstances surrounding important historical events. New connections, new relationships, new ideas about our history and society. Put this together with other data sources, such as special collections, multimedia repositories and geographic information systems, and the opportunities for discovery are significantly increased.

Similarly, by making Copac bibliographic data available as Linked Data we can increase the opportunities for developers to provide contextual links to primary and secondary source material held within the UK’s research libraries and an increasing number of specialist libraries, including the British Museum, the National Trust, and the Royal Society. The provision of library and special collections content as Linked Data will allow developers to build interfaces to link contextually related historical sources that may have been curated and described using differing methodologies. The differences in these methodologies and the emerging standards for description and access have resulted in distinct challenges in providing meaningful cross-searching and interlinking of this related content – a Linked Data approach offers potential to overcome that significant hurdle.

Researchers and teachers will have the ability to repurpose data for their own specific use. Linked Data provides flexibility for people to create their own pathways through Archives Hub and Copac data alongside other data sources. Developers will be able to provide applications and visualisations tailored to the needs of researchers, learning environments, institutional and project goals.

Innovation

Archives are described hierarchically, and this presents challenges for the output of Linked Data. In addition, descriptions are a combination of structured data and semi-structured data. As part of this project, we will explore the challenges in working with semi-structured data, which can potentially provide a very rich source of information. The biographical histories for creators of archives may provide unique information that has been based on the archival source. Extracting event-based data from this can really open up the potential of the archival description to be so much more than the representation of an archive collection. It becomes a much more multi-faceted resource, providing data about people, organisations, places and events.

The library community is beginning to explore the potential of Linked Data. The Swedish and Hungarian National Libraries have exposed their catalogues as Linked Data, the Library of Congress has exposed subject authority data (LCSH), and OCLC is now involved in making the Virtual International Authority File (VIAF) available in this way.

By treating the entities (people, places, concepts etc) referred to in bibliographic data as resources in their own right, links can be made to other data referring to those same resources. Those other sources can be used to enrich the presentation of bibliographic data, and the bibliographic data can be used in conjunction with other data sources to create new applications.

Copac is the largest union catalogue of bibliographic data in the UK, and one of the largest in the world, and its exposure as Linked Data can provide a rich data source, of particular value to the research, learning and teaching communities.

In answering the call, we will be able to report on the challenges of the project, and how we have approached them. This will be of benefit to all institutions with bibliographic and archival data looking to maximise its potential. We are very well placed within the research and teaching communities to share our experiences and findings.

RDFa – from theory to practice

Adrian Stevenson will be talking about the LOCAH project at IWMW 2010 in Sheffield in a session that looks at implementing RDFa.  The session will:

  1. provide an introduction to what’s happening now in Linked Data and RDFa
  2. demonstrate recent work exposing repository metadata as RDFa
  3. explain how integration of RDFa within a content management system such as Drupal can enrich semantic content – and in some cases help significantly boost search engine ranking.

Information about the session can be found at http://iwmw.ukoln.ac.uk/iwmw2010/sessions/bunting-dewey-stevenson/

[slideshare id=4734746&doc=rdfatheorytopractice-100712063426-phpapp01]