Explaining Linked Data to Your Pro Vice Chancellor

At the JISCEXPO Programme meeting today I led a session on ‘Explaining linked data to your Pro Vice Chancellor’, and this post is a summary of that session. The attendees were: myself (Adrian Stevenson), Rob Hawton, Alex Dutton, and Zeth, with later contributions from Chris Gutteridge.

It seemed clear to us that this is really about focussing on institutional administrative data, as it’s probably harder to sell the idea of providing research data in linked data form to the Pro VC. Linked data probably doesn’t allow you to do things that couldn’t do by other means, but it is easier than other approaches in the long run, once you’ve got your linked data available. Linked Data can be of value without having to be open:

“Southampton’s data is used internally. You could draw a ring around the data and say ‘that’s closed’, and it would still have the same value.”

== Benefits ==

Quantifying the value of linked data efficiencies can be tricky, but providing open data allows quicker development of tools, as the data the tools hook into already exist and are standardised.

== Strategies ==

Don’t mention the term ‘linked data’ to the Pro VC, or get into discussing the technology. It’s about the outcomes and the solutions, not the technologies. Getting ‘Champions’ who have the ear of the Pro VC will help.  Some enticing prototype example mash-up demonstrators that help sell the idea are also important. Also, pointing out that other universities are deploying and using linked open data to their advantage may help. Your University will want to be part of the club.

Making it easy for others to supply data that can be utilised as part of linked data efforts is important. This can be via Google spreadsheets, or e-mailing spreadsheets for example. You need to offload the difficult jobs to the people who are motivated and know what they’re doing.

It will also help to sell the idea to other potential consumers, such as the libraries, and other data providers. Possibly sell on the idea of the “increasing prominence of holdings” for libraries. This helps bring attention and re-use.

It’s worth emphasising that linked data simplifies the Freedom of Infomataion (FOI) process.  We can say “yes, we’ve already published that FOI data”. You have a responsibility to publish this data if asked via FOI anyway. This is an example of a Sheer curation approach.

Linked data may provide decreased bureaucracy. There’s no need to ask other parts of the University for their data, wasting their time, if it’s already published centrally. Examples here are estates, HR, library, student statistics.

== Targets ==

Some possible targets are: saving money, bringing in new business, funding, students.

The potential for increased business intelligence is a great sell, and Linked Data can provide the means to do this. Again, you need to sell a solution to a problem, not a technology. The University ‘implementation’ managers need to be involved and brought on board as well as the as the Pro VC.

It can be a problem that some institutions adopt a ‘best of breed’ policy with technology. Linked data doesn’t fit too well with this. However, it’s worth noting that Linked Data doesn’t need to change the user experience.

A lot of the arguments being made here don’t just apply to linked data. Much is about issues such as opening access to data generally. It was noted that there have been many efforts from JISC to solve the institutional data silo problem.

If we were setting a new University up from scratch, going for Linked Data from the start would be a realistic option, but it’s always hard to change currently embedded practice. Universities having Chief Technology Officers would help here, or perhaps a PVC for Technology?

Final Product Post: Archives Hub EAD to RDF XSLT Stylesheet

Archives Hub EAD to RDF XSLT Stylesheet

Please note: Although this is the ‘final’ formal post of the LOCAH JISC project, it will not be the last post. Our project is due to complete at the end of July, and we still have plenty to do, so there’ll more blog posts to come.

User this product is for: Archives Hub contributors, EAD aware archivists, software developers, technical librarians, JISC Discovery Programme (SALDA Project), BBC Digital Space.

Description of prototype/product:

We consider the Archives Hub EAD to RDF XSLT stylesheet to be a key product of the Locah project. The stylesheet encapsulates both the Locah developed Linked Data model and provides a simple standards-based means to transform archival data to Linked Data RDF/XML. The stylesheet can straightforwardly be re-used and re-purposed by anyone wishing to transform archival data in EAD form to Linked Data ready RDF/XML.

The stylesheet is available directly from http://data.archiveshub.ac.uk/xslt/ead2rdf.xsl

The stylesheet is the primary source from which we were able to develop data.archiveshub.ac.uk, our main access point to the Archives Hub Linked Data. Data.archiveshub.ac.uk provides access to both human and machine-readable views of our Linked Data, as well as access to our SPARQL endpoint for querying the Hub data and a bulk download of the entire Locah Archives Hub Linked Dataset.

The stylesheet also provided the means necessary to supply data for our first ‘Timemap’ visualisation prototype. This visualisation currently allows researchers to access the Hub data by a small range of pre-selected subjects: travel and exploration, science and politics. Having selected a subject, the researcher can then drag a time slider to view the spread of a range of archive sources through time. If a researcher then selects an archive she/he is interested in on the timeline, a pin appears on the map below showing the location of the archive, and an call out box appears providing some simple information such as the title, size and dates of the archive. We hope to include data from other Linked Data sources, such as Wikipedia in these information boxes.

This visualisation of the Archives Hub data and links to other data sets provides an intuitive view to the user that would be very difficult to provide by means other than exploiting the potential of Linked Data.

Please note these visualisations are currently still work in progress:

Screenshots:

Data.archiveshub.ac.uk home page:

Screenshot of data.archiveshub.ac.uk homepage

Screenshot of data.archiveshub.ac.uk homepage

Prototype visualisation for subject ‘science’ (work in progress):

Screenshot of Locah Visualisation for subject 'science'

Locah Visualisation for subject ‘science’

Working prototype/product:

http://data.archiveshub.ac.uk/ead2rdf/

There are a large number of resources available on the Web for using XSLT stylesheets, as well as our own ‘XSLT’ tagged blog posts.

Instructional documentation:

Our instructional documentation can be found in a series of posts, all tagged with ‘instructionaldocs‘. We provide instructional posts on the following main topics:

Project tag: locah

Full project name: Linked Open Copac Archives Hub

Short description: A JISC-funded project working to make data from Copac and the Archives Hub available as Linked Data.

Longer description: The Archives Hub and Copac national services provide a wealth of rich inter- disciplinary information that we will expose as Linked Data. We will be working with partners who are leaders in their fields: OCLC, Talis and Eduserv. We will be investigating the creation of links between the Hub, Copac and other data sources including DBPedia, data.gov.uk and the BBC, as well as links with OCLC for name authorities and with the Library of Congress for subject headings.This project will put archival and bibliographic data at the heart of the Linked Data Web, making new links between diverse content sources, enabling the free and flexible exploration of data and enabling researchers to make new connections between subjects, people, organisations and places to reveal more about our history and society.

Key deliverables: Output of structured Linked Data for the Archives Hub and Copac services. A prototype visualisation for browsing archives by subject, time and location. Opportunities and barriers reporting via the project blog.

Lead Institution: UKOLN, University of Bath

Person responsible for documentation: Adrian Stevenson

Project Team: Adrian Stevenson, Project Manager (UKOLN); Jane Stevenson, Archives Hub Manager (Mimas); Pete Johnston, Technical Researcher (Eduserv); Bethan Ruddock, Project Officer (Mimas); Yogesh Patel, Software Developer (Mimas); Julian Cheal, Software Developer (UKOLN). Read more about the LOCAH Project team.

Project partners and roles: Talis are our technology partner on the project, providing us with access to store our data in the Talis Store. Leigh Dodds and Tim Hodson are our main contacts at the company. OCLC also partnered, mainly to help with VIAF. Our contacts at OCLC are John MacColl, Ralph LeVan and Thom Hickey. Ed Summers is also helping us out as a voluntary consultant.

The address of the LOCAH Project blog is http://archiveshub.ac.uk/locah/ . The main atom feed is http://archiveshub.ac.uk/locah/feed/atom

All reusable program code produced by the Locah project will be available as free software under the Apache License 2. You will be able to get the code from our project sourceforge repository.

The LOCAH dataset content is licensed under a Creative Commons CC0 1.0 licence.

The contents of this blog are available under a Creative Commons Attribution-ShareAlike 3.0 Unported license.

LOCAH Datasets
LOCAH Blog Content
Locah Code

Project start date: 1st Aug 2010
Project end date: 31st July 2011
Project budget: £100,000


LOCAH was funded by JISC as part of the #jiscexpo programme. See our JISC PIMS project management record.

Transforming EAD XML into RDF/XML using XSLT

This is a (brief!) second post revisiting my “process” diagram from an early post. Here I’ll focus on the “transform” process on the left of the diagram:

Diagram showing process of transforming EAD to RDF and exposing as Linked Data

The “transform” process is currently performed using XSLT to read an EAD XML document and output RDF/XML, and the current version of the stylesheet is now available:

(The data currently available via http://data.archiveshub.ac.uk/ was actually generated using the previous version http://data.archiveshub.ac.uk/xslt/20110502/ead2rdf.xsl. The 20110630 version includes a few tweaks and bug fixes which will be reflected when we reload the data, hopefully within the next week.)

As I’ve noted previously, we initially focused our efforts on processing the set of EAD documents held by the Archives Hub, and on the particular set of markup conventions recommended by the Hub for data contributors – what I sometimes referred to as the Archives Hub EAD “profile” – though in practice, the actual dataset we’ve worked with encompasses a good degree of variation. But it remains the case that the transform is really designed to handle the set of EAD XML documents within that particular dataset rather than EAD in general. (I admit that it also remains somewhat “untidy” – the date handling is particularly messy! And parts of it were developed in a rather ad hoc fashion as I amended things as I encountered new variations in new batches of data. I should try to spend some time cleaning it up before the end of the project.)

Over the last few months, I’ve also been working on another JISC-funded project, SALDA, with Karen Watson and Chris Keene of the University of Sussex Library, focusing on making available their catalogue data for the Mass Observation Archive as Linked Data.

I wrote a post over on the SALDA blog on how I’d gone about applying and adapting the transform we developed in LOCAH for use with the SALDA data. That work has prompted me to think a bit more about the different facets of the data and how they are reflected in aspects of the transform process:

  • aspects which are generic/common to all EAD documents
  • aspects which are common to some quite large subset of EAD documents (like the Archives Hub dataset, with its (more or less) common set of conventions)
  • aspects which are “generic” in some way, but require some sort of “local” parameterisation – here, I’m thinking of the sort of “name/keyword lookup” techniques I describe in the SALDA post: the technique is broadly usable but the “lookup tables” used would vary from one dataset to another
  • aspects which reflect very specific, “local” characteristics of the data – e.g., some of the SALDA processing is based on testing for text patterns/structures which are very particular to the Mass Observation catalogue data

What I’d like to do (but haven’t done yet) is to reorganise the transform to try to make it a little more “modular” and to separate the “general”/”generic” from the “local”/”specific”, so that it might be easier for other users to “plug in” components more suitable for their own data.

Serving Linked Data

Back near the start of the project, I published a post outlining the processes involved in generating the Archives Hub RDF dataset and serving up “Linked Data” descriptions from that dataset; it’s perhaps best summarised in the following diagram from that post:

Diagram showing process of transforming EAD to RDF and exposing as Linked Data

In this post, I’ll say a little bit more about what is involved in the “Expose” operation up in the top right of the diagram.

Cool URIs for the Semantic Web

In an earlier post, I discussed the URI patterns we are using for the URIs of “things” described in our data (archival resources, concepts, people, places, and so on). One of the core requirements for exposing our RDF data as Linked Data is that, given one of these URIs, a user/consumer of that URI can use the HTTP protocol to “look up” that URI and obtain a description of the thing identified by that URI. So as providers of the data, our challenge is to enable our HTTP server to respond to such requests and provide such descriptions.

The W3C Note Cool URIs for the Semantic Web lists a number of possible “recipes” for achieving this while also paying attention to the principle of avoiding URI ambiguity i.e. of avoiding using a single URI to refer to more than one resource – and in particularly to maintaining a distinction between the URI of a “thing” and the URIs of documents describing that thing.

Document URI Patterns

Within the JISCExpo programme which funds LOCAH, projects generating Linked Data were encouraged to make use of the guidelines provided by the UK Cabinet Office in Designing URI Sets for the UK Public Sector.

Thse guidelines refer to the URIs used to identify “things” (somewhat tautologically, it seems to me!) as “Identifier URIs”, where they have the general pattern:

http://{domain}/id/{concept}/{reference}

where:

  • concept is a name for a resource type, like “person”;
  • reference is a name for an individual instance of that type or class

(The guidelines also allow for the option of using URIs with fragment identifiers (“Hash URIs”) as “Identifier URIs”.)

The document also recommends patterns for the URIs of the documents which provide information about these “things”, “Document URIs”:

http://{domain}/doc/{concept}/{reference}

These documents are, I think, what Berners-Lee calls Generic Resources. For each such document, multiple representations may be available, each in different formats, and each of those multiple “more specific” documents in a single concrete format may be available as a separate resource in its own right. So a third set of URIs, “Representation URIs,” name documents in a specific format, using the suggested pattern:

http://{domain}/doc/{concept}/{reference}/{doc.file-extension}

i.e. for each “thing URI”/”Identifier URI” in our data, like:

http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist, which identifies a person, the artist Beverley Skinner;

there is a corresponding “Document URI” which identifies a (“generic”) document describing the thing:

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist

and a set of “Representation URIs” each identifying a (“specific”) document in a particular format:

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.html, which identifies an HTML document;

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.rdf, which identifies an RDF/XML document;

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.turtle, which identifies a Turtle document;

http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.json, which identifies a JSON document (more specifically one using Talis’ RDF/JSON conventions for serializing RDF)

(We’ve deviated slightly from the recommended pattern here in that we just add “.{extension}” to the “reference” string, rather than adding “/doc.{extension}”, but we’ve retained the basic approach of distinguishing generic document and documents in specific formats, which I think is the significant aspect of the recommendations.)

This set of URI patterns corresponds to those used in the “recipe” described in section 4.2 of the W3C Cool URIs note, “303 URIs forwarding to One Generic Document”.

The Talis Platform

It is perhaps worth emphasising here that in the LOCAH case a “description” of any one of the things in our model may contain data which originated in multiple EAD documents e.g. a description of a concept may contain links to multiple archival resources with which it is associated, or a description of a repository may contain links to multiple finding aids they have published, and so on. A description may also contain data which originated from a source other than the EAD documents: for example, we add some postcode data provided by the National Archives, and most of the links to external resources, such as people described by VIAF records, are generated by post-transformation processes.

This aggregated RDF data – the output of the EAD-to-RDF transformation process and this additional data – is stored in an instance of the Talis Platform store. Simplifying things slightly, the Platform store is a “database” specialised for the storage and retieval of RDF data. It is hosted by Talis, and made avalable as what in cloud computing terms is referred to as “Software as a Service” (SaaS). (Actually, a Platform store allows the storage of content other than RDF data too – see the discussion of the ContentBox and MetaBox features in the Talis documentation – but we are, currently at least, making use only of the MetaBox facilities).

Access to the store is provided through a Web API. Using the MetaBox API, data can be added/uploaded to the MetaBox using HTTP POST, updates can be applied through what Talis call “Changesets” (essentially “remove that set of triples” and “add this set of triples”) again using HTTP POST, and “bounded descriptions” of individual resources can be retrieved using HTTP GET. There are also “admin” functions like “give me a dump of the contents” and “clear the database”. In addition, the Platform provides a simple full-text search over literals (which returns result sets in RSS), a configurable faceted search, an “augment” function and a SPARQL endpoint.

A number of client software libraries for working with the Platform are available, developed either by Talis staff or by developers who have worked with the Platform.

Delivering Linked Data from the Platform

I’m going to focus here on retrieving data from the MetaBox, and more specifically retrieving the “bounded descriptions” of individual resources which which provide the basis for the “Linked Data” documents.

This process involves a small Web application which responds to HTTP GET requests for these URIs:

  • For an “Identifier URI”, the server responds with a 303 status code and a Location header redirecting the client to the “Document URI”
  • For a “Document URI”, the server derives the corresponding “Identifier URI”, queries the Platform store to obtain a description of the thing identified by that URI, and responds with a 200 status code, a document in a format selected according to the preferences specified by the client (i.e. following the principles of HTTP content negotiation), and a Content-Location header providing a “Representation URI” for a document in that format.
  • For a “Representation URI”, the server derives the corresponding “Identifier URI”, queries the Platform store to obtain a description of the thing identified by that URI, and responds with a 200 status code and a document in the format associated with that URI.

The first step above is handled using a simple Apache rewrite rule. For the latter two steps, we’ve made use of the Paget PHP library created by Ian Davis of Talis for working with the Platform (Paget itself makes use of another library, Moriarty, also created by Ian). I’m sure there are many other ways of achieving this; I chose Paget in part because my software development abilities are fairly limited, but having had a quick look at the documentation and one of Ian’s blog posts, I felt there was enough there to enable me to take an example and apply my basic and rather rusty PHP skills to tweak it to make it work – at least as a short-term path to getting something functional we could “put out there”, and then polish in the future if necessary.

The main challenge was that the default Paget behaviour seemed to be to use the approach described in section 4.3 of the Cool URIs document, “303 URIs forwarding to Different Documents”, where the server performs content negotiation on the request for the “Identifier URI” and redirects directly to a “Representation URI”, i.e. a GET for an “Identifier URI” like http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist resulted in redirects to “Representation URIs” like http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist.html or http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist.rdf

If possible we wanted to use the alternative “recipe” described in the previous section, and after some tweaking we managed to get something that did the job. We also made some minor changes to provide a small amount of additional “document metadata”, e.g. the publisher of and license for the document. (I do recognise that the presentation of the HTML pages is currently pretty basic, and there is room for improvement!)

Finally, it’s maybe worth noting here that the Platform store itself doesn’t contain any information about the documents i.e. neither the Document URI nor the Representation URIs appear in RDF triples loaded to the store. So, in principle at least, we could add additional formats using additional Representation URIs simply by extending the PHP to handle the URIs and generate documents in those formats, without needing to extend the data in the store.

I’d started to write more here about extending what we’ve done to provide other ways of accessing the data, but having written quite a lot here already, I think that is probably best saved for a future post.

Lifting the Lid on Linked Data at ELAG 2011

Myself and Jane have just given our ‘Lifting the Lid on Linked Data‘ presentation at the ELAG European Library Automation Group Conference 2011 in Prague today. It seemed to go pretty well. There were a few comments about the licensing situation for the Copac data on the #elag2011 twitter stream, which is something we’re still working on.

[slideshare id=8082967&doc=elag2011-locah-110524105057-phpapp02]

Querying the Linked Archives Hub data using SPARQL

We’ve just announced the availability of our first draft linked data dataset of data from the Archives Hub. When newly available linked data datasets appear, I sometimes hear comments/questions along the lines of:

  • How do I know what the data looks like?
  • Show me some example SPARQL queries that I can use as starting points for my own exploration of the data

We’ve tried to go some way to addressing the first of those points in previous posts, in which I outlined the data model we’re using, to give a general picture of the types of things described and the relationships between them, and then provided a more detailed list of the RDF terms used to describe things. (That second post in particular will, I hope, be useful in thinking about how to construct queries).

In addition, there are some useful posts around on techniques for “probing” a SPARQL endpoint, i.e. issuing some general queries to get a picture of the nature of the graph(s) in the dataset behind an endpoint. See, for example:

In this post, I’ll focus mainly on responding to the second point, by providing a few sample SPARQL queries. Inevitably, these can only give a flavour of what is possible, but I hope they provide a starting point for people to build on.

This isn’t intended to be a tutorial on SPARQL; there are various such tutorials available, but one I found particularly thorough and helpful is:

The SPARQL endpoint for the Linked Archives Hub dataset is:

http://data.archiveshub.ac.uk/sparql.

The data is hosted in an instance of the Talis Platform, which supports a few useful extensions to the SPARQL standard, some of which are used in the examples below.

Listing “top-level” archival “collections”

Following the principles of “multi-level” description of archives, archivists apply a conceptualisation of archival materials as constituting hierarchically organised “collections”, where one “unit of description” may contain others, which in turn may contain others. It is often the case that an archival finding aid provides descriptions of materials only at the “collection-level”, or perhaps at some “sub-collection” level, without describing items individually at all.

In the LOCAH archival data, this approach is reflected in the use of a class ArchivalResource, where an instance of that class may have other instances as parts or members (or, inversely, one instance may be a part, or member, of another instance). This relationship is expressed using the properties dcterms:hasPart/dcterms:isPartOf and ore:aggregates/ore:isAggregatedBy.

The following query provides the URIs and labels (titles) of all archival resources mentioned in the dataset:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?ar ?arlabel
WHERE { 
?ar a locah:ArchivalResource ;
   rdfs:label ?arlabel .
}

This list includes archival resources at any “level”, from collections down to individual items.

We want to narrow down that selection so that it includes only “top-level” archival resources i.e. archival resources which are not “part of” another archival resource. This can be done by extending our pattern to allow for the optional presence of a triple with predicate dcterms:isPartOf, and filtering to select only those cases where the object in that optional pattern is “not bound” i.e. no such triple is present in the dataset:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?ar ?arlabel
WHERE { 
?ar a locah:ArchivalResource ;
   rdfs:label ?arlabel .
   OPTIONAL { ?ar dcterms:isPartOf ?parent } .
   FILTER (!bound(?parent))
}

Run this query against the current LOCAH endpoint.

Finding the location of the Repository holding an Archival Resource

For each archival resource, access to that resource is provided by a Repository (an agent, an entity with the ability to do things). This relationship is expressed using the property locah:accessProvidedBy. The Repository-as-Agent manages a place where the resource is held, a relationship expressed using the locah:administers property, and that place is associated with a postcode, both as a literal, and (perhaps more usefully) in the form of a link to a “postcode unit” in the dataset provided by the Ordnance Survey; by “following” that link, more information about the location can be obtained (e.g. latitude and longitude, relationships with other places) from the data provided by the OS.

Given the URI of an archival resource (in this example http://data.archiveshub.ac.uk/id/archivalresource/gb1086skinner), the following query returns the URI of the repository (agent), the postcode as literal, and the URI of the postcode unit:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX gn: <http://www.geonames.org/ontology#>
PREFIX ospc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>

SELECT ?repo ?pc ?pcunit
WHERE {
   ?repo locah:providesAccessTo 
                <http://data.archiveshub.ac.uk/id/archivalresource/gb1086skinner> ;
           locah:administers ?place .
   ?place gn:postalCode ?pc ;
          ospc:postcode ?pcunit
}

Run this query against the current LOCAH endpoint.

Listing the Archival Resources associated with a Person

In the EAD finding aids, the description of an archival resource may provide an association with the name of one or more persons associated with the resource as “index terms”. The person may be the creator of the resource, they may be the topic of it, or there may be some other association which is considered by the archivist to be significant for people searching the catalogue.

The following query provides a list of person names, the “authority file” form of the name, the identifiers of the archival resources with which they are associated, and the URI of a page on the existing Hub Web site describing the resource. I’ve limited it to a particular repository as without that constraint it potentially generates a quite large result set (and it helps me conceal the fact that some of the person name data is still a little bit rough and ready!)

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX locah: <http://data.archiveshub.ac.uk/def/>

SELECT DISTINCT ?name ?famname ?givenname ?authname ?unitid ?hubpage
WHERE {
?arcres locah:accessProvidedBy <http://data.archiveshub.ac.uk/id/repository/gb15> ;
        locah:associatedWith ?concept ;
        dcterms:identifier ?unitid ;
        rdfs:seeAlso ?hubpage .
?concept foaf:focus ?person ;
             rdfs:label ?authname .
?person a foaf:Person;
        foaf:name ?name;
OPTIONAL {?person foaf:familyName ?famname;
                  foaf:givenName ?givenname }
}
ORDER BY ?famname ?givenname ?name  

Run this query against the current LOCAH endpoint.

Listing Concepts by number of associated Archival Resources

The following query lists the concepts from a specified concept scheme (here the UNESCO thesaurus, which is assigned the URI http://data.archiveshub.ac.uk/id/conceptscheme/unesco, and orders them according to the number of archival resources with which they are associated (This makes use of the count and GROUP BY Talis Platform SPARQL extensions):

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?concept ( count(?concept) AS ?count ) 
WHERE {
   ?x locah:associatedWith ?concept .
   ?concept skos:inScheme  <http://data.archiveshub.ac.uk/id/conceptscheme/unesco> .
 }
GROUP BY ?concept
ORDER BY DESC(?count)

Run this query against the current LOCAH endpoint.

Listing Persons associated with Archival Resources, where Persons are born during a specified period

In an earlier post, I described the modelling of the births and deaths of individual persons as “events”.

Based on this approach, birth or death events occurring within a specified period can be selected. So, for example, the following query returns a list of persons born during the 1940s, with the archival resources with which they are associated:

PREFIX locah: <http://data.archiveshub.ac.uk/def/>
PREFIX bio: <http://purl.org/vocab/bio/0.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?birthdate ?person ?name ?famname ?givenname ?ar
WHERE { 
?event a bio:Birth ;
   bio:date ?birthdate ;
   bio:principal ?person .
   FILTER regex(str(?birthdate), '^194') .
?person foaf:name ?name .
   OPTIONAL { ?person foaf:familyName ?famname ; foaf:givenName ?givenname } .
?concept foaf:focus ?person .
?ar locah:associatedWith ?concept .
}
ORDER BY ?birthdate ?name

Run this query against the current LOCAH endpoint.

(I use this to illustrate the “event” approach, but in this case, birth and death dates are also provided as literal values of properties associated with the person, so there are other (easier!) ways of getting that information.)

To close, I’ll just emphasise again that these are only a few simple examples, intended to give an idea of the structure/”shape” of the data, and a flavour of what sort of queries are possible. If you come up with any examples of your own you’d like to share, we’d be glad to hear about them in comments below. (Come to think of it, it’s probably not very easy to maintain formatting/whitespace etc in comments, so it might be easier to host any such examples elsewhere and just post links here).

P.S. If there are any “tweaks” that you think we could make that would make things easier for those consuming/querying the data, it would be good to hear about them. I can’t promise we’ll be able to implement them, but we are still at the stage where things can be changed and we do want the data to be as usable and useful as possible.

Archives Hub Linked Data Release

We’re very pleased to announce the release of http://data.archiveshub.ac.uk, the first Linked Data set produced by the LOCAH project. The team has been working hard since the beginning of the project on modelling the complex archival data and transforming it into RDF Linked Data. This is now available in a variety of forms via the data.archiveshub.ac.uk home page. A number of previous blog posts outline the modelling and transformation process, the RDF terms used in the data, and the challenges and opportunities arising along the way. A forthcoming post will provide some example queries for accessing data from the SPARQL query endpoint. The data and content is licensed under a Creative Commons CC0 1.0 licence.

We’re working on a visualisation prototype that provides an example of how we link the Hub Data with other Linked Data sources on the Web using our enhanced dataset to provide a useful graphical resource for researchers.

One important point to note is that this initial release is a selected subset, representative of the Hub collection descriptions as a proof of concept, and does not contain the full Archives Hub dataset at present, although we are very keen to explore this in the future.

We still have some work to do, this being the initial release of the Hub data. Some revisions for a later release will address a few issues including reconciling our internal person and subject names, and will also contain some further enhancements to the data to include links to Library of Congress subject headings and further links to DBPedia based on subject terms. We also hope to include links for place names using Geonames and Ordnance Survey.

We encourage feedback on the data, the model and any other aspect of data.archiveshub.ac.uk, so please leave comments or contact us directly.

We are also working hard on our other main LOCAH release, the Copac Linked Data. Our first version of the model for this is now finished, and we have the data in our test triple store. We hope to release this in about a month’s time.

I’d personally like to thank the LOCAH team for all their hard work on this exciting and challenging project. I’d also like to thank our technology partner, Talis for kindly providing our Linked Data store.

Describing the “things”: the RDF terms used (part 2)

In the previous post, I described some of the considerations in choosing RDF vocabularies to use for the LOCAH archival metadata. In the tables below, I’ve tried to summarise the properties used to “describe” an instance of each of the classes in our model, i.e. for a particular thing URI, in our dataset, one might expect to find triples with that URI as subject and these property URIs as predicates, and when our data is served as linked data, and a thing URI is dereferenced the “bounded description” provided will include those triples (and others) – though some may be optional, so not necessarily present for all instances (and some may not be present at all until we add some more data…!)

This is really more of a “reference document” than a blog post, but I provide it in part as documentation of the data creation/transformation process, and in part as a guide for potential users of the actual data. Having said that, the data is liable (even likely) to change so consumers should always refer to the actual data for an up-to-date picture of the terms used. I’ve tried to highlight (dark grey background) below terms which I consider to be particularly “at risk” and liable to be removed/replaced, mostly terms from the “locah” vocabulary.

Most of this data is generated from the transformation of the EAD XML documents; a small proportion is added separately. Again, I’ve tried to indicate that in the tables below (light grey background).

Finding Aid

Type rdf:type Class URI locah:FindingAid
foaf:Document
bibo:Document
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Identifier dcterms:identifier plain literal
Description dcterms:description plain literal
Conforms to dcterms:conformsTo Standard URI standard:isadg
Publisher dcterms:publisher Repository (Agent) URI
Encoded As locah:encodedAs EAD URI
Topic foaf:topic Archival Resource URI
Subject dcterms:subject Archival Resource URI
Has Part dcterms:hasPart Biographical History URI

The following should probably be an owl:sameAs relationship (or we should just cite the Hub URI?)

See also rdfs:seeAlso Hub Page URI e.g.
http://archiveshub.ac.uk/data/
gb15sirernesthenryshackleton

EAD

Type rdf:type Class URI locah:EAD
foaf:Document
bibo:Document
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Identifier dcterms:identifier plain literal
Description dcterms:description plain literal
Date Created dcterms:created plain literal
Conforms to dcterms:conformsTo Standard URI dbpedia:
Encoded_Archival_Description

standard:ead2002
Encoding Of locah:encodingOf Finding Aid URI

The Hub does not currently provide a URI for the EAD document, but it is planned to do so, at which point we should add an owl:sameAs relationship (or just cite the Hub URI?)

Repository (Agent)

Type rdf:type Class URI locah:Repository
foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Identifier dcterms:identifier plain literal
Country Code locah:
countryCode
plain literal
Maintenance Agency Code locah:
maintenance
AgencyCode
plain literal
Is Publisher Of locah:isPublisherOf Finding Aid URI
Provides Access To locah:
providesAccessTo
Archival Resource URI
Administers locah:administers Place URI
See also rdfs:seeAlso Archon Page URI e.g.
http://www.nationalarchives.gov.uk/
archon/searches/
locresult_details.asp?LR=15

Repository (Place)

Type rdf:type Class URI wg84_pos:
SpatialThing
Label rdfs:label plain literal
Title dcterms:title plain literal
Is Administered By locah:
isAdministeredBy
Repository URI
See also rdfs:seeAlso Archon Page URI e.g.
http://www.nationalarchives.gov.uk/
archon/searches/
locresult_details.asp?LR=15

The following data is not generated from the EAD documents, but added in from a separate source:

Postal Code gn:postalCode plain literal
Located In gn:locatedIn Postcode Unit URI e.g.
http://data.ordnancesurvey.co.uk/id/postcodeunit/CB21ER
Within ossr:within Postcode Unit URI e.g.
http://data.ordnancesurvey.co.uk/id/postcodeunit/CB21ER
Postcode postcode:postcode Postcode Unit URI e.g.
http://data.ordnancesurvey.co.uk/id/postcodeunit/CB21ER

Archival Resource

Type rdf:type Class URI locah:ArchivalResource
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Level locah:level Level URI
Page foaf:page Finding Aid URI
Access Provided By locah:
accessProvidedBy
Repository (Agent) URI
Identifier dcterms:identifier plain literal
Date dcterms:date plain literal
Date Created or Accumulated locah:
dateCreated
AccumulatedString
plain literal

The following properties were introduced to distinguish different date cases (date range v single date).

Date Created or Accumulated locah:
dateCreated
Accumulated
typed literal
Date Created or Accumulated (Start) locah:
dateCreated
AccumulatedStart
typed literal
Date Created or Accumulated (End) locah:
dateCreated
AccumulatedEnd
typed literal
Produced In event:produced_in Creation Event URI
Extent (String) locah:extent plain literal
Extent dcterms:extent Extent URI
Language dcterms:language Language URI e.g.
http://lexvo.org/id/iso639-3/eng
Is Represented By locah:
isRepresentedBy
Document URI or Aggregation URI
Origination locah:origination Agent URI
Has Biographical History locah:
hasBiographicalHistory
Bioghist URI
Associated With locah:
asssociatedWith
Concept URI
Has Part dcterms:hasPart Archival Resource URI
Aggregates ore:aggregates Archival Resource URI
Is Part Of dcterms:isPartOf Archival Resource URI
Is Aggregated By ore:isAggregatedBy Archival Resource URI
Members locah:members (RDF Collection)
See also rdfs:seeAlso Hub Page URI e.g.
http://archiveshub.ac.uk/data/
gb15sirernesthenryshackleton-
gb15sirernesthenryshackleton-
imperialtrans-antarcticexpedition

For all of the following, the object is simply a copy of the XML element content from the EAD document as an XML Literal. This is a rather “dumb” and probably not terribly useful “translation” from the EAD; in a future iteration of the transform, we hope to extract further useful triples from this part of the EAD data, and we will probably remove some of these triples.

Custodial History locah:
custodialHistory
XML literal
Acquisitions locah:acquisitions XML literal
Scope and Content locah:
scopecontent
XML literal
Appraisal locah:appraisal XML literal
Accruals locah:accruals XML literal
Access Restrictions locah:
accessRestrictions
XML literal
Use Restrictions locah:
useRestrictions
XML literal
Physical or Technical Requirements locah:
physicalTechnical
Requirements
XML literal
Other Finding Aids locah:
otherFindingAids
XML literal
Location Of Originals locah:
locationOfOriginals
XML literal
Alternate Forms Available locah:
alternateForms
Available
XML literal
Related Material locah:
relatedMaterial
XML literal
Bibliography locah:bibliography XML literal
Note locah:note XML literal
Processing locah:processing XML literal

Level

Type rdf:type Class URI locah:Level
skos:Concept
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Comment rdfs:comment plain literal
Note skos:note plain literal
Definition skos:definition plain literal
Description dcterms:description plain literal

Language

Type rdf:type Class URI lvont:Language

Creation (Event)

Type rdf:type Class URI locah:Creation
lode:Event
event:Event
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Product event:product Archival Resource URI
Involved lode:involved Archival Resource URI
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI

Time Interval

Type rdf:type Class URI time:Interval
time:TemporalEntity
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Timeline timeline:timeline Timeline URI timeline:universaltimeline
Start timeline:start typed literal
Interval Starts time:intervalStarts Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
End timeline:end typed literal
Interval Ends time:intervalEnds Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
Contains crm:
P86i_contains
Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
At timeline:at typed literal
Interval During time:intervalDuring Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874
Falls Within crm:
P86_falls_within
Time Interval URI e.g.
http://reference.data.gov.uk/id/
year/1874

Extent

Type rdf:type Class URI locah:Extent
Label rdfs:label plain literal

In the EAD XML doc, extent is expressed simply as a literal. Where possible we’ve tried to parse out a “unit of measurement” and a quantity, reflected in RDF as a triple where the predicate reflects the unit and the object the quantity, as a typed literal, to try to make comparisons easier. I need to catch up with what current “best practice” is for representing quantities/units of measurement so this may well change. Also, currently, “units” include things like “file”, “paper” and “envelope”, which may not be terribly useful.

Archival Box locah:archbox typed literal (xsd:decimal)
Metre (Linear) locah:metre typed literal (xsd:decimal)
Cubic Metre locah:cubicmetre typed literal (xsd:decimal)
Folder locah:folder typed literal (xsd:decimal)
Envelope locah:envelope typed literal (xsd:decimal)
Volume locah:volume typed literal (xsd:decimal)
File locah:file typed literal (xsd:decimal)
Item locah:archbox typed literal (xsd:decimal)
Page locah:page typed literal (xsd:decimal)
Paper locah:paper typed literal (xsd:decimal)

Origination (Agent)

Type rdf:type Class URI foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Page foaf:page Biographical History URI
Is Origination Of locah:
isOriginationOf
Archival Resource URI

For links to other agents (external or internal):

Same As owl:sameAs Agent URI
Is Like umbel:isLike Agent URI

Biographical History

Type rdf:type Class URI locah:
BiographicalHistory
bibo:DocumentPart
bibo:Document
foaf:Document
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Title dcterms:title plain literal
Body locah:body XML literal, plain literal
Topic foaf:topic Agent URI
Subject dcterms:subject Agent URI
Is Biographical History For locah:
isBiographicalHistoryFor
Archival Resource URI
Is Part Of dcterms:isPartOf Finding Aid URI

Concept Scheme

Type rdf:type Class URI skos:ConceptScheme
Label rdfs:label plain literal

Concept (ControlAccess – Subject)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Concept (ControlAccess – Persname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Person URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the person who is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Surname locah:surname plain literal
Forename locah:forename plain literal
Dates locah:dates plain literal
Title locah:title plain literal
Epithet locah:epithet plain literal
Other locah:other plain literal

Person (ControlAccess – Persname)

Type rdf:type Class URI foaf:Person
foaf:Agent
dcterms:Agent
crm:
E21_Person
Label rdfs:label plain literal
Name foaf:name plain literal
Family Name foaf:familyName plain literal
Given Name foaf:givenName plain literal
Dates locah:dates plain literal
Title locah:title plain literal
Epithet locah:epithet plain literal
Other locah:other plain literal

For links to other persons (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Person URI
Is Like umbel:isLike Person URI

Concept (ControlAccess – Famname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Family URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the family that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Family (ControlAccess – Famname)

Type rdf:type Class URI locah:Family
foaf:Group
foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

For links to other families (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Family URI
Is Like umbel:isLike Family URI

Concept (ControlAccess – Corpname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Family URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the organisation that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Organisation (ControlAccess – Corpname)

Type rdf:type Class URI foaf:Organization
foaf:Agent
dcterms:Agent
Label rdfs:label plain literal
Name foaf:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

For links to other organisations (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Organisation URI
Is Like umbel:isLike Organisation URI

Concept (ControlAccess – Geogname)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
Focus foaf:focus Place URI
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

The following properties represent the structure that is captured for controlaccess elements using the Hub EAD profile. Currently, there are properties associated with both the concept and the place that is the foaf:focus of the concept. I’m not sure this is necessary/useful, and we may remove some of these triples.

Name locah:name plain literal
Dates locah:dates plain literal
Location locah:location plain literal
Other locah:other plain literal

Place (ControlAccess – Geogname)

Type rdf:type Class URI wg84_pos:
SpatialThing
Label rdfs:label plain literal
Name locah:name plain literal
Location locah:location plain literal
Other locah:other plain literal

For links to other places (internal or external). Not generated from the EAD documents, but added in via separate process.

Same As owl:sameAs Place URI
Is Like umbel:isLike Place URI

Concept (ControlAccess – GenreForm)

Type rdf:type Class URI locah:GenreForm
skos:Concept
Label rdfs:label plain literal
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

Concept (ControlAccess – Function)

Type rdf:type Class URI skos:Concept
Label rdfs:label plain literal
In Scheme skos:inScheme Concept Scheme URI

For links to other concepts (internal or external). Not generated from the EAD documents, but added in via separate process.

Exact Match skos:exactMatch Concept URI
Close Match skos:closeMatch Concept URI

Book/Document (ControlAccess – Title)

Type rdf:type Class URI foaf:Document
bibo:Document
Label rdfs:label plain literal
Title dcterms:title plain literal

Birth (Event)

Type rdf:type Class URI bio:Birth
bio:IndividualEvent
bio:Event
lode:Event
event:Event
crm:E67_Birth
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Date bio:date typed literal
Date dcterms:date typed literal
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI
Has Time-Span crm:
P4_has_time-span
Time Interval URI
Agent bio:agent Person URI
Principal bio:principal Person URI
Agent event:agent Person URI
Involved Agent lode:involvedAgent Person URI
Brought Into Life crm:
P98_brought_into_life
Person URI

Death (Event)

Type rdf:type Class URI bio:Death
bio:IndividualEvent
bio:Event
lode:Event
event:Event
crm:E69_Death
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Date bio:date typed literal
Date dcterms:date typed literal
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI
Has Time-Span crm:
P4_has_time-span
Time Interval URI
Agent bio:agent Person URI
Principal bio:principal Person URI
Agent event:agent Person URI
Involved Agent lode:involvedAgent Person URI
Was Death Of crm:
P100_was_death_of
Person URI

Floruit (Event)

Type rdf:type Class URI locah:Floruit
bio:IndividualEvent
bio:Event
lode:Event
event:Event
Label rdfs:label plain literal
Preferred Label skos:prefLabel plain literal
Date bio:date typed literal
Date dcterms:date typed literal
Time event:time Temporal Entity URI
At Time lode:atTime Temporal Entity URI
Agent bio:agent Person URI
Principal bio:principal Person URI
Agent event:agent Person URI
Involved Agent lode:involvedAgent Person URI
Was Death Of crm:
P100_was_death_of
Person URI

Object

Type rdf:type Class URI foaf:Document
bibo:Document
Is Aggregated By ore:isAggregatedBy Object Group URI

Object Group

Type rdf:type Class URI ore:Aggregation
dcmitype:Collection
bibo:Collection
Aggregates ore:aggregates Object URI