Back near the start of the project, I published a post outlining the processes involved in generating the Archives Hub RDF dataset and serving up “Linked Data” descriptions from that dataset; it’s perhaps best summarised in the following diagram from that post:

In this post, I’ll say a little bit more about what is involved in the “Expose” operation up in the top right of the diagram.
Cool URIs for the Semantic Web
In an earlier post, I discussed the URI patterns we are using for the URIs of “things” described in our data (archival resources, concepts, people, places, and so on). One of the core requirements for exposing our RDF data as Linked Data is that, given one of these URIs, a user/consumer of that URI can use the HTTP protocol to “look up” that URI and obtain a description of the thing identified by that URI. So as providers of the data, our challenge is to enable our HTTP server to respond to such requests and provide such descriptions.
The W3C Note Cool URIs for the Semantic Web lists a number of possible “recipes” for achieving this while also paying attention to the principle of avoiding URI ambiguity i.e. of avoiding using a single URI to refer to more than one resource – and in particularly to maintaining a distinction between the URI of a “thing” and the URIs of documents describing that thing.
Document URI Patterns
Within the JISCExpo programme which funds LOCAH, projects generating Linked Data were encouraged to make use of the guidelines provided by the UK Cabinet Office in Designing URI Sets for the UK Public Sector.
Thse guidelines refer to the URIs used to identify “things” (somewhat tautologically, it seems to me!) as “Identifier URIs”, where they have the general pattern:
http://{domain}/id/{concept}/{reference}
where:
- concept is a name for a resource type, like “person”;
- reference is a name for an individual instance of that type or class
(The guidelines also allow for the option of using URIs with fragment identifiers (“Hash URIs”) as “Identifier URIs”.)
The document also recommends patterns for the URIs of the documents which provide information about these “things”, “Document URIs”:
http://{domain}/doc/{concept}/{reference}
These documents are, I think, what Berners-Lee calls Generic Resources. For each such document, multiple representations may be available, each in different formats, and each of those multiple “more specific” documents in a single concrete format may be available as a separate resource in its own right. So a third set of URIs, “Representation URIs,” name documents in a specific format, using the suggested pattern:
http://{domain}/doc/{concept}/{reference}/{doc.file-extension}
i.e. for each “thing URI”/”Identifier URI” in our data, like:
http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist, which identifies a person, the artist Beverley Skinner;
there is a corresponding “Document URI” which identifies a (“generic”) document describing the thing:
http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist
and a set of “Representation URIs” each identifying a (“specific”) document in a particular format:
http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.html, which identifies an HTML document;
http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.rdf, which identifies an RDF/XML document;
http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.turtle, which identifies a Turtle document;
http://data.archiveshub.ac.uk/doc/person/ncarules/skinnerbeverley1938-1999artist.json, which identifies a JSON document (more specifically one using Talis’ RDF/JSON conventions for serializing RDF)
(We’ve deviated slightly from the recommended pattern here in that we just add “.{extension}” to the “reference” string, rather than adding “/doc.{extension}”, but we’ve retained the basic approach of distinguishing generic document and documents in specific formats, which I think is the significant aspect of the recommendations.)
This set of URI patterns corresponds to those used in the “recipe” described in section 4.2 of the W3C Cool URIs note, “303 URIs forwarding to One Generic Document”.
The Talis Platform
It is perhaps worth emphasising here that in the LOCAH case a “description” of any one of the things in our model may contain data which originated in multiple EAD documents e.g. a description of a concept may contain links to multiple archival resources with which it is associated, or a description of a repository may contain links to multiple finding aids they have published, and so on. A description may also contain data which originated from a source other than the EAD documents: for example, we add some postcode data provided by the National Archives, and most of the links to external resources, such as people described by VIAF records, are generated by post-transformation processes.
This aggregated RDF data – the output of the EAD-to-RDF transformation process and this additional data – is stored in an instance of the Talis Platform store. Simplifying things slightly, the Platform store is a “database” specialised for the storage and retieval of RDF data. It is hosted by Talis, and made avalable as what in cloud computing terms is referred to as “Software as a Service” (SaaS). (Actually, a Platform store allows the storage of content other than RDF data too – see the discussion of the ContentBox and MetaBox features in the Talis documentation – but we are, currently at least, making use only of the MetaBox facilities).
Access to the store is provided through a Web API. Using the MetaBox API, data can be added/uploaded to the MetaBox using HTTP POST, updates can be applied through what Talis call “Changesets” (essentially “remove that set of triples” and “add this set of triples”) again using HTTP POST, and “bounded descriptions” of individual resources can be retrieved using HTTP GET. There are also “admin” functions like “give me a dump of the contents” and “clear the database”. In addition, the Platform provides a simple full-text search over literals (which returns result sets in RSS), a configurable faceted search, an “augment” function and a SPARQL endpoint.
A number of client software libraries for working with the Platform are available, developed either by Talis staff or by developers who have worked with the Platform.
Delivering Linked Data from the Platform
I’m going to focus here on retrieving data from the MetaBox, and more specifically retrieving the “bounded descriptions” of individual resources which which provide the basis for the “Linked Data” documents.
This process involves a small Web application which responds to HTTP GET requests for these URIs:
- For an “Identifier URI”, the server responds with a 303 status code and a Location header redirecting the client to the “Document URI”
- For a “Document URI”, the server derives the corresponding “Identifier URI”, queries the Platform store to obtain a description of the thing identified by that URI, and responds with a 200 status code, a document in a format selected according to the preferences specified by the client (i.e. following the principles of HTTP content negotiation), and a Content-Location header providing a “Representation URI” for a document in that format.
- For a “Representation URI”, the server derives the corresponding “Identifier URI”, queries the Platform store to obtain a description of the thing identified by that URI, and responds with a 200 status code and a document in the format associated with that URI.
The first step above is handled using a simple Apache rewrite rule. For the latter two steps, we’ve made use of the Paget PHP library created by Ian Davis of Talis for working with the Platform (Paget itself makes use of another library, Moriarty, also created by Ian). I’m sure there are many other ways of achieving this; I chose Paget in part because my software development abilities are fairly limited, but having had a quick look at the documentation and one of Ian’s blog posts, I felt there was enough there to enable me to take an example and apply my basic and rather rusty PHP skills to tweak it to make it work – at least as a short-term path to getting something functional we could “put out there”, and then polish in the future if necessary.
The main challenge was that the default Paget behaviour seemed to be to use the approach described in section 4.3 of the Cool URIs document, “303 URIs forwarding to Different Documents”, where the server performs content negotiation on the request for the “Identifier URI” and redirects directly to a “Representation URI”, i.e. a GET for an “Identifier URI” like http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist resulted in redirects to “Representation URIs” like http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist.html or http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist.rdf
If possible we wanted to use the alternative “recipe” described in the previous section, and after some tweaking we managed to get something that did the job. We also made some minor changes to provide a small amount of additional “document metadata”, e.g. the publisher of and license for the document. (I do recognise that the presentation of the HTML pages is currently pretty basic, and there is room for improvement!)
Finally, it’s maybe worth noting here that the Platform store itself doesn’t contain any information about the documents i.e. neither the Document URI nor the Representation URIs appear in RDF triples loaded to the store. So, in principle at least, we could add additional formats using additional Representation URIs simply by extending the PHP to handle the URIs and generate documents in those formats, without needing to extend the data in the store.
I’d started to write more here about extending what we’ve done to provide other ways of accessing the data, but having written quite a lot here already, I think that is probably best saved for a future post.