Seth Denbo, “Linking the Digital Past: British History and the Impact of the Semantic Web”

Linking the Digital Past: British History and the Impact of the Semantic Web
Seth Denbo, University of Maryland

[NOTE: This is a draft of a conference paper, so please keep in mind that the text is informal and the link and notations are still in development]

In an article entitled "New Tools for Men of Letters" published in the Yale Review in 1935 the historian of Europe Robert Binkley wrote,

The reader … found his world changed by the invention of printing. Books became more accessible. The first effect … was to spread more widely the source books by which all intellectual activities were fed … it became possible for the moderately wealthy man to possess what previously only princes or great religious establishments could afford—a fairly complete collection of the materials he desired.[1]

Binkley went on to argue that the specialization in all fields of knowledge that occurred in the nineteenth-century destroyed this "happy position", and that the prohibitive cost of these vast materials made it necessary for the scholar to cede ownership of the sources back to institutions. In the first decades of the twentieth century, Binkley claimed, a new technology was planting the seeds for another revolution in accessibility and affordability akin to that which had occurred with the birth of moveable type. Mimeograph and microfilm – which had both been invented in the late nineteenth-century but were becoming much more widely used in the 1930s – were greatly reducing the cost of printing and distributing primary sources.[2] In Binkley's utopian vision, the availability of inexpensive publishing and printing technologies would allow scholars who would otherwise be excluded from the production of knowledge to participate. And the stakes were high – cultural centralization led to Fascism. What America required in the late 1930s to avoid the trap in which Germany found itself was a program where the objectives were

"not only a bathroom in every home and a car in every garage but a scholar in every schoolhouse and a man of letters in every town. Towards this end technology offers new devices and points the way."

Today we can substitute digitization for microfilm and 'near-print' technologies of the 1930s. These are of course not the same thing, but the ability to deliver large volumes of information at a fraction of the cost is an important similarity between the technologies, and the potential value of this material for historians is enormous. Primary source materials once accessible in a few major research libraries are now available on one's desk, any time of the day or night, a mere on/off switch away. Collections of online journals that span most of the history of the discipline itself are available to scholars whose universities subscribe to JStor, Project Muse or any number of publisher's platforms. In addition to being easy to obtain, these materials often offer searching facilities earlier generations of historians could only dream of. In a 2009 article in Digital Humanities Quarterly, the classicist Bruce Robertson put it like this:

It might be said that today's Web has the makings of an historian's fantasy: it provides worthy encyclopedia entries on a vast array of topics; its textual editions are rapidly improving in quality and amassing in quantity; and it offers historical source material ranging from argumentation in on-line editions of the best journals to the first-hand accounts appearing in blog entries.[3]

British historians are particularly well served by the Web. The combined resources of Early English Books Online and Eighteenth Century Collections Online (affectionately known as EEBO and ECCO respectively) provide full text access to the majority of works published in English from the advent of printing to the end of the eighteenth century, and work is currently underway to expand these digital works into the nineteenth century. The whole of the Burney Collection, a major source for eighteenth-century newspapers, is now available online. The constellation of projects around the Old Bailey Online, including London Lives, Connected Histories and Locating London's Past all combine an amazing array of primary sources with innovative means of interrogation and analysis of those materials, providing the means for connecting materials into historical narratives that were difficult if not impossible to reconstruct. British History Online provides access to a vast array of public, parochial and private records.

These are just examples of the largest and most well-known of the digital resources available to British historians. Many smaller digitization projects have provided collections that supplement these larger corpora of texts, many British museums and archives maintain online collections, and many individuals maintain specialist digital collections on an area of interest that may be valuable to others within the profession.

Along side the benefits inherent in digitization of historical materials, this wealth of materials brings some major challenges and and disadvantages. Two of these in particular are the subjects of my talk – discovery and context – and some of the technological solutions that are currently underway to change the way that resources that are exposed via the Web are made available to begin to solve some of these problems.

Discovery, as the historian Steven Lubar has argued in a recent blog entry on the question of how scholarly research has changed in the past decade "is perhaps the stage of scholarship that's seen the largest change".[4] The wealth of materials that are now keyword searchable means that the volume of material that can be considered for a given research project is much greater than previously was possible. The ease with which this technological advance has happened makes searching second nature, and discovery would seem at first glance to be easy. Google has transformed finding things on the Web through sophisticated algorithms that make keyword searching effective. However, searching and discovery are related but different problems. While keyword searching of corpora has transformed certain types of historical research, there are still major issues to be solved when it comes to discovery of primary source materials.

One issue with discovery is overabundance – and every user of Google knows how many items a search can potentially return. But with this problem of overabundance comes something more difficult to solve – the issue of knowing if what your finding is worthwhile and relevant, or if there are sources lurking somewhere on the web (or off) that are potentially valuable but which cannot or are not found through searching. While an experienced researcher should in many cases, through following leads and keeping abreast of the literature find her way, there are real issues around what it means to discover sources and how this process works in the digital realm.

The other issue is that of context. Keyword searching provides many, many items and resources for Context can be as simple as being able to filter out irrelevant items from a search in order to improve the quality of responses, or it can be much more complex in that it is about knowledge of the source itself rather than merely a few words. It is a potential pitfall of doing research in the age of digital abundance that the source itself is not sufficiently understood by the scholar who has found a reference that relates to his question without a thorough reading.

So what does all of this have to do with Linked Data? Before getting to the explanation of what Linked Data does to solve this problem I'm going give a brief overview of what the Semantic Web and Linked Data are, before coming back around to the question of why this is relevant for historians how it has the potential to greatly enhance web-based historical research.

The Web was originally conceived as a collection of "web pages" – discrete "documents or information resources … usually produced using HTML format, and providing navigation to other web pages via hypertext links".[5] The Web is set up to exchange documents – anyone can create a website that anyone else would be able to access – but beyond the information about formatting provided by HTML there was very little structure to those pages.  With the vast variety of types of materials now exposed via the Web, and what has been called the "data deluge", this basic structure leads to some of the problems that I have already discussed.[6]

To deal with these problems, and to find better ways of exposing the masses of data available on the Web that had previously been locked up these individual sites, the World Wide Web Consortium began to support what they call the Semantic Web. The idea was that just as hyperlinks are used to connect webpages, the Semantic Web would have structures to make it possible to connect disparate data from many different sources, thus unlocking the potential of the materials available. This is an attempt to re-conceive the Web as a web of data rather than one of pages. The ultimate goal of the semantic web is to make the data available on the web, much of which is locked up in specific applications, shareable and discoverable.

The problem with the theoretical concept of the Semantic Web arose in finding a way to realize the ideal of unlocking data that was not easily shared because it was hidden within sites or not easily transferred between applications. Linked Data refers to a specific technological solution for implementing this "web of data" that is based on two basic principles: first that the data is presented according to an agreed model and second that it links to a vocabulary that provides an authoritative reference for the data, or more succinctly Linked Dat is "structured data that links to published vocabularies to define terms"[7]. The structured data side is a standard called Resource description framework, which I will come back to later in the paper. Published vocabularies are authoritative data that are provided as a means to have a standard source for referring to entities. These are usually published by accepted organizations who have a history of creating classifications and ontologies for the consumption of others – such as the Library of Congress – but also by newer, often web-based institutions and projects such as Wikipedia.

One example of the kind of thing I'm talking about here is "Library of Congress Names". In the library's words:

The Library of Congress Name Authority File (NAF) file provides authoritative data for names of persons, organizations, events, places, and titles. Its purpose is the identification of these entities and, through the use of such controlled vocabulary, to provide uniform access to bibliographic resources. Names descriptions also provide access to a controlled form of name through references from unused forms

This is the LoC NAF for John Tillotson, Archbishop of Canterbury[8]. The page provides authoritative information about his name and dates. Importantly it also provides information about variants to the ways he is referred to. While a human reader understands that "Archbishop Tillotson" refers to the same person as "John Tillotson", a computer requires this information to be provided in some fashion. It is possible for anyone publishing information on John Tillotson on the Web to link to this record in the Library of Congress Name Authority File, thus allowing automated processes for finding information on the Web to be certain that they are retrieving relevant information.

This becomes more important if the person's name is very common or in the case of a place name that refers to more than one location, making it difficult to search for using basic keyword functionality. Linking a name that is easily confused with others, especially in the case where the person is not a major historical figure, could be crucial for finding historical actors across different sources of data.

Another example of where confusion can arise is with titles. If I come across the Earl of Orford on a site or this aristocratic title arises in a text that I've digitized and created a resource from, then as a reader I know from the context if this is referring to the First Lord of the Admiralty Edward Russell, or later and more famous but unrelated Earls Orford including England's first Prime Minister Robert Walpole or his literary offspring Horace, but a search engine or other machine driven way of finding information on the web cannot make such a judgement based upon context.[9] There could potentially be useful knowledge or a perspective on the individual who is being searched for in a site about something else entirely, but if the site is not specifically about one of these figures, then it's unlikely to come up in a search. If you look more closely at the entry for Horace Walpole in the Library of Congress NAF it demonstrates how the use of Linked Data can provide a way of referencing any of a number of different pseudonyms, alternative spellings, etc.

By publishing this name in a structured format and embedding the link to this Library of Congress Name Authority File, this problem is alleviated because it makes it possible to discover the material, and to provide the necessary context to determine which Earl of Orford is being discussed. This information is then connected to other sites not through actual hypertext links but rather through using the same authority files.

My examples here are all notable historical actors or people who produced published works, because it iss easier to demonstrate in this context, but it is possible to see the value of using linked data for less well-known historical figures, especially in historical contexts where there is a lot of a data published across a range of different sources. An authoritative way to refer to an individual who arises in, for example, the Old Bailey Online, a newspaper published in the Burney Collection, and the Convict Transportation Registers Database would be immensely valuable in enriching the historical record of this period.

Data linkage of this kind on the Web is primarily achieved through a standard known as the Resource Description Framework or RDF. This family of specifications is a set of guidelines for describing entities, or things of interest within a site. The idea of RDF is to have a system that is so simple and straightforward that it can represent any entity or fact about that entity, and yet be structured enough that computers can do useful things with it. There are a number of different technical ways to express RDF, but the basic idea is that things are described through what are known as triples which take the form of Subject – Predicate – Object sentences[10]. For example:

Subject Predicate Object
Shakespeare wrote King Lear
Shakespeare wrote Macbeth
Anne Hathaway married Shakespeare
Stratford is in England
Macbeth is set Scotland
England is part of The UK
Scotland is part of The UK[11]

These descriptive phrases can be expressed in a number of different formats – XML, N3 – that are machine readable and so can be used to transfer these discrete pieces of information that may exist within larger bodies of text that are published on the web, and would previously have been inaccessible except through the native interface.

So what is all of this about? How does what ultimately amounts to a fairly technical solution to the problem of having data locked up within applications, websites and databases make any difference to historical research? The answer to that question isn't necessarily obvious. In many cases, it will never be apparent to the historian doing research using web-based sources that all of this stuff is actually doing its job.

One way to answer to this question is to give an example based upon how Wikipedia works. In spite of obvious flaws (which I would argue are merely different, but not more fatal than that of any encyclopedic compendium of knowledge), Wikipedia has supplanted many print sources as a place a lot of people (including academics) go to for basic information about a topic.

There are however limits to the kinds of questions one can address to the source because of the unstructured way in which the data is presented. For example, if you are looking for information on Robert Walpole's political career, or the War of Jenkins's Ear you can go to the respective entries and read them, but there is no easy way to find out what information about either of those things might be buried in entry for the other one, or in others that are less easily discovered. Equally, while you can search Wikipedia for Sir John Soane, you can't ask it to find all of the architects who were active in London during the eighteenth century. The limited possibilities for querying the data are a barrier to doing anything more than the most basic searches.

Two projects are in the process of changing this. dbPedia is an initiative based in Germany to created Linked Data from existing Wikipedia entries. dbPedia is a "a crowd-sourced community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to make sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data."[12] This project provides a kind of nucleus for the web of data by providing a huge amount of information across a wide range of domains in structured form. Because the dbPedia knowledge base is provided as Linked Data on the web, other data providers have used it as a place to link to (as with the Library of Congress data that I was using in the earlier examples).

This web of data is now growing at an enormous rate and there are a number of hubs of this kind, some of which are domain specific and others of which are more general (as with dbPedia

