Permalink for this paragraph 0
Textmining British Studies: an Overview of Recent Developments
Tim Hitchcock, University of Hertfordshire
Permalink for this paragraph 0 [NOTE: This is a draft of a conference paper, so please keep in mind that the text is informal and the link and notations are still in development]
Permalink for this paragraph 0 More printed materials relating to British studies are available online and in a digital format, than is true of any other humanities discipline. Between Google Books, ECCO, EEBO, the Burney Collection, the British Library’s 19th century newspapers, the Parliamentary Papers, the Old Bailey Online, Connected Histories, Nines and 18thConnect, we are possessed of an almost ‘infinite archive’ of printed primary sources. We also have a growing number of manuscript materials available through sites like London Lives, and the Transcribe Bentham project; and that is before we even begin to discuss the censuses and the wild, un-tamed products of the wider web, shepherded in to existence by ill-regarded genealogists, and simple enthusiasts. My guestimate is that some 60% of every word published in English between the late fifteenth century and 1923 – between moveable type, and Mickey Mouse – is available online in a key word searchable form.
Permalink for this paragraph 0 Sometimes, I think this is a very bad thing.
Permalink for this paragraph 0 It is one of the great ironies of the minute that the most revolutionary technical change in the history of the human ordering of information since writing – the creation of the infinite archive, with all its disruptive possibilities – has, for British Studies in particular, made the field increasingly reactionary and conservative.
Permalink for this paragraph 0 If you look at what has been digitised in the last twenty years, with few exceptions it has been the Western cannon. The most common sort of historical web resource coming out of the academy, are sites dedicated to posting the musings of some elite, dead, white, western male – some scientist, or man of letters; or more unusually, some equally elite, dead white woman of letters. And for technical reasons as much as anything -1950s scholarship has formed the overwhelming basis for most of the projects funded by millions of pounds of academic grants. It is simply easier to digitise a scholarly edition published in 1952, its contents determined by the oldest of canons, than it is to deal with the ephemera and cheap print and manuscript that makes up the source base of most more innovative and non-canonical histories. The works of Darwin, Newton, Smith etc., have all been made freely available and searchable at the click of a mouse, as have Jane Austen’s every note and draft – giving all their ideas a new hyper-availability, a new authority, new importance, and a new canonicity. Even the existence of sites that were designed to mitigate this tendency, like the Old Bailey Online, have frequently simply re-enforced an essentially conservative history of the criminal justice system – letting the oldest of methodologies (hunt and peck research) work ever so slightly faster; creating its own problems of canonicity in the process. But, at the same time the existence of this new, second, e-edition of the Western canon also makes possible a dramatically new and innovative series of approaches based in the large scale analysis of inherited text.
Permalink for this paragraph 1 For many of us, these developments have pushed us to the point where, although the text is readily available on every desktop, we can’t even begin to read all the material one would want to consult in a classic immersive fashion; and frequently practise what has been described as a form ‘Drive-by scholarship’ in which key word searching makes serious reading redundant. The practical limits on our research, in terms of travel grants and access have disappeared; while the time and energy we have to read has remained unchanged. Instead, in relation to text, we are necessarily moving towards what Franco Moretti has dubbed ‘distant reading’, and towards the development of new methodologies for ‘text mining’ – or the statistical analysis of large bodies of text/data. Stephen Ramsay’s recent book, Reading Machines, illustrates four or five examples of what he describes as a new form of ‘Algorithmic Literary Criticism’, but this is simply a taster for a wider series of practical methodologies. And the first thing, I want to do today is to simply go through some of these methodologies and projects – just to begin to illustrate the wider research landscape that is rapidly evolving.
Permalink for this paragraph 0 In relation to printed, primarily nineteenth-century text, the game changing development was the Google Ngram viewer, which allows you to rapidly chart the changing use of words and phrases as a percentage of the total published per year. This initiative came directly out of quantitative biology; and a collaboration between Google and two scholars from Harvard: Erez Lieberman Aiden and Jean-Baptiste Michel. The underlying aspiration of the academic field they have created – “Culturomics,” – is nothing short of the quantitative analysis of culture, and by extension human history, using millions of books. In their words, culturomics is: “the application of high-throughput data collection and analysis to the study of human culture.”
Permalink for this paragraph 0 In humanist and historical terms, culturomics reflects a return to an essentially positivist, ‘scientific’ historical approach that seeks to determine the character of a ‘knowable’ past from texts it produced, and to use this data to infer the character of the laws governing human society. Culturomics is staggering in its underlying aspirations. But I suspect that it will make very uncomfortable those generations of historians who have been schooled in both a close and critical reading of primary sources, and in the need for a deep and empathetic immersion in a particular time and place.
Permalink for this paragraph 0 Now, I love the ngram viewer, and spend my Sundays charting the changing use of dirty words decade by decade.
Permalink for this paragraph 0 But it also forms the basis for a newly statistical approach to language, and that is important.
Permalink for this paragraph 0 But, for most of us, I expect the exemplar for how this new set of approaches and tools might be used does not come from culturomics, nor does it necessarily imply the positivist assumptions that underpin that intellectual approach. Rather, it is found in the work of people who are using the existence of a few billion words more subtly. In work of historians such as Ben Schmidt, Tim Sherratt and Jo Guldi, examples can be found of work that seeks to use some of the same methodologies in a way that feels much more historically valid; and which essentially seeks to map meaning and linguistic behaviour across inherited text, without making the assumption central to culturomics, that we actually know how ‘text’ evidences a more or less unknowable past.
Permalink for this paragraph 0 Some of my favourite work includes that undertaken by Ben Schmidt.
Permalink for this paragraph 0 This histogram simply illustrates how the dialogue in a historical drama, Downton Abbey compares to published UK English in the period it is purporting to represent. And it demonstrates that the authors of the dialogue committed some serious linguistic errors. All sorts of expressions from ‘Black market’ to ‘staff luncheon’ just weren’t used by early twentieth century British people. But more importantly, what Schmidt’s work does is show us how to use a million books to contextualise a piece of literature or drama. George Elliott’s historical novels, for instance, will have a different relationship to the period they represent, than will those written in the twentieth century; or written for a 21st century audience. Measures of this sort throw in to sharp relief patterns of important historical/linguistic change. It never occurred to me that no one in the 1910s used the expression, to ‘feel loved’; but knowing this, makes it possible to think different about the history of affect.
Permalink for this paragraph 0 And while the original Ngran viewer has a number of problems, it is rapidly improving and there is an evolving set of online tools that lets you do serious work with it. If you look at something like the SEASR iteration of the same tool, with its measures for correlates and alternative spellings, you end up with a facility that actually allows serious and credible linguistic searches to be undertaken. This graph simply is the print history of 18th and 19th century Anglo-phone politics.
Permalink for this paragraph 0 Or look at Tim Sherratt’s visualisations of the use of the terms ‘Great War’ and ‘First World War’ in 20th century Australian newspapers. While entirely commonsensical the detailed results of the 1940s in particular mark out the evidence for a month by month reaction to events; allowing both more directed immersive reading (drilling down to the finest detail), and a secure characterisation of large scale collections.
Permalink for this paragraph 0 Or to bring this back to a more personal perspective, we can look at work Bill Turkel and I have done on the Old Bailey Online in collaboration with Stefan Sinclair, Dan Cohen and many others – to simply chart the distribution of the 125 million words reported in 197,000 trials, to analyse both the nature of the Old Bailey Proceedings as a publication, and their relationship to words recorded as having been spoken in court – in this instance to illustrate, among a few other things, that serious crimes like killing were more fully reported than others in the 18th century proceedings, and that this pattern changed in the 19th century.
Permalink for this paragraph 0 All of these projects are based on a simple methodology – counting words. Largely unlike culturomics, the abstract relationship between those words, and their value as historical evidence, is not the issue – these are simply different ways of looking at what we all look at already.
Permalink for this paragraph 0 But, of course once you start counting words – and stepping on the toes of corpus linguistics – all sorts of new approaches become possible; which in essence seek to go beyond counting terms, to what I think of as mapping meaning.
Permalink for this paragraph 0 For a start, you can simply ask: ‘where can I find similar texts?’, and use measures such as ‘Normalised Compression Distance’, and ‘Term Frequency/Inverse Document Frequency’ (TF/IDF) measures to identify statistically similar collections of words.
Permalink for this paragraph 0 A measure like ‘Normalised Compression Distance’ uses the same methodology used to create a Zip file, to identify where phrases and words have been repeated; and uses that repetition to create a measure of similarity between any two text. How close is this trial, to that literary account?
Permalink for this paragraph 0 Alternatively, a related methodology, which all of you encounter every time you hit a ‘more like this’ button while shopping online, is ‘term frequency / inverse document frequency’ (TF/IDF). To quote Wikipedia, TF/IDF:
Permalink for this paragraph 0 is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
Permalink for this paragraph 0 So, to take an Old Bailey example, we can go from trials which include the phrase – ‘About the Streets’ – some 400 odd, which tend to include streets sellers and beggars; to the 1085 trails that share a similar vocabulary and which are likely to include street sellers, although the phrase ‘about the streets’ is not itself present. From being restricted to analysing the categories of crime types and conviction rates; it becomes possible to analysis categories of texts.
Permalink for this paragraph 0 Of course, this still doesn’t get you to meaning. The origins of these methodologies in linguistics (and this is largely true of all textmining methodologies) means that they are directed at understanding and analyzing language rather than meaning.
Permalink for this paragraph 0 To get around this, to get to some form of ‘meaning’, the most common current methodology is Topic Modeling, which has been around since 2003, and can be implemented using an off the shelf package such as Mallet.
Permalink for this paragraph 0 Topic modeling starts from the other end from TF-IDF – and assumes a sort of random distribution of all subjects throughout a larger corpus, going on to identify the co-occurrence of meaningful words in each text division (usually a paragraph), but not defining those clumps in any way. It assumes that if two words regularly co-occur in a single paragraph, then other paragraphs with these same two words, are likely to be about the same subject. This is sometimes referred to as a ‘bag of words’ approach. This in turn forms the basis for the identification of thematic meaning in a body of text by the user – i.e. just looking at the prominent words in a single paragraph to decide that it is about patriotism or housework, or indeed patriotic housework. The collection of words that become a paragraph about ‘washing up’, for instance can then become a search tool for all paragraphs about washing up or house work. All of which ends up again in a series of rather cool – if opaque – visualisations.
Permalink for this paragraph 0 Most of the people using this methodology see it as a tool for combining what Franco Moretti describes as ‘Distant Reading’, with close reading of the more traditional sort, and up until now it has yet to be fully implemented with any major British dataset.
Permalink for this paragraph 0 All of which is just to say that we have a few simple hand tools for working with massive text objects. We can count their individual elements. We can measure the textual difference between any two text objects. And finally, we can characterize their topical relatedness – even if we still have to eye ball the text to get to ‘meaning’.
Permalink for this paragraph 0 Of course, once you actually have any measure of textual content, you then want to tie it to something. If you know the price of wheat – you want to know the number of riots in dearth years. And in relation to textmining, after that first characteristation of text as data, the next place that many historians have ended is with place.
Permalink for this paragraph 0 Projects like Electronic Enlightenment, for instance, have taken the bits and pieces of the Republic of Letters, the metadata for each letter sent between philosophes, and mapped them through time and space – arguably creating a different kind of text mining.
Permalink for this paragraph 0 And this sort of methodology has in its own turn led to, or perhaps just evolved in parallel to, further attempts to implement the same kind of mapping of text, with a true GIS or Geographical Information System component. In the last few years as a result of Google Earth and Open Street Map, this methodology is suddenly available for online applications in a new way, encouraging people working with large bodies of text, to explore a spatial dimension.
Permalink for this paragraph 0 In some cases this is about mapping a single new data set – as in the visualisation of the shipping routes generated from digitised ship logs in the eighteenth century.
Permalink for this paragraph 0 But it can also be about mapping text in its raw form.
Permalink for this paragraph 0 My favourite work in this area is being undertaken by Ian Gregory at Lancaster; who is simply taking literature, and giving it a geographical dimension. Starting with his project: Mapping the Lakes: A Literary GIS, he has begun to add measures of space, place, height and experience, to the literary output of the Lake poets; creating something new.
Permalink for this paragraph 0 And in a major European project called ‘Spatial Humanities Texts, Geographic Information Systems and Places’, Ian Gregory is building from that starting point to the creation of a wider geography of text, and to create the tools to understand that text better. In other words, he is combining Corpus Linguistics with GIS, to create something entirely new.
Permalink for this paragraph 0 We are also trying to do something similar, if not nearly as ambitious with eighteenth-century London material – to at least map text, if not quite to go as far as Gregory is planning. In our small project, Locating London’s Past, Bob Shoemaker, Matthew Davies and I, and a large team of other people, recently mapped some 40 million words of trials and 2 million lines of structured databases onto a warped and rectified version of Rocque’s 1746 map of London.
Permalink for this paragraph 1 I cant recommend the result for the quality of the underlying data, but it does allow you to do things like map the distribution of words such as horse, mare or gelding, in the Old Bailey.
Permalink for this paragraph 0 Or, my favourite example (though I have no idea what it means), to map all the instances of the words for the industrial colours of ‘blue, red and yellow’ against the natural hues of ‘brown and green’ to explore an urban environment and to suggest different ways of thinking about a wider cityscape.
Permalink for this paragraph 1 All of which just takes us to a point where something new is actually possible – where for example, the close reading of a single paragraph, or word, or poem can be contextualised, not just in the assumed and shared knowledge of a shared education and canon (the unstated circumstance of the outrageous privilege that underpins our discussions), but within a clearly and statistically validated intellectual context. We can still practise thick description, and detailed textual analysis, while also pursuing what Franco Moretti’s ‘Distant Reading’. What these approaches change is not necessarily the conclusion, but simply the authority of the argument.
Permalink for this paragraph 0 And the means of doing this are there and remarkably easy to use. Voyant tools, for instance, created by Stefan Sinclair and Geoffrey Rockwell, at McGill and Alberta. This is simply a visualisation of where the word ‘Marriage’ appears in Hardwicke’s 1753 Marriage Act:
Permalink for this paragraph 0 Or if you use something like Zotero to keep your notes and manage citations; there is a plug in called ‘Paper Machines’, created by Jo Guldi and the MetLab at Harvard, to visualise the works you are reading, in relation to topic, geography, phrases and shed load more.
Permalink for this paragraph 0 But to take a simple example of what these approaches make possible, you are probably familiar with Dror Wahrman’s article, ‘Percy’s Prologue’, published in Past & Present in 1998, and positing a profound shift in the construction of gender in the late 1770s. A fantastic and important article that takes just 49 lines of text, just 396 words, and builds a compelling story of changing attitudes to sex and gender on the basis of a close reading of those few words. But if you use online facilities like Stefan Sinclair’s, Voyant Tools, the character of that text is revealed in a new way.
Permalink for this paragraph 0 The predominance of men and man, may make us question which gender role is changing here, but more indicative, is the dominant presence of the word ‘woman’. Once plurals are taken in to account, a version of ‘woman’ appears five times.
Permalink for this paragraph 0 And if you then look at how that single word changed over the 18th century, it is not that Wahrman’s close reading is overturned, indeed the changing frequency of the use of ‘woman’ in the 1770s confirms his chronology and in many ways supports his conclusion; but his gender revolution, is contextualised in a new way. The 1770s saw one important shift, while the ngram suggests that it is part of a larger and more complex story – that is partly social historical, and partly linguistic (or some amalgam of both).
Permalink for this paragraph 0 In other words, these tools and methodologies, whether simply counting words, mapping meanings, or adding geography, make our close readings stronger and more powerful – give old methodologies new life; and both suggest and convince in a way that close reading and anecdotal argument on its own, occasionally fails to do.
Permalink for this paragraph 0 But to conclude, these tools and methodologies are important. They are not simply about shock and awe graphics and visualisations. In itself, this is not all that interesting.
Permalink for this paragraph 0 Instead we need to remember that the technology is not actually the point, but simply a point along the way.
Permalink for this paragraph 0 When I stare in to the faces of non-white immigrants to Australia through Tim Sherratt’s web site I don’t see the technology, I see a humanist and political project at its most individual and compelling:
 Jean-Baptiste Michel et al., “Quantitative Analysis of Culture Using Millions of Digitized Books,” Science 331 (December 2010), accessed December 17, 2010, doi: 10.1126/science.1199644.