Is There a Text in This Book?

And, if so, where is it?

I started using Google Books heavily in 2008 in the last sprint of dissertation writing. How useful it was to be able to chase up a peskily errant quotation or to type “University town: University press” with conviction!

As I sat in the reading room scrolling through these facsimile pages, I thanked Google heartily for saving me from the descent into the library stacks. The last time I’d had to do this my knees had chafed from crouching on the floor.

The following year, I began using the database to do new research, among which I would include some of the following activities: reading books that aren’t in the Stanford library; searching for phrases and words within well-known texts, and searching on the whole corpus for more specific ones (“he reread his own work,” for example, in my project on revision); and downloading the full texts of books published before 1923 to collate across editions. Last Fall, after using the Google Books database even more extensively in my Critical Methods Class, I gave a short paper - the title was “Google Books and the Plain Texts of Modernism” at the Modernist Studies Association conference in Victoria.

Although I’d spent a more-than-average time with Google Books at this point, quite a few of the assumptions I made about how the database and search engine works were wrong. Some of these things might be obvious now, or to others, but I think not all are obvious to everyone. In particular they might not be obvious to those of us who could benefit most from the data – academic humanists, working in time periods between Gutenberg and the early twentieth century.

In the following series of posts, I’m planning first to talk about the project’s bibliophilia (why Google Books and not Google Texts?); then about kinds of error in the database; and finally about the kinds of humanities research that this enormously rich database might support. I’ve been lucky to receive some help over the last year from people within the team at Google, but I’ve also been struck by a prevailing note of defensiveness in their responses, as if any discussion of errors or flaws from humanities academics is unwelcome. My perspective is that understanding what the database and search function don’t do well allows us to use them more effectively.

I will address quantitative research and the Ngram project in a later post. For structural reasons, counting word occurrences or summing up the number of items in a set is problematic in the regular database. This is the practical difficulty. Nor has anyone proved that the frequency with which a word is used in a particular year or time period equates easily to a cultural fixation on it. If the Victorians were more preoccupied than we are with empire or religion or divorce, does that mean they used those words more frequently than we do? Sometimes writers self-censor, or circle around a central topic, or address it in language that we don’t think to use in a search query. The Waste Land is often spoken about and taught as a poem about the First World War, but it doesn’t actually contain the word “war.” The Victorians had a rich language of sexual slang which most of us wouldn’t think to type into the search engine. To conclude the series of posts, I’ll propose a few non-quantitative directions for research using the database.

To begin with, a thought experiment:

Imagine a student, preparing in a hurry for class. She’s supposed to read King Lear and she doesn’t have a copy. The bookstore is out again. So she types “King Lear” into Google Books and begins preparing…

The first returned object is a 1723 edition printed in London and printed by J. Darby. At least, this is true today. The returned results change over time and are not always reproducible. The second, dating from 1770 and edited by Charles Jennens , is the first variorum edition of the play. We often think of the eighteenth century as a period of strong, creative editing (or doing as one pleases, viz. Bentley’s Milton). But Jennens argues even-handedly in his Preface in favour of careful collation: “No editors that I know of has the right to impose upon every body his own favourite reading, or to give his own conjectural interpolation, without producing the readings of the several editions.” When we turn to the first page, we find an apparatus in full play. In this edition, Gloucester begins, “It did always seem so to us, but now in the division of the kingdom it appears not which of the dukes he values most…” and it has two footnotes, b and c. This is how they look on the page:

Image removed.Image removed.Image removed.


The OCR (optical character recognition) software works on this to produce a text. This can be seen by clicking on the “Plain text” button in the top right hand corner, and it reads as follows:

* The seene ii not describtJin the qu't or fo's. *• The three last fo's am\tfi.

cThe qu'j read imjjifomi.

*So the qu'i; all the rest, qualities.

The first two objects that Google returns are editions of interest to the scholar and, as physical copies, they would be of some value. But they are not what our student was looking for. She doesn’t want to study the history of eighteenth-century editing or book publishing, or to trace the editorial fortunes of the play, or even to read in an unfamiliar typeface on weathered and marginalia-covered pages. She probably wants to be able to search in the plain text for important phrases, and this unfamiliar typeface, replete with the medial-s, makes searching hard. The fourth object returned is Stephen Orgel’s Pelican—the one that I, like many other teachers, set in my class. This would suit her needs better, and would put her on the same page as others reading in a hard copy—but, because of copyright, it isn’t fully accessible.

So what’s the best she can do? A clearly printed late nineteenth- or early twentieth-century edition, with minimal editorial apparatus, prefaces, or commentary, would probably be best—and would produce a more accurate plain text for searching in. But these are tens or hundreds of results down and she doesn’t have time to get to them.

I give this example to make the point that what we want from books is not what we want from texts. Students who take a Shakespeare class in college want to talk about Lear’s madness, Gloucester’s blindness, or the play’s awful discharge of nothingness—not to wade through eighteenth-century editions like an antiquarian bookseller. Choosing an edition of King Lear from Google’s database isn’t as painful on the joints as rifling through the library shelves—but it requires just as much, and perhaps more, editorial judgment. The average student or reader doesn’t want the thousands of different King Lears that Google Books provides. One will do just fine. Even among professional scholars, the number of people interested in the 18c editorial history of King Lear is quite small; “the angels of hermeneutics,” as Jerome McGann once called them, have often been satisfied with the Penguin.

One of the questions that I’ll come to in another post is whether, with input from book historians and textual critics, Google could improve the algorithm for sorting through its treasure trove of objects. More fundamentally, I want to ask: what relationship between texts and books does the project suppose? Where are we to suppose the ethereal, derivative plain texts are? How are they produced? And, given that the process of generating them from books introduces error, could we imagine a textual criticism devoted to improving them?

