Hannah Sullivan

I started using Google Books heavily in the last sprint of dissertation writing. How useful to be able to discover an errant page number without getting up! How reassuring to type “Oxford: Oxford University Press” with confirmed conviction.

The following year, I began using the database to do new research, among which I would include some of the following activities: reading rare 19c or early 20c books that aren’t in the Stanford library; searching for phrases and words within well-known texts, and searching on the whole corpus for more specific ones (“he reread his own work,” for example, in my project on revision); and downloading the full texts of books published before 1923 to collate or compare across editions. Last Fall, after using the Google Books database even more extensively in my Critical Methods Class, I gave a short paper – the title was “Google Books and the Plain Texts of Modernism” at the MSA conference in Victoria.

Although I’d spent a more-than-average time with Google Books at this point, quite a few of the assumptions I made about how the database and search engine works were wrong. Some of these things might be obvious now, or to others, but I think not all are obvious to everyone. In the following series of posts, I’m planning first to talk about different kinds of error in the database; then about the project’s bibliophilia (why Google Books, and not Google Texts?); and finally about the kinds of humanities research that this enormously rich database might support.

I will address quantitative research, and particularly the Ngram project in a later post. For structural reasons, counting word occurrences or summing up the number of items in a set is very problematic in the regular database. Nor has anyone proved that the frequency with which a word is used in a particular year or time period equates easily to a cultural preoccupation with it. If the Victorians were more preoccupied than we are with empire or religion or divorce, does that mean they used those words more frequently than we do? Sometimes writers self-censor, or circle around a central topic, or address it in language that we don’t think to use in a search query. The Waste Land is often spoken about and taught as a poem about the first world war, but it doesn’t actually contain the word “war.” The Victorians had a rich language of sexual slang which most of us wouldn’t think to type into the search engine. To conclude the series of posts, I’ll propose a few non-quantitative directions for research using the database.

For now, to go back to last Fall:

At this point, I was interested mostly in the way that the Google Books database upends the traditional relationship between the text and the book. Books were historically understood as fragile containers, the temporal casing in which an immaterial and timeless text presented itself. They had less auratic, and perhaps less use value, that an authorial manuscript. Shakespeare scholars would give a great deal to have not only the Folio and Quarto of King Lear, and the thousands of editions descending from this divergent pair, but Shakespeare’s own (unblotted) final autograph. That document would somehow be the play King Lear in a more substantial and primary way than any printed text.

Textual criticism was the art – or science – designed to deal with the fact that transmission entails corruption. Medieval scribes made unconscious errors and perverse alterations as they copied, and type-setting too produces errors -- a line of type is set twice, a printer’s clerk swaps out a word he doesn’t recognize, or an author deals carelessly with proofs. The work of the editor was to get back to a hypothesized “lost original.” As Jerome McGann puts it, this needn’t have actually ever existed: “critics use this heuristically, as a focussing device for studying the extant documents.”

In the enormous library scanned by Google, this ontology (text precedes book) is reversed. The scanned pages come first, and the texts which we can search on are derived automatically from them by OCR (“optical character recognition”). This has a few peculiar consequences:

1) What you see isn’t what you get. Because the OCR process is imperfect, the texts that we search on are not identical to the printed pages that we have.

2) In one important sense, the OCR process isn’t like a medieval scribe, or a printer’s clerk, or an editorial assistant: it has no semantic intelligence. It doesn’t make corrections to the books, but can only introduce error. Consequently, what you get is always worse than what you see. The plain texts are more “corrupt” than the page scans.

3) These errors are much more significant in the case of books from earlier time periods or in unusual fonts, books that have been heavily used, and books containing extensive marginalia. Books in foreign languages are especially problematic, and so too are books in more than one language. Below is a fairly simple and classic example. How do you imagine the Greek name “Persephone” is represented in the searchable text? The book, titled The Scripture Chronology Demonstrated by Astronomical Calculations was published in London in 1730

The answer, due partly to problems with the long “S,” is “telephone.” Here is the plain text version.

Stanford Humanities Today

Stanford Humanities Center

CESTA

Arcade: A Digital Salon