You are here

Stanford Students Use Digital Tools to Analyze Classic Texts


A network map of correlated physical adjectives that increase in usage over the 19th century.

What happens when students with little or no computer science background are tasked with working with a collection of digitized 19th century texts in a course taught by a humanities computing scholar? The combination of these three elements made for an enlightening research endeavor for both students and professor in an on-going Stanford course entitled “Literary Studies and the Digital Library” – a class so engaging that almost all of the students chose to continue their projects for a second and third quarter.

The course began in the fall of 2009 with a simple yet tempting course description: “In this class, there will be 1,200 books assigned, but students won’t read any of them.” Fourteen students from disciplines such as English, history, biology, computer science and even anthropology signed up for the course, taught by Matthew Jockers, an English Department lecturer and the department’s Academic Technology Specialist. The material they would study came primarily from a Chadwyck-Healey collection of digitized 19th century English and American novels, which includes well-known works such as Pride and PrejudiceGreat ExpectationsMoby Dick, andThe Last of the Mohicans. The comprehensive digitized literature collection is a model for the kinds of resources literary scholars will likely turn to with increasing frequency in the near future; indeed, this April, Jockers and his team submitted a grant application that would give them access to the entire Google Books collection.

Digital Humanities Scholar Brings Research Into the Classroom

Although the Chadwyck-Healey source texts were written in English, the students would need to become familiar with a few new languages to be able to make use of the digitized text. As part of the course, students became versed in the computer programming languages of PHP and Python, as well as MySQL for managing the database and ‘R’ which they needed to use for the statistical analysis component.

Linguistics scholars typically use computational and statistical techniques to analyze large amounts of textual data, but Jockers explains that in his own research, his methods of approaching text derive from a relatively new area of scholarship called corpus stylistics. Corpus stylistics is an approach that uses theories relating to linguistic concepts such as phonetics and syntax to analyze literary texts.

Jockers has adapted techniques in the field of linguistics to use for literary analysis and has been examining literary works with computer-based tools. He looks for recurring themes and trends in word usage and then examines how these aspects of the novel’s content and style change over time. With this in mind, he began teaching an Introduction to Humanities Computing course a few years ago. Students who took that course learned to extract information from individual texts. Last fall, the English department asked him to teach a more in-depth course meant for both graduate and undergraduate students so that they could learn to apply the methods of text-analysis and text-mining on a broader level. With the students, Jockers wanted to see what could be done not with one text, but with twelve hundred. Increasing computer capabilities combined with newly digitized book collections made Jockers’ vision possible.

Fourteen students showed up for the first day of the seminar, none with any background in computer text analysis. When asked why he decided to take a digital humanities course, history PhD student Cameron Blevins explains, “It was frankly too interesting of a class to pass up. There are very few other classes that are going to present this kind of opportunity.”

Technology and Literature Unite

During the fall quarter students were provided with a general introduction to digital literary research techniques, and they began to learn how to write code designed to process the text line-by-line and word-by-word. Technology was a tool however, and not the academic focus. At the heart of the course was the novel, possibly the most popular genre of literature in the nineteenth century. “The nineteenth century is the period in which the novel matures, so we are charting the rise of the novel and its evolution as a form,” says Jockers. “As the novel becomes a more generally recognized form, to what extent does its style change? We are looking at the full text of hundreds of books at once, analyzing how the style changes as the novel evolves.”

Throughout the quarter, the students refined their approach, decided what questions to tackle, divided into groups, and developed abstracts introducing the specifics of their studies. Cameron describes the teamwork as “kind of an organic process, mostly among ourselves, talking about what we wanted to do.” Jockers wanted the students to figure out their research questions on their own.

“The process was completely driven by students in the class. It was impressive to see how fourteen people arrive at an interesting overall proposal that really ties everything together quite well,” Cameron adds. “It was really important that we did have complete freedom with the problems and proposals. A lot of us have a personal stake in the project that might not have been there if there was a predetermined agenda.”

The students decided to analyze the novel through three levels of inquiry. The most detailed level focused on the words themselves. Which parts of speech appeared over and over? What articles tended to be used? The middle level examined formal structures of paragraphs, chapters, and sentences. The broadest level looked at overall themes, or “semantic fields,” the topical patterns between novels.

Research Yields Valuable Historical Insights

The quarter flew by, and by the end, it was clear that more time was needed for such an in-depth project. After prompting from the students, Jockers agreed to offer an ad hoc seminar to be held in the department’s newly born “literature lab.” In the winter of 2010, students were able to pick up where they left off. That is when the most exciting thing about the course occurred. “The really unexpected result was that of the fourteen students who took the course, thirteen of them wanted to keep working on the project,” says Jockers. And work they did. By the end of the term, they had not only written and developed a complex set of computational tools, but had explored a wide range of questions about the novel. Questions such as whether the rise of serialization had an impact on the structure of the novel; whether there really is such a thing as a distinguishable literary period; and whether or not novelistic subgenres can be detected algorithmically.

The answers to these questions did not begin to come until this Spring—after the formal seminar was over the students continued to meet in the lab and work through the data. Among other discoveries, the team has found that American usage of proper nouns nearly triples in frequency over the course of the century, while British usage remains relatively stable. “This trend is significant, says Blevins, “and may speak to the increasing desire and need of a young, expanding nation to assign new names to its places and people.” And the team’s analysis of semantic fields in the British corpus has revealed a sharp drop in “abstract values” from the 1830’s onwards while there is at the same time a sharply rising inverse trend in a semantic field they call “concrete description/action words.” Together, these data suggest less a shift in moral content over the century and more a fundamental shift in narration: from “telling” to “showing.”

Project Highlights Merits of Interdisciplinary Collaboration

In the end, twelve students continued the course. Not only that, but based on the success and positive response to the class, Jockers has been asked to teach the class again. The class will be offered in the fall of 2010.

So what made the students consistently excited about the project? Surprisingly, while the advanced technology initially lured them in, the greatest reward for most of the students was the collaborative aspect.

English PhD student Ryan Heuser says, “I think that it’s really ground-breaking on a few levels. Number one, this methodology and this level of cooperation are rarely seen in the humanities. It’s also revolutionary in the sense that we’re just a bunch of grad students and undergraduates, and in two quarters, we have built an entire corpus of novels and three separate ways of studying them. It demonstrates the vitality of the students.”

When asked how the project affected the way he thought about research in the humanities, he adds, “What’s most striking is the fact that fourteen students are cooperating on one project, whereas in the humanities you often shut yourself up in your office to produce something by yourself.”

Cameron agrees that such extensive collaboration is rare but welcome, saying, “As for me, I haven’t done this kind of collaborative humanities work. [This project] really drove home the importance of collaboration. I was sitting on a couch next to a bio undergrad and an English PhD, talking about programming in PHP. That crystallized what it’s really about – how fascinating interdisciplinary studies are. I gained an appreciation for how different scholars approach questions.”

Cameron believes that classes like the one he just took make Stanford unique. “I am huge proponent of this class and what it does. This class embodies why Stanford is considered one of the frontrunners of the digital humanities community. It’s a completely unique opportunity for someone like myself to take a class outside of my department with so many different people, working with undergraduates learning technical skills in a collaborative project. I strongly believe that few other institutions can provide this kind of opportunity.”

He leaves us with one sentence that sums up his experience very simply. “I can’t say enough about how much I absolutely loved it.”