a painting shows a luminous galaxy
Essay
The Places and Uses of Data

The very first event of the "Data that Divides Us" series was “The Place of Data.” Our guest speaker Roopika Risam described the place of data as an imperative for scholars like herself to look at the objects they study. Risam is a scholar of Film and Media Studies and Comparative Literature, I am also a scholar of Comparative Literature, and so the place of data became a question with Risam that addressed me directly as her junior peer of the same discipline: What is the place of data in our work?

Risam’s question is deceptively simple.

What is the place of data in our work? I have a friend at Stanford, Klemen Kotar, who works in Computer Vision that can very easily point to the data in his own published papers. For example, in a paper he co-authored for NeurIPS in 2023, the section titled “Data” appears in the Appendix with the subheadings “Dataset description” and a hyperlink to a dataset sheet that lives on GitHub, “Link and license,” “Maintenance,” “Author statement,” and “Format.”  

Klemen explained to me in a phone conversation that the paper I had been reading is a dataset paper, which means that the key focus of the paper is the data they collected and the way they evaluated their model on tasks surrounding the data. He also explained that Artificial Intelligence (AI) research is based on the “holy trinity of Data, Algorithms and Graphics Processing Units (GPUs),” and that almost every paper in AI research has some section regarding data because data is the fuel that powers the work. For papers that focus primarily on new methods, there is still a section on the research data, but it is typically shorter and much vaguer in content. It is common among AI papers that present new models to hugely deemphasize the data. Data is crucial to the success of the model training—the more data the better—but researchers typically do not advertise how they obtained their data. Fair use applies to academic research, but industry research, in which data collection occurs in not so legal ways, sets this specific standard for research presentation.

What is the place of data in our work? Unlike Klemen, I am hard-pressed to point to the data, or the literal place of data, in my own work. I cannot recall ever having written or read a paper in Comparative Literature that had the subtitle “Data,” let alone all the justificatory apparatus I just described from Klemen’s paper. Naturally, the answer to the question “What is the place of data in my work?” is going to be much more long-winded.

Earlier in this seminar, Eric Harvey described the usability of data being one primary criteria for something being data. Mark Algee-Hewitt followed up and said that calling something data implies that we will do something with it.

I think that unlike Laura Stokes, who then described data in History to be their disciplinary bread-and-butter, literary scholars are hesitant to call things data, and then doubly hesitant to integrate those things they consider to be data into their work as corroborating evidence to their claims. We are conservative when it comes to data.

At the same time, Harvey and Algee-Hewitt’s interventions made me want to recast Risam’s question as, “What is the use of data in my work?”

In retrospect, I think this question was underlying the graduate reading group offering that Matt Warner and I co-hosted for the Mellon Sawyer Seminar in Winter quarter with the theme “The Data that Divides Us.” Our goal was to set a low barrier to entry for PhD students with a series of case studies to illustrate the scope of what is possible with computational tools in the Humanities. We wanted to address the disconnect between what humanists think computational tools can do for them, and what computational tools can actually do for humanistic research.

When I started working in the digital humanities (DH), I found that it took a considerable amount of trial and error to learn to ask research questions that needed computational tools to answer. I reached for examples, as many examples as possible, when my own process failed. Matt and I wanted our graduate reading group to illustrate both the affordances and limitations of using data to build humanistic arguments. We wanted a wide range of data sources and practical methods. That is why we assigned the work of two DH scholars local to Stanford—Quinn Dombrowski and Anne Ladyem McDivitt—and then invited them to join us to discuss their work, the data they use, how they manage that data, the challenges the data presents, and the methods they use to overcome those challenges.

Ultimately, I think that the prospect of working with data—but also the prospect of using computational tools to collect and curate data—forces scholars to not only “look at the objects [we] study,” to quote Risam, but to think in a sustained way about methods. Building a corpus for computational text analysis forces us to think explicitly about our own claims and to make plans for how we will substantiate them.

Literary Studies trained me to look for patterns. As an analog scholar, I build a corpus, and limit it partially by choice, but also partially by force of necessity given limited resources. I discern patterns, observe and collect evidence by close reading, draw through-lines between both expected and unexpected textual elements, make arguments for commensurability, rinse, and repeat. On my best days, I go slow, meander, and relish in the delicate, iterative process of letting the text itself guide me towards my claims. Once I arrive at those claims I return to the text to collect evidence to substantiate what arose from the text itself. I foster “intimacy” with my corpus, and gain insight every time I return to it anew.[1]

Computational tools allow me to discover patterns and then elucidate patterns in new ways. Computational tools allow me to scale up, and to argue for patterns with further reach. They allow me to foster intimacy with my corpora in new ways, to become unfamiliar with them just at the point where I thought I could learn no more. When I use computational tools to generate data, I become denatured to my own intuitions as researcher.

Let me answer the two questions I posed as I conclude. What is the place of data in my work? What is the use of data in my work?

I use data in my work as evidence alongside interpretation of printed text that I generate with close reading. In the past that has meant discovery of my corpus using named entity recognition (NER) and topic modeling, and then context extraction, which splits the corpus into sentences, takes the sentences with the desired NER and representative documents of the desired topic, and grabs three sentences before and after to super-power my close readings.

The promise of data is the opportunity to examine my methods, to do something new and vary my methods, to scale both my claims and the type and amount of evidence I present to substantiate my claims, and lastly, to iterate my argument. I rarely make my argument once with data as evidence alone, like Klemen does, for example. Typically, I set up my argument with a series of claims, provide evidence with data, return to my argument to complicate it, and then provide evidence with interpretation. The reality is that close reading is not going anywhere in literary studies. It is our bread-and-butter, the way we establish evidence for our arguments and legitimacy among our peers. Computational tools have not, and likely will not, change that, and neither will data, broadly construed.

Exactly a year ago, I co-presented with my Undergraduate Research Intern, Clare Chua, the findings of our Center for Spatial and Textual Analysis (CESTA)-funded collaboration that took place over a period of five months. That research was about the reception of Pierre Vallières’ White N-Words of America: The Precocious Autobiography of a Quebec “terrorist” over five and a half decades. We compiled 750 mentions of Vallières’ book in the Canadian French-language press from 1968 to 2023, and we used computational tools, including principal component analysis, word shift graphs, and topic modeling, to identify which language around the book changed, and which language stayed the same. I will conclude this talk by quoting the conclusion of the talk I delivered with Clare.

“What does this all mean? That I, as a literary scholar working alone, can tell a story. My collaborator Clare, majoring in Symbolic Systems with a concentration in Natural Language, can tell the same story using data, only different. The data-story is rich, fresh, and surprising.”


 

Notes

[1] I am inspired here by Jonathan Kramnick in his interview with John Yargo of the New Books Network, who describes literary criticism in the most everyday sense, under perfect conditions, being the mode by which professional readers gain “insight from an intimacy with […] its object of study.”

 

Join the colloquy
Colloquy

The Data that Divides Us: Methods and Frameworks for Data Across the Humanities

What is data in the humanities? What relationships do humanists have with data? What is the place of data in humanistic inquiry? These questions are pressing in our era of rapid technological transformation, one which is increasingly predicated on creating and consuming data at ever larger scales. With the rapidly growing power of data over various aspects of our lives, it has been said that "data is the new oil." And as data science increasingly moves into interdisciplinary spaces, humanists’ perspectives are essential.

more

Flagship humanistic journals in a variety of disciplines—History and Theory, Critical Inquiry, American Historical Review and New Literary History—have recently published special issues on data, reflecting on data as a new structural condition and using humanities methods to illuminate the constructed nature of data. But for far longer the "digital humanities" (DH) has been the space where, most explicitly and intentionally, humanists have worked with data, as Miriam Posner wrote in 2015 in Humanities Data as a necessary contradiction. While the term DH is now commonly accepted, even as it refers to many kinds of work in many different fields, we are still at pains to define what exactly the “digital” is, and how one kind of digital work might be in conversation with another. Yet data might be the key. The stakes of defining the digital might not need to center the taxonomic or the programmatic—although as humanists and educators, we do care about those things. Rather, the stakes of the digital are frequently found in the way in which it invites us to confront our relationship to data—and, it turns out, humanists have many, deeply varied relationships to data.

Our relationships to data are fraught at all stages: capturing, collecting, or making data; “cleaning” or “munging” data; preserving, recording, archiving or storing data; analyzing, understanding, or interpreting data; using, manipulating, abusing, contesting, or resisting data--our practices, and our names for those practices, are rooted in commitments, both political and epistemic, that can be challenging to unpack. What does humanistic data look like? What should it look like? And what can we learn about data and humanities when we deliberately ask these questions across disciplines, institutions, and time periods--when a historian confronts the data practices of a literary critic, or a classicist looks at the data originally collected for scientists?

At the Center for Spatial and Textual Analysis (CESTA) at Stanford we began in 2020 a conversation about data and the humanities in the Workshop "Critical Data Practices" (funded by Stanford Humanities Center). In 2023-2024, thanks to a Mellon Foundation grant, we continued and expanded on that conversation to include outside invited speakers and to support a postdoctoral researcher and two graduate dissertation fellows with the Mellon Sawyer Seminar Series “The Data That Divides Us.” Hosted by CESTA, this year-long seminar asked participants to interrogate how historical assumptions about data continue to shape modern divisions and, paradoxically, might offer new avenues for bridging them. (See the full schedule here.) Taking a deliberately historical and transdisciplinary approach, the seminar as a whole explored the underlying assumptions in the collection, conceptualization, and application of data as these have developed in the last three centuries. What latent bias might historical data carry undetected into our present moment? How has this data shaped contemporary manifestations of historical divisions even as it has created new social, cultural, and political fissures? And how might data help us to redress or speak across the very divisions that it has engendered? These are of the kind questions best tackled in conversations across disciplines and expertise, and we have been fortunate to draw on a community of librarians, archivists, graduate students, faculty, and data activists in this work.

In this Colloquy we share various outcomes of our "The Data that Divides Us" conversation. We include video recordings of visitors’ presentations and written responses to these talks by other seminar participants. We also feature a piece written for the concluding symposium by Chloé Brault, one of the Seminar’s Dissertation Fellows and a PhD Candidate in Comparative Literature, in which she synthesizes the major themes and conversations of the year. And we include a post-seminar interview, led by Nichole Nomura (Seminar’s Postdoctoral Researcher, and now lecturer in the English Department at Stanford) and Matt Warner (Seminar’s Dissertation Fellow, and now lecturer in the English Department at Stanford) with the Mellon Sawyer Seminar’s PIs: Giovanna Ceserani (Classics), Mark Algee-Hewitt (English), Laura Stokes (History), and Grant Parker (Classics and African and African American Studies). The interview reflects on the lessons of the year, and answers the hardest question of all: is data singular or plural?

These reflections underscore the notion that data, in the humanities, is more than a tool. It is a site of inquiry, a cultural artifact, and often a point of tension. Through collective examination, we find that our relationships to data invite us not only to question what we know but also to explore how we know it, taking us to a space of humanistic inquiry where data both divides and connects us, drawing disparate practices and perspectives into critical conversation.

Join the Colloquy

My Colloquies are shareables: Curate personal collections of blog posts, book chapters, videos, and journal articles and share them with colleagues, students, and friends.

My Colloquies are open-ended: Develop a Colloquy into a course reader, use a Colloquy as a research guide, or invite participants to join you in a conversation around a Colloquy topic.

My Colloquies are evolving: Once you have created a Colloquy, you can continue adding to it as you browse Arcade.