Understanding Genre in a Collection of a Million Volumes

Pages in Genre as a Percentage of the Whole Collection

Pages in genre as a percentage of the whole collection

Project Director: Ted Underwood, Professor of English

Grant Program: Digital Humanities Start-Up Grants

Years: 20132015

"Digital libraries contain millions of volumes, and represent a huge opportunity for literary history. Only a small fraction of the fiction and poetry preserved in these libraries has been studied by scholars, and we don't know that it's a representative sample. In truth, we haven't even known how much unread fiction and poetry was out there, because volumes published before 1960 are not necessarily categorized by genre. (In the nineteenth century, less than half of the novels in digital libraries are tagged as 'fiction.') So our first goal in this NEH-funded project was simply to find the fiction, poetry, and drama in a collection of a million English-language volumes, from 1700 to 1950. 'Finding' a genre meant identifying not only volumes but specific pages, so we could separate poetry from a prose introduction (or from indexes and advertisements at the back of the book). We solved this problem by training a computer to distinguish literary genres from nonfiction and paratext. So far we have covered 854,000 volumes (up to 1922); this fall we will complete the project, using the resources of HathiTrust Research Center to study works still covered by copyright."

Ted Underwood

The research team for this project included:

Boris Capitanu, a Senior Research Programmer at the NCSA who worked with HTRC to design their feature-extraction workflow, and developed web tools for this project as well as a page-level tagger we used to create training data. 

Michael L. Black, who worked on the project first as a graduate research assistant, and then as Associate Director of I-CHASS. He designed the original prototype of the tagger used to create page-level training data, the original version of a Python script we used to extract tabular data from MARC records, and significant parts of other workflows.

Shawn Ballard worked on the project as a graduate research assistant, supervising the creation of training data.

Jonathan Cheng, Nicole Moore, Clara Mount, and Lea Potter worked on the project as undergraduate research assistants, playing a vital role especially in the collection of training data.


Selected Publications Produced Using this Grant:



For more information about Professor Underwood's work in digital humanities, visit his website: http://tedunderwood.com/.

The Past Five Years