Early Modern Italian Corpora


Since 2015 I have been incrementally developing collections of full text (corpora) of early modern Italian books to use for computational text analysis. There are 2 sets of texts that I use for research and teaching: available diplomatic editions curated by others (EarlyModernItalian) and diplomatic editions of books known to have been in Galileo’s library (GsLibrary) curated by my team. Links for descriptive metadata are supplied in relevant publications and other materials.



  • 437 texts
  • 19.2 million words


  • 81 texts
  • 4.9 million words

Scholarly Context for Corpus Development

Since much of the U.S. scholarly work on computational literary analysis has been driven by studies of modern, Anglophone texts, I needed a collection of texts in my area and language of specialization: 14th-18th century Italian. Such a collection offers an opportunity to test the portability of 21st century quantitative methods. In terms of scale and scope, my preliminary point of comparison was Matthew Jockers’ corpus used for Macroanalysis (Univ. of Illinois Press, 2013). Andrew Piper explores subcorpora that range in size from 75,000 poems, to 150 novels, to 28,000 documents of fiction and nonfiction, to 65,000 characters in 7,500 novels in Enumerations (Univ. of Chicago Press, 2018). Both scholars were working with teams of graduate students at institutions with robust support for Digital Humanities research.

Since Bowdoin College is an undergraduate-only institution without a non-English language requirement, I have approached corpus development in two directions: using what is already available and carefully curating a new data set with the help of Italian majors and an Italophile outside contractor, MAI Services.

Creating a corpus engages directly with concerns raised by Leah Marcus in Unediting the Renaissance (Routledge, 1996). Marcus focused on the English Renaissance, but brought to light the outsized role of 17th-century and later editors in determining the authoritative version of a text, often obscuring how contemporary readers would have encountered and contextualized the text. When combined with concerns about the ways in which archives determine what and which authors are preserved, creating a corpus both relies on prior power structures that create access to primary materials and represents an opportunity to intervene.

For more information: