GaLiLeO Corpora

The Code and Documents Behind the Prototype

The overall corpus can be subset into specialized document groupings based on type of document and metadata tag categories. The example on the left, “InstrumentsAndMaterials” is from our proof of concept in September 2018: files that have been identified by readers as containing discussions of the objects used as instruments or tools in the creation of knowledge.

  • Letters: all correspondence in the National Edition of Galileo’s works, including over 4,300 letters from the 20 volumes edited by Favaro in the 19th century (EdNaz). Can include prefatory letters from the library (NoFull).
  • Library: all full text documents known to be in the library. (82 at the time of the prototype demonstration.) In development.
  • All: the combined letters and library books. In development.

During the prototype demonstration we used the “InstrumentsAndMaterials” subcorpus with 543 letters written by Galileo identified by specialist tagging and 3,200 letters retrieved by searching for the terms in the tags in letters written to or about Galileo.

The prototype obscures the work to bring two different corpora into conversation with each other:

  • recognizing that authors of books and prefatory letters could be recipients or senders of correspondence
  • making clear that sometimes authors or senders are simply unknown

One of the chief challenges was accounting for orthographical variance between the critically edited texts and the diplomatically edited ones. Early modern typesetters exchanged v for u quite frequently. Human readers can still recognize a word. In terms of matching characters in machine readable text, this required making substitutions of known sequences with v that should be u. Using regular expressions would have caught too many vs that should be vs, so this catalog was developed iteratively. It catches and corrects most v/u substitutions, but likely still needs refinement.

All of the analysis occurs prior to the user arriving at the interface. The code is not elegant, but captures quantitative information about each document, the words in the document, and the author. This happens in two phases: individually (shown at right) and then in relationship to the rest of the corpus. The results are stored as RDS objects to be loaded into memory when the user chooses a subcorpus for exploration.

A remaining feature to develop is the presence of words in titles and metadata. Currently the analysis focuses only on the body of documents.

There are 8 languages represented in the overall corpus. Currently the analysis prioritizes Italian.

There are over 625 authors. 307 authors only contributed one item. The top contributors are:

  • Galileo Galilei (441)
  • Benedetto Castelli (251)
  • Fulgenzio Micanzio (153)
  • Federico Cesi (150)
  • Buonaventura Cavalieri (134)
  • Suor Maria Celeste Galilei (124)
  • Giovanfrancesco Sagredo (102)

Behind the scenes I wrote the code to allow for users to select a subcorpus and then load prepared analytic data. I also developed a few high-level tools to understand the range of documents: viewing the list of files, outputting the range of dates covered, exploring the files from a given a year, and other quantitative features. An example follows.

Author Representation in the Corpus

A table of the number of times each author appears in the subcorpus.

Return to the GaLiLeO landing page

Explore the available analytical features.

Explore the static version of the GaLiLeO Text Analysis Jupyter notebook.

css.php