The goal with the analytical tools for the text of the documents was a visual or numerical point of entry for asking questions about terminology, texts, and authors. While not as polished as the suite of widgets in Voyant Tools, we were able to use all of these 12 tools during the demonstration at Harvard in 2018 even though participants were unfamiliar with Jupyter notebooks.
- I am omitting here a discussion of the Tag Exploration (developed by Hannah Marcus and Morgan MacLeod) that was part of a secondary Jupyter Notebook written in Python.
- Contextualizing tools for publication or sent dates, lengths, word usage, vocabulary density of authors
- Pattern-Based Inquiry
Please explore the static version of the GaLiLeO prototype to see the output and some of the underlying code for these different tools.
Document Analysis & Contextualization
The function “choose_doc” is the primary contextualizer in the suite of tools. It retrieves critical numeric and lexical information and presents it as text and saves images of graphs when appropriate.
Highlights the year of publication or the year in which a letter was sent (in blue). Height of the bars indicates the number of documents in the corpus that were published or sent in each year covered by the data set.
There is a similar function to contextualize the length of a document in comparison to the lengths of other documents in the subcorpus.
Type-Token Ratio (TTR) Contextualization
Compares the lexical complexity of a document (in blue), with or without punctuation, to works by Galileo and works by other authors.
“choose_doc” also allows a user to see unique vocabulary and surprising omissions of words that are otherwise popular in the corpus.
Reports on similarity of rates of use of the 100 most frequent words in the corpus. Outputs an analysis using both Euclidean and Cosine methods (since the lengths of the documents can vary considerably). The graph shows the 10 most similar documents and the 10 least similar documents. Aside from OCR errors that could easily skew this result right now, we also speculate that we are seeing the effects of different epistolary best practices in the period.
Reports word co-occurrence at a corpus level.
The model is created in advance. Information about whether or not a term is a top word in a topic is preserved.
i.e. if an author uses “foro” he or she is likely to use “spechio”, “christallo”, or “accqua” (in the top image)
Remaining code for the document level allows a user to explore the unique or omitted words more closely (function = “TopWordsNotInText”). There is also an option to read the plain text file (“read_doc”).
The second half of the notebook is dedicated to learning more about specific words of interest.
To assist with usability, the function “word_info” outputs features of word types based on other functions.
Chronological Contextualization by Author
Since this project takes Galileo as its example, the visual output (building on the textual output above) separates the rate of use by author over time in order to better understand patterns of similarity, innovations, and archaisms.
Keyword in Context
The function “get_KWIC” is modified from Matthew Jockers’ code in Text Analysis with R for Students of Literature (Springer, 2014). Changes include highlighting whether the author is Galileo (GG) or someone else (NotGG) and emphasizing the year of the occurrence.
Analysis of the Context
Underlying function name: see_KWIC
Visualizes terms that co-occur with the keyword over time.
i.e. Authors who use “vetri” only begin to frequently use “telescopio” in the same context after 1623, and even then, not strongly until 1636. In the early years they are concerned with “due”, “quarte” and “buoni”.
All of the output can be saved. Users can modify stop words for some of the later analyses.
The final tools allow for comparative analysis. One allows direct comparison of authors, the other structures queries by relative proportions.
The function “compare_author_vocabulary” allows a user to see how the lexical richness of one author compares to the rest of the corpus. This is very much a proof of concept, and needs to be refined to provide more detail.
This final tool is perhaps the most exciting. What are the words that Galileo uses in high frequency that the other authors in the corpus do not use at all? The function “find_types_by_range” provides that output. Again, this is a proof of concept, which needs more customization. For now, high is considered to be the 75th percentile or higher in frequency of usage; low is the 25th percentile.
Return to the GaLiLeO landing page.
Explore the static version of the GaLiLeO Jupyter notebook.