Methods
We performed stylometric analysis of our selected poems using Python in a repl.it repository. The texts we analyzed fell into 3 categories: outside works (works by potential authors not included in The Passionate Pilgrim), confirmed works (works by the potential authors included in The Passionate Pilgrim), and unknown works (the poems from The Passionate Pilgrim whose authorship we tried to identify).
We created stylistic fingerprints for the works by applying 6 statistical measures: average word length, type-token ratio, hapax legomena ratio, average line length, average sentence length, and average sentence complexity. We found through careful testing that these measures fell into 3 categories with regards to efficacy:
1. Most useful: word length, sentence complexity, line length 2. Somewhat useful: type-token, hapax legomena 3. Not at all useful: sentence length
We thus eliminated sentence length from our metric, replacing it with line complexity, which fell into the “somewhat useful” category. Eventually, we eliminated every attribute that was not “most useful,” only using word length, line length, and sentence complexity.
An example stylistic fingerprint appears below:
Venus and Adonis (Shakespeare): Average Word Length: 4.51074090311267 Average Line Length: 0.22961420429636126 Average Sentence Complexity: 12.12587412587412
Once we identified the particular stylistic fingerprint of a work, we applied two different methods of comparison to establish authorship.
The first was a simple % error function, where we attributed authorship based on which author’s stylistic fingerprint had the least % error for the most categories. For example, if Shakespeare’s fingerprint was most similar to the above Venus and Adonis fingerprint for word length and sentence complexity, but Richard Barnfield had the closest line length, we would attribute the poem to Shakespeare.
The second method involved taking the geometric mean of the % error at each attribute using the simplified mean function:
This final method produced the most positive results in testing, so we adopted it.
Findings:
We tested the accuracy of our fingerprint comparisons by attempting to attribute authorship to the poems in The Passionate Pilgrim that already had confirmed authors. We grouped the poems by the same author together in order to create a larger sample size. This is what we found in those tests:
work : predicted author Passionate Pilgrim I, II, III, V, XVI (William Shakespeare): Shakespeare Passionate Pilgrim VIII (Richard Barnfield): Barnfield Passionate Pilgrim XI (Bartholomew Griffin): Griffin Passionate Pilgrim XIX (Christopher Marlowe, Sir Walter Raleigh)*: Marlowe
*For simplicity’s sake, we removed the final stanza in this poem (attributed to Raleigh), so the file that was analyzed only contained Marlowe’s work
Total Test Accuracy:
With our new metrics, we were able to properly classify each of the confirmed bodies of work in The Passionate Pilgrim.
With increased confidence from the test accuracy, we then turned to the unattributed poems. Based on our analysis, this is what we found:
work : predicted author The Passionate Pilgrim IV: William Shakespeare The Passionate Pilgrim VI: William Shakespeare The Passionate Pilgrim VII: Christopher Marlowe The Passionate Pilgrim IX: William Shakespeare The Passionate Pilgrim X: Christopher Marlowe The Passionate Pilgrim XII: Bartholomew Griffin The Passionate Pilgrim XIII: Christopher Marlowe The Passionate Pilgrim XIV: Bartholomew Griffin The Passionate Pilgrim XV: Christopher Marlowe The Passionate Pilgrim XVII: Richard Barnfield The Passionate Pilgrim XVIII: Richard Barnfield
And perhaps most interestingly, when we analyzed all of the unknown poems together, the algorithm produced Shakespeare as the file’s author.
Our program attributed 2 poems each to Richard Barnfield and Bartholomew Griffin and 4 each to Christopher Marlowe and William Shakespeare.
Finally, taking into consideration the previously-confirmed works, the distribution looks like this:
Perhaps the publisher W. Jaggard was not so full of it. He advertised a collection of poems by “W. Shakespeare,” and based on our analysis and previous scholarship, he published a collection where *most* of the poems were written by Shakespeare, and the overall character of the text classified it as most similar to the Bard’s works.
Challenges:
The primary issue that we faced in performing this analysis was the simple volume of the texts we were analyzing. Computational analysis in the humanities is a brilliant tool to be used for even the largest corpus; what confounds it is when there is not enough data. This was our issue.
The poems we classified each had between 12 and 62 lines, which is not a particularly large amount of data from which to make predictions. Our confidence, though burgeoned by the accuracy of our tests on the known Passionate Pilgrim poems, which are not much longer (between 14 and 78 lines), remains somewhat low for the accuracy of our results.
A similar study to ours, published in the journal Computers and the Humanities in 1984, notes these points as well. The chief issue with attributing authorship to a poem via stylistic analysis, said the article’s author, M.W.A. Smith, is its brevity. Smith investigated Shakespeare’s The Lover’s Complaint, which has sparked debate over authorship as well. The Lover’s Complaint, Smith noted, has 2600 words contained in 329 lines. Our poems were at most five times shorter than the poem that Smith complained was too short. One main reason this causes issues, according to Smith, is that “the most common features of a text occur only a few times per thousand words.”
If that holds true for our texts, it follows that the stylistic fingerprints we created for the authors, made as they were from corpora of 1120 to 10,000 lines, are fairly accurate, but the fingerprints of individual poems with which we compared them were likely not full pictures of the authors who created them.
Sources:
Whodunnit: The Passionate Pilgrim