Do the Flaws in the Perseus Word Study Tool Matter?

In a recent post I tried to categorize the problems of the Perseus Word Study Tool, as tested on a section of Vergil. More surprising to me than the overall rate of error (about one in three words was misidentified in some way) was the fact that many of the errors were not subject to correction by means of Perseus’ “voting” system; and that even when voting was in operation, it often did not correct the error. Sometimes the correct choice was not an available option; other times, unanimous correct votes were ignored, and unanimous incorrect votes were accepted. At Aen. 5.17, to add another example to those mentioned the earlier post, the vocative magnanime was incorrectly called an adverb on the basis of six incorrect user votes.

The inadequacy of the LWST will not have been news to anyone who has used it. The question is, is the level of error pedagogically significant? Is the LWST good enough for the purposes of a typical Latin student? In other words, should the average Latinist care? It is not good enough, and the level of error and the specific types of errors in this flagship classical DH project are pedagogically significant and worthy of attention, I believe, for several reasons.

1. Words that give students the most trouble–relative pronouns, demonstratives, quam, ut, modo, Q-words in general–are exactly those least likely to be handled well by the LWST. The earlier post has some examples from my small sample, but I’ll add here that in Aen. 5.30 (magis . . . ) quam, when it comes to that quam, the LWST offered no fewer than seven possible quams to choose from (all numbered quam 1-7), none of which has the correct definition in the context (“than”).

2. The LWST is of course helpless when it comes to unusual or idiomatic expressions, of which there is a good example in my sample at 5.6, were notum must be translated “the knowledge that.”

3. The tool naturally can analyze only what is there. It cannot tell when something is left out or assumed.

4. A major structural problem is represented by bad short definitions of the type (to choose again from examples offered by my sample)  iubet = “imposed,” iam = “are you going so soon,” frustra = “in deception, in error,” or more subtly, the fact that the common meaning of tendere, “direct one’s course,” does not appear in the short def. for that word.This is important because, even though one can click on and read the full Lewis & Short dictoinary definition, intermediate students are very unlikely to click through and sift through long entries in search of the correct definition.

5. Moreover, the LWST obscures the relationships between words, which is key to learning to read Latin. This is why seemingly minor accidence mistakes are meaningful. Misled on a part of speech, or the gender of an adjective or the case of a noun, the student will likely not see the syntactical connection between words, and thus the tool reinforces the urge to produce the dreaded “word salad” translations.

6. More broadly, with its cryptic statistical data and jumbled pseudo-information, the LWST reinforces the the impression that many students have: that Latin isn’t really supposed to make sense anyway, that it’s all some kind of fiendish crossword puzzle.

Gregory Crane in an important article and apologia for Perseus, has said that the goal of the Perseus Project is to provide “machine-actionable knowledge.”

Reference materials, in particular, are structured to support automatic systems (e.g., the morphological analyzer learns Greek and Latin morphology from a machine actionable grammar) and to be decomposed into small chunks and then recombined to provide dynamic commentaries. If you retrieve a book in a language that you cannot read or on a topic that you cannot understand, the system can find translations where these already exist, machine translation and translation support systems, reference works, and general background information suited to the general background and immediate purposes of the reader. In knowledge bases, the boundaries between books begin to dissolve.

But clearly machines are spectacularly bad at understanding Latin at the moment. Crane thinks in terms of many decades, and is waiting for massive improvements in artificial intelligence, or teams of graduate students to encode correct grammatical analysis in texts. But such a prospect seems increasingly far off, and given the size of the Perseus Digital Library (10.5 million words at the moment), it seem unlikely that the millions of errors can be corrected any time soon, if ever. Indeed, would it be worth huge the investment of time and money? In the meantime, we need to create a collaborative tool for generating reasonably correct and reliable vocabulary lists for Latin (and Greek) authors that will be helpful for students and teachers around the world. Why we should do this, and what kind of tool I have in mind, will be the subjects of future posts.

–Chris Francese


Types of Error in the Perseus Latin Word Study Tool

The Perseus Latin Word Study Tool (LWST) is intended to provide dictionary definitions and grammatical analysis of all words in the Latin texts available in the Perseus Digital Library, currently 10.5 million words.

A check of the definitions and grammatical analysis of an arbitrarily chosen chunk of Vergil’s Aeneid (5.1-34, 223 words), found that it was incorrect in 79 instances, or 35.4% of the time (and correct 64.6% of the time). The most common type of error (21 instances,  26.6% of all errors, 9.4% of all words) was a mistake of accidence, for example duri (5.5) was taken as genitive singular instead of nominative plural. In 17 cases (21.5% of errors, 7.6% of all words) words were assigned to the wrong lemma, as when quoque (“and whither”) was derived from quoque (“also, too”), or venti (“winds,” 5.20) was assigned to the verb venio, “come,” as if it were the perfect participle. This particular mistake occurred three times in this passage, and the correct lemma was not listed as a possible option. In 14 instances (17.7% of errors, 6.3% of all words) the dictionary definitions provided were wildly wrong. This was true of some very common words. iam was glossed as “are you going so soon,” nec as “and not yet,” ab as “all the way from.” Elissae (5.3) was glossed as “Hannibal.” In every case this type of error was seen to come from the pulling, seemingly at random, of a word or phrase from the dictionary of Lewis & Short on which the LWRT is based. In 11 instances (13.9% of errors,  4.9% of all words), the relevant definition in the context at hand was not provided (though it could be found by clicking to and reading through the full Lewis & Short dictionary entry). For example, cerno was glossed as “separate, part, sift,” but not “perceive,” or infelicis (5.3) glossed as “unfruitful, not fertile barren,” rather than “unfortunate.” More seriously, all relative pronouns were glossed as interrogatives (“who? which? what? what kind of a?”), and described simply as “pron.” The word “relative” did not appear on the page. In 8 instances (10% of errors, 3.6% of all words) a word was assigned to the incorrect part of speech, as when medium (5.1) was called a noun rather than an adjective, or locutus (5.14) assigned to the rare 4th decl. noun “a speaking” rather than to loquor. In 4 cases (5% of errors, 1.8% of all words), there was no definition available. And in all cases deponent verbs were incorrectly labeled passive (4 instances in this particular section, or 5% of errors, 1.8% of all words).

Now, the makers of Perseus are perfectly aware of the flaws in LWST, and attempt to use the power of social media of help remedy the situation. Subjoined to the analysis of every ambiguous word, after an explanation of the methodology used, one finds a plea to help by voting.

The possible parses for this word have been evaluated by an experimental system that attempts to determine which parse is correct in this context. The system is composed of a number of “evaluators”–each of which uses different criteria to score the possibilities–whose votes are weighted to determine the best answer. The percentages in the table above show each evaluator’s score for each form, which are then combined to determine each form’s overall score.
This selection used the following evaluators:
• User-voting evaluator: Scores parses based on the number of votes each one has received from users. Weighted more heavily as more users vote for a given word in a text.
• Prior-form frequency evaluator: Evaluates forms based on the preceding word in the text; finds the most likely parse among this word’s possible morphological features and the preceding word’s possible features based on the frequency of each possible pair.
• Word-frequency evaluator: Scores parses based on how often the dictionary word appears in the Perseus corpus. Only used when a given form could be from more than one possible word.
• Tagger evaluator: Evaluator based on pre-computed automatic morphological tagging
• Form frequency evaluator: Scores parses based on how often their morphological features (first-person, indicative, plural, and so on) occur among all the words in the Perseus corpus.
User votes are weighted more heavily than the other methods, which are all treated equally.
Don’t agree with the results? Cast your vote for the correct form by clicking on the [vote] link to the right of the form above!

But here too, some problems arose in my sample. First of all, only a handful of doubtful words had any votes. Second, many of the error types identified above do not admit of voting. And third, those that did have votes did not always benefit from having them. Here is the entry on the word rates in ut pelagus tenuere rates (5.8), showing a preference for the (incorrect) accusative, despite nine user votes for the (correct) nominative.


On the word pater in Quidve, pater Neptune, paras? (5.14), ten incorrect user votes for the nominative win out over the (obviously correct) vocative.

More common, however, is the lack of any user votes at all, as in this very confusing jumble of information on the word hoc (5.18). Note that the correct lemmatization (> hic) has a nonsensical definition; that the morphological analysis states it can only be a pronoun (“pron.”) whereas here, as often, it is a demonstrative adjective; and finally that the LWST incorrectly concludes that the form derives from the lemma huc.

Another odd and thankfully rare genre of error occurs in the case of deinde (5.14), which is correctly analyzed, but put beside a fictional alternative, the present imperative of a verb *deindo.

I would like to know if the same level of error and types of errors occur when LWST is unleashed on a prose text. Perhaps there the idea of a “prior-form frequency evaluator” would make more sense.

It is not my intent to denigrate the huge achievements of Perseus in our field. It is certainly better to have the LWST than not to have it. My purpose here is just to investigate the nature and extent of its errors. If this sample is at all representative, something along the lines of 3.5 million errors exist in the current database. I would also like to ask, is it realistic to think that qualified people can be found to correct the mistakes of the LWST? What is the incentive for professional Latinists to do so?

I also have a proposal for a different kind of tool, which I will save for another post, since this one is already too long. Your thoughts?

–Chris Francese