Do the Flaws in the Perseus Word Study Tool Matter?

Posted on July 10, 2012 by Chris Francese

In a recent post I tried to categorize the problems of the Perseus Word Study Tool, as tested on a section of Vergil. More surprising to me than the overall rate of error (about one in three words was misidentified in some way) was the fact that many of the errors were not subject to correction by means of Perseus’ “voting” system; and that even when voting was in operation, it often did not correct the error. Sometimes the correct choice was not an available option; other times, unanimous correct votes were ignored, and unanimous incorrect votes were accepted. At Aen. 5.17, to add another example to those mentioned the earlier post, the vocative magnanime was incorrectly called an adverb on the basis of six incorrect user votes.

The inadequacy of the LWST will not have been news to anyone who has used it. The question is, is the level of error pedagogically significant? Is the LWST good enough for the purposes of a typical Latin student? In other words, should the average Latinist care? It is not good enough, and the level of error and the specific types of errors in this flagship classical DH project are pedagogically significant and worthy of attention, I believe, for several reasons.

1. Words that give students the most trouble–relative pronouns, demonstratives, quam, ut, modo, Q-words in general–are exactly those least likely to be handled well by the LWST. The earlier post has some examples from my small sample, but I’ll add here that in Aen. 5.30 (magis . . . ) quam, when it comes to that quam, the LWST offered no fewer than seven possible quams to choose from (all numbered quam 1-7), none of which has the correct definition in the context (“than”).

2. The LWST is of course helpless when it comes to unusual or idiomatic expressions, of which there is a good example in my sample at 5.6, were notum must be translated “the knowledge that.”

3. The tool naturally can analyze only what is there. It cannot tell when something is left out or assumed.

4. A major structural problem is represented by bad short definitions of the type (to choose again from examples offered by my sample) iubet = “imposed,” iam = “are you going so soon,” frustra = “in deception, in error,” or more subtly, the fact that the common meaning of tendere, “direct one’s course,” does not appear in the short def. for that word.This is important because, even though one can click on and read the full Lewis & Short dictoinary definition, intermediate students are very unlikely to click through and sift through long entries in search of the correct definition.

5. Moreover, the LWST obscures the relationships between words, which is key to learning to read Latin. This is why seemingly minor accidence mistakes are meaningful. Misled on a part of speech, or the gender of an adjective or the case of a noun, the student will likely not see the syntactical connection between words, and thus the tool reinforces the urge to produce the dreaded “word salad” translations.

6. More broadly, with its cryptic statistical data and jumbled pseudo-information, the LWST reinforces the the impression that many students have: that Latin isn’t really supposed to make sense anyway, that it’s all some kind of fiendish crossword puzzle.

Gregory Crane in an important article and apologia for Perseus, has said that the goal of the Perseus Project is to provide “machine-actionable knowledge.”

Reference materials, in particular, are structured to support automatic systems (e.g., the morphological analyzer learns Greek and Latin morphology from a machine actionable grammar) and to be decomposed into small chunks and then recombined to provide dynamic commentaries. If you retrieve a book in a language that you cannot read or on a topic that you cannot understand, the system can find translations where these already exist, machine translation and translation support systems, reference works, and general background information suited to the general background and immediate purposes of the reader. In knowledge bases, the boundaries between books begin to dissolve.

But clearly machines are spectacularly bad at understanding Latin at the moment. Crane thinks in terms of many decades, and is waiting for massive improvements in artificial intelligence, or teams of graduate students to encode correct grammatical analysis in texts. But such a prospect seems increasingly far off, and given the size of the Perseus Digital Library (10.5 million words at the moment), it seem unlikely that the millions of errors can be corrected any time soon, if ever. Indeed, would it be worth huge the investment of time and money? In the meantime, we need to create a collaborative tool for generating reasonably correct and reliable vocabulary lists for Latin (and Greek) authors that will be helpful for students and teachers around the world. Why we should do this, and what kind of tool I have in mind, will be the subjects of future posts.

–Chris Francese

12 thoughts on “Do the Flaws in the Perseus Word Study Tool Matter?”

Laura Gibbs on July 10, 2012 at 4:01 pm said:

What a great post, Chris – this is something that I think every Latin teacher needs to ponder. I stopped using Perseus quite some time ago (about seven or eight years ago they were having lots of server problems and outages, so I just got out of the habit of using it at all) – so I don’t have any personal experience with the voting system. Based on exactly the kinds of concerns that you have expressed in your other post, I’m not sure such a tagging and voting system is the best way to use people’s collective energy. If people do want to read socially and collaboratively, which I think is a great goal, having a computerized analysis define the way we interact together is too limiting in my opinion. The literary texts at Perseus are so rich and complex (that is the nature of literary texts after all!) that simply tagging forms is inherently unsatisfying, just as for any reader the reading experience would not be very satisfying if all you did was to parse, no matter how accurately you might be doing that. Over the past couple of months I’ve been experimenting with a new form of commenting on texts that I found personally very satisfying as a reader and on which I got very positive feedback from others as well – I was calling them “Latin Without Latin” essays. The putative reason for this title was that they are written for people who have no Latin at all, but the purpose they really serve is simply to unmask my reading process, word by word, as a native speaker of English who is trying to appreciate Latin on its own terms. I wrote about 50 of these little essays – Latin Without Latin essays – before taking a pause to focus on the book. I’m very much looking forward to starting up the essays again since they are the best way I’ve found in several decades of sharing my thoughts and ideas about texts as a I read. Part of that DOES involve disambiguating ambiguous forms (a big task that every non-native Latin reader, which is to say all Latin readers nowadays, must face), but it involves much more than that; disambiguating the forms is only the first step in terms of getting at meaning. Moreover, the process of disambiguating forms and apprehending the meaning are like two sides of a piece of paper; they are separate but cannot be separated if you see what I mean.

Also, I have to say that your Core Vocabulary List has really made me feel so much more optimistic and confident about glossing texts for vocabulary without providing additional commentary. I love the way that with my current project, the core vocabulary is continually reinforced by the LACK of glosses – you’ll see what I mean when I send you a copy of the book (just a few weeks to go!). I used to hate glossing every word of a poem (overkill, counterproductive), but I really like the fact that I can now focus my attention on glossing exactly those words that intermediate readers are less likely to know, giving them the correct dictionary headword (that’s really the only crucial information) so that they can confidently look the word up if they want to learn more. There is nothing more frustrating than reading a text, finding a difficult word, and not even being confident what part of speech the word is, or what dictionary headword to look up! Of course, students will have to figure out what to do on their own about cum and quam and hic and such, because I am not glossing those; that is their responsibility as Latin students, and I hope their teachers will be sharing with them good strategies for coping with those most common ambiguities on their own. But if they get to a form like “culpas,” it’s definitely a big help to have me gloss that with the noun culpa (or with the verb culpo) to help get them going in the right direction, and so too with forms like parce (adverb or verb?), vivis (they surely know the verb vivo, but probably not the adjectival vivus), etc. etc. etc.

How do we best do this collectively and collaboratively? I am not sure – but I suspect it will be an enterprise far more human-driven than machine-driven. I’m a hyper-electronic person myself, but the way I use the electronic world for reading Latin is primarily in terms of searching and access, not automated text analysis. So I love the fact that I can get all kinds of Latin grammars and dictionaries online instantly and that I can search Google Books for passages (although the OCR for early printed books is a nightmare…). What would I like that I don’t have? A dedicated forum for people reading Latin texts together – and the more populated the forum, the better. Various people have tried to do that in various ways (TextKit, for example) – but nothing has really worked; nothing has been full of enough people! Creating an electronic space full of people is something that CAN be done – just look at Google+ for example. To my mind, that’s what we need – ways to let us read Latin together, asynchronously, at a distance, but together, rather than reading with a machine.
latin-poetry-podcast on July 10, 2012 at 6:35 pm said:

Beautifully put, Laura, and wise. “How do we best do this collectively and collaboratively? I am not sure – but I suspect it will be an enterprise far more human-driven than machine-driven. ” This is very important thought for all digital humanists to keep in mind. Thank you as always for your work, time and ideas.
Gregory Crane on July 11, 2012 at 6:28 pm said:

This is incredibly useful — we posted a link to this on the Perseus site on July 11 — http://www.perseus.tufts.edu/. We can’t wait to see more discussion on this.

****

We would like to call attention to a thoughtful blog post on the Latin Word Study Tool in Perseus and why it is important to improve it. The post names a number of key challenges and, more importantly, argues forcefully that answering those challenges in the morphological analysis is of pedagogical value.

The big challenge is that machines are not perfect and won’t be, even if they grow cleverer (as indeed they can) in their ability to morphologically analyze Latin. The best solution is probably to aggregate effort from many people. The big question is whether the problem is important enough and people care enough to build a system to manage corrections from many named sources.

You can see work in this direction in the March 20 posting below. There we show work adapting the distributed editing environment developed for the papyrologists to the needs of Classicists working with literary texts as well. Editing morphological data is similar in principle, though a bit different in practice. While we do not have a simple morphology system, users could use the Greek and Latin Treebank editing environment and only edit the morphology. That might be a useful starting point. The real question is whether the morphology is worth the attention by itself when you could also annotate the syntax – but there are very qualified people (such as Helma Dik) who believe just that.

Our question to the community is this: How many contributors would there be? In particular, we would love for classes to ‘adopt a passage,’ correcting all the morphology and basic issues as a contribution to knowledge. Of course, all the data that we produce is released under a Creative Commons License.
Patrick on July 15, 2012 at 9:46 pm said:

Great discussion…

Having just taken (and passed) both my Latin and Greek PhD translation exams in the past year, I would say that the word study tools in both languages were critical parts of my success. This is not a defense of the weaknesses, just that I haven’t found anything comparable for handling the volume of dictionary work required for these types of exams—and this is the key—with such efficiency. Like all tools, the Latin and Greek WSTs require a bit of care in learning what they do well and what they don’t, and when you are looking up without exaggeration hundreds of words of day, you naturally become a better judge of this.

So, for example, CF, I agree with your first two points, but I learned to read past this sort of thing early and concentrate my Perseus time in other areas. If I’m stuck on an idiomatic expression or an unusual “ut”, I’ll turn to other dictionaries, grammars, commentaries, etc. instead. (Do my students make this effort? A good question and one with an answer that would probably be disappointing.) The overall benefit from the WSTs in speed and efficiency greatly outweighed the frustrations caused by the errors, morphological ambiguities, scoring confusion, etc. There is clearly room for improvement and I think your call for a new collaborative tool is a great idea, but I don’t foresee giving up the utility of the WSTs as part of my daily workflow any time soon.

Lastly, GC, I really like this idea of adopting a passage and think that a number of advanced undergraduate classes and even grad classes could produce better work and certainly more generally useful work through this sort of project than through hastily written end-of-the-semester papers.
Gabriele Alfinito on July 17, 2012 at 12:47 pm said:

It’s time to offer good tools to search the exact meaning of the words as Forcellini’s lexicon http://archive.org/details/totiuslatinitati01forc and http://archive.org/details/totiuslatinitat00forcgoog that can be numerized for an online using. The errors are to be expected because the texts to be translated are of authors that use these terms in full command of the language, proficiency that students generally do not have … Excuse for my English!
Pomeline on August 20, 2012 at 6:01 pm said:

I’d be glad to approach my professor to see if this upcoming semester our class can attempt taking on a passage. It might have to wait until next semester as I believe our syllabus has been set for the fall.
Mark on September 8, 2012 at 4:15 pm said:

I have no experience with the Latin Word Study Tool, but I have some thoughts on the Greek Word Study Tool, some of which will apply to the Latin one as well.

First of all, the presentation of the forms found could be clearer. Just as an arbitrary example, I asked for help on ἧττον, which means “less.” Perseus gets all of the hard stuff right. It correctly determines that the citation form (i.e., the form it is listed under in the LSJ lexicon) is ἥσσων. (The Attic form would be ἥττων, but LSJ lists the form with double sigma; even relative beginners in Greek will know about this alternation.)

There are five possible forms ἧττον could be, but in a sense only two of them are distinct. Greek has five cases and three genders, but many adjectives, including ἥσσων, have the same masculine and feminine forms, and, with no exceptions whatsoever, no distinction is made in the neuter between nominative, accusative, and vocative, the so-called the “direct” cases. ἧττον is singular, and it is either a vocative (in any of the three genders) or a neuter direct case. Perseus obscures this by giving all five forms separately. It also tries to guess what the correct form is. In the example I tried, it is 94.2% confident that ἧττον is a neuter nominative. In actuality, ἧττον is a neuter singular functioning adverbially, which is a very common function for neuter singular comparatives. It doesn’t really make sense to ask whether it is nominative or accusative. Now, I doubt anyone would think that ἧττον is a vocative in this passage, but beginners might not realize it is adverbial and be confused about what it is agreeing with. So the most useful information from Perseus might be to not only present the two options (vocative or neuter direct) but also to note that if it is neuter direct it might be adverbial. The main point, though, is that it should give two forms, not five, and it shouldn’t try to guess which form is correct since it can’t do that reliably, and unreliable guesses are not helpful to anyone. I also agree that the statistical pseudo-information should be removed.

For verbs, Perseus often gives a huge number of possible analyses, most of which are from other dialects than that in which the text is written. These forms should be suppressed, or at least severely deemphasized.

It would be convenient to be able to click on a word and see both the morphological analysis and the LSJ entry (perhaps with the Middle Liddell as an alternative for beginners). Currently, this requires an extra click. It is also somewhat slow. Slowness used to be an enormous problem with Perseus, and it has gotten much better, but it would be tremendously useful to beginners in Greek to have instant word lookup. Ideally, in fact, morphological analysis and lexicon would not open in a new tab but would either appear in another part of the same web page or perhaps appear in a floating bubble when one hovered over a word.

I disagree that full markup (as could be achieved with adopt-a-passage) is a desirable goal. That seems to me to encourage the “decoding” approach to Greek, which will never lead to fluent reading. Much better is the traditional approach of notes on grammatical difficulties. In my opinion, the best way for intermediate Greek students to read texts is with a facing translation and judicious notes. The only advantage of an online system like Perseus is that checking morphology and looking up words in the dictionary can be sped up.
liz fairhead on October 23, 2012 at 8:13 pm said:

I am showing my AS and A2 level students both the short and the full Lewis and Short Latin tool. They would struggle to translate using the word study tool alone but it is extremely useful as an aid to understanding along with a grammar and a crib. It does have shortcomings but nevertheless has much to offer, especially the full Lewis and Short which gives students a context and much fuller appreciation of words and is usable in a classroom in a way that the book is not. Much easier than the page turning I did 30 years ago at university.
James Rinkevich on November 20, 2012 at 3:47 pm said:

Have you looked at what Whitacker’s Words does? It finds all the possible forms and lists the forms that match and then finds all the dictionary forms and provides a gloss for each. For Perseus the only additional thing needed would be links from the root dictionary form word to the dictionaries (I believe he used many of the same that Perseus will link to for his glosses).
Katie on January 28, 2013 at 9:33 pm said:

How do you feel about experiences students of Latin (advanced/graduate level) using the tool for quick word lookup? For example, if I’m reading the word “pacto” (De Rerum Natura, 2.549), I know it can be a noun, adjective, or perfect passive participle. If it’s the perfect passive participle, my memory is not good enough to remembered that it comes from “paciscor,” nor what the translation of that word is (I then either click on the L&S translation or find a physical L&S to tell me what’s up). However, this anecdote in itself might be a point against the parsing tool: I seem to have gotten lazy with my verbs.

On the whole, I’m generally conflicted about using parsing tools, because I do feel like it’s a sign of laziness – however, I catch things that I otherwise wouldn’t just using a dictionary (for example, if a word looks genitive but is, in reality, ablative). I feel like many older scholars expect their students to use pre-internet/computer tools, just because that’s what they were taught with – even though there are faster, easier ways to do the exact same thing. Again, this might just be complete and total laziness on my part.
Rusty on September 21, 2013 at 5:37 pm said:

That was is a wonderful post, thank you, Chris. I have studied Latin for years independently and had always assumed that there must be some secret code for using those perseus tables properly, a code to which only those at university had priviledged acccess.