Vergil, XML-TEI Markup, linked annotation data, and the DCC

Organizing our received humanities materials as if they were simply informational depositories, computer markup as currently imagined handicaps or even baffles altogether our moves to engage with the well-known dynamic functions of  textual works. ((Jerome McGann, A New Republics of Letters: Memory and Scholarship in the Age of Digital Reproduction (Cambridge, MA: Harvard University Press, 2014), pp. 107-108.))

This sentence from Jerome McGann’s challenging new book on digitization and the humanities (part of a chapter on TEI called “Marking Texts in Many Directions,” originally published in 2004) rang out to me with a clarity derived from its relevance to my own present struggles and projects. The question for a small project like ours, “to mark up or not to mark up in XML-TEI?” is an acutely anxious one. The labor involved is considerable, the delays and headaches numerous. The payoff is theoretical and long term. But the risk of not doing so is oblivion. Texts stuck in html will eventually be marginalized, forgotten, unused, like 1989’s websites. TEI promises pain now, but with a chance of the closest thing the digital world has to immortality. It holds out the possibility of re-use, of a dynamic and interactive future.

McGann points out that print always has a twin function as both archive and simulation of natural language. Markup decisively subordinates the simulation function of text in favor of ever better archiving. This may be why XML-TEI has such a profound appeal to many classicists, and why it makes others, who value the performative, aesthetic aspects of language more than the archiving of it, so uneasy.

McGann expresses some hope for future interfaces that work “topologically” rather than linearly to integrate these functions, but that’s way in the future. What we have right now is an enormous capacity to enhance the simulation capacity of print via audio and other media. But if we (I mean DCC) spend time on that aspect of web design, it takes time away from the “store your nuts for winter” activities of TEI tagging.

Virgil. Opera (Works). Strasburg: Johann Grüninger, 1502. Sir George Grey Special Collections, Auckland City Libraries

Virgil. Opera (Works). Strasburg: Johann Grüninger, 1502. Sir George Grey Special Collections, Auckland City Libraries

These issues are in the forefront of my mind as I am in the thick of preparations for a new multimedia edition of the Aeneid that will, when complete, be unlike anything else available in print or on the web, I believe. It will have not just notes, but a wealth of annotated images (famous art works and engravings from historical illustrated Aeneid editions), audio recordings, video scansion tutorials, recordings of Renaissance music on texts from the Aeneid, a full Vergilian lexicon based on that of Henry Frieze, comprehensive linking to a newly digitized version of Allen & Greenough’s Latin Grammar, complete running vocabulary lists for the whole poem, and other enhancements.

Embarking on this I will have the help of many talented people:

  • Lucy McInerney (Dickinson ’15) and Tyler Denton (University of Kentucky MA ’14) and Nick Stender (Dickinson ’15) will help this summer to gather grammatical and explanatory notes on the Latin from various existing school editions in the public domain, which I will then edit.
  • Derek Frymark (Dickinson ’12) is working on the Vergilian dictionary database, digitizing and editing Frieze-Dennison. This will be combined with data generously provided by LASLA to produce the running vocabulary lists.
  • Meagan Ayer (PhD University of Buffalo ’13) is putting the finishing touches on the html of Allen & Greenough. This was also worked on by Kaylin Bednarz (Dickinson ’15).
  • Melinda Schlitt, Prof. of Art and Art History at Dickinson, will work on essays on the artistic tradition inspired by the Aeneid in fall of 2014, assisted by Gillian Pinkham (Dickinson ’14).
  • Ryan Burke, our heroic Drupal developer, is creating the interface that will allow for attractive viewing of images along with their metadata and annotations, a new interface for Allen & Greenough, and many other things.
  • Blake Wilson, Prof. of Music at Dickinson, and director of the Collegium Musicum, will be recording choral music based on texts from the Aeneid.

And I expect to have other collaborators down the road as well, faculty, students, and teachers (let me know if you want to get involved!). My own role at the moment is an organizational one, figuring out which of these many tasks is most important and how to implement them, picking illustrations, trying to get rights, and figuring out what kind of metadata we need. I’ll make the audio recordings and scansion tutorials, and no doubt write  a lot of image annotations as we go, and do tons of editing. The plan is to have the AP selections substantially complete by the end of the summer, with Prof. Schlitt’s art historical material and the music to follow in early 2015. My ambition is to cover the entire Aeneid in the coming years.

Faced with this wealth of possibilities for creative simulation, for the sensual enhancement of the Aeneid, I have essentially abandoned, for now at least, the attempt to archive our annotations via TEI. I went through some stages. Denial: this TEI stuff doesn’t apply to us at all; it’s for large database projects at big universities with massive funding. Grief: it’s true, lack of TEI dooms the long-term future of our data; we’re in a pathetic silo; it’s all going to be lost. Hope: maybe we can actually do it; if we just get the right set of minimal tags. Resignation: we just can’t do everything, and we have to do what will make the best use of our resources and make the biggest impact now and in the near future.

One of the things that helped me make this decision was a conversation via email with Bret Mulligan and Sam Huskey. Bret is an editor at the DCC, author of the superb DCC edition of Nepos’ Life of Hannical, and my closest confidant on matters of strategy. Sam is the Information Architect at the APA, and a leader in the developing Digital Latin Library Projects which, if funded, will create the infrastructure for digitized peer reviewed critical texts and commentaries for the whole history of Latin.

When queried about plans to mark up annotation content, Sam acknowledged that developing the syntax for this was a key first step in creating this new archive. He plans at this point to use not TEI but RDF Triples, the lnked data scheme that has worked so well for Pleiades and Pelagios. RDF triples basically allow you to say for anything on the web, x = y. You can connect any chunk of text with any relevant annotation, in the way that Pleiades and Pelagios can automatically connect any ancient place with any tagged photo of that place on flickr, or any tagged reference to it in DCC or other text database. I can see how, for the long term development of infrastructure, RDF triples would be the way to go, in so far as it would create the potential for a linked data approach to annotation (including apparatus).

The fact that the vocabulary for doing that is not ready yet makes my decision about what to do with the Aeneid. Greg Crane too, and the Perseus/OGL team at Tufts and the University of Leipzig, are working on a new infrastructure for connecting ancient texts to annotation content, and Prof. Crane has been very generous with his time in advising me about DCC. He seemed to be a little frustrated that the system for reliably encoding and sharing annotations is not there yet, and eager to help us just get on with the business of creating new freely available annotation content in the meantime, and that’s what we’re doing. Our small project is not in a position to get involved in the building of the infrastructure. We’ll just have to work on complying when and if an accepted schema appears.

For those who are in a position to develop this infrastructure, here with my two sesterces. Perhaps the goal is someday to have something like Pleiades for texts, with something like Pelagios for linking annotation content. You could have a designated chunk of text displayed, then off to the bottom right somewhere there could be a list of types and sources of annotation content. “15 annotations to this section in DCC,” “25 annotations to this section in Perseus,” “3 place names that appear in Pleiades,” “55 variant readings in DLL apparatus bank,” “5 linked translations available via Alpheios,” etc., and the user could click and see that content as desired.

It seems to me that the only way to wrangle all this content is to deal in chunks of text, paragraphs, line ranges, not individual words or lemmata. We’re getting ready to chunk the Aeneid, and I think I’m going to use Perseus’ existing chunks. Standard chunkings would serve much the same purpose as numeration in the early printed editions, Stephanus numbers for Plato and so forth. Annotations can obviously flag individual words and lemmata, but it seems like for linked data approaches you simply can’t key things to small units that won’t be unique and might in fact change if a manuscript reading is revised. I am aware of the CTS-URN architecture, and consider it to be a key advance in the history of classical studies. But I am speaking here just about linking annotation content to chunks of classical texts.

What Prof. Crane would like is more machine operability, so you can re-use annotations and automate the process. That way, I don’t have to write the same annotation over and over. If, say,  iam tum cum in Catullus 1 means the same thing as iam tum cum in other texts, you should be able to re-use the note. Likewise for places and personal names, you shouldn’t have to explain afresh every time which one of the several Alexandrias or Diogeneses you are dealing with.

I personally think that, while the process of annotation can be simplified, especially by linking out to standard grammars rather than re-explaining grammatical points every time, and by creating truly accurate running vocabulary lists, the dream of machine operable annotation is not a realistic one. You can use reference works to make the process more efficient. But a human will always have to do that, and more importantly the human scholar figure will always need to be in the forefront for classical annotation. The audience prefers it, and the qualified specialists are out there.

This leads me to my last point for this overlong post, that getting the qualified humans in the game of digital annotation is for me the key factor. I am so thrilled the APA is taking the lead with DLL. APA has access to the network of scholars in a way that the rest of us do not, and I look forward to seeing the APA leverage that into some truly revolutionary quality resources in the coming years. Sorry, it’s the SCS now!

Greek Core Vocabulary: A Sight Reading Approach

Crytian Cruz, via Flickr (

(This is a slightly revised version of a talk given by Chris Francese on January 4, 2013 at the American Philological Association Meeting, at the panel “New Adventures in Greek Pedagogy,” organized by Willie Major.)

Not long ago, in the process of making some websites of reading texts with commentary on classical authors, I became interested in high-frequency vocabulary for ancient Greek. The idea was straightforward: define a core list of high frequency words that would not be glossed in running vocabulary lists to accompany texts designed for fluid reading. I was fortunate to be given a set of frequency data from the TLG by Maria Pantelia, with the sample restricted to authors up to AD 200, in order to avoid distortions introduced church fathers and Byzantine texts. So I thought I had it made. But I soon found myself in a quicksand, slowly drowning in a morass infested with hidden, nasty predators, until Willie Major threw me a rope, first via his published work on this subject, and then with his collaboration in creating what is now a finished core list of around 500 words, available free online. I want to thank Willie for his generosity, his collegiality, his dedication, and for including me on this panel. I also received very generous help, data infusions, and advice on our core list from Helma Dik at the University of Chicago, for which I am most grateful.

What our websites offer that is new, I believe, is the combination of a statistically-based yet lovingly hand-crafted core vocabulary, along with handmade glosses for non-core words. The idea is to facilitate smooth reading for non-specialist readers at any level, in the tradition of the Bryn Mawr Commentaries, but with media—sound recordings, images, etc. Bells and whistles aside, however, how do you get students to actually absorb and master the core list? Rachel Clark has published an interesting paper on this problem at the introductory level of ancient Greek that I commend to you. There is also of course a large literature on vocabulary acquisition in modern languages, which I am going to ignore completely. This paper is more in the way of an interim report from the field about what my colleague Meghan Reedy and I have been doing at Dickinson to integrate core vocabulary with a regime based on sight reading and comprehension, as opposed to the traditional prepared translation method. Consider this a provisional attempt to think through a pedagogy to go with the websites. I should also mention that we make no great claim to originality, and have taken inspiration from some late nineteenth century teachers who used sight reading, in particular Edwin Post.

In the course of some mandated assessment activities it became clear that the traditional prepared translation method was not yielding students who could pick their way through a new chunk of Greek with sufficient vocabulary help, which is our ultimate goal. With this learning goal in mind we tried to back-design a system that would yield the desired result, and have developed a new routine based around the twin ideas of core vocabulary and sight reading. Students are held responsible for the core list, and they read and are tested at sight, with the stipulation that non-core words will be glossed. I have no statistics to prove that our current regime is superior to the old way, but I do know it has changed substantially the dynamics of our intermediate classes, I believe for the better.
Students’ class preparation consists of a mix of vocabulary memorization for passages to be read at sight in class the next day, and comprehension/grammar worksheets on other passages (ones not normally dealt with in class). Class itself consists mainly of sight translation, and review and discussion of previously read passages, with grammar review as needed. Testing consists of sight passages with comprehension and grammar questions (like the worksheets), and vocabulary quizzes. Written assignments focus on textual analysis as well as literal and polished literary translation.

The concept (not always executed with 100% effectiveness, I hasten to add) is that for homework students focus on relatively straightforward tasks they can successfully complete (the vocabulary preparation and the worksheets). This preserves class time for the much more difficult and higher-order task of translation, where they need to be able to collaborate with each other, and where we’re there to help them—point out word groups and head off various types of frustration. It’s a version, in other words, of the flipped classroom approach, a model of instruction associated with math and science, where students watch recorded lectures for homework and complete their assignments, labs, and tests in class. More complex, higher-order tasks are completed in class, more routine, more passive ones, outside.

There are many possible variations of this idea, but the central selling point for me is that it changes the set of implicit bargains and imperatives that underlie ancient language instruction, at least as we were practicing it. Consider first vocabulary: in the old regime we said essentially: “know for the short-term every word in each text we read. I will ask you anything.” In the new regime we say, “know for the long-term the most important words. The rest will be glossed.” When it comes to reading, we used to say or imply, “understand for the test every nuance of the texts we covered in class. I will ask you any detail.” In the new system we say, “learn the skills to read any new text you come across. I will ask for the main points only, and give you clues.” What about morphology? The stated message was, “You should know all your declensions and conjugations.” The unspoken corollary was “But if you can translate the prepared passage without all that you will still pass.” With the new method, the daily lived reality is, “If you don’t know what endings mean you will be completely in the dark as to how these words are related.” When it comes to grammar and syntax, the old routine assumed they should know all the major constructions as abstract principles, but with the tacit understanding that this is not really likely to be possible at the intermediate level. The new method says, “practice recognizing and identifying the most common grammatical patterns that actually occur in the readings. Unusual things will be glossed.” More broadly, the underlying incentives of our usual testing routines was always, “Learn and English translation of assigned texts and you’ll be in pretty good shape.” This has now changed to: “know core vocabulary and common grammar cold and you’ll be in pretty good shape.”

Now, every system has its pros and cons. The cons here might be a) that students don’t spend quite as much time reading the dictionary as before, so their vocabulary knowledge is not as broad or deep as it should be; b) that the level of attention to specific texts is not as high as in the traditional method; and c) that not as much material can be covered when class work done at sight. The first of these (not enough dictionary time) is a real problem in my view that makes this method not really suitable at the upper levels. At the intermediate level the kind of close reading that we classicists value so much can be accomplished through repeated exposure in class to texts initially encountered at sight, and through written assignments and analytical papers. The problem of coverage is alleviated somewhat by the fact that students encounter as much or more in the original language than before, thanks to the comprehension worksheets, which cover a whole separate set of material.

On the pro side, the students seem to like it. Certainly their relationship to grammar is transformed. They suddenly become rather curious about grammatical structures that will help them figure out what the heck is going on. With the comprehension worksheets the assumption is that the text makes some kind of sense, rather than what used to be the default assumption, that it’s Greek, so it’s not really supposed to make that much sense anyway. While the students are still mastering the core vocabulary, one can divide the vocabulary of a passage into core and non-core items, holding the students responsible only for core items. Students obviously like this kind of triage, since it helps them focus their effort in a way they acknowledge and accept as rational. The key advantage to a statistically based core list in my view is really a rhetorical one. In helps generate buy-in. The problem is that we don’t read enough to really master the core contextually in the third semester. Coordinating the core with what happens to occur in the passages we happen to read is the chief difficulty of this method. I would argue, however, that even if you can’t teach them the whole core contextually, the effort to do so crucially changes the student’s attitude to vocabulary acquisition, from “how can I possibly ever learn this vast quantity of ridiculous words?” to “Ok, some of these are more important than others, and I have a realistic numerical goal to achieve.” The core is a possible dream, something that cannot always be said of the learning goals implicit in the traditional prepared translation method at the intermediate level.

The question of how technology can make all this work better is an interesting one. Prof. Major recently published an important article in CO that addresses this issue. In my view we need a vocabulary app that focuses on the DCC core, and I want to try to develop that. We need a video Greek grammar along the lines of Khan Academy that will allow students to absorb complex grammatical concepts by repeated viewings at home, with many, many examples, annotated with chalk and talk by a competent instructor. And we need more texts that are equipped with handmade vocabulary lists that exclude core items, both to facilitate reading and to preserve the incentive to master the core. And this is where our project hopes to make a contribution. Thank you very much, and I look forward to the discussion period.

–Chris Francese


Greek Core Vocabulary Acquisition: A Sight Reading Approach

American Philological Association, Seattle, WA

Friday January 4, 2013

Panel: New Adventures in Greek Pedagogy

Christopher Francese, Professor of Classical Studies, Dickinson College


Dickinson College Commentaries:

Latin and Greek texts for reading, with explanatory notes, vocabulary, and graphic, video, and audio elements. Greek texts forthcoming: Callimachus, Aetia (ed. Susan Stephens); Lucian, True History (ed. Stephen Nimis and Evan Hayes).

DCC Core Ancient Greek Vocabulary

About 500 of the most common words in ancient Greek, the lemmas that generate approximately 65% of the word forms in a typical Greek text. Created in the summer of 2012 by Christopher Francese and collaborators, using two sets of data:  1. A subset of the comprehensive Thesaurus Linguae Graecae database, including all texts in the database up to AD 200, a total of 20.003 million words (of which the period AD 100–200 accounts for 10.235 million). 2. The corpus of Greek authors at Perseus Chicago, which at the time our list was developed was approximately 5 million words.

Rachel Clark, “The 80% Rule: Greek Vocabulary in Popular Textbooks,” Teaching Classical Languages 1.1 (2009), 67–108.

Wilfred E. Major, “Teaching and Testing Classical Greek in a Digital World,” Classical Outlook 89.2 (2012), 36–39.

Wilfred E. Major, “It’s Not the Size, It’s the Frequency: The Value of Using a Core Vocabulary in Beginning and Intermediate Greek”  CPL Online 4.1 (2008), 1–24.



Read Iliad 1.266-291, then answer the following in English, giving the exact Greek that is the basis of your answer:


  1. (lines 266-273)  Who did Nestor fight against, and why did he go?





  1. (lines 274-279 ) Why should Achilles defer to Agamemnon, in Nestor’s view?




  1. (lines 280-284) What is the meaning and difference between κάρτερος and φέρτερος as Nestor explains it?




  1. (lines 285-291) What four things does Achilles want, according to Agamemnon?



Find five prepositional phrases, write them out and translate, noting the line number, and the case that each preposition takes.







Find five verbs in the imperative mood, write them out and translate, noting the line number and tense of each.






Latin Core Spreadsheet

Peter Sipes, benevolus amicus noster apud Google+, has kindly made available a Google spreadsheet of the DCC Latin Core Vocabulary. Check it out, and download it. He uses it for those occasions when he is working without an internet connection. I wonder what he is doing with the list? Perhaps a guest blog post is in order. Peter?

The core vocabularies have been on my back burner while I have been finishing up a book project of the dead tree variety while on leave from Dickinson for the fall ’12 semester. But I hope to return very soon to consideration of the semantic groupings in particular. My Dickinson colleague Meghan Reedy pointed out some flaws in the groupings on the Latin side, and we need to get that sorted before she and I move forward on our grand project: a poster that will visually represent the core according to its associated LASLA data, expressing visually each lemma’s frequency, semantic group, and relative commonness in poetry and prose.

In the meantime, if you will be at the meetings of the (soon-to-be-renamed) American Philological Association in Seattle, please stop by the Greek pedagogy session and hear my fifteen minute talk about a way to use the DCC Greek core vocabulary in an intermediate sequence based around sight reading and comprehension, as opposed to the traditional prepared translation method.

Here is the whole line-up:

Friday January 4, 8:30 AM – 11:00 AM Washington State Convention Center Room 604

Wilfred E. Major, Louisiana State University, Organizer
The papers on this panel each offer guidance and new directions for teaching beginning and intermediate Greek. First is a report on the 2012 College Greek Exam. Following are a new way to teach Greek accents, and a new way to sequence declensions, tenses and conjugations in beginning classes. Then we get a look at a reader in development that makes authentic ancient texts accessible to beginning students, and finally a way to make sight reading the standard method of reading in intermediate Greek classes.

Albert Watanabe, Louisiana State University
The 2012 College Greek Exam (15 mins.)

Wilfred E. Major, Louisiana State University
A Better Way to Teach Greek Accents (15 mins.)

Byron Stayskal, Western Washington University
Sequence and Structure in Beginning Greek (15 mins.)

Georgia L. Irby, The College of William and Mary
A Little Greek Reader: Teaching Grammar and Syntax with Authentic Greek (15 mins.)

Christopher Francese, Dickinson College
Greek Core Vocabulary Acquisition: A Sight Reading Approach (15 mins.)