Johan Winge’s New Latin Macronizer

Inscription_latine_avec_apex_extrait

image credit: Vincent Ramos via Wikimedia Commons

A new Latin macronizer has come on the scene, and it is superb. It should become an essential tool for Latin teachers and editors of Latin texts. The author is Johan Winge, who just completed his undergraduate studies in the Language Technology Programme at Uppsala University, supervised by Joakim Nivre. The macronizer is the result of his thesis work for the degree. I had the opportunity to give it a good test run recently, as I read the Ilias Latina along with about twenty Latin teachers at the Dickinson Summer Latin Workshop. I took the PHI text (Vollmer’s Teubner from 1913) of this 1070-line condensation of the Iliad into Latin hexameters, put it in a Word document, and ran it through Winge’s macronizer. We read the text together and spotted the cases where corrections were needed.

The claim on the site that “The expected accuracy on an average classical text is estimated to be about 98% to 99%” seems like no exaggeration. What makes Winge’s macronizer more effective that other tools such as Kevin Ryan’s Macron Helper or Felipe Vogel’s māccer is that it does not work on the basis of a database of previously macronized forms. Rather, it uses a part-of-speech tagger (RFTagger) trained on the Latin Dependency Treebank, and with macrons provided by a customized version of the Morpheus morphological analyzer.

You’ll have to read Johan’s thesis, Automatic Annotation of Latin Vowel Length, to get all the technical details. I’ll just say that it performed splendidly on the Ilias Latina. Here is a typical stretch, lines 344-374, with the errors highlighted:

dumque inter sēsē procerēs certāmen habērent,
concilium omnipotēns habuit rēgnātor Olympī 345
foederaque intentō turbāvit Pandarus arcū,
tē, Menelāe, petēns; latērīque volātile tēlum
incīdit et tunicam ferrō squāmīsque rigentem
dissecat: excēdit pugna gemebundus Atrīdēs
castraque tūta petit; quem doctus ab arte paternā 350
Paeōniīs cūrat iuvenis Podalīrius herbīs
itque iterum in caedēs horrendaque proelia victor.
armāvit fortēs Agamemnonis īra Pelasgōs
et dolor in pugnam cūnctōs commūnīs agēbat.
bellum ingēns oritur multumque utrimque cruōris 355
funditur et tōtīs sternuntur corpora campīs;
inque vicem Trōumque cadunt Danaumque catervae.
nec requiēs datur ūlla virīs; sonat undique Mavors
tēlōrumque volant cūnctīs ē partibus imbrēs.
occīdit Antilochī rigidō dēmersus in umbrās 360
ēnse Thalysiadēs optātaque lūmina linquit.
inde manū fortī Grāiōrum terga prementem
occupat Anthemiōne satum Telamōnius Aiāx
et praedūrātō trānsfīxit pectora tēlō:
purpureō vomit ille animam cum sanguine mixtam, 365
ōra rigat moriēns. tum magnīs Antiphus hastam
vīribus adversum cōnātūs corpore tōtō
torquet in Aeacidēn: tēlumque errāvit ab hoste
inque hostem cecidit, trānsfīxit et inguine Leucōn:
concīdit īnfēlīx prōstrātus vulnere fortī 370
et carpit viridēs moribundus dentibus herbās.
†impiger †Atrīdēs cāsū concussūs amīcī
Democoonta petit tēlōque adversā trabālī
tempora trānsadigit …

You will note that of the 11 “mistakes” on this page, only one (Mavors) is a genuine error. All the others are simply ambiguous forms, issues that need to be decided by a human. Virtually all of the cases that did not fall into the category of “ambiguous forms that need to be decided by a human” were Greek proper names, in which this text abounds. For some reason the form Achillis consistently came out with a long mark on the final vowel. Paris came out with a final macron twice, but without it three times. There were quantity issues with Nereus, and his daughters.The strange form mēō emerged at line 851. But virtually all the time, with all ordinary Latin words, the macronizer performed brilliantly. The greatest delight was seeing it correctly macronize the phrase rēbus in artīs (line 968), where the final word almost always has a short “i”–but not here. That will have been the result of the Treebank data, I am guessing.

Mr. Winge, I salute you!

Postrcipt 7/21/15: Johan writes that his source code is now available.  Also, the picture I posted originally is not of him but of his friend Francesco Veneziano. Apologies to both Johan and Francesco for that one!

A New Latin Macronizer

Felipe Vogel has released a new Latin macronizer, Maccer, and I thought I would take it for a spin and share the results. It works based on a database of previously macronized Latin texts (some provided by DCC), and is still in development.

For my test I figured I would use an unusual text I have been working on lately, Historiarum Indicarum Libri XVI, about the Portuguese exploration of the Far East in the 16th century. It was published by the Jesuit humanist Pietro Maffei in 1588, and the Latin is excellent and full of interest. Book 6 is a fascinating ethnography of China, informed by reports from Jesuit missionaries who visited and lived in China over a number of years. The last print edition was 1751: Joannis Petri Maffeii Bergomatis E Societate Jesu Historiarum Indicarum Libri XVI (Vienna: Bernardi, 1751), and thanks to a tip from Terence Tunberg (who introduced me to this text) I tracked it down on the site of the Dresden Library. Since there is no fully digitized text, my students and I transcribed Book 6 this past fall. Here is an excerpt, with no macrons.

E Sinarum provinciis maxime occidua est Cantonia. Eo priusquam pervenias, multae occurrunt insulae; quas praefecti regii praesidiis et classibus tenent: neque ipsorum iniussu progredi advenas Cantonem est fas. Fernandus Andradius, ut exponere coeperam, cum ad Tamum insulam pervenisset, post diuturnam moram, transitu aegre tandem impetrato, cum duobus expeditis et egregie ornatis navigiis, cetera classe ad Tamum relicta, Cantonis portum invehitur, ac magistratuum permissu Thomam legatum exponit, cui aedes et lautia de more attributa. Ibi Fernandus, mira lenitate ac iustitia contrahendo cum incolis, haud ita difficili negotio aditum ad ea commercia nostris aperuit.

With Vogel’s macronizer this becomes

Ē ✖Sinarum prōvinciīs maximē ✖occidua ✪est ✖Cantonia. Eō priusquam perveniās, multae occurrunt īnsulae; quās ✖praefecti ✖regii praesidiīs et classibus tenent: neque ipsōrum ❡iniussū prōgredī ✖advenas ✖Cantonem ✪est fās. ✖Fernandus ✖Andradius, ut expōnere ✖coeperam, cum ad ✖Tamum īnsulam pervēnisset, post diūturnam moram, trānsitū aegrē tandem ✖impetrato, cum duōbus expedītīs et ēgregiē ✖ornatis nāvigiīs, cētera classe ad ✖Tamum ✪relictā, ✖Cantonis portum invehitur, ac magistrātuum ❡permissū ✖Thomam lēgātum expōnit, cui aedēs et ✖lautia dē mōre ❡attribūta. Ibi ✖Fernandus, ✒mīrã ✖lenitate ac iūstitia ✖contrahendo cum incolīs, haud ita ✖difficili negōtiō aditum ad ✒eã commercia nostrīs aperuit.

The symbols mean this:

unknown word, i.e. not yet in Vogel’s database.
ambiguous: uncertain vowels marked with a tilde (~).
guessed based on frequency.
prefix or enclitic detected attached to a known word.
invalid characters detected.

I made sixteen corrections in 92 words.

21 words were flagged as unknown, 10 of those were proper names (Sinārum, occidua, Cantonia, praefectī, regiī, advenās, Cantonem, Fernandus, Andradius, coeperam, Tamum, impetrātā, ornātīs, Tamum, Cantonis, Thomam, lautia, Fernandus, lēnitāte, contrahendō, difficilī). I made 9 corrections in that group, leaving alone most of the proper names for now.

3 words were guessed based on frequency, all correctly (est, est, relictā).

3 words were marked as “prefix detected,” all correctly macronized (iniussū, permissū, attribūta)

2 were marked as having invalid characters (mīrā, ea), had tildes over the vowel, and had to be corrected by hand.

Only two words were incorrect but not flagged as in any way problematic (cēterā, iūstitiā). In both cases it was an ambiguous first-declension -a. The other vowels in those words were correct.

The hand-corrected result is as follows:

Ē Sinārum prōvinciīs maximē occidua est Cantonia. Eō priusquam perveniās, multae occurrunt īnsulae; quās praefectī regiī praesidiīs et classibus tenent: neque ipsōrum iniussū prōgredī advenās Cantonem est fās. Fernandus Andradius, ut expōnere coeperam, cum ad Tamum īnsulam pervēnisset, post diūturnam moram, trānsitū aegrē tandem impetrātā, cum duōbus expedītīs et ēgregiē ornātīs nāvigiīs, cēterā classe ad Tamum relictā, Cantonis portum invehitur, ac magistrātuum permissū Thomam lēgātum expōnit, cui aedēs et lautia dē mōre attribūta. Ibi Fernandus, mīrā lēnitāte ac iūstitiā contrahendō cum incolīs, haud ita difficilī negōtiō aditum ad ea commercia nostrīs aperuit.

I would call this very good results, and it should be possible to do even better given a larger database. In theory we could do even better than that by marrying a parser and a dictionary like LaNe that has quantities accurately marked. If all goes well I hope to embark on such a project this fall with the help of a Dickinson Computer Science senior student. The other thing I would like to see is an editing environment that would make inserting macrons as easy as clicking on the vowel. This would really help in the inevitable process of hand correction.

Thank you Felipe, for this amazing tool!