A New Latin Macronizer

Felipe Vogel has released a new Latin macronizer, Maccer, and I thought I would take it for a spin and share the results. It works based on a database of previously macronized Latin texts (some provided by DCC), and is still in development.

For my test I figured I would use an unusual text I have been working on lately, Historiarum Indicarum Libri XVI, about the Portuguese exploration of the Far East in the 16th century. It was published by the Jesuit humanist Pietro Maffei in 1588, and the Latin is excellent and full of interest. Book 6 is a fascinating ethnography of China, informed by reports from Jesuit missionaries who visited and lived in China over a number of years. The last print edition was 1751: Joannis Petri Maffeii Bergomatis E Societate Jesu Historiarum Indicarum Libri XVI (Vienna: Bernardi, 1751), and thanks to a tip from Terence Tunberg (who introduced me to this text) I tracked it down on the site of the Dresden Library. Since there is no fully digitized text, my students and I transcribed Book 6 this past fall. Here is an excerpt, with no macrons.

E Sinarum provinciis maxime occidua est Cantonia. Eo priusquam pervenias, multae occurrunt insulae; quas praefecti regii praesidiis et classibus tenent: neque ipsorum iniussu progredi advenas Cantonem est fas. Fernandus Andradius, ut exponere coeperam, cum ad Tamum insulam pervenisset, post diuturnam moram, transitu aegre tandem impetrato, cum duobus expeditis et egregie ornatis navigiis, cetera classe ad Tamum relicta, Cantonis portum invehitur, ac magistratuum permissu Thomam legatum exponit, cui aedes et lautia de more attributa. Ibi Fernandus, mira lenitate ac iustitia contrahendo cum incolis, haud ita difficili negotio aditum ad ea commercia nostris aperuit.

With Vogel’s macronizer this becomes

Ē ✖Sinarum prōvinciīs maximē ✖occidua ✪est ✖Cantonia. Eō priusquam perveniās, multae occurrunt īnsulae; quās ✖praefecti ✖regii praesidiīs et classibus tenent: neque ipsōrum ❡iniussū prōgredī ✖advenas ✖Cantonem ✪est fās. ✖Fernandus ✖Andradius, ut expōnere ✖coeperam, cum ad ✖Tamum īnsulam pervēnisset, post diūturnam moram, trānsitū aegrē tandem ✖impetrato, cum duōbus expedītīs et ēgregiē ✖ornatis nāvigiīs, cētera classe ad ✖Tamum ✪relictā, ✖Cantonis portum invehitur, ac magistrātuum ❡permissū ✖Thomam lēgātum expōnit, cui aedēs et ✖lautia dē mōre ❡attribūta. Ibi ✖Fernandus, ✒mīrã ✖lenitate ac iūstitia ✖contrahendo cum incolīs, haud ita ✖difficili negōtiō aditum ad ✒eã commercia nostrīs aperuit.

The symbols mean this:

unknown word, i.e. not yet in Vogel’s database.
ambiguous: uncertain vowels marked with a tilde (~).
guessed based on frequency.
prefix or enclitic detected attached to a known word.
invalid characters detected.

I made sixteen corrections in 92 words.

21 words were flagged as unknown, 10 of those were proper names (Sinārum, occidua, Cantonia, praefectī, regiī, advenās, Cantonem, Fernandus, Andradius, coeperam, Tamum, impetrātā, ornātīs, Tamum, Cantonis, Thomam, lautia, Fernandus, lēnitāte, contrahendō, difficilī). I made 9 corrections in that group, leaving alone most of the proper names for now.

3 words were guessed based on frequency, all correctly (est, est, relictā).

3 words were marked as “prefix detected,” all correctly macronized (iniussū, permissū, attribūta)

2 were marked as having invalid characters (mīrā, ea), had tildes over the vowel, and had to be corrected by hand.

Only two words were incorrect but not flagged as in any way problematic (cēterā, iūstitiā). In both cases it was an ambiguous first-declension -a. The other vowels in those words were correct.

The hand-corrected result is as follows:

Ē Sinārum prōvinciīs maximē occidua est Cantonia. Eō priusquam perveniās, multae occurrunt īnsulae; quās praefectī regiī praesidiīs et classibus tenent: neque ipsōrum iniussū prōgredī advenās Cantonem est fās. Fernandus Andradius, ut expōnere coeperam, cum ad Tamum īnsulam pervēnisset, post diūturnam moram, trānsitū aegrē tandem impetrātā, cum duōbus expedītīs et ēgregiē ornātīs nāvigiīs, cēterā classe ad Tamum relictā, Cantonis portum invehitur, ac magistrātuum permissū Thomam lēgātum expōnit, cui aedēs et lautia dē mōre attribūta. Ibi Fernandus, mīrā lēnitāte ac iūstitiā contrahendō cum incolīs, haud ita difficilī negōtiō aditum ad ea commercia nostrīs aperuit.

I would call this very good results, and it should be possible to do even better given a larger database. In theory we could do even better than that by marrying a parser and a dictionary like LaNe that has quantities accurately marked. If all goes well I hope to embark on such a project this fall with the help of a Dickinson Computer Science senior student. The other thing I would like to see is an editing environment that would make inserting macrons as easy as clicking on the vowel. This would really help in the inevitable process of hand correction.

Thank you Felipe, for this amazing tool!

Exporting and Sharing Digital Scholarly Editions

Desmond Schmidt’s recent article in the Journal of TEI about how to create a truly portable and interoperable digital scholarly editions came at an opportune time for me. DCC is entering into a relationship with Open Book Publishers in Cambridge to exchange our (Creative Commons licensed) content. They will publish some of our commentaries as books and eBooks, and we will publish some of their book commentaries as multimedia, web-based editions. But how to actually make the transference?

We are starting by delivering Bret Mulligan’s commentary on Nepos’ Life of Hannibal. OBP needs it in a format they can use and set in InDesign and publish in EPUB. But how should the transfer happen? How can we actually share the open licensed scholarly content of DCC so it can actually be re-purposed and pe-published in different formats? Not easily, it turns out. Our commentaries are just html pages in Drupal, not XML based and TEI tagged documents, and thus, in the view of one early critic of the project, “not truly digital.” XML-TEI is intended as a universal standard for editing and tagging documents of all kinds, and not adopting that for our project was at the time a decision based on cost. Anyway, after various investigations on the OBP side it turned out the best way for us to get our commentaries is to OBP deliver the via . . . wait for it . . . Microsoft Word–with all the labor and possibilities for error that that involves.

Wouldn’t things be better if our texts were marked up in XML-TEI? No, according to Schmidt. He argues, in effect, that TEI is actually hindering the sharing of digital scholarly editions. The problem is the subjectivity of TEI tagging and the diversity of the tags themselves, which in Schmidt’s view makes true interoperability of scholarly editions in TEI a pipe dream. The solution he proposes, as I understand it, is to get all the tags and metadata out completely and into separate files, preserving the text as plain text (in multiple versions if we are dealing with revisions or variants). He is evidently developing an editing environment which ends up creating zipped files that completely separate the text itself, annotation data that points back to the text, and metadata. A few choice quotes:

Syd Bauman (2011), one of the original editors of TEI P5, has since observed that interoperability of TEI-encoded texts today—that is, the exchange of unmodified TEI files between different programs—is “impossible.” (9)

One obvious remedy to this problem is to remove the main source of non-interoperability, namely the embedded markup itself, from the text. By removing it, the part which contains all the significant interpretation can later be added or substituted at will. (21)

What remains when the markup is removed is a residue of plain text that is highly interoperable, which can be exchanged with other researchers, just as the files on Gutenberg.org are downloaded by the tens of thousands every day (Leibert 2008). However, if one suggests this to someone who regularly uses TEI-XML, the immediate objection is made that this will solve nothing, because even plain ASCII texts are still an interpretation of what the transcriber sees on the page (e.g. Sperberg-McQueen 1991, 35). This point, although valid to a degree, misses an important distinction. (22)

And it goes on in this interesting vein. I would love to hear from people who are wiser and more experienced than I am about Schmidt’s critique of embedded TEI annotation and his proposed solution. In the meantime, I need to go format some stuff in Microsoft Word.