Fully Parsed Apuleius Progress

Significant headway yesterday on the ongoing project to digitize the Index Apuleianus. This print work, created by William Oldfather et. al., was published in 1934 by the American Philological Association. Bret Mulligan and I received a small grant for digitization from the copyright holder, the Society for Classical Studies. The value of concordances was once widely acknowledged, but since the rise of computing the genre has fallen into neglect and disrepute. Bret and I are engaged in a data reclamation mission. A proper index or concordance is essentially a fully parsed text that has been chopped up and organized by dictionary headword. The parsing was done by very competent scholars who put a lot of time and effort into correctly analyzing each and every word of the text. What we are doing is unscrambling them to yield the fully parsed texts so they can be used in The Bridge, a tool that allows users to create accurate vocabulary lists for Latin and Ancient Greek texts.

Just unscramble the index and get a parsed text: excellent in theory, but what about the practice? Each of these print indices and concordances has quirks that make processing the data a matter for careful thought. The SCS grant allowed us to have the Index Apuleianus professionally digitized by NewGen Knowledge Works. This yielded data that was somewhat messy, and what I did yesterday was carefully examine what we have and figure out what we need and what we don’t. For example:

sample of digitized concordance with deletions marked&lorbrk; represents the double brackets that are ubiquitous in Oldfather’s Index. They indicate words that are found not in the published texts of Apuleius used to compile the Index  (the Teubner editions of 1908-1913) but only in the Additamentum ad Apparatum Criticum which the team laboriously compiled to add to the number of variant readings identified and emendations proposed since the publication of the source editions. The Index was heavily oriented toward advancing the textual criticism of Apuleius. His team reported every single notable textual variant or proposed emendation known up to that time, even when the variant readings were clearly mistakes in the principle manuscript, F (Florence, Bibl. Med. Laurenziana 68.2, 11th century). The superscript * indicates that a reading is correct, but urges the reader to consult the Additamentum.  On inspection it became clear that all matter in double brackets needed to go. The same was true for material in single square brackets. They contained not likely readings but emendations proposed for lacunae by older critics, most of them not even mentioned by the latest critical texts, such as the newish OCT of the Metamorphoses by Zimmerman. Likewise for our purposes things like the dagger symbol indicated an unsolved textual problem was not needed. Issues of that kind will be dealt with in post-production, and just gum up the works here. Words in parentheses, however, are accepted in the text, but the parentheses are a signal that the word is mentioned in some serious way in the apparatus criticus. 

My Dickinson colleague in the Computer Science Department Michael Skalak is writing up a script to remove what for Oldfather and his team was crucial information, but for us constitutes noise.

This is a basic summary of the deletions:

<SUP>*</SUP> this exact string
[…] all text within brackets, and the brackets themselves. Watch out for missing close bracket (see below)
&lobrk; … &robrk; all txt within double bracket symbol, and the symbol itself
<il>[omiem? M 7, 7, 8.]</il>  any lemma that consists entirely of bracketed material
<il>no digits</il> all lemmas that have no numerals
&dagger; dagger symbol
(<I>u. et</I> Sicinius) “see also” cross-references. The <I>u. et</> tag  (“see also”) and everything within it, and with it within parentheses, should be deleted, along with the parentheses themselves. All instances of the <I>u. et</> tag that are not within parentheses seem to be in lemmas with no numerals, and this will already be deleted.

Since the chapters in many of Apulieus’ works are quite large, Oldfather and his team felt the need for some kind of location data that wouldn’t leave you reading an entire page of Latin just to find a single word. They decided to use the line numbers in the Teubners. Problem is, those line numbers are not at all standard, and have indeed changed in subsequent revisions of the Teubner texts themselves. We too would love to have more accurate location data, but are for the moment stuck with these obsolete line numbers.

After cleaning, Skalak’s script will structure the data as follows:

A: lemma (first word after <il> tag, sometimes preceded by “(“)

B: Work title, abbreviated

C: Citation form (chapter and line, or book, chapter, and line in the Teubners), with underscore between numbers for easier processing by the Bridge

D: Word form as it appears in the text

E: Any syntactical tagging added by Oldfather et al.

For example:

omniformis As 35_16 omniformis  
omniformis As 19_24 omniformem (<I>m.</I.)
omniformis As 34_26 omniformes (<I>f.</I.)
omniformis As 36_15 omniformes  
omniformis As 3_17 omniformium (<I>f.</I>)

After post-processing and checking, this will allow us to upload the text to the Bridge, and create a stand-alone database for this information on DCC. It will not of course be perfect, primarily because textual criticism of Apuleius has moved on since the early 20th century, but it will allow students and teachers to create custom vocabulary lists for all works Apuleius, and substantially increase the readability of his texts. (This is already the case for the Aeneid, Caesar’s Gallic War, et al., thanks to the Bridge.) Another benefit for scholars will be ready access to the frequency and morphology information contained in the Index, which is currently hard to access.