How far will core vocabulary get you?

One of the claims that scholars make about vocabulary acquisition in Latin and Greek is that a relatively small number of high frequency lemmas (dictionary headwords) accounts for a high percentage of word forms in a typical text. John Muccigrosso and Wilfred Major, for example, estimate that the number of lemmas that will generate 80% of a typical text in Latin is 1500, in Greek, about 1100. (Muccigrosso, 2004, p. 416; Major, 2008, p. 7). Of course it stands to reason that this figure will differ between texts, and within texts, since some authors use relatively simple vocabulary (Nepos, Lysias), while some do not (Juvenal, Aeschylus), and some passages within an author have more unusual words than others. I and others have long wanted a way to calculate the “core percentage” in a given piece of text, that is the number of word forms in a section of a text that derive from high frequency lemmas. This would be both interesting from the point of view of literary criticism, and helpful pedagogically. Some data on that is now emerging in the case of Latin, thanks to the work of LASLA, of Bret Mulligan and his Bridge application, and the Excel skills of Derek Frymark (Dickinson ’12).

If we take the 1000-word DCC core Latin vocabulary as the definition of high frequency lemmas, then 78% of Caesar’s Gallic War consists of core lemmas, excluding proper names. The core percentages by book in Caesar’s Gallic War (excluding Hirtius’ Book 8, for which we have no LASLA data) look like this:

Book      Percentage

1             0.80

2             0.78

3             0.77

4             0.79

5             0.77

6             0.78

7             0.75

Individual chapters range from a high of 91% (4.8) to a low of 57% (7.72). 44 sentences in the work consist of 100% core vocabulary (e.g. 1.8.3 and 1.10.4), while there are two sentences, 3.13.4 and 3.13.4, which tie for a low of 17%.

In the Aeneid (taking the chunks of the text as presented in Perseus) the average chunk is 70% core, with a high of 88% (7.1–4), and a low of 46% (6.417–425). The book by book totals are as follows:

Book      Percentage

1              0.72

2              0.73

3              0.70

4              0.72

5              0.70

6              0.71

7              0.69

8              0.69

9              0.71

10           0.70

11           0.72

12           0.70

Two Dickinson students, Seth Levin and Connor Ford, are working on visualizing the core percentage data for the Aeneid and the Gallic War as part of Dickinson’s Mellon-funded Digital Boot Camp, led by Patrick Belk, starting this week. I look forward to sharing the results in the next few weeks, and hearing what you think of them!


Major, Wilfred E. (2008). It’s Not the Size, It’s the Frequency: The Value of Using a Core Vocabulary in Beginning and Intermediate Greek. CPL Online, 4.1, 1-24.

Muccigrosso, John (2004). “Frequent Vocabulary in Latin Instruction.” Classical World, 97, 409-433.

Note: this post was edited Jan. 15, 2016, to take into account some corrections in the data, and to add the book by book figures for the Aeneid.