{"id":73,"date":"2012-07-09T02:47:52","date_gmt":"2012-07-09T02:47:52","guid":{"rendered":"http:\/\/blogs.dickinson.edu\/dcc\/?p=73"},"modified":"2012-07-09T12:52:50","modified_gmt":"2012-07-09T12:52:50","slug":"types-of-error-in-the-perseus-latin-word-study-tool","status":"publish","type":"post","link":"https:\/\/blogs.dickinson.edu\/dcc\/2012\/07\/09\/types-of-error-in-the-perseus-latin-word-study-tool\/","title":{"rendered":"Types of Error in the Perseus Latin Word Study Tool"},"content":{"rendered":"<p>The <a href=\"http:\/\/www.perseus.tufts.edu\/hopper\/morph?lang=la\">Perseus Latin Word Study Tool<\/a>\u00a0(LWST) is intended to provide dictionary definitions and grammatical analysis of all words in the Latin texts available in the <a href=\"http:\/\/www.perseus.tufts.edu\/hopper\/collection?collection=Perseus:collection:Greco-Roman\">Perseus Digital Library<\/a>, currently 10.5 million words.<\/p>\n<p>A check of the definitions and grammatical analysis of an arbitrarily chosen chunk of Vergil&#8217;s\u00a0<em>Aeneid<\/em>\u00a0(5.1-34, 223 words), found that it was incorrect in 79 instances, or 35.4% of the time (and correct 64.6% of the time). The most common type of error (21 instances, \u00a026.6% of all errors, 9.4% of all words) was a mistake of accidence, for example <em>duri<\/em>\u00a0(5.5) was taken as genitive singular instead of nominative plural. In 17 cases (21.5% of errors, 7.6% of all words) words were assigned to the wrong lemma, as when <em>quoque<\/em> (&#8220;and whither&#8221;) was derived from <em>quoque<\/em>\u00a0(&#8220;also, too&#8221;), or<em>\u00a0venti<\/em> (&#8220;winds,&#8221; 5.20) was assigned to the verb <em>venio<\/em>, &#8220;come,&#8221; as if it were the perfect participle. This particular mistake occurred three times in this passage, and the correct lemma was not listed as a possible option.\u00a0In 14 instances (17.7% of errors, 6.3% of all words) the dictionary definitions provided were wildly wrong. This was true of some very common words. <em>iam<\/em>\u00a0was glossed as &#8220;are you going so soon,&#8221; <em>nec<\/em>\u00a0as &#8220;and not yet,&#8221; <em>ab<\/em>\u00a0as &#8220;all the way from.&#8221; <em>Elissae<\/em> (5.3) was glossed as &#8220;Hannibal.&#8221; In every case this type of error was seen to come from the pulling, seemingly at random, of a word or phrase from the dictionary of Lewis &amp; Short on which the LWRT is based. In 11 instances (13.9% of errors, \u00a04.9% of all words), the relevant definition in the context at hand was not provided (though it could be found by clicking to and reading through the full Lewis &amp; Short dictionary entry). For example, <em>cerno<\/em>\u00a0was glossed as &#8220;separate, part, sift,&#8221; but not &#8220;perceive,&#8221; or <em>infelicis<\/em>\u00a0(5.3) glossed as &#8220;unfruitful, not fertile barren,&#8221; rather than &#8220;unfortunate.&#8221;\u00a0More seriously, all relative pronouns were glossed as interrogatives (&#8220;who? which? what? what kind of a?&#8221;), and described simply as &#8220;pron.&#8221; The word &#8220;relative&#8221; did not appear on the page. In 8 instances (10% of errors, 3.6% of all words) a word was assigned to the incorrect part of speech, as when <em>medium<\/em>\u00a0(5.1) was called a noun rather than an adjective, or\u00a0<em>locutus <\/em>(5.14) assigned to the rare 4th decl. noun &#8220;a speaking&#8221; rather than to <em>loquor<\/em>. In 4 cases (5% of errors, 1.8% of all words), there was no definition available. And in all cases deponent verbs were incorrectly labeled passive (4 instances in this particular section, or\u00a05% of errors, 1.8% of all words).<\/p>\n<p>Now, the makers of Perseus are perfectly aware of the flaws in LWST, and attempt to use the power of social media of help remedy the situation. Subjoined to the analysis of every ambiguous word,\u00a0after an explanation of the methodology used, one finds a plea to help by voting.<\/p>\n<blockquote><p>The possible parses for this word have been evaluated by an experimental system that attempts to determine which parse is correct in this context. The system is composed of a number of &#8220;evaluators&#8221;&#8211;each of which uses different criteria to score the possibilities&#8211;whose votes are weighted to determine the best answer. The percentages in the table above show each evaluator&#8217;s score for each form, which are then combined to determine each form&#8217;s overall score.<br \/>\nThis selection used the following evaluators:<br \/>\n\u2022 User-voting evaluator: Scores parses based on the number of votes each one has received from users. Weighted more heavily as more users vote for a given word in a text.<br \/>\n\u2022 Prior-form frequency evaluator: Evaluates forms based on the preceding word in the text; finds the most likely parse among this word&#8217;s possible morphological features and the preceding word&#8217;s possible features based on the frequency of each possible pair.<br \/>\n\u2022 Word-frequency evaluator: Scores parses based on how often the dictionary word appears in the Perseus corpus. Only used when a given form could be from more than one possible word.<br \/>\n\u2022 Tagger evaluator: Evaluator based on pre-computed automatic morphological tagging<br \/>\n\u2022 Form frequency evaluator: Scores parses based on how often their morphological features (first-person, indicative, plural, and so on) occur among all the words in the Perseus corpus.<br \/>\nUser votes are weighted more heavily than the other methods, which are all treated equally.<br \/>\nDon&#8217;t agree with the results? Cast your vote for the correct form by clicking on the [vote] link to the right of the form above!<\/p><\/blockquote>\n<p>But here too, some problems arose in my sample. First of all, only a handful of doubtful words had any votes. Second, many of the error types identified above do not admit of voting. And third, those that did have votes did not always benefit from having them. Here is the entry on the word\u00a0<em>rates<\/em> in <em>ut pelagus tenuere rates<\/em> (5.8), showing a preference for the (incorrect) accusative, despite nine user votes for the (correct) nominative.<\/p>\n<p><a href=\"http:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST-screen-shot-1.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignleft size-large wp-image-76\" src=\"http:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST-screen-shot-1-1024x640.jpg\" alt=\"\" width=\"584\" height=\"365\" srcset=\"https:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST-screen-shot-1-1024x640.jpg 1024w, https:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST-screen-shot-1-300x187.jpg 300w, https:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST-screen-shot-1-480x300.jpg 480w, https:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST-screen-shot-1.jpg 1280w\" sizes=\"auto, (max-width: 584px) 100vw, 584px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>On the word <em>pater<\/em>\u00a0in\u00a0<em>Quidve, pater Neptune, paras?<\/em> (5.14), ten incorrect user votes for the nominative win out over the (obviously correct) vocative.<\/p>\n<p><a href=\"http:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST-2.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignleft size-large wp-image-77\" src=\"http:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST-2-1024x640.jpg\" alt=\"\" width=\"584\" height=\"365\" srcset=\"https:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST-2-1024x640.jpg 1024w, https:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST-2-300x187.jpg 300w, https:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST-2-480x300.jpg 480w, https:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST-2.jpg 1280w\" sizes=\"auto, (max-width: 584px) 100vw, 584px\" \/><\/a><\/p>\n<p>More common, however, is the lack of any user votes at all, as in this very confusing jumble of information on the word <em>hoc<\/em> (5.18). <a href=\"http:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST3.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"alignleft size-large wp-image-78\" src=\"http:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST3-1024x640.jpg\" alt=\"\" width=\"584\" height=\"365\" srcset=\"https:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST3-1024x640.jpg 1024w, https:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST3-300x187.jpg 300w, https:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST3-480x300.jpg 480w, https:\/\/blogs.dickinson.edu\/dcc\/files\/2012\/07\/LWST3.jpg 1280w\" sizes=\"auto, (max-width: 584px) 100vw, 584px\" \/><\/a>Note that the correct lemmatization (&gt; <em>hic<\/em>) has a nonsensical definition; that the morphological analysis states it can only be a pronoun (&#8220;pron.&#8221;) whereas here, as often, it is a demonstrative adjective; and finally that the LWST incorrectly concludes that the form derives from the lemma <em>huc<\/em>.<\/p>\n<p>Another odd and thankfully rare genre of error occurs in the case of <em>deinde<\/em> (5.14), which is correctly analyzed, but put beside a fictional alternative, the present imperative of a verb *<em>deindo<\/em>.<\/p>\n<p>I would like to know if the same level of error and types of errors occur when LWST is unleashed on a prose text. Perhaps there the idea of a &#8220;prior-form frequency evaluator&#8221; would make more sense.<\/p>\n<p>It is not my intent to denigrate the huge achievements of Perseus in our field. It is certainly better to have the LWST than not to have it. My purpose here is just to investigate the nature and extent of its errors. If this sample is at all representative, something along the lines of 3.5 million errors exist in the current database. I would also like to ask, is it realistic to think that qualified people can be found to correct the mistakes of the LWST? What is the incentive for professional Latinists to do so?<\/p>\n<p>I also have a proposal for a different kind of tool, which I will save for another post, since this one is already too long. Your thoughts?<\/p>\n<p>&#8211;Chris Francese<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Perseus Latin Word Study Tool\u00a0(LWST) is intended to provide dictionary definitions and grammatical analysis of all words in the Latin texts available in the Perseus Digital Library, currently 10.5 million words. A check of the definitions and grammatical analysis &hellip; <a href=\"https:\/\/blogs.dickinson.edu\/dcc\/2012\/07\/09\/types-of-error-in-the-perseus-latin-word-study-tool\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":65,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[1],"tags":[25148,61727,61730,61728],"class_list":["post-73","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-grammar","tag-perseus","tag-vocabulary","tag-word-study-tool"],"_links":{"self":[{"href":"https:\/\/blogs.dickinson.edu\/dcc\/wp-json\/wp\/v2\/posts\/73","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.dickinson.edu\/dcc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.dickinson.edu\/dcc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.dickinson.edu\/dcc\/wp-json\/wp\/v2\/users\/65"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.dickinson.edu\/dcc\/wp-json\/wp\/v2\/comments?post=73"}],"version-history":[{"count":0,"href":"https:\/\/blogs.dickinson.edu\/dcc\/wp-json\/wp\/v2\/posts\/73\/revisions"}],"wp:attachment":[{"href":"https:\/\/blogs.dickinson.edu\/dcc\/wp-json\/wp\/v2\/media?parent=73"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.dickinson.edu\/dcc\/wp-json\/wp\/v2\/categories?post=73"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.dickinson.edu\/dcc\/wp-json\/wp\/v2\/tags?post=73"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}