A gentle introduction to lexicon building and application
Katrien Depuydt of the Institute for Dutch Lexicology (INL) begins with the distinction between a lexicon and an electronic dictionary. Dictionaries are primarily for human use and organised so that each entry is an item in itself. Lexica on the other hand are primarily for computation, and used for linguistic annotation, enhanced retrieval (like tracking inflected forms of words), and for syntactic parsing and machine retrieval. Where a dictionary gives you the history of words; lexica give you the history of a language.
Various types of lexica: an OCR lexicon is a verified list of words based on a corpus of language material, from same period/text type as the documents you want to OCR. An Information Retrival (IR) lexicon will group words together across different time periods and contexts, so as to cover as many spelling and context variants as possible. Katrien warns that no lexicon can ever be complete.
Lexicon use in IMPACT: lexicons for partner languages; tools for lexicon building and enrichment; best practice for lexicon building. Katrien explains the different between “witnessed” (or verified) lexica and “hypothetical” lexica (one based on language rules) and how combining the two approaches leads to an intelligent engine for language recognition. She mentions the use of Named Entity dictionaries in IMPACT (OCR struggles with proper nouns) and their use in avoiding misrecognitions within the general language lexicon.
Katrin concludes with a discussion of where this work improves on the state of the art, and the plans to assess both the coverage and the effectiveness of lexica on OCR results and information material in several languages. She outlines the early assessment of Dutch lexicon: very promising results +97% accuracy achieved on difficult material. There’s still a lot of work left for the next year(s) though.
Niall Anderson, BL + Mark-Oliver Fischer, BSB