Skip to content

IMPACT Final Conference – Evaluation of lexicon supported OCR and information retrieval

25 October, 2011
Jesse de Does

Jesse de Does - Evaluation of lexicon supported OCR

Jesse De Does from the INL gave a brief but rich presentation on the evaluation of lexicon supported OCR and the project’s recent improvements. To evaluate lexica in OCR, the FineReader SDK 10 is used. In short, the software measures OCR with a default included dictionary, and, for each word or fuzzy set, it gives a number of alternatives and segmentations. It is then up to the user to manually select the most suitable or probable option. Lexicon, however, may include errors and the fuzzy sets created by FineReader may be too small (we will never have all spelling variations or compounds). Thus, a number of actions, including word recall, dictionary cleaning and implementation of historical dictionaries, are taken in order to increase performance, even if by small percentages.

The languages analysed and improved so far are Bulgarian (the only non-Latin based language analysed), Czech, English (initially and mistakenly thought to be a no-brainer), French (good improvements overall), German (progress mainly accounted for 16th century), Polish, Slovene and Spanish. The use of historical lexica has produced an overall progress of 10% to 36%.

Jesse de Does

Jesse de Does

Finally, De Does mentioned experiments undertaken for the evaluation of IR. While a more complete evaluation is coming soon, performance in experiments with the English and Spanish languages has been measured using lemmatisation of modern lexica (e.g. OED IR for English).

View the presentation here:

and the video here:

No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: