Skip to content

Optical Character Recognition – introduction and overview

7 May, 2010

Michael Fuchs of Abbyy starts by explaining the work of his company within IMPACT: Abbyy provides the other partners with access to its FineReader SDK, and uses the experiments and digital text material within IMPACT to hone and test its own products and technology.  He identifies Fraktur/Gothic script as being a particular difficulty for state of the art OCR engines.

Michael explains the difficulties with recognition technology with reference to Captcha sites: the text strings in Captcha windows look remarkably like some historical documents – exhibiting such characteristics as warp, curl, different fonts, Gothic script.  The irony is that Captchas are designed to be machine-unreadable, and yet these are exactly the sort of characters that Abbyy and IMPACT would like to improve OCR recognition for.

He goes on to explain how some of these characteristics come to be exhibited in historic text documents: bad scanning, preprocessing of an image, generic binarisation, colour artefacts, etc.  He then explains how Abbyy technology attempts to get around these proplems: through adaptive (image-sensitive) binarisation, structural analysis of a document image, and character classification of different types – including the ability to “train” the OCR engine to particular types of language and font.

To conclude, Michael identifies five key areas in which OCR needs to improve:

  • better context-sensitive binarisation and preprocessing;
  • more comprehensive document analysis, perhaps focussing on optimised character patterns
  • “Adaptive” OCR and the creation of special dictionaries
  • better validation and correction systems, including the mass verification of OCR results
  • better document synthesis and export, relying on a standard like xml as a language in which mistakes and misrecognitions can be analysed

Niall Anderson, The British Library + Mark-Oliver Fischer, Bavarian State Library

No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: