Skip to content

IMPACT Final Conference – ABBYY & OCR Improvements for IMPACT

24 October, 2011
Michael Fuchs - ABBYY

Michael Fuchs - ABBYY

Michael Fuchs, Senior Product Marketing Manager at ABBYY Europe, kicked off the session with a brief presentation of ABBYY products. Fuchs claims ABBYY has made a 40% improvement since 2007 thanks to software improvements and the increasing good quality of images used. Fuchs walked the audience through the range of improvements, including image binarisation, document layout analysis and text/character recognition. Fuchs highlighted the need for well produced binary images from greyscale or colour as an OCR input subject to material being processed. Grey scale and colour images, in fact, while easy on the eye, can significantly hinder the OCR process. Amongst other improvements, ABBYY has also extended its character and text recognition system to include Eastern European languages, such as Old Slavonic.

Finally, ABBYY’s decreased cost in historic fonts and its flexible rates have contributed to IMPACT’s efforts to make digitisation cheaper and OCR technology more accessible to both individuals and collaborations across Europe and the world.

View the presentation here:

or the video here:


For more information on ABBYY Historic OCR can be found here: http://www.frakturschrift.com

You can try out the Gothic/Fraktur OCR portal here: finereader.abbyyonline.com

The public Beta test for the ABBYY online OCR SDK is expected for early 2012.

Michael Fuchs - ABBYY

Michael Fuchs - ABBYY

Advertisements

IMPACT Final Conference – Evaluation Tools, Ground Truth and Datasets: Stefan Pletschacher

24 October, 2011

Stefan Pletschacher (University of Salford) began by presenting an overview of the digitisation workflow and the issues at each stage. These stages are usually given as scanning (whatever goes wrong here is hard to recover later) image enhancement, layout analysis, OCR and post processing. He explained that you should evaluate at each step but should also consider the workflow as a whole. To carry out performance evaluation you need to begin with some images that are representative of the images you will be processing. You’ll then OCR the results.

There then followed an explanation of the concept of Ground truth – it is not just the final text but will also include other aspects, such as images to map to. Stefan explained that to do a good job regarding ground truth you really need to use several tools e.g. you can’t use Alto to look at certain aspects of character formation. The IMPACT ground truths have been produced using Alethia, now a fairly mature tool, which allows creation of information on page borders, print space, layout regions, text lines, words, glphs, Unicode text, reading order, layers etc. Groundtruth is more than just text. It can take on elements like deskewing, dewarping, border removal and binarisation. He suggested that institutions consider scenarios so they can decide what aspect of OCRing and what workflow is important to them.

Stefan also gave an introduction to the IMPACT Image repository where all the images and metadata have been collected and shared. The repository has allowed central management of metadata, images and ground truth and is searchable so you can filter on aspects of images.

Stefan finished his talk with an overview of the datasets available: 667,120 images approximately comprising of institutional datasets from 10 libraries (602,313 images) and demonstrator sets (56,141 images).

IMPACT Final Conference – The IMPACT Interoperability Framework

24 October, 2011

Clemens Neudecker presenting the IMPACT Interoperability Framework

Clemens Neudecker presented the IMPACT Interoperability Framework which brings together project tools and data. He introduced workflows in the project and the various building blocks used.

Clemens pointed out that there are still places still available at a free IMPACT toolhack day, 14th – 15th November 2011 at the University of Manchester.

myGrid

The myGrid team produces and uses a suite of tools designed to “help e-Scientists get on with science and get on with scientists”. The tools support the creation of e-laboratories and have been used in domains as diverse as systems biology, social science, music, astronomy, multimedia and chemistry. myExperiment makes it easy to find, use and share scientific workflows and other Research Objects, and to build communities.

Taverna

Taverna is an open source and domain-independent Workflow Management System – a suite of tools used to design and execute scientific workflows and aid in silico experimentation. Taverna has been created by the myGrid team and funded through the OMII-UK. The project has guaranteed funding till 2014. The Taverna suite is written in Java and includes the Taverna Engine (used for enacting workflows) that powers both the Taverna Workbench (the desktop client application) and the Taverna Server (which allows remote execution of workflows). Taverna is also available as a Command Line Tool for a quick execution of workflows from a terminal.

View presentation here:

and video here:

IMPACT Final Conference- IMPACT Evaluation tools, ground truth and datasets

24 October, 2011

Stefan Pletschacher joined us today from the Pattern Recognition & Image Analysis Research (PRImA) group at the University of Salford and revealed an intriguing set of freely available performance evaluation tools for image enhancement, segmentation and OCR outputs created as part of IMPACT. We were also introduced to the IMPACT dataset, a searchable repository of over 700,000 images from 10 participating project libraries, arranged into various working and demonstrative sets which showcase specific OCR challenges, such as typewritten material or image groups dedicated to dewarping.

For more information on these tools including Aletheia, an Advanced Document Layout and Text Ground-Truthing System for Production Environments, visit: http://tools.primaresearch.org/tools/primaweb/tool.php

View the presentation here:

and the video here:

IMPACT Final Conference – Experiences in Mass Digitisation with the California Digital Library

24 October, 2011

Paul Fogel on Experiences in mass digitisation

Paul Fogel, Technical Lead of the Mass Digitisation team at the California Digital Library (CDL), presented digitisation experiences and challenges faced by CDL when dealing with OCR document text extraction. Fogel emphasised the difficulties and obstacles posed by bad OCR during the mass indexing and digitisation processes of cultural records: marginalia, image-text misinterpretations and fonts, as well as limited resources, the wide range of languages (400 to be exact but OCR dictionaries for only 20 of them) and disciplines and the project’s large indexing scale, make ranking results and their use extremely difficult. Fogel finally echoed Antonacopoulos and stressed the need for high quality images to ensure best indexing and query results.


View the presentation here:

and the video here:

IMPACT Final Conference – Applied IMPACT: Claus Gravenhorst, CCS

24 October, 2011

Claus Gravenhorst from CCS presented a case study from the Koninklijke Bibliotheek (KB) and the Content Conversion Specialists GmbH (CCS) considering if the new FineReader Engine and Dutch Lexicon increase OCR accuracy and production efficiency.

Claus Gravenhorst from the CCS

CCS were interested in working with IMPACT as they were aware of the use of 9 languages and felt they could benefit from technological improvements in the area of OCR.

Claus reiterated that various pre- and post- processing steps can have an effect on accuracy as well as the image quality. He explained that the test material they had chosen were 17th Century Dutch newspapers part of a DDD database. A typical page would have two colours and gothic fonts and

The test system used was docWorks which was developed during the EU FP5 project METAe (of which ABBYY was involved). The system has previously been used for small, mid and large scale projects. The workflow covered item tracking from the shelf, scanning and back to the shelf including QA etc. This system was used to integrate IMPACT tools. There was very little pre-processing as the focussing was the OCR. Zones were classified and then passed to the OCR engine. At the end analysis was carried out to understand the structure of the page.

The IMPACT tools used were ABBYY FineReader Engine 10 and external dictionaries used on the DDD material. The goal was to generate statistical data for character and word accuracy of all 4 test runs . An improvement was shown between FineReader 9 and FineReader 10 and the biggest improvement was shown when using the dictionaries. There was a 20.6% word accuracy improvement when using IMPACT tools. In laypersons terms this means that if you had to correct 100 words with IMPACT you would only have to correct 80. Claus showed some screen shots of the docWorks text correction mode.

To conclude Claus explained that ABBYY OCR and historical dictionaries enable higher text accuracy and lower the correction effort.

View the presentation here:

and video here:

IMPACT Final Conference – Case study: Scanning Parameters

24 October, 2011

Apostolos Antonacopoulos’ (University of Salford) session presented and analysed the effects of scanning parameters on OCR quality, as well as the issues regarding storage and maintenance costs for Content Holders. Different experiments were carried out in order to establish scanning effects on OCR quality, including colour vs greyscale vs bitonal, effects on resolution and the comparison with images from the National Library of New Zealand (NLNZ).


The images selected for the project were taken from the British Library newspaper collection and varied in quality. To ensure optimal results, only text regions were selected, thus ignoring additional artefacts (e.g. warping). The IMPACT tool Aletheia was used to extract and key the text to be represented  and ABBY Fine Reader 9 Engine software was used for the OCR process.

Overall, word accuracy improvements were more apparent when using colour, bitonal and 4 and 8-bit scanners while dithered scanners produced the lowest results, with 1.64% word accuracy.

In conclusion, Mr Antonacopoulos stressed the importance of investing in high quality images as they leave room for improvement and can be reused without the need to re-scan. However, different decisions should be taken for different document types.

View presentation here:

and the video here: