Skip to content

IMPACT Final Conference – IBM Adaptive OCR Engine and CONCERT Cooperative Correction

24 October, 2011

Asaf Tzadok (IBM Haifa Research Lab

Asaf Tzadok (IBM Haifa Research Lab) showed us IBM’s CONCERT tool which facilitates collaborative OCR correction. CONCERT (Cooperative Engine for the Correction of Extracted Text) works in three steps: character session, word session and page-level session. Character session presents the user with a list of characters the OCR has characterised as the same letter. The user can then mark characters as “suspicious”. In the next step, theses characters are presented in word context, where the user can again decide if the characters were recognised correctly. In the final step, characters and words that are still marked as suspicious are shown on page-level. CONCERT also has a series of games including “feed the dolphin, he is hungry”.

View the presentation here:

and the video here:

Lotte Wilms (KB) joined Asaf on stage and gave the library perspective: 3 libraries were involved and user tests were carried out by all the libraries with full support from IBM

IMPACT Final Conference – ABBYY & OCR Improvements for IMPACT

24 October, 2011
Michael Fuchs - ABBYY

Michael Fuchs - ABBYY

Michael Fuchs, Senior Product Marketing Manager at ABBYY Europe, kicked off the session with a brief presentation of ABBYY products. Fuchs claims ABBYY has made a 40% improvement since 2007 thanks to software improvements and the increasing good quality of images used. Fuchs walked the audience through the range of improvements, including image binarisation, document layout analysis and text/character recognition. Fuchs highlighted the need for well produced binary images from greyscale or colour as an OCR input subject to material being processed. Grey scale and colour images, in fact, while easy on the eye, can significantly hinder the OCR process. Amongst other improvements, ABBYY has also extended its character and text recognition system to include Eastern European languages, such as Old Slavonic.

Finally, ABBYY’s decreased cost in historic fonts and its flexible rates have contributed to IMPACT’s efforts to make digitisation cheaper and OCR technology more accessible to both individuals and collaborations across Europe and the world.

View the presentation here:

or the video here:


For more information on ABBYY Historic OCR can be found here: http://www.frakturschrift.com

You can try out the Gothic/Fraktur OCR portal here: finereader.abbyyonline.com

The public Beta test for the ABBYY online OCR SDK is expected for early 2012.

Michael Fuchs - ABBYY

Michael Fuchs - ABBYY

IMPACT Final Conference – Evaluation Tools, Ground Truth and Datasets: Stefan Pletschacher

24 October, 2011

Stefan Pletschacher (University of Salford) began by presenting an overview of the digitisation workflow and the issues at each stage. These stages are usually given as scanning (whatever goes wrong here is hard to recover later) image enhancement, layout analysis, OCR and post processing. He explained that you should evaluate at each step but should also consider the workflow as a whole. To carry out performance evaluation you need to begin with some images that are representative of the images you will be processing. You’ll then OCR the results.

There then followed an explanation of the concept of Ground truth – it is not just the final text but will also include other aspects, such as images to map to. Stefan explained that to do a good job regarding ground truth you really need to use several tools e.g. you can’t use Alto to look at certain aspects of character formation. The IMPACT ground truths have been produced using Alethia, now a fairly mature tool, which allows creation of information on page borders, print space, layout regions, text lines, words, glphs, Unicode text, reading order, layers etc. Groundtruth is more than just text. It can take on elements like deskewing, dewarping, border removal and binarisation. He suggested that institutions consider scenarios so they can decide what aspect of OCRing and what workflow is important to them.

Stefan also gave an introduction to the IMPACT Image repository where all the images and metadata have been collected and shared. The repository has allowed central management of metadata, images and ground truth and is searchable so you can filter on aspects of images.

Stefan finished his talk with an overview of the datasets available: 667,120 images approximately comprising of institutional datasets from 10 libraries (602,313 images) and demonstrator sets (56,141 images).

IMPACT Final Conference – The IMPACT Interoperability Framework

24 October, 2011

Clemens Neudecker presenting the IMPACT Interoperability Framework

Clemens Neudecker presented the IMPACT Interoperability Framework which brings together project tools and data. He introduced workflows in the project and the various building blocks used.

Clemens pointed out that there are still places still available at a free IMPACT toolhack day, 14th – 15th November 2011 at the University of Manchester.

myGrid

The myGrid team produces and uses a suite of tools designed to “help e-Scientists get on with science and get on with scientists”. The tools support the creation of e-laboratories and have been used in domains as diverse as systems biology, social science, music, astronomy, multimedia and chemistry. myExperiment makes it easy to find, use and share scientific workflows and other Research Objects, and to build communities.

Taverna

Taverna is an open source and domain-independent Workflow Management System – a suite of tools used to design and execute scientific workflows and aid in silico experimentation. Taverna has been created by the myGrid team and funded through the OMII-UK. The project has guaranteed funding till 2014. The Taverna suite is written in Java and includes the Taverna Engine (used for enacting workflows) that powers both the Taverna Workbench (the desktop client application) and the Taverna Server (which allows remote execution of workflows). Taverna is also available as a Command Line Tool for a quick execution of workflows from a terminal.

View presentation here:

and video here:

IMPACT Final Conference- IMPACT Evaluation tools, ground truth and datasets

24 October, 2011

Stefan Pletschacher joined us today from the Pattern Recognition & Image Analysis Research (PRImA) group at the University of Salford and revealed an intriguing set of freely available performance evaluation tools for image enhancement, segmentation and OCR outputs created as part of IMPACT. We were also introduced to the IMPACT dataset, a searchable repository of over 700,000 images from 10 participating project libraries, arranged into various working and demonstrative sets which showcase specific OCR challenges, such as typewritten material or image groups dedicated to dewarping.

For more information on these tools including Aletheia, an Advanced Document Layout and Text Ground-Truthing System for Production Environments, visit: http://tools.primaresearch.org/tools/primaweb/tool.php

View the presentation here:

and the video here:

IMPACT Final Conference – Experiences in Mass Digitisation with the California Digital Library

24 October, 2011

Paul Fogel on Experiences in mass digitisation

Paul Fogel, Technical Lead of the Mass Digitisation team at the California Digital Library (CDL), presented digitisation experiences and challenges faced by CDL when dealing with OCR document text extraction. Fogel emphasised the difficulties and obstacles posed by bad OCR during the mass indexing and digitisation processes of cultural records: marginalia, image-text misinterpretations and fonts, as well as limited resources, the wide range of languages (400 to be exact but OCR dictionaries for only 20 of them) and disciplines and the project’s large indexing scale, make ranking results and their use extremely difficult. Fogel finally echoed Antonacopoulos and stressed the need for high quality images to ensure best indexing and query results.


View the presentation here:

and the video here:

IMPACT Final Conference – Applied IMPACT: Claus Gravenhorst, CCS

24 October, 2011

Claus Gravenhorst from CCS presented a case study from the Koninklijke Bibliotheek (KB) and the Content Conversion Specialists GmbH (CCS) considering if the new FineReader Engine and Dutch Lexicon increase OCR accuracy and production efficiency.

Claus Gravenhorst from the CCS

CCS were interested in working with IMPACT as they were aware of the use of 9 languages and felt they could benefit from technological improvements in the area of OCR.

Claus reiterated that various pre- and post- processing steps can have an effect on accuracy as well as the image quality. He explained that the test material they had chosen were 17th Century Dutch newspapers part of a DDD database. A typical page would have two colours and gothic fonts and

The test system used was docWorks which was developed during the EU FP5 project METAe (of which ABBYY was involved). The system has previously been used for small, mid and large scale projects. The workflow covered item tracking from the shelf, scanning and back to the shelf including QA etc. This system was used to integrate IMPACT tools. There was very little pre-processing as the focussing was the OCR. Zones were classified and then passed to the OCR engine. At the end analysis was carried out to understand the structure of the page.

The IMPACT tools used were ABBYY FineReader Engine 10 and external dictionaries used on the DDD material. The goal was to generate statistical data for character and word accuracy of all 4 test runs . An improvement was shown between FineReader 9 and FineReader 10 and the biggest improvement was shown when using the dictionaries. There was a 20.6% word accuracy improvement when using IMPACT tools. In laypersons terms this means that if you had to correct 100 words with IMPACT you would only have to correct 80. Claus showed some screen shots of the docWorks text correction mode.

To conclude Claus explained that ABBYY OCR and historical dictionaries enable higher text accuracy and lower the correction effort.

View the presentation here:

and video here:

IMPACT Final Conference – Case study: Scanning Parameters

24 October, 2011

Apostolos Antonacopoulos’ (University of Salford) session presented and analysed the effects of scanning parameters on OCR quality, as well as the issues regarding storage and maintenance costs for Content Holders. Different experiments were carried out in order to establish scanning effects on OCR quality, including colour vs greyscale vs bitonal, effects on resolution and the comparison with images from the National Library of New Zealand (NLNZ).


The images selected for the project were taken from the British Library newspaper collection and varied in quality. To ensure optimal results, only text regions were selected, thus ignoring additional artefacts (e.g. warping). The IMPACT tool Aletheia was used to extract and key the text to be represented  and ABBY Fine Reader 9 Engine software was used for the OCR process.

Overall, word accuracy improvements were more apparent when using colour, bitonal and 4 and 8-bit scanners while dithered scanners produced the lowest results, with 1.64% word accuracy.

In conclusion, Mr Antonacopoulos stressed the importance of investing in high quality images as they leave room for improvement and can be reused without the need to re-scan. However, different decisions should be taken for different document types.

View presentation here:

and the video here:

IMPACT Final Conference – Digitisation challenges & IMPACT Achievements so far

24 October, 2011


No one could say the key objectives of Hildelies Balk-Pennington de Jongh and the IMPACT project are not ambitious! As the shared vision in Europe is that all cultural heritage should become available in digital form in this decade, IMPACT has been working hard for the last four years to overcome the various challenges involved when digising historical material.

With 26 partners across Europe and headed by the National Library of the Netherlands, the IMPACT project was focused on significantly improving mass digitisation of historical printed text by:
– Innovating OCR software and language technology
– Sharing expertise and building capacity across Europe
– Providing facilities for future research and development

The benefits of their work so far is evident in the production of an improved ABBYY FineReader, several tools ready for testing in a productive environment (exceeding the original project expectations), some tools for future development and a Centre of Competence ready to be launched. Language resources include historical lexics for no less than nine languages, which are Dutch, English, German, Czech, Bulgarian, Polish, French, Slovene and Spanish.

Hildelies then gave an overview of the results, showing how they make text digitisation better, faster, cheaper. Some of the key enhancements of the state of the art included:
– Page splitting of images has gone up from 73% accuracy to 94%
– Segmentation improved from 19% to 98% in finding accurate lines
– Recognition of old fonts gives 25% better recognition in FR10 compared to FR9
– Post correction with the Error profiler is 2.7 times faster than without

Hildelies ends with a nice example showing the benefits for the end users (researchers in the humanities / the greater public). This end user is only interested in finding the words he searches for. Preliminary results of OCR combined with a dictionary on difficult 17th century Dutch material indicate already 15% increase of words found. So this means that for 1 million words, 150.000 more words will be found!
From the ambitious objectives at the beginning of the Hildelies, the benefits for all researchers in humanities, academics and the public, and amateur classicists like me, are now shown to be a reality.

View the slides here:

and the video here:

IMPACT Final Conference: 1st Keynote: The Strategic Digital Overview

24 October, 2011
Richard Bouderstone from British Library

Richard Boulderstone from Brtish Library gives the opening keynote

Richard Boulderstone, Director of eStrategy and Programs at the British Library, kicked off the IMPACT Conference this morning with a suitably impactful statement of scope: the British Library, he estimates, has nearly 5 billion physical pages in a 150 million object collection. Add this statistic to the Conference of European National Libraries 2006 Survey which estimates the national libraries of Europe are holding over 13 billion pages to be digitised and growing and one gets a sense of the magnitude of digitisation potential. The British Library has digitised about 1% of its collections thus far, and with a clear mission to advance the world’s knowledge, and a vision to be a leading hub in the global information network, the core of the strategy is necessarily digital. Boulderstone views the IMPACT project as a key to helping us add greater value to these growing digitised collections and provide users with a deeper cultural understanding of the nation’s holdings.

What makes IMPACT a uniquely “fantastic” project he emphasises, is how it addresses a common set of issues across Europe, and seeks to resolve them through wide collaboration and the piloting of systems which will benefit libraries and the citizens of Europe for many years to come. OCR works very well for modern collections and users enjoy and expect high accuracy rates but there is some way to go for older material. Improving access to this text is an imperative and the strong collaborative basis of the IMPACT project is exactly the sort of engagement the British Library feels will guarantee this. The project has already made significant progress, as we’ll see throughout this two day workshop.

View the presentation here: