Evaluation Framework and Taverna – with Clemens Neudecker
IMPACT Interoperability Framework (IIF)
After seeing so many tools introduced during the day, Clemens now considers how you can pull them all together into a usable service for the mass digitisation and OCR of historic text.
We know there are a multitude of challenges from not only the OCR process but also from the language as well. The day has shown that there are also multiple solutions to these problems. IMPACT has now developed 22 different tools from a diverse selection of developers for a range of tasks and developed using a range of different technologies.
- OCR Tools
- Image Processing & Lexica
- Command line tools
- java, ruby, php perl etc
This modular approach has been very useful as it has allowed users to ‘pick and choose’ the tools that would work best for them in each individual case, however it meant that a framework had to be developed in order to control how they were used within the workflow. This has become the IMPACT Interoperability Framework.
Architecture for the ‘Interoperability Framework’ is built around `Java’, ‘Apache’ and Taverna Workflow technology. The tools are available through a web service available from a wide set of nodes held by the major IMPACT partners across Europe.
Framework integration is pulled together by an easy to use ‘open source’ generic command line wrapper.
The OCR workflow is equal to a data pipeline using a selection of the building blocks which can be used together in a mash-up.
Workflow management is by using a Web 2.0 style registry – myExperiment using a local client of Taverna Workbench or a web-client held within the project website using SOAP.
This approach also supports the creation of a social community around this research – allowing for results and ideas to be discussed by the users – within the workflow management software.
The main dataset for the IMPACT project is hosted at PRIMA research group at the university of Salford.
This now has more than 500,000 digtal images from the IMPACT partner libraries with more that 50,000 ground truth representations. These datasets are receiving up to 10,000 direct access calls per month. This is using up to 4TB of space and is growing fast. This dataset is fully described within the metadata and now allows the search and download of very specific types of images, including a range of ‘problem’ type material.
Evaluation of the dataset is ongoing, with text based comparison of results with Ground-Truth using Levenshtein distance method and layout based comparison of results with original Ground-Truth.
A range of tools have been made for Ground-Truthing:
- FineReader PAGE exporter
- GT Validator
- GT normalizer
Clemens talked about evaluating OCR accuracy and stressed the importance of using ‘word accuracy’ for normal evaluation rather than ‘character accuracy’.
Clemens’s presentation is available in full here: