IMPACT Final Conference – Digitisation challenges & IMPACT Achievements so far
No one could say the key objectives of Hildelies Balk-Pennington de Jongh and the IMPACT project are not ambitious! As the shared vision in Europe is that all cultural heritage should become available in digital form in this decade, IMPACT has been working hard for the last four years to overcome the various challenges involved when digising historical material.
With 26 partners across Europe and headed by the National Library of the Netherlands, the IMPACT project was focused on significantly improving mass digitisation of historical printed text by:
– Innovating OCR software and language technology
– Sharing expertise and building capacity across Europe
– Providing facilities for future research and development
The benefits of their work so far is evident in the production of an improved ABBYY FineReader, several tools ready for testing in a productive environment (exceeding the original project expectations), some tools for future development and a Centre of Competence ready to be launched. Language resources include historical lexics for no less than nine languages, which are Dutch, English, German, Czech, Bulgarian, Polish, French, Slovene and Spanish.
Hildelies then gave an overview of the results, showing how they make text digitisation better, faster, cheaper. Some of the key enhancements of the state of the art included:
– Page splitting of images has gone up from 73% accuracy to 94%
– Segmentation improved from 19% to 98% in finding accurate lines
– Recognition of old fonts gives 25% better recognition in FR10 compared to FR9
– Post correction with the Error profiler is 2.7 times faster than without
Hildelies ends with a nice example showing the benefits for the end users (researchers in the humanities / the greater public). This end user is only interested in finding the words he searches for. Preliminary results of OCR combined with a dictionary on difficult 17th century Dutch material indicate already 15% increase of words found. So this means that for 1 million words, 150.000 more words will be found!
From the ambitious objectives at the beginning of the Hildelies, the benefits for all researchers in humanities, academics and the public, and amateur classicists like me, are now shown to be a reality.
View the slides here:
and the video here: