Skip to content

IMPACT Final Conference – The Functional Extension Parser: A Document Understanding Platform with Günter Mühlberger

25 October, 2011
Gunter Muhlberger

Günter Mühlberger

Günter Mühlberger  chaired the 3rd block:  Tools for Improved Text Recognition.

However, as time ran out in the session he chose to postpone the delivery of his talk on the FEP until the second day Research Parallel Sessions.

The Presentation is delivered below:

IMPACT Final Conference – Introduction to the IMPACT Centre of Competence

25 October, 2011

Hildelies Balk-Pennington De Jongh and Aly Conteh introduce the IMPACT Centre of Competence

Aly Conteh (BL), Hildelies Balk-Pennington De Jongh (KB) and Lieke Ploeger (KB) introduced and launched the IMPACT Centre of Competence!

While still facing many challenges, IMPACT aims at providing solid foundations by getting good quality OCR text which will enable the project to undertake new avenues of research. Furthermore, IMPACT’s strength lies in the development of a very strong collaboration between communities, a collaboration which the project strives to maintain and build on. After all, the Centre of Competence is exactly this, the bringing together of communities and experts to look at ways to optimise the tools used to extract text from historical documents. In addition, IMPACT has developed a knowledge bank offering guidelines and materials for institutions working on OCR projects all over the world. These resources can be found at the Centre of Competence website, www.digitisation.eu. Other valuable services offered by the Centre include a Ground Truth dataset, as well as a helpdesk, specifically designed and created for institutions to ask questions to experts who can, in turn, give some insight into how to address issues raised.

However, as with every enterprise, there exists the issue of sustainability. How do we sustain the Centre? Over the last four years, the project has received continuous financial support – but what about the future? The sustainability model, Conteh explained, works around four key points: the website, membership packages, partner contributions and a Centre office established to provide assistance. Can IMPACT partners leverage some level of resourcing to allow the Centre to develop products and services? The answer is yes! Some of the partners, in fact, are already providing valuable support to enable the Centre of Competence to administer and deliver its services in a timely manner.

T-shirt from the IMPACT Centre of Competence

One of the major commitments when setting-up a business, is to find an institution or multiple institutions who are willing to bear the risk and support the enterprise in its initial stages. IMPACT has found its support: representatives from the Bibliothèque National de France (BNF) and the Fundación Biblioteca Virtual Miguel De Cervantes took to the floor to present and discuss the reasons behind their decision to back the Centre. These include the need to sustain present, in-house OCR projects to which the Centre of Competence would provide invaluable support and advice, not to mention the opportunity to exchange expertise and data. In turn, the BNF and the Biblioteca Virtual can provide experience, a wide range of partner support, sponsorship and networks, technology transfer and dissemination, as well as the preparation and dissemination of open source software.

To sign up to the Centre of Competence, email impact@kb.nl or visit the website at www.digitisation.eu.

View the presentation here:

and the video here:

IMPACT Final Conference – The EC Digital Agenda and Official Launch of the IMPACT Centre of Competence

25 October, 2011

Khalil Rouhana, Director for digital content and cognitive systems in DG Information Society and Media at the European Commission gave an overview of the EC Digital Agenda.

Khalil explained that the IMPACT project has been one of the biggest projects relating to size. Initially there were doubts that large-scale projects were the right way of supporting innovation and the FP7 team were involved in a lot of discussion on the approach to take. However IMPACT has been a good example of why large scale integrating projects are still valuable for the ICT agenda of the EU.

Khalil talked about the digital agenda for Europe which includes creating a sustainable financing model for Europeana, creating a legal framework for orphan and out of commerce works. Three important aspects are stressed : 1) that ‘access’ is the central concept 2) digitisation should be a moral obligation to maintain our cultural resources 3) Common platforms are central to the strategy. The bottlenecks are the funding challenge and copyright.

Khalil referred to the Comite des Sages report. He also pointed attendees to future calls in the areas of digital preservation and cultural heritage: Call 9 (DL is 17 April 2012), WP 2013 the orientation in currently in preparation.

Khalil then officially launched the IMPACT Centre of Competence.

View the presentation here:

View the video here (including introduction from Hildelies Balk:

IMPACT Final Conference – CLARIN and IMPACT: Crossing Paths

25 October, 2011
Steven Krauwer

Steven Krauwer

Steven Krauwer of CLARIN brought everyone back from coffee break and entertained and informed us with his reflections on the future possibilities of e-Humanities.

Steven talked us through CLARIN’s resource and technical infrastructure, all based in language in whatever form that takes: text speech, programming or multimodal. The goal is to allow users to interact with data not just view it. Steven envisages a single portal – a one-stop-shop for everyone involved in research of culture and literature. CLARIN is based in Humanities but not limited to them. For now, pragmatically CLARIN is based in the EU and caters for EU especially. He stressed the academic bias and the democratic ethos with a distinction on languages not considering any commercial relevance. Each language is equally dear to CLARIN.

Steven hopes to serve all scholars whether for example, looking for documents concerning Black plague in European archives; all newspaper coverage on Islam; German television speakers with Dutch accent or pronoun systems in language of Nepal. Whatever you field CLARIN wants to provide the platform to research it.

Moving towards e-Science, CLARIN sees a continual decline of old traditions as inefficient ways of working give way to future methods for future generations increasingly based in the digital realm.

Steven highlighted problems of cross compatibility between tools and systems and standards in order to make access sustainable, right from the researchers desk, and without needing costly journeys to archives themselves.

CLARIN hopes to integrate into existing systems and institution and not to build a whole new system for others to adopt.

The aim is to “if it exists, it exists on CLARIN”, the system should be adaptable to any individual’s online environment, it should be be technical competent by building modular features that can  be used for unique problems and tasks and specific workflows. CLARIN aims to  be fully compatible with OCR, Worklow and Post correction processes but will be receptive to new ways of working as the source material changes to more complex forms of media.

So down to reality, what is the state of play today? CLARIN began with preparatory phase from 2008-11. This  ended in the summer and they are now building the infrastructure. Although 2 years away for official launch of the services, tools and resources will start feeding into the public domain before then. After launch Steven stressed that CLARIN will be in “continuous state of evolution” and hopes to be agile enough to adapt to constantly changing needs.

Steven introduced the structure envisaged, with governmental bodies called ERICs that govern the cooperation between counties with with universities and institution controlling individual input. It should be funded by governments.

ERIC will involve membership fee and each member creating a governing body and to make material available for CLARIN and co-operate with other countries.

ERIC groups will be launched on the 1st of Jan 2012 with many countries already online (although there are those who have been less than pro-active, Steven managed a little hint to the host nation perhaps!)

What sort of animal will CLARIN be? Standards must be agreed on although CLARIN will not try to impose but hope to agree and support limited number of standards with ‘mapability’ and conversion ability.

No closing for holidays, CLARIN will be a 24/7 service. It will ensure access to all library everywhere. The IMPACT OCR centre is an existing model CLARIN would like to follow. CLARIN wants to cater for any archive with paper material and so will be integral to IMACT and future developments like it.

Operational levels includes: universities, national academies, national language institutions; research institutions; at present dealing with mostly written resources but would like to increase scope to other media as well.

The old adage together we are stronger is key in CLARIN’s ethos and collaboration is central to being able to reach the ambitious goal agenda that CLARIN has set down.

e-Humanities through collaboration is not without its problems, but the benefits of this grand vision are now clear to all and with the work of CLARIN, together with IMPACT, it is not so far from a reality.

View presentation here:

And video here:

IMPACT Final Conference – Evaluation of lexicon supported OCR and information retrieval

25 October, 2011
Jesse de Does

Jesse de Does - Evaluation of lexicon supported OCR

Jesse De Does from the INL gave a brief but rich presentation on the evaluation of lexicon supported OCR and the project’s recent improvements. To evaluate lexica in OCR, the FineReader SDK 10 is used. In short, the software measures OCR with a default included dictionary, and, for each word or fuzzy set, it gives a number of alternatives and segmentations. It is then up to the user to manually select the most suitable or probable option. Lexicon, however, may include errors and the fuzzy sets created by FineReader may be too small (we will never have all spelling variations or compounds). Thus, a number of actions, including word recall, dictionary cleaning and implementation of historical dictionaries, are taken in order to increase performance, even if by small percentages.

The languages analysed and improved so far are Bulgarian (the only non-Latin based language analysed), Czech, English (initially and mistakenly thought to be a no-brainer), French (good improvements overall), German (progress mainly accounted for 16th century), Polish, Slovene and Spanish. The use of historical lexica has produced an overall progress of 10% to 36%.

Jesse de Does

Jesse de Does

Finally, De Does mentioned experiments undertaken for the evaluation of IR. While a more complete evaluation is coming soon, performance in experiments with the English and Spanish languages has been measured using lemmatisation of modern lexica (e.g. OED IR for English).

View the presentation here:

and the video here:

IMPACT Final Conference-Overview of language work in IMPACT

25 October, 2011
Katrien  Depuydt

Katrien Depuydt gives an overview of language work in IMPACT

Katrien Depuydt provided a brief overview of the IMPACT project’s work packages devoted to creating language tools and lexicon to aid in both information retrieval and OCR processing. How might one measure successful improvement to the access of text? She cleverly posits that the key will be in asking ourselves:  “Can we handle the “world”?

In an 18th century dutch periodical ‘werried’ was the spelling of the day and using OCR built with a simple dutch dictionary you would need to begin your search with that term and the results would be necessarily limited. What we really want she notes, is to key in the modern term “world” and retrieve all the appropriate variants in the text.

Katrien Dupuydt

Katrien Dupuydt gives an overview of language work in in IMPACT

This is where IMPACT’s work in building lexica comes in, and we start to discover that yes, we CAN handle ‘the world’. In the course of the project an OCR lexicon, an IR lexicon and an NE lexicon were created for 9 languages and these plug into ABBYY FineReader enhancing the OCR and the retrieval. No simple task, the work required analysing different language resources available for each unique language, identifying tools already available, special character sets and the like. She gives the example of Bulgarian which had no existing dictionaries or lexica and some characters were not recognised by Abbey Fine Reader creating a unique set of challenges. How these challenges were overcome will be explored in more detail at the forthcoming in the Parallel Session 2: Language Session later this afternoon.

View the presentation here:

and the video here:

IMPACT Final Conference – Keynote: OCR and the transformation of the Humanities

25 October, 2011
Gregory Crane from Tufts University

Gregory Crane from Tufts University

Gregory Crane (Tufts University) introduced day 2 with a presentation on the significance of OCR in the Humanities. In particular, Crane listed 3 basic changes:

1. The transformation of the scale of questions in terms of breadth and depth;

2. The rise of student researchers and citizen scholars: these figures play a critical role as professionals alone can no longer tackle the large amount of data out there;

3. The globalisation of cultural heritage: dealing with global activity and cultural heritage has to be, as the word suggests, a global effort as Europe and North America’s expertise is no longer enough.

Gregory Crane

Gregory Crane gives Keynote - OCR & the transformation of the humanities

Crane then moved on to describing dynamic variorum editions as one of OCR’s greatest challenges. How do we create self-organising collections? Crane stressed that even with the all possible crowd-sourcing, we still need to process data as automatically as possible. The ‘New’ Variorum Shakespeare series (140 years old) is a good example of this. After all, the winning system is the most useful, not the smartest!

A Classicist by origin, Crane then shifted his focus to the Graeco-Roman world and illustrated the problems ancient languages such as Latin and Ancient Greek pose to OCR technology. What do we do with 2000+  years of Latin? What do you do with dirty OCR? However bad, Crane explained, OCR helped Tufts detect how many of the 25,000 Latin books selected were actually in Latin. Unsurprisingly, OCR analysis revealed that many of these were actually Greek. Crane’s following statement was self explanatory: “OCR often tells us more than metadata can”. Ancient languages such as Classical Greek, Crane continued, can cause numerous problems to OCR technology, as we often encounter polysemy, ambiguity, and changes in terms. So how do we deal with a cultural heritage language? The key, Crane claimed, is to have multiple open-source OCR engines in order to produce better results.

Finally, Crane explained that we are not just producing OCR data, we are changing connections around the world, enabling a transformation of the humanities and the way in which the world as a whole relates to its cultural heritage.

View the presentation here:

and the video here:

IMPACT Final Conference – Summary of Day 1 and Intro to Day 2

25 October, 2011

This slideshow requires JavaScript.

Day 2 of the conference began with Hildelies Balk-Pennington de Jongh thanking the sponsors ABBYY and Planman, without whom the conference would not be possible.

She then provided us with her own summary of the previous day’s key messages:

  • When accessing collections digital always wins
  • Improving access is imperative!
  • Larger volumes often pose problems, indexing will often grind to a halt
  • Do not economise on the quality of your images, they will be the basis for future OCR improvement
  • In IMPACT all individual steps in digitisation workflow has already been made better, faster or cheaper
  • First productive OCR test in 17th Century Newspapers already 15% more words found
  • Post correction will always be necessary
  • For correction we need the crowd
  • What will be do with the mountains of information, it may become to big to search efficiently!

The messages led succinctly into introductions of the day two speakers.

IMPACT Final Conference – Post-Correction on IMPACT with Ulrich Reffle

25 October, 2011
Ulrich Reffle - University of Munich

Ulrich Reffle - University of Munich

Developing a unique user tool with a team led by Prof. Schulz, Ulrich Reffle along with Annette Gotscharek, Christoph Ringlstetter and Thorsten Vobl. at the University of Munich have potentially revolutionised the speed at which researchers can analyse texts.

Ulrich pointed out some of the major problems facing scholars with the non-standardisation of spelling variants in historical texts. He also cites the problems with specialised words, including technical term, names of people and name places alongside antiquated words that are not included in the lexicon. And this is on top of regular OCR errors in recognising hazardous printing methods.

Ulrich ReffleUlrich and the team have come up with developing Error Profiles for individual texts. These recognise a particular set of characteristics within an individual text and can create adaptive solutions for that set of problems. These could be different spelling of vowel and diphthong sounds or regular swapping of particular letters. These rules can then be corrected automatically saving the time and patience of the scholar.

For uncertain words the profile tool will flag up suggestions, Ulrich gave the example of the old english term of “hath” and corresponding modern equivalents “has”, “hat” etc.

The second bulkhead of the work done by Ulrich and his team has been in the area of a post-correction system. For this they created from scratch an interface that gives users novel possibilities to detection, presentation and correction of OCR mistakes. This allows the user to see on one screen, the image of the original page, alongside the OCR editor tool with a special functionality window. With the help of historical lexica this functionality window can provide suggestions for corrections alongside the word itself, with a text tool that allows a drop down menu much like a spell check.

The team evaluated their tool on 14 participants and found that when working with the Post-Correction Tool, researchers completed tasks 2.7 times faster than without it.

The interface technology is open source and available to all, although the error profiles are protected by US patent there is a web-service free of charge. Contact Ulrich at for more details on these remarkable advances in efficiency.

View the presentation here:

and the video here:

IMPACT Final Conference-Crowdsourcing in the Digitalkoot Project

24 October, 2011
Majis Bremer-Laamanen

Majis Bremer-Laamanen

DIGI-to digitise Talkoot-people gathering to work together voluntarily (without payment)

Majilis Bremer-Laamanen (National Library of Finland) shares their unique experiment in crowdsourcing OCR correction through gaming with their Digitalkoot Project launched February 2011. As Richard Boulderstone, British Library, touched on his keynote, the National Library of Finland faces the same issues of digitising at scale millions of pages of historical newspapers, books, journals, ephemera and sound, processing and enhancing them in the most cost-effective manner. Partnering with Microtask, they launched two web-based games which turned the meaningful but arduous task of human review of OCR’d text into a collaborative, manageable and useful game.

Majlis Bremer-Laamanen

Majlis Bremer-Laamanen

DigiTalkoot Mole Bridge


DigiTalkoot Mole Hunt


View the presentation here:

and the video here: