Skip to content

IMPACT Final Conference – Evaluation of lexicon supported OCR and information retrieval

25 October, 2011
Jesse de Does

Jesse de Does - Evaluation of lexicon supported OCR

Jesse De Does from the INL gave a brief but rich presentation on the evaluation of lexicon supported OCR and the project’s recent improvements. To evaluate lexica in OCR, the FineReader SDK 10 is used. In short, the software measures OCR with a default included dictionary, and, for each word or fuzzy set, it gives a number of alternatives and segmentations. It is then up to the user to manually select the most suitable or probable option. Lexicon, however, may include errors and the fuzzy sets created by FineReader may be too small (we will never have all spelling variations or compounds). Thus, a number of actions, including word recall, dictionary cleaning and implementation of historical dictionaries, are taken in order to increase performance, even if by small percentages.

The languages analysed and improved so far are Bulgarian (the only non-Latin based language analysed), Czech, English (initially and mistakenly thought to be a no-brainer), French (good improvements overall), German (progress mainly accounted for 16th century), Polish, Slovene and Spanish. The use of historical lexica has produced an overall progress of 10% to 36%.

Jesse de Does

Jesse de Does

Finally, De Does mentioned experiments undertaken for the evaluation of IR. While a more complete evaluation is coming soon, performance in experiments with the English and Spanish languages has been measured using lemmatisation of modern lexica (e.g. OED IR for English).

View the presentation here:

and the video here:

IMPACT Final Conference-Overview of language work in IMPACT

25 October, 2011
Katrien  Depuydt

Katrien Depuydt gives an overview of language work in IMPACT

Katrien Depuydt provided a brief overview of the IMPACT project’s work packages devoted to creating language tools and lexicon to aid in both information retrieval and OCR processing. How might one measure successful improvement to the access of text? She cleverly posits that the key will be in asking ourselves:  “Can we handle the “world”?

In an 18th century dutch periodical ‘werried’ was the spelling of the day and using OCR built with a simple dutch dictionary you would need to begin your search with that term and the results would be necessarily limited. What we really want she notes, is to key in the modern term “world” and retrieve all the appropriate variants in the text.

Katrien Dupuydt

Katrien Dupuydt gives an overview of language work in in IMPACT

This is where IMPACT’s work in building lexica comes in, and we start to discover that yes, we CAN handle ‘the world’. In the course of the project an OCR lexicon, an IR lexicon and an NE lexicon were created for 9 languages and these plug into ABBYY FineReader enhancing the OCR and the retrieval. No simple task, the work required analysing different language resources available for each unique language, identifying tools already available, special character sets and the like. She gives the example of Bulgarian which had no existing dictionaries or lexica and some characters were not recognised by Abbey Fine Reader creating a unique set of challenges. How these challenges were overcome will be explored in more detail at the forthcoming in the Parallel Session 2: Language Session later this afternoon.

View the presentation here:

and the video here:

IMPACT Final Conference – Keynote: OCR and the transformation of the Humanities

25 October, 2011
Gregory Crane from Tufts University

Gregory Crane from Tufts University

Gregory Crane (Tufts University) introduced day 2 with a presentation on the significance of OCR in the Humanities. In particular, Crane listed 3 basic changes:

1. The transformation of the scale of questions in terms of breadth and depth;

2. The rise of student researchers and citizen scholars: these figures play a critical role as professionals alone can no longer tackle the large amount of data out there;

3. The globalisation of cultural heritage: dealing with global activity and cultural heritage has to be, as the word suggests, a global effort as Europe and North America’s expertise is no longer enough.

Gregory Crane

Gregory Crane gives Keynote - OCR & the transformation of the humanities

Crane then moved on to describing dynamic variorum editions as one of OCR’s greatest challenges. How do we create self-organising collections? Crane stressed that even with the all possible crowd-sourcing, we still need to process data as automatically as possible. The ‘New’ Variorum Shakespeare series (140 years old) is a good example of this. After all, the winning system is the most useful, not the smartest!

A Classicist by origin, Crane then shifted his focus to the Graeco-Roman world and illustrated the problems ancient languages such as Latin and Ancient Greek pose to OCR technology. What do we do with 2000+  years of Latin? What do you do with dirty OCR? However bad, Crane explained, OCR helped Tufts detect how many of the 25,000 Latin books selected were actually in Latin. Unsurprisingly, OCR analysis revealed that many of these were actually Greek. Crane’s following statement was self explanatory: “OCR often tells us more than metadata can”. Ancient languages such as Classical Greek, Crane continued, can cause numerous problems to OCR technology, as we often encounter polysemy, ambiguity, and changes in terms. So how do we deal with a cultural heritage language? The key, Crane claimed, is to have multiple open-source OCR engines in order to produce better results.

Finally, Crane explained that we are not just producing OCR data, we are changing connections around the world, enabling a transformation of the humanities and the way in which the world as a whole relates to its cultural heritage.

View the presentation here:

and the video here:

IMPACT Final Conference – Summary of Day 1 and Intro to Day 2

25 October, 2011

This slideshow requires JavaScript.

Day 2 of the conference began with Hildelies Balk-Pennington de Jongh thanking the sponsors ABBYY and Planman, without whom the conference would not be possible.

She then provided us with her own summary of the previous day’s key messages:

  • When accessing collections digital always wins
  • Improving access is imperative!
  • Larger volumes often pose problems, indexing will often grind to a halt
  • Do not economise on the quality of your images, they will be the basis for future OCR improvement
  • In IMPACT all individual steps in digitisation workflow has already been made better, faster or cheaper
  • First productive OCR test in 17th Century Newspapers already 15% more words found
  • Post correction will always be necessary
  • For correction we need the crowd
  • What will be do with the mountains of information, it may become to big to search efficiently!

The messages led succinctly into introductions of the day two speakers.

IMPACT Final Conference – Post-Correction on IMPACT with Ulrich Reffle

25 October, 2011
Ulrich Reffle - University of Munich

Ulrich Reffle - University of Munich

Developing a unique user tool with a team led by Prof. Schulz, Ulrich Reffle along with Annette Gotscharek, Christoph Ringlstetter and Thorsten Vobl. at the University of Munich have potentially revolutionised the speed at which researchers can analyse texts.

Ulrich pointed out some of the major problems facing scholars with the non-standardisation of spelling variants in historical texts. He also cites the problems with specialised words, including technical term, names of people and name places alongside antiquated words that are not included in the lexicon. And this is on top of regular OCR errors in recognising hazardous printing methods.

Ulrich ReffleUlrich and the team have come up with developing Error Profiles for individual texts. These recognise a particular set of characteristics within an individual text and can create adaptive solutions for that set of problems. These could be different spelling of vowel and diphthong sounds or regular swapping of particular letters. These rules can then be corrected automatically saving the time and patience of the scholar.

For uncertain words the profile tool will flag up suggestions, Ulrich gave the example of the old english term of “hath” and corresponding modern equivalents “has”, “hat” etc.

The second bulkhead of the work done by Ulrich and his team has been in the area of a post-correction system. For this they created from scratch an interface that gives users novel possibilities to detection, presentation and correction of OCR mistakes. This allows the user to see on one screen, the image of the original page, alongside the OCR editor tool with a special functionality window. With the help of historical lexica this functionality window can provide suggestions for corrections alongside the word itself, with a text tool that allows a drop down menu much like a spell check.

The team evaluated their tool on 14 participants and found that when working with the Post-Correction Tool, researchers completed tasks 2.7 times faster than without it.

The interface technology is open source and available to all, although the error profiles are protected by US patent there is a web-service free of charge. Contact Ulrich at for more details on these remarkable advances in efficiency.

View the presentation here:

and the video here:

IMPACT Final Conference-Crowdsourcing in the Digitalkoot Project

24 October, 2011
Majis Bremer-Laamanen

Majis Bremer-Laamanen

DIGI-to digitise Talkoot-people gathering to work together voluntarily (without payment)

Majilis Bremer-Laamanen (National Library of Finland) shares their unique experiment in crowdsourcing OCR correction through gaming with their Digitalkoot Project launched February 2011. As Richard Boulderstone, British Library, touched on his keynote, the National Library of Finland faces the same issues of digitising at scale millions of pages of historical newspapers, books, journals, ephemera and sound, processing and enhancing them in the most cost-effective manner. Partnering with Microtask, they launched two web-based games which turned the meaningful but arduous task of human review of OCR’d text into a collaborative, manageable and useful game.

Majlis Bremer-Laamanen

Majlis Bremer-Laamanen

DigiTalkoot Mole Bridge


DigiTalkoot Mole Hunt


View the presentation here:

and the video here:

IMPACT Final Conference – IBM Adaptive OCR Engine and CONCERT Cooperative Correction

24 October, 2011

Asaf Tzadok (IBM Haifa Research Lab

Asaf Tzadok (IBM Haifa Research Lab) showed us IBM’s CONCERT tool which facilitates collaborative OCR correction. CONCERT (Cooperative Engine for the Correction of Extracted Text) works in three steps: character session, word session and page-level session. Character session presents the user with a list of characters the OCR has characterised as the same letter. The user can then mark characters as “suspicious”. In the next step, theses characters are presented in word context, where the user can again decide if the characters were recognised correctly. In the final step, characters and words that are still marked as suspicious are shown on page-level. CONCERT also has a series of games including “feed the dolphin, he is hungry”.

View the presentation here:

and the video here:

Lotte Wilms (KB) joined Asaf on stage and gave the library perspective: 3 libraries were involved and user tests were carried out by all the libraries with full support from IBM