Parallel Session 1: Research was dedicated to presentations and discussions around the state of the art research tools for document analysis developed via the IMPACT Project.
As you might guess from the slides below, the information packed into these presentations could fill a whole new two-day conference! But for now, a brief summary will have to suffice and I will implore you to visit the tools section of the freshly launched Impact: Centre of Competence Website for more details.
A video of the session is available here:
Impact Tools Developed by NCSR (Basilis Gatos)
The folks at the Computational Intelligence Laboratory over at the National Centre of Scientific Research (DEMOKRITOS) in Athens focus their activity around “research & development methods, techniques and prototypes in the areas of telecommunication systems, networks, and informatics”. Involved with IMPACT since 2008 they have partnered in the production of nine software tools to support binarisation, border removal, page split, page curl correction, OCR result, character segment, word spotting.
OCR for Typewritten Documents (Stefan Pletschacher)
Stefan explained that typewritten documents from roughly 1870-1970’s pose a unique challenge to OCR recognition. He points out that each character is actually produced on the page independently of the rest and they can appear with different weights do the mechanical nature of the process, even within the same word. Typical typewritten documents in archives are actually carbon copies with blurred type and a textured background, and administrative documents at that, rife with names, abbreviations, numbers, which render typical lexicon based recognition approaches less useful. A system was developed in IMPACT to tackle these unique issues by incorporating background knowledge of typewritten documents, and through improved segmentatio and enhancement of glyph images, while “performing language independent character recognition using specifically trained classifiers”.
Image Enhancement, Segmentation and Experimental OCR (A. Antonacopoulos)
Representing the work of PRImA, Pattern Recognition & Image Analysis Research at the University of Salford Apostolos demonstrated their approach to the digitisation workflow and the tools developed for Image Enhancement (border removal, page curl removal, correction of arbitrary warping) as well as segmentation (recognition-based and stand alone).
Named Entity Work in IMPACT: Frank Landsbergen
Frank began by defining named entities (NE). They are a word or string referring to a proper location, person or organisation (or date, time, etc). Within IMPACT the term is limited to location, person or organisation. The extra focus on these words is primarily because they may be of particular interest to end users and because they are usually not in dictionaries, so there is more improvement in the lexicon and ultimately the OCR. Note that a lexicon is a list of related entities in the database that are linked e.g. Amsterdams, Amsteldam, Amsteldamme = Amsterdam.
He then spent the majority of his talk walking us through the 4 step process of building a NE Lexicon:
- Data collection
- NE tagging – Possibilities include NE extraction software, use of the Stanford University module, statistical NE recognition (the software ‘trains’ itself) or manual tagging. Many of these tools currently work best with contemporary data.
- enrichment (POS tagging, lemmatizing, adding the person name structure, variants)
- database creation
So far the majority of IMPACT’s NE work has been on creating a toolkit for lexicon building (NERT, Attestation tool) and creating NE-lexica for Dutch, English and German.
Special Resources to Access 16th Century Germany: Annette Gotscharek
The 16th Century German printed book collection was a special case because the resources were so old, and therefore very challenging. There were a number of special features of the historical language on a word-level. The historical variants were all added to the modern lemma.
The diachronic ground truth corpus was text files of what appeared on the scans. It was collected from different resources on the Web and non-public electronic corpora. They worked on areas including creation of a hypothetical lexicon and manually verifying IR-lexicon.
Polish Language Resources in IMPACT: Janusz S. Bień
Janusz and his team faced a number of challenges when working with Polish text. They did not use the oldest Polish dictionary but focused on later historical dictionaries and sources used by these dictionaries. The earlier dictionaries were rejected because they had relevant information but it was too difficult to extract. They also struggled to use other later dictionaries because of copyright issues. In the end they did manage to use a selection of dictionaries and texts including the Benedykt Chmielowski encyclopedia, which is famous for its memorable “definitions”: “Horse is as everyone can see.”
They looked at a number of tools including the lemmatization in INL Lexicon tool (SAM, SpXViiw). More information is available at http://bc.klf.uw.edu.pl
Slovene Language Resources in IMPACT: Tomaž Erjavec
Tomaž previously worked on the AHLib project looking at transcription correction and markup. At the same time as their work on the IMPACT project they also won a Google award so have been able to develop language models for historical Slovene.
Their methodology has been to develop 3 resources: transcribed texts, hand-annotated corpus and a lexicon of historical words. They have also developed the annotation tool, ToTrTaLe which aids in tagging and lemmatising historical Slovene. The main issues have been tokenisation (words were split differently in historical language), variability and extinct words. During the project they have transcribed over 10 million words, these comprise of the AHLib corpus/DL, NUK GTD, Google Books and Wiki source –all are freely available.
A video of the session is here:
Aly Conteh (BL) hosted the parallel Q&A session on Digitisation tips. Pannel members included Astrid Verheusen (Manager of the Digital Library Programme and Head of the Innovative Projects Department, KB National Library of the Netherlands), Geneviève Cron (OCR expert, Bibliothèque nationale de France), Christa Müller (Director Digital Services Department, Austrian National Library), Majlis Bremer-Laamanen (Director of the Centre for Preservation and Digitisation, National Library of Finland) and Alena Kavčič – Colic (Head of the Research and Development Unit, National and University Library of Slovenia).
Here are some of the questions and issues addressed:
- Q: How do Digital Libraries exist without METS/ALTO and how do you support retrospective production of OCR? A: You always have to try to find the best solution in terms of time, cost, scope and user needs. Currently, some libraries use only ALTO as it better suits the project’s needs. Standards like ALTO, however, don’t always support certain features. While new standards releases are being reviewed and will soon be published, it is paramount that libraries evaluate their data and resources and adopt the necessary measure accordingly. The problem of continuous migration due to updated standards will always remain for as long as we digitise. If, however, OCR is in raw, plain text, retrofitting it into METS/ALTO is encouraged as it transforms the users’ experience with working with that information. It is relatively straightforward, not highly technical but it does, of course, need some financial support.
- Q: Many libraries digitising newspaper collections clean the head titles of their documents. Will this still happen in the future? Why insist on cleaning head titles rather than invest in digitising more pages? A: Good point! Some libraries have already interrupted the process of cleaning headings in favour of a larger number of digitised pages. However, the higher accuracy in article titles, the higher the relevance of the article to search terms. On the other hand, OCR of headings does cost more and it limits the number of pages you can digitise. It comes down to choices. A possible solution: invest the money on good scans, let the software do the OCR and live with the automated results. Do not spend money and time in manual correction. And remember to always consult your user community.
- Q: How do you measure capture quality when you lack ground truth? A: It is impossible to ground truth everything in a digitisation project but what you could do is sample some of the pages rather than check every single one. However, OCR machines do come with a certain level of confidence.
- Q: What are library priorities for the next 10 years? A: To obtain copyright exceptions and extended licensing so that we can publish material currently protected. As regards the British Library, current and near future projects include theatre playbills, newspaper collections, maps and medieval manuscripts.
IMPACT Final Conference – The Functional Extension Parser: A Document Understanding Platform with Günter Mühlberger
Günter Mühlberger chaired the 3rd block: Tools for Improved Text Recognition.
However, as time ran out in the session he chose to postpone the delivery of his talk on the FEP until the second day Research Parallel Sessions.
The Presentation is delivered below:
Aly Conteh (BL), Hildelies Balk-Pennington De Jongh (KB) and Lieke Ploeger (KB) introduced and launched the IMPACT Centre of Competence!
While still facing many challenges, IMPACT aims at providing solid foundations by getting good quality OCR text which will enable the project to undertake new avenues of research. Furthermore, IMPACT’s strength lies in the development of a very strong collaboration between communities, a collaboration which the project strives to maintain and build on. After all, the Centre of Competence is exactly this, the bringing together of communities and experts to look at ways to optimise the tools used to extract text from historical documents. In addition, IMPACT has developed a knowledge bank offering guidelines and materials for institutions working on OCR projects all over the world. These resources can be found at the Centre of Competence website, www.digitisation.eu. Other valuable services offered by the Centre include a Ground Truth dataset, as well as a helpdesk, specifically designed and created for institutions to ask questions to experts who can, in turn, give some insight into how to address issues raised.
However, as with every enterprise, there exists the issue of sustainability. How do we sustain the Centre? Over the last four years, the project has received continuous financial support – but what about the future? The sustainability model, Conteh explained, works around four key points: the website, membership packages, partner contributions and a Centre office established to provide assistance. Can IMPACT partners leverage some level of resourcing to allow the Centre to develop products and services? The answer is yes! Some of the partners, in fact, are already providing valuable support to enable the Centre of Competence to administer and deliver its services in a timely manner.
One of the major commitments when setting-up a business, is to find an institution or multiple institutions who are willing to bear the risk and support the enterprise in its initial stages. IMPACT has found its support: representatives from the Bibliothèque National de France (BNF) and the Fundación Biblioteca Virtual Miguel De Cervantes took to the floor to present and discuss the reasons behind their decision to back the Centre. These include the need to sustain present, in-house OCR projects to which the Centre of Competence would provide invaluable support and advice, not to mention the opportunity to exchange expertise and data. In turn, the BNF and the Biblioteca Virtual can provide experience, a wide range of partner support, sponsorship and networks, technology transfer and dissemination, as well as the preparation and dissemination of open source software.
To sign up to the Centre of Competence, email firstname.lastname@example.org or visit the website at www.digitisation.eu.
View the presentation here:
and the video here:
IMPACT Final Conference – The EC Digital Agenda and Official Launch of the IMPACT Centre of Competence
Khalil explained that the IMPACT project has been one of the biggest projects relating to size. Initially there were doubts that large-scale projects were the right way of supporting innovation and the FP7 team were involved in a lot of discussion on the approach to take. However IMPACT has been a good example of why large scale integrating projects are still valuable for the ICT agenda of the EU.
Khalil talked about the digital agenda for Europe which includes creating a sustainable financing model for Europeana, creating a legal framework for orphan and out of commerce works. Three important aspects are stressed : 1) that ‘access’ is the central concept 2) digitisation should be a moral obligation to maintain our cultural resources 3) Common platforms are central to the strategy. The bottlenecks are the funding challenge and copyright.
Khalil referred to the Comite des Sages report. He also pointed attendees to future calls in the areas of digital preservation and cultural heritage: Call 9 (DL is 17 April 2012), WP 2013 the orientation in currently in preparation.
Khalil then officially launched the IMPACT Centre of Competence.
View the presentation here:
View the video here (including introduction from Hildelies Balk:
Steven Krauwer of CLARIN brought everyone back from coffee break and entertained and informed us with his reflections on the future possibilities of e-Humanities.
Steven talked us through CLARIN’s resource and technical infrastructure, all based in language in whatever form that takes: text speech, programming or multimodal. The goal is to allow users to interact with data not just view it. Steven envisages a single portal – a one-stop-shop for everyone involved in research of culture and literature. CLARIN is based in Humanities but not limited to them. For now, pragmatically CLARIN is based in the EU and caters for EU especially. He stressed the academic bias and the democratic ethos with a distinction on languages not considering any commercial relevance. Each language is equally dear to CLARIN.
Steven hopes to serve all scholars whether for example, looking for documents concerning Black plague in European archives; all newspaper coverage on Islam; German television speakers with Dutch accent or pronoun systems in language of Nepal. Whatever you field CLARIN wants to provide the platform to research it.
Moving towards e-Science, CLARIN sees a continual decline of old traditions as inefficient ways of working give way to future methods for future generations increasingly based in the digital realm.
Steven highlighted problems of cross compatibility between tools and systems and standards in order to make access sustainable, right from the researchers desk, and without needing costly journeys to archives themselves.
CLARIN hopes to integrate into existing systems and institution and not to build a whole new system for others to adopt.
The aim is to “if it exists, it exists on CLARIN”, the system should be adaptable to any individual’s online environment, it should be be technical competent by building modular features that can be used for unique problems and tasks and specific workflows. CLARIN aims to be fully compatible with OCR, Worklow and Post correction processes but will be receptive to new ways of working as the source material changes to more complex forms of media.
So down to reality, what is the state of play today? CLARIN began with preparatory phase from 2008-11. This ended in the summer and they are now building the infrastructure. Although 2 years away for official launch of the services, tools and resources will start feeding into the public domain before then. After launch Steven stressed that CLARIN will be in “continuous state of evolution” and hopes to be agile enough to adapt to constantly changing needs.
Steven introduced the structure envisaged, with governmental bodies called ERICs that govern the cooperation between counties with with universities and institution controlling individual input. It should be funded by governments.
ERIC will involve membership fee and each member creating a governing body and to make material available for CLARIN and co-operate with other countries.
ERIC groups will be launched on the 1st of Jan 2012 with many countries already online (although there are those who have been less than pro-active, Steven managed a little hint to the host nation perhaps!)
What sort of animal will CLARIN be? Standards must be agreed on although CLARIN will not try to impose but hope to agree and support limited number of standards with ‘mapability’ and conversion ability.
No closing for holidays, CLARIN will be a 24/7 service. It will ensure access to all library everywhere. The IMPACT OCR centre is an existing model CLARIN would like to follow. CLARIN wants to cater for any archive with paper material and so will be integral to IMACT and future developments like it.
Operational levels includes: universities, national academies, national language institutions; research institutions; at present dealing with mostly written resources but would like to increase scope to other media as well.
The old adage together we are stronger is key in CLARIN’s ethos and collaboration is central to being able to reach the ambitious goal agenda that CLARIN has set down.
e-Humanities through collaboration is not without its problems, but the benefits of this grand vision are now clear to all and with the work of CLARIN, together with IMPACT, it is not so far from a reality.
View presentation here:
And video here: