Skip to content

Closing the IMPACT Project blog

3 January, 2012

The IMPACT project has now officially finished and been superseded by the Centre of Competence.

This blog has now been frozen. Comments have been disabled and we do not intend to publish further posts. We have published the following statistics for future reference. They are intended to inform others about the lifecycle of the blog and assist people wishing to reuse resources by identifying the authors of articles etc.

Active Dates: From 10 December 2009 to 31 December 2011
Number of posts:   117
Number of comments:  16
Akismet statistics: 1750 spams caught and an overall accuracy rate of 100%.
Details of contributors: The IMPACT project (used as a generic log-in for IMPACT staff, impacteib, mariekeguy, Nora Daly, Greta Franzini, simonaitken
Categories used: admin, Bratislava (May 2010), British Library, conference, Demo Day, Deutsch (German), English, Final Conference 2011, hackday, Munich (March 2010), Munich (October 2011), myGrid – Taverna Hackathon, Nederlands (Dutch), Rouen (March 2011), taverna, The Hague (Feb 2011)
Details of blog theme: Vigilance with 4 Widgets
Details of type and version of software used: This blog was run on the free hosted version of WordPress at www.wordpress.com
Blog licence: All items on the blog are copyright of the IMPACT project and unless otherwise stated have been released under the Creative Commons License:  Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License

BSB/ÖNB Demo Day – Videos online!

21 November, 2011

Es hat dann doch etwas länger gedauert, aber jetzt sind alle Vorträge unserer Doppel-Veranstaltung “Historische Dokumente auf dem Weg zum digitalen Volltext” (11. – 12. Oktober 2011) und in die entsprechenden Blog-Artikel eingebunden.

Wie gehabt finden sich alle Informationen zum ersten Tag, dem “IMPACT Demo Day”, hier auf dem IMPACT-Blog, während Sie sich alles Wissenswerte zum zweiten Tag, den “Erfahrungen aus der Digitalisierungspraxis”, auf dem Blog des Münchener DigitalisierungsZentrums zu Gemüte führen können.

Viel Vergnügen beim Ansehen!

—————————————————————-

It took us a bit longer than expected, but all videos of our dual event “Turning Historical Documents into Digital Full Texts” (11 – 12 October 2011) are now online and embedded into the relevant blog posts.

For the firstday, you’ll find them here on the IMPACT blog. For the second day, please visit the blog of the Munich DigitiZation Center.

Have fun watching!

Mark-Oliver Fischer (BSB)

IMPACT/myGrid Hackathon – Taverna Roadmap

14 November, 2011
Shoaib Sufi talks about the Taverna Roadmap

Shoaib Sufi talks about the Taverna Roadmap

In the afternoon, after everyone had worked through the 3 group tasks in the practical session: ‘Workflow Development in Digitisation’, we returned to hear from the Taverna Manager – Shoaib Sufi.

Shoib gave an interesting talk about where he sees Taverna going in the next few years and the further development of Taverna 3, including some of the projects that they hope to work with.

IMPACT/myGrid Taverna Hackathon – Taverna Server as a Portal

14 November, 2011

Clemens Neudecker leads a session on using Taverna Server as a portal, using IMPACT workflows to demonstrate the functionality.

This was followed by Rob Haines from myGrid who gave more examples of Taverna Server Interfaces.

The IMPACT Framework – From Tools to Workflows

14 November, 2011

This practical session started with the attendees introducing themselves and splitting up into 3 groups, so that each could work on a different set of tasks based on a Case Study:

Sven Schlarb at IMPACT/myGrid Hackathon

Sven Schlarb at IMPACT/myGrid Hackathon

Case Study:

A collection holder wants to reduce storage costs for his collections that
are currently available as TIFF master files. She/he heard that JPEG2000 is
a good candidate for storing digital master files, and she/he heard about
the efficiency of image compression when using lossy compression.

She/he knows that JPEG2000 compression can be “visually lossless”, so that
the compression is reversible, but she/he is still concerned about the
impact the JPEG2000 compression could have on OCR.

We suggest a Taverna workflow that creates an executable processing pipeline
for studying the results.

The workflow should have 1 TIFF image as input and a list of increasing
compression parameters which are used when encoding the image. The image
should then be decompressed before applying the OCR. Finally, the impact
of the compression on the OCR should be measured by comparing the original
OCR output to the OCR output of the compressed images.

IMPACT myGrid Taverna Hackathon

IMPACT myGrid Taverna Hackathon

The Three Groups:

Group 1

Use the toolwrapper for providing access to a JPEG2000 encoding/decoding tool:

Group 2

Use Taverna for creating the workflow:

Group 3

Use a Taverna beanshell for creating the Text comparison

  • commons-lang-2.4.jar (/home/<youruser>/.taverna-home/lib/commons-lang-2.4.jar)
Carl Wilson from the BL concentrates on Taverna

Carl Wilson from the BL concentrates on Taverna

The selection of groups has shown a definite preference for the more ‘user’ based tasks rather than ‘developer’ tasks, with 12 working on Group 1, 6 on Group 2 and only 3 on Group3.  However, quite a few attendees seemed happy to be involved in more than one group, or work in one, but support users in another.

General feeling is that this bodes well for tomorrow which has a more ‘practical’ based timetable.

IMPACT/myGrid Taverna Hackathon

14 November, 2011

Full details of this workshop are available through the workshop wiki at:

http://impact-mygrid-taverna-hackathon.wikispaces.com/Background+Materials

Clemens Neudecker at the IMPACT myGrid Taverna Hackathon

Clemens Neudecker at the IMPACT myGrid Taverna Hackathon

The day started with an introduction to IMPACT from Clemens Neudecker:

and then an introduction to Taverna from Katy Wolstencroft:

Katy Wolstencroft gives an introduction to Taverna

Katy Wolstencroft gives an introduction to Taverna

IMPACT Final Conference – Blog-index

26 October, 2011

The whole conference was blogged and photographed with presentations uploaded to Slideshare and videos to Vimeo.

These are also embedded within the blogs on this site.

This post contains direct links to all posts made at the Final Conference.  Please do feel free to add comments or thoughts below the posts.

Monday 24 October 2011

BLOCK 1: OPERATIONAL CONTEXT

BLOCK 2: FRAMEWORK AND EVALUATION

BLOCK 3: TOOLS FOR IMPROVED TEXT RECOGNITION

Tuesday 25 October 2011

BLOCK 4: LANGUAGE TOOLS AND RESOURCES

BLOCK 5: IMPACT CENTRE OF COMPETENCE

PARALLEL SESSIONS

  • Research Session: Presentation and discussion of state of the art research tools for document analysis and OCR, hosted by Apostolos Antonacopoulos (University of Salford).
  • Language Session: Presentation and demonstration of the IMPACT language tools & resources in further detail, hosted by Katrien Depuydt (INL)
  • Digitisation Tips Session: Meet the expert: questions & answers on digitisation issues, hosted by Aly Conteh (The British Library)

IMPACT Final Conference – Research Parallel Sessions Brief Summary

25 October, 2011

Parallel Session 1: Research was dedicated to presentations and discussions around the state of the art research tools for document analysis developed via the IMPACT Project.

As you might guess from the slides below, the information packed into these presentations could fill a whole new two-day conference!  But for now, a brief summary will have to suffice and I will implore you to visit the tools section of the freshly launched Impact: Centre of Competence Website for more details.

A video of the session is available here:

Impact Tools Developed by NCSR (Basilis Gatos)

The folks at the Computational Intelligence Laboratory over at the National Centre of Scientific Research (DEMOKRITOS) in Athens focus their activity around “research & development methods, techniques and prototypes in the areas of telecommunication systems, networks, and informatics”. Involved with IMPACT since 2008 they have partnered in the production of nine software tools to support binarisation, border removal, page split, page curl correction, OCR result, character segment, word spotting.

OCR for Typewritten Documents (Stefan Pletschacher)

Stefan explained that typewritten documents from roughly 1870-1970’s pose a unique challenge to OCR recognition. He points out that each character is actually produced on the page independently of the rest and  they can appear with different weights do the mechanical nature of the process, even within the same word. Typical typewritten documents in archives are actually carbon copies with blurred type and a textured background, and administrative documents at that, rife with names, abbreviations, numbers, which render typical lexicon based recognition approaches less useful. A system was developed in IMPACT to tackle these unique issues by incorporating background knowledge of typewritten documents, and through improved segmentatio and enhancement of glyph images, while “performing language independent character recognition using specifically trained classifiers”.

Image Enhancement, Segmentation and Experimental OCR (A. Antonacopoulos)

Representing the work of PRImA, Pattern Recognition & Image Analysis Research at the University of Salford Apostolos demonstrated their approach to the digitisation workflow and the tools developed for Image Enhancement (border removal, page curl removal, correction of arbitrary warping) as well as segmentation (recognition-based and stand alone).

IMPACT Final Conference – Language Parallel Session

25 October, 2011

The language parallel session consisted of a series of presentations and demonstrations of the IMPACT language tools and hosted by Katrien Depuydt (INL).

Named Entity Work in IMPACT: Frank Landsbergen

Frank began by defining named entities (NE). They are a word or string referring to a proper location, person or organisation (or date, time, etc). Within IMPACT the term is limited to location, person or organisation. The extra focus on these words is primarily because they may be of particular interest to end users and because they are usually not in dictionaries, so there is more improvement in the lexicon and ultimately the OCR. Note that a lexicon is a list of related entities in the database that are linked e.g. Amsterdams, Amsteldam, Amsteldamme = Amsterdam.

He then spent the majority of his talk walking us through the 4 step process of building a NE Lexicon:

  1. Data collection
  2. NE tagging – Possibilities include NE extraction software, use of the Stanford University module, statistical NE recognition (the software ‘trains’ itself) or manual tagging. Many of these tools currently work best with contemporary data.
  3. enrichment (POS tagging, lemmatizing, adding the person name structure, variants)
  4. database creation

So far the majority of IMPACT’s NE work has been on creating a toolkit for lexicon building (NERT, Attestation tool) and creating NE-lexica for Dutch, English and German.

Special Resources to Access 16th Century Germany: Annette Gotscharek

The 16th Century German printed book collection was a special case because the resources were so old, and therefore very challenging. There were a number of special features of the historical language on a word-level. The historical variants were all added to the modern lemma.

The diachronic ground truth corpus was text files of what appeared on the scans. It was collected from different resources on the Web and non-public electronic corpora. They worked on areas including creation of a hypothetical lexicon and manually verifying IR-lexicon.


Polish Language Resources in IMPACT: Janusz S. Bień

Janusz and his team faced a number of challenges when working with Polish text. They did not use the oldest Polish dictionary but focused on later historical dictionaries and sources used by these dictionaries. The earlier dictionaries were rejected because they had relevant information but it was too difficult to extract. They also struggled to use other later dictionaries because of copyright issues. In the end they did manage to use a selection of dictionaries and texts including the Benedykt Chmielowski encyclopedia, which is famous for its memorable “definitions”: “Horse is as everyone can see.

They looked at a number of tools including the lemmatization in INL Lexicon tool (SAM, SpXViiw). More information is available at http://bc.klf.uw.edu.pl

Slovene Language Resources in IMPACT: Tomaž Erjavec

Tomaž previously worked on the AHLib project looking at transcription correction and markup. At the same time as their work on the IMPACT project they also won a Google award so have been able to develop language models for historical Slovene.

Their methodology has been to develop 3 resources: transcribed texts, hand-annotated corpus and a lexicon of historical words. They have also developed the annotation tool, ToTrTaLe which aids in tagging and lemmatising historical Slovene. The main issues have been tokenisation (words were split differently in historical language), variability and extinct words. During the project they have transcribed over 10 million words, these comprise of the AHLib corpus/DL, NUK GTD, Google Books and Wiki source –all are freely available.

A video of the session is here:

IMPACT Final Conference – Digitisation Tips Parallel Session

25 October, 2011

Aly Conteh (BL) hosted the parallel Q&A session on Digitisation tips. Pannel members included Astrid Verheusen (Manager of the Digital Library Programme and Head of the Innovative Projects Department, KB National Library of the Netherlands), Geneviève Cron (OCR expert, Bibliothèque nationale de France), Christa Müller (Director Digital Services Department, Austrian National Library), Majlis Bremer-Laamanen (Director of the Centre for Preservation and Digitisation, National Library of Finland) and Alena Kavčič – Colic (Head of the Research and Development Unit, National and University Library of Slovenia).

Here are some of the questions and issues addressed:

  • Q: How do Digital Libraries exist without METS/ALTO and how do you support retrospective production of OCR? A: You always have to try to find the best solution in terms of time, cost, scope and user needs. Currently, some libraries use only ALTO as it better suits the project’s needs. Standards like ALTO, however, don’t always support certain features. While new standards releases are being reviewed and will soon be published, it is paramount that libraries evaluate their data and resources and adopt the necessary measure accordingly. The problem of continuous migration due to updated standards will always remain for as long as we digitise. If, however, OCR is in raw, plain text, retrofitting it into METS/ALTO is encouraged as it transforms the users’ experience with working with that information. It is relatively straightforward, not highly technical but it does, of course, need some financial support.
  • Q: Many libraries digitising newspaper collections clean the head titles of their documents. Will this still happen in the future? Why insist on cleaning head titles rather than invest in digitising more pages? A: Good point! Some libraries have already interrupted the process of cleaning headings in favour of a larger number of digitised pages. However, the higher accuracy in article titles, the higher the relevance of the article to search terms. On the other hand, OCR of headings does cost more and it limits the number of pages you can digitise. It comes down to choices. A possible solution: invest the money on good scans, let the software do the OCR and live with the automated results. Do not spend money and time in manual correction. And remember to always consult your user community.
  • Q: How do you measure capture quality when you lack ground truth? A: It is impossible to ground truth everything in a digitisation project but what you could do is sample some of the pages rather than check every single one. However, OCR machines do come with a certain level of confidence.
  • Q: What are library priorities for the next 10 years? A: To obtain copyright exceptions and extended licensing so that we can publish material currently protected. As regards the British Library, current and near future projects include theatre playbills, newspaper collections, maps and medieval manuscripts.