Skip to content

The challenges of historical materials and an overview on the technical solutions in IMPACT

7 May, 2010

Sven Schlarb of the Österreichische Nationalbibliothek (Austrian National Library) now talks about the challenges of text digitisation for OCR and the solutions IMPACT has devised to deal with them.  Having outlined the individual tools and the partners responsible for their development, he talks in detail about the ideal IMPACT workflow in which they can all be used.

One important novelty to the IMPACT approach is that it begins with image enhancement (border detection, geometric correction and binarisation) before the page is segmented into text blocks.  This links to a point that both Aly and Günter have made this morning: that a huge amount of digitised material exists that was not captured with OCR in mind.  Enhancing images upfront is done to translate those images into something like an optimal scan for OCR.

Having discussed the various tools in depth, Sven talks about interoperability and sharing of results from one tool to another.  IMPACT is using xml based standards for all OCR outputs, combining them in an “embracing” or translation standard called PAGE xml, being developed by the University of Salford.

Niall Anderson, BL + Mark-Oliver Fischer, BSB

No comments yet

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: