The challenges of historical materials and an overview on the technical solutions in IMPACT
Sven Schlarb of the Österreichische Nationalbibliothek (Austrian National Library) now talks about the challenges of text digitisation for OCR and the solutions IMPACT has devised to deal with them. Having outlined the individual tools and the partners responsible for their development, he talks in detail about the ideal IMPACT workflow in which they can all be used.
One important novelty to the IMPACT approach is that it begins with image enhancement (border detection, geometric correction and binarisation) before the page is segmented into text blocks. This links to a point that both Aly and Günter have made this morning: that a huge amount of digitised material exists that was not captured with OCR in mind. Enhancing images upfront is done to translate those images into something like an optimal scan for OCR.
Having discussed the various tools in depth, Sven talks about interoperability and sharing of results from one tool to another. IMPACT is using xml based standards for all OCR outputs, combining them in an “embracing” or translation standard called PAGE xml, being developed by the University of Salford.
Niall Anderson, BL + Mark-Oliver Fischer, BSB