IMPACT Final Conference – Digitisation Tips Parallel Session
Aly Conteh (BL) hosted the parallel Q&A session on Digitisation tips. Pannel members included Astrid Verheusen (Manager of the Digital Library Programme and Head of the Innovative Projects Department, KB National Library of the Netherlands), Geneviève Cron (OCR expert, Bibliothèque nationale de France), Christa Müller (Director Digital Services Department, Austrian National Library), Majlis Bremer-Laamanen (Director of the Centre for Preservation and Digitisation, National Library of Finland) and Alena Kavčič – Colic (Head of the Research and Development Unit, National and University Library of Slovenia).
Here are some of the questions and issues addressed:
- Q: How do Digital Libraries exist without METS/ALTO and how do you support retrospective production of OCR? A: You always have to try to find the best solution in terms of time, cost, scope and user needs. Currently, some libraries use only ALTO as it better suits the project’s needs. Standards like ALTO, however, don’t always support certain features. While new standards releases are being reviewed and will soon be published, it is paramount that libraries evaluate their data and resources and adopt the necessary measure accordingly. The problem of continuous migration due to updated standards will always remain for as long as we digitise. If, however, OCR is in raw, plain text, retrofitting it into METS/ALTO is encouraged as it transforms the users’ experience with working with that information. It is relatively straightforward, not highly technical but it does, of course, need some financial support.
- Q: Many libraries digitising newspaper collections clean the head titles of their documents. Will this still happen in the future? Why insist on cleaning head titles rather than invest in digitising more pages? A: Good point! Some libraries have already interrupted the process of cleaning headings in favour of a larger number of digitised pages. However, the higher accuracy in article titles, the higher the relevance of the article to search terms. On the other hand, OCR of headings does cost more and it limits the number of pages you can digitise. It comes down to choices. A possible solution: invest the money on good scans, let the software do the OCR and live with the automated results. Do not spend money and time in manual correction. And remember to always consult your user community.
- Q: How do you measure capture quality when you lack ground truth? A: It is impossible to ground truth everything in a digitisation project but what you could do is sample some of the pages rather than check every single one. However, OCR machines do come with a certain level of confidence.
- Q: What are library priorities for the next 10 years? A: To obtain copyright exceptions and extended licensing so that we can publish material currently protected. As regards the British Library, current and near future projects include theatre playbills, newspaper collections, maps and medieval manuscripts.