Skip to content

IMPACT Final Conference – Experiences in Mass Digitisation with the California Digital Library

24 October, 2011

Paul Fogel on Experiences in mass digitisation

Paul Fogel, Technical Lead of the Mass Digitisation team at the California Digital Library (CDL), presented digitisation experiences and challenges faced by CDL when dealing with OCR document text extraction. Fogel emphasised the difficulties and obstacles posed by bad OCR during the mass indexing and digitisation processes of cultural records: marginalia, image-text misinterpretations and fonts, as well as limited resources, the wide range of languages (400 to be exact but OCR dictionaries for only 20 of them) and disciplines and the project’s large indexing scale, make ranking results and their use extremely difficult. Fogel finally echoed Antonacopoulos and stressed the need for high quality images to ensure best indexing and query results.

View the presentation here:

and the video here:

One Comment leave one →
  1. 28 March, 2014 10:51 am

    Reblogged this on digitisation at the james hardiman library.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: