Skip to content

Use of digitised and OCRed text collections by end users

7 May, 2010

Geneviève Cron of the Bibliotheque Nationale de France (BnF) begins by discussing the BNF’s digital library: Gallica.  A million documents digitised since 1992, with OCR as standard since 2005.  OCR accuracy for newspapers is 98% on word level, but results are much more varied – from 60% up. For books, the average accuracy lies at 90%.

She describes users of digital services: mostly French or Francophone; special access needs (vision impairment).  Queries about digital store go up every year; most queries relate to content rather than bibliographic information.  Content queries split into thematic, geographic, history, genealogy, newspapers.

Geneviève goes on to describe the Gallica workflow: a volume is OCR’d; some books sent straight to store; but newspapers are manually corrected by service provider, some other books are manually corrected to reach almost 100% accuracy.  As a validation tool for manually corrected text, ABBYY FineReader is used.  OCR is useful when words in user queries are not in bibliographic data – hence subject spread of content queries.  She outlines the Wikimedia/BnF Collaborative Correction plan: going for 100% accuracy through user collaboration.  Text-to-speech and epub projects in progress.  Creation of groundtruthed datasets within IMPACT to aid further research into improving OCR accuracy.

Niall Anderson, BL + Mark-Oliver Fischer, BSB

No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: