A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books

Authors:
Shaolei Feng;R. Manmatha
Affiliations:
University of Massachusetts, Amherst, MA;University of Massachusetts, Amherst, MA
Venue:
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Year:
2006

Citing 10
Cited 14

The String-to-String Correction Problem

Journal of the ACM (JACM)
TextFinder: An Automatic System to Detect and Recognize Text In Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Prototype Extraction and Adaptive OCR

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Statistical, Nonparametric Methodology for Document Degradation Model Validation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning to Recognize Speech by Watching Television

IEEE Intelligent Systems
Speaker Identification Based Text to Audio Alignment for an Audio Retrieval System

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Text Alignment with Handwritten Documents

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Text-translation alignment

Computational Linguistics - Special issue on using large corpora: I
Detecting and reading text in natural scenes

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
Aligning transcripts to automatically segmented handwritten manuscripts

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

A new generation of textual corpora: mining corpora from very large collections

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Collecting fragmentary authors in a digital library

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Finding words in alphabet soup: Inference on freeform character recognition for historical scripts

Pattern Recognition
Self adaptable recognizer for document image collections

PReMI'07 Proceedings of the 2nd international conference on Pattern recognition and machine intelligence
Improving OCR accuracy for classical critical editions

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
Nearest neighbor based collection OCR

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Transferring structural markup across translations using multilingual alignment and projection

Proceedings of the 10th annual joint conference on Digital libraries
Transcription alignment of Latin manuscripts using hidden Markov models

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
An experimental workflow development platform for historical document digitisation and analysis

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Partial duplicate detection for large book collections

Proceedings of the 20th ACM international conference on Information and knowledge management
How to carry over historic books into social networks

Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Mining relational structure from millions of books: position paper

Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Finding translations in scanned book collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Searching online book documents and analyzing book citations

Proceedings of the 2013 ACM symposium on Document engineering

Quantified Score

Hi-index	0.01

Visualization

Abstract

A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.