The String-to-String Correction Problem
Journal of the ACM (JACM)
TextFinder: An Automatic System to Detect and Recognize Text In Images
IEEE Transactions on Pattern Analysis and Machine Intelligence
Prototype Extraction and Adaptive OCR
IEEE Transactions on Pattern Analysis and Machine Intelligence
A Statistical, Nonparametric Methodology for Document Degradation Model Validation
IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning to Recognize Speech by Watching Television
IEEE Intelligent Systems
Speaker Identification Based Text to Audio Alignment for an Audio Retrieval System
ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 2 - Volume 2
Text Alignment with Handwritten Documents
DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Computational Linguistics - Special issue on using large corpora: I
Detecting and reading text in natural scenes
CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
Aligning transcripts to automatically segmented handwritten manuscripts
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
A new generation of textual corpora: mining corpora from very large collections
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Collecting fragmentary authors in a digital library
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Self adaptable recognizer for document image collections
PReMI'07 Proceedings of the 2nd international conference on Pattern recognition and machine intelligence
Improving OCR accuracy for classical critical editions
ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
Nearest neighbor based collection OCR
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Transferring structural markup across translations using multilingual alignment and projection
Proceedings of the 10th annual joint conference on Digital libraries
Transcription alignment of Latin manuscripts using hidden Markov models
Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
An experimental workflow development platform for historical document digitisation and analysis
Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Partial duplicate detection for large book collections
Proceedings of the 20th ACM international conference on Information and knowledge management
How to carry over historic books into social networks
Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Mining relational structure from millions of books: position paper
Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Finding translations in scanned book collections
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Searching online book documents and analyzing book citations
Proceedings of the 2013 ACM symposium on Document engineering
Hi-index | 0.01 |
A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.