Bounding the probability of error for high precision optical character recognition

Authors:
Gary B. Huang;Andrew Kae;Carl Doersch;Erik Learned-Miller
Affiliations:
Department of Computer Science, University of Massachusetts Amherst, Amherst, MA;Department of Computer Science, University of Massachusetts Amherst, Amherst, MA;Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA;Department of Computer Science, University of Massachusetts Amherst, Amherst, MA
Venue:
The Journal of Machine Learning Research
Year:
2012

Citing 17
Cited 0

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Twenty Years of Document Image Analysis in PAMI

IEEE Transactions on Pattern Analysis and Machine Intelligence
Substitution Deciphering Based on HMMs with Applications to Compressed Document Processing

IEEE Transactions on Pattern Analysis and Machine Intelligence
Enhancing Degraded Document Images via Bitmap Clustering and Averaging

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Bootstrapping Text Recognition from Stop Words

ICPR '98 Proceedings of the 14th International Conference on Pattern Recognition-Volume 1 - Volume 1
Visual inter-word relations and their use in OCR postprocessing

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
OCR with No Shape Training

ICPR '00 Proceedings of the International Conference on Pattern Recognition - Volume 4
Character Recognition by Adaptive Statistical Similarity

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Support vector machine learning for interdependent and structured output spaces

ICML '04 Proceedings of the twenty-first international conference on Machine learning
A generative probabilistic OCR model for NLP applications

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Context-Sensitive Error Correction: Using Topic Models to Improve OCR

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Scene Text Recognition Using Similarity and a Lexicon with Sparse Belief Propagation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Bound propagation

Journal of Artificial Intelligence Research
Learning on the Fly: Font-Free Approaches to Difficult OCR Problems

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning

Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning
Meta-Recognition: The Theory and Practice of Recognition Score Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identified with near certainty, they can be conditioned upon, allowing further inference to be done efficiently. Specifically, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This "clean set" is subsequently used as document-specific training data. While OCR systems produce confidence measures for the identity of each letter or word, thresholding these values still produces a significant number of errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data set of difficult historical newspaper scans, demonstrating that our method for identifying correct words makes only two errors in 56 documents. Using document-specific character models generated from this data, we are able to reduce the error over properly segmented characters by 34.1% from an initial OCR system's translation.