Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
Twenty Years of Document Image Analysis in PAMI
IEEE Transactions on Pattern Analysis and Machine Intelligence
Substitution Deciphering Based on HMMs with Applications to Compressed Document Processing
IEEE Transactions on Pattern Analysis and Machine Intelligence
Enhancing Degraded Document Images via Bitmap Clustering and Averaging
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Bootstrapping Text Recognition from Stop Words
ICPR '98 Proceedings of the 14th International Conference on Pattern Recognition-Volume 1 - Volume 1
Visual inter-word relations and their use in OCR postprocessing
ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
ICPR '00 Proceedings of the International Conference on Pattern Recognition - Volume 4
Character Recognition by Adaptive Statistical Similarity
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Distinctive Image Features from Scale-Invariant Keypoints
International Journal of Computer Vision
Support vector machine learning for interdependent and structured output spaces
ICML '04 Proceedings of the twenty-first international conference on Machine learning
A generative probabilistic OCR model for NLP applications
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Context-Sensitive Error Correction: Using Topic Models to Improve OCR
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Scene Text Recognition Using Similarity and a Lexicon with Sparse Belief Propagation
IEEE Transactions on Pattern Analysis and Machine Intelligence
Journal of Artificial Intelligence Research
Learning on the Fly: Font-Free Approaches to Difficult OCR Problems
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning
Meta-Recognition: The Theory and Practice of Recognition Score Analysis
IEEE Transactions on Pattern Analysis and Machine Intelligence
Hi-index | 0.00 |
We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identified with near certainty, they can be conditioned upon, allowing further inference to be done efficiently. Specifically, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This "clean set" is subsequently used as document-specific training data. While OCR systems produce confidence measures for the identity of each letter or word, thresholding these values still produces a significant number of errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data set of difficult historical newspaper scans, demonstrating that our method for identifying correct words makes only two errors in 56 documents. Using document-specific character models generated from this data, we are able to reduce the error over properly segmented characters by 34.1% from an initial OCR system's translation.