Incorporating Language Syntax in Visual Text Recognition with a Statistical Model
IEEE Transactions on Pattern Analysis and Machine Intelligence
Using consensus sequence voting to correct OCR errors
Computer Vision and Image Understanding
IEEE Transactions on Pattern Analysis and Machine Intelligence
A generative probabilistic OCR model for NLP applications
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Semantics-Based Content Extraction in Typewritten Historical Documents
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Hybrid OCR combination approach complemented by a specialized ICR applied on ancient documents
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Improving optical character recognition through efficient multiple system alignment
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
An improved search algorithm for optimal multiple-sequence alignment
Journal of Artificial Intelligence Research
Learning on the Fly: Font-Free Approaches to Difficult OCR Problems
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Improving OCR accuracy for classical critical editions
ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
Hypothesis Preservation Approach to Scene Text Recognition with Weighted Finite-State Transducer
ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition
Error Correction with In-domain Training across Multiple OCR System Outputs
ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition
Progressive Alignment and Discriminative Error Correction for Multiple OCR Engines
ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition
Hi-index | 0.00 |
Our previous work has shown that the error correction of optical character recognition (OCR) on degraded historical machine-printed documents is improved with the use of multiple information sources and multiple OCR hypotheses including from multiple document image binarizations. The contributions of this paper are in demonstrating how diversity among multiple binarizations makes those improvements to OCR accuracy possible. We demonstrate the degree and breadth to which the information required for correction is distributed across multiple binarizations of a given document image. Our analysis reveals that the sources of these corrections are not limited to any single binarization and that the full range of binarizations holds information needed to achieve the best result as measured by the word error rate (WER) of the final OCR decision. Even binarizations with high WERs contribute to improving the final OCR. For the corpus used in this research, fully 2.68% of all tokens are corrected using hypotheses not found in the OCR of the binarized image with the lowest WER. Further, we show that the higher the WER of the OCR overall, the more the corrections are distributed among all binarizations of the document image.