Why multiple document image binarizations improve OCR

Authors:
William B. Lund;Douglas J. Kennard;Eric K. Ringger
Affiliations:
Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah;Brigham Young University, Provo, Utah
Venue:
Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing
Year:
2013

Citing 13
Cited 0

Incorporating Language Syntax in Visual Text Recognition with a Statistical Model

IEEE Transactions on Pattern Analysis and Machine Intelligence
Using consensus sequence voting to correct OCR errors

Computer Vision and Image Understanding
On Combining Classifiers

IEEE Transactions on Pattern Analysis and Machine Intelligence
A generative probabilistic OCR model for NLP applications

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Semantics-Based Content Extraction in Typewritten Historical Documents

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Hybrid OCR combination approach complemented by a specialized ICR applied on ancient documents

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Improving optical character recognition through efficient multiple system alignment

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
An improved search algorithm for optimal multiple-sequence alignment

Journal of Artificial Intelligence Research
Learning on the Fly: Font-Free Approaches to Difficult OCR Problems

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Improving OCR accuracy for classical critical editions

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
Hypothesis Preservation Approach to Scene Text Recognition with Weighted Finite-State Transducer

ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition
Error Correction with In-domain Training across Multiple OCR System Outputs

ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition
Progressive Alignment and Discriminative Error Correction for Multiple OCR Engines

ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Our previous work has shown that the error correction of optical character recognition (OCR) on degraded historical machine-printed documents is improved with the use of multiple information sources and multiple OCR hypotheses including from multiple document image binarizations. The contributions of this paper are in demonstrating how diversity among multiple binarizations makes those improvements to OCR accuracy possible. We demonstrate the degree and breadth to which the information required for correction is distributed across multiple binarizations of a given document image. Our analysis reveals that the sources of these corrections are not limited to any single binarization and that the full range of binarizations holds information needed to achieve the best result as measured by the word error rate (WER) of the final OCR decision. Even binarizations with high WERs contribute to improving the final OCR. For the corpus used in this research, fully 2.68% of all tokens are corrected using hypotheses not found in the OCR of the binarized image with the lowest WER. Further, we show that the higher the WER of the OCR overall, the more the corrections are distributed among all binarizations of the document image.