Using topic models for OCR correction

Authors:
Faisal Farooq;Anurag Bhardwaj;Venu Govindaraju
Affiliations:
Siemens Medical Solutions, Image and Knowledge Management, Malvern, PA, USA;University at Buffalo, Department of Computer Science and Engineering, Buffalo, NY, USA;University at Buffalo, Department of Computer Science and Engineering, Buffalo, NY, USA
Venue:
International Journal on Document Analysis and Recognition - Special Issue NOISY
Year:
2009

Citing 0
Cited 4

Evaluating models of latent document semantics in the presence of OCR errors

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
TSV-LR: topological signature vector-based lexicon reduction for fast recognition of pre-modern Arabic subwords

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
W-TSV: Weighted topological signature vector for lexicon reduction in handwritten Arabic documents

Pattern Recognition
Content level access to digital library of India pages

Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.