Evaluating models of latent document semantics in the presence of OCR errors
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Content level access to digital library of India pages
Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing
Hi-index | 0.00 |
Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.