Using topic models for OCR correction

  • Authors:
  • Faisal Farooq;Anurag Bhardwaj;Venu Govindaraju

  • Affiliations:
  • Siemens Medical Solutions, Image and Knowledge Management, Malvern, PA, USA;University at Buffalo, Department of Computer Science and Engineering, Buffalo, NY, USA;University at Buffalo, Department of Computer Science and Engineering, Buffalo, NY, USA

  • Venue:
  • International Journal on Document Analysis and Recognition - Special Issue NOISY
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.