Topic based language models for OCR correction

Authors:
Anurag Bhardwaj;Faisal Farooq;Huaigu Cao;Venu Govindaraju
Affiliations:
University at Buffalo, Amherst, NY;University at Buffalo, Amherst, NY;University at Buffalo, Amherst, NY;University at Buffalo, Amherst, NY
Venue:
Proceedings of the second workshop on Analytics for noisy unstructured text data
Year:
2008

Citing 9
Cited 1

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
A Lexicon Driven Approach to Handwritten Word Recognition for Real-Time Applications

IEEE Transactions on Pattern Analysis and Machine Intelligence
Integration of hand-written address interpretation technology into the United States Postal Service Remote Computer Reader system

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Lexicon Reduction in an HMM-Framework Based on Quantized Feature Vectors

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Reading handwritten US census forms

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
Stochastic Error-Correcting Parsing for OCR Post-Processing

ICPR '00 Proceedings of the International Conference on Pattern Recognition - Volume 4
Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Introduction to Information Retrieval

Introduction to Information Retrieval
Lexicon Reduction in Handwriting Recognition Using Topic Categorization

DAS '08 Proceedings of the 2008 The Eighth IAPR International Workshop on Document Analysis Systems

Handwritten document retrieval strategies

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers produce reasonably clean output when used with a restricted lexicon. But in absence of such a restricted lexicon, the output of an unconstrained handwritten word recognizer is noisy. The objective of this research is to process noisy recognizer output and eliminate spurious recognition choices using a topic based language model. We construct a topic based language model for every document using a training data which is manually categorized. A topic categorization sub-system based on Maximum Entropy model is also trained which is used to generate the topic distribution of a test document. A given test word image is processed by the recognizer and its word recognition likelihood is refined by incorporating topic distribution of the document and topic based language model probability. The proposed method is evaluated on a publicly available IAM dataset and experimental results show significant improvement in the word recognition accuracy from 32% to 40% over a test set consisting of 4033 word images extracted from 70 handwritten document images.