Context-Sensitive Error Correction: Using Topic Models to Improve OCR

Authors:
M. Wick;M. Ross;E. Learned-Miller
Affiliations:
University of Massachusetts Amherst;University of Massachusetts Amherst;University of Massachusetts Amherst
Venue:
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Year:
2007

Citing 0
Cited 8

Latent Style Model: Discovering writing styles for calligraphy works

Journal of Visual Communication and Image Representation
Evaluating models of latent document semantics in the presence of OCR errors

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Audio lifelog search system using a topic model for reducing recognition errors

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
Bounding the probability of error for high precision optical character recognition

The Journal of Machine Learning Research
Estimation and selection via absolute penalized convex minimization and its multistage adaptive applications

The Journal of Machine Learning Research
Measuring contextual fitness using error contexts extracted from the Wikipedia revision history

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Content level access to digital library of India pages

Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing
On handling textual errors in latent document modeling

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern optical character recognition software relies on human interaction to correct misrecognized charac- ters. Even though the software often reliably identifies low-confidence output, the simple language and vocabu- lary models employed are insufficient to automatically cor- rect mistakes. This paper demonstrates that topic models, which automatically detect and represent an article's se- mantic context, reduces error by 7% over a global word distribution in a simulated OCR correction task. Detecting and leveraging context in this manner is an important step towards improving OCR.