Context-Sensitive Error Correction: Using Topic Models to Improve OCR

  • Authors:
  • M. Wick;M. Ross;E. Learned-Miller

  • Affiliations:
  • University of Massachusetts Amherst;University of Massachusetts Amherst;University of Massachusetts Amherst

  • Venue:
  • ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Modern optical character recognition software relies on human interaction to correct misrecognized charac- ters. Even though the software often reliably identifies low-confidence output, the simple language and vocabu- lary models employed are insufficient to automatically cor- rect mistakes. This paper demonstrates that topic models, which automatically detect and represent an article's se- mantic context, reduces error by 7% over a global word distribution in a simulated OCR correction task. Detecting and leveraging context in this manner is an important step towards improving OCR.