Recognizing garbage in OCR output on historical documents

  • Authors:
  • Richard Wudtke;Christoph Ringlstetter;Klaus U. Schulz

  • Affiliations:
  • CIS -- University of Munich;CIS -- University of Munich;CIS -- University of Munich

  • Venue:
  • Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Erroneous tokens in the output of an OCR engine can be roughly divided into two categories. For less serious OCR errors typically human readers - in many cases also text correction systems - are able to reconstruct the correct original word, or to suggest a small set of plausible corrections. Sometimes, however, the OCR output contains "garbage" output tokens for which it is completely impossible to predict the correct word. Garbage tokens are for example caused by graphics occurring in images misinterpreted as text by the OCR engine. In this paper we report on the development of a classifier for garbage tokens in OCR output on historical documents. The classifier is based on a specific feature set and implemented as a support vector machine. In our experiments it clearly outperformed simple rule-based predecessor solutions for OCR garbage detection.