The effects of OCR error on the extraction of private information

  • Authors:
  • Kazem Taghva;Russell Beckley;Jeffrey Coombs

  • Affiliations:
  • Information Science Research Institute, University of Nevada, Las Vegas;Information Science Research Institute, University of Nevada, Las Vegas;Information Science Research Institute, University of Nevada, Las Vegas

  • Venue:
  • DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

OCR error has been shown not to affect the average accuracy of text retrieval or text categorization.Recent studies however have indicated that information extraction is significantly degraded by OCR error. We experimented with information extraction software on two collections, one with OCR-ed documents and another with manually-corrected versions of the former. We discovered a significant reduction in accuracy on the OCR text versus the corrected text. The majority of errors were attributable to zoning problems rather than OCR classification errors.