Quality enhancement in information extraction from scanned documents

  • Authors:
  • Atsuhiro Takasu;Kenro Aihara

  • Affiliations:
  • National Institute of Informatics;National Institute of Informatics

  • Venue:
  • Proceedings of the 2006 ACM symposium on Document engineering
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

When constructing a large document archive, an important element is the digitizing of printed documents. Although various techniques for document image analysis such as Optical Character Recognition (OCR) have been developed, error handling is required in constructing real document archive systems. This paper discusses the problem from the quality enhancement perspective and proposes a robust reference extraction method for academic articles scanned with OCR mark-up. We applied the proposed method to articles appearing in various journals, and these experiments showed that the proposed method achieved a recognition accuracy of more than 94%. This paper also discusses manual correction and investigates experimentally the relationship between extraction accuracy and cost reduction.