Quality enhancement in information extraction from scanned documents

Authors:
Atsuhiro Takasu;Kenro Aihara
Affiliations:
National Institute of Informatics;National Institute of Informatics
Venue:
Proceedings of the 2006 ACM symposium on Document engineering
Year:
2006

Citing 4
Cited 1

The RightPages Image-Based Electronic Library for Alerting and Browsing

Computer
Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Bibliographic attribute extraction from erroneous references based on a statistical model

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
DVHMM: Variable Length Text Recognition Error Model

ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 3 - Volume 3

Crf-based authors' name tagging for scanned documents

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

When constructing a large document archive, an important element is the digitizing of printed documents. Although various techniques for document image analysis such as Optical Character Recognition (OCR) have been developed, error handling is required in constructing real document archive systems. This paper discusses the problem from the quality enhancement perspective and proposes a robust reference extraction method for academic articles scanned with OCR mark-up. We applied the proposed method to articles appearing in various journals, and these experiments showed that the proposed method achieved a recognition accuracy of more than 94%. This paper also discusses manual correction and investigates experimentally the relationship between extraction accuracy and cost reduction.