Performing information extraction to improve OCR error detection in semi-structured historical documents

Authors:
Thomas L. Packer
Affiliations:
Brigham Young University, Provo, Utah
Venue:
Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Year:
2011

Citing 9
Cited 1

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Multitask Learning

Machine Learning - Special issue on inductive transfer
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
OCR Error Detection and Correction of an Inflectional Indian Language Script

ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume III-Volume 7276 - Volume 7276
Named entity extraction from noisy input: speech and OCR

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Unsupervised learning of field segmentation models for information extraction

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Coupled semi-supervised learning for information extraction

Proceedings of the third ACM international conference on Web search and data mining
Modeling duration in a hidden Markov model with the exponential family

ICASSP'93 Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: speech processing - Volume II
The effects of OCR error on the extraction of private information

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Unsupervised profiling of OCRed historical documents

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Optical character recognition (OCR) produces transcriptions of document images. These transcriptions often contain incorrectly recognized characters which we must avoid or correct downstream. An ability to both identify OCR errors and extract information from OCR output would allow us to extract and index only correct information and to post-process specific parts of the OCR output with targeted resources (e.g. re-OCR using specialized dictionaries). We present a general approach to OCR error detection that uses a hidden Markov model trained to simultaneously detect OCR errors and extract information. We evaluate this approach in two information extraction settings and on semi-structured text from two machine-printed family history documents. We show this joint approach to OCR error detection to be an improvement over two alternative approaches, one based on dictionary matching and the other using a hidden Markov model trained only to detect OCR errors. In particular, we report an average of 8% increase in macro-averaged F-measure between the dictionary approach and our best HMM. Our contribution is to show how an OCR error detection approach based on a word model can be improved by joining this task with an information extraction task, and that an improvement in OCR error detection is achieved regardless of the information extraction task.