Techniques for automatically correcting words in text
ACM Computing Surveys (CSUR)
Machine Learning - Special issue on inductive transfer
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
OCR Error Detection and Correction of an Inflectional Indian Language Script
ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume III-Volume 7276 - Volume 7276
Named entity extraction from noisy input: speech and OCR
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Unsupervised learning of field segmentation models for information extraction
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Coupled semi-supervised learning for information extraction
Proceedings of the third ACM international conference on Web search and data mining
Modeling duration in a hidden Markov model with the exponential family
ICASSP'93 Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: speech processing - Volume II
The effects of OCR error on the extraction of private information
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Unsupervised profiling of OCRed historical documents
Pattern Recognition
Hi-index | 0.00 |
Optical character recognition (OCR) produces transcriptions of document images. These transcriptions often contain incorrectly recognized characters which we must avoid or correct downstream. An ability to both identify OCR errors and extract information from OCR output would allow us to extract and index only correct information and to post-process specific parts of the OCR output with targeted resources (e.g. re-OCR using specialized dictionaries). We present a general approach to OCR error detection that uses a hidden Markov model trained to simultaneously detect OCR errors and extract information. We evaluate this approach in two information extraction settings and on semi-structured text from two machine-printed family history documents. We show this joint approach to OCR error detection to be an improvement over two alternative approaches, one based on dictionary matching and the other using a hidden Markov model trained only to detect OCR errors. In particular, we report an average of 8% increase in macro-averaged F-measure between the dictionary approach and our best HMM. Our contribution is to show how an OCR error detection approach based on a word model can be improved by joining this task with an information extraction task, and that an improvement in OCR error detection is achieved regardless of the information extraction task.