The String-to-String Correction Problem
Journal of the ACM (JACM)
Adaptive post-processing of OCR text via knowledge acquisition
CSC '91 Proceedings of the 19th annual conference on Computer Science
Optical Character Recognition: An Illustrated Guide to the Frontier
Optical Character Recognition: An Illustrated Guide to the Frontier
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Lexical postprocessing by heuristic search and automatic determination of the edit costs
ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Information access in the presence of OCR errors
Proceedings of the 1st ACM workshop on Hardcopy document processing
A generative probabilistic OCR model for NLP applications
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
An improved error model for noisy channel spelling correction
ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Deriving Symbol Dependent Edit Weights for Text Correction_The Use of Error Dictionaries
ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
On lexical resources for digitization of historical documents
Proceedings of the 9th ACM symposium on Document engineering
A panlingual anomalous text detector
Proceedings of the 9th ACM symposium on Document engineering
Efficiently generating correction suggestions for garbled tokens of historical language
Natural Language Engineering
International Journal on Document Analysis and Recognition - Special issue on noisy text analytics
Transcription alignment of Latin manuscripts using hidden Markov models
Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
IMPACT: centre of competence in text digitisation
Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Limits on the Application of Frequency-Based Language Models to OCR
ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition
Generating search term variants for text collections with historic spellings
ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Hi-index | 0.01 |
In search engines and digital libraries, more and more OCRed historical documents become available. Still, access to these texts is often not satisfactory due to two problems: first, the quality of optical character recognition (OCR) on historical texts is often surprisingly low; second, historical spelling variation represents a barrier for search even if texts are properly reconstructed. As one step towards a solution we introduce a method that automatically computes a two-channel profile from an OCRed historical text. The profile includes (1) ''global'' information on typical recognition errors found in the OCR output, typical patterns for historical spelling variation, vocabulary and word frequencies in the underlying text, and (2) ''local'' hypotheses on OCR-errors and historical orthography of particular tokens of the OCR output. We argue that availability of this kind of knowledge represents a key step for improving OCR and Information Retrieval (IR) on historical texts: profiles can be used, e.g., to automatically finetune postcorrection systems or adapt OCR engines to the given input document, and to define refined models for approximate search that are aware of the kind of language variation found in a specific document. Our evaluation results show a strong correlation between the true distribution of spelling variation patterns and recognition errors in the OCRed text and estimated ranks and scores automatically computed in profiles. As a specific application we show how to improve the output of a commercial OCR engine using profiles in a postcorrection system.