Unsupervised profiling of OCRed historical documents

Authors:
Ulrich Reffle;Christoph Ringlstetter
Affiliations:
University of Munich, Center of Information and Language Processing, Germany;University of Munich, Center of Information and Language Processing, Germany
Venue:
Pattern Recognition
Year:
2013

Citing 19
Cited 0

The String-to-String Correction Problem

Journal of the ACM (JACM)
Adaptive post-processing of OCR text via knowledge acquisition

CSC '91 Proceedings of the 19th annual conference on Computer Science
Optical Character Recognition: An Illustrated Guide to the Frontier

Optical Character Recognition: An Illustrated Guide to the Frontier
Learning String Edit Distance

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Lexical postprocessing by heuristic search and automatic determination of the edit costs

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Information access in the presence of OCR errors

Proceedings of the 1st ACM workshop on Hardcopy document processing
A generative probabilistic OCR model for NLP applications

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Deriving Symbol Dependent Edit Weights for Text Correction_The Use of Error Dictionaries

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
On lexical resources for digitization of historical documents

Proceedings of the 9th ACM symposium on Document engineering
A panlingual anomalous text detector

Proceedings of the 9th ACM symposium on Document engineering
Efficiently generating correction suggestions for garbled tokens of historical language

Natural Language Engineering
Towards information retrieval on historical document collections: the role of matching procedures and special lexica

International Journal on Document Analysis and Recognition - Special issue on noisy text analytics
Transcription alignment of Latin manuscripts using hidden Markov models

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Performing information extraction to improve OCR error detection in semi-structured historical documents

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Grid-based modelling and correction of arbitrarily warped historical document images for large-scale digitisation

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
IMPACT: centre of competence in text digitisation

Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
Limits on the Application of Frequency-Based Language Models to OCR

ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition
Generating search term variants for text collections with historic spellings

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.01

Visualization

Abstract

In search engines and digital libraries, more and more OCRed historical documents become available. Still, access to these texts is often not satisfactory due to two problems: first, the quality of optical character recognition (OCR) on historical texts is often surprisingly low; second, historical spelling variation represents a barrier for search even if texts are properly reconstructed. As one step towards a solution we introduce a method that automatically computes a two-channel profile from an OCRed historical text. The profile includes (1) ''global'' information on typical recognition errors found in the OCR output, typical patterns for historical spelling variation, vocabulary and word frequencies in the underlying text, and (2) ''local'' hypotheses on OCR-errors and historical orthography of particular tokens of the OCR output. We argue that availability of this kind of knowledge represents a key step for improving OCR and Information Retrieval (IR) on historical texts: profiles can be used, e.g., to automatically finetune postcorrection systems or adapt OCR engines to the given input document, and to define refined models for approximate search that are aware of the kind of language variation found in a specific document. Our evaluation results show a strong correlation between the true distribution of spelling variation patterns and recognition errors in the OCRed text and estimated ranks and scores automatically computed in profiles. As a specific application we show how to improve the output of a commercial OCR engine using profiles in a postcorrection system.