On lexical resources for digitization of historical documents

Authors:
Annette Gotscharek;Ulrich Reffle;Christoph Ringlstetter;Klaus U. Schulz
Affiliations:
University of Munich, Munich, Germany;University of Munich, Munich, Germany;University of Munich, Munich, Germany;University of Munich, Munich, Germany
Venue:
Proceedings of the 9th ACM symposium on Document engineering
Year:
2009

Citing 6
Cited 5

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Retrieval in text collections with historic spelling using linguistic and spelling variants

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Adaptive text correction with Web-crawled domain-dependent dictionaries

ACM Transactions on Speech and Language Processing (TSLP)
Image-matching for revision detection in printed historical documents

Proceedings of the 29th DAGM conference on Pattern recognition
Generating search term variants for text collections with historic spellings

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
A cross-language approach to historic document retrieval

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Leveraging back-of-the-book indices to enable spatial browsing of a historical document collection

Proceedings of the 6th Workshop on Geographic Information Retrieval
Document conversion for cultural heritage texts: FrameMaker to HTML revisited

Proceedings of the 10th ACM symposium on Document engineering
Efficiently generating correction suggestions for garbled tokens of historical language

Natural Language Engineering
Recognizing garbage in OCR output on historical documents

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Unsupervised profiling of OCRed historical documents

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many European libraries are currently engaged in mass digitization projects that aim to make historical documents and corpora online available in the Internet. In this context, appropriate lexical resources play a double role. They are needed to improve OCR recognition of historical documents, which currently does not lead to satisfactory results. Second, even assuming a perfect OCR recognition, since historical language differs considerably from modern language, the matching process between queries submitted to search engines and variants of the search terms found in historical documents needs special support. While the usefulness of special dictionaries for both problems seems undisputed, concrete knowledge and experience are still missing. There are no hints about what optimal lexical resources for historical documents should look like. The real benefit reached by optimized lexical resources is unclear. Both questions are rather complex since answers depend on the point in history when documents were born. We present a series of experiments which illuminate these points. For our evaluations we collected a large corpus covering German historical documents from before 1500 to 1950 and constructed various types of dictionaries. We present the coverage reached with each dictionary for ten subperiods of time. Additional experiments illuminate the improvements for OCR accuracy and Information Retrieval that can be reached, again looking at distinct dictionaries and periods of time. For both OCR and IR, our lexical resources lead to substantial improvements.