Efficiently generating correction suggestions for garbled tokens of historical language

Authors:
Ulrich Reffle
Affiliations:
Centrum f/4r informations und sprachverarbeitung, university of munich, germany email: uli@cis.uni-muenchen.de
Venue:
Natural Language Engineering
Year:
2011

Citing 13
Cited 3

Fast approximate string matching

Software—Practice & Experience
Fast text searching: allowing errors

Communications of the ACM
Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction

Computational Linguistics
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Finite-State Language Processing

Finite-State Language Processing
Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences

Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences
An improved error model for noisy channel spelling correction

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Fast Approximate Search in Large Dictionaries

Computational Linguistics
Retrieval in text collections with historic spelling using linguistic and spelling variants

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Fast Selection of Small and Precise Candidate Sets from Dictionaries for Text Correction Tasks

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 01
Enabling information retrieval on historical document collections: the role of matching procedures and special lexica

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
On lexical resources for digitization of historical documents

Proceedings of the 9th ACM symposium on Document engineering
Generating search term variants for text collections with historic spellings

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Recognizing garbage in OCR output on historical documents

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene

LaTeCH '11 Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Unsupervised profiling of OCRed historical documents

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text correction systems rely on a core mechanism where suitable correction suggestions for garbled input tokens are generated. Current systems, which are designed for documents including modern language, use some form of approximate search in a given background lexicon. Due to the large amount of spelling variation found in historical documents, special lexica for historical language can only offer restricted coverage. Hence historical language is often described in terms of a matching procedure to be applied to modern words. Given such a procedure and a base lexicon of modern words, the question arises of how to generate correction suggestions for garbled historical variants. In this paper we suggest an efficient algorithm that solves this problem. The algorithm is used for postcorrection of optical character recognition results on historical document collections.