Lexical Postcorrection of OCR-Results: The Web as a Dynamic Secondary Dictionary?

Authors:
Christian M. Strohmaier;Christoph Ringlstetter;Klaus U. Schulz;Stoyan Mihov
Affiliations:
-;-;-;-
Venue:
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Year:
2003

Citing 3
Cited 3

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
A word shape analysis approach to lexicon based word recognition

Pattern Recognition Letters
Lexical postprocessing by heuristic search and automatic determination of the edit costs

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2

Orthographic Errors in Web Pages: Toward Cleaner Web Corpora

Computational Linguistics
Adaptive text correction with Web-crawled domain-dependent dictionaries

ACM Transactions on Speech and Language Processing (TSLP)
Efficient dictionary-based text rewriting using subsequential transducers†

Natural Language Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Postcorrection of OCR-results for text documents is usuallybased on electronic dictionaries. When scanning textsfrom a specific thematic area, conventional dictionaries oftenmiss a considerable number of tokens. Furthermore,if word frequencies are stored with the entries, these frequencieswill not properly reflect the frequencies found inthe given thematic area. Correction adequacy suffers fromthese two shortcomings. We report on a series of experimentswhere we compare (1) the use of fixed, static large-scaledictionaries (including proper names and abbreviations)with (2) the use of dynamic dictionaries retrieved viaan automated analysis of the vocabulary of web pages froma given domain, and (3) the use of mixed dictionaries. Ourexperiments, which address English and German documentcollections from a variety of fields, show that dynamic dictionariesof the above mentioned form can improve the coveragefor the given thematic area in a significant way andhelp to improve the quality of lexical postcorrection methods.