Lexical Postcorrection of OCR-Results: The Web as a Dynamic Secondary Dictionary?

  • Authors:
  • Christian M. Strohmaier;Christoph Ringlstetter;Klaus U. Schulz;Stoyan Mihov

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Postcorrection of OCR-results for text documents is usuallybased on electronic dictionaries. When scanning textsfrom a specific thematic area, conventional dictionaries oftenmiss a considerable number of tokens. Furthermore,if word frequencies are stored with the entries, these frequencieswill not properly reflect the frequencies found inthe given thematic area. Correction adequacy suffers fromthese two shortcomings. We report on a series of experimentswhere we compare (1) the use of fixed, static large-scaledictionaries (including proper names and abbreviations)with (2) the use of dynamic dictionaries retrieved viaan automated analysis of the vocabulary of web pages froma given domain, and (3) the use of mixed dictionaries. Ourexperiments, which address English and German documentcollections from a variety of fields, show that dynamic dictionariesof the above mentioned form can improve the coveragefor the given thematic area in a significant way andhelp to improve the quality of lexical postcorrection methods.