Post-processing of OCR results for automatic indexing

Authors:
L. Wiedenhifer;H.-G. Hein;A. Dengel
Affiliations:
-;-;-
Venue:
ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Year:
1995

Citing 0
Cited 1

Information Retrieval can Cope with Many Errors

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The indexing of inaccurately recognized OCR text yields unsatisfactory results, where the quality of the index terms decreases rapidly when the quality of the documents get worse. Index terms of OCR processed documents can be used for archiving or classification tasks. We present an indexing component whose input are character hypothesis lattices which are post-processed by a generate-and-test component feeding a morphology, a rule based substitution system, and a trigram correction component with word candidates. Stop words are filtered by a Levenshtein-based elimination routine. The recognized words are subsequently processed by our indexing component. Our system minimizes the number of generated index terms which are correct German words. The experiments have shown an increase in accuracy of next to 10%.