Information Retrieval can Cope with Many Errors
Information Retrieval
Hi-index | 0.00 |
The indexing of inaccurately recognized OCR text yields unsatisfactory results, where the quality of the index terms decreases rapidly when the quality of the documents get worse. Index terms of OCR processed documents can be used for archiving or classification tasks. We present an indexing component whose input are character hypothesis lattices which are post-processed by a generate-and-test component feeding a morphology, a rule based substitution system, and a trigram correction component with word candidates. Stop words are filtered by a Levenshtein-based elimination routine. The recognized words are subsequently processed by our indexing component. Our system minimizes the number of generated index terms which are correct German words. The experiments have shown an increase in accuracy of next to 10%.