Retrieval methods for English-text with missrecognized OCR characters

Authors:
Manabu Ohta;Atsuhiro Takasu;Jun Adachi
Affiliations:
-;-;-
Venue:
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Year:
1997

Citing 0
Cited 13

Information Retrieval from Documents: A Survey

Information Retrieval
Reduction of Expanded Search Terms for Fuzzy English-Text Retrieval

ECDL '98 Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries
Word Searching in Document Images Using Word Portion Matching

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Spotting Where to Read on Pages - Retrieval of Relevant Parts from Page Images

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Information Retrieval in Document Image Databases

IEEE Transactions on Knowledge and Data Engineering
Noisy Text Categorization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Indexing and retrieval of handwritten medical forms

dg.o '07 Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains
Document image analysis for active reading

SADPI '07 Proceedings of the 2007 international workshop on Semantically aware document processing and indexing
A Framework for Managing Multimodal Digitized Music Collections

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
Handwritten document retrieval strategies

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
A survey of keyword spotting techniques for printed document images

Artificial Intelligence Review
Keyword spotting on korean document images by matching the keyword image

ICADL'05 Proceedings of the 8th international conference on Asian Digital Libraries: implementing strategies and sharing experiences
The impact of OCR accuracy and feature transformation on automatic text classification

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents three probabilistic text retrieval methods designed to carry out a full-text search of English documents containing OCR errors. By searching for any query term on the premise that there are errors in the recognized text, the methods presented can tolerate such errors, and therefore costly manual post-editing is not required after OCR recognition. In the applied approach, confusion matrices are used to store characters which are likely to be interchanged when a particular character is missrecognized, and the respective probability of each occurrence. Moreover, a 2-gram matrix is used to store probabilities of character connection, i.e., which letter is likely to come after another. Multiple search terms are generated for an input query term by making reference to confusion matrices, after which a full-text search is run for each search term. The validity of retrieved terms is determined based on error-occurrence and character connection probabilities. The performance of these methods is experimentally evaluated by determining retrieval effectiveness, i.e., by calculating recall and precision rates. Results indicate marked improvement in comparison with exact matching.