Reduction of Expanded Search Terms for Fuzzy English-Text Retrieval

Authors:
Manabu Ohta;Atsuhiro Takasu;Jun Adachi
Affiliations:
-;-;-
Venue:
ECDL '98 Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries
Year:
1998

Citing 3
Cited 1

Retrieval methods for English-text with missrecognized OCR characters

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Fuzzy Full-Text Searches in OCR Databases

ADL '95 Selected Papers from the Digital Libraries, Research and Technology Advances
Robust Retrieval of Noisy Text

ADL '96 Proceedings of the 3rd International Forum on Research and Technology Advances in Digital Libraries

Probabilistic Automaton Model for Fuzzy English-Text Retrieval

ECDL '00 Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to the confusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, in English fuzzy retrieval, occasionally a few million search terms are generated, which has an intolerable effect on retrieval speed. Therefore, this paper presents two heuristics to reduce the number of generated search terms by restricting the number of errors included in each expanded search term while maintaining retrieval effectiveness.