An approximate multi-word matching algorithm for robust document retrieval

Authors:
Atsuhiro Takasu
Affiliations:
National Institute of Informatics, Chiyoda-ku, Tokyo, Japan
Venue:
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Year:
2006

Citing 22
Cited 1

Algorithms for approximate string matching

Information and Control
On the Recognition of Printed Characters of Any Font and Size

IEEE Transactions on Pattern Analysis and Machine Intelligence
Fast parallel and serial approximate string matching

Journal of Algorithms
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Results of applying probabilistic IR to OCR text

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of model-based retrieval effectiveness with OCR text

ACM Transactions on Information Systems (TOIS)
Effects of OCR errors on ranking and feedback using the vector space model

Information Processing and Management: an International Journal
Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
New techniques for open-vocabulary spoken document retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Phonetic confusion matrix based spoken document retrieval

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
Algorithms on Trees and Graphs

Algorithms on Trees and Graphs
New and faster filters for multiple approximate string matching

Random Structures & Algorithms
Compression: A Key for Next-Generation Text Retrieval Systems

Computer
Fuzzy Full-Text Searches in OCR Databases

ADL '95 Selected Papers from the Digital Libraries, Research and Technology Advances
Theoretical and Empirical Comparisons of Approximate String Matching Algorithms

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
A Metric Index for Approximate String Matching

LATIN '02 Proceedings of the 5th Latin American Symposium on Theoretical Informatics
Bibliographic attribute extraction from erroneous references based on a statistical model

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
DVHMM: Variable Length Text Recognition Error Model

ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 3 - Volume 3
Spoken document retrieval from call-center conversations

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

Text Retrieval through Corrupted Queries

IBERAMIA '08 Proceedings of the 11th Ibero-American conference on AI: Advances in Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document generation from low level data and its utilization is one of the most challenging tasks in document engineering. Word occurrence detection is a fundamental problem in the recognized document utilization obtained by a recognizer, such as OCR and speech recognition. Given a set of words, such as a dictionary, this paper proposes an efficient dynamic programming (DP) algorithm to find the occurrences of each word in a text. In this paper, the string similarity is measured by a statistical similarity model that enables a definition of the similarities in the character level as well as edit operation level. The proposed algorithm uses tree structures to measure similarities in order to avoid measuring similarities of the same substrings appearing in different parts of the text and words. The time complexity of the proposed algorithm is O(|W|⋅|S|⋅|Q|), where |W| (resp. |S|) denote the number of nodes in the trees representing the word set (resp. the text), and |Q| donotes the number of the states of the model used for string similarity. This paper shows the proposed algorithm is experimentally about six times faster than a naive DP algorithm.