An approximate multi-word matching algorithm for robust document retrieval

  • Authors:
  • Atsuhiro Takasu

  • Affiliations:
  • National Institute of Informatics, Chiyoda-ku, Tokyo, Japan

  • Venue:
  • CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Document generation from low level data and its utilization is one of the most challenging tasks in document engineering. Word occurrence detection is a fundamental problem in the recognized document utilization obtained by a recognizer, such as OCR and speech recognition. Given a set of words, such as a dictionary, this paper proposes an efficient dynamic programming (DP) algorithm to find the occurrences of each word in a text. In this paper, the string similarity is measured by a statistical similarity model that enables a definition of the similarities in the character level as well as edit operation level. The proposed algorithm uses tree structures to measure similarities in order to avoid measuring similarities of the same substrings appearing in different parts of the text and words. The time complexity of the proposed algorithm is O(|W|⋅|S|⋅|Q|), where |W| (resp. |S|) denote the number of nodes in the trees representing the word set (resp. the text), and |Q| donotes the number of the states of the model used for string similarity. This paper shows the proposed algorithm is experimentally about six times faster than a naive DP algorithm.