WallBreaker: overcoming the wall effect in similarity search

  • Authors:
  • Stefan Gerdjikov;Stoyan Mihov;Petar Mitankin;Klaus U. Schulz

  • Affiliations:
  • Sofia University, Sofia, Bulgaria;Institute of Information and Communication Technologies, Bulgarian Academy of Science, Sofia, Bulgaria;Sofia University, Sofia, Bulgaria;Centrum für Informations- und Sprachverarbeitung, Ludwig-Maximilians-Universität München, München, Germany

  • Venue:
  • Proceedings of the Joint EDBT/ICDT 2013 Workshops
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we present the WallBreaker system for similarity search as used in the String Similarity Search/Join Competition, 2013, organized by the Humboldt University of Berlin [1]. We consider the problem of how to efficiently find for a given string P (pattern) all words W in a lexicon such that the distance between P and W does not exceed a given bound b. Classical solutions to this problem try to align P with suitable lexicon words in a strict left-to-right manner, starting at the left border of the pattern. During the search, only prefixes of lexicon words are visited where the distance to a prefix P' of the pattern does not exceed the given bound b. The main problem with this solution is the so-called "wall effect": if we tolerate b errors and start searching in the lexicon from left to right, then in the first b steps we have to consider all prefixes of lexicon words. Eventually, only a tiny fraction of these prefixes will lead to a useful lexicon word, which means that our exhaustive initial search represents a waste of time. To avoid the "wall effect", in WallBreaker we have implemented our new method presented first in [3]. To sketch it let us assume that the pattern can be aligned with a lexicon word with not more than b errors. Clearly, if we divide the pattern into b + 1 pieces, then at least one piece will exactly match the corresponding substring of a lexicon word in the answer set. In our approach we first find the lexicon substrings that exactly match such a given piece of the pattern. Afterwards we continue by extending this alignment, step-wise attaching new pieces on the left or right side. For the alignment of new pieces, more errors are tolerated at each step, which guarantees that eventually b errors can occur. Since at later steps the set of interesting substrings to be extended is already small the wall effect is avoided, it does not hurt that we need to tolerate more errors. For this kind of search strategy, a new representation of the lexicon is needed where we can start traversal at any point of a word. In our new approach, the lexicon is represented as symmetric compact directed acyclic word graph (SCDAWG). This index structure can be seen as a part of a longer development of related index structures. Our implementation executes the search queries in parallel. It is realized in ANSI C, compiled with GCC and does not use any additional libraries beside LIBC and POSIX threads. In average it performs a similarity search of a 100 character pattern with up to 16 errors in a lexicon with 750 000 entries in about 0.088 ms.