Lossless filter for multiple repetitions with Hamming distance

  • Authors:
  • Pierre Peterlongo;Nadia Pisanti;Frédéric Boyer;Alair Pereira do Lago;Marie-France Sagot

  • Affiliations:
  • IRISA / INRIA, CNRS, Campus de Beaulieu, 35042 Rennes Cedex, France;Dipartimento di Informatica, Universití di Pisa, 56127 Pisa, Italy;INRIA Rhône-Alpes and Laboratoire de Biométrie et Biologie ívolutive, UMR 5558, Université Claude Bernard, Lyon, 69622 Villeurbanne, France;Instituto de Matemática e Estatística Universidade de São Paulo, 05508-090 São Paulo, Brazil;INRIA Rhône-Alpes and Laboratoire de Biométrie et Biologie ívolutive, UMR 5558, Université Claude Bernard, Lyon, 69622 Villeurbanne, France and King's College London, London WC ...

  • Venue:
  • Journal of Discrete Algorithms
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Similarity search in texts, notably in biological sequences, has received substantial attention in the last few years. Numerous filtration and indexing techniques have been created in order to speed up the solution of the problem. However, previous filters were made for speeding up pattern matching, or for finding repetitions between two strings or occurring twice in the same string. In this paper, we present an algorithm called Nimbus for filtering strings prior to finding repetitions occurring twice or more in a string, or in two or more strings. Nimbus uses gapped seeds that are indexed with a new data structure, called a bi-factor array, that is also presented in this paper. Experimental results show that the filter can be very efficient: preprocessing with Nimbus a data set where one wants to find functional elements using a multiple local alignment tool such as Glam, the overall execution time can be reduced from 7.5 hours to 2 minutes.