Lossless filter for finding long multiple approximate repetitions using a new data structure, the bi-factor array

Authors:
Pierre Peterlongo;Nadia Pisanti;Frederic Boyer;Marie-France Sagot
Affiliations:
Institut Gaspard-Monge, Universite de Marne-la-Vallée, France;Dipartimento di Informatica, Università di Pisa, Italy;INRIA Rhône-Alpes and LBBE, Univ. Claude Bernard, Lyon, France;INRIA Rhône-Alpes and LBBE, Univ. Claude Bernard, Lyon, France
Venue:
SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Year:
2005

Citing 6
Cited 5

q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Indexing Text with Approximate q-Grams

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Better Filtering with Gapped q-Grams

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Linear work suffix array construction

Journal of the ACM (JACM)
Linear-time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching

Algorithms for computing variants of the longest common subsequence problem

Theoretical Computer Science
Lossless filter for multiple repetitions with Hamming distance

Journal of Discrete Algorithms
Succinct gapped suffix arrays

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Algorithms for computing variants of the longest common subsequence problem

ISAAC'06 Proceedings of the 17th international conference on Algorithms and Computation
Rime: Repeat identification

Discrete Applied Mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity search in texts, notably biological sequences, has received substantial attention in the last few years. Numerous filtration and indexing techniques have been created in order to speed up the resolution of the problem. However, previous filters were made for speeding up pattern matching, or for finding repetitions between two sequences or occurring twice in the same sequence. In this paper, we present an algorithm called NIMBUS for filtering sequences prior to finding repetitions occurring more than twice in a sequence or in more than two sequences. NIMBUS uses gapped seeds that are indexed with a new data structure, called a bi-factor array, that is also presented in this paper. Experimental results show that the filter can be very efficient: preprocessing with NIMBUS a data set where one wants to find functional elements using a multiple local alignment tool such as GLAM ([7]), the overall execution time can be reduced from 10 hours to 6 minutes while obtaining exactly the same results.