Lossless filter for multiple repetitions with Hamming distance

Authors:
Pierre Peterlongo;Nadia Pisanti;Frédéric Boyer;Alair Pereira do Lago;Marie-France Sagot
Affiliations:
IRISA / INRIA, CNRS, Campus de Beaulieu, 35042 Rennes Cedex, France;Dipartimento di Informatica, Universití di Pisa, 56127 Pisa, Italy;INRIA Rhône-Alpes and Laboratoire de Biométrie et Biologie ívolutive, UMR 5558, Université Claude Bernard, Lyon, 69622 Villeurbanne, France;Instituto de Matemática e Estatística Universidade de São Paulo, 05508-090 São Paulo, Brazil;INRIA Rhône-Alpes and Laboratoire de Biométrie et Biologie ívolutive, UMR 5558, Université Claude Bernard, Lyon, 69622 Villeurbanne, France and King's College London, London WC ...
Venue:
Journal of Discrete Algorithms
Year:
2008

Citing 10
Cited 1

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Better Filtering with Gapped q-Grams

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Multiseed Lossless Filtration

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Linear-time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Space efficient linear time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Simple linear work suffix array construction

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming
Efficient q-gram filters for finding all ε-matches over a given length

RECOMB'05 Proceedings of the 9th Annual international conference on Research in Computational Molecular Biology
Lossless filter for finding long multiple approximate repetitions using a new data structure, the bi-factor array

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

Rime: Repeat identification

Discrete Applied Mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity search in texts, notably in biological sequences, has received substantial attention in the last few years. Numerous filtration and indexing techniques have been created in order to speed up the solution of the problem. However, previous filters were made for speeding up pattern matching, or for finding repetitions between two strings or occurring twice in the same string. In this paper, we present an algorithm called Nimbus for filtering strings prior to finding repetitions occurring twice or more in a string, or in two or more strings. Nimbus uses gapped seeds that are indexed with a new data structure, called a bi-factor array, that is also presented in this paper. Experimental results show that the filter can be very efficient: preprocessing with Nimbus a data set where one wants to find functional elements using a multiple local alignment tool such as Glam, the overall execution time can be reduced from 7.5 hours to 2 minutes.