Rime: Repeat identification

Authors:
Maria Federico;Pierre Peterlongo;Nadia Pisanti;Marie-France Sagot
Affiliations:
Dipartimento di Ingegneria dell'Informazione, University of Modena and Reggio Emilia, Italy;INRIA Rennes - Bretagne Atlantique, EPI Symbiose, Rennes, France;Dipartimento di Informatica, University of Pisa, Italy and LIACS Leiden University, The Netherlands;Université Lyon 1, CNRS, UMR5558, Laboratoire de Biométrie et Biologie Evolutive, Villeurbanne, France and INRIA Grenoble Rhône-Alpes, France
Venue:
Discrete Applied Mathematics
Year:
2014

Citing 11
Cited 0

q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Enumerating all connected maximal common subgraphs in two graphs

Theoretical Computer Science
Algorithm 457: finding all cliques of an undirected graph

Communications of the ACM
Repseek, a tool to retrieve approximate repeats from large DNA sequences

Bioinformatics
Lossless filter for multiple repetitions with Hamming distance

Journal of Discrete Algorithms
An optimized filter for finding multiple repeats in DNA sequences

AICCSA '10 Proceedings of the ACS/IEEE International Conference on Computer Systems and Applications - AICCSA 2010
VARUN: Discovering Extensible Motifs under Saturation Constraints

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Identifying SNPs without a reference genome by comparing raw reads

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Lossless filter for finding long multiple approximate repetitions using a new data structure, the bi-factor array

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Note: Extracting string motif bases for quorum higher than two

Theoretical Computer Science
Efficient bubble enumeration in directed graphs

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.04

Visualization

Abstract

We present an algorithm for detecting long similar fragments occurring at least twice in a set of biological sequences. The problem becomes computationally challenging when the frequency of a repeat is allowed to increase and when a non-negligible number of insertions, deletions and substitutions are allowed. We introduce in this paper an algorithm, Rime (for Repeat Identification: long, Multiple, and with Edits) that performs this task, and manages instances whose size and combination of parameters cannot be handled by other currently existing methods. This is achieved by using a filter as a preprocessing step, and by then exploiting the information gathered by the filter in the following actual repeat inference step. To the best of our knowledge, Rime is the first algorithm that can accurately deal with very long repeats (up to a few thousands), occurring possibly several times, and with a rate of differences (substitutions and indels) allowed among copies of a same repeat of 10-15% or even more.