De novo identification of repeat families in large genomes

Authors:
Alkes L. Price;Neil C. Jones;Pavel A. Pevzner
Affiliations:
Department of Computer Science and Engineering, University of California San Diego La Jolla, CA 92093-0114, USA;Department of Computer Science and Engineering, University of California San Diego La Jolla, CA 92093-0114, USA;Department of Computer Science and Engineering, University of California San Diego La Jolla, CA 92093-0114, USA
Venue:
Bioinformatics
Year:
2005

Citing 0
Cited 5

A study of the repetitive structure and distribution of short motifs in human genomic sequences

International Journal of Bioinformatics Research and Applications
A Novel Heuristic for Local Multiple Alignment of Interspersed DNA Repeats

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
An algorithm for the reconstruction of consensus sequences of ancient segmental duplications and transposon copies in eukaryotic genomes

International Journal of Bioinformatics Research and Applications
RepFrag: a graph based method for finding repeats and transposons from fragmented genomes

Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology
Image and fractal information processing for large-scale chemoinformatics, genomics analyses and pattern discovery

PRIB'06 Proceedings of the 2006 international conference on Pattern Recognition in Bioinformatics

Quantified Score

Hi-index	3.84

Visualization

Abstract

Every time we compare two species that are closer to each other than either is to humans, we get nearly killed by unmasked repeats. Webb Miller (Personal communication) Motivation:De novo repeat family identification is a challenging algorithmic problem of great practical importance. As the number of genome sequencing projects increases, there is a pressing need to identify the repeat families present in large, newly sequenced genomes. We develop a new method for de novo identification of repeat families via extension of consensus seeds; our method enables a rigorous definition of repeat boundaries, a key issue in repeat analysis. Results: Our RepeatScout algorithm is more sensitive and is orders of magnitude faster than RECON, the dominant tool for de novo repeat family identification in newly sequenced genomes. Using RepeatScout, we estimate that ∼2% of the human genome and 4% of mouse and rat genomes consist of previously unannotated repetitive sequence. Availability: Source code is available for download at http://www-cse.ucsd.edu/groups/bioinformatics/software.html Contact: ppevzner@cs.ucsd.edu