Storage and Retrieval of Individual Genomes

Authors:
Veli Mäkinen;Gonzalo Navarro;Jouni Sirén;Niko Välimäki
Affiliations:
Department of Computer Science, University of Helsinki, Finland;Department of Computer Science, University of Chile, Chile;Department of Computer Science, University of Helsinki, Finland;Department of Computer Science, University of Helsinki, Finland
Venue:
RECOMB 2'09 Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology
Year:
2009

Citing 16
Cited 6

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Compact representations of ordered sets

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Indexing compressed text

Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Compressed Data Structures: Dictionaries and Data-Aware Measures

DCC '06 Proceedings of the Data Compression Conference
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Compressed Suffix Trees with Full Functionality

Theory of Computing Systems
An(other) Entropy-Bounded Compressed Suffix Tree

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Dynamic Fully-Compressed Suffix Trees

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Fully-compressed suffix trees

LATIN'08 Proceedings of the 8th Latin American conference on Theoretical informatics

Compressed Suffix Arrays for Massive Data

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Sampled longest common prefix array

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Indexing similar DNA sequences

AAIM'10 Proceedings of the 6th international conference on Algorithmic aspects in information and management
Iterative Dictionary Construction for Compression of Large DNA Data Sets

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Unified view of backward backtracking in short read mapping

Algorithms and Applications
Improved grammar-based compressed indexes

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N . Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O (N logN ) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O (N log*** ) bits, where *** is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection. We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N /n . We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.