Indexing similar DNA sequences

Authors:
Songbo Huang;T. W. Lam;W. K. Sung;S. L. Tam;S. M. Yiu
Affiliations:
Department of Computer Science, The University of Hong Kong, Hong Kong;Department of Computer Science, The University of Hong Kong, Hong Kong;Department of Computer Science, National University of Singapore, Singapore;Department of Computer Science, The University of Hong Kong, Hong Kong;Department of Computer Science, The University of Hong Kong, Hong Kong
Venue:
AAIM'10 Proceedings of the 6th international conference on Algorithmic aspects in information and management
Year:
2010

Citing 10
Cited 3

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
An experimental study of an opportunistic index

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Probabilistic Analysis of Generalized Suffix Trees (Extended Abstract)

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
Compressed indexing and local alignment of DNA

Bioinformatics
Encyclopedia of Algorithms

Encyclopedia of Algorithms
Orthogonal range searching in linear and almost-linear space

Computational Geometry: Theory and Applications
Storage and Retrieval of Individual Genomes

RECOMB 2'09 Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology

A faster grammar-based self-index

LATA'12 Proceedings of the 6th international conference on Language and Automata Theory and Applications
Fast relative lempel-ziv self-index for similar sequences

FAW-AAIM'12 Proceedings of the 6th international Frontiers in Algorithmics, and Proceedings of the 8th international conference on Algorithmic Aspects in Information and Management
Improved grammar-based compressed indexes

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

To study the genetic variations of a species, one basic operation is to search for occurrences of patterns in a large number of very similar genomic sequences. To build an indexing data structure on the concatenation of all sequences may require a lot of memory. In this paper, we propose a new scheme to index highly similar sequences by taking advantage of the similarity among the sequences. To store r sequences with k common segments, our index requires only O(n +N log N) bits of memory, where n is the total length of the common segments and N is the total length of the distinct regions in all texts. The total length of all sequences is rn + N, and any scheme to store these sequences requires Ω(n + N) bits. Searching for a pattern P of length m takes O(m + mlogN + mlog(rk)psc(P) + occlogn), where psc(P) is the number of prefixes of P that appear as a suffix of some common segments and occ is the number of occurrences of P in all sequences. In practice, rk ≤ N, and psc(P) is usually a small constant. We have implemented our solution and evaluated our solution using real DNA sequences. The experiments show that the memory requirement of our solution is much less than that required by BWT built on the concatenation of all sequences. When compared to the other existing solution (RLCSA), we use less memory with faster searching time.