Efficient and scalable indexing techniques for biological sequence data

Authors:
Mihail Halachev;Nematollaah Shiri;Anand Thamildurai
Affiliations:
Dept. of Computer Science and Software Engineering, Concordia University, Montreal, Canada;Dept. of Computer Science and Software Engineering, Concordia University, Montreal, Canada;Dept. of Computer Science and Software Engineering, Concordia University, Montreal, Canada
Venue:
BIRD'07 Proceedings of the 1st international conference on Bioinformatics research and development
Year:
2007

Citing 17
Cited 1

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Efficient implementation of suffix trees

Software—Practice & Experience
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
On effective multi-dimensional indexing for strings

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Reducing the space requirement of suffix trees

Software—Practice & Experience
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
Structured motifs search

RECOMB '04 Proceedings of the eighth annual international conference on Resaerch in computational molecular biology
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
Practical methods for constructing suffix trees

The VLDB Journal — The International Journal on Very Large Data Bases
Exact match search in sequence data using suffix trees

Proceedings of the 14th ACM international conference on Information and knowledge management
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)

Efficient techniques on retrieving bio-information for active U-healthcare

Personal and Ubiquitous Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate indexing techniques for sequence data, crucial in a wide variety of applications, where efficient, scalable, and versatile search algorithms are required. Recent research has focused on suffix trees (ST) and suffix arrays (SA) as desirable index representations. Existing solutions for very long sequences however provide either efficient index construction or efficient search, but not both. We propose a new ST representation, STTD64, which has reasonable construction time and storage requirement, and is efficient in search. We have implemented the construction and search algorithms for the proposed technique and conducted numerous experiments to evaluate its performance on various types of real sequence data. Our results show that while the construction time for STTD64 is comparable with current ST based techniques, it outperforms them in search. Compared to ESA, the best known SA technique, STTD64 exhibits slower construction time, but has similar space requirement and comparable search time. Unlike ESA, which is memory based, STTD64 is scalable and can handle very long sequences.