Efficient and scalable indexing techniques for biological sequence data

  • Authors:
  • Mihail Halachev;Nematollaah Shiri;Anand Thamildurai

  • Affiliations:
  • Dept. of Computer Science and Software Engineering, Concordia University, Montreal, Canada;Dept. of Computer Science and Software Engineering, Concordia University, Montreal, Canada;Dept. of Computer Science and Software Engineering, Concordia University, Montreal, Canada

  • Venue:
  • BIRD'07 Proceedings of the 1st international conference on Bioinformatics research and development
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We investigate indexing techniques for sequence data, crucial in a wide variety of applications, where efficient, scalable, and versatile search algorithms are required. Recent research has focused on suffix trees (ST) and suffix arrays (SA) as desirable index representations. Existing solutions for very long sequences however provide either efficient index construction or efficient search, but not both. We propose a new ST representation, STTD64, which has reasonable construction time and storage requirement, and is efficient in search. We have implemented the construction and search algorithms for the proposed technique and conducted numerous experiments to evaluate its performance on various types of real sequence data. Our results show that while the construction time for STTD64 is comparable with current ST based techniques, it outperforms them in search. Compared to ESA, the best known SA technique, STTD64 exhibits slower construction time, but has similar space requirement and comparable search time. Unlike ESA, which is memory based, STTD64 is scalable and can handle very long sequences.