Exact match search in sequence data using suffix trees

  • Authors:
  • Mihail Halachev;Nematollaah Shiri;Anand Thamildurai

  • Affiliations:
  • Concordia University, Montreal, Quebec, Canada;Concordia University, Montreal, Quebec, Canada;Concordia University, Montreal, Quebec, Canada

  • Venue:
  • Proceedings of the 14th ACM international conference on Information and knowledge management
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study suitable indexing techniques to support efficient exact match search in large biological sequence databases. We propose a suffix tree (ST) representation, called STA-DF, as an alternative to the array representation of ST (STA) proposed in [7] and utilized in [18]. To study the performance of STA and STA-DF, we develop a memory efficient ST-based Exact Match (STEM) search algorithm. We implemented STEM and both representations of ST and conducted extensive experiments. Our results indicate that the STA and STA-DF representations are very similar in construction time, storage utilization, and search time using STEM. In terms of the access patterns by STEM, our results show that compared to STA, the STA-DF representation exhibits better spatial and sequential locality of reference. This suggests that STA-DF would require less number of disk I/Os, and hence is more amenable to efficient and scalable disk-based computation.