Obtaining provably good performance from suffix trees in secondary storage

  • Authors:
  • Pang Ko;Srinivas Aluru

  • Affiliations:
  • Department of Electrical and Computer Engineering;Laurence H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University

  • Venue:
  • CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Designing external memory data structures for string data-bases is of significant recent interest due to the proliferation of biological sequence data. The suffix tree is an important indexing structure that provides optimal algorithms for memory bound data. However, string B-trees provide the best known asymptotic performance in external memory for substring search and update operations. Work on external memory variants of suffix trees has largely focused on constructing suffix trees in external memory or layout schemes for suffix trees that preserve link locality. In this paper, we present a new suffix tree layout scheme for secondary storage and present construction, substring search, insertion and deletion algorithms that are competitive with the string B-tree. For a set of strings of total length n, a pattern p and disk blocks of size B, we provide a substring search algorithm that uses O(|p|/B + logBn) disk accesses. We present algorithms for insertion and deletion of all suffixes of a string of length m that take O(m logB (n+m)) and O(mlogBn) disk accesses, respectively. Our results demonstrate that suffix trees can be directly used as efficient secondary storage data structures for string and sequence data.