Engineering a Fast Online Persistent Suffix Tree Construction

  • Authors:
  • Srikanta J. Bedathur;Jayant R. Haritsa

  • Affiliations:
  • -;-

  • Venue:
  • ICDE '04 Proceedings of the 20th International Conference on Data Engineering
  • Year:
  • 2004

Quantified Score

Hi-index 0.01

Visualization

Abstract

Online persistent suffix tree construction has been consideredimpractical due to its excessive I/O costs. However,these prior studies have not taken into account the effects ofthe buffer management policy and the internal node structureof the suffix tree on I/O behavior of construction andsubsequent retrievals over the tree. In this paper, we studythese two issues in detail in the context of large genomicDNA and Protein sequences. In particular, we make the followingcontributions: (i) a novel, low-overhead bufferingpolicy called TOP-Q which improves the on-disk behaviorof suffix tree construction and subsequent retrievals, and (ii)empirical evidence that the space efficient linked-list representationof suffix tree nodes provides significantly inferiorperformance when compared to the array representation.These results demonstrate that a careful choice ofimplementation strategies can make online persistent suffixtree construction considerably more scalable - in termsof length of sequences indexed with a fixed memory budget,than currently perceived.