Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory

  • Authors:
  • Ching-Fung Cheung;Jeffrey Xu Yu;Hongjun Lu

  • Affiliations:
  • -;-;-

  • Venue:
  • IEEE Transactions on Knowledge and Data Engineering
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Mammalian genomes are typically 3Gbps (gibabase pairs) in size. The largest public database NCBI (National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov)) of DNA contains more than 20 Gbps. Suffix trees are widely acknowledged as a data structure to support exact/approximate sequence matching queries as well as repetitive structure finding efficiently when they can reside in main memory. But, it has been shown as difficult to handle long DNA sequences using suffix trees due to the so-called memory bottleneck problems. The most space efficient main-memory suffix tree construction algorithm takes nine hours and 45 GB memory space to index the human genome [19]. In this paper, we show that suffix trees for long DNA sequences can be efficiently constructed on disk using small bounded main memory space and, therefore, all existing algorithms based on suffix trees can be used to handle long DNA sequences that cannot be held in main memory. We adopt a two-phase strategy to construct a suffix tree on disk: 1) to construct a diskbase suffix-tree without suffix links and 2) rebuild suffix links upon the suffix-tree being constructed on disk, if needed. We propose a new disk-based suffix tree construction algorithm, called DynaCluster, which shows O(n \log n) experimental behavior regarding CPU cost and linearity for I/O cost. DynaCluster needs 16MB main memory only to construct more than 200Mbps DNA sequences and significantly outperforms the existing disk-based suffix-tree construction algorithms using prepartitioning techniques in terms of both construction cost and query processing cost. We conducted extensive performance studies and report our findings in this paper.