Suffix Array Construction in External Memory Using D-Critical Substrings

  • Authors:
  • Ge Nong;Wai Hong Chan;Sen Zhang;Xiao Feng Guan

  • Affiliations:
  • Sun Yat-sen University and SYSU-CMU Shunde International Joint Research Institute;Hong Kong Institute of Education;SUNY College at Oneonta;Sun Yat-sen University

  • Venue:
  • ACM Transactions on Information Systems (TOIS)
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a new suffix array construction algorithm that aims to build, in external memory, the suffix array for an input string of length n measured in the magnitude of tens of Giga characters over a constant or integer alphabet. The core of this algorithm is adapted from the framework of the original internal memory SA-DS algorithm that samples fixed-size d-critical substrings. This new external-memory algorithm, called EM-SA-DS, uses novel cache data structures to construct a suffix array in a sequential scanning manner with good data spatial locality: data is read from or written to disk sequentially. On the assumed external-memory model with RAM capacity Ω((nB)0.5), disk capacity O(n), and size of each I/O block B, all measured in log n-bit words, the I/O complexity of EM-SA-DS is O(n/B). This work provides a general cache-based solution that could be further exploited to develop external-memory solutions for other suffix-array-related problems, for example, computing the longest-common-prefix array, using a modern personal computer with a typical memory configuration of 4GB RAM and a single disk.