A data structure for a sequence of string accesses in external memory

  • Authors:
  • Valentina Ciriani;Paolo Ferragina;Fabrizio Luccio;S. Muthukrishnan

  • Affiliations:
  • University of Milano, Via Bramante, Crema;University of Pisa, Largo Pontecorro, Pisa;University of Pisa, Largo Pontecorro, Pisa;Rutgers University, Piscataway, NJ

  • Venue:
  • ACM Transactions on Algorithms (TALG)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We introduce a new paradigm for querying strings in external memory, suited to the execution of sequences of operations. Formally, given a dictionary of n strings S1, …, Sn, we aim at supporting a search sequence for m not necessarily distinct strings T1, T2, …, Tm, as well as inserting and deleting individual strings. The dictionary is stored on disk, where each access to a disk page fetches B items, the cost of an operation is the number of pages accessed (I/Os), and efficiency must be attained on entire sequences of string operations rather than on individual ones. Our approach relies on a novel and conceptually simple self-adjusting data structure (SASL) based on skip lists, that is also interesting per se. The search for the whole sequence T1, T2, …, Tm can be done in an expected number of I/Os: O(∑j=1m |Tj|/B + ∑i=1nn (ni logB m/ni)), where each Tj may or may not be present in the dictionary, and ni is the number of times Si is queried (i.e., the number of Tjs equal to Si). Moreover, inserting or deleting a string Si takes an expected amortized number O(|Si|/B + logB n) of I/Os. The term ∑j=1m |Tj|/B in the search formula is a lower bound for reading the input, and the term ∑i=1n ni logB m/ni (entropy of the query sequence) is a standard information-theoretic lower bound. We regard this result as the static optimality theorem for external-memory string access, as compared to Sleator and Tarjan's classical theorem for numerical dictionaries [Sleator and Tarjan 1985]. Finally, we reformulate the search bound if a cache is available, taking advantage of common prefixes among the strings examined in the search.