A data structure for a sequence of string accesses in external memory

Authors:
Valentina Ciriani;Paolo Ferragina;Fabrizio Luccio;S. Muthukrishnan
Affiliations:
University of Milano, Via Bramante, Crema;University of Pisa, Largo Pontecorro, Pisa;University of Pisa, Largo Pontecorro, Pisa;Rutgers University, Piscataway, NJ
Venue:
ACM Transactions on Algorithms (TALG)
Year:
2007

Citing 22
Cited 4

Self-adjusting binary search trees

Journal of the ACM (JACM)
Skip lists: a probabilistic alternative to balanced trees

Communications of the ACM
Self-adjusting multi-way search trees

Information Processing Letters
Bit-Tree: a data structure for fast file processing

Communications of the ACM
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Self-adjusting k-ary search trees

Journal of Algorithms
Optimal prefetching via data compression

Journal of the ACM (JACM)
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
Alternatives to splay trees with O(log n) worst-case access times

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Biased dictionaries with fast insert/deletes

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
External memory algorithms and data structures: dealing with massive data

ACM Computing Surveys (CSUR)
Database indexing for large DNA and protein sequence collections

The VLDB Journal — The International Journal on Very Large Data Bases
Algorithm Design and Software Libraries: Recent Developments in the LEDA Project

Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1 - Volume I
DBMSs on a Modern Processor: Where Does Time Go?

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Proceedings of the 27th International Conference on Very Large Data Bases
A Fast Index for Semistructured Data

Proceedings of the 27th International Conference on Very Large Data Bases
Topology B-Trees and Their Applications

WADS '95 Proceedings of the 4th International Workshop on Algorithms and Data Structures
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Engineering a Fast Online Persistent Suffix Tree Construction

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory

IEEE Transactions on Knowledge and Data Engineering
Practical suffix tree construction

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Dynamic optimality for skip lists and B-trees

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
On searching compressed string collections cache-obliviously

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
B-tries for disk-based string management

The VLDB Journal — The International Journal on Very Large Data Bases
A distribution-sensitive dictionary with low space overhead

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce a new paradigm for querying strings in external memory, suited to the execution of sequences of operations. Formally, given a dictionary of n strings S1, …, Sn, we aim at supporting a search sequence for m not necessarily distinct strings T1, T2, …, Tm, as well as inserting and deleting individual strings. The dictionary is stored on disk, where each access to a disk page fetches B items, the cost of an operation is the number of pages accessed (I/Os), and efficiency must be attained on entire sequences of string operations rather than on individual ones. Our approach relies on a novel and conceptually simple self-adjusting data structure (SASL) based on skip lists, that is also interesting per se. The search for the whole sequence T1, T2, …, Tm can be done in an expected number of I/Os: O(∑j=1m |Tj|/B + ∑i=1nn (ni logB m/ni)), where each Tj may or may not be present in the dictionary, and ni is the number of times Si is queried (i.e., the number of Tjs equal to Si). Moreover, inserting or deleting a string Si takes an expected amortized number O(|Si|/B + logB n) of I/Os. The term ∑j=1m |Tj|/B in the search formula is a lower bound for reading the input, and the term ∑i=1n ni logB m/ni (entropy of the query sequence) is a standard information-theoretic lower bound. We regard this result as the static optimality theorem for external-memory string access, as compared to Sleator and Tarjan's classical theorem for numerical dictionaries [Sleator and Tarjan 1985]. Finally, we reformulate the search bound if a cache is available, taking advantage of common prefixes among the strings examined in the search.