Compressed indexes for dynamic text collections

  • Authors:
  • Ho-Leung Chan;Wing-Kai Hon;Tak-Wah Lam;Kunihiko Sadakane

  • Affiliations:
  • University of Hong Kong;National Tsing Hua University;University of Hong Kong;Kyushu University

  • Venue:
  • ACM Transactions on Algorithms (TALG)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Let T be a string with n characters over an alphabet of constant size. A recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching [Ferragina and Manzini 2000; Grossi and Vitter 2000]. Yet the compressed nature of such indexes also makes them difficult to update dynamically. This article extends the work on optimal-space indexing to a dynamic collection of texts. Our first result is a compressed solution to the library management problem, where we show an index of O(n) bits for a text collection L of total length n, which can be updated in O(|T| log n) time when a text T is inserted or deleted from L; also, the index supports searching the occurrences of any pattern P in all texts in L in O(|P| log n + occ log2 n) time, where occ is the number of occurrences. Our second result is a compressed solution to the dictionary matching problem, where we show an index of O(d) bits for a pattern collection D of total length d, which can be updated in O(|P| log2 d) time when a pattern P is inserted or deleted from D; also, the index supports searching the occurrences of all patterns of D in any text T in O((|T| + occ)log2 d) time. When compared with the O(d log d)-bit suffix-tree-based solution of Amir et al. [1995], the compact solution increases the query time by roughly a factor of log d only. The solution to the dictionary matching problem is based on a new compressed representation of a suffix tree. Precisely, we give an O(n)-bit representation of a suffix tree for a dynamic collection of texts whose total length is n, which supports insertion and deletion of a text T in O(|T| log2 n) time, as well as all suffix tree traversal operations, including forward and backward suffix links. This work can be regarded as a generalization of the compressed representation of static texts. In the study of the aforementioned result, we also derive the first O(n)-bit representation for maintaining n pairs of balanced parentheses in O(log n/log log n) time per operation, matching the time complexity of the previous O(n log n)-bit solution.