Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays

  • Authors:
  • Veli Mäkinen;Gonzalo Navarro;Kunihiko Sadakane

  • Affiliations:
  • Dept of Computer Science, Univ of Helsinki, Finland;Center for Web Research, Dept of Computer Science, Univ of Chile, Chile;Dept of Computer Science and Communication Engineering, Kyushu Univ., Japan

  • Venue:
  • ISAAC'04 Proceedings of the 15th international conference on Algorithms and Computation
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the most relevant succinct suffix array proposals in the literature is the Compressed Suffix Array (CSA) of Sadakane [ISAAC 2000] The CSA needs n(H0+O(log logσ)) bits of space, where n is the text size, σ is the alphabet size, and H0 the zero-order entropy of the text The number of occurrences of a pattern of length m can be computed in O(mlog n) time Most notably, the CSA does not need the text separately available to operate The CSA simulates a binary search over the suffix array, where the query is compared against text substrings These are extracted from the same CSA by following irregular access patterns over the structure Sadakane [SODA 2002] has proposed using backward searching on the CSA in similar fashion as the FM-index of Ferragina and Manzini [FOCS 2000] He has shown that the CSA can be searched in O(m) time whenever σ = O(polylog(n)). In this paper we consider some other consequences of backward searching applied to CSA The most remarkable one is that we do not need, unlike all previous proposals, any complicated sub-linear structures based on the four-Russians technique (such as constant time rank and select queries on bit arrays) We show that sampling and compression are enough to achieve O(mlog n) query time using less space than the original structure It is also possible to trade structure space for search time Furthermore, the regular access pattern of backward searching permits an efficient secondary memory implementation, so that the search can be done with O(m logBn) disk accesses, being B the disk block size Finally, it permits a distributed implementation with optimal speedup and negligible communication effort.