Text Indexing, Suffix Sorting, and Data Compression: Common Problems and Techniques

  • Authors:
  • Roberto Grossi

  • Affiliations:
  • Università di Pisa, Italy

  • Venue:
  • CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The talk is a guided tour on text indexing data structures, suffix sorting, and data compression. We discuss how they share common problems on text suffixes, showing the interplay among some of the algorithmic techniques that have been devised so far. In the following, given a text T = T [1,n ] of n symbols, we denote by s i its suffix s i = T [i ,n ] for 1 ≤ i ≤ n . A text indexing data structure stores the suffixes s 1 , s 2 , ..., s n of T at preprocessing time, in a suitable format that can support pattern matching queries over T . For example, given a pattern string P of m symbols, one type of query is that of computing how many times P appears in T , whose O (m + logn ) time complexity in the comparison model compares favorably with the O (m + n ) cost required by full text scanning [8]. Notable examples of text indexing data structures are suffix trees [10,14] and suffix arrays [9] for usage in main memory, string Btrees [4] and cache-oblivious string B-trees [1] for usage in external and hierarchical memory, to name a few. Suffix sorting requires to arrange the suffixes s 1 , s 2 , ..., s n in lexicographic order. This is the major computational bottleneck in suffix-based algorithms, and can be solved in O (n logn ) time in the comparison model (e.g. [7]). Having sorted the suffixes, it is not difficult to build a text indexing data structure in (nearly) linear time. Suffix sorting is crucial also in data compression, as witnessed by the importance of the Burrows-Wheeler transform [3]. The techniques adopted in the aforementioned topics converged in several ways into the rich fields of compressed text indexing [5,6,11,13] and succinct data structures [2,12], with some old and new open problems.