Text Indexing, Suffix Sorting, and Data Compression: Common Problems and Techniques

Authors:
Roberto Grossi
Affiliations:
Università di Pisa, Italy
Venue:
CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Year:
2009

Citing 11
Cited 0

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Membership in Constant Time and Almost-Minimum Space

SIAM Journal on Computing
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Indexing compressed text

Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Cache-oblivious string B-trees

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Linear work suffix array construction

Journal of the ACM (JACM)
Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets

ACM Transactions on Algorithms (TALG)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The talk is a guided tour on text indexing data structures, suffix sorting, and data compression. We discuss how they share common problems on text suffixes, showing the interplay among some of the algorithmic techniques that have been devised so far. In the following, given a text T = T [1,n ] of n symbols, we denote by s i its suffix s i = T [i ,n ] for 1 ≤ i ≤ n . A text indexing data structure stores the suffixes s 1 , s 2 , ..., s n of T at preprocessing time, in a suitable format that can support pattern matching queries over T . For example, given a pattern string P of m symbols, one type of query is that of computing how many times P appears in T , whose O (m + logn ) time complexity in the comparison model compares favorably with the O (m + n ) cost required by full text scanning [8]. Notable examples of text indexing data structures are suffix trees [10,14] and suffix arrays [9] for usage in main memory, string Btrees [4] and cache-oblivious string B-trees [1] for usage in external and hierarchical memory, to name a few. Suffix sorting requires to arrange the suffixes s 1 , s 2 , ..., s n in lexicographic order. This is the major computational bottleneck in suffix-based algorithms, and can be solved in O (n logn ) time in the comparison model (e.g. [7]). Having sorted the suffixes, it is not difficult to build a text indexing data structure in (nearly) linear time. Suffix sorting is crucial also in data compression, as witnessed by the importance of the Burrows-Wheeler transform [3]. The techniques adopted in the aforementioned topics converged in several ways into the rich fields of compressed text indexing [5,6,11,13] and succinct data structures [2,12], with some old and new open problems.