Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
High-order entropy-compressed text indexes
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Opportunistic data structures with applications
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
ACM Computing Surveys (CSUR)
Compressed representations of sequences and full-text indexes
ACM Transactions on Algorithms (TALG)
Space-efficient static trees and graphs
SFCS '89 Proceedings of the 30th Annual Symposium on Foundations of Computer Science
Practical Rank/Select Queries over Arbitrary Sequences
SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Implicit compression boosting with applications to self-indexing
SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Wavelet Trees: From Theory to Practice
CCP '11 Proceedings of the 2011 First International Conference on Data Compression, Communications and Processing
Word-based self-indexes for natural language text
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
Self-indexes for natural-language texts, where these are regarded as token (word or separator) sequences, achieve very attractive space and search time. However, they suffer from a space penalty due to their large vocabulary. In this paper we show that by replacing the Huffman encoding they implicitly use by the slightly weaker Hu-Tucker encoding, which respects the lexical order of the vocabulary, both their space and time are improved.