Fast text searching for regular expressions or automaton searching on tries
Journal of the ACM (JACM)
Fast algorithms for sorting and searching strings
SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
An analysis of the Burrows—Wheeler transform
Journal of the ACM (JACM)
Modern Information Retrieval
Two-dimensional substring indexing
Journal of Computer and System Sciences - Special issu on PODS 2001
Journal of the ACM (JACM)
Cache-oblivious string B-trees
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
ACM Computing Surveys (CSUR)
A taxonomy of suffix array construction algorithms
ACM Computing Surveys (CSUR)
Succinct indexes for strings, binary relations and multi-labeled trees
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Introduction to Information Retrieval
Introduction to Information Retrieval
CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
On searching compressed string collections cache-obliviously
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Compressed text indexes: From theory to practice
Journal of Experimental Algorithmics (JEA)
On compressing the textual web
Proceedings of the third ACM international conference on Web search and data mining
Index structures for efficiently searching natural language text
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Compression, indexing, and retrieval for massive string data
CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Engineering basic algorithms of an in-memory text search engine
ACM Transactions on Information Systems (TOIS)
Data structures: time, I/Os, entropy, joules!
ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Indexing methods for approximate dictionary searching: Comparative analysis
Journal of Experimental Algorithmics (JEA)
Space-efficient substring occurrence estimation
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient fuzzy search in large text collections
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
Recently [Manning et al., 2007] resorted the Permuterm indexof Garfield (1976) as a time-efficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wild-card symbol (called, Tolerant Retrieval problem). Unfortunately the Permuterm index is space inefficient because its quadruples the dictionary size. In this paper we propose the Compressed Permuterm Index which solves the Tolerant Retrieval problem in optimal query time, i.e. time proportional to the length of the searched pattern, and space close to the k-th order empirical entropy of the indexed dictionary. Our index can be used to solve also more sophisticated queries which involve several wild-card symbols, or require to prefix-match multiple fields in a database of records.The result is based on an elegant variant of the Burrows-Wheeler Transform defined on a dictionary of strings of variable length, which allows to easily adapt known compressed indexes [Makinen-Navarro, 2007] to solve the Tolerant Retrieval problem. Experiments show that our index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip, bzip or ppmdi. This improves known approaches based on front-coding by more than 50% in absolute space occupancy, still guaranteeing comparable query time.