Software—Practice & Experience
Text compression
Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Block addressing indices for approximate text retrieval
Journal of the American Society for Information Science - Special topic issue: When museum informatics meets the World Wide Web
STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Adaptive set intersections, unions, and differences
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Fast and flexible word searching on compressed text
ACM Transactions on Information Systems (TOIS)
An analysis of the Burrows—Wheeler transform
Journal of the ACM (JACM)
Information Retrieval: Computational and Theoretical Aspects
Information Retrieval: Computational and Theoretical Aspects
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Adaptive intersection and t-threshold problems
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Adding Compression to Block Addressing Inverted Indexes
Information Retrieval
High-order entropy-compressed text indexes
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
Opportunistic data structures with applications
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
New text indexing functionalities of the compressed suffix arrays
Journal of Algorithms
Journal of the ACM (JACM)
Inverted files for text search engines
ACM Computing Surveys (CSUR)
Succinct suffix arrays based on run-length encoding
Nordic Journal of Computing
Lightweight natural language text compression
Information Retrieval
ACM Computing Surveys (CSUR)
Succinct data structures for flexible text retrieval systems
Journal of Discrete Algorithms
Compressed representations of sequences and full-text indexes
ACM Transactions on Algorithms (TALG)
Efficient document retrieval in main memory
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Theoretical Computer Science
Alternation and redundancy analysis of the intersection problem
ACM Transactions on Algorithms (TALG)
Dynamic entropy-compressed sequences and full-text indexes
ACM Transactions on Algorithms (TALG)
Improving suffix array locality for fast pattern matching on disk
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing
DCC '08 Proceedings of the Data Compression Conference
Space-efficient static trees and graphs
SFCS '89 Proceedings of the 30th Annual Symposium on Foundations of Computer Science
Linear pattern matching algorithms
SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Compressed text indexes: From theory to practice
Journal of Experimental Algorithmics (JEA)
Self-indexing Natural Language
SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Practical Rank/Select Queries over Arbitrary Sequences
SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
An experimental investigation of set intersection algorithms for text searching
Journal of Experimental Algorithmics (JEA)
A Linear-Time Burrows-Wheeler Transform Using Induced Sorting
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices
SIAM Journal on Computing
Space-Efficient Framework for Top-k String Retrieval Problems
FOCS '09 Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science
Compact set representation for information retrieval
SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Improved dynamic rank-select entropy-bound structures
LATIN'08 Proceedings of the 8th Latin American conference on Theoretical informatics
Modern Information Retrieval
Efficient set intersection for inverted indexing
ACM Transactions on Information Systems (TOIS)
Engineering basic algorithms of an in-memory text search engine
ACM Transactions on Information Systems (TOIS)
Top-k ranked document search in general text databases
ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Compressed self-indices supporting conjunctive queries on document collections
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Colored range queries and document retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Compressed string dictionaries
SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Faster adaptive set intersections for text searching
WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
Inverted files versus suffix arrays for locating patterns in primary memory
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Enhanced byte codes with restricted prefix properties
SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Experimental analysis of a fast intersection algorithm for sorted sequences
SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
To index or not to index: time-space trade-offs in search engines with positional ranking functions
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Grammar precompression speeds up burrows---wheeler compression
SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Smaller self-indexes for natural language
SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences
ACM Computing Surveys (CSUR)
Privacy-enhanced string matching with wordwise positional sampling
Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Indexing Word Sequences for Ranked Retrieval
ACM Transactions on Information Systems (TOIS)
Tag recommendation for open source software
Frontiers of Computer Science: Selected Publications from Chinese Universities
Hi-index | 0.00 |
The inverted index supports efficient full-text searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for single-word searches, yet phrase searches require more expensive intersections. In this article we introduce a different kind of index. It replaces the text using essentially the same space required by the compressed text alone (compression ratio around 35%). Within this space it supports not only decompression of arbitrary passages, but efficient word and phrase searches. Searches are orders of magnitude faster than those over inverted indexes when looking for phrases, and still faster on single-word searches when little space is available. Our new indexes are particularly fast at counting the occurrences of words or phrases. This is useful for computing relevance of words or phrases. We adapt self-indexes that succeeded in indexing arbitrary strings within compressed space to deal with large alphabets. Natural language texts are then regarded as sequences of words, not characters, to achieve word-based self-indexes. We design an architecture that separates the searchable sequence from its presentation aspects. This permits applying case folding, stemming, removing stopwords, etc. as is usual on inverted indexes.