Compression, indexing, and retrieval for massive string data

Authors:
Wing-Kai Hon;Rahul Shah;Jeffrey Scott Vitter
Affiliations:
National Tsing Hua University, Taiwan;Louisiana State University;Texas A&M University
Venue:
CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Year:
2010

Citing 61
Cited 11

The input/output complexity of sorting and related problems

Communications of the ACM
New indices for text: PAT Trees and PAT arrays

Information retrieval
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
Prefix B-trees

ACM Transactions on Database Systems (TODS)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Efficient algorithms for document retrieval problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array

ISAAC '00 Proceedings of the 11th International Conference on Algorithms and Computation
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Indexing compressed text

Journal of the ACM (JACM)
Boosting textual compression in optimal linear time

Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Structuring labeled trees for optimal succinctness, and beyond

FOCS '05 Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
When indexing equals compression: Experiments with compressing suffix arrays and applications

ACM Transactions on Algorithms (TALG)
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Succinct data structures for flexible text retrieval systems

Journal of Discrete Algorithms
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Compressed indexes for dynamic text collections

ACM Transactions on Algorithms (TALG)
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Compressed permuterm index

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Succinct indexes for strings, binary relations and multi-labeled trees

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets

ACM Transactions on Algorithms (TALG)
Compressed Suffix Trees with Full Functionality

Theory of Computing Systems
Dynamic entropy-compressed sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
On searching compressed string collections cache-obliviously

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Improved Approximate String Matching Using Compressed Suffix Data Structures

Algorithmica
Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing

DCC '08 Proceedings of the Data Compression Conference
Compressed Index for Dictionary Matching

DCC '08 Proceedings of the Data Compression Conference
Space-Efficient Algorithms for Document Retrieval

CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
ZOOM! Zillions of oligos mapped

Bioinformatics
Succincter

FOCS '08 Proceedings of the 2008 49th Annual IEEE Symposium on Foundations of Computer Science
Algorithms and Data Structures for External Memory

Algorithms and Data Structures for External Memory
The myriad virtues of Wavelet Trees

Information and Computation
SOAP2

Bioinformatics
Succinct Text Indexing with Wildcards

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
On Entropy-Compressed Text Indexing in External Memory

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Faster entropy-bounded compressed suffix trees

Theoretical Computer Science
Succinct Index for Dynamic Dictionary Matching

ISAAC '09 Proceedings of the 20th International Symposium on Algorithms and Computation
Space-Efficient Framework for Top-k String Retrieval Problems

FOCS '09 Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science
Parallelism via Multithreaded and Multicore CPUs

Computer
Implicit compression boosting with applications to self-indexing

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Fully-compressed suffix trees

LATIN'08 Proceedings of the 8th Latin American conference on Theoretical informatics
I/O-Efficient Compressed Text Indexes: From Theory to Practice

DCC '10 Proceedings of the 2010 Data Compression Conference
Succinct dictionary matching with no slowdown

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Inverted files versus suffix arrays for locating patterns in primary memory

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays

ISAAC'04 Proceedings of the 15th international conference on Algorithms and Computation
Rank-Sensitive data structures

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Position-Restricted substring searching

LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics
A Lempel-Ziv text index on secondary storage

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching

Data structures: time, I/Os, entropy, joules!

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Spatio-temporal range searching over compressed kinetic sensor data

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part I
A quick tour on suffix arrays and compressed suffix arrays

Theoretical Computer Science
Succinct indexes for circular patterns

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Towards an optimal space-and-query-time index for top-k document retrieval

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Document listing for queries with excluded pattern

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
On enumerating the DNA sequences

Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
Compressed data structures with relevance

Proceedings of the 21st ACM international conference on Information and knowledge management
Compressing IP forwarding tables: towards entropy bounds and beyond

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)
A new compression scheme for secure transmission

International Journal of Automation and Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.