The SBC-tree: an index for run-length compressed sequences

Authors:
Mohamed Y. Eltabakh;Wing-Kai Hon;Rahul Shah;Walid G. Aref;Jeffrey S. Vitter
Affiliations:
Purdue University;National Tsing Hua University;Louisiana State University;Purdue University;Purdue University
Venue:
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Year:
2008

Citing 37
Cited 5

Two algorithms for maintaining order in a list

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
Matching patterns in strings subject to multi-linear transformations

Theoretical Computer Science
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Efficient pattern matching with scaling

Journal of Algorithms
Edit distance of run-length coded strings

SAC '92 Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing: technological challenges of the 1990's
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
On two-dimensional indexability and optimal range search indexing

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Matching for run-length encoded strings

Journal of Complexity
Let sleeping files lie: pattern matching in Z-compressed files

SODA '94 Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms
Prefix B-trees

ACM Transactions on Database Systems (TODS)
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Inplace run-length 2d compressed search

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Ubiquitous B-Tree

ACM Computing Surveys (CSUR)
Trie memory

Communications of the ACM
External memory algorithms and data structures: dealing with massive data

ACM Computing Surveys (CSUR)
Edit distance of run-length encoded strings

Information Processing Letters
An Efficient Multiversion Access Structure

IEEE Transactions on Knowledge and Data Engineering
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
SEQ: A Model for Sequence Databases

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
Optimal Two-Dimensional Compressed Matching

ICALP '94 Proceedings of the 21st International Colloquium on Automata, Languages and Programming
Approximate Matching of Run-Length Compressed Strings

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
An asymptotically optimal multiversion B-tree

The VLDB Journal — The International Journal on Very Large Data Bases
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Searching BWT Compressed Text with the Boyer-Moore Algorithm and Binary Search

DCC '02 Proceedings of the Data Compression Conference
The suffix binary search tree and suffix AVL tree

Journal of Discrete Algorithms
Regular expression searching on compressed text

Journal of Discrete Algorithms
Engineering a Fast Online Persistent Suffix Tree Construction

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Tight bounds for the partial-sums problem

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
When indexing equals compression: experiments with compressing suffix arrays and applications

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Longest common subsequence between run-length-encoded strings: a new algorithm with improved parallelism

Information Processing Letters
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Practical suffix tree construction

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Dynamic entropy-compressed sequences and full-text indexes

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching

Algorithms and data structures for external memory

Foundations and Trends® in Theoretical Computer Science
Efficient indexing algorithms for one-dimensional discretely-scaled strings

Information Processing Letters
Reordering columns for smaller indexes

Information Sciences: an International Journal
Compressed indexes for aligned pattern matching

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Fast algorithms for computing the constrained LCS of run-length encoded strings

Theoretical Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the String B-tree for Compressed sequences, termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-known String B-tree and a 3-sided range query structure [7]. The SBC-tree supports pattern matching queries such as substring matching, prefix matching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. Substring matching, prefix matching, and range search execute in an optimal O(logB N + |p|+T/B) I/O operations, where |p| is the length of the compressed query pattern and T is the query output size. The SBC-tree is also dynamic and supports insert and delete operations efficiently. The insertion and deletion of all suffixes of a compressed sequence of length m take O(m logB(N + m)) amortized I/O operations. The SBC-tree index is realized inside PostgreSQL. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, while retains the optimal search performance achieved by the String B-tree over the uncompressed sequences.