Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

Authors:
Jouni Sirén;Niko Välimäki;Veli Mäkinen;Gonzalo Navarro
Affiliations:
Dept. of Computer Science, Univ. of Helsinki, Finland;Dept. of Computer Science, Univ. of Helsinki, Finland;Dept. of Computer Science, Univ. of Helsinki, Finland;Dept. of Computer Science, Univ. of Chile,
Venue:
SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Year:
2008

Citing 19
Cited 12

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Indexing compressed text

Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Compressed Data Structures: Dictionaries and Data-Aware Measures

DCC '06 Proceedings of the Data Compression Conference
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Compressed Suffix Trees with Full Functionality

Theory of Computing Systems
An(other) Entropy-Bounded Compressed Suffix Tree

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Dynamic Fully-Compressed Suffix Trees

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Fully-compressed suffix trees

LATIN'08 Proceedings of the 8th Latin American conference on Theoretical informatics
Reducing the space requirement of LZ-Index

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
The myriad virtues of wavelet trees

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part I
Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays

ISAAC'04 Proceedings of the 15th international conference on Algorithms and Computation

Storage and Retrieval of Individual Genomes

RECOMB 2'09 Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology
Compressed Suffix Arrays for Massive Data

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Self-indexing based on LZ77

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Iterative Dictionary Construction for Compression of Large DNA Data Sets

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Faster approximate pattern matching in compressed repetitive texts

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Memory-Aware BWT by segmenting sequences to support subsequence search

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Fast relative lempel-ziv self-index for similar sequences

FAW-AAIM'12 Proceedings of the 6th international Frontiers in Algorithmics, and Proceedings of the 8th international conference on Algorithmic Aspects in Information and Management
Self-Indexed Grammar-Based Compression

Fundamenta Informaticae
Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Cache-aware parallel approximate matching and join algorithms using BWT

Proceedings of the Joint EDBT/ICDT 2013 Workshops
On compressing and indexing repetitive sequences

Theoretical Computer Science
Wavelet trees for all

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N . Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. This paper is devoted to studying ways to store massive sets of highly repetitive sequence collections in space-efficient manner so that retrieval of the content as well as queries on the content of the sequences can be provided time-efficiently. We show that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task. We engineer some new structures that use run-length encoding and give empirical evidence that these structures are superior to the current structures.