Replacing suffix trees with enhanced suffix arrays

Authors:
Mohamed Ibrahim Abouelhoda;Stefan Kurtz;Enno Ohlebusch
Affiliations:
Faculty of Computer Science, University of Ulm, 89069 Ulm, Germany;Center for Bioinformatics, University of Hamburg, 20146 Hamburg, Germany;Faculty of Computer Science, University of Ulm, 89069 Ulm, Germany
Venue:
Journal of Discrete Algorithms - SPIRE 2002
Year:
2004

Citing 19
Cited 107

New indices for text: PAT Trees and PAT arrays

Information retrieval
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Fast algorithms for sorting and searching strings

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Reducing the space requirement of suffix trees

Software—Practice & Experience
An experimental study of an opportunistic index

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Succinct representations of lcp information and improvements in the compressed suffix arrays

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
The Enhanced Suffix Array and Its Applications to Genome Analysis

WABI '02 Proceedings of the Second International Workshop on Algorithms in Bioinformatics
The LCA Problem Revisited

LATIN '00 Proceedings of the 4th Latin American Symposium on Theoretical Informatics
Optimal Exact Strring Matching Based on Suffix Arrays

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Space-Economical Algorithms for Finding Maximal Unique Matches

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Finding Maximal Repetitions in a Word in Linear Time

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
On compressing and indexing data

On compressing and indexing data
Linear-time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Space efficient linear time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Simple linear work suffix array construction

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming

Practical methods for constructing suffix trees

The VLDB Journal — The International Journal on Very Large Data Bases
Fast Frequent String Mining Using Suffix Arrays

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Matching statistics: efficient computation and a new practical algorithm for the multiple common substring problem

Software—Practice & Experience
Construction of Aho Corasick automaton in linear time for integer alphabets

Information Processing Letters
Fast and space efficient string kernels using suffix arrays

ICML '06 Proceedings of the 23rd international conference on Machine learning
Suffix arrays: what are they good for?

ADC '06 Proceedings of the 17th Australasian Database Conference - Volume 49
When indexing equals compression: Experiments with compressing suffix arrays and applications

ACM Transactions on Algorithms (TALG)
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Longest repeats with a block of k don't cares

Theoretical Computer Science
Linear work suffix array construction

Journal of the ACM (JACM)
Constructing large suffix trees on a computational grid

Journal of Parallel and Distributed Computing
Computing suffix links for suffix trees and arrays

Information Processing Letters
A taxonomy of suffix array construction algorithms

ACM Computing Surveys (CSUR)
An efficient, versatile approach to suffix sorting

Journal of Experimental Algorithmics (JEA)
Efficient token based clone detection with flexible tokenization

Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Efficient token based clone detection with flexible tokenization

The 6th Joint Meeting on European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering: companion papers
The affix array data structure and its applications to RNA secondary structure analysis

Theoretical Computer Science
Computing Longest Previous Factor in linear time and applications

Information Processing Letters
The generalised k-Truncated Suffix Tree for time-and space-efficient searches in multiple DNA or protein sequences

International Journal of Bioinformatics Research and Applications
DARN! A Weighted Constraint Solver for RNA Motif Localization

Constraints
Improving suffix array locality for fast pattern matching on disk

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Counting suffix arrays and strings

Theoretical Computer Science
Fast profile matching algorithms – A survey

Theoretical Computer Science
Linear-Time Computation of Similarity Measures for Sequential Data

The Journal of Machine Learning Research
A space efficient solution to the frequent string mining problem for many databases

Data Mining and Knowledge Discovery
Spamming botnets: signatures and characteristics

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
An(other) Entropy-Bounded Compressed Suffix Tree

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Efficient String Mining under Constraints Via the Deferred Frequency Index

ICDM '08 Proceedings of the 8th industrial conference on Advances in Data Mining: Medical Applications, E-Commerce, Marketing, and Theoretical Aspects
An Online Algorithm for Finding the Longest Previous Factors

ESA '08 Proceedings of the 16th annual European symposium on Algorithms
Fast and Adaptive Variable Order Markov Chain Construction

WABI '08 Proceedings of the 8th international workshop on Algorithms in Bioinformatics
On-line construction of compact suffix vectors and maximal repeats

Theoretical Computer Science
A new method for indexing genomes using on-disk suffix trees

Proceedings of the 17th ACM conference on Information and knowledge management
Efficient multi-word expressions extractor using suffix arrays and related structures

Proceedings of the 2nd ACM workshop on Improving non english web searching
Speeding Up Pattern Matching by Text Sampling

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Efficient Algorithms for the Computational Design of Optimal Tiling Arrays

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Using Bloom Filters for Large Scale Gene Sequence Analysis in Haskell

PADL '09 Proceedings of the 11th International Symposium on Practical Aspects of Declarative Languages
PSISA: an algorithm for indexing and searching protein structure using suffix arrays

ICCOMP'08 Proceedings of the 12th WSEAS international conference on Computers
Reducing Space Requirements for Disk Resident Suffix Arrays

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Groovy Neural Networks

Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
Broadword Computing and Fibonacci Code Speed Up Compressed Suffix Arrays

SEA '09 Proceedings of the 8th International Symposium on Experimental Algorithms
Permuted Longest-Common-Prefix Array

CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Transformation of Suffix Arrays into Suffix Trees on the MPI Environment

RSFDGrC '07 Proceedings of the 11th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
Engineering a compressed suffix tree implementation

Journal of Experimental Algorithmics (JEA)
A Compressed Enhanced Suffix Array Supporting Fast String Matching

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Compressed Suffix Arrays for Massive Data

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Practical Algorithms for the Longest Common Extension Problem

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Faster entropy-bounded compressed suffix trees

Theoretical Computer Science
Kernel-based machine learning for fast text mining in R

Computational Statistics & Data Analysis
Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem

Information Processing Letters
Engineering a software tool for gene structure prediction in higher organisms

Information and Software Technology
Construction of Aho Corasick automaton in linear time for integer alphabets

Information Processing Letters
Wee LCP

Information Processing Letters
Efficient and scalable indexing techniques for biological sequence data

BIRD'07 Proceedings of the 1st international conference on Bioinformatics research and development
Engineering a compressed suffix tree implementation

WEA'07 Proceedings of the 6th international conference on Experimental algorithms
Detecting duplicate video based on camera transitional behavior

ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
Maximal phrases based analysis for prototyping online discussion forums postings

AdaptLRTtoND '09 Proceedings of the Workshop on Adaptation of Language Resources and Technology to New Domains
Sampled longest common prefix array

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Enhanced suffix arrays as language models: virtual k-testable languages

ICGI'10 Proceedings of the 10th international colloquium conference on Grammatical inference: theoretical results and applications
UASMAs (universal automated SNP mapping algorithms): a set of algorithms to instantaneously map SNPs in real time to aid functional SNP discovery

Proceedings of the VLDB Endowment
Sparse substring pattern set discovery using linear programming boosting

DS'10 Proceedings of the 13th international conference on Discovery science
CST++

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Computing matching statistics and maximal exact matches on compressed full-text indexes

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Toward optimal disk layout of genome scale suffix trees

SEAL'10 Proceedings of the 8th international conference on Simulated evolution and learning
Fully compressed suffix trees

ACM Transactions on Algorithms (TALG)
Lempel-Ziv factorization revisited

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Sparse and truncated suffix trees on variable-length codes

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Human motion classification and management based on mocap data analysis

J-HGBU '11 Proceedings of the 2011 joint ACM workshop on Human gesture and behavior understanding
Optimal string mining under frequency constraints

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Theoretical and practical improvements on the RMQ-Problem, with applications to LCA and LCE

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Practical compressed suffix trees

SEA'10 Proceedings of the 9th international conference on Experimental Algorithms
A new compressed suffix tree supporting fast search and its construction algorithm using optimal working space

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Construction of aho corasick automaton in linear time for integer alphabets

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
String matching with alphabet sampling

Journal of Discrete Algorithms
Searching for smallest grammars on large sequences and application to DNA

Journal of Discrete Algorithms
Bidirectional search in a string with wavelet trees and bidirectional matching statistics

Information and Computation
Counting suffix arrays and strings

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Efficient relaxed search in hierarchically clustered sequence datasets

Journal of Experimental Algorithmics (JEA)
Improving tweet stream classification by detecting changes in word probability

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Minimum Unique Substrings and Maximum Repeats

Fundamenta Informaticae - Theory that Counts: To Oscar Ibarra on His 70th Birthday
Space efficient modifications to structator-- a fast index-based search tool for RNA sequence-structure patterns

SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
Computing lempel-ziv factorization online

MFCS'12 Proceedings of the 37th international conference on Mathematical Foundations of Computer Science
A comparison of index-based lempel-Ziv LZ77 factorization algorithms

ACM Computing Surveys (CSUR)
Computing regularities in strings: A survey

European Journal of Combinatorics
Machine translation without words through substring alignment

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Efficient computational design of tiling arrays using a shortest path approach

WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics
Fast and practical algorithms for computing all the runs in a string

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Suffix arrays on words

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Efficient computation of substring equivalence classes with suffix arrays

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
A new succinct representation of RMQ-information and improvements in the enhanced suffix array

ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies
Efficient distributed computation of maximal exact matches

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Comparing DNA sequence collections by direct comparison of compressed text indexes

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Lightweight LCP construction for next-generation sequencing datasets

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Distributed string mining for high-throughput sequencing data

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Space-Efficient computation of maximal and supermaximal repeats in genome sequences

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Of motifs and goals: mining trajectory data

Proceedings of the 20th International Conference on Advances in Geographic Information Systems
Parallel suffix array and least common prefix for the GPU

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Scalable string similarity search/join with approximate seeds and multiple backtracking

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Optimized relative Lempel-Ziv compression of genomes

ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113
Trends in suffix sorting: a survey of low memory algorithms

ACSC '12 Proceedings of the Thirty-fifth Australasian Computer Science Conference - Volume 122
Distributional phrasal paraphrase generation for statistical machine translation

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction
On parsing optimality for dictionary-based text compression-the Zip case

Journal of Discrete Algorithms
Substring-based machine translation

Machine Translation
Viewing functions as token sequences to highlight similarities in source code

Science of Computer Programming
Suffix Array Construction in External Memory Using D-Critical Substrings

ACM Transactions on Information Systems (TOIS)
A Compressed Suffix Tree Based Implementation With Low Peak Memory Usage

Electronic Notes in Theoretical Computer Science (ENTCS)

Quantified Score

Hi-index	0.01

Visualization

Abstract

The suffix tree is one of the most important data structures in string processing and comparative genomics. However, the space consumption of the suffix tree is a bottleneck in large scale applications such as genome analysis. In this article, we will overcome-this obstacle. We will show how every algorithm that uses a suffix tree as data structure can systematically be replaced with an algorithm that uses an enhanced suffix array and solves the same problem in the same time complexity. The generic name enhanced suffix array stands for data structures consisting of the suffix array and additional tables. Our new algorithms are not only more space efficient than previous ones, but they are also faster and easier to implement.