Efficient implementation of suffix trees
Software—Practice & Experience
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
Reducing the space requirement of suffix trees
Software—Practice & Experience
On the sorting-complexity of suffix tree construction
Journal of the ACM (JACM)
Average Case Analysis of Algorithms on Sequences
Average Case Analysis of Algorithms on Sequences
A Database Index to Large Biological Sequences
Proceedings of the 27th International Conference on Very Large Data Bases
Providing Orthogonal Persistence for Java (Extended Abstract)
ECCOP '98 Proceedings of the 12th European Conference on Object-Oriented Programming
Approximate String Matching in DNA Sequences
DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
A parallel algorithm for the extraction of structured motifs
Proceedings of the 2004 ACM symposium on Applied computing
Improving Hash Join Performance through Prefetching
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Engineering a Fast Online Persistent Suffix Tree Construction
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
OASIS: an online and accurate technique for local-alignment searches on biological sequences
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Practical methods for constructing suffix trees
The VLDB Journal — The International Journal on Very Large Data Bases
Exact match search in sequence data using suffix trees
Proceedings of the 14th ACM international conference on Information and knowledge management
A data structure for a sequence of string accesses in external memory
ACM Transactions on Algorithms (TALG)
Constructing large suffix trees on a computational grid
Journal of Parallel and Distributed Computing
Genome-scale disk-based suffix tree indexing
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
PSIST: A scalable approach to indexing protein structures using suffix trees
Journal of Parallel and Distributed Computing
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
The SBC-tree: an index for run-length compressed sequences
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
A new method for indexing genomes using on-disk suffix trees
Proceedings of the 17th ACM conference on Information and knowledge management
AS-index: a structure for string search using n-grams and algebraic signatures
Proceedings of the 18th ACM conference on Information and knowledge management
Suffix trees for very large genomic sequences
Proceedings of the 18th ACM conference on Information and knowledge management
A practical method for approximate subsequence search in DNA databases
PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Suffix trees for inputs larger than main memory
Information Systems
ERA: efficient serial and parallel suffix tree construction for very long strings
Proceedings of the VLDB Endowment
Search-Optimized suffix-tree storage for biological applications
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Obtaining provably good performance from suffix trees in secondary storage
CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Parallel construction of large suffix trees on a PC cluster
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Efficient parallel construction of suffix trees for genomes larger than main memory
Proceedings of the 20th European MPI Users' Group Meeting
Efficient techniques on retrieving bio-information for active U-healthcare
Personal and Ubiquitous Computing
Hi-index | 0.00 |
Large string datasets are common in a number of emerging text and biological database applications. Common queries over such datasets include both exact and approximate string matches. These queries can be evaluated very efficiently by using a suffix tree index on the string dataset. Although suffix trees can be constructed quickly in memory for small input datasets, constructing persistent trees for large datasets has been challenging. In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n2) complexity outperforms the popular O(n) Ukkonen algorithm, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we present a buffer management strategy for the O(n2) algorithm, creating a new disk-based construction algorithm that scales to sizes much larger than have been previously described in the literature. Our approach far outperforms the best known disk-based construction algorithms.