Fast algorithms for finding nearest common ancestors
SIAM Journal on Computing
On finding lowest common ancestors: simplification and parallelization
SIAM Journal on Computing
Introduction to algorithms
Genetic sequence data retrieval and manipulation based on generalized suffix trees
Genetic sequence data retrieval and manipulation based on generalized suffix trees
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
Reducing the space requirement of suffix trees
Software—Practice & Experience
On the sorting-complexity of suffix tree construction
Journal of the ACM (JACM)
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
A Database Index to Large Biological Sequences
Proceedings of the 27th International Conference on Very Large Data Bases
Optimal Logarithmic Time Randomized Suffix Tree Construction
ICALP '96 Proceedings of the 23rd International Colloquium on Automata, Languages and Programming
LATIN '00 Proceedings of the 4th Latin American Symposium on Theoretical Informatics
COCOON '96 Proceedings of the Second Annual International Conference on Computing and Combinatorics
Optimal suffix tree construction with large alphabets
FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Overcoming the Memory Bottleneck in Suffix Tree Construction
FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Practical methods for constructing suffix trees
The VLDB Journal — The International Journal on Very Large Data Bases
A data structure for a sequence of string accesses in external memory
ACM Transactions on Algorithms (TALG)
Genome-scale disk-based suffix tree indexing
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Improving suffix array locality for fast pattern matching on disk
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A new method for indexing genomes using on-disk suffix trees
Proceedings of the 17th ACM conference on Information and knowledge management
DRFP-tree: disk-resident frequent pattern tree
Applied Intelligence
Reducing Space Requirements for Disk Resident Suffix Arrays
DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
B-tries for disk-based string management
The VLDB Journal — The International Journal on Very Large Data Bases
Serial and parallel methods for i/o efficient suffix tree construction
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Engineering a compressed suffix tree implementation
Journal of Experimental Algorithmics (JEA)
Abstractions in Process Mining: A Taxonomy of Patterns
BPM '09 Proceedings of the 7th International Conference on Business Process Management
Space-economical partial gram indices for exact substring matching
Proceedings of the 18th ACM conference on Information and knowledge management
Indexing genomic sequences on the IBM Blue Gene
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Suffix tree construction algorithms on modern hardware
Proceedings of the 13th International Conference on Extending Database Technology
Engineering a compressed suffix tree implementation
WEA'07 Proceedings of the 6th international conference on Experimental algorithms
I/O efficient algorithms for serial and parallel suffix tree construction
ACM Transactions on Database Systems (TODS)
On-line suffix tree construction with reduced branching
Journal of Discrete Algorithms
Clustering near-identical sequences for fast homology search
RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
Efficient techniques on retrieving bio-information for active U-healthcare
Personal and Ubiquitous Computing
Periodic pattern analysis of non-uniformly sampled stock market data
Intelligent Data Analysis
Hi-index | 0.00 |
Mammalian genomes are typically 3Gbps (gibabase pairs) in size. The largest public database NCBI (National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov)) of DNA contains more than 20 Gbps. Suffix trees are widely acknowledged as a data structure to support exact/approximate sequence matching queries as well as repetitive structure finding efficiently when they can reside in main memory. But, it has been shown as difficult to handle long DNA sequences using suffix trees due to the so-called memory bottleneck problems. The most space efficient main-memory suffix tree construction algorithm takes nine hours and 45 GB memory space to index the human genome [19]. In this paper, we show that suffix trees for long DNA sequences can be efficiently constructed on disk using small bounded main memory space and, therefore, all existing algorithms based on suffix trees can be used to handle long DNA sequences that cannot be held in main memory. We adopt a two-phase strategy to construct a suffix tree on disk: 1) to construct a diskbase suffix-tree without suffix links and 2) rebuild suffix links upon the suffix-tree being constructed on disk, if needed. We propose a new disk-based suffix tree construction algorithm, called DynaCluster, which shows O(n \log n) experimental behavior regarding CPU cost and linearity for I/O cost. DynaCluster needs 16MB main memory only to construct more than 200Mbps DNA sequences and significantly outperforms the existing disk-based suffix-tree construction algorithms using prepartitioning techniques in terms of both construction cost and query processing cost. We conducted extensive performance studies and report our findings in this paper.