I/O efficient algorithms for serial and parallel suffix tree construction

Authors:
Amol Ghoting;Konstantin Makarychev
Affiliations:
IBM Thomas J. Watson Research Center, NY;IBM Thomas J. Watson Research Center, NY
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2010

Citing 20
Cited 0

Optimal parallel suffix tree construction

STOC '94 Proceedings of the twenty-sixth annual ACM symposium on Theory of computing
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
On the sorting-complexity of suffix tree construction

Journal of the ACM (JACM)
Constructing Suffix Trees On-Line in Linear Time

Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1 - Volume I
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
Overcoming the Memory Bottleneck in Suffix Tree Construction

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Constructing chromosome scale suffix trees

APBC '04 Proceedings of the second conference on Asia-Pacific bioinformatics - Volume 29
Engineering a Fast Online Persistent Suffix Tree Construction

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory

IEEE Transactions on Knowledge and Data Engineering
Linear time algorithms for finding and representing all the tandem repeats in a string

Journal of Computer and System Sciences
Practical methods for constructing suffix trees

The VLDB Journal — The International Journal on Very Large Data Bases
Genome-scale disk-based suffix tree indexing

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
OASIS: an online and accurate technique for local-alignment searches on biological sequences

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Serial and parallel methods for i/o efficient suffix tree construction

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distributed and paged suffix trees for large genetic databases

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Search-Optimized suffix-tree storage for biological applications

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Lightweight data indexing and compression in external memory

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we have a scalable suffix tree construction algorithm. The past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting—for example, indexing the entire human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this article, we will empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation—to glean reasonable disk I/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality exhibited by existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, we will show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be rearchitected to build a suffix tree in a tiled manner, allowing the execution to operate within a fixed main memory budget when indexing strings of any size. We will also present a parallel extension of our algorithm that is designed for massively parallel systems like the IBM Blue Gene. An experimental evaluation will show that the proposed approach affords an improvement of several orders of magnitude in serial performance when indexing large strings. Furthermore, the proposed parallel extension is shown to be scalable—it is now possible to index the entire human genome on a 1024 processor IBM Blue Gene system in under 15 minutes.