Distributed string mining for high-throughput sequencing data

Authors:
Niko Välimäki;Simon J. Puglisi
Affiliations:
Helsinki Institute for Information Technology, Finland,Department of Computer Science, University of Helsinki, Finland;Department of Computer Science, University of Helsinki, Finland
Venue:
WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Year:
2012

Citing 14
Cited 0

A note on the height of suffix trees

SIAM Journal on Computing
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Autocorrelation on words and its applications: analysis of suffix trees by string-ruler approach

Journal of Combinatorial Theory Series A
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Color Set Size Problem with Application to String Matching

CPM '92 Proceedings of the Third Annual Symposium on Combinatorial Pattern Matching
Mining Emerging Substrings

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
A new representation for protein secondary structure prediction based on frequent patterns

Bioinformatics
Linear work suffix array construction

Journal of the ACM (JACM)
A space efficient solution to the frequent string mining problem for many databases

Data Mining and Knowledge Discovery
Space Efficient String Mining under Frequency Constraints

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Permuted Longest-Common-Prefix Array

CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Optimal string mining under frequency constraints

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Practical Efficient String Mining

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of frequency constrained string mining is to extract substrings that discriminate two (or more) datasets. Known solutions to the problem range from an optimal time algorithm to different time---space tradeoffs. However, all of the existing algorithms have been designed to be run in a sequential manner and require that the whole input fits the main memory. Due to these limitations, the existing algorithms are practical only up to a few gigabytes of input. We introduce a distributed algorithm that has a novel time---space tradeoff and, in practice, achieves a significant reduction in both memory and time compared to state-of-the-art methods. To demonstrate the feasibility of the new algorithm, our study includes comprehensive tests on large-scale metagenomics data. We also study the cost of renting the required infrastructure from, e.g. Amazon EC2. Our distributed algorithm is shown to be practical on terabyte-scale inputs and affordable on rented infrastructure.