Distributed string mining for high-throughput sequencing data

  • Authors:
  • Niko Välimäki;Simon J. Puglisi

  • Affiliations:
  • Helsinki Institute for Information Technology, Finland,Department of Computer Science, University of Helsinki, Finland;Department of Computer Science, University of Helsinki, Finland

  • Venue:
  • WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The goal of frequency constrained string mining is to extract substrings that discriminate two (or more) datasets. Known solutions to the problem range from an optimal time algorithm to different time---space tradeoffs. However, all of the existing algorithms have been designed to be run in a sequential manner and require that the whole input fits the main memory. Due to these limitations, the existing algorithms are practical only up to a few gigabytes of input. We introduce a distributed algorithm that has a novel time---space tradeoff and, in practice, achieves a significant reduction in both memory and time compared to state-of-the-art methods. To demonstrate the feasibility of the new algorithm, our study includes comprehensive tests on large-scale metagenomics data. We also study the cost of renting the required infrastructure from, e.g. Amazon EC2. Our distributed algorithm is shown to be practical on terabyte-scale inputs and affordable on rented infrastructure.