Approximate all-pairs suffix/prefix overlaps

  • Authors:
  • Niko Välimäki;Susana Ladra;Veli Mäkinen

  • Affiliations:
  • Helsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, P.O. Box 68, 00014, Finland;Department of Computer Science, University of A Coruña, Spain;Helsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, P.O. Box 68, 00014, Finland

  • Venue:
  • Information and Computation
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Finding approximate overlaps is the first phase of many sequence assembly methods. Given a set of strings of total length n and an error-rate @e, the goal is to find, for all-pairs of strings, their suffix/prefix matches (overlaps) that are within edit distance k=@?@e@?@?, where @? is the length of the overlap. We propose a new solution for this problem based on backward backtracking (Lam, et al., 2008) and suffix filters (Karkkainen and Na, 2008). Our technique uses nH"k+o(nlog@s)+rlogr bits of space, where H"k is the k-th order entropy and @s the alphabet size. In practice, it is more scalable in terms of space, and comparable in terms of time, than q-gram filters (Rasmussen, et al., 2006). Our method is also easy to parallelize and scales up to millions of DNA reads.