On approximating string selection problems with outliers

  • Authors:
  • Christina Boucher;Gad M. Landau;Avivit Levy;David Pritchard;Oren Weimann

  • Affiliations:
  • Department of Computer Science, University of California, San Diego, USA;Department of Computer Science, University of Haifa, Haifa 31905, Israel and Polytechnic Institute of NYU, Brooklyn, NY 11201-3840, USA;Shenkar College for Engineering and Design, Ramat-Gan 52526, Israel and CRI, University of Haifa, Mount Carmel, Haifa 31905, Israel;CEMC, University of Waterloo, Canada;Department of Computer Science, University of Haifa, Haifa 31905, Israel

  • Venue:
  • Theoretical Computer Science
  • Year:
  • 2013

Quantified Score

Hi-index 5.23

Visualization

Abstract

Many problems in bioinformatics are about finding strings that approximately represent a collection of given strings. We look at more general problems where some input strings can be classified as outliers. The Close to Most Strings problem is, given a set S of the same-length strings, and a parameter d, find a string x that maximizes the number of ''non-outliers'' within Hamming distance d of x. We prove that this problem has no polynomial-time approximation scheme (PTAS) unless NP has randomized polynomial-time algorithms, correcting a decade-old erroneous proof made previously in the literature. The Most Strings with Few Bad Columns problem is to find a maximum-size subset of input strings so that the number of non-identical positions is at most k; we show it has no PTAS unless P=NP. We also observe Closest to k Strings has no efficient PTAS (EPTAS) unless a parameterized complexity hierarchy collapses. In sum, outliers help model problems associated with using biological data, but we show the problem of finding an approximate solution is computationally difficult.