Hard problems in similarity searching

  • Authors:
  • Christophe Moan;Irena Rusu

  • Affiliations:
  • LINA, Université de Nantes, 2, Rue de la Houssiniere, B.P. 92208, 44322 Nantes Cedex 3, France;LINA, Université de Nantes, 2, Rue de la Houssiniere, B.P. 92208, 44322 Nantes Cedex 3, France

  • Venue:
  • Discrete Applied Mathematics - Discrete mathematics & data mining (DM & DM)
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Closest Substring Problem is one of the most important problems in the field of computational biology. It is stated as follows: given a set of t sequences s1, s2,...,st over an alphabet Σ, and two integers k,d with d ≤ k, can one find a string s of length k and, for all i=1,2,...,t, substrings oi of si, all of length k, such that d(s,oi) ≤ d (for all i= 1,2,...,t)? (here, d(.,.) represents the Hamming distance). Closest Substring was shown to be NP-hard (Proceedings of 10th SODA, 1999, pp. 633-642) and W[1]-hard with respect to the number t of input sequences (Proceedings of STACS'02, Lecture Notes in Computer Science, Vol. 2285, 2002, pp. 262-273); recently, an important number of results concerning the parameterized computational complexity of Closest Substring has been added in Evans et al. (Theoret. Comput. Sci. 306 (1-3) (2003) 407). In this paper we introduce and analyze two variants of the Closest Substring Problem, obtained by imposing restrictions on the pairwise distances between the substrings oi: • the bounded Hamming distance constraint asks that d(oi,oj) ≤ p, for all i,j ∈ {1,2,...,t} (where p i≤ij≤t d(oi,oj) ≤ P (where P dt(t - 1) is a given constant) and yields the problem called SCCS. We motivate the introduction of these problems, and we show that while SCCS is very close to Closest Substring, BCCS is a non-trivial restriction of Closest Substring more suitable to use in certain practical applications. We then concentrate on BCCS and show that all the hardness results available for Closest Substring remain valid for BCCS even when the parameter p is restricted to a certain range.