On spaced seeds for similarity search

  • Authors:
  • Uri Keich;Ming Li;Bin Ma;John Tromp

  • Affiliations:
  • Computer Science & Engineering Department, University of California, San Diego, CA;Bioinformatics Lab, Computer Science Department, University of California, Santa Barbara, CA;Computer Science Department, University of Western Ontario, London Canada N6A 5B8;CWI, P.O. Box 94079 1090 GB Amsterdam, Netherlands

  • Venue:
  • Discrete Applied Mathematics
  • Year:
  • 2004

Quantified Score

Hi-index 0.05

Visualization

Abstract

Genomics studies routinely depend on similarity searches based on the strategy of finding short seed matches (contiguous k bases) which are then extended. The particular choice of the seed length, k, is determined by the tradeoff between search speed (larger k reduces chance hits) and sensitivity (smaller k finds weaker similarities). A novel idea of using a single deterministic optimized spaced seed was introduced in Ma et al. (Bioinformatics (2002) 18) to the above similarity search process and it was empirically demonstrated that the optimal spaced seed quadruples the search speed, without sacrificing sensitivity. Multiple, randomly spaced patterns, spaced q-grams, and spaced probes were also studied in Califano and Rigoutsos (Technical Report, IBM T.J. Watson Research Center (1995), Burkhardt, Kärkkäinen, CPM (2001), and Buhler, Bioinformatics 17 (2001) 419) and in other applications [(RECOMB (1999) 295, RECOMB (2000) 245)]. They were all found to be better than their contiguous counterparts. In this paper we study some of the theoretical and practical aspects of optimal seeds. In particular we demonstrate that the commonly used contiguous seed is in some sense the worst one, and we offer an algorithmic solution to the problem of finding the optimal seed.