Generalised Sequence Signatures through symbolic clustering

  • Authors:
  • Dietmar H. Dorr;Anne M. Denton

  • Affiliations:
  • Research and Development, Thomson Reuters, St. Paul, MN 55123, USA.;Department of Computer Science, North Dakota State University, Fargo, ND 58102, USA

  • Venue:
  • International Journal of Data Mining and Bioinformatics
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Traditionally sequence motifs and domains are defined such that insertions, deletions and mismatched regions are small compared with matched regions. We introduce an algorithm for the identification of Generalised Sequence Signatures (GSS) that can be composed of windows distributed throughout the sequence. Our approach is based on clustering analysis of recurring subsequences of a predefined length, to which we refer as symbols. Sequences are grouped so as to maximise the number of shared symbols among them. We show that the utilisation of GSS for deriving sequence annotations yields higher confidence values than the usage of other signature recognition approaches.