Sparse nonnegative matrix factorization for protein sequence motif discovery

  • Authors:
  • Wooyoung Kim;Bernard Chen;Jingu Kim;Yi Pan;Haesun Park

  • Affiliations:
  • Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA;Department of Computer Science, University of Central Arkansas, Conway, AR 72035, USA;School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA;Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA;School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2011

Quantified Score

Hi-index 12.05

Visualization

Abstract

The problem of discovering motifs from protein sequences is a critical and challenging task in the field of bioinformatics. The task involves clustering relatively similar protein segments from a huge collection of protein sequences and culling high quality motifs from a set of clusters. A granular computing strategy combined with K-means clustering algorithm was previously proposed for the task, but this strategy requires a manual selection of biologically meaningful clusters which are to be used as an initial condition. This manipulated clustering method is undisciplined as well as computationally expensive. In this paper, we utilize sparse non-negative matrix factorization (SNMF) to cluster a large protein data set. We show how to combine this method with Fuzzy C-means algorithm and incorporate bio-statistics information to increase the number of clusters whose structural similarity is high. Our experimental results show that an SNMF approach provides better protein groupings in terms of similarities in secondary structures while maintaining similarities in protein primary sequences.