An empirical comparison of four initialization methods for the K-Means algorithm
Pattern Recognition Letters
Data Mining and Machine Oriented Modeling: A Granular Computing Approach
Applied Intelligence
On Modeling Data Mining with Granular Computing
COMPSAC '01 Proceedings of the 25th International Computer Software and Applications Conference on Invigorating Software Development
Document clustering based on non-negative matrix factorization
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Non-negative Matrix Factorization with Sparseness Constraints
The Journal of Machine Learning Research
BIBE '06 Proceedings of the Sixth IEEE Symposium on BionInformatics and BioEngineering
Learning Parts-Based Representations of Data
The Journal of Machine Learning Research
SIAM Journal on Matrix Analysis and Applications
Discriminative Orthogonal Nonnegative matrix factorization with flexibility for data representation
Expert Systems with Applications: An International Journal
Online dictionary learning algorithm with periodic updates and its application to image denoising
Expert Systems with Applications: An International Journal
Hi-index | 12.05 |
The problem of discovering motifs from protein sequences is a critical and challenging task in the field of bioinformatics. The task involves clustering relatively similar protein segments from a huge collection of protein sequences and culling high quality motifs from a set of clusters. A granular computing strategy combined with K-means clustering algorithm was previously proposed for the task, but this strategy requires a manual selection of biologically meaningful clusters which are to be used as an initial condition. This manipulated clustering method is undisciplined as well as computationally expensive. In this paper, we utilize sparse non-negative matrix factorization (SNMF) to cluster a large protein data set. We show how to combine this method with Fuzzy C-means algorithm and incorporate bio-statistics information to increase the number of clusters whose structural similarity is high. Our experimental results show that an SNMF approach provides better protein groupings in terms of similarities in secondary structures while maintaining similarities in protein primary sequences.