Peptide programs: applying fragment programs to protein classification
Proceedings of the 2nd international workshop on Data and text mining in bioinformatics
Prediction of protein protein interactions from primary sequences
International Journal of Data Mining and Bioinformatics
Protein remote homology detection based on binary profiles
BIRD'07 Proceedings of the 1st international conference on Bioinformatics research and development
Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce
Proceedings of the 19th international conference on World wide web
Protein remote homology detection based on auto-cross covariance transformation
Computers in Biology and Medicine
Computers in Biology and Medicine
Remote homology detection incorporating the context of physicochemical properties
Computers in Biology and Medicine
Hi-index | 3.84 |
Motivation: Remote homology detection between protein sequences is a central problem in computational biology. The discriminative method such as the support vector machine (SVM) is one of the most effective methods. Many of the SVM-based methods focus on finding useful representations of protein sequence, using either explicit feature vector representations or kernel functions. Such representations may suffer from the peaking phenomenon in many machine-learning methods because the features are usually very large and noise data may be introduced. Based on these observations, this research focuses on feature extraction and efficient representation of protein vectors for SVM protein classification. Results: In this study, a latent semantic analysis (LSA) model, which is an efficient feature extraction technique from natural language processing, has been introduced in protein remote homology detection. Several basic building blocks of protein sequences have been investigated as the 'words' of 'protein sequence language', including N-grams, patterns and motifs. Each protein sequence is taken as a 'document' that is composed of bags-of-word. The word-document matrix is constructed first. The LSA is performed on the matrix to produce the latent semantic representation vectors of protein sequences, leading to noise-removal and smart description of protein sequences. The latent semantic representation vectors are then evaluated by SVM. The method is tested on the SCOP 1.53 database. The results show that the LSA model significantly improves the performance of remote homology detection in comparison with the basic formalisms. Furthermore, the performance of this method is comparable with that of the complex kernel methods such as SVM-LA and better than that of other sequence-based methods such as PSI-BLAST and SVM-pairwise. Availability: The source codes are freely available at http://www.insun.hit.edu.cn/news/view.asp?id=413 or upon request from the authors. Contact: qwdong@insun.hit.edu.cn