Detecting disease genes based on semi-supervised learning and protein-protein interaction networks

Authors:
Thanh-Phuong Nguyen;Tu-Bao Ho
Affiliations:
Microsoft Research - University of Trento Centre for Computational and Systems Biology Piazza Manci 17, Trento 38123, Italy;Japan Advanced Institute of Science and Technology, Nomi, Ishikawa 923-1292, Japan and Vietnam Academy of Science and Technology, Caugiay, Hanoi, Viet Nam
Venue:
Artificial Intelligence in Medicine
Year:
2012

Citing 10
Cited 0

Data mining: concepts and techniques

Data mining: concepts and techniques
Online Predicted Human Interaction Database

Bioinformatics
Semi-supervised protein classification using cluster kernels

Bioinformatics
Highly consistent patterns for inherited human diseases at the molecular level

Bioinformatics
Discovering disease-genes by topological features in human protein--protein interaction network

Bioinformatics
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Human Disease-Gene Classification with Integrative Sequence-Based and Topological Features of Protein-Protein Interaction Networks

BIBM '07 Proceedings of the 2007 IEEE International Conference on Bioinformatics and Biomedicine
A Semi-supervised Learning Approach to Disease Gene Prediction

BIBM '07 Proceedings of the 2007 IEEE International Conference on Bioinformatics and Biomedicine
The power of protein interaction networks for associating genes with diseases

Bioinformatics
Disease gene prioritization based on topological similarity in protein-protein interaction networks

RECOMB'11 Proceedings of the 15th Annual international conference on Research in computational molecular biology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Objective: Predicting or prioritizing the human genes that cause disease, or ''disease genes'', is one of the emerging tasks in biomedicine informatics. Research on network-based approach to this problem is carried out upon the key assumption of ''the network-neighbour of a disease gene is likely to cause the same or a similar disease'', and mostly employs data regarding well-known disease genes, using supervised learning methods. This work aims to find an effective method to exploit the disease gene neighbourhood and the integration of several useful omics data sources, which potentially enhance disease gene predictions. Methods: We have presented a novel method to effectively predict disease genes by exploiting, in the semi-supervised learning (SSL) scheme, data regarding both disease genes and disease gene neighbours via protein-protein interaction network. Multiple proteomic and genomic data were integrated from six biological databases, including Universal Protein Resource, Interologous Interaction Database, Reactome, Gene Ontology, Pfam, and InterDom, and a gene expression dataset. Results: By employing a 10 times stratified 10-fold cross validation, the SSL method performs better than the k-nearest neighbour method and the support vector machines method in terms of sensitivity of 85%, specificity of 79%, precision of 81%, accuracy of 82%, and a balanced F-function of 83%. The other comparative experimental evaluations demonstrate advantages of the proposed method given a small amount of labeled data with accuracy of 78%. We have applied the proposed method to detect 572 putative disease genes, which are biologically validated by some indirect ways. Conclusion: Semi-supervised learning improved ability to study disease genes, especially a specific disease when the known disease genes (as labeled data) are very often limited. In addition to the computational improvement, the analysis of predicted disease proteins indicates that the findings are beneficial in deciphering the pathogenic mechanisms.