Document classification for mining host pathogen protein-protein interactions

Authors:
Lanlan Yin;Guixian Xu;Manabu Torii;Zhendong Niu;Jose M. Maisog;Cathy Wu;Zhangzhi Hu;Hongfang Liu
Affiliations:
Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University, Washington, DC, USA;Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University, Washington, DC, USA and School of Computer Science and Technology, Beijing Institute of Technology, Beijing, ...;Imaging Science and Information Systems Center, Georgetown University Medical Center, Washington, DC, USA;School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China;Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University, Washington, DC, USA and Medical Numerics, Inc., Germantown, MD, USA;Protein Information Resources, Georgetown University Medical Center, Washington, DC, USA;Department of Oncology, Georgetown University Medical Center, Washington, DC, USA;Department of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University, Washington, DC, USA
Venue:
Artificial Intelligence in Medicine
Year:
2010

Citing 7
Cited 1

Making large-scale support vector machine learning practical

Advances in kernel methods
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Response to Webb and Ting's On the Application of ROC Analysis to Predict Classification Performance Under Varying Class Distributions

Machine Learning
Introduction to Information Retrieval

Introduction to Information Retrieval

Guest editorial: Data mining for the study of disease genes and proteins

Artificial Intelligence in Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

Objective: Scientific findings regarding human pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host pathogen protein-protein interactions (HP-PPIs) from literature. Methods: In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to HP-PPIs. An annotated corpus consisting of 1360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of three feature selection methods:information gain (IG), @g^2 test, and specific mutual information (SI). The performance was measured using normalized discounted cumulative gain (NDCG) and positive predictive value (PPV) and all measures were obtained through 10-fold cross validation. Results: NDCG measures for classification systems using all features or a subset of features selected using IG and @g^2 test range from 0.83 to 0.89 while classification systems built based on features selected using SI had relatively lower NDCG measures. The classification system achieved a PPV of 50.7% for the top 10% ranked documents comparing to a baseline PPV of 10.0%. Conclusions: Our results indicate that document classification systems can be constructed to efficiently retrieve HP-PPI related documents. Feature selection was effective in reducing the dimensionality of features to build a compact system.