A probabilistic model for mining implicit ‘chemical compound–gene’ relations from literature

  • Authors:
  • Shanfeng Zhu;Yasushi Okuno;Gozoh Tsujimoto;Hiroshi Mamitsuka

  • Affiliations:
  • Bioinformatics Center, Institute for Chemical Research, Kyoto University Gokasho, Uji 611-0011, Japan;Graduate School of Pharmaceutical Sciences, Kyoto University Sakyo-ku, Kyoto 606-8501, Japan;Graduate School of Pharmaceutical Sciences, Kyoto University Sakyo-ku, Kyoto 606-8501, Japan;Bioinformatics Center, Institute for Chemical Research, Kyoto University Gokasho, Uji 611-0011, Japan

  • Venue:
  • Bioinformatics
  • Year:
  • 2005

Quantified Score

Hi-index 3.84

Visualization

Abstract

Motivation: The importance of chemical compounds has been emphasized more in molecular biology, and 'chemical genomics' has attracted a great deal of attention in recent years. Thus an important issue in current molecular biology is to identify biological-related chemical compounds (more specifically, drugs) and genes. Co-occurrence of biological entities in the literature is a simple, comprehensive and popular technique to find the association of these entities. Our focus is to mine implicit 'chemical compound and gene' relations from the co-occurrence in the literature. Results: We propose a probabilistic model, called the mixture aspect model (MAM), and an algorithm for estimating its parameters to efficiently handle different types of co-occurrence datasets at once. We examined the performance of our approach not only by a cross-validation using the data generated from the MEDLINE records but also by a test using an independent human-curated dataset of the relationships between chemical compounds and genes in the ChEBI database. We performed experimentation on three different types of co-occurrence datasets (i.e. compound--gene, gene--gene and compound--compound co-occurrences) in both cases. Experimental results have shown that MAM trained by all datasets outperformed any simple model trained by other combinations of datasets with the difference being statistically significant in all cases. In particular, we found that incorporating compound--compound co-occurrences is the most effective in improving the predictive performance. We finally computed the likelihoods of all unknown compound--gene (more specifically, drug--gene) pairs using our approach and selected the top 20 pairs according to the likelihoods. We validated them from biological, medical and pharmaceutical viewpoints. Contact: mami@kuicr.kyoto-u.ac.jp