Inclusion of Textual Documentation in the Analysis of Multidimensional Data Sets: Application to Gene Expression Data

Authors:
Soumya Raychaudhuri;Hinrich Schü/tze;Russ B. Altman
Affiliations:
Department of Genetics, Stanford University, Stanford, CA 94305-5479, USA&semi/ Stanford Medical Informatics, Stanford University, Stanford, CA 94305-5479, USA. tumpa@stanford.edu;Department of Genetics, Stanford Univ., Stanford, CA & Stanford Medical Informatics, Stanford Univ., Stanford, CA;Department of Genetics, Stanford University, Stanford, CA 94305-5479, USA&semi/ Stanford Medical Informatics, Stanford University, Stanford, CA 94305-5479, USA. russ.altman@stanford.edu
Venue:
Machine Learning
Year:
2003

Citing 6
Cited 6

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Pattern Recognition and Neural Networks

Pattern Recognition and Neural Networks
Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Genes, Themes, and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Pattern Recognition of Genomic Features with Microarrays: Site Typing of Mycobacterium Tuberculosis Strains

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Automatic Annotation for Biological Sequences by Etraction of Keywords from MEDLINE Abstracts: Development of a Prototype System

Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology

Meta-clustering of gene expression data and literature-based information

ACM SIGKDD Explorations Newsletter
Enhancing Automatic Construction of Gene Subnetworks by Integrating Multiple Sources of Information

Journal of Signal Processing Systems
Scoring and summarising gene product clusters using the Gene Ontology

International Journal of Data Mining and Bioinformatics
Scoring and summarising gene product clusters using the Gene Ontology

International Journal of Data Mining and Bioinformatics
Relational Descriptive Analysis of Gene Expression Data

Proceedings of the 2006 conference on STAIRS 2006: Proceedings of the Third Starting AI Researchers' Symposium
Relational subgroup discovery for descriptive analysis of microarray data

CompLife'06 Proceedings of the Second international conference on Computational Life Sciences

Quantified Score

Hi-index	0.02

Visualization

Abstract

Recently, biology has been confronted with large multidimensional gene expression data sets where the expression of thousands of genes is measured over dozens of conditions. The patterns in gene expression are frequently explained retrospectively by underlying biological principles. Here we present a method that uses text analysis to help find meaningful gene expression patterns that correlate with the underlying biology described in scientific literature. The main challenge is that the literature about an individual gene is not homogenous and may addresses many unrelated aspects of the gene. In the first part of the paper we present and evaluate the neighbor divergence per gene (NDPG) method that assigns a score to a given subgroup of genes indicating the likelihood that the genes share a biological property or function. To do this, it uses only a reference index that connects genes to documents, and a corpus including those documents. In the second part of the paper we present an approach, optimizing separating projections (OSP), to search for linear projections in gene expression data that separate functionally related groups of genes from the rest of the genes; the objective function in our search is the NDPG score of the positively projected genes. A successful search, therefore, should identify patterns in gene expression data that correlate with meaningful biology. We apply OSP to a published gene expression data set; it discovers many biologically relevant projections. Since the method requires only numerical measurements (in this case expression) about entities (genes) with textual documentation (literature), we conjecture that this method could be transferred easily to other domains. The method should be able to identify relevant patterns even if the documentation for each entity pertains to many disparate subjects that are unrelated to each other.