Foundations of statistical natural language processing
Foundations of statistical natural language processing
Pattern Recognition and Neural Networks
Pattern Recognition and Neural Networks
Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions
Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Genes, Themes, and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology
Meta-clustering of gene expression data and literature-based information
ACM SIGKDD Explorations Newsletter
Enhancing Automatic Construction of Gene Subnetworks by Integrating Multiple Sources of Information
Journal of Signal Processing Systems
Scoring and summarising gene product clusters using the Gene Ontology
International Journal of Data Mining and Bioinformatics
Scoring and summarising gene product clusters using the Gene Ontology
International Journal of Data Mining and Bioinformatics
Relational Descriptive Analysis of Gene Expression Data
Proceedings of the 2006 conference on STAIRS 2006: Proceedings of the Third Starting AI Researchers' Symposium
Relational subgroup discovery for descriptive analysis of microarray data
CompLife'06 Proceedings of the Second international conference on Computational Life Sciences
Hi-index | 0.02 |
Recently, biology has been confronted with large multidimensional gene expression data sets where the expression of thousands of genes is measured over dozens of conditions. The patterns in gene expression are frequently explained retrospectively by underlying biological principles. Here we present a method that uses text analysis to help find meaningful gene expression patterns that correlate with the underlying biology described in scientific literature. The main challenge is that the literature about an individual gene is not homogenous and may addresses many unrelated aspects of the gene. In the first part of the paper we present and evaluate the neighbor divergence per gene (NDPG) method that assigns a score to a given subgroup of genes indicating the likelihood that the genes share a biological property or function. To do this, it uses only a reference index that connects genes to documents, and a corpus including those documents. In the second part of the paper we present an approach, optimizing separating projections (OSP), to search for linear projections in gene expression data that separate functionally related groups of genes from the rest of the genes; the objective function in our search is the NDPG score of the positively projected genes. A successful search, therefore, should identify patterns in gene expression data that correlate with meaningful biology. We apply OSP to a published gene expression data set; it discovers many biologically relevant projections. Since the method requires only numerical measurements (in this case expression) about entities (genes) with textual documentation (literature), we conjecture that this method could be transferred easily to other domains. The method should be able to identify relevant patterns even if the documentation for each entity pertains to many disparate subjects that are unrelated to each other.