Clustering Genes Using Gene Expression and Text Literature Data

Authors:
Chengyong Yang;Erliang Zeng;Tao Li;Giri Narasimhan
Affiliations:
Florida International University;Florida International University;Florida International University;Florida International University
Venue:
CSB '05 Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference
Year:
2005

Citing 9
Cited 2

Algorithms for clustering data

Algorithms for clustering data
A view of the EM algorithm that justifies incremental, sparse, and other variants

Proceedings of the NATO Advanced Study Institute on Learning in graphical models
GlOSS: text-source discovery over the Internet

ACM Transactions on Database Systems (TODS)
Modern Information Retrieval

Modern Information Retrieval
Genes, Themes, and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Clustering of diverse genomic data using information fusion

Proceedings of the 2004 ACM symposium on Applied computing
Meta-clustering of gene expression data and literature-based information

ACM SIGKDD Explorations Newsletter
Multi-View Clustering

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Multimodal integration-a statistical view

IEEE Transactions on Multimedia

An effective soft clustering approach to mining gene expressions from multi-source databases

AIKED'07 Proceedings of the 6th Conference on 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases - Volume 6
A knowledge-driven method to evaluate multi-source clustering

ISPA'05 Proceedings of the 2005 international conference on Parallel and Distributed Processing and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering of gene expression data is a standard technique used to identify closely related genes. In this paper, we develop a new clustering algorithm, MSC (Multi-Source Clustering), to perform exploratory analysis using two or more diverse sources of data. In particular, we investigate the problem of improving the clustering by integrating information obtained from gene expression data with knowledge extracted from biomedical text literature. In each iteration of algorithm MSC, an EM-type procedure is employed to bootstrap the model obtained from one data source by starting with the cluster assignments obtained in the previous iteration using the other data sources. Upon convergence, the two individual models are used to construct the final cluster assignment. We compare the results of algorithm MSC for two data sources with the results obtained when the clustering is applied on the two sources of data separately. We also compare it with that obtained using the feature level integration method that performs the clustering after simply concatenating the features obtained from the two data sources. We show that the z-scores of the clustering results from MSC are better than that from the other methods. To evaluate our clusters better, function enrichment results are presented using terms from the Gene Ontology database. Finally, by investigating the success of motif detection programs that use the clusters, we show that our approach integrating gene expression data and text data reveals clusters that are biologically more meaningful than those identified using gene expression data alone.