Microarray gene cluster identification and annotation through cluster ensemble and EM-based informative textual summarization

Authors:
Xiaohua Hu;E. K. Park;Xiaodan Zhang
Affiliations:
Yellow-River Scholar at Henan University, Kaifeng, China and College of Information Science and Technology, Drexel University, Philadelphia, PA;School of Computing and Engineering, University of Missouri at Kansas City, Kansas City, MO;College of Information Science and Technology, Drexel University, Philadelphia, PA
Venue:
IEEE Transactions on Information Technology in Biomedicine - Special section on computational intelligence in medical systems
Year:
2009

Citing 12
Cited 2

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Multilevel k-way partitioning scheme for irregular graphs

Journal of Parallel and Distributed Computing
Clustering gene expression patterns

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Data clustering: a review

ACM Computing Surveys (CSUR)
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Pattern Recognition with Fuzzy Objective Function Algorithms

Pattern Recognition with Fuzzy Objective Function Algorithms
Using the Co-occurrence of Words for Retrieval Weighting

Information Retrieval
Using Rough Sets Theory and Database Operations to Construct a Good Ensemble of Classifiers for Data Mining Applications

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Genes, Themes, and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
An Adaptive Meta-Clustering Approach: Combining the Information from Different Clustering Results

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
KPSpotter: a flexible information gain-based keyphrase extraction system

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management

Hybrid cluster ensemble framework based on the random combination of data transformation operators

Pattern Recognition
From cluster ensemble to structure ensemble

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Generating high-quality gene clusters and identifying the underlying biological mechanism of the gene clusters are the important goals of clustering gene expression analysis. To get high-quality cluster results, most of the current approaches rely on choosing the best cluster algorithm, in which the design biases and assumptions meet the underlying distribution of the dataset. There are two issues for this approach: 1) usually, the underlying data distribution of the gene expression datasets is unknown and 2) there are so many clustering algorithms available and it is very challenging to choose the proper one. To provide a textual summary of the gene clusters, the most explored approach is the extractive approach that essentially builds upon techniques borrowed from the information retrieval, in which the objective is to provide terms to be used for query expansion, and not to act as a stand-alone summary for the entire document sets. Another drawback is that the clustering quality and cluster interpretation are treated as two isolated research problems and are studied separately. In this paper, we design and develop a unified system Gene Expression Miner to address these challenging issues in a principled and general manner by integrating cluster ensemble, text clustering, and multidocument summarization and provide an environment for comprehensive gene expression data analysis. We present a novel cluster ensemble approach to generate high-quality gene cluster. In our text summarization module, given a gene cluster, our expectation-maximization based algorithm can automatically identify subtopics and extract most probable terms for each topic. Then, the extracted top k topical terms from each subtopic are combined to form the biological explanation of each gene cluster. Experimental results demonstrate that our system can obtain high-quality clusters and provide informative key terms for the gene clusters.