Microarray gene cluster identification and annotation through cluster ensemble and EM-based informative textual summarization

  • Authors:
  • Xiaohua Hu;E. K. Park;Xiaodan Zhang

  • Affiliations:
  • Yellow-River Scholar at Henan University, Kaifeng, China and College of Information Science and Technology, Drexel University, Philadelphia, PA;School of Computing and Engineering, University of Missouri at Kansas City, Kansas City, MO;College of Information Science and Technology, Drexel University, Philadelphia, PA

  • Venue:
  • IEEE Transactions on Information Technology in Biomedicine - Special section on computational intelligence in medical systems
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Generating high-quality gene clusters and identifying the underlying biological mechanism of the gene clusters are the important goals of clustering gene expression analysis. To get high-quality cluster results, most of the current approaches rely on choosing the best cluster algorithm, in which the design biases and assumptions meet the underlying distribution of the dataset. There are two issues for this approach: 1) usually, the underlying data distribution of the gene expression datasets is unknown and 2) there are so many clustering algorithms available and it is very challenging to choose the proper one. To provide a textual summary of the gene clusters, the most explored approach is the extractive approach that essentially builds upon techniques borrowed from the information retrieval, in which the objective is to provide terms to be used for query expansion, and not to act as a stand-alone summary for the entire document sets. Another drawback is that the clustering quality and cluster interpretation are treated as two isolated research problems and are studied separately. In this paper, we design and develop a unified system Gene Expression Miner to address these challenging issues in a principled and general manner by integrating cluster ensemble, text clustering, and multidocument summarization and provide an environment for comprehensive gene expression data analysis. We present a novel cluster ensemble approach to generate high-quality gene cluster. In our text summarization module, given a gene cluster, our expectation-maximization based algorithm can automatically identify subtopics and extract most probable terms for each topic. Then, the extracted top k topical terms from each subtopic are combined to form the biological explanation of each gene cluster. Experimental results demonstrate that our system can obtain high-quality clusters and provide informative key terms for the gene clusters.