Clustering of diverse genomic data using information fusion

Authors:
Jyotsna Kasturi;Raj Acharya
Affiliations:
The Pennsylvania State University, University Park, PA;The Pennsylvania State University, University Park, PA
Venue:
Proceedings of the 2004 ACM symposium on Applied computing
Year:
2004

Citing 4
Cited 3

Self-organizing maps

Self-organizing maps
From promoter sequence to expression: a probabilistic framework

Proceedings of the sixth annual international conference on Computational biology
Probabilistic hierarchical clustering for biological data

Proceedings of the sixth annual international conference on Computational biology
Finding Regulatory Elements Using Joint Likelihoods for Sequence and Expression Profile Data

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology

Clustering Genes Using Gene Expression and Text Literature Data

CSB '05 Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference
Identifying Conserved Discriminative Motifs

PRIB '08 Proceedings of the Third IAPR International Conference on Pattern Recognition in Bioinformatics
Cluster ensemble selection based on relative validity indexes

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Genome sequencing projects and high-throughput technologies like DNA and Protein arrays have resulted in a very large amount of information-rich data. Microarray experimental data are a valuable, but limited source for inferring gene regulation mechanisms on a genomic scale. Additional information such as promoter sequences of genes/ DNA binding motifs, gene ontologies, and location data, when combined with gene expression analysis can increase the statistical significance of the finding. This paper introduces a machine learning approach to information fusion for combining heterogeneous genomic data. This algorithm uses an unsupervised joint learning mechanism that identifies clusters of genes using the combined data. The correlation between gene expression time-series patterns obtained from different experimental conditions and the presence of several distinct and repeated motifs in their upstream sequences is examined here using publicly available yeast cellcycle data. The results show that the combined learning approach taken here identifies correlated genes effectively. The algorithm provides an automated clustering method, but allows the user to specify apriori the influence of each data type on the final clustering using probabilities.