Conceptual Clustering of Heterogeneous GeneExpression Sequences

Authors:
Sally McClean;Bryan Scotney;Steve Robinson
Affiliations:
School of Computing and Information Engineering,;School of Computing and Information Engineering, Faculty of Informatics, University of Ulster, Cromore Road, Coleraine, BT52 1SA, Northern Ireland (E-mail: bw.scotney@ulster.ac.uk ...;School of Computing and Information Engineering, Faculty of Informatics, University of Ulster, Cromore Road, Coleraine, BT52 1SA, Northern Ireland (E-mail: s.robinson@ulster.ac.uk ...
Venue:
Artificial Intelligence Review
Year:
2003

Citing 16
Cited 0

The derivation problem of summary data

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
A universal-scheme approach to statistical databases containing homogeneous summary tables

ACM Transactions on Database Systems (TODS)
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Probabilistic independence networks for hidden Markov probability models

Neural Computation
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Mining the gene expression matrix: inferring gene relationships from large scale gene expression data

IPCAT '97 Proceedings of the second international workshop on Information processing in cell and tissues
Optimal and efficient integration of heterogeneous summary tables in a distributed database

Data & Knowledge Engineering
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Deformable Markov model templates for time-series pattern matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A general probabilistic framework for clustering individuals and objects

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Machine Learning

Machine Learning
An Evidential Reasoning Approach to Attribute Value Conflict Resolution in Database Integration

IEEE Transactions on Knowledge and Data Engineering
Aggregation of Imprecise and Uncertain Information in Databases

IEEE Transactions on Knowledge and Data Engineering
Interpreting microarray expression data using text annotating the genes

Information Sciences—Applications: An International Journal
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Model-Based Clustering and Visualization of Navigation Patterns on a Web Site

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

We are concerned with clustering andcharacterising gene expression sequences thathave been classified according to heterogeneousclassification schemes. We adopt a model-basedapproach that uses a Hidden Markov Model (HMM)that has as states the stages of the underlyingprocess that generates the gene sequences, thusallowing us to handle complex and heterogeneousdata. Each cluster is described in terms of aHMM where we seek to find schema mappingsbetween the states of the original sequencesand the states of the HMM.The general solution that we propose involvesseveral distinct tasks. Firstly, there is aclustering problem where we seek to groupsimilar sequences; for this we use mutualentropy to identify associations betweensequence states. Secondly, because we areconcerned with clustering heterogeneoussequences, we must determine the mappingsbetween the states of each sequence in acluster and the states of an underlying hiddenprocess; for this we compute the most probablemapping. Thirdly, using these mappings weemploy maximum likelihood techniques to learnthe probabilistic description of the hiddenMarkov process for each cluster. Fourthly, weuse these descriptions to characterise theclusters using Dynamic Programming to determinethe most probable pathway for each cluster.Finally, we derive linguistic labels todescribe the clusters in a user-friendlymanner. Such an approach provides an intuitiveway of describing the underlying shape of theprocess by explicitly modelling the temporalaspects of the data. Non time-homogeneous HMMsare used to capture the full temporal semantics.