OPTICS: ordering points to identify the clustering structure
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
ACM Computing Surveys (CSUR)
Data mining: concepts and techniques
Data mining: concepts and techniques
BIRCH: A New Data Clustering Algorithm and Its Applications
Data Mining and Knowledge Discovery
An Approach to Active Spatial Data Mining Based on Statistical Information
IEEE Transactions on Knowledge and Data Engineering
STING: A Statistical Information Grid Approach to Spatial Data Mining
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
k-means: a new generalized k-means clustering algorithm
Pattern Recognition Letters
Fast Detection of XML Structural Similarity
IEEE Transactions on Knowledge and Data Engineering
Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
IEEE Transactions on Pattern Analysis and Machine Intelligence
Iterative Cluster Analysis of Protein Interaction Data
Bioinformatics
Bayesian hierarchical clustering
ICML '05 Proceedings of the 22nd international conference on Machine learning
k-means++: the advantages of careful seeding
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Techniques for clustering gene expression data
Computers in Biology and Medicine
An improved algorithm for clustering gene expression data
Bioinformatics
Modeling and Visualizing Uncertainty in Gene Expression Clusters Using Dirichlet Process Mixtures
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Noise-robust algorithm for identifying functionally associated biclusters from gene expression data
Information Sciences: an International Journal
An agglomerative clustering algorithm using a dynamic k-nearest-neighbor list
Information Sciences: an International Journal
Automatic summarisation and annotation of microarray data
Soft Computing - A Fusion of Foundations, Methodologies and Applications - Special issue on advances in computational intelligence and bioinformatics
A Coclustering Approach for Mining Large Protein-Protein Interaction Networks
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Gene transposon based clone selection algorithm for automatic clustering
Information Sciences: an International Journal
Hi-index | 0.07 |
The recent advances in genomic technologies and the availability of large-scale microarray datasets call for the development of advanced data analysis techniques, such as data mining and statistical analysis to cite a few. Among the mining techniques proposed so far, cluster analysis has become a standard method for the analysis of microarray expression data. It can be used both for initial screening of patients and for extraction of disease molecular signatures. Moreover, clustering can be profitably exploited to characterize genes of unknown function and uncover patterns that can be interpreted as indications of the status of cellular processes. Finally, clustering biological data would be useful not only for exploring the data but also for discovering implicit links between the objects. To this end, several clustering approaches have been proposed in order to obtain a good trade-off between accuracy and efficiency of the clustering process. In particular, great attention has been devoted to hierarchical clustering algorithms for their accuracy in unsupervised identification and stratification of groups of similar genes or patients, while, partition based approaches are exploited when fast computations are required. Indeed, it is well known that no existing clustering algorithm completely satisfies both accuracy and efficiency requirements, thus a good clustering algorithm has to be evaluated with respect to some external criteria that are independent from the metric being used to compute clusters. In this paper, we propose a clustering algorithm called M-CLUBS (for Microarray data CLustering Using Binary Splitting) exhibiting higher accuracy than the hierarchical ones proposed so far while allowing a faster computation with respect to partition based approaches. Indeed, M-CLUBS is faster and more accurate than other algorithms, including k-means and its recently proposed refinements, as we will show in the experimental section. The algorithm consists of a divisive phase and an agglomerative phase; during these two phases, the samples are repartitioned using a least quadratic distance criterion possessing unique analytical properties that we exploit to achieve a very fast computation. M-CLUBS derives good clusters without requiring input from users, and it is robust and impervious to noise, while providing better speed and accuracy than methods, such as BIRCH, that are endowed with the same critical properties. Due to the structural feature of microarray data (they are represented as arrays of numeric values), M-CLUBS is suitable for analyzing them since it is designed to perform well for Euclidean distances. In order to stronger the obtained results we interpreted the obtained clusters by a domain expert and the evaluation by quality measures specifically tailored for biological validity assessment.