Data transformation for sum squared residue

Authors:
Hyuk Cho
Affiliations:
Computer Science, Sam Houston State University, Huntsville, TX
Venue:
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Year:
2010

Citing 5
Cited 2

Biclustering of Expression Data

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Biclustering Algorithms for Biological Data Analysis: A Survey

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Shifting and scaling patterns from gene expression data

Bioinformatics
A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

The Journal of Machine Learning Research
Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Situation-Aware on mobile phone using co-clustering: algorithms and extensions

IEA/AIE'12 Proceedings of the 25th international conference on Industrial Engineering and Other Applications of Applied Intelligent Systems: advanced research in applied artificial intelligence
Feature selection for k-means clustering stability: theoretical analysis and an algorithm

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

The sum squared residue has been popularly used as a clustering and co-clustering quality measure, however little research on its detail properties has been performed. Recent research articulates that the residue is useful to discover shifting patterns but inappropriate to find scaling patterns. To remedy this weakness, we propose to take specific data transformations that can adjust latent scaling factors and eventually lead to lower the residue. First, we consider data matrix models with varied shifting and scaling factors. Then, we formally analyze the effect of several data transformations on the residue. Finally, we empirically validate the analysis with publicly-available human cancer gene expression datasets. Both the analytical and experimental results reveal column standard deviation normalization and column Z-score transformation are effective for the residue to handle scaling factors, through which we are able to achieve better tissue sample clustering accuracy.