Hierarchical co-clustering: off-line and incremental approaches

Authors:
Ruggero G. Pensa;Dino Ienco;Rosa Meo
Affiliations:
Department of Computer Science, University of Torino, Torino, Italy 10149;IRSTEA Montpellier, UMR TETIS, Montpellier, France 34093 and LIRMM Montpellier, UMR CNRS, Montpellier, France 34095;Department of Computer Science, University of Torino, Torino, Italy 10149
Venue:
Data Mining and Knowledge Discovery
Year:
2014

Citing 24
Cited 0

Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Data mining: concepts and techniques

Data mining: concepts and techniques
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
Incremental Clustering for Mining in a Data Warehousing Environment

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Efficient Local Search in Conceptual Clustering

DS '01 Proceedings of the 4th International Conference on Discovery Science
Comparison of Three Objective Functions for Conceptual Clustering

PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Fully automatic cross-associations

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A Scalable Collaborative Filtering Framework Based on Co-Clustering

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Incremental hierarchical clustering of text documents

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
CP/CV: concept similarity mining without frequency information from domain describing taxonomies

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation

The Journal of Machine Learning Research
A hierarchical model-based approach to co-clustering high-dimensional data

Proceedings of the 2008 ACM symposium on Applied computing
Approximation algorithms for co-clustering

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Selecting hierarchical clustering cut points for web person-name disambiguation

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Parameter-Free Hierarchical Co-clustering by n-Ary Splits

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Hierarchical co-clustering for web queries and selected URLs

WISE'07 Proceedings of the 8th international conference on Web information systems engineering
SCOAL: A framework for simultaneous co-clustering and learning from complex data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Incremental collaborative filtering via evolutionary co-clustering

Proceedings of the fourth ACM conference on Recommender systems
Parameter-less co-clustering for star-structured heterogeneous data

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering data is challenging especially for two reasons. The dimensionality of the data is often very high which makes the cluster interpretation hard. Moreover, with high-dimensional data the classic metrics fail in identifying the real similarities between objects. The second challenge is the evolving nature of the observed phenomena which makes the datasets accumulating over time. In this paper we show how we propose to solve these problems. To tackle the high-dimensionality problem, we propose to apply a co-clustering approach on the dataset that stores the occurrence of features in the observed objects. Co-clustering computes a partition of objects and a partition of features simultaneously. The novelty of our co-clustering solution is that it arranges the clusters in a hierarchical fashion, and it consists of two hierarchies: one on the objects and one on the features. The two hierarchies are coupled because the clusters at a certain level in one hierarchy are coupled with the clusters at the same level of the other hierarchy and form the co-clusters. Each cluster of one of the two hierarchies thus provides insights on the clusters of the other hierarchy. Another novelty of the proposed solution is that the number of clusters is possibly unlimited. Nevertheless, the produced hierarchies are still compact and therefore more readable because our method allows multiple splits of a cluster at the lower level. As regards the second challenge, the accumulating nature of the data makes the datasets intractably huge over time. In this case, an incremental solution relieves the issue because it partitions the problem. In this paper we introduce an incremental version of our algorithm of hierarchical co-clustering. It starts from an intermediate solution computed on the previous version of the data and it updates the co-clustering results considering only the added block of data. This solution has the merit of speeding up the computation with respect to the original approach that would recompute the result on the overall dataset. In addition, the incremental algorithm guarantees approximately the same answer than the original version, but it saves much computational load. We validate the incremental approach on several high-dimensional datasets and perform an accurate comparison with both the original version of our algorithm and with the state of the art competitors as well. The obtained results open the way to a novel usage of the co-clustering algorithms in which it is advantageous to partition the data into several blocks and process them incrementally thus "incorporating" data gradually into an on-going co-clustering solution.