Document clustering using word clusters via the information bottleneck method
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Data mining: concepts and techniques
Data mining: concepts and techniques
Knowledge Acquisition Via Incremental Conceptual Clustering
Machine Learning
Incremental Clustering for Mining in a Data Warehousing Environment
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Efficient Local Search in Conceptual Clustering
DS '01 Proceedings of the 4th International Conference on Discovery Science
Comparison of Three Objective Functions for Conceptual Clustering
PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions
The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
Information-theoretic co-clustering
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Fully automatic cross-associations
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A Scalable Collaborative Filtering Framework Based on Co-Clustering
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Incremental hierarchical clustering of text documents
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
CP/CV: concept similarity mining without frequency information from domain describing taxonomies
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
The Journal of Machine Learning Research
A hierarchical model-based approach to co-clustering high-dimensional data
Proceedings of the 2008 ACM symposium on Applied computing
Approximation algorithms for co-clustering
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees
Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
ACM Transactions on Knowledge Discovery from Data (TKDD)
Selecting hierarchical clustering cut points for web person-name disambiguation
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Parameter-Free Hierarchical Co-clustering by n-Ary Splits
ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Hierarchical co-clustering for web queries and selected URLs
WISE'07 Proceedings of the 8th international conference on Web information systems engineering
SCOAL: A framework for simultaneous co-clustering and learning from complex data
ACM Transactions on Knowledge Discovery from Data (TKDD)
Incremental collaborative filtering via evolutionary co-clustering
Proceedings of the fourth ACM conference on Recommender systems
Parameter-less co-clustering for star-structured heterogeneous data
Data Mining and Knowledge Discovery
Hi-index | 0.00 |
Clustering data is challenging especially for two reasons. The dimensionality of the data is often very high which makes the cluster interpretation hard. Moreover, with high-dimensional data the classic metrics fail in identifying the real similarities between objects. The second challenge is the evolving nature of the observed phenomena which makes the datasets accumulating over time. In this paper we show how we propose to solve these problems. To tackle the high-dimensionality problem, we propose to apply a co-clustering approach on the dataset that stores the occurrence of features in the observed objects. Co-clustering computes a partition of objects and a partition of features simultaneously. The novelty of our co-clustering solution is that it arranges the clusters in a hierarchical fashion, and it consists of two hierarchies: one on the objects and one on the features. The two hierarchies are coupled because the clusters at a certain level in one hierarchy are coupled with the clusters at the same level of the other hierarchy and form the co-clusters. Each cluster of one of the two hierarchies thus provides insights on the clusters of the other hierarchy. Another novelty of the proposed solution is that the number of clusters is possibly unlimited. Nevertheless, the produced hierarchies are still compact and therefore more readable because our method allows multiple splits of a cluster at the lower level. As regards the second challenge, the accumulating nature of the data makes the datasets intractably huge over time. In this case, an incremental solution relieves the issue because it partitions the problem. In this paper we introduce an incremental version of our algorithm of hierarchical co-clustering. It starts from an intermediate solution computed on the previous version of the data and it updates the co-clustering results considering only the added block of data. This solution has the merit of speeding up the computation with respect to the original approach that would recompute the result on the overall dataset. In addition, the incremental algorithm guarantees approximately the same answer than the original version, but it saves much computational load. We validate the incremental approach on several high-dimensional datasets and perform an accurate comparison with both the original version of our algorithm and with the state of the art competitors as well. The obtained results open the way to a novel usage of the co-clustering algorithms in which it is advantageous to partition the data into several blocks and process them incrementally thus "incorporating" data gradually into an on-going co-clustering solution.