The Relevant-Set Correlation Model for Data Clustering

Authors:
Michael E. Houle
Affiliations:
National Institute of Informatics, Tokyo, Japan
Venue:
Statistical Analysis and Data Mining
Year:
2008

Citing 0
Cited 7

Applying relevant set correlation clustering to multi-criteria recommender systems

Proceedings of the third ACM conference on Recommender systems
SEICOS: semantically enriched interactive collaborative online shopping

Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services
Active caching for similarity queries based on shared-neighbor information

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Can shared-neighbor distances defeat the curse of dimensionality?

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Quality of similarity rankings in time series

SSTD'11 Proceedings of the 12th international conference on Advances in spatial and temporal databases
A set correlation model for partitional clustering

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Object-based visual query suggestion

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a model for clustering, the Relevant-Set Correlation (RSC) model, that requires no direct knowledge of the nature or representation of the data. Instead, the RSC model relies solely on the existence of an oracle that accepts a query in the form of a reference to a data item, and returns a ranked set of references to items that are most relevant to the query. The quality of cluster candidates, the degree of association between pairs of cluster candidates, and the degree of association between clusters and data items are all assessed according to the statistical significance of a form of correlation among pairs of relevant sets and-or candidate cluster sets. The RSC significance measures can be used to evaluate the relative importance of cluster candidates of various sizes, avoiding the problems of bias found with other shared-neighbor methods that use fixed neighborhood sizes. A scalable clustering heuristic based on the RSC model is also presented and demonstrated for large, high-dimensional datasets using a fast approximate similarity search structure as the oracle. © 2008 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 1: 000-000, 2008