Clustering with Lower Bound on Similarity

Authors:
Mohammad Al Hasan;Saeed Salem;Benjarath Pupacdi;Mohammed J. Zaki
Affiliations:
Department of Computer Science, Rensselaer Polytechnic Institute, Troy,;Department of Computer Science, Rensselaer Polytechnic Institute, Troy,;Chulabhorn Research Institute, Laksi, Bangkok, Thailand;Department of Computer Science, Rensselaer Polytechnic Institute, Troy,
Venue:
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Year:
2009

Citing 7
Cited 2

On-line algorithms for the dominating set problem

Information Processing Letters
On the hardness of approximating minimization problems

Journal of the ACM (JACM)
Approximation algorithms

Approximation algorithms
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions

Machine Learning
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Using lower-bound similarity to enhance the performance of recommender systems

COMPUTE '11 Proceedings of the Fourth Annual ACM Bangalore Conference
Automatic discovery of high-level provenance using semantic similarity

IPAW'12 Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new method, called SimClus, for clustering with lower bound on similarity. Instead of accepting k the number of clusters to find, the alternative similarity-based approach imposes a lower bound on the similarity between an object and its corresponding cluster representative (with one representative per cluster). SimClus achieves a O (logn ) approximation bound on the number of clusters, whereas for the best previous algorithm the bound can be as poor as O (n ). Experiments on real and synthetic datasets show that our algorithm produces more than 40% fewer representative objects, yet offers the same or better clustering quality. We also propose a dynamic variant of the algorithm, which can be effectively used in an on-line setting.