Optimising sum-of-squares measures for clustering multisets defined over a metric space

  • Authors:
  • George Kettleborough;V. J. Rayward-Smith

  • Affiliations:
  • -;-

  • Venue:
  • Discrete Applied Mathematics
  • Year:
  • 2013

Quantified Score

Hi-index 0.04

Visualization

Abstract

Clustering is the problem of dividing a dataset into subsets, called clusters, which are both homogeneous and well-separated. Many criteria have been devised which simultaneously measure both of these properties. Two such criteria are centroid-distance, used by the popular k-means algorithm, and the complete sum of all intra-cluster distances squared, which we call all-squares. This paper compares these two criteria in the context of clustering multisets which are defined over a metric space. We show that optimal clusterings according to both criteria can be consistent, meaning identical elements belong to the same cluster, but while centroid-distance always produces linearly separable solutions, all-squares does not. It has recently been shown that finding optimal clusterings according to centroid-distance in Euclidean space is NP-hard. We show that the decision problems associated with both optimisation problems are NP-complete in a simple, three-valued, metric space, and that the all-squares decision problem remains NP-complete in Euclidean space. We then show that if the metric is the simple 0/1 metric then both problems are in P. We then introduce a new metric on clusterings based on the earth mover's distance called the assignment metric and use this to show that optimal clusterings according to the two criteria can be as different as two clusterings can possibly be under both our metric and the well-known variation of information metric.