Exact algorithms for minimum sum-of-squares clustering

  • Authors:
  • Daniel Aloise

  • Affiliations:
  • Ecole Polytechnique, Montreal (Canada)

  • Venue:
  • Exact algorithms for minimum sum-of-squares clustering
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Minimum sum-of-squares clustering (MSSC) consists in, given a set of n entities associated with points in s-dimensional Euclidean space, partitioning this set into k clusters in such a way that the sum of squared distances from each entity to the centroid of its cluster is minimum. This much studied problem is a basic one in cluster analysis and has application in numerous and diverse fields. Many heuristic algorithms for MSSC have been and continue to be regularly proposed. Exact solution methods are rare but a variety of approaches have been explored. The first chapter of the thesis concerns complexity analysis of MSSC, a topic in which there seems to have been much confusion. We note indeed that several dozen papers have made incorrect or unjustified statements about NP-hardness of MSSC, usually contusing it with some other clustering problem. Recently, a proof was proposed by Drineas. Frieze, Kanan, Vempala and Vinay in Machine Learning, 2004. Unfortunately, as shown in this chapter, this proof is not correct. An alternate short proof, due to Amit Deshpande and Preyas Popat, is then provided. The next three chapters of the thesis consider three of the main approaches to exact solution of MSSC. In chapter 2 we study a recent paper of Sherali and Desai in Journal of Global Optimization, 2005. In this paper the authors proposed a reformulation-linearization based branch-and-bound algorithm for this problem, claiming to solve instances with up to 1000 points. We investigated their method in further detail, reproducing some of their computational experiments. However, our computational times turned out to be drastically larger. Indeed, for two data sets from the literature only instances with up to 20 points could be solved in less than 10 hours of computer time. Possible reasons for this discrepancy are discussed. The effect of a symmetry breaking rule due to Plastria (European Journal of Operational Research, 2002) and of the introduction of valid inequalities of the convex hull of points in two dimensions which may belong to each cluster is also explored. In chapter 3, we study the work of Peng and Xia (Studies in Fuziness and Soft Computing, 2005) on a 0-1 semidefinite programming (0-1 SDP) reformulation of MSSC. In view of the rapid increase in size of the set of constraints in their model, the authors only sketched an algorithm to exactly solve the problem. We then developed a branch-and-cut algorithm following those lines but adding only sets of violated constraints. The algorithm obtains exact solutions with computing times comparable with those of the best exact method previously found in the literature. Finally, Chapter 4 is devoted to the column generation approach of du Merle, Hansen, Jaumard and Mladenović (SIAM Journal on Scientific Computing, 2000) and its improvements. The bottleneck of that algorithm is the resolution of the auxiliary problem of finding a column with negative reduced cost. We propose a new way to solve this auxiliary problem based on geometric arguments. This greatly improves the efficiency of the whole algorithm and leads to exact solution of instances in the plane with up to n = 2392 entities and k ≥ 2 clusters, i.e., more than 10 times as much as previously done. Moreover, instances in up to 19 dimensions and with up to n = 2310 entities could be solved exactly when there are many clusters.