Optimal Time Bounds for Approximate Clustering
Machine Learning
Performance guarantees for hierarchical clustering
Journal of Computer and System Sciences - Special issue on COLT 2002
Optimal combinatorial electricity markets
Web Intelligence and Agent Systems
Optimal time bounds for approximate clustering
UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence
Hi-index | 0.00 |
Given a set of n points and their pairwise distances, the goal of clustering is to partition the points into a “small” number of “related” sets. Clustering algorithms are used widely to manage, classify, and summarize many kinds of data. In this dissertation, we study the classic facility location and k-median problems in the context of clustering, and formulate and study a new optimization problem that we call the online median problem. For each of these problems, it is known to be NP -hard to compute a solution with cost less than a certain constant factor times the optimal cost. We give simple constant-factor approximation algorithms for the facility location, k-median, and online median problems with optimal or nearoptimal time bounds. We also study distance functions that are “approximately” metric, and show that such distance functions allow us to obtain a faster online median algorithm and to generalize our analysis to other objective functions, such as that of the well-known k-means heuristic. Given n points, the associated interpoint distances and nonnegative point weights, and a nonnegative penalty for each point, the facility location problem asks us to identify a set of cluster centers so that the weighted average cluster radii and the sum of the cluster center penalties are both minimized. The k-median problem asks us to identify exactly k cluster centers while minimizing just the weighted average cluster radii. We give a simple greedy algorithm for the facility location problem that runs in O(n 2) time and produces a solution with cost at most 3 times optimal. For the k-median problem, we develop and make use of a sampling technique that we call successive sampling, and give a randomized constant-factor approximation algorithm that runs in O( n(k + log n + log2 n)) time. We also give an Ω(nk) lower bound on the running time of any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible constant probability. In many settings, it is desirable to browse a given data set at differing levels of granularity (i.e., number of clusters). To address this concern, we formulate a generalization of the k-median problem that we call the online median problem. The online median problem asks us to compute an ordering of the points so that, over all i, when a prefix of length i is taken as a set of cluster centers, the weighted average radii of the induced clusters is minimized. We show that a natural generalization of the greedy strategy that we call hierarchically greedy yields an algorithm that produces an ordering such that every prefix of the ordering is within a constant factor of the associated optimal cost. Furthermore, our algorithm has a running time of Θ(n 2). Finally, we study the performance of our algorithms in practice. We present implementations of our k-median and online median algorithms; our experimental results indicate that our approximation algorithms may be useful in practice.