Fitting tree metrics: Hierarchical clustering and Phylogeny

Authors:
Nir Ailon;Moses Charikar
Affiliations:
Princeton University;Princeton University
Venue:
FOCS '05 Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science
Year:
2005

Citing 11
Cited 11

On the Approximability of Numerical Taxonomy (Fitting Distances by Tree Metrics)

SIAM Journal on Computing
Rank aggregation methods for the Web

Proceedings of the 10th international conference on World Wide Web
Correlation Clustering

FOCS '02 Proceedings of the 43rd Symposium on Foundations of Computer Science
A tight bound on approximating arbitrary metrics by tree metrics

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Clustering with Qualitative Information

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Integrating Microarray Data by Consensus Clustering

ICTAI '03 Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence
Clustering Aggregation

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Learning nonsingular phylogenies and hidden Markov models

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Aggregating inconsistent information: ranking and clustering

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Δ additive and Δ ultra-additive maps, Gromov's trees, and the Farris transform

Discrete Applied Mathematics
Approximating the best-fit tree under Lp norms

APPROX'05/RANDOM'05 Proceedings of the 8th international workshop on Approximation, Randomization and Combinatorial Optimization Problems, and Proceedings of the 9th international conference on Randamization and Computation: algorithms and techniques

Hierarchical mixture models: a probabilistic analysis

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Deterministic pivoting algorithms for constrained ranking and clustering problems

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Aggregation of partial rankings, p-ratings and top-m lists

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Approximation algorithms for embedding general metrics into trees

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Theory research at Google

ACM SIGACT News
Aggregating inconsistent information: Ranking and clustering

Journal of the ACM (JACM)
Linear time approximation schemes for the Gale-Berlekamp game and related minimization problems

Proceedings of the forty-first annual ACM symposium on Theory of computing
Correlation Clustering Revisited: The "True" Cost of Error Minimization Problems

ICALP '09 Proceedings of the 36th International Colloquium on Automata, Languages and Programming: Part I
Deterministic Pivoting Algorithms for Constrained Ranking and Clustering Problems

Mathematics of Operations Research
Deterministic algorithms for rank aggregation and other ranking and clustering problems

WAOA'07 Proceedings of the 5th international conference on Approximation and online algorithms
Fitting Tree Metrics: Hierarchical Clustering and Phylogeny

SIAM Journal on Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Given dissimilarity data on pairs of objects in a set, we study the problem of fitting a tree metric to this data so as to minimize additive error (i.e. some measure of the difference between the tree metric and the given data). This problem arises in constructing an M-level hierarchical clustering of objects (or an ultrametric on objects) so as to match the given dissimilarity data - a basic problem in statistics. Viewed in this way, the problem is a generalization of the correlation clustering problem (which corresponds to M = 1). We give a very simple randomized combinatorial algorithm for the Mlevel hierarchical clustering problem that achieves an approximation ratio of M+2. This is a generalization of a previous factor 3 algorithm for correlation clustering on complete graphs. The problem of fitting tree metrics also arises in phylogeny where the objective is to learn the evolution tree by fitting a tree to dissimilarity data on taxa. The quality of the fit is measured by taking the \ellp norm of the difference between the tree metric constructed and the given data. Previous results obtained a factor 3 approximation for finding the closest tree tree metric under the \ell\infty norm. No non-trivial approximation for general \ellp norms was known before. We present a novel LP formulation for this problem and obtain an O(({\rm{log n log log n}})^{1/p} ) approximation using this. En route, we obtain an O(({\rm{log n log log n}})^{1/p} ) approximation for the closest ultrametric under the \ellp norm. Our techniques are based on representing and viewing an ultrametric as a hierarchy of clusterings, and may be useful in other contexts.