A divide-and-merge methodology for clustering

Authors:
David Cheng;Ravi Kannan;Santosh Vempala;Grant Wang
Affiliations:
Massachusetts Institute of Technology, Cambridge, MA;Yale University, New Haven, CT;Massachusetts Institute of Technology, Cambridge, MA;Massachusetts Institute of Technology, Cambridge, MA
Venue:
ACM Transactions on Database Systems (TODS)
Year:
2006

Citing 24
Cited 21

Algorithms for clustering data

Algorithms for clustering data
Approximate counting, uniform generation and rapidly mixing Markov chains

Information and Computation
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Finding $k$ Cuts within Twice the Optimal

SIAM Journal on Computing
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Incremental clustering and dynamic information retrieval

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
P-Complete Approximation Problems

Journal of the ACM (JACM)
Data clustering: a review

ACM Computing Surveys (CSUR)
Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
Co-clustering documents and words using bipartite spectral graph partitioning

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Information Retrieval

Information Retrieval
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
COOLCAT: an entropy-based algorithm for categorical clustering

Proceedings of the eleventh international conference on Information and knowledge management
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery
Correlation Clustering

FOCS '02 Proceedings of the 43rd Symposium on Foundations of Computer Science
The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximation schemes for clustering problems

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Clustering with Qualitative Information

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Correlation Clustering: maximizing agreements via semidefinite programming

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
On clusterings: Good, bad and spectral

Journal of the ACM (JACM)
A Simple Linear Time (1+ ") -Approximation Algorithm for k-Means Clustering in Any Dimensions

FOCS '04 Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science

Filtering spam with behavioral blacklisting

Proceedings of the 14th ACM conference on Computer and communications security
A discriminative framework for clustering via similarity functions

STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Spectral geometry for simultaneously clustering and ranking query search results

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Generating Fuzzy Equivalence Classes on RSS News Articles for Retrieving Correlated Information

ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Traffic Aggregation for Malware Detection

DIMVA '08 Proceedings of the 5th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
A search space reduction methodology for data mining in large databases

Engineering Applications of Artificial Intelligence
Information Extraction

Foundations and Trends in Databases
Fighting spam, phishing, and online scams at the network level

Proceedings of the 4th Asian Conference on Internet Engineering
Efficient top-k count queries over imprecise duplicates

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Query result clustering for object-level search

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A spectral-based clustering algorithm for categorical data using data summaries

Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors
Spectral Algorithms

Foundations and Trends® in Theoretical Computer Science
Sampling for information and structure preservation when mining large data bases

IBERAMIA'10 Proceedings of the 12th Ibero-American conference on Advances in artificial intelligence
Minimum spanning tree based split-and-merge: A hierarchical clustering method

Information Sciences: an International Journal
An effective web document clustering algorithm based on bisection and merge

Artificial Intelligence Review
Cutting graphs using competing ant colonies and an edge clustering heuristic

EvoCOP'11 Proceedings of the 11th European conference on Evolutionary computation in combinatorial optimization
Measuring the impact of sense similarity on word sense induction

EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Active clustering of biological sequences

The Journal of Machine Learning Research
Distributed spectral cluster management: a method for building dynamic publish/subscribe systems

Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
Evaluating unsupervised ensembles when applied to word sense induction

ACL '12 Proceedings of ACL 2012 Student Research Workshop
A peer-to-peer recommender system for self-emerging user communities based on gossip overlays

Journal of Computer and System Sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a divide-and-merge methodology for clustering a set of objects that combines a top-down “divide” phase with a bottom-up “merge” phase. In contrast, previous algorithms use either top-down or bottom-up methods to construct a hierarchical clustering or produce a flat clustering using local search (e.g., k-means). For the divide phase, which produces a tree whose leaves are the elements of the set, we suggest an efficient spectral algorithm. When the data is in the form of a sparse document-term matrix, we show how to modify the algorithm so that it maintains sparsity and runs in linear space. The merge phase quickly finds the optimal partition that respects the tree for many natural objective functions, for example, k-means, min-diameter, min-sum, correlation clustering, etc. We present a thorough experimental evaluation of the methodology. We describe the implementation of a meta-search engine that uses this methodology to cluster results from web searches. We also give comparative empirical results on several real datasets.