Clustering Large Datasets in Arbitrary Metric Spaces

Authors:
Venkatesh Ganti;Raghu Ramakrishnan;Johannes Gehrke;Allison Powell
Affiliations:
University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;University of Virginia at Charlottesville
Venue:
ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Year:
1999

Citing 0
Cited 38

A framework for measuring changes in data characteristics

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Scalable algorithms for mining large databases

KDD '99 Tutorial notes of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering

Journal of Global Optimization
Mining Very Large Databases

Computer
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Redefining Clustering for High-Dimensional Applications

IEEE Transactions on Knowledge and Data Engineering
Fully Dynamic Clustering of Metric Data Sets

BNCOD 19 Proceedings of the 19th British National Conference on Databases: Advances in Databases
A Visual Method of Cluster Validation with Fastmap

PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
M-FastMap: A Modified FastMap Algorithm for Visual Cluster Validation in Data Mining

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
COFE: A Scalable Method for Feature Extraction from Complex Objects

DaWaK 2000 Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery
A Human-Computer Interactive Method for Projected Clustering

IEEE Transactions on Knowledge and Data Engineering
Hypergraph Models and Algorithms for Data-Pattern-Based Clustering

Data Mining and Knowledge Discovery
A top-down approach for density-based clustering using multidimensional indexes

Journal of Systems and Software - Special issue: Performance modeling and analysis of computer systems and networks
Clustering in Dynamic Spatial Databases

Journal of Intelligent Information Systems
Antipole Tree Indexing to Support Range Search and K-Nearest Neighbor Search in Metric Spaces

IEEE Transactions on Knowledge and Data Engineering
Exploration of textual document archives using a fuzzy hierarchical clustering algorithm in the GAMBAL system

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

Data Mining and Knowledge Discovery
QROCK: A quick version of the ROCK algorithm for clustering of categorical data

Pattern Recognition Letters
Approximate data mining in very large relational data

ADC '06 Proceedings of the 17th Australasian Database Conference - Volume 49
Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Semantic peer, here are the neighbors you want!

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Research on Spatial Clustering Acetabuliform Model and Algorithm Based on Mathematical Morphology

ISNN '08 Proceedings of the 5th international symposium on Neural Networks: Advances in Neural Networks, Part II
Image-mapped data clustering: An efficient technique for clustering large data sets

Intelligent Data Analysis
A scalable framework for cluster ensembles

Pattern Recognition
Distributed clustering based on sampling local density estimates

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Extending fuzzy and probabilistic clustering to very large data sets

Computational Statistics & Data Analysis
An incremental clustering scheme for data de-duplication

Data Mining and Knowledge Discovery
Agent-based distributed data mining: the KDEC scheme

Intelligent information agents
Information theoretic criteria for community detection

SNAKDD'08 Proceedings of the Second international conference on Advances in social network mining and analysis
A unified multimedia and semantic perspective for data retrieval in the semantic web

Information Systems
Distributed antipole clustering for efficient data search and management in Euclidean and metric spaces

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Measure based metrics for aggregated data

Intelligent Data Analysis
Distributed spatial clustering in sensor networks

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Multiple-Winners randomized tournaments with consensus for optimization problems in generic metric spaces

WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms
An indexing approach for representing multimedia objects in high-dimensional spaces based on expectation maximization algorithm

MIS'05 Proceedings of the 11th international conference on Advances in Multimedia Information Systems
On approximation algorithms for data mining applications

Efficient Approximation and Online Algorithms
Improved tangent space based distance metric for accurate lithographic hotspot classification

Proceedings of the 49th Annual Design Automation Conference
Knowledge augmentation via incremental clustering: new technology for effective knowledge management

International Journal of Business Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering partitions a collection of objects into groups called clusters, such that similar objects fall into the same group. Similarity between objects is defined by a distance function satisfying the triangle inequality; this distance function along with the collection of objects describes a distance space. In a distance space, the only operation possible on data objects is the computation of distance between them. All scalable algorithms in the literature assume a special type of distance space, namely a k-dimensional vector space, which allows vector operations on objects. We present two scalable algorithms designed for clustering very large datasets in distance spaces. Our first algorithm BUBBLE is, to our knowledge, the first scalable clustering algorithm for data in a distance space. Our second algorithm BUBBLE-FM improves upon BUBBLE by reducing the number of calls to the distance function, which may be computationally very expensive. Both algorithms make only a single scan over the database while producing high clustering quality. In a detailed experimental evaluation, we study both algorithms in terms of scalability and quality of clustering. We also show results of applying the algorithms to a real-life dataset.