Clustering in massive data sets

Authors:
Fionn Murtagh
Affiliations:
School of Computer Science, The Queen's University of Belfast, Belfast BT7 1NN, Northern Ireland
Venue:
Handbook of massive data sets
Year:
2002

Citing 30
Cited 7

An algorithm for finding nearest neighbours in (approximately) constant average time

Pattern Recognition Letters
Algorithms for clustering data

Algorithms for clustering data
An efficient branch-and-bound nearest neighbour classifier

Pattern Recognition Letters
Efficiency of hierarchic agglomerative clustering using the ICL distributed array processor

Journal of Documentation
Strategies for efficient incremental nearest neighbor search

Pattern Recognition
Note on learning rate schedules for stochastic optimization

NIPS-3 Proceedings of the 1990 conference on Advances in neural information processing systems 3
An efficient approximation-elimination algorithm for fast nearest-neighbor search based on a spherical distance coordinate formulation

Pattern Recognition Letters
Comments on 'Parallel Algorithms for Hierarchical Clustering and Cluster Validity'

IEEE Transactions on Pattern Analysis and Machine Intelligence
Search algorithms for numeric and quantitative data

Intelligent information retrieval
Randomized algorithms

Randomized algorithms
Efficient search for approximate nearest neighbor in high dimensional spaces

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Image processing and data analysis: the multiscale approach

Image processing and data analysis: the multiscale approach
Subquadratic approximation algorithms for clustering problems in high dimensional spaces

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Algorithms for Model-Based Gaussian Hierarchical Clustering

SIAM Journal on Scientific Computing
A view of the EM algorithm that justifies incremental, sparse, and other variants

Learning in graphical models
Accelerating exact k-means algorithms with geometric reasoning

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Semantic Road Maps for Literature Searchers

Journal of the ACM (JACM)
Matrices, Vector Spaces, and Information Retrieval

SIAM Review
Very fast EM-based mixture model clustering using multiresolution kd-trees

Proceedings of the 1998 conference on Advances in neural information processing systems II
Reinforcement learning based on on-line EM algorithm

Proceedings of the 1998 conference on Advances in neural information processing systems II
An Algorithm for Finding Best Matches in Logarithmic Expected Time

ACM Transactions on Mathematical Software (TOMS)
Optimal Expected-Time Algorithms for Closest Point Problems

ACM Transactions on Mathematical Software (TOMS)
The choice of reference points in best-match file searching

Communications of the ACM
Some approaches to best-match file searching

Communications of the ACM
The nearest neighbour problem in information retrieval: an algorithm using upperbounds

SIGIR '81 Proceedings of the 4th annual international ACM SIGIR conference on Information storage and retrieval: theoretical issues in information retrieval
The Cluster Dissection and Analysis Theory FORTRAN Programs Examples

The Cluster Dissection and Analysis Theory FORTRAN Programs Examples
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Fundamentals of Computer Alori

Fundamentals of Computer Alori
A probabilistic algorithm for nearest neighbour searching

SIGIR '80 Proceedings of the 3rd annual ACM conference on Research and development in information retrieval

Bit-sliced index arithmetic

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Information preserving multi-objective feature selection for unsupervised learning

Proceedings of the 8th annual conference on Genetic and evolutionary computation
Optimal implementations of UPGMA and other common clustering algorithms

Information Processing Letters
Modular neuroevolution for multilegged locomotion

Proceedings of the 10th annual conference on Genetic and evolutionary computation
Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Linear grouping using orthogonal regression

Computational Statistics & Data Analysis
Topographic mapping of large dissimilarity data sets

Neural Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Theoretical results developed as far back as the 1960s still very often remain topical. More recent work is also covered in this article. This includes a solution for the statistical question of how many clusters there are in a dataset. We also look at one line of inquiry in the use of clustering for human-computer user interfaces. Finally, the visualization of data leads to the consideration of data arrays as images, and we speculate on future results to be expected here.