Extending fuzzy and probabilistic clustering to very large data sets

Authors:
Richard J. Hathaway;James C. Bezdek
Affiliations:
Department of Mathematical Sciences, Georgia Southern University, Statesboro, GA 30460, USA;Department of Computer Sciences, University of West Florida, Pensacola, FL 32514, USA
Venue:
Computational Statistics & Data Analysis
Year:
2006

Citing 20
Cited 15

Efficient Implementation of the Fuzzy c-Means Clustering Algorithms

IEEE Transactions on Pattern Analysis and Machine Intelligence
Algorithms for clustering data

Algorithms for clustering data
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Scalability for clustering algorithms revisited

ACM SIGKDD Explorations Newsletter
Modern Information Retrieval

Modern Information Retrieval
Fuzzy Models and Algorithms for Pattern Recognition and Image Processing

Fuzzy Models and Algorithms for Pattern Recognition and Image Processing
Mining Very Large Databases

Computer
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Clustering Large Datasets in Arbitrary Metric Spaces

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
The learning-curve sampling method applied to model-based clustering

The Journal of Machine Learning Research
Convergence of alternating optimization

Neural, Parallel & Scientific Computations
Optimal Fuzzy Partitions: A Heuristic for Estimating the Parameters in a Mixture of Normal Distributions

IEEE Transactions on Computers
Complexity reduction for "large image" processing

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Reducing the time complexity of the fuzzy c-means algorithm

IEEE Transactions on Fuzzy Systems
Fast accurate fuzzy clustering through data reduction

IEEE Transactions on Fuzzy Systems
Parametric estimation for normal mixtures

Pattern Recognition Letters

Book review

Fuzzy Sets and Systems
A Scalable Framework For Segmenting Magnetic Resonance Images

Journal of Signal Processing Systems
New modified fuzzy C-means for determination of proper structure in dataset

Proceedings of the International Conference on Advances in Computing, Communication and Control
The fuzzy approach to statistical analysis

Computational Statistics & Data Analysis
Density-weighted fuzzy c-means clustering

IEEE Transactions on Fuzzy Systems
Clustering large data sets based on data compression technique and weighted quality measures

FUZZ-IEEE'09 Proceedings of the 18th international conference on Fuzzy Systems
Effective fuzzy c-means based kernel function in segmenting medical images

Computers in Biology and Medicine
Approximate pairwise clustering for large data sets via sampling plus extension

Pattern Recognition
Effective fuzzy c-means clustering algorithms for data clustering problems

Expert Systems with Applications: An International Journal
An evaluation of clustering technique over intrusion detection system

Proceedings of the International Conference on Advances in Computing, Communications and Informatics
Strong fuzzy c-means in medical image data analysis

Journal of Systems and Software
Credit-Card fraud profiling using a hybrid incremental clustering methodology

SUM'12 Proceedings of the 6th international conference on Scalable Uncertainty Management
Weighted Fuzzy-Possibilistic C-Means Over Large Data Sets

International Journal of Data Warehousing and Mining
Optimal-selection-based suppressed fuzzy c-means clustering algorithm with self-tuning non local spatial information for image segmentation

Expert Systems with Applications: An International Journal
Two novel fuzzy clustering methods for solving data clustering problems

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology

Quantified Score

Hi-index	0.03

Visualization

Abstract

Approximating clusters in very large (VL=unloadable) data sets has been considered from many angles. The proposed approach has three basic steps: (i) progressive sampling of the VL data, terminated when a sample passes a statistical goodness of fit test; (ii) clustering the sample with a literal (or exact) algorithm; and (iii) non-iterative extension of the literal clusters to the remainder of the data set. Extension accelerates clustering on all (loadable) data sets. More importantly, extension provides feasibility-a way to find (approximate) clusters-for data sets that are too large to be loaded into the primary memory of a single computer. A good generalized sampling and extension scheme should be effective for acceleration and feasibility using any extensible clustering algorithm. A general method for progressive sampling in VL sets of feature vectors is developed, and examples are given that show how to extend the literal fuzzy (c-means) and probabilistic (expectation-maximization) clustering algorithms onto VL data. The fuzzy extension is called the generalized extensible fast fuzzy c-means (geFFCM) algorithm and is illustrated using several experiments with mixtures of five-dimensional normal distributions.