FREM: fast and robust EM clustering for large data sets

Authors:
Carlos Ordonez;Edward Omiecinski
Affiliations:
Teradata, a division of NCR, San Diego, CA;Georgia Institute of Technology, Atlanta, GA
Venue:
Proceedings of the eleventh international conference on Information and knowledge management
Year:
2002

Citing 27
Cited 19

Hierarchical mixtures of experts and the EM algorithm

Neural Computation
Statistical physics, mixtures of distributions, and the EM algorithm

Neural Computation
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The KDD process for extracting useful knowledge from volumes of data

Communications of the ACM
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A unifying review of linear Gaussian models

Neural Computation
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Accelerating exact k-means algorithms with geometric reasoning

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
SQLEM: fast clustering in SQL using the EM algorithm

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
SMEM algorithm for mixture models

Proceedings of the 1998 conference on Advances in neural information processing systems II
Scalability for clustering algorithms revisited

ACM SIGKDD Explorations Newsletter
Outlier detection for high dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Data bubbles: quality preserving performance boosting for hierarchical clustering

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
A Fast Algorithm to Cluster High Dimensional Basket Data

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Mining Constrained Association Rules to Predict Heart Disease

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Finding Intensional Knowledge of Distance-Based Outliers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
C2P: Clustering based on Closest Pairs

Proceedings of the 27th International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
On Convergence Properties of the EM Algorithm for Gaussian Mixtures

On Convergence Properties of the EM Algorithm for Gaussian Mixtures
Mining complex databases using the EM algorithm

Mining complex databases using the EM algorithm
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
On-line EM Algorithm for the Normalized Gaussian Network

Neural Computation

Clustering binary data streams with K-means

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Horizontal aggregations for building tabular data sets

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Efficient Disk-Based K-Means Clustering for Relational Databases

IEEE Transactions on Knowledge and Data Engineering
Programming the K-means clustering algorithm in SQL

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Gradual Model Generator for Single-Pass Clustering

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Integrating K-Means Clustering with a Relational DBMS Using SQL

IEEE Transactions on Knowledge and Data Engineering
Adherence clustering: an efficient method for mining market-basket clusters

Information Systems
Effective document clustering for large heterogeneous law firm collections

ICAIL '05 Proceedings of the 10th international conference on Artificial intelligence and law
Vector and matrix operations programmed with UDFs in a relational DBMS

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Gradual model generator for single-pass clustering

Pattern Recognition
Network anomaly detection with incomplete audit data

Computer Networks: The International Journal of Computer and Telecommunications Networking
A convergence theorem for the fuzzy subspace clustering (FSC) algorithm

Pattern Recognition
Data Set Homeomorphism Transformation Based Meta-clustering

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
Generalized fuzzy C-means clustering algorithm with improved fuzzy partitions

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
Adherence clustering: an efficient method for mining market-basket clusters

Information Systems
Legal document clustering with built-in topic segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Autonomous and deterministic probabilistic neural network using global k-means

ISNN'06 Proceedings of the Third international conference on Advances in Neural Networks - Volume Part I
Leveraging network structure for incremental document clustering

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Optimized query-driven appointment routing based on Expectation-Maximization in wireless sensor networks

Journal of Network and Computer Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is a fundamental Data Mining technique. This article presents an improved EM algorithm to cluster large data sets having high dimensionality, noise and zero variance problems. The algorithm incorporates improvements to increase the quality of solutions and speed. In general the algorithm can find a good clustering solution in 3 scans over the data set. Alternatively, it can be run until it converges. The algorithm has a few parameters that are easy to set and have defaults for most cases. The proposed algorithm is compared against the standard EM algorithm and the On-Line EM algorithm.