Clustering binary data streams with K-means

Authors:
Carlos Ordonez
Affiliations:
Teradata, a division of NCR, San Diego, CA
Venue:
DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Year:
2003

Citing 21
Cited 36

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A unifying review of linear Gaussian models

Neural Computation
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Scalability for clustering algorithms revisited

ACM SIGKDD Explorations Newsletter
Data bubbles: quality preserving performance boosting for hierarchical clustering

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
FREM: fast and robust EM clustering for large data sets

Proceedings of the eleventh international conference on Information and knowledge management
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
The LBG-U Method for Vector Quantization – an Improvement over LBGInspired from Neural Networks

Neural Processing Letters
A Fast Algorithm to Cluster High Dimensional Basket Data

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
C2P: Clustering based on Closest Pairs

Proceedings of the 27th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
On-line EM Algorithm for the Normalized Gaussian Network

Neural Computation

Cost-efficient mining techniques for data streams

ACSW Frontiers '04 Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation - Volume 32
Horizontal aggregations for building tabular data sets

Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Efficient Disk-Based K-Means Clustering for Relational Databases

IEEE Transactions on Knowledge and Data Engineering
Programming the K-means clustering algorithm in SQL

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A model for association rules based on clustering

Proceedings of the 2005 ACM symposium on Applied computing
Mining data streams: a review

ACM SIGMOD Record
TCSOM: Clustering Transactions Using Self-Organizing Map

Neural Processing Letters
Integrating K-Means Clustering with a Relational DBMS Using SQL

IEEE Transactions on Knowledge and Data Engineering
A Bit Level Representation for Time Series Data Mining with Shape Based Similarity

Data Mining and Knowledge Discovery
Projected clustering for categorical datasets

Pattern Recognition Letters
Clicks: An effective algorithm for mining subspace clusters in categorical datasets

Data & Knowledge Engineering
Can exclusive clustering on streaming data be achieved?

ACM SIGKDD Explorations Newsletter
Supervised clustering of streaming data for email batch detection

Proceedings of the 24th international conference on Machine learning
Enhanced P2P services providing multimedia content

Advances in Multimedia
Exploratory data analysis leading towards the most interesting simple association rules

Computational Statistics & Data Analysis
A semi-random multiple decision-tree algorithm for mining data streams

Journal of Computer Science and Technology
Utilizing phrase-similarity measures for detecting and clustering informative RSS news articles

Integrated Computer-Aided Engineering
Generating Fuzzy Equivalence Classes on RSS News Articles for Retrieving Correlated Information

ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Online pairing of VoIP conversations

The VLDB Journal — The International Journal on Very Large Data Bases
Models for association rules based on clustering and correlation

Intelligent Data Analysis
Scalable learning of collective behavior based on sparse social dimensions

Proceedings of the 18th ACM conference on Information and knowledge management
C-DenStream: Using Domain Knowledge on a Data Stream

DS '09 Proceedings of the 12th International Conference on Discovery Science
SCALE: a scalable framework for efficiently clustering transactional data

Data Mining and Knowledge Discovery
Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams

Proceedings of the 2010 conference on Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams
Mining fuzzy frequent itemsets for hierarchical document clustering

Information Processing and Management: an International Journal
MG-join: detecting phenomena and their correlation in high dimensional data streams

Distributed and Parallel Databases
Increasing availability of industrial systems through data stream mining

Computers and Industrial Engineering
Generating associative ripples of relevant information from a variety of data streams by throwing a heuristic stone

Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication
A clustering algorithm for multiple data streams based on spectral component similarity

Information Sciences: an International Journal
Two-dimensional clustering algorithms for image segmentation

WSEAS Transactions on Computers
Kalman filters and adaptive windows for learning in data streams

DS'06 Proceedings of the 9th international conference on Discovery Science
Clustering large datasets using cobweb and k-means in tandem

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
Socialized ubiquitous personal study: Toward an individualized information portal

Journal of Computer and System Sciences
Enriching user search experience by mining social streams with heuristic stones and associative ripples

Multimedia Tools and Applications
Clustering cubes with binary dimensions in one pass

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
On clustering large number of data streams

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering data streams is an interesting Data Mining problem. This article presents three variants of the K-means algorithm to cluster binary data streams. The variants include On-line K-means, Scalable K-means, and Incremental K-means, a proposed variant introduced that finds higher quality solutions in less time. Higher quality of solutions are obtained with a mean-based initialization and incremental learning. The speedup is achieved through a simplified set of sufficient statistics and operations with sparse matrices. A summary table of clusters is maintained on-line. The K-means variants are compared with respect to quality of results and speed. The proposed algorithms can be used to monitor transactions.