Statistical grid-based clustering over data streams

Authors:
Nam Hun Park;Won Suk Lee
Affiliations:
Yonsei University, Seoul, Korea;Yonsei University, Seoul, Korea
Venue:
ACM SIGMOD Record
Year:
2004

Citing 6
Cited 15

CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Entropy-based subspace clustering for mining numerical data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Maintaining stream statistics over sliding windows: (extended abstract)

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Incremental Clustering for Mining in a Data Warehousing Environment

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Streaming-Data Algorithms for High-Quality Clustering

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Cell trees: An adaptive synopsis structure for clustering multi-dimensional on-line data streams

Data & Knowledge Engineering
Grid-based subspace clustering over data streams

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Efficiently tracing clusters over high-dimensional on-line data streams

Data & Knowledge Engineering
Clustering data stream: A survey of algorithms

International Journal of Knowledge-based and Intelligent Engineering Systems
Anomaly intrusion detection by clustering transactional audit streams in a host computer

Information Sciences: an International Journal
Approximate trace of grid-based clusters over high dimensional data streams

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
A neighborhood density estimation clustering algorithm based on minimum spanning tree

RSKT'10 Proceedings of the 5th international conference on Rough set and knowledge technology
Clustering distributed sensor data streams using local processing and reduced communication

Intelligent Data Analysis - Ubiquitous Knowledge Discovery
A clustering algorithm for multiple data streams based on spectral component similarity

Information Sciences: an International Journal
Anomaly intrusion detection based on clustering a data stream

ISC'06 Proceedings of the 9th international conference on Information Security
SIC-means: a semi-fuzzy approach for clustering data streams using c-means

ANNPR'10 Proceedings of the 4th IAPR TC3 conference on Artificial Neural Networks in Pattern Recognition
An incremental data stream clustering algorithm based on dense units detection

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
A grid-based subspace clustering algorithm for high-dimensional data streams

WISE'06 Proceedings of the 7th international conference on Web Information Systems
On pre-processing algorithms for data stream

ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Aggregating and disaggregating flexibility objects

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processing time. The processing time is greatly influenced by the amount of information that should be maintained. This paper proposes a statistical grid-based approach to clustering data elements of a data stream. Initially, the multidimensional data space of a data stream is partitioned into a set of mutually exclusive equal-size initial cells. When the support of a cell becomes high enough, the cell is dynamically divided into two mutually exclusive intermediate cells based on its distribution statistics. Three different ways of partitioning a dense cell are introduced. Eventually, a dense region of each initial cell is recursively partitioned until it becomes the smallest cell called a unit cell. A cluster of a data stream is a group of adjacent dense unit cells. In order to minimize the number of cells, a sparse intermediate or unit cell is pruned if its support becomes much less than a minimum support. Furthermore, in order to confine the usage of memory space, the size of a unit cell is dynamically minimized such that the result of clustering becomes as accurate as possible. The proposed algorithm is analyzed by a series of experiments to identify its various characteristics.