BIRCH: A New Data Clustering Algorithm and Its Applications

Authors:
Tian Zhang;Raghu Ramakrishnan;Miron Livny
Affiliations:
Computer Sciences Department, University of Wisconsin, Madison, WI 53706, U.S.A. E-mail: zhang@cs.wisc.edu, raghu@cs.wisc.edu, miron@cs.wisc.edu;Computer Sciences Department, University of Wisconsin, Madison, WI 53706, U.S.A. E-mail: zhang@cs.wisc.edu, raghu@cs.wisc.edu, miron@cs.wisc.edu;Computer Sciences Department, University of Wisconsin, Madison, WI 53706, U.S.A. E-mail: zhang@cs.wisc.edu, raghu@cs.wisc.edu, miron@cs.wisc.edu
Venue:
Data Mining and Knowledge Discovery
Year:
1997

Citing 11
Cited 87

Models of incremental concept formation

Artificial Intelligence
The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Vector quantization and signal compression

Vector quantization and signal compression
Data clustering for very large datasets plus applications

Data clustering for very large datasets plus applications
Digital Image Compression: Algorithms and Standards

Digital Image Compression: Algorithms and Standards
Digital Image Compression Techniques

Digital Image Compression Techniques
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Experiments with Incremental Concept Formation: UNIMEM

Machine Learning
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification

SSD '95 Proceedings of the 4th International Symposium on Advances in Spatial Databases

Squashing flat files flatter

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data mining on an OLTP system (nearly) for free

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A robust and scalable clustering algorithm for mixed type attributes in large database environment

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
E-business enterprise data mining

Tutorial notes of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Mining data streams under block evolution

ACM SIGKDD Explorations Newsletter
Alternatives to the k-means algorithm that find better clusterings

Proceedings of the eleventh international conference on Information and knowledge management
Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications

Data Mining and Knowledge Discovery
Change Detection in Overhead Imagery Using Neural Networks

Applied Intelligence
An Efficient k-Means Clustering Algorithm: Analysis and Implementation

IEEE Transactions on Pattern Analysis and Machine Intelligence
An Adaptive Flocking Algorithm for Spatial Clustering

PPSN VII Proceedings of the 7th International Conference on Parallel Problem Solving from Nature
A Visual Method of Cluster Validation with Fastmap

PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
M-FastMap: A Modified FastMap Algorithm for Visual Cluster Validation in Data Mining

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Performance Analysis of Database Systems

Performance Evaluation: Origins and Directions
A Cube Model and Cluster Analysis for Web Access Sessions

WEBKDD '01 Revised Papers from the Third International Workshop on Mining Web Log Data Across All Customers Touch Points
Data mining tasks and methods: Clustering: conceptual clustering

Handbook of data mining and knowledge discovery
A method for decentralized clustering in large multi-agent systems

AAMAS '03 Proceedings of the second international joint conference on Autonomous agents and multiagent systems
Scalable Model-based Clustering by Working on Data Summaries

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
A Monotonic On-Line Linear Algorithm for Hierarchical Agglomerative Classification

Information Technology and Management
On Decentralised Clustering in self-monitoring networks

Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems
Scalable Model-Based Clustering for Large Databases Based on Data Summarization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Gradual Model Generator for Single-Pass Clustering

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Computing LTS Regression for Large Data Sets

Data Mining and Knowledge Discovery
A fast k-means implementation using coresets

Proceedings of the twenty-second annual symposium on Computational geometry
Fast Agglomerative Clustering Using a k-Nearest Neighbor Graph

IEEE Transactions on Pattern Analysis and Machine Intelligence
Gradual model generator for single-pass clustering

Pattern Recognition
Iterative shrinking method for clustering problems

Pattern Recognition
Towards higher disk head utilization: extracting free bandwidth from busy disk drives

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Hierarchical clustering of mixed data based on distance hierarchy

Information Sciences: an International Journal
Evolutionary model selection in unsupervised learning

Intelligent Data Analysis
Mining association rules using clustering

Intelligent Data Analysis
LEGClust—A Clustering Algorithm Based on Layered Entropic Subgraphs

IEEE Transactions on Pattern Analysis and Machine Intelligence
Cluster By: a new sql extension for spatial data aggregation

Proceedings of the 15th annual ACM international symposium on Advances in geographic information systems
Efficient clustering of databases induced by local patterns

Decision Support Systems
A Novel Biologically and Psychologically Inspired Fuzzy Decision Support System: Hierarchical Complementary Learning

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Data Set Homeomorphism Transformation Based Meta-clustering

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
Finding Arbitrary Shaped Clusters for Character Recognition

ICIAR '08 Proceedings of the 5th international conference on Image Analysis and Recognition
Image-mapped data clustering: An efficient technique for clustering large data sets

Intelligent Data Analysis
Novelty detection with application to data streams

Intelligent Data Analysis - Knowledge Discovery from Data Streams
K-tree: large scale document clustering

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A Cluster-Based Feature Selection Approach

HAIS '09 Proceedings of the 4th International Conference on Hybrid Artificial Intelligence Systems
Linear grouping using orthogonal regression

Computational Statistics & Data Analysis
Profiling Retail Web Site Functionalities and Conversion Rates: A Cluster Analysis

International Journal of Electronic Commerce
Generalized fuzzy C-means clustering algorithm with improved fuzzy partitions

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
SPARCL: an effective and efficient algorithm for mining arbitrary shape-based clusters

Knowledge and Information Systems
Subspace Discovery for Promotion: A Cell Clustering Approach

DS '09 Proceedings of the 12th International Conference on Discovery Science
Scalable model-based cluster analysis using clustering features

Pattern Recognition
Communication-Efficient Privacy-Preserving Clustering

Transactions on Data Privacy
A creditable subspace labeling method based on D-S evidence theory

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Quantization-based clustering algorithm

Pattern Recognition
Towards subspace clustering on dynamic data: an incremental version of PreDeCon

Proceedings of the First International Workshop on Novel Data Stream Pattern Mining Techniques
TI-DBSCAN: clustering with DBSCAN by means of the triangle inequality

RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing
A neighborhood-based clustering by means of the triangle inequality

IDEAL'10 Proceedings of the 11th international conference on Intelligent data engineering and automated learning
Clustering-based geometric support vector machines

LSMS/ICSEE'10 Proceedings of the 2010 international conference on Life system modeling and simulation and intelligent computing, and 2010 international conference on Intelligent computing for sustainable energy and environment: Part II
Mining massive datasets by an unsupervised parallel clustering on a GRID: Novel algorithms and case study

Future Generation Computer Systems
A survey on clustering in data mining

Proceedings of the International Conference & Workshop on Emerging Trends in Technology
ClustCube: an OLAP-based framework for clustering and mining complex database objects

Proceedings of the 2011 ACM Symposium on Applied Computing
Combining a new data classification technique and regression analysis to predict the Cost-To-Serve new customers

Computers and Industrial Engineering
A Cluster-Based Context-Tree Model for Multivariate Data Streams with Applications to Anomaly Detection

INFORMS Journal on Computing
A unique property of single-link distance and its application in data clustering

Data & Knowledge Engineering
Efficient mining of emerging events in a dynamic spatiotemporal environment

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
On convergence of dynamic cluster formation in multi-agent networks

ECAL'05 Proceedings of the 8th European conference on Advances in Artificial Life
A coarse grained parallel algorithm for closest larger ancestors in trees with applications to single link clustering

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Identifying risk groups associated with colorectal cancer

Data Mining
iDISQUE: tuning high-dimensional similarity queries in DHT networks

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Improving k-means by outlier removal

SCIA'05 Proceedings of the 14th Scandinavian conference on Image Analysis
Clustering large dynamic datasets using exemplar points

MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
Towards adaptive clustering in self-monitoring multi-agent networks

KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part II
Predicting cluster formation in decentralized sensor grids

KES'06 Proceedings of the 10th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part III
Streaming data reduction using low-memory factored representations

Information Sciences: an International Journal
A BIRCH-Based clustering method for large time series databases

PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining
A computational study of a nonlinear minsum facility location problem

Computers and Operations Research
CAMEUD: clustering approach for mining evolving usage data

Proceedings of the Ninth International Workshop on Information Integration on the Web
Enhanced clustering of complex database objects in the clustcube framework

Proceedings of the fifteenth international workshop on Data warehousing and OLAP
Towards hierarchical clustering

CSR'07 Proceedings of the Second international conference on Computer Science: theory and applications
TRES-CORE: content-based retrieval based on the balanced tree in peer to peer systems

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Knowledge augmentation via incremental clustering: new technology for effective knowledge management

International Journal of Business Information Systems
Accelerating non-local denoising with a patch based dictionary

Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing
Warped K-Means: An algorithm to cluster sequentially-distributed data

Information Sciences: an International Journal
Clustering local frequency items in multiple databases

Information Sciences: an International Journal
Clustering based on a near neighbor graph and a grid cell graph

Journal of Intelligent Information Systems
Data stream clustering: A survey

ACM Computing Surveys (CSUR)
EvenTweet: online localized event detection from twitter

Proceedings of the VLDB Endowment
Local learning integrating global structure for large scale semi-supervised classification

Computers & Mathematics with Applications
Hyperspherical cluster based distributed anomaly detection in wireless sensor networks

Journal of Parallel and Distributed Computing
Analysing microarray expression data through effective clustering

Information Sciences: an International Journal
Survey of Clustering: Algorithms and Applications

International Journal of Information Retrieval Research
DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data clustering is an important technique for exploratory dataanalysis, and has been studied for several years. It has been shownto be useful in many practical domains such as data classificationand image processing. Recently, there has been a growing emphasis onexploratory analysis of very large datasets todiscover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch.However existing data clustering methods do not adequately addressthe problem of processing large datasets with a limited amount ofresources (e.g., memory and cpu cycles). So as the dataset sizeincreases, they do not scale up well in terms of memory requirement,running time, and result quality.In this paper, an efficient and scalable data clustering method isproposed, based on a new in-memory data structure called CF-tree, which serves as an in-memory summary of the datadistribution. We have implemented it in a system called BIRCH(Balanced Iterative Reducing and Clustering using Hierarchies), andstudied its performance extensively in terms of memory requirements,running time, clustering quality, stability and scalability; we alsocompare it with other available methods. Finally, BIRCH is appliedto solve two real-life problems: one is building an iterative andinteractive pixel classification tool, and the other is generatingthe initial codebook for image compression.