BIRCH: A New Data Clustering Algorithm and Its Applications

  • Authors:
  • Tian Zhang;Raghu Ramakrishnan;Miron Livny

  • Affiliations:
  • Computer Sciences Department, University of Wisconsin, Madison, WI 53706, U.S.A. E-mail: zhang@cs.wisc.edu, raghu@cs.wisc.edu, miron@cs.wisc.edu;Computer Sciences Department, University of Wisconsin, Madison, WI 53706, U.S.A. E-mail: zhang@cs.wisc.edu, raghu@cs.wisc.edu, miron@cs.wisc.edu;Computer Sciences Department, University of Wisconsin, Madison, WI 53706, U.S.A. E-mail: zhang@cs.wisc.edu, raghu@cs.wisc.edu, miron@cs.wisc.edu

  • Venue:
  • Data Mining and Knowledge Discovery
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data clustering is an important technique for exploratory dataanalysis, and has been studied for several years. It has been shownto be useful in many practical domains such as data classificationand image processing. Recently, there has been a growing emphasis onexploratory analysis of very large datasets todiscover useful patterns and/or correlations among attributes. This is called data mining, and data clustering is regarded as a particular branch.However existing data clustering methods do not adequately addressthe problem of processing large datasets with a limited amount ofresources (e.g., memory and cpu cycles). So as the dataset sizeincreases, they do not scale up well in terms of memory requirement,running time, and result quality.In this paper, an efficient and scalable data clustering method isproposed, based on a new in-memory data structure called CF-tree, which serves as an in-memory summary of the datadistribution. We have implemented it in a system called BIRCH(Balanced Iterative Reducing and Clustering using Hierarchies), andstudied its performance extensively in terms of memory requirements,running time, clustering quality, stability and scalability; we alsocompare it with other available methods. Finally, BIRCH is appliedto solve two real-life problems: one is building an iterative andinteractive pixel classification tool, and the other is generatingthe initial codebook for image compression.