Learning structure and concepts in data through data clustering

  • Authors:
  • Gregory James Hamerly;Charles P. Elkan

  • Affiliations:
  • -;-

  • Venue:
  • Learning structure and concepts in data through data clustering
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data clustering is an important and applications-oriented branch of machine learning. Its goal is to estimate the structure or density of a set of data without a training signal. There are many approaches to data clustering that vary in their complexity and effectiveness, due to the wide number of applications that these algorithms have. Due to the explosive growth of the amount of data that humans want to analyze, fast (e.g. linear-time) algorithms are necessary, but they can often give poor quality results. While maintaining the runtime characteristics of the fast algorithms, we show modifications that improve clustering algorithms in two ways. The first focus is on finding better solutions for a fixed number of clusters. We decompose the algorithms into fundamental parts, and analyze how the parts affect the quality of clustering solutions. The second focus is on estimating the number of clusters efficiently using statistical hypothesis tests, and how that may be applied in novel ways. We also discuss the application of data clustering to the task of learning the structure of computer programs. We show how clustering may be used to improve the accuracy of computer processor simulations while simultaneously improving their efficiency.