Efficient Clustering of Web-Derived Data Sets

  • Authors:
  • Luís Sarmento;Alexander Kehlenbeck;Eugénio Oliveira;Lyle Ungar

  • Affiliations:
  • Faculdade de Engenharia da Universidade do Porto - DEI - LIACC, Porto, Portugal 4200-465;Google Inc, New York, NY, USA;Faculdade de Engenharia da Universidade do Porto - DEI - LIACC, Porto, Portugal 4200-465;University of Pennsylvania - CS, Philadelphia, USA

  • Venue:
  • MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many data sets derived from the web are large, high-dimensional, sparse and have a Zipfian distribution of both classes and features. On such data sets, current scalable clustering methods such as streaming clustering suffer from fragmentation, where large classes are incorrectly divided into many smaller clusters, and computational efficiency drops significantly. We present a new clustering algorithm based on connected components that addresses these issues and so works well on web-type data.