Efficient Clustering of Web-Derived Data Sets

Authors:
Luís Sarmento;Alexander Kehlenbeck;Eugénio Oliveira;Lyle Ungar
Affiliations:
Faculdade de Engenharia da Universidade do Porto - DEI - LIACC, Porto, Portugal 4200-465;Google Inc, New York, NY, USA;Faculdade de Engenharia da Universidade do Porto - DEI - LIACC, Porto, Portugal 4200-465;University of Pennsylvania - CS, Philadelphia, USA
Venue:
MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Year:
2009

Citing 11
Cited 2

Introduction to algorithms

Introduction to algorithms
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Algorithm 447: efficient algorithms for graph manipulation

Communications of the ACM
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Clustering Data Streams: Theory and Practice

IEEE Transactions on Knowledge and Data Engineering
Better streaming algorithms for clustering problems

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Google news personalization: scalable online collaborative filtering

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6

An Approach to Web-Scale Named-Entity Disambiguation

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Behavior-driven clustering of queries into topics

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many data sets derived from the web are large, high-dimensional, sparse and have a Zipfian distribution of both classes and features. On such data sets, current scalable clustering methods such as streaming clustering suffer from fragmentation, where large classes are incorrectly divided into many smaller clusters, and computational efficiency drops significantly. We present a new clustering algorithm based on connected components that addresses these issues and so works well on web-type data.