A scalable algorithm for high-quality clustering of web snippets

Authors:
Filippo Geraci;Marco Pellegrini;Paolo Pisati;Fabrizio Sebastiani
Affiliations:
Consiglio Nazionale delle Ricerche, Pisa, Italy and Università di Siena, Via Roma, Siena, Italy;Consiglio Nazionale delle Ricerche, Pisa, Italy;Consiglio Nazionale delle Ricerche, Pisa, Italy;Università di Padova, Padova, Italy
Venue:
Proceedings of the 2006 ACM symposium on Applied computing
Year:
2006

Citing 8
Cited 9

Optimal algorithms for approximate clustering

STOC '88 Proceedings of the twentieth annual ACM symposium on Theory of computing
Reexamining the cluster hypothesis: scatter/gather on retrieval results

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
An empirical comparison of four initialization methods for the K-Means algorithm

Pattern Recognition Letters
Acceleration of K-Means and Related Clustering Algorithms

ALENEX '02 Revised Papers from the 4th International Workshop on Algorithm Engineering and Experiments
A personalized search engine based on web-snippet hierarchical clustering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Similarity Search: The Metric Space Approach (Advances in Database Systems)

Similarity Search: The Metric Space Approach (Advances in Database Systems)
A topology-driven approach to the design of web meta-search clustering engines

SOFSEM'05 Proceedings of the 31st international conference on Theory and Practice of Computer Science

Extraction and classification of dense communities in the web

Proceedings of the 16th international conference on World Wide Web
VISTO: visual storyboard for web video browsing

Proceedings of the 6th ACM international conference on Image and video retrieval
Dynamic user-defined similarity searching in semi-structured text retrieval

Proceedings of the 3rd international conference on Scalable information systems
Web Information Organization Using Keyword Distillation Based Clustering

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
FPF-SB: a scalable algorithm for microarray gene expression data clustering

ICDHM'07 Proceedings of the 1st international conference on Digital human modeling
Using semantic techniques to access web data

Information Systems
Cluster generation and cluster labelling for web snippets: a fast and accurate hierarchical solution

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Result disambiguation in web people search

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Mining subtopics from text fragments for a web query

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k non-overlapping clusters. We augment the well-known furthest-point-first algorithm for k-center clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical k-means iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the real-time nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.