The Anatomy of a Hierarchical Clustering Engine for Web-page, News and Book Snippets

  • Authors:
  • Paolo Ferragina;Antonio Gulli

  • Affiliations:
  • Università di Pisa, Italy;Università di Pisa, Italy

  • Venue:
  • ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we investigate the web snippet hierarchical clustering problem in its full extent by devising an algorithmic solution, and a software prototype called SnakeT (accessible at http://roquefort.di.unipi.it/), that: (1) draws the snippets from 16 Web search engines, the Amazon collection of books a9.com, the news of Google News and the blogs of Blogline; (2) builds the clusters on-the-fly (ephemeral clustering) in response to a user query without adopting any pre-defined organization in categories; (3) labels the clusters with sentences of variable length, drawn from the snippets and possibly missing some terms, provided they are not too many;