The Anatomy of a Hierarchical Clustering Engine for Web-page, News and Book Snippets

Authors:
Paolo Ferragina;Antonio Gulli
Affiliations:
Università di Pisa, Italy;Università di Pisa, Italy
Venue:
ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Year:
2004

Citing 0
Cited 7

Graph Visualization Techniques for Web Clustering Engines

IEEE Transactions on Visualization and Computer Graphics
An integrated system for building enterprise taxonomies

Information Retrieval
A Method for Automatic Text Categorization Using Word Sense Disambiguation

ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Refining the results of automatic e-textbook construction by clustering

ICWL'05 Proceedings of the 4th international conference on Advances in Web-Based Learning
A topology-driven approach to the design of web meta-search clustering engines

SOFSEM'05 Proceedings of the 31st international conference on Theory and Practice of Computer Science
Topic structure mining for document sets using graph-based analysis

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Selecting labels for news document clusters

NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we investigate the web snippet hierarchical clustering problem in its full extent by devising an algorithmic solution, and a software prototype called SnakeT (accessible at http://roquefort.di.unipi.it/), that: (1) draws the snippets from 16 Web search engines, the Amazon collection of books a9.com, the news of Google News and the blogs of Blogline; (2) builds the clusters on-the-fly (ephemeral clustering) in response to a user query without adopting any pre-defined organization in categories; (3) labels the clusters with sentences of variable length, drawn from the snippets and possibly missing some terms, provided they are not too many;