Cluster generation and cluster labelling for web snippets: a fast and accurate hierarchical solution

Authors:
Filippo Geraci;Marco Pellegrini;Marco Maggini;Fabrizio Sebastiani
Affiliations:
Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Pisa, Italy;Istituto di Informatica e Telematica, Consiglio Nazionale delle Ricerche, Pisa, Italy;Dipartimento di Ingegneria dell’Informazione, Università di Siena, Siena, Italy;Istituto di Scienza e Tecnologia dell’Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy
Venue:
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Year:
2006

Citing 15
Cited 16

Elements of information theory

Elements of information theory
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Sublinear time algorithms for metric space problems

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Deciphering cluster representations

Information Processing and Management: an International Journal
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Evaluating strategies for similarity search on the web

Proceedings of the 11th international conference on World Wide Web
The effectiveness of query-specific hierarchic clustering in information retrieval

Information Processing and Management: an International Journal
Generating hierarchical summaries for web searches

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Relationship-based clustering and cluster ensembles for high-dimensional data mining

Relationship-based clustering and cluster ensembles for high-dimensional data mining
A hierarchical monothetic document clustering algorithm for summarization and browsing search results

Proceedings of the 13th international conference on World Wide Web
Learning to cluster web search results

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A personalized search engine based on web-snippet hierarchical clustering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
A scalable algorithm for high-quality clustering of web snippets

Proceedings of the 2006 ACM symposium on Applied computing
Clustering information retrieval search outputs

IRSG'99 Proceedings of the 21st Annual BCS-IRSG conference on Information Retrieval Research

Extraction and classification of dense communities in the web

Proceedings of the 16th international conference on World Wide Web
VISTO: visual storyboard for web video browsing

Proceedings of the 6th ACM international conference on Image and video retrieval
The opposite of smoothing: a language model approach to ranking query-specific document clusters

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Collection Browsing through Automatic Hierarchical Tagging

AH '08 Proceedings of the 5th international conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Dynamic user-defined similarity searching in semi-structured text retrieval

Proceedings of the 3rd international conference on Scalable information systems
A Co-occurrence Based Hierarchical Method for Clustering Web Search Results

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Web Search Clustering and Labeling with Hidden Topics

ACM Transactions on Asian Language Information Processing (TALIP)
Feature extraction and clustering for dynamic video summarisation

Neurocomputing
Using semantic techniques to access web data

Information Systems
The role of queries in ranking labeled instances extracted from text

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Exploiting user feedback to improve quality of search results clustering

Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication
Nonlinear evidence fusion and propagation for hyponymy relation mining

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
The opposite of smoothing: a language model approach to ranking query-specific document clusters

Journal of Artificial Intelligence Research
Beyond precision@10: clustering the long tail of web search results

Proceedings of the 20th ACM international conference on Information and knowledge management
A transduction-based approach to fuzzy clustering, relevance ranking and cluster label generation on web search results

Journal of Intelligent Information Systems
Search result presentation based on faceted clustering

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster well-formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and use no external sources of knowledge. Clustering is performed by means of a fast version of the furthest-point-first algorithm for metric k-center clustering. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external” metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms.