Generating unambiguous URL clusters from web search

  • Authors:
  • G. Smith;T. Brailsford;C. Donner;D. Hooijmaijers;M. Truran;J. Goulding;H. Ashman

  • Affiliations:
  • University of South Australia;University of Nottingham;University of South Australia;University of South Australia;University of Teesside;University of Nottingham;University of South Australia

  • Venue:
  • Proceedings of the 2009 workshop on Web Search Click Data
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper reports on the generation of unambiguous clusters of URLs from clickthrough data from the MSN search query log excerpt (the RFP 2006 dataset). Selections (clickthroughs) by a single user from a single query can be assumed to have some mutual semantic relevance, and the URLs coselected in this way can be aggregated to form single-sense clusters. When the graphs for a single term separate into distinct clusters, the semantics of the distinct clusters can be interpreted as disambiguated aggregations of URLs. This principle had been tested on smaller and more constrained datasets previously, and this paper reports on findings from applying a method based on the principle to the RFP 2006 dataset. This paper evaluates the proposed coselection method for generating single-sense clusters against two other methods, with varying parameters. The evaluation is done both with a human evaluation to determine the quality of the clusters generated by the different methods, and by a simple "edit distance" analysis to determine the content difference of the methods. The main questions addressed are i) whether it is feasible to generate single-sense / sense-coherent clusters, and ii) whether, in a closed world, it would be feasible to discover ambiguous terms. The experimentation showed that sense-coherent clusters were found and further indicated that ambiguous terms could be detected from observing small overlap between large clusters.