Diversifying query results on semi-structured data

  • Authors:
  • Mahbub Hasan;Abdullah Mueen;Vassilis Tsotras;Eamonn Keogh

  • Affiliations:
  • University of California, Riverside, Riverside, CA, USA;University of California, Riverside, Riverside, CA, USA;University of California, Riverside, Riverside, CA, USA;University of California, Riverside, Riverside, CA, USA

  • Venue:
  • Proceedings of the 21st ACM international conference on Information and knowledge management
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Queries on the web can easily result in a large number of results. Result Diversification, a process by which the query provides the k most diverse set of matches, enables the user to better understand/explore such large results. Computing the diverse subset from a large set of results needs a massive number of pair-wise distance computations as well as finding the subset that maximizes the total pair-wise distance, which is NP-hard and requires efficient approximate algorithm. The problem becomes more difficult when querying semi-structured data, since diversity can occur not only in the document content but also (and more importantly) in the document structure; thus one needs to efficiently measure the structural differences between results. The tree edit distance is the standard choice but, is too expensive for large result sets. Moreover, the generalized tree edit distance ignores the context of the query and also the content of the documents resulting in poor diversification. We present a novel algorithm for meaningful diversification that considers both the structural context of the query and the content of the matched results while computing pair-wise distances. Our algorithm is an order of magnitude faster than the tree edit distance with an elegant worst case guarantee. We also present a novel algorithm that finds the top-k diverse subset of matches in time linear on the size of the result-set. We experimentally demonstrate the utility of our algorithms as a plugin for standard query processors without introducing large error and latency to the output.