Diversifying query results on semi-structured data

Authors:
Mahbub Hasan;Abdullah Mueen;Vassilis Tsotras;Eamonn Keogh
Affiliations:
University of California, Riverside, Riverside, CA, USA;University of California, Riverside, Riverside, CA, USA;University of California, Riverside, Riverside, CA, USA;University of California, Riverside, Riverside, CA, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 5
Cited 0

Simple fast algorithms for the editing distance between trees and related problems

SIAM Journal on Computing
LCS-TRIM: dynamic programming meets XML indexing and querying

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Structured search result differentiation

Proceedings of the VLDB Endowment
DivQ: diversification for keyword search over structured databases

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
On query result diversification

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Queries on the web can easily result in a large number of results. Result Diversification, a process by which the query provides the k most diverse set of matches, enables the user to better understand/explore such large results. Computing the diverse subset from a large set of results needs a massive number of pair-wise distance computations as well as finding the subset that maximizes the total pair-wise distance, which is NP-hard and requires efficient approximate algorithm. The problem becomes more difficult when querying semi-structured data, since diversity can occur not only in the document content but also (and more importantly) in the document structure; thus one needs to efficiently measure the structural differences between results. The tree edit distance is the standard choice but, is too expensive for large result sets. Moreover, the generalized tree edit distance ignores the context of the query and also the content of the documents resulting in poor diversification. We present a novel algorithm for meaningful diversification that considers both the structural context of the query and the content of the matched results while computing pair-wise distances. Our algorithm is an order of magnitude faster than the tree edit distance with an elegant worst case guarantee. We also present a novel algorithm that finds the top-k diverse subset of matches in time linear on the size of the result-set. We experimentally demonstrate the utility of our algorithms as a plugin for standard query processors without introducing large error and latency to the output.