SAIL: Structure-aware indexing for effective and progressive top-k keyword search over XML documents

  • Authors:
  • Guoliang Li;Chen Li;Jianhua Feng;Lizhu Zhou

  • Affiliations:
  • Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China;Department of Computer Science, University of California, Irvine, CA 92697-3435, USA;Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China;Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 2009

Quantified Score

Hi-index 0.07

Visualization

Abstract

Keyword search in XML documents has recently gained a lot of research attention. Given a keyword query, existing approaches first compute the lowest common ancestors (LCAs) or their variants of XML elements that contain the input keywords, and then identify the subtrees rooted at the LCAs as the answer. In this the paper we study how to use the rich structural relationships embedded in XML documents to facilitate the processing of keyword queries. We develop a novel method, called SAIL, to index such structural relationships for efficient XML keyword search. We propose the concept of minimal-cost trees to answer keyword queries and devise structure-aware indices to maintain the structural relationships for efficiently identifying the minimal-cost trees. For effectively and progressively identifying the top-k answers, we develop techniques using link-based relevance ranking and keyword-pair-based ranking. To reduce the index size, we incorporate a numbering scheme, namely schema-aware dewey code, into our structure-aware indices. Experimental results on real data sets show that our method outperforms state-of-the-art approaches significantly, in both answer quality and search efficiency.