Structure and content scoring for XML

  • Authors:
  • Sihem Amer-Yahia;Nick Koudas;Amélie Marian;Divesh Srivastava;David Toman

  • Affiliations:
  • AT&T Labs-Research;University of Toronto;Columbia University;AT&T Labs-Research;University of Waterloo

  • Venue:
  • VLDB '05 Proceedings of the 31st international conference on Very large data bases
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

XML repositories are usually queried both on structure and content. Due to structural heterogeneity of XML, queries are often interpreted approximately and their answers are returned ranked by scores. Computing answer scores in XML is an active area of research that oscillates between pure content scoring such as the well-known tf*idf and taking structure into account. However, none of the existing proposals fully accounts for structure and combines it with content to score query answers. We propose novel XML scoring methods that are inspired by tf*idf and that account for both structure and content while considering query relaxations. Twig scoring, accounts for the most structure and content and is thus used as our reference method. Path scoring is an approximation that loosens correlations between query nodes hence reducing the amount of time required to manipulate scores during top-k query processing. We propose efficient data structures in order to speed up ranked query processing. We run extensive experiments that validate our scoring methods and that show that path scoring provides very high precision while improving score computation time.