Compact Labeling Scheme for XML Ancestor Queries

  • Authors:
  • Haim Kaplan;Tova Milo;Ronen Shabo

  • Affiliations:
  • School of Computer Science, Faculty of Exact Sciences, Tel-Aviv University, Tel Aviv 69978, Israel;School of Computer Science, Faculty of Exact Sciences, Tel-Aviv University, Tel Aviv 69978, Israel;School of Computer Science, Faculty of Exact Sciences, Tel-Aviv University, Tel Aviv 69978, Israel

  • Venue:
  • Theory of Computing Systems
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

XML documents are often viewed as trees (basically the parse tree of the document), and queries over such documents typically test for ancestor relationships among tree nodes. Search engines process such queries using an index structure summarizing the ancestor relations. In the index, each document item (tree node) is identified using some logical id (node label), such that, given two labels, the engine can determine the ancestor relationship between the corresponding nodes. The length of the labels is a main factor of the index size. Therefore, reducing this length, even by a constant factor, is a critical issue. In this work we consider the following problem. Given a rooted XML tree T, label the nodes of T in the most compact way such that given the labels of two nodes, one can determine in constant time, by looking at the labels only, whether one node is an ancestor of the other. Labelings currently being used are all variants of the following interval scheme. Number the leaves say from left to right and label each node with a pair consisting of the numbers of its smallest and largest leaf descendants. An ancestor query then amounts to an interval containment test on the labels. The maximum label length using this scheme is 2 log n, where n is the number of nodes in the tree. (All logarithms in this paper are to base 2.) The focus of this work is finding a scheme that works best in practice on real XML data. We suggest an orthogonal prefix-based approach, where the labeling is such that an ancestor query roughly amounts to testing whether one label is a prefix of the other. We present several new labeling schemes based on this approach and analyze their performance both theoretically and empirically.