DTD based costs for tree-edit distance in structured information retrieval

  • Authors:
  • Cyril Laitang;Karen Pinel-Sauvagnat;Mohand Boughanem

  • Affiliations:
  • IRIT-SIG, Toulouse Cedex 9, France;IRIT-SIG, Toulouse Cedex 9, France;IRIT-SIG, Toulouse Cedex 9, France

  • Venue:
  • ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we present a Structured Information Retrieval (SIR) model based on graph matching. Our approach combines content propagation, which handles sibling relationships, with a document-query structure matching process. The latter is based on Tree-Edit Distance (TED) which is the minimum set of insert, delete, and replace operations to turn one tree to another. To our knowledge this algorithm has never been used in ad-hoc SIR. As the effectiveness of TED relies both on the input tree and the edit costs, we first present a focused subtree extraction technique which selects the most representative elements of the document w.r.t the query. We then describe our TED costs setting based on the Document Type Definition (DTD). Finally we discuss our results according to the type of the collection (data-oriented or text-oriented). Experiments are conducted on two INEX test sets: the 2010 Datacentric collection and the 2005 Ad-hoc one.