A generative retrieval model for structured documents

  • Authors:
  • Le Zhao;Jamie Callan

  • Affiliations:
  • Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA

  • Venue:
  • Proceedings of the 17th ACM conference on Information and knowledge management
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Structured documents contain elements defined by the author(s) and annotations assigned by other people or processes. Structured documents pose challenges for probabilistic retrieval models when there are mismatches between the structured query and the actual structure in a relevant document or erroneous structure introduced by an annotator. This paper makes three contributions. First, a new generative retrieval model is proposed to deal with the mismatch problem. This new model extends the basic keyword language model by treating structure as hidden variable during the generation process. Second, variations of the model are compared. Third, term-level and structure-level smoothing strategies are studied. Evaluation was conducted with INEX XML retrieval and question-answering retrieval tasks. Experimental results indicate that the optimal structured retrieval model is task dependent, two-level Dirichlet smoothing significantly outperforms two-level Jelinek-Mercer smoothing, and with accurate structured queries, the proposed structured retrieval model outperforms keyword retrieval significantly, on both QA and INEX datasets.