UJM at INEX 2007: Document Model Integrating XML Tags

  • Authors:
  • Mathias Géry;Christine Largeron;Franck Thollard

  • Affiliations:
  • Hubert Curien Lab, Jean Monnet University, Saint-Étienne, France;Hubert Curien Lab, Jean Monnet University, Saint-Étienne, France;Hubert Curien Lab, Jean Monnet University, Saint-Étienne, France

  • Venue:
  • Focused Access to XML Documents
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Different approaches have been used to represent textual documents, based on boolean model, vector space model or probabilistic models. In text mining as in information retrieval (IR), these models have shown good results about textual documents modeling. They nevertheless do not take into account documents structure. In many applications however, documents are inherently structured (e.g. XML documents).In this article, we propose an extended probabilistic representation of documents in order to take into account a certain kind of structural information: logical tags that represent the different parts of the document and formatting tags used to emphasized text. Our approach includes a learning step that estimates the weight of each tag. This weight is related to the probability for a given tag to distinguish the relevant terms.