Probabilistic Model for Structured Document Mapping

  • Authors:
  • Guillaume Wisniewski;Francis Maes;Ludovic Denoyer;Patrick Gallinari

  • Affiliations:
  • LIP6 -- University of Paris 6 104 avenue du prsident Kennedy 75015, Paris;LIP6 -- University of Paris 6 104 avenue du prsident Kennedy 75015, Paris;LIP6 -- University of Paris 6 104 avenue du prsident Kennedy 75015, Paris;LIP6 -- University of Paris 6 104 avenue du prsident Kennedy 75015, Paris

  • Venue:
  • MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We address the problem of learning automatically to map heterogeneous semi-structured documents onto a mediated target XML schema. We adopt a machine learning approach where the mapping between input and target documents is learned from a training corpus of documents. We first introduce a general stochastic model of semi structured documents generation and transformation. This model relies on the concept of meta-document which is a latent variable providing a link between input and target documents. It allows us to learn the correspondences when the input documents are expressed in a large variety of schemas. We then detail an instance of the general model for the particular task of HTML to XML conversion. This instance is tested on three different corpora using two different inference methods: a dynamic programming method and an approximate LaSO-based method.