Document structure meets page layout: loopy random fields for web news content extraction

  • Authors:
  • Alex Spengler;Patrick Gallinari

  • Affiliations:
  • Université Pierre et Marie Curie, Paris, France;Université Pierre et Marie Curie, Paris, France

  • Venue:
  • Proceedings of the 10th ACM symposium on Document engineering
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web content extraction is concerned with the automatic identification of semantically interesting web page regions. To generalize to pages from unknown sites, it is crucial to exploit not only the local characteristics of a particular web page region, but also the rich interdependencies that exist between the regions and their latent semantics. We therefore propose a loopy conditional random field which combines semantic intra-page dependencies derived from both document structure and page layout, uses a realistic set of local and relational features and is efficiently learnt in the tree-based reparameterization framework. The results of our empirical analysis on a corpus of real-world news web pages from 177 distinct sites with multiple annotations on DOM node level demonstrate that our combination of document structure and layout-driven interdependencies leads to a significant error reduction on the semantically interesting regions of a web page.