Document structure meets page layout: loopy random fields for web news content extraction

Authors:
Alex Spengler;Patrick Gallinari
Affiliations:
Université Pierre et Marie Curie, Paris, France;Université Pierre et Marie Curie, Paris, France
Venue:
Proceedings of the 10th ACM symposium on Document engineering
Year:
2010

Citing 20
Cited 4

A brief survey of web data extraction tools

ACM SIGMOD Record
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing
2D Conditional Random Fields for Web information extraction

ICML '05 Proceedings of the 22nd international conference on Machine learning
Accelerated training of conditional random fields with stochastic gradient methods

ICML '06 Proceedings of the 23rd international conference on Machine learning
A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Web page title extraction and its application

Information Processing and Management: an International Journal
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Dynamic hierarchical Markov random fields and their application to web data extraction

Proceedings of the 24th international conference on Machine learning
Adaptive web-page content identification

Proceedings of the 9th annual ACM international workshop on Web information and data management
A densitometric approach to web page segmentation

Proceedings of the 17th ACM conference on Information and knowledge management
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to Extract Content from News Webpages

WAINA '09 Proceedings of the 2009 International Conference on Advanced Information Networking and Applications Workshops
Web article extraction for web printing: a DOM+visual based approach

Proceedings of the 9th ACM symposium on Document engineering
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
Loopy belief propagation for approximate inference: an empirical study

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Discriminative probabilistic models for relational data

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence
Hybrid method for automated news content extraction from the web

WISE'06 Proceedings of the 7th international conference on Web Information Systems

An efficient language-independent method to extract content from news webpages

Proceedings of the 11th ACM symposium on Document engineering
A versatile model for web page representation, information extraction and content re-packaging

Proceedings of the 11th ACM symposium on Document engineering
Structural and visual comparisons for web page archiving

Proceedings of the 2012 ACM symposium on Document engineering
Feature-based object identification for web automation

Proceedings of the 28th Annual ACM Symposium on Applied Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web content extraction is concerned with the automatic identification of semantically interesting web page regions. To generalize to pages from unknown sites, it is crucial to exploit not only the local characteristics of a particular web page region, but also the rich interdependencies that exist between the regions and their latent semantics. We therefore propose a loopy conditional random field which combines semantic intra-page dependencies derived from both document structure and page layout, uses a realistic set of local and relational features and is efficiently learnt in the tree-based reparameterization framework. The results of our empirical analysis on a corpus of real-world news web pages from 177 distinct sites with multiple annotations on DOM node level demonstrate that our combination of document structure and layout-driven interdependencies leads to a significant error reduction on the semantically interesting regions of a web page.