Adaptive web-page content identification

Authors:
John Gibson;Ben Wellner;Susan Lubar
Affiliations:
The MITRE Corporation, Bedford, MA;The MITRE Corporation, Bedford, MA;The MITRE Corporation, Bedford, MA
Venue:
Proceedings of the 9th annual ACM international workshop on Web information and data management
Year:
2007

Citing 8
Cited 6

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
A brief survey of web data extraction tools

ACM SIGMOD Record
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Automating Content Extraction of HTML Documents

World Wide Web
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Columbia Newsblaster: multilingual news summarization on the web

HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
Semantic role labeling as sequential tagging

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning

Coreex: content extraction from online news articles

Proceedings of the 17th ACM conference on Information and knowledge management
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
Document structure meets page layout: loopy random fields for web news content extraction

Proceedings of the 10th ACM symposium on Document engineering
A comparison of discriminative classifiers for web news content extraction

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Automatic Extraction of Blog Post from Diverse Blog Pages

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
A hybrid approach for extracting informative content from web pages

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Web-based applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site's Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97% of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources.