News article extraction with template-independent wrapper

Authors:
Junfeng Wang;Xiaofei He;Can Wang;Jian Pei;Jiajun Bu;Chun Chen;Ziyu Guan;Gang Lu
Affiliations:
Zhejiang University, Hangzhou, China;Zhejiang University, Hangzhou, China;Zhejiang University, Hangzhou, China;Simon Fraser University, Central City, Canada;Zhejiang University, Hangzhou, China;Zhejiang University, Hangzhou, China;Zhejiang University, Hangzhou, China;College of Information, Zhejiang University of Finance and Ecomonics, Hangzhou, China
Venue:
Proceedings of the 18th international conference on World wide web
Year:
2009

Citing 2
Cited 5

Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Template-independent news extraction based on visual consistency

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2

A unified approach for extracting multiple news attributes from news pages

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
An automatic web news article contents extraction system based on RSS feeds

Journal of Web Engineering
An efficient language-independent method to extract content from news webpages

Proceedings of the 11th ACM symposium on Document engineering
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Extracting multiple news attributes based on visual features

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieved 98.1% accuracy over 3,973 news pages from 12 news sites.