Can we learn a template-independent wrapper for news article extraction from a single training site?

Authors:
Junfeng Wang;Chun Chen;Can Wang;Jian Pei;Jiajun Bu;Ziyu Guan;Wei Vivian Zhang
Affiliations:
Zhejiang Key Lab. of Service Robot, College of Computer Science, Zhejiang University, Hangzhou, China;Zhejiang Key Lab. of Service Robot, College of Computer Science, Zhejiang University, Hangzhou, China;Zhejiang Key Lab. of Service Robot, College of Computer Science, Zhejiang University, Hangzhou, China;School of Computer Science, Simon Fraser University, Vancouver, Canada;Zhejiang Key Lab. of Service Robot, College of Computer Science, Zhejiang University, Hangzhou, China;Zhejiang Key Lab. of Service Robot, College of Computer Science, Zhejiang University, Hangzhou, China;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2009

Citing 19
Cited 13

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Giving meanings to WWW images

MULTIMEDIA '00 Proceedings of the eighth ACM international conference on Multimedia
A brief survey of web data extraction tools

ACM SIGMOD Record
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Tree-Structured Template Generation for Web Pages

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Interactive wrapper generation with minimal user effort

Proceedings of the 15th international conference on World Wide Web
Web wrapper induction: a brief survey

AI Communications
Review spam detection

Proceedings of the 16th international conference on World Wide Web
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Template-independent news extraction based on visual consistency

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2

Web article extraction for web printing: a DOM+visual based approach

Proceedings of the 9th ACM symposium on Document engineering
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Document structure meets page layout: loopy random fields for web news content extraction

Proceedings of the 10th ACM symposium on Document engineering
Automatic selection of print-worthy content for enhanced web page printing experience

Proceedings of the 10th ACM symposium on Document engineering
A very efficient approach to news title and content extraction on the web

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Article clipper: a system for web article extraction

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
An efficient language-independent method to extract content from news webpages

Proceedings of the 11th ACM symposium on Document engineering
Little knowledge rules the web: domain-centric result page extraction

RR'11 Proceedings of the 5th international conference on Web reasoning and rule systems
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
AMBER: turning annotations into knowledge

Proceedings of the 21st international conference companion on World Wide Web
Harnessing the wisdom of the crowds for accurate web page clipping

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Turn the page: automated traversal of paginated websites

ICWE'12 Proceedings of the 12th international conference on Web Engineering
Web news extraction via path ratios

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic news extraction from news pages is important in many Web applications such as news aggregation. However, the existing news extraction methods based on template-level wrapper induction have three serious limitations. First, the existing methods cannot correctly extract pages belonging to an unseen template. Second, it is costly to maintain up-to-date wrappers for a large amount of news websites, because any change of a template may invalidate the corresponding wrapper. Last, the existing methods can merely extract unformatted plain texts, and thus are not user friendly. In this paper, we tackle the problem of template-independent Web news extraction in a user-friendly way. We formalize Web news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed. Correlations between news titles and news bodies are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. Moreover, our approach can extract not only texts, but also images and animates within the news bodies and the extracted news articles are in the same visual style as in the original pages. In our experiments, a wrapper learned from 40 pages from a single news site achieved an accuracy of 98.1% on 3,973 news pages from 12 news sites.