An automatic web news article contents extraction system based on RSS feeds

  • Authors:
  • Hao Han;Tomoya Noro;Takehiro Tokuda

  • Affiliations:
  • Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan;Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan;Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan

  • Venue:
  • Journal of Web Engineering
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Nowadays, the Web news article contents extraction is vital to provide news indexing and searching services. Most of the traditional methods need to analyze the layout of news pages to generate the wrappers manually or automatically. It is a costly work and needs much maintenance during the extraction over a long period of time. In this paper, we construct an automatic Web news article contents extraction system based on RSS feeds. We propose an effective and efficient algorithm to extract the news article contents from the news pages without the analysis of news sites before extraction. We calculate the relevance between the news title and each sentence in the news page to detect the news article contents. Our approach is applicable to the general types of news RSS feeds and independent of news page layout. Our experimental results show that our approach can extract the news article contents automatically, accurately and constantly.