A unified approach for extracting multiple news attributes from news pages

  • Authors:
  • Wei Liu;Hualiang Yan;Jianwu Yang;Jianguo Xiao

  • Affiliations:
  • Institute of Computer Science & Technology, Peking University, China and Key Laboratory of Computational Linguistics, Peking University, MOE, China;Institute of Computer Science & Technology, Peking University, China;Institute of Computer Science & Technology, Peking University, China and Key Laboratory of Computational Linguistics, Peking University, MOE, China;Institute of Computer Science & Technology, Peking University, China

  • Venue:
  • PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most previous woks on web news article extraction only focus on its content and title. To meet the growing demand for the various web data integration applications, more useful news attributes, such as publication date, author, etc., need to be extracted structured stored for further processing. In this paper, we study the problem of automatically extracting multiple news attributes from news pages. Unlike the traditional ways(e.g. extracting news attributes separately or generating template-dependent wrappers), we propose an automatic, unified approach to extract them based on the visual features of news attributes which includes independent visual features and dependent visual features. The basic idea of our approach is that, first, the candidates of each news attribute are extracted from the news page based on their independent visual features, and then, the true value of each attribute is identified from the candidates based on dependent visual features(the layout relations among news attributes). The extensive experiments using a large number of news pages show that the proposed approach is highly effective and efficient.