Extracting multiple news attributes based on visual features

  • Authors:
  • Wei Liu;Hualiang Yan;Jianguo Xiao

  • Affiliations:
  • Institute of Scientific and Technical Information of China, Beijing, China 100038;Institute of Computer Science & Technology, Peking University, Beijing, China 100871;Institute of Computer Science & Technology, Peking University, Beijing, China 100871

  • Venue:
  • Journal of Intelligent Information Systems
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The problem of automatically extracting multiple news attributes from news pages is studied in this paper. Most previous work on web news article extraction focuses only on content. To meet a growing demand for web data integration applications, more useful news attributes, such as title, publication date, author, etc., need to be extracted from news pages and stored in a structured way for further processing. An automatic unified approach to extract such attributes based on their visual features, including independent and dependent visual features, is proposed. Unlike conventional methods, such as extracting attributes separately or generating template-dependent wrappers, the basic idea of this approach is twofold. First, candidates for each news attribute are extracted from the page based on their independent visual features. Second, the true value of each attribute is identified from the candidates based on dependent visual features such as the layout relationships among the attributes. Extensive experiments with a large number of news pages show that the proposed approach is highly effective and efficient.