Template-independent news extraction based on visual consistency

  • Authors:
  • Shuyi Zheng;Ruihua Song;Ji-Rong Wen

  • Affiliations:
  • Pennsylvania State University, University Park, PA;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China

  • Venue:
  • AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Wrapper is a traditional method to extract useful information from Web pages. Most previous works rely on the similarity between HTML tag trees and induced template-dependent wrappers. When hundreds of information sources need to be extracted in a specific domain like news, it is costly to generate and maintain the wrappers. In this paper, we propose a novel template-independent news extraction approach to easily identify news articles based on visual consistency. We first represent a page as a visual block tree. Then, by extracting a series of visual features, we can derive a composite visual feature set that is stable in the news domain. Finally, we use a machine learning approach to generate a template-independent wrapper. Experimental results indicate that our approach is effective in extracting news across websites, even from unseen websites. The performance is as high as around 95% in terms of F1-value.