Learning to Extract Web News Title in Template Independent Way

  • Authors:
  • Can Wang;Junfeng Wang;Chun Chen;Li Lin;Ziyu Guan;Junyan Zhu;Cheng Zhang;Jiajun Bu

  • Affiliations:
  • College of Computer Science, Zhejiang University, China;College of Computer Science, Zhejiang University, China;College of Computer Science, Zhejiang University, China;College of Computer Science, Zhejiang University, China;College of Computer Science, Zhejiang University, China;College of Computer Science, Zhejiang University, China;China Disabled Persons' Federation Information Center,;College of Computer Science, Zhejiang University, China

  • Venue:
  • RSKT '09 Proceedings of the 4th International Conference on Rough Sets and Knowledge Technology
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many news sites have large collections of news pages generated dynamically and endlessly from underlying databases. Automatic extraction of news titles and contents from news pages therefore is an important technique for applications like news aggregation systems. However, extracting news titles accurately from news pages of various styles is found to be a challenging task in previous work. In this paper, we propose a machine learning approach to tackle this problem. Our approach is independent of templates and thus will not suffer from the updates of templates which usually invalidate the corresponding extractors. Empirical evaluation of our approach over 5,200 news Web pages collected from 13 important on-line news sites shows that our approach significantly improves the accuracy of news title extraction.