Learning to Extract Web News Title in Template Independent Way

Authors:
Can Wang;Junfeng Wang;Chun Chen;Li Lin;Ziyu Guan;Junyan Zhu;Cheng Zhang;Jiajun Bu
Affiliations:
College of Computer Science, Zhejiang University, China;College of Computer Science, Zhejiang University, China;College of Computer Science, Zhejiang University, China;College of Computer Science, Zhejiang University, China;College of Computer Science, Zhejiang University, China;College of Computer Science, Zhejiang University, China;China Disabled Persons' Federation Information Center,;College of Computer Science, Zhejiang University, China
Venue:
RSKT '09 Proceedings of the 4th International Conference on Rough Sets and Knowledge Technology
Year:
2009

Citing 5
Cited 0

Function-based object model towards website adaptation

Proceedings of the 10th international conference on World Wide Web
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Title extraction from bodies of HTML documents and its application to web page retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many news sites have large collections of news pages generated dynamically and endlessly from underlying databases. Automatic extraction of news titles and contents from news pages therefore is an important technique for applications like news aggregation systems. However, extracting news titles accurately from news pages of various styles is found to be a challenging task in previous work. In this paper, we propose a machine learning approach to tackle this problem. Our approach is independent of templates and thus will not suffer from the updates of templates which usually invalidate the corresponding extractors. Empirical evaluation of our approach over 5,200 news Web pages collected from 13 important on-line news sites shows that our approach significantly improves the accuracy of news title extraction.