A very efficient approach to news title and content extraction on the web

Authors:
Hualiang Yan;Jianwu Yang
Affiliations:
Institute of Computer Science & Technology, Peking University, Beijing, China;Institute of Computer Science & Technology, Peking University, Beijing, China
Venue:
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Year:
2011

Citing 4
Cited 1

Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Template-independent news extraction based on visual consistency

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2

Determining the titles of Web pages using anchor text and link analysis

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of efficient and template-independent news extraction on the Web. The popular news extraction methods are based on visual information, and they can achieve good accuracy performance, but the computational efficiency is poor, because it is very time-consuming to render web page to obtain visual information. In this paper we propose an efficient and effective news extraction approach based on novel features. Our approach neither needs training nor needs visual information, so it is simple and very efficient. And it can extract news information from various news sites without using templates. In our experiments, the proposed approach achieves 99% accuracy over 5,671 news pages from 20 different news sites. And the efficiency is much faster than the baseline machine learning method using visual information.