An automatic web news article contents extraction system based on RSS feeds

Authors:
Hao Han;Tomoya Noro;Takehiro Tokuda
Affiliations:
Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan;Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan;Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan
Venue:
Journal of Web Engineering
Year:
2009

Citing 13
Cited 1

Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Detecting shifts in news stories for paragraph extraction

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Automatic Extraction of Publication Time from News Search Results

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Content Extraction from News Pages Using Particle Swarm Optimization on Linguistic and Structural Features

WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
Perception-oriented online news extraction

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Coreex: content extraction from online news articles

Proceedings of the 17th ACM conference on Information and knowledge management
News article extraction with template-independent wrapper

Proceedings of the 18th international conference on World wide web
Towards Automatic Construction of News Directory Systems

Proceedings of the 2008 conference on Information Modelling and Knowledge Bases XIX
A News Index System for Global Comparisons of Many Major Topics on the Earth

Proceedings of the 2009 conference on Information Modelling and Knowledge Bases XX
Personal News RSS Feeds Generation Using Existing News Feeds

ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Template-independent news extraction based on visual consistency

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Hybrid method for automated news content extraction from the web

WISE'06 Proceedings of the 7th international conference on Web Information Systems

A description-based composition method for mobile and tethered Mashup applications

Journal of Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nowadays, the Web news article contents extraction is vital to provide news indexing and searching services. Most of the traditional methods need to analyze the layout of news pages to generate the wrappers manually or automatically. It is a costly work and needs much maintenance during the extraction over a long period of time. In this paper, we construct an automatic Web news article contents extraction system based on RSS feeds. We propose an effective and efficient algorithm to extract the news article contents from the news pages without the analysis of news sites before extraction. We calculate the relevance between the news title and each sentence in the news page to detect the news article contents. Our approach is applicable to the general types of news RSS feeds and independent of news page layout. Our experimental results show that our approach can extract the news article contents automatically, accurately and constantly.