News page discovery policy for instant crawlers

Authors:
Yong Wang;Yiqun Liu;Min Zhang;Shaoping Ma
Affiliations:
State Key Lab of Intelligent Tech. & Sys., Tsinghua University;State Key Lab of Intelligent Tech. & Sys., Tsinghua University;State Key Lab of Intelligent Tech. & Sys., Tsinghua University;State Key Lab of Intelligent Tech. & Sys., Tsinghua University
Venue:
AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Year:
2008

Citing 10
Cited 0

Efficient crawling through URL ordering

WWW7 Proceedings of the seventh international conference on World Wide Web 7
How dynamic is the Web?

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Adaptive Retrieval Agents: Internalizing Local Contextand Scaling up to the Web

Machine Learning - Special issue on information retrieval
Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Design and Implementation of a High-Performance Distributed Web Crawler

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
Ranking the web frontier

Proceedings of the 13th international conference on World Wide Web
A large-scale study of the evolution of web pages

Software—Practice & Experience - Special issue: Web technologies
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
Looking at both the present and the past to efficiently update replicas of web content

Proceedings of the 7th annual ACM international workshop on Web information and data management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many news pages which are of high freshness requirements are published on the internet every day. They should be downloaded immediately by instant crawlers. Otherwise, they will become outdated soon. In the past, instant crawlers only downloaded pages from a manually generated news website list. Bandwidth is wasted in downloading non-news pages because news websites do not publish news pages exclusively. In this paper, a novel approach is proposed to discover news pages. This approach includes seed selection and news URL prediction based on user behavior analysis. Empirical studies in a user access log for two months show that our approach outperforms the traditional approach in both precision and recall.