Efficient crawling through URL ordering
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Adaptive Retrieval Agents: Internalizing Local Contextand Scaling up to the Web
Machine Learning - Special issue on information retrieval
Adaptive on-line page importance computation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Design and Implementation of a High-Performance Distributed Web Crawler
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Effective page refresh policies for Web crawlers
ACM Transactions on Database Systems (TODS)
Proceedings of the 13th international conference on World Wide Web
A large-scale study of the evolution of web pages
Software—Practice & Experience - Special issue: Web technologies
Topical web crawlers: Evaluating adaptive algorithms
ACM Transactions on Internet Technology (TOIT)
Looking at both the present and the past to efficiently update replicas of web content
Proceedings of the 7th annual ACM international workshop on Web information and data management
Hi-index | 0.00 |
Many news pages which are of high freshness requirements are published on the internet every day. They should be downloaded immediately by instant crawlers. Otherwise, they will become outdated soon. In the past, instant crawlers only downloaded pages from a manually generated news website list. Bandwidth is wasted in downloading non-news pages because news websites do not publish news pages exclusively. In this paper, a novel approach is proposed to discover news pages. This approach includes seed selection and news URL prediction based on user behavior analysis. Empirical studies in a user access log for two months show that our approach outperforms the traditional approach in both precision and recall.