Towards a quality-oriented real-time web crawler

  • Authors:
  • Jianling Sun;Hui Gao;Xiao Yang

  • Affiliations:
  • VLIS Lab, College of Computer Science and Technology, Zhejiang University, Hangzhou, China;VLIS Lab, College of Computer Science and Technology, Zhejiang University, Hangzhou, China;VLIS Lab, College of Computer Science and Technology, Zhejiang University, Hangzhou, China

  • Venue:
  • WISM'10 Proceedings of the 2010 international conference on Web information systems and mining
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Real-time search emerges as a significant amount of time-sensitive information is produced online every minute. Rather than most commercial web sites having routine content publish schedules, online users deliver their postings on web communities with high variance in both temporality and quality. In this work, we address the scheduling problem for web crawlers, with the objective of optimizing the quality of the local index (i.e. minimizing the total weighted delays of postings) with the given quantity of resources. Towards this, we utilize the posting importance evaluation mechanism and the underlying publish pattern of data source to exploit a posting weights generation prediction model, which is leveraged to help web crawler decide the retrieval points for better index quality. From extensive experiments applied on several web communities, we show the effectiveness of our policy outperforms uniform scheduling and the one purely based upon posting generation pattern.