On the Clustering of Web Content for Efficient Replication

  • Authors:
  • Yan Chen;Lili Qiu;Weiyu Chen;Luan Nguyen;Randy H. Katz

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • On the Clustering of Web Content for Efficient Replication
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recently there has been an increasing deployment of content distribution networks (CDNs) that offer hosting services to Web content providers. How to efficiently provision CDNs is a crucial and challenging issue. In this paper, we first compare pull-based versus push-based replication in distributing Web content. Our results show that the push-based replication can achieve comparable users'' perceived performance with much less replication traffic (4 - 5% of that in the pull-based scheme). Motivated by the observation, we explore how to efficiently push content to CDN nodes. Using trace-driven simulation, we show that replicating content in units of URLs can yield 60 - 70% reduction in clients'' latency compared to replicating in units of Web sites. On the other hand, it is very expensive to perform such a fine-grained replication. To address this issue, we propose to replicate content in units of clusters, each containing objects with similar access patterns and which are likely to be requested by clients that are topologically close. To this end, we describe three clustering techniques, and use various topologies and several real traces from large Web servers to evaluate their performance. Our results show that cluster-based replication achieves 40 - 60% improvement over full Web site replication. In addition, by adjusting the number of clusters, we can smoothly trade off the management and computation cost for better client performance. To take into account of change in users'' access patterns, we also explore incremental clusterings to adaptively add new documents to the content clusters. We examine both offline and online incremental clusterings, where the former assumes access history is available while the latter predicts access pattern based on the hyperlink structure. Our results show that the offline clusterings yield close to the performance of the complete re-clustering while at much lower overhead. The online incremental clustering and replication cut down the retrieval cost by 4.6 - 8 times compared to no replication and random replication, so it is especially useful to avoid flashcrowd and improve document availability.