Clustering Web Content for Efficient Replication

Authors:
Yan Chen;Lili Qiu;Weiyu Chen;Luan Nguyen;Randy H. Katz
Affiliations:
-;-;-;-;-
Venue:
ICNP '02 Proceedings of the 10th IEEE International Conference on Network Protocols
Year:
2002

Citing 0
Cited 7

Replication for web hosting systems

ACM Computing Surveys (CSUR)
Replication for web hosting systems

ACM Computing Surveys (CSUR)
Increasing the Performance of CDNs Using Replication and Caching: A Hybrid Approach

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
GlobeDB: autonomic data replication for web applications

WWW '05 Proceedings of the 14th international conference on World Wide Web
Coordinated data prefetching for web contents

Computer Communications
Combining replica placement and caching techniques in content distribution networks

Computer Communications
Integrating caching techniques on a content distribution network

ADBIS'06 Proceedings of the 10th East European conference on Advances in Databases and Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently there has been an increasing deployment of content distribution networks (CDNs) that offer hosting services to Web content providers. In this paper, we first compare the un-cooperative pulling of Web contents used by commercial CDNs with the cooperative pushing. Our results show that the latter can achieve comparable users' perceived performance with only 4 - 5% of replication and update traffic compared to the former scheme. Therefore we explore how to efficiently push content to CDN nodes. Using trace-driven simulation, we show that replicating content in units of URLs can yield 60-70% reduction in clients' latency, compared to replicating in units of Web sites. However, it is very expensive to perform such a fine-grained replication.To address this issue, we propose to replicate content in units of clusters, each containing objects which are likely to be requested by clients that are topologically close. To this end, we describe three clustering techniques, and use various topologies and several large Web server traces to evaluate their performance. Our results show that the cluster-based replication achieves 40-60% improvement over the per Web site based replication. In addition, by adjusting the number of clusters, we can smoothly trade off the management and computation cost for better client performance.To adapt to changes in users' access patterns, we also explore incremental clusterings that adaptively add new documents to the existing content clusters. We examine both offlineand online incremental clusterings, where the former assumes access history is available while the latter predicts access pattern based on the hyperlink structure. Our results show that the offline clusterings yield close to the performance of the complete re-clustering at much lower overhead. The online incremental clustering and replication cut down the retrieval cost by 4.6 - 8 times compared to no replication and random replication, so it is especially useful to improve document availability during flash crowds.