Characterization of a large web site population with implications for content delivery

  • Authors:
  • L. Bent;M. Rabinovich;G. M. Voelker;Z. Xiao

  • Affiliations:
  • University of California - San Diego, La Jolla, CA;AT&T Labs-Research, Florham Park, NJ;University of California - San Diego, La Jolla, CA;AT&T Labs-Research, Florham Park, NJ

  • Venue:
  • Proceedings of the 13th international conference on World Wide Web
  • Year:
  • 2004

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper presents a systematic study of the properties of a large number of Web sites hosted by a major ISP. To our knowledge, ours is the first comprehensive study of a large server farm that contains thousands of commercial Web sites. We also perform a simulation analysis to estimate potential performance benefits of content delivery networks (CDNs) for these Web sites. We make several interesting observations about the current usage of Web technologies and Web site performance characteristics. First, compared with previous client workload studies, the Web server farm workload contains a much higher degree of uncacheable responses and responses that require mandatory cache validations. A significant reason for this is that cookie use is prevalent among our population, especially among more popular sites. However, we found an indication of wide-spread indiscriminate usage of cookies, which unnecessarily impedes the use of many content delivery optimizations. We also found that most Web sites do not utilize the cache-control features ofthe HTTP 1.1 protocol, resulting in suboptimal performance. Moreover, the implicit expiration time in client caches for responses is constrained by the maximum values allowed in the Squid proxy. Finally, our simulation results indicate that most Web sites benefit from the use of a CDN. The amount of the benefit depends on site popularity, and, somewhat surprisingly, a CDN may increase the peak to average request ratio at the origin server because the CDN can decrease the average request rate more than the peak request rate.