Characterization of a Large Web Site Population with Implications for Content Delivery

  • Authors:
  • Leeann Bent;Michael Rabinovich;Geoffrey M. Voelker;Zhen Xiao

  • Affiliations:
  • University of California, San Diego, USA 92093-0114;Case Western Reserve University, Cleveland, USA 44106-7071;University of California, San Diego, USA 92093-0114;IBM T.J. Watson Research Center, New York, USA 10532

  • Venue:
  • World Wide Web
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a systematic study of the properties of a large number of Web sites hosted by a major ISP. To our knowledge, ours is the first comprehensive study of a large server farm that contains thousands of commercial Web sites. We also perform a simulation analysis to estimate potential performance benefits of content delivery networks (CDNs) for these Web sites, and validate our analysis for several sites by replaying our trace through a real cache. We make several interesting observations about the current usage of Web technologies and Web site performance characteristics. First, compared with previous client workload studies, the Web server farm workload contains a much higher degree of uncacheable responses and responses that require mandatory cache validations. A significant reason for this is that cookie use is prevalent among our population, especially among more popular sites. We found an indication of widespread indiscriminate usage of cookies, which unnecessarily impedes the use of many content delivery optimizations. We also found that most Web sites do not utilize the cache-control features of the HTTP 1.1 protocol, resulting in suboptimal performance. Moreover, the implicit expiration time in client caches for responses is strongly constrained by the maximum values allowed in the Squid proxy. Thus, supplying explicit expiration information would significantly improve Web sites' cacheability. Finally, our simulation results indicate that while most Web sites benefit from the use of a CDN, the amount of the benefit varies widely among the sites, which underscores the need for workload analysis tools.