What TCP/IP protocol headers can tell us about the web

  • Authors:
  • F. Donelson Smith;Félix Hernández Campos;Kevin Jeffay;David Ott

  • Affiliations:
  • University of North Carolina at Chapel Hill, Department of Computer Science, Chapel Hill, NC;University of North Carolina at Chapel Hill, Department of Computer Science, Chapel Hill, NC;University of North Carolina at Chapel Hill, Department of Computer Science, Chapel Hill, NC;University of North Carolina at Chapel Hill, Department of Computer Science, Chapel Hill, NC

  • Venue:
  • Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

We report the results of a large-scale empirical study of web traffic. Our study is based on over 500 GB of TCP/IP protocol-header traces collected in 1999 and 2000 (approximately one year apart) from the high-speed link connecting The University of North Carolina at Chapel Hill to its Internet service provider. We also use a set of smaller traces from the NLANR repository taken at approximately the same times for comparison. The principal results from this study are: (1) empirical data suitable for constructing traffic generating models of contemporary web traffic, (2) new characterizations of TCP connection usage showing the effects of HTTP protocol improvement, notably persistent connections (e.g., about 50% of web objects are now transferred on persistent connections), and (3) new characterizations of web usage and content structure that reflect the influences of "banner ads," server load balancing, and content distribution. A novel aspect of this study is a demonstration that a relatively light-weight methodology based on passive tracing of only TCP/IP headers and off-line analysis tools can provide timely, high quality data about web traffic. We hope this will encourage more researchers to undertake on-going data collection and provide the research community with data about the rapidly evolving characteristics of web traffic.