Counting YouTube videos via random prefix sampling

  • Authors:
  • Jia Zhou;Yanhua Li;Vijay Kumar Adhikari;Zhi-Li Zhang

  • Affiliations:
  • University of Minnesota, Minneapolis, USA;University of Minnesota, Minneapolis, USA;University of Minnesota, Minneapolis, USA;University of Minnesota, Minneapolis, USA

  • Venue:
  • Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Leveraging the characteristics of YouTube video id space and exploiting a unique property of YouTube search API, in this paper we develop a random prefix sampling method to estimate the total number of videos hosted by YouTube. Through theoretical modeling and analysis, we demonstrate that the estimator based on this method is unbiased, and provide bounds on its variance and confidence interval. These bounds enable us to judiciously select sample sizes to control estimation errors. We evaluate our sampling method and validate the sampling results using two distinct collections of YouTube video id's (namely, treating each collection as if it were the "true" collection of YouTube videos). We then apply our sampling method to the live YouTube system, and estimate that there are a total of roughly 500 millions YouTube videos by May, 2011. Finally, using an unbiased collection of YouTube videos sampled by our method, we show that YouTube video view count statistics collected by prior methods (e.g., through crawling of related video links) are highly skewed, significantly under-estimating the number of videos with very small view counts (