Generating request streams on Big Data using clustered renewal processes

Authors:
Cristina L. Abad;Mindi Yuan;Chris X. Cai;Yi Lu;Nathan Roberts;Roy H. Campbell
Affiliations:
-;-;-;-;-;-
Venue:
Performance Evaluation
Year:
2013

Citing 21
Cited 0

Generating representative Web workloads for network and server performance evaluation

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Automatic modeling of file system workloads using two-level arrival processes

ACM Transactions on Modeling and Computer Simulation (TOMACS)
GISMO: a Generator of Internet Streaming Media Objects and workloads

ACM SIGMETRICS Performance Evaluation Review
ProWGen: a synthetic workload generation tool for simulation evaluation of web proxy caches

Computer Networks: The International Journal of Computer and Telecommunications Networking
MediSyn: a synthetic streaming media service workload generator

NOSSDAV '03 Proceedings of the 13th international workshop on Network and operating systems support for digital audio and video
Sources and Characteristics of Web Temporal Locality

MASCOTS '00 Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Characteristics of WWW Client-based Traces

Characteristics of WWW Client-based Traces
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Analysis of enterprise media server workloads: access patterns, locality, content evolution, and rates of change

IEEE/ACM Transactions on Networking (TON)
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Cost-aware WWW proxy caching algorithms

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Capture, conversion, and analysis of an intense NFS workload

FAST '09 Proccedings of the 7th conference on File and storage technologies
DiskReduce: RAID for data-intensive scalable computing

Proceedings of the 4th Annual Workshop on Petascale Data Storage
Benchmarking cloud serving systems with YCSB

Proceedings of the 1st ACM symposium on Cloud computing
Power-law revisited: large scale measurement study of P2P content popularity

IPTPS'10 Proceedings of the 9th international conference on Peer-to-peer systems
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Design implications for enterprise storage systems via multi-dimensional trace analysis

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Extracting flexible, replayable models from large block traces

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Proceedings of the VLDB Endowment
Metadata Traces and Workload Models for Evaluating Big Storage Systems

UCC '12 Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing
A storage-centric analysis of MapReduce workloads: File popularity, temporal locality and arrival patterns

IISWC '12 Proceedings of the 2012 IEEE International Symposium on Workload Characterization (IISWC)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance evaluation of large file systems, such as storage and media streaming, motivates scalable generation of representative traces. We focus on two key characteristics of traces, popularity and temporal locality. The common practice of using a system-wide distribution obscures per-object behavior, which is important for system evaluation. We propose a model based on delayed renewal processes which, by sampling interarrival times for each object, accurately reproduces popularity and temporal locality for the trace. A lightweight version reduces the dimension of the model with statistical clustering. It is workload-agnostic and object type-aware, suitable for testing emerging workloads and 'what-if' scenarios. We implemented a synthetic trace generator and validated it using: (1) a Big Data storage (HDFS) workload from Yahoo!, (2) a trace from a feature animation company, and (3) a streaming media workload. Two case studies in caching and replicated distributed storage systems show that our traces produce application-level results similar to the real workload. The trace generator is fast and readily scales to a system of 4.3 million files. It outperforms existing models in terms of accurately reproducing the characteristics of the real trace.