Modeling throughput sampling size for a cloud-hosted data scheduling and optimization service

Authors:
Esma Yildirim;Jangyoung Kim;Tevfik Kosar
Affiliations:
Department of Computer Engineering, Fatih University, Buyukcekmece, Istanbul, Turkey;Department of Computer Science & Engineering, University at Buffalo, Buffalo, NY, USA;Department of Computer Science & Engineering, University at Buffalo, Buffalo, NY, USA
Venue:
Future Generation Computer Systems
Year:
2013

Citing 19
Cited 0

The network weather service: a distributed resource performance forecasting service for metacomputing

Future Generation Computer Systems - Special issue on metacomputing
Predicting Sporadic Grid Data Transfers

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
A measurement study of available bandwidth estimation tools

Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement
Web100: extended TCP instrumentation for research, education and diagnosis

ACM SIGCOMM Computer Communication Review
Design, Implementation, and Performance of an Extensible Toolkit for Resource Prediction in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Algorithms for Integrated Routing and Scheduling for Aggregating Data from Distributed Resources on a Lambda Grid

IEEE Transactions on Parallel and Distributed Systems
Using overlays for efficient data transfer over shared wide-area networks

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Which network measurement tool is right for you? a multidimensional comparison study

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
A view of cloud computing

Communications of the ACM
A data transfer framework for large-scale science experiments

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A Data Throughput Prediction and Optimization Service for Widely Distributed Many-Task Computing

IEEE Transactions on Parallel and Distributed Systems
Budget-constrained bulk data transfer via internet and shipping networks

Proceedings of the 8th ACM international conference on Autonomic computing
Passive Network Performance Estimation for Large-Scale, Data-Intensive Computing

IEEE Transactions on Parallel and Distributed Systems
Inter-datacenter bulk transfers with netstitcher

Proceedings of the ACM SIGCOMM 2011 conference
Prediction of Optimal Parallelism Level in Wide Area Data Transfers

IEEE Transactions on Parallel and Distributed Systems
Software as a service for data scientists

Communications of the ACM
Network-aware end-to-end data throughput optimization

Proceedings of the first international workshop on Network-aware data management
Evaluation and characterization of available bandwidth probing techniques

IEEE Journal on Selected Areas in Communications
How GridFTP Pipelining, Parallelism and Concurrency Work: A Guide for Optimizing Large Dataset Transfers

SCC '12 Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

As big-data processing and analysis dominates the usage of the Cloud systems, the need for Cloud-hosted data scheduling and optimization services increases. One key component for such a service is to provide available bandwidth and achievable throughput estimation capabilities, since all scheduling and optimization decisions would be built on top of this information. The biggest challenge in providing these estimation capabilities is the dynamic decision of what proportion of the actual dataset, when transferred, would give us an accurate estimate of the bandwidth and throughput achieved by transferring the whole data set. That proportion of data is called the sampling size (or the probe size). Although small fixed sample sizes worked well for high-latency low-bandwidth networks in the past, high-bandwidth networks require much larger and more dynamic sample sizes, since an accurate estimation now also depends on how fast the transfer protocol can saturate that fat network link. In this study, we present a model to decide the optimal sampling size based on the data size and estimated capacity of the network. Our results show that the predicted sampling size is very accurate compared to the targeted best sampling size for a certain file transfer in a majority of the cases.