DataGarage: warehousing massive performance data on commodity servers

Authors:
Charles Loboz;Slawek Smyl;Suman Nath
Affiliations:
Microsoft Corporation;Microsoft Corporation;Microsoft Research
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 11
Cited 3

A Query Processing Strategy for the Decomposed Storage Model

Proceedings of the Third International Conference on Data Engineering
Integrating compression and execution in column-oriented database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Managing massive time series streams with multi-scale compressed trickles

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Fast approximate correlation for massive time-series data

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Otus: resource attribution in data-intensive clusters

Proceedings of the second international workshop on MapReduce and its applications
RainMon: an integrated approach to mining bursty timeseries monitoring data

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Specialized storage for big numeric time series

HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Contemporary datacenters house tens of thousands of servers. The servers are closely monitored for operating conditions and utilizations by collecting their performance data (e.g., CPU utilization). In this paper, we show that existing database and file-system solutions are not suitable for warehousing performance data collected from a large number of servers because of the scale and the complexity of performance data. We describe the design and implementation of DataGarage, a performance data warehousing system that we have developed at Microsoft. DataGarage is a hybrid solution that combines benefits of DBMSs, file-systems, and MapReduce systems to address unique challenges of warehousing performance data. We describe how DataGarage allows efficient storage and analysis of years of historical performance data collected from many tens of thousands of servers---on commodity servers. We also report DataGarage's performance with a real dataset and a 32-node, 256-core shared-nothing cluster and our experience of using DataGarage at Microsoft for the last one year.