Rhea: automatic filtering for unstructured cloud storage

Authors:
Christos Gkantsidis;Dimitrios Vytiniotis;Orion Hodson;Dushyanth Narayanan;Florin Dinu;Antony Rowstron
Affiliations:
Microsoft Research, Cambridge, UK;Microsoft Research, Cambridge, UK;Microsoft Research, Cambridge, UK;Microsoft Research, Cambridge, UK;Microsoft Research, Cambridge, UK;Microsoft Research, Cambridge, UK
Venue:
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Year:
2013

Citing 29
Cited 0

Principles of Program Analysis

Principles of Program Analysis
Active Disks for Large-Scale Data Processing

Computer
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Program slicing

ICSE '81 Proceedings of the 5th international conference on Software engineering
Interprocedural slicing using dependence graphs

ACM SIGPLAN Notices - Best of PLDI 1979-1999
Automatic information extraction from large websites

Journal of the ACM (JACM)
Path slicing

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Termination proofs for systems code

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
The octagon abstract domain

Higher-Order and Symbolic Computation
Extracting queries by static analysis of transparent persistence

Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
RadixZip: linear time compression of token streams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Interprocedural query extraction for transparent persistence

Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
SPEED: precise and efficient static estimation of program computational complexity

Proceedings of the 36th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
VL2: a scalable and flexible data center network

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Quincy: fair scheduling for distributed computing clusters

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
HadoopToSQL: a mapReduce query optimizer

Proceedings of the 5th European conference on Computer systems
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Sawja: static analysis workshop for java

FoVeOOS'10 Proceedings of the 2010 international conference on Formal verification of object-oriented software
Diamond: a storage architecture for early discard in interactive search

FAST'04 Proceedings of the 3rd USENIX conference on File and storage technologies
Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
Windows Azure Storage: a highly available cloud storage service with strong consistency

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
The Case for Evaluating MapReduce Performance Using Workload Suites

MASCOTS '11 Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Opening the black boxes in data flow optimization

Proceedings of the VLDB Endowment
Automatic partitioning of database applications

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Unstructured storage and data processing using platforms such as MapReduce are increasingly popular for their simplicity, scalability, and flexibility. Using elastic cloud storage and computation makes them even more attractive. However cloud providers such as Amazon and Windows Azure separate their storage and compute resources even within the same data center. Transferring data from storage to compute thus uses core data center network bandwidth, which is scarce and oversubscribed. As the data is unstructured, the infrastructure cannot automatically apply selection, projection, or other filtering predicates at the storage layer. The problem is even worse if customers want to use compute resources on one provider but use data stored with other provider(s). The bottleneck is now the WAN link which impacts performance but also incurs egress bandwidth charges. This paper presents Rhea, a system to automatically generate and run storage-side data filters for unstructured and semi-structured data. It uses static analysis of application code to generate filters that are safe, stateless, side effect free, best effort, and transparent to both storage and compute layers. Filters never remove data that is used by the computation. Our evaluation shows that Rhea filters achieve a reduction in data transfer of 2x- 20,000x, which reduces job run times by up to 5x and dollar costs for cross-cloud computations by up to 13x.