Nectar: automatic management of data and computation in datacenters

Authors:
Pradeep Kumar Gunda;Lenin Ravindranath;Chandramohan A. Thekkath;Yuan Yu;Li Zhuang
Affiliations:
Microsoft Research Silicon Valley;Microsoft Research Silicon Valley and Massachusetts Institute of Technology;Microsoft Research Silicon Valley;Microsoft Research Silicon Valley;Microsoft Research Silicon Valley
Venue:
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Year:
2010

Citing 23
Cited 18

Incremental computation via function caching

POPL '89 Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Static caching for incremental computation

ACM Transactions on Programming Languages and Systems (TOPLAS)
Caching function calls using precise dependencies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Efficient incremental view maintenance in data warehouses

Proceedings of the tenth international conference on Information and knowledge management
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Automated Selection of Materialized Views and Indexes in SQL Databases

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Deriving Production Rules for Incremental View Maintenance

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Answering queries using views: A survey

The VLDB Journal — The International Journal on Very Large Data Bases
Algorithmic Graph Theory and Perfect Graphs (Annals of Discrete Mathematics, Vol 57)

Algorithmic Graph Theory and Perfect Graphs (Annals of Discrete Mathematics, Vol 57)
Software Configuration Management System Using Vesta (Monographs in Computer Science)

Software Configuration Management System Using Vesta (Monographs in Computer Science)
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Automatic optimization of parallel dataflow programs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Scheduling shared scans of large data files

Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Distributed aggregation for data-parallel computing: interfaces and implementations

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Boom analytics: exploring data-centric, declarative programming for the cloud

Proceedings of the 5th European conference on Computer systems
Stateful bulk processing for incremental analytics

Proceedings of the 1st ACM symposium on Cloud computing
Comet: batched stream processing for data intensive distributed computing

Proceedings of the 1st ACM symposium on Cloud computing
DryadInc: reusing work in large-scale computations

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Scarlett: coping with skewed content popularity in mapreduce clusters

Proceedings of the sixth conference on Computer systems
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Steno: automatic optimization of declarative queries

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
TidyFS: a simple and small distributed file system

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Incoop: MapReduce for incremental computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
The purge threat: scientists' thoughts on peta-scale usability

Proceedings of the sixth workshop on Parallel Data Storage
Kineograph: taking the pulse of a fast-changing and connected world

Proceedings of the 7th ACM european conference on Computer Systems
Large-scale incremental data processing with change propagation

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Shredder: GPU-accelerated incremental storage and computation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Re-optimizing data-parallel computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Putting a "big-data" platform to good use: training kinect

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Facilitating real-time graph mining

Proceedings of the fourth international workshop on Cloud data management
Streaming big data with self-adjusting computation

DDFP '13 Proceedings of the 2013 workshop on Data driven functional programming
TimeStream: reliable stream computation in the cloud

Proceedings of the 8th ACM European Conference on Computer Systems
Optimus: a dynamic rewriting framework for data-parallel execution plans

Proceedings of the 8th ACM European Conference on Computer Systems
DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
(Big)data in a virtualized world: volume, velocity, and variety in cloud datacenters

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Managing data and computation is at the heart of datacenter computing. Manual management of data can lead to data loss, wasteful consumption of storage, and laborious bookkeeping. Lack of proper management of computation can result in lost opportunities to share common computations across multiple jobs or to compute results incrementally. Nectar is a system designed to address the aforementioned problems. It automates and unifies the management of data and computation within a datacenter. In Nectar, data and computation are treated interchangeably by associating data with its computation. Derived datasets, which are the results of computations, are uniquely identified by the programs that produce them, and together with their programs, are automatically managed by a datacenter wide caching service. Any derived dataset can be transparently regenerated by reexecuting its program, and any computation can be transparently avoided by using previously cached results. This enables us to greatly improve datacenter management and resource utilization: obsolete or infrequently used derived datasets are automatically garbage collected, and shared common computations are computed only once and reused by others. This paper describes the design and implementation of Nectar, and reports on our evaluation of the system using analytic studies of logs from several production clusters and an actual deployment on a 240-node cluster.