Disco: a computing platform for large-scale data analytics

Authors:
Prashanth Mundkur;Ville Tuulos;Jared Flatow
Affiliations:
Nokia Research Center, Palo Alto, CA, USA;Nokia Research Center, Palo Alto, CA, USA;Nokia Research Center, Palo Alto, CA, USA
Venue:
Proceedings of the 10th ACM SIGPLAN workshop on Erlang
Year:
2011

Citing 7
Cited 1

Semantic file systems

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
CouchDB: The Definitive Guide Time to Relax

CouchDB: The Definitive Guide Time to Relax
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation

Assisting developers of big data analytics applications when deploying on hadoop clouds

Proceedings of the 2013 International Conference on Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe the design and implementation of Disco, a distributed computing platform for MapReduce style computations on large-scale data. Disco is designed for operation in clusters of commodity server machines, and provides both a fault-tolerant scheduling and execution layer as well as a distributed and replicated storage layer. Disco is implemented in Erlang and Python; Erlang is used for the implementation of the core aspects of cluster monitoring, job management, task scheduling and distributed filesystem, while Python is used to implement the standard Disco library. Disco has been used in production for several years at Nokia, to analyze tens of terabytes of data daily on a cluster of over 100 nodes. With a small but very functional codebase, it provides a free, proven, and effective component of a full-fledged data analytics stack.