Generative communication in Linda
ACM Transactions on Programming Languages and Systems (TOPLAS)
Distributed Shared Memory: A Survey of Issues and Algorithms
Computer - Distributed computing systems: separate resources acting as one
Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Safe and efficient sharing of persistent objects in Thor
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Lineage retrieval for scientific data processing: a survey
ACM Computing Surveys (CSUR)
A survey of data provenance in e-science
ACM SIGMOD Record
Map-reduce-merge: simplified relational data processing on large clusters
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
IPython: A System for Interactive Scientific Computing
Computing in Science and Engineering
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
MapReduce for Data Intensive Scientific Analyses
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Provenance in Databases: Why, How, and Where
Foundations and Trends in Databases
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling
Proceedings of the 5th European conference on Computer systems
A common substrate for cluster computing
HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Scripting the cloud with skywriting
HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Large-scale incremental processing using distributed transactions and notifications
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Mesos: a platform for fine-grained resource sharing in the data center
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Disk-locality in datacenter computing considered irrelevant
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Non-deterministic parallelism considered useful
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Cost optimized provisioning of elastic resources for application workflows
Future Generation Computer Systems
Managing data transfers in computer clusters with orchestra
Proceedings of the ACM SIGCOMM 2011 conference
PrIter: a distributed framework for prioritized iterative computations
Proceedings of the 2nd ACM Symposium on Cloud Computing
Making time-stepped applications tick in the cloud
Proceedings of the 2nd ACM Symposium on Cloud Computing
Scaling the mobile millennium system in the cloud
Proceedings of the 2nd ACM Symposium on Cloud Computing
A distributed look-up architecture for text mining applications using MapReduce
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Riding the elephant: managing ensembles with hadoop
Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
A down-to-earth look at the cloud host OS
Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing
The datacenter needs an operating system
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
TransMR: data-centric programming beyond data parallelism
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Foundations and Trends® in Machine Learning
The HaLoop approach to large-scale iterative data analysis
The VLDB Journal — The International Journal on Very Large Data Bases
iMapReduce: A Distributed Computing Framework for Iterative Computation
Journal of Grid Computing
Distributed GraphLab: a framework for machine learning and data mining in the cloud
Proceedings of the VLDB Endowment
PACMan: coordinated memory caching for parallel jobs
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Adaptive MapReduce using situation-aware mappers
Proceedings of the 15th International Conference on Extending Database Technology
MapIterativeReduce: a framework for reduction-intensive data processing on azure clouds
Proceedings of third international workshop on MapReduce and its Applications Date
Accelerate large-scale iterative computation through asynchronous accumulative updates
Proceedings of the 3rd workshop on Scientific Cloud Computing Date
Adapting scientific computing problems to clouds using MapReduce
Future Generation Computer Systems
Streaming graph partitioning for large distributed graphs
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Programming your network at run-time for big data applications
Proceedings of the first workshop on Hot topics in software defined networks
The seven deadly sins of cloud computing research
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Using R for iterative and incremental processing
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
MixApart: decoupled analytics for shared storage systems
HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
Parallelizing ListNet training using spark
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Spinning fast iterative data flows
Proceedings of the VLDB Endowment
REX: recursive, delta-based data-centric computation
Proceedings of the VLDB Endowment
Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Socio-PLT: principles for programming language adoption
Proceedings of the ACM international symposium on New ideas, new paradigms, and reflections on programming and software
Collaborative energy debugging for mobile devices
HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
PowerGraph: distributed graph-parallel computation on natural graphs
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
GraphChi: large-scale graph computation on just a PC
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Large scale data analytics on clouds
Proceedings of the fourth international workshop on Cloud data management
True elasticity in multi-tenant data-intensive compute clusters
Proceedings of the Third ACM Symposium on Cloud Computing
MapReduce-Based data stream processing over large history data
ICSOC'12 Proceedings of the 10th international conference on Service-Oriented Computing
Scalable parallel computing on clouds using Twister4Azure iterative MapReduce
Future Generation Computer Systems
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling
ACM Transactions on Architecture and Code Optimization (TACO)
A bloat-aware design for big data applications
Proceedings of the 2013 international symposium on memory management
Trinity: a distributed graph engine on a memory cloud
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
HyMR: a hybrid MapReduce workflow system
Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
GPS: a graph processing system
Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Large-scale computation not at the cost of expressiveness
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Distributed data management using MapReduce
ACM Computing Surveys (CSUR)
Exploiting application dynamism and cloud elasticity for continuous dataflows
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SIDR: structure-aware intelligent data routing in Hadoop
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable bootstrapping for python
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions
Journal of Grid Computing
Consolidated cluster systems for data centers in the cloud age: a survey and analysis
Frontiers of Computer Science: Selected Publications from Chinese Universities
Distributed matrix factorization with mapreduce using a series of broadcast-joins
Proceedings of the 7th ACM conference on Recommender systems
PredictionIO: a distributed machine learning server for practical software development
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Clustering on the cloud: reducing CLARA to MapReduce
Proceedings of the Second Nordic Symposium on Cloud Computing & Internet Technologies
Carat: collaborative energy diagnosis for mobile devices
Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
Memory-efficient groupby-aggregate using compressed buffer trees
Proceedings of the 4th annual Symposium on Cloud Computing
Apache Hadoop YARN: yet another resource negotiator
Proceedings of the 4th annual Symposium on Cloud Computing
High performance clustering of social images in a map-collective programming model
Proceedings of the 4th annual Symposium on Cloud Computing
Recommending just enough memory for analytics
Proceedings of the 4th annual Symposium on Cloud Computing
A protocol for simultaneous use of confidentiality and integrity in large-scale storage systems
Proceedings of the 6th International Conference on Security of Information and Networks
PonIC: using stratosphere to speed up pig analytics
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
MR-runner: a modularized map-reduce job management tool
Proceedings of the 5th Asia-Pacific Symposium on Internetware
Hone: "Scaling down" Hadoop on shared-memory systems
Proceedings of the VLDB Endowment
Simplifying Scalable Graph Processing with a Domain-Specific Language
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Resilient X10: efficient failure-aware programming
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
LASER: a scalable response prediction platform for online advertising
Proceedings of the 7th ACM international conference on Web search and data mining
PREDIcT: towards predicting the runtime of large scale iterative analytics
Proceedings of the VLDB Endowment
How to build a bad research center
Communications of the ACM
MixApart: decoupled analytics for shared storage systems
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Hi-index | 0.02 |
MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.