Join processing in database systems with large main memories
ACM Transactions on Database Systems (TODS)
Parallel database systems: the future of high performance database systems
Communications of the ACM
Query execution techniques for caching expensive methods
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
GAMMA - A High Performance Dataflow Database Machine
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Map-reduce-merge: simplified relational data processing on large clusters
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Tuple routing strategies for distributed eddies
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
Space-optimal heavy hitters with strong error bounds
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distributed aggregation for data-parallel computing: interfaces and implementations
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
Hadoop: The Definitive Guide
Towards automatic optimization of MapReduce programs
Proceedings of the 1st ACM symposium on Cloud computing
ParaTimer: a progress indicator for MapReduce DAGs
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
The performance of MapReduce: an in-depth study
Proceedings of the VLDB Endowment
From a stream of relational queries to distributed stream processing
Proceedings of the VLDB Endowment
S4: Distributed Stream Computing Platform
ICDMW '10 Proceedings of the 2010 IEEE International Conference on Data Mining Workshops
Proceedings of the 2nd ACM SIGSPATIAL International Workshop on GeoStreaming
Parallel data processing with MapReduce: a survey
ACM SIGMOD Record
SkewTune: mitigating skew in mapreduce applications
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Adaptive MapReduce using situation-aware mappers
Proceedings of the 15th International Conference on Extending Database Technology
Early accurate results for advanced analytics on MapReduce
Proceedings of the VLDB Endowment
Muppet: MapReduce-style processing of fast data
Proceedings of the VLDB Endowment
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce
ACM Transactions on Database Systems (TODS)
Efficient distributed locality sensitive hashing
Proceedings of the 21st ACM international conference on Information and knowledge management
Using mapreduce to scale events correlation discovery for business processes mining
BPM'12 Proceedings of the 10th international conference on Business Process Management
Journal of Computer and System Sciences
Investigating hybrid SSD FTL schemes for Hadoop workloads
Proceedings of the ACM International Conference on Computing Frontiers
Distributed data management using MapReduce
ACM Computing Surveys (CSUR)
CooMR: cross-task coordination for efficient data management in MapReduce programs
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Memory-efficient groupby-aggregate using compressed buffer trees
Proceedings of the 4th annual Symposium on Cloud Computing
Scalable progressive analytics on big data in the cloud
Proceedings of the VLDB Endowment
SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters
Journal of Parallel and Distributed Computing
A Scalable Distributed Framework for Efficient Analytics on Ordered Datasets
UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Nephele streaming: stream processing under QoS constraints at scale
Cluster Computing
Balancing reducer workload for skewed data using sampling-based partitioning
Computers and Electrical Engineering
Hi-index | 0.00 |
Today's one-pass analytics applications tend to be data-intensive in nature and require the ability to process high volumes of data efficiently. MapReduce is a popular programming model for processing large datasets using a cluster of machines. However, the traditional MapReduce model is not well-suited for one-pass analytics, since it is geared towards batch processing and requires the data set to be fully loaded into the cluster before running analytical queries. This paper examines, from a systems standpoint, what architectural design changes are necessary to bring the benefits of the MapReduce model to incremental one-pass analytics. Our empirical and theoretical analyses of Hadoop-based MapReduce systems show that the widely-used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental one-pass analytics, despite various optimizations. To address these limitations, we propose a new data analysis platform that employs hash techniques to enable fast in-memory processing, and a new frequent key based technique to extend such processing to workloads that require a large key-state space. Evaluation of our Hadoop-based prototype using real-world workloads shows that our new platform significantly improves the progress of map tasks, allows the reduce progress to keep up with the map progress, with up to 3 orders of magnitude reduction of internal data spills, and enables results to be returned continuously during the job.