Performance tradeoffs in read-optimized databases
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Map-reduce-merge: simplified relational data processing on large clusters
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Clustera: an integrated computation and data management system
Proceedings of the VLDB Endowment
Read-optimized databases, in depth
Proceedings of the VLDB Endowment
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce and parallel DBMSs: friends or foes?
Communications of the ACM - Amir Pnueli: Ahead of His Time
MapReduce: a flexible data processing tool
Communications of the ACM - Amir Pnueli: Ahead of His Time
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads
Proceedings of the VLDB Endowment
Towards automatic optimization of MapReduce programs
Proceedings of the 1st ACM symposium on Cloud computing
Improving MapReduce performance in heterogeneous environments
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Column-oriented storage techniques for MapReduce
Proceedings of the VLDB Endowment
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Providing scalable database services on the cloud
WISE'10 Proceedings of the 11th international conference on Web information systems engineering
CoHadoop: flexible data placement and its exploitation in Hadoop
Proceedings of the VLDB Endowment
Query optimization for massively parallel data processing
Proceedings of the 2nd ACM Symposium on Cloud Computing
ActiveSLA: a profit-oriented admission control framework for database-as-a-service providers
Proceedings of the 2nd ACM Symposium on Cloud Computing
Hadoop acceleration through network levitated merge
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Analytics over large-scale multidimensional data: the big data revolution!
Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Building wavelet histograms on large data in MapReduce
Proceedings of the VLDB Endowment
Parallel data processing with MapReduce: a survey
ACM SIGMOD Record
PerfXplain: debugging MapReduce job performance
Proceedings of the VLDB Endowment
Apriori-based frequent itemset mining algorithms on MapReduce
Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Optimizing analytic data flows for multiple execution engines
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Clydesdale: structured data processing on MapReduce
Proceedings of the 15th International Conference on Extending Database Technology
Adaptive MapReduce using situation-aware mappers
Proceedings of the 15th International Conference on Extending Database Technology
MapReduce Workload Modeling with Statistical Approach
Journal of Grid Computing
Efficient processing of k nearest neighbor joins using MapReduce
Proceedings of the VLDB Endowment
Efficient multi-way theta-join processing using MapReduce
Proceedings of the VLDB Endowment
Only aggressive elephants are fast elephants
Proceedings of the VLDB Endowment
PRISM: privacy-preserving search in mapreduce
PETS'12 Proceedings of the 12th international conference on Privacy Enhancing Technologies
Optimization of analytic data flows for next generation business intelligence applications
TPCTC'11 Proceedings of the Third TPC Technology conference on Topics in Performance Evaluation, Measurement and Characterization
Efficient big data processing in Hadoop MapReduce
Proceedings of the VLDB Endowment
Sailfish: a framework for large scale data processing
Proceedings of the Third ACM Symposium on Cloud Computing
Cloud MapReduce for Monte Carlo bootstrap applied to Metabolic Flux Analysis
Future Generation Computer Systems
Just-in-time data distribution for analytical query processing
ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process
International Journal of Intelligent Systems
Eagle-eyed elephant: split-oriented indexing in Hadoop
Proceedings of the 16th International Conference on Extending Database Technology
Evaluating MapReduce for profiling application traffic
Proceedings of the first edition workshop on High performance and programmable networking
Issues in big data testing and benchmarking
Proceedings of the Sixth International Workshop on Testing Database Systems
Distributed data management using MapReduce
ACM Computing Surveys (CSUR)
Database research at the National University of Singapore
ACM SIGMOD Record
Data warehousing and OLAP over big data: current challenges and future research directions
Proceedings of the sixteenth international workshop on Data warehousing and OLAP
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
Proceedings of the 17th International Database Engineering & Applications Symposium
Gunther: search-based auto-tuning of mapreduce
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Piranha: optimizing short jobs in Hadoop
Proceedings of the VLDB Endowment
Hybrid Analytic Flows-the Case for Optimization
Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology
Hi-index | 0.01 |
MapReduce has been widely used for large-scale data analysis in the Cloud. The system is well recognized for its elastic scalability and fine-grained fault tolerance although its performance has been noted to be suboptimal in the database context. According to a recent study [19], Hadoop, an open source implementation of MapReduce, is slower than two state-of-the-art parallel database systems in performing a variety of analytical tasks by a factor of 3.1 to 6.5. MapReduce can achieve better performance with the allocation of more compute nodes from the cloud to speed up computation; however, this approach of "renting more nodes" is not cost effective in a pay-as-you-go environment. Users desire an economical elastically scalable data processing system, and therefore, are interested in whether MapReduce can offer both elastic scalability and efficiency. In this paper, we conduct a performance study of MapReduce (Hadoop) on a 100-node cluster of Amazon EC2 with various levels of parallelism. We identify five design factors that affect the performance of Hadoop, and investigate alternative but known methods for each factor. We show that by carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5 for the same benchmark used in [19], and is thus more comparable to that of parallel database systems. Our results show that it is therefore possible to build a cloud data processing system that is both elastically scalable and efficient.