Parallel database systems: the future of high performance database systems
Communications of the ACM
Architecture and Algorithm for Parallel Execution of a Join Operation
Proceedings of the First International Conference on Data Engineering
Map-reduce-merge: simplified relational data processing on large clusters
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Supporting table partitioning by reference in oracle
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Self-organizing tuple reconstruction in column-stores
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Proceedings of the VLDB Endowment
PLANET: massively parallel learning of tree ensembles with MapReduce
Proceedings of the VLDB Endowment
MAD skills: new analysis practices for big data
Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads
Proceedings of the VLDB Endowment
Pregel: a system for large-scale graph processing
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient parallel set-similarity joins using MapReduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A comparison of join algorithms for log processing in MaPreduce
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
HadoopDB in action: building real world applications
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Cheetah: a high performance, custom data warehouse on top of MapReduce
Proceedings of the VLDB Endowment
Large-scale machine learning at twitter
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
The equi-join processing and optimization on ring architecture key/value database
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
TEEPA: a timely-aware elastic parallel architecture
Proceedings of the 16th International Database Engineering & Applications Sysmposium
The unified logging infrastructure for data analytics at Twitter
Proceedings of the VLDB Endowment
Just-in-time data distribution for analytical query processing
ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Near real-time analytics with IBM DB2 analytics accelerator
Proceedings of the 16th International Conference on Extending Database Technology
Split query processing in polybase
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Cloudy: heterogeneous middleware for in time queries processing
Proceedings of the 17th International Database Engineering & Applications Symposium
Hi-index | 0.00 |
Hadapt is a start-up company currently commercializing the Yale University research project called HadoopDB. The company focuses on building a platform for Big Data analytics in the cloud by introducing a storage layer optimized for structured data and by providing a framework for executing SQL queries efficiently. This work considers processing data warehousing queries over very large datasets. Our goal is to maximize perfor mance while, at the same time, not giving up fault tolerance and scalability. We analyze the complexity of this problem in the split execution environment of HadoopDB. Here, incoming queries are examined; parts of the query are pushed down and executed inside the higher performing database layer; and the rest of the query is processed in a more generic MapReduce framework. In this paper, we discuss in detail performance-oriented query execution strategies for data warehouse queries in split execution environments, with particular focus on join and aggregation operations. The efficiency of our techniques is demonstrated by running experiments using the TPC-H benchmark with 3TB of data. In these experiments we compare our results with a standard commercial parallel database and an open-source MapReduce implementation featuring a SQL interface (Hive). We show that HadoopDB successfully competes with other systems.