HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Authors:
Azza Abouzeid;Kamil Bajda-Pawlikowski;Daniel Abadi;Avi Silberschatz;Alexander Rasin
Affiliations:
Yale University;Yale University;Yale University;Yale University;Brown University
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 8
Cited 128

An Overview of The System Software of A Parallel Relational Database Machine GRACE

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
GAMMA - A High Performance Dataflow Database Machine

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Boom analytics: exploring data-centric, declarative programming for the cloud

Proceedings of the 5th European conference on Computer systems
HadoopToSQL: a mapReduce query optimizer

Proceedings of the 5th European conference on Computer systems
Distributed indexing of web scale datasets for the cloud

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
Towards scalable RDF graph analytics on MapReduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
Indexing multi-dimensional data in a cloud system

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Integrating hadoop and parallel DBMs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
HadoopDB in action: building real world applications

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ASSET queries: a declarative alternative to MapReduce

ACM SIGMOD Record
VDB-MR: MapReduce-based distributed data integration using virtual database

Future Generation Computer Systems
Parallelizing multiple group-by query in share-nothing environment: a MapReduce study case

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Manimal: relational optimization for data-intensive programs

Procceedings of the 13th International Workshop on the Web and Databases
Parallel bulk insertion for large-scale analytics applications

Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
ESQP: an efficient SQL query processing for cloud data management

CloudDB '10 Proceedings of the second international workshop on Cloud data management
Adaptive query execution for data management in the cloud

CloudDB '10 Proceedings of the second international workshop on Cloud data management
Benchmarking cloud-based data management systems

CloudDB '10 Proceedings of the second international workshop on Cloud data management
Tradeoffs between parallel database systems, Hadoop, and HadoopDB as platforms for petabyte-scale analysis

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Comparing Hadoop and Fat-Btree based access method for small file I/O applications

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Invited paper: Scalable reduction of large datasets to interesting subsets

Web Semantics: Science, Services and Agents on the World Wide Web
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
DataGarage: warehousing massive performance data on commodity servers

Proceedings of the VLDB Endowment
Web data processing on the cloud

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
The case for object databases in cloud data management

ICOODB'10 Proceedings of the Third international conference on Objects and databases
Experience in Continuous analytics as a Service (CaaaS)

Proceedings of the 14th International Conference on Extending Database Technology
Big data and cloud computing: current state and future opportunities

Proceedings of the 14th International Conference on Extending Database Technology
Dremel: interactive analysis of web-scale datasets

Communications of the ACM
Scale and concurrency of GIGA+: file system directories with millions of files

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
A load-aware scheduler for MapReduce framework in heterogeneous cloud environments

Proceedings of the 2011 ACM Symposium on Applied Computing
Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
Query optimization techniques for partitioned tables

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A batch of PNUTS: experiences connecting cloud batch and serving systems

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Automated partitioning design in parallel database systems

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient processing of data warehousing queries in a split execution environment

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Providing scalable database services on the cloud

WISE'10 Proceedings of the 11th international conference on Web information systems engineering
Full-text indexing for optimizing selection operations in large-scale data analytics

Proceedings of the second international workshop on MapReduce and its applications
Towards efficient subgraph search in cloud computing environments

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
LinearDB: a relational approach to make data warehouse scale like MapReduce

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
TidyFS: a simple and small distributed file system

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
CoHadoop: flexible data placement and its exploitation in Hadoop

Proceedings of the VLDB Endowment
An intermediate algebra for optimizing RDF graph pattern matching on MapReduce

ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
Brown Dwarf: A fully-distributed, fault-tolerant data warehousing system

Journal of Parallel and Distributed Computing
VarDB: high-performance warehouse processing with massive ordering and binary search

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Tagged mapreduce: efficiently computing multi-analytics using mapreduce

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
An efficient multi-tier tablet server storage architecture

Proceedings of the 2nd ACM Symposium on Cloud Computing
DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems

Proceedings of the 2nd ACM Symposium on Cloud Computing
Trojan data layouts: right shoes for a running elephant

Proceedings of the 2nd ACM Symposium on Cloud Computing
Automatic physical database tuning middleware for web-based applications

ADBIS'11 Proceedings of the 15th international conference on Advances in databases and information systems
Continuous data stream query in the cloud

Proceedings of the 20th ACM international conference on Information and knowledge management
Analytics over large-scale multidimensional data: the big data revolution!

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
Scalable queries for large datasets using cloud computing: a case study

Proceedings of the 15th Symposium on International Database Engineering & Applications
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Of hammers and nails: an empirical comparison of three paradigms for processing large graphs

Proceedings of the fifth ACM international conference on Web search and data mining
Abstract state machines for data-parallel computing

Conceptual Modelling and Its Theoretical Foundations
Apriori-based frequent itemset mining algorithms on MapReduce

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
The HaLoop approach to large-scale iterative data analysis

The VLDB Journal — The International Journal on Very Large Data Bases
Performance Evaluation of Range Queries in Key Value Stores

Journal of Grid Computing
Clydesdale: structured data processing on hadoop

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Oracle in-database hadoop: when mapreduce meets RDBMS

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Optimizing analytic data flows for multiple execution engines

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
To nest or not to nest, when and how much: representing intermediate results of graph pattern queries in MapReduce based processing

SWIM '12 Proceedings of the 4th International Workshop on Semantic Web Information Management
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
ComMapReduce: an improvement of mapreduce with lightweight communication mechanisms

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
Halt or continue: estimating progress of queries in the cloud

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
Towards a scalable, performance-oriented OLAP storage engine

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
Understanding the effects and implications of compute node related failures in hadoop

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Executing Data-Intensive Workloads in a Cloud

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
ParaLite: Supporting Collective Queries in Database System to Parallelize User-Defined Executable

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
On saying "enough already!" in MapReduce

Proceedings of the 1st International Workshop on Cloud Intelligence
REX: recursive, delta-based data-centric computation

Proceedings of the VLDB Endowment
Towards energy-efficient database cluster design

Proceedings of the VLDB Endowment
TEEPA: a timely-aware elastic parallel architecture

Proceedings of the 16th International Database Engineering & Applications Sysmposium
A short survey on the state of the art in architectures and platforms for large scale data analysis and knowledge discovery from data

Proceedings of the WICSA/ECSA 2012 Companion Volume
Optimization of analytic data flows for next generation business intelligence applications

TPCTC'11 Proceedings of the Third TPC Technology conference on Topics in Performance Evaluation, Measurement and Characterization
The unified logging infrastructure for data analytics at Twitter

Proceedings of the VLDB Endowment
AROMA: automated resource allocation and configuration of mapreduce environment in the cloud

Proceedings of the 9th international conference on Autonomic computing
SCOPE: parallel databases meet MapReduce

The VLDB Journal — The International Journal on Very Large Data Bases
Spanner: Google's globally-distributed database

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
A distributed, semiotic-inductive, and human-oriented approach to web-scale knowledge retrieval

Proceedings of the 2012 international workshop on Web-scale knowledge representation, retrieval and reasoning
Multimedia Applications and Security in MapReduce: Opportunities and Challenges

Concurrency and Computation: Practice & Experience
Optimizing queries with expensive video predicates in cloud environment

Concurrency and Computation: Practice & Experience
Improving large graph processing on partitioned graphs in the cloud

Proceedings of the Third ACM Symposium on Cloud Computing
Optimizing large-scale Semi-Naïve datalog evaluation in hadoop

Datalog 2.0'12 Proceedings of the Second international conference on Datalog in Academia and Industry
Cogset: a high performance MapReduce engine

Concurrency and Computation: Practice & Experience
Robust runtime optimization and skew-resistant execution of analytical SPARQL queries on pig

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Computing scientometrics in large-scale academic search engines with mapreduce

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Report from the first workshop on scalable workflow enactment engines and technology (SWEET'12)

ACM SIGMOD Record
Constructing a data accessing layer for in-memory data grid

Proceedings of the Fourth Asia-Pacific Symposium on Internetware
SemanMR: big data processing framework based on semantics

Proceedings of the Fourth Asia-Pacific Symposium on Internetware
Invisible loading: access-driven data transfer from raw files into database systems

Proceedings of the 16th International Conference on Extending Database Technology
Eagle-eyed elephant: split-oriented indexing in Hadoop

Proceedings of the 16th International Conference on Extending Database Technology
Near real-time analytics with IBM DB2 analytics accelerator

Proceedings of the 16th International Conference on Extending Database Technology
Scalable SAPRQL querying processing on large RDF data in cloud computing environment

ICPCA/SWS'12 Proceedings of the 2012 international conference on Pervasive Computing and the Networked World
Massive electronic records processing for digital archives in cloud

ICPCA/SWS'12 Proceedings of the 2012 international conference on Pervasive Computing and the Networked World
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
CARTILAGE: adding flexibility to the Hadoop skeleton

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Simulation of database-valued markov chains using SimSQL

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Ad-hoc aggregate query processing algorithms based on bit-store for query intensive applications in cloud computing

Future Generation Computer Systems
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing
Cloudy: heterogeneous middleware for in time queries processing

Proceedings of the 17th International Database Engineering & Applications Symposium
Data warehousing and OLAP over big data: current challenges and future research directions

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
A parallel spatial data analysis infrastructure for the cloud

Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Simple and efficient coupling of Hadoop with a database engine

Proceedings of the 4th annual Symposium on Cloud Computing
Big data: a research agenda

Proceedings of the 17th International Database Engineering & Applications Symposium
Piranha: optimizing short jobs in Hadoop

Proceedings of the VLDB Endowment
Hadoop GIS: a high performance spatial data warehousing system over mapreduce

Proceedings of the VLDB Endowment
Instant loading for main memory databases

Proceedings of the VLDB Endowment
ComMapReduce: An improvement of MapReduce with lightweight communication mechanisms

Data & Knowledge Engineering
Hybrid query execution engine for large attributed graphs

Information Systems
SeaCloudDM: a database cluster framework for managing and querying massive heterogeneous sensor sampling data

The Journal of Supercomputing
Hybrid Analytic Flows-the Case for Optimization

Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology
A platform for eXtreme analytics

IBM Journal of Research and Development

Quantified Score

Hi-index	0.02

Visualization

Abstract

The production environment for analytical data management applications is rapidly changing. Many enterprises are shifting away from deploying their analytical databases on high-end proprietary machines, and moving towards cheaper, lower-end, commodity hardware, typically arranged in a shared-nothing MPP architecture, often in a virtualized environment inside public or private "clouds". At the same time, the amount of data that needs to be analyzed is exploding, requiring hundreds to thousands of machines to work in parallel to perform the analysis. There tend to be two schools of thought regarding what technology to use for data analysis in such an environment. Proponents of parallel databases argue that the strong emphasis on performance and efficiency of parallel databases makes them well-suited to perform such analysis. On the other hand, others argue that MapReduce-based systems are better suited due to their superior scalability, fault tolerance, and flexibility to handle unstructured data. In this paper, we explore the feasibility of building a hybrid system that takes the best features from both technologies; the prototype we built approaches parallel databases in performance and efficiency, yet still yields the scalability, fault tolerance, and flexibility of MapReduce-based systems.