Tradeoffs between parallel database systems, Hadoop, and HadoopDB as platforms for petabyte-scale analysis

Authors:
Daniel J. Abadi
Affiliations:
Yale University
Venue:
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Year:
2010

Citing 3
Cited 2

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment

RDFPath: path query processing on large RDF graphs with mapreduce

ESWC'11 Proceedings of the 8th international conference on The Semantic Web
A short survey on the state of the art in architectures and platforms for large scale data analysis and knowledge discovery from data

Proceedings of the WICSA/ECSA 2012 Companion Volume

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the market demand for analyzing data sets of increasing variety and scale continues to explode, the software options for performing this analysis are beginning to proliferate. No fewer than a dozen companies have launched in the past few years that sell parallel database products to meet this market demand. At the same time, MapReduce-based options, such as the open source Hadoop framework are becoming increasingly popular, and there have been a plethora of research publications in the past two years that demonstrate how MapReduce can be used to accelerate and scale various data analysis tasks. Both parallel databases and MapReduce-based options have strengths and weaknesses that a practitioner must be aware of before selecting an analytical data management platform. In this talk, I describe some experiences in using these systems, and the advantages and disadvantages of the popular implementations of these systems. I then discuss a hybrid system that we are building at Yale University, called HadoopDB, that attempts to combine the advantages of both types of platforms. Finally, I discuss our experience in using HadoopDB for both traditional decision support workloads (i.e., TPC-H) and also scientific data management (analyzing the Uniprot protein sequence, function, and annotation data).