Tradeoffs between parallel database systems, Hadoop, and HadoopDB as platforms for petabyte-scale analysis

  • Authors:
  • Daniel J. Abadi

  • Affiliations:
  • Yale University

  • Venue:
  • SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

As the market demand for analyzing data sets of increasing variety and scale continues to explode, the software options for performing this analysis are beginning to proliferate. No fewer than a dozen companies have launched in the past few years that sell parallel database products to meet this market demand. At the same time, MapReduce-based options, such as the open source Hadoop framework are becoming increasingly popular, and there have been a plethora of research publications in the past two years that demonstrate how MapReduce can be used to accelerate and scale various data analysis tasks. Both parallel databases and MapReduce-based options have strengths and weaknesses that a practitioner must be aware of before selecting an analytical data management platform. In this talk, I describe some experiences in using these systems, and the advantages and disadvantages of the popular implementations of these systems. I then discuss a hybrid system that we are building at Yale University, called HadoopDB, that attempts to combine the advantages of both types of platforms. Finally, I discuss our experience in using HadoopDB for both traditional decision support workloads (i.e., TPC-H) and also scientific data management (analyzing the Uniprot protein sequence, function, and annotation data).