Integrating hadoop and parallel DBMs

Authors:
Yu Xu;Pekka Kostamaa;Like Gao
Affiliations:
Teradata, San Diego, CA, USA;Teradata, El Segundo, CA, USA;Teradata, San Diego, CA, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 11
Cited 7

Optimizing data aggregation for cluster-based internet services

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment

Towards personal high-performance geospatial computing (HPC-G): perspectives and a case study

Proceedings of the ACM SIGSPATIAL International Workshop on High Performance and Distributed Geographic Information Systems
A Hadoop based distributed loading approach to parallel data warehouses

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Emerging trends in the enterprise data analytics: connecting Hadoop and DB2 warehouse

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Tagged mapreduce: efficiently computing multi-analytics using mapreduce

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Split query processing in polybase

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Hadoop GIS: a high performance spatial data warehousing system over mapreduce

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Teradata's parallel DBMS has been successfully deployed in large data warehouses over the last two decades for large scale business analysis in various industries over data sets ranging from a few terabytes to multiple petabytes. However, due to the explosive data volume increase in recent years at some customer sites, some data such as web logs and sensor data are not managed by Teradata EDW (Enterprise Data Warehouse), partially because it is very expensive to load those extreme large volumes of data to a RDBMS, especially when those data are not frequently used to support important business decisions. Recently the MapReduce programming paradigm, started by Google and made popular by the open source Hadoop implementation with major support from Yahoo!, is gaining rapid momentum in both academia and industry as another way of performing large scale data analysis. By now most data warehouse researchers and practitioners agree that both parallel DBMS and MapReduce paradigms have advantages and disadvantages for various business applications and thus both paradigms are going to coexist for a long time [16]. In fact, a large number of Teradata customers, especially those in the e-business and telecom industries have seen increasing needs to perform BI over both data stored in Hadoop and data in Teradata EDW. One common thing between Hadoop and Teradata EDW is that data in both systems are partitioned across multiple nodes for parallel computing, which creates integration optimization opportunities not possible for DBMSs running on a single node. In this paper we describe our three efforts towards tight and efficient integration of Hadoop and Teradata EDW.