Optimizing data aggregation for cluster-based internet services
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Map-reduce-merge: simplified relational data processing on large clusters
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce and parallel DBMSs: friends or foes?
Communications of the ACM - Amir Pnueli: Ahead of His Time
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads
Proceedings of the VLDB Endowment
Towards personal high-performance geospatial computing (HPC-G): perspectives and a case study
Proceedings of the ACM SIGSPATIAL International Workshop on High Performance and Distributed Geographic Information Systems
A Hadoop based distributed loading approach to parallel data warehouses
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Emerging trends in the enterprise data analytics: connecting Hadoop and DB2 warehouse
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Tagged mapreduce: efficiently computing multi-analytics using mapreduce
DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
Parallel data processing with MapReduce: a survey
ACM SIGMOD Record
Split query processing in polybase
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Hadoop GIS: a high performance spatial data warehousing system over mapreduce
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Teradata's parallel DBMS has been successfully deployed in large data warehouses over the last two decades for large scale business analysis in various industries over data sets ranging from a few terabytes to multiple petabytes. However, due to the explosive data volume increase in recent years at some customer sites, some data such as web logs and sensor data are not managed by Teradata EDW (Enterprise Data Warehouse), partially because it is very expensive to load those extreme large volumes of data to a RDBMS, especially when those data are not frequently used to support important business decisions. Recently the MapReduce programming paradigm, started by Google and made popular by the open source Hadoop implementation with major support from Yahoo!, is gaining rapid momentum in both academia and industry as another way of performing large scale data analysis. By now most data warehouse researchers and practitioners agree that both parallel DBMS and MapReduce paradigms have advantages and disadvantages for various business applications and thus both paradigms are going to coexist for a long time [16]. In fact, a large number of Teradata customers, especially those in the e-business and telecom industries have seen increasing needs to perform BI over both data stored in Hadoop and data in Teradata EDW. One common thing between Hadoop and Teradata EDW is that data in both systems are partitioned across multiple nodes for parallel computing, which creates integration optimization opportunities not possible for DBMSs running on a single node. In this paper we describe our three efforts towards tight and efficient integration of Hadoop and Teradata EDW.