Parallel Secondo: Boosting Database Engines with Hadoop

Authors:
Jiamin Lu;Ralf Hartmut Guting
Affiliations:
-;-
Venue:
ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems
Year:
2012

Citing 0
Cited 1

CG_Hadoop: computational geometry in MapReduce

Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hadoop is an efficient and simple parallel framework following the Map Reduce paradigm, and making the parallel processing recently become a hot issue in data-intensive applications. Since Hadoop can be easily deployed on large-scale clusters including up to thousands of computers, various studies intend to process common relational database operations also on this new platform and expect to achieve a remarkable performance. However, these works have to prepare customized programs according to different input format, making the communication between co-workers difficult. Additionally, all intermediate data have to be transformed to key-value pairs and then transferred through the underlying HDFS, making the data processable by Map and Reduce tasks and keeping a balanced workload on the cluster. During this period, unnecessary overhead decreases both the speed-up and scale-up of these systems. Therefore, this paper attempts to propose a light and efficient coupling structure thus to combine Hadoop with single-computer databases on the engine level. On one hand, it uses a well-designed parallel data model to make end-users represent parallel queries like common queries. All current and future data types and algorithms can be used directly, having no need to be specifically changed for the parallel platform. On the other hand, it provides a simple and independent distributed file system to transfer data among database engines directly, without passing through HDFS, hence to remove as much as possible unnecessary transform and transfer overhead. For purpose of demonstration, a prototype Parallel Secondo is introduced in this paper. It has been fully evaluated in both small and large scale clusters, achieving satisfactory performances for different database operations.