Parallel bulk insertion for large-scale analytics applications

Authors:
Antonio Barbuzzi;Pietro Michiardi;Ernst Biersack;Gennaro Boggia
Affiliations:
Politecnico di Bari;Eurecom;Eurecom;Politecnico di Bari
Venue:
Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Year:
2010

Citing 9
Cited 5

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Community systems research at Yahoo!

ACM SIGMOD Record
Efficient bulk insertion into a distributed ordered table

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Benchmarking cloud serving systems with YCSB

Proceedings of the 1st ACM symposium on Cloud computing

YCSB++: benchmarking and performance debugging advanced features in scalable table stores

Proceedings of the 2nd ACM Symposium on Cloud Computing
Serving large-scale batch computed data with project Voldemort

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Insertion and querying mechanism for a distributed XML database system

Proceedings of the 5th ACM COMPUTE Conference: Intelligent & scalable system technologies
The big data ecosystem at LinkedIn

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A real-time stream storage and analysis platform for underwater acoustic monitoring

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern data analytics applications, e.g. Internet-scale indexing, system trace analysis, recommender engines to name a few, operate on massive amounts of data and call for a parallel approach to data processing. In this work, we focus on the popular MapReduce framework to carry out such tasks and identify bulk data insert operations as a critical preliminary step to achieve reduced processing times, especially when new data is generated and processed at regular time intervals. We present a parallel approach to bulk data insertion in a system that use horizontally range partitioned data and evaluate several variants to insertion operations, including legacy approaches. Our method exploits the parallel processing framework itself to insert data into the system, which is stored in a semi-structured format. Our results indicate that a parallel approach to bulk insertion can substantially reduce the recurrent costs of insertion of new data into the system.