A parallel spatial data analysis infrastructure for the cloud

Authors:
Suprio Ray;Bogdan Simion;Angela Demke Brown;Ryan Johnson
Affiliations:
University of Toronto;University of Toronto;University of Toronto;University of Toronto
Venue:
Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Year:
2013

Citing 18
Cited 0

Parallel database systems: the future of high performance database systems

Communications of the ACM
Partition based spatial-merge join

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Clone join and shadow join: two parallel spatial join algorithms

Proceedings of the 8th ACM international symposium on Advances in geographic information systems
Scheduling Divisible Loads in Parallel and Distributed Systems

Scheduling Divisible Loads in Parallel and Distributed Systems
Data Partitioning for Parallel Spatial Join Processing

Geoinformatica
Parallel Processing of Spatial Joins Using R-trees

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Leveraging Non-Uniform Resources for Parallel Query Processing

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Spatial join techniques

ACM Transactions on Database Systems (TODS)
Parallel Query Processing in Databases on Multicore Architectures

ICA3PP '08 Proceedings of the 8th international conference on Algorithms and Architectures for Parallel Processing
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Jackpine: A benchmark to evaluate spatial database performance

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Massively parallel sort-merge joins in main memory multi-core database systems

Proceedings of the VLDB Endowment
More for your money: exploiting performance heterogeneity in public clouds

Proceedings of the Third ACM Symposium on Cloud Computing
Towards building a high performance spatial query system for large scale medical imaging data

Proceedings of the 20th International Conference on Advances in Geographic Information Systems
Surveying the landscape: an in-depth analysis of spatial database workloads

Proceedings of the 20th International Conference on Advances in Geographic Information Systems
Petabyte scale databases and storage systems at Facebook

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Spatial data analysis applications are emerging from a wide range of domains such as building information management, environmental assessments and medical imaging. Time-consuming computational geometry algorithms make these applications slow, even for medium-sized datasets. At the same time, there is a rapid expansion in available processing cores, through multicore machines and Cloud computing. The confluence of these trends demands effective parallelization of spatial query processing. Unfortunately, traditional parallel spatial databases are ill-equipped to deal with the performance heterogeneity that is common in the Cloud. We introduce Niharika, a parallel spatial data analysis infrastructure that exploits all available cores in a heterogeneous cluster. Niharika first uses a declustering technique that creates balanced spatial partitions. Then, Niharika adapts to performance heterogeneity and processing skew in the spatial dataset using dynamic load-balancing. We evaluate Niharika with three load-balancing algorithms and two different spatial datasets (both from TIGER) using Amazon EC2 instances. Niharika adapts to the performance heterogeneity in the EC2 nodes, thereby achieving excellent speedups (e.g., 63.6X using 64 cores on 16 4-core EC2 nodes, in the best case) and outperforming an approach that does not adapt.