A MapReduce approach to Gi*(d) spatial statistic

  • Authors:
  • Yan Liu;Kaichao Wu;Shaowen Wang;Yanli Zhao;Qian Huang

  • Affiliations:
  • University of Illinois at Urbana-Champaign, Urbana, Illinois;Chinese Academy of Science (CAS), Beijing, China;University of Illinois at Urbana-Champaign, Urbana, Illinois;University of Illinois at Urbana-Champaign, Urbana, Illinois;Peking University, Beijing, China

  • Venue:
  • Proceedings of the ACM SIGSPATIAL International Workshop on High Performance and Distributed Geographic Information Systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Managing and analyzing massive spatial datasets as supported by GIS and spatial analysis is becoming crucial to geospatial problem-solving and decision-making. MapReduce provides a data-centric computational model through which highly scalable spatial analysis computation can be achieved. However, it is challenging to leverage multi-dimensional spatial characteristics on the horizontally-partitioned and transparently managed MapReduce data system for improving the computational performance of spatial analysis. This paper tackles this challenge through the development of MapReduce-based computation of Gi*(d) -- a spatial statistic for detecting local clustering. Without exploiting spatial characteristics, Gi*(d) computation for a particular location requires pair-wise distance calculation for all points of a given dataset. A spatial locality-based storage and indexing strategy is developed to associate spatial locality with storage locality on MapReduce platform. Based on a spatial indexing method, unnecessary map tasks can be eliminated for a MapReduce job, thus significantly improving the overall computation performance. To leverage underlying parallelism on storage nodes, an application-level load balancing mechanism is developed to produce even loads among map tasks based on adaptive spatial domain decomposition. Experiments show the effectiveness of the developed storage and indexing strategy with different distance parameter settings. Significant reduction on execution time for all-point computation is observed through the use of the application-level load balancing mechanism.