Connected components labeling for giga-cell multi-categorical rasters

  • Authors:
  • Pawel Netzel;Tomasz F. Stepinski

  • Affiliations:
  • -;-

  • Venue:
  • Computers & Geosciences
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Labeling of connected components in an image or a raster of non-imagery data is a fundamental operation in fields of pattern recognition and machine intelligence. The bulk of effort devoted to designing efficient connected components labeling (CCL) algorithms concentrated on the domain of binary images where labeling is required for a computer to recognize objects. In contrast, in the Geographical Information Science (GIS) a CCL algorithm is mostly applied to multi-categorical rasters in order to either convert a raster to a shapefile, or for statistical characterization of individual clumps. Recently, it has become necessary to label connected components in very large, giga-cell size, multi-categorical rasters but performance of existing CCL algorithms lacks sufficient speed to accomplish such task. In this paper we present a modification to the popular two-scan CCL algorithm that enables labeling of giga-cell size, multi-categorical rasters. Our approach is to apply a divide-and-conquer technique coupled with parallel processing to a standard two-scan algorithm. For specificity, we have developed a variant of a standard CCL algorithm implemented as r.clump in GRASS GIS. We have established optimal values of data blocks (stemming from the divide-and-conquer technique) and optimal number of computational threads (stemming from parallel processing) for a new algorithm called r.clump3p. The performance of the new algorithm was tested on a series of rasters up to 160Mcells in size; for largest size test raster a speed up over the original algorithm is 74 times. Finally, we have applied the new algorithm to the National Land Cover Dataset 2006 raster with 1.6x10^1^0 cells. Labeling this raster took 39h using two-processors, 16 cores computer and resulted in 221,718,501 clumps. Estimated speed up over the original algorithm is 450 times. The r.clump3p works within the GRASS environment and is available in the public domain.