Cogset: A Unified Engine for Reliable Storage and Parallel Processing

  • Authors:
  • Steffen Viken Valvag;Dag Johansen

  • Affiliations:
  • -;-

  • Venue:
  • NPC '09 Proceedings of the 2009 Sixth IFIP International Conference on Network and Parallel Computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

MapReduce has become a popular paradigm for parallel data processing, both for ad-hoc schema-less processing using a simple functional interface, and as a building block for higher-level abstractions. Much subsequent work has layered additional functionality on top of MapReduce or similar infrastructures, building powerful software stacks for distributed applications. In this paper, we present Cogset, the result of re-thinking the original MapReduce architecture that sits at the bottom of the stack. We observe that the traditional loose coupling between the distributed file system and the MapReduce processing engine leads to poor data locality for many applications. Accordingly, Cogset offers both reliable storage and parallel data processing, fusing the two components into a single system that ensures good data locality. We also take a new approach to data shuffling, relying on highly efficient static routing, and devise new mechanisms for fault tolerance, load balancing and ensuring consistency. We evaluate Cogset using a suite of benchmark applications, comparing it to Hadoop with very favorable results. For example, on a 12-node cluster, an inverted index that takes 80 minutes to build using Hadoop can be constructed using Cogset in less than 35 minutes.