Efficient analytics on ordered datasets using MapReduce

  • Authors:
  • Jiangtao Yin;Yong Liao;Mario Baldi;Lixin Gao;Antonio Nucci

  • Affiliations:
  • UMass Amherst, Amherst, MA, USA;Narus Inc., Sunnyvale, CA, USA;Narus Inc., Sunnyvale, CA, USA;UMass Amherst, Amherst, MA, USA;Narus Inc., Sunnyvale, CA, USA

  • Venue:
  • Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Efficiently analyzing data on a large scale can be vital for data owners to gain useful business intelligence. One of the most common datasets used to gain business intelligence is event log files. Oftentimes, records in event log files that are time sorted, need to be grouped by user ID or transaction ID in order to mine user behaviors, such as click through rate, while preserving the time order. This kind of analytical workload is here referred to as RElative Order-pReserving based Grouping (Re-Org). Using MapReduce/Hadoop, a popular big data analysis tool, in an as-is manner for executing Re-Org tasks on ordered datasets is not efficient due to its internal sort-merge mechanism. We propose a framework that adopts an efficient group-order-merge mechanism to provide faster execution of Re-Org tasks and implement it by extending Hadoop. Experimental results show a 2.2x speedup over executing Re-Org tasks in plain vanilla Hadoop.