A Scalable Distributed Framework for Efficient Analytics on Ordered Datasets

  • Authors:
  • Jiangtao Yin;Yong Liao;Mario Baldi;Lixin Gao;Antonio Nucci

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the most common datasets used by many corporations to gain business intelligence is event log files. Oftentimes, the records in event log files are temporally ordered, and need to be grouped by user ID with the temporal ordering preserved to facilitate mining user behaviors. This kind of analytical workload, here referred to as Relative Order-preserving based Grouping (RE-ORG), is quite common in big data analytics. Using MapReduce/Hadoop for executing RE-ORG tasks on ordered datasets is not efficient due to its internal sort-merge mechanism. In this paper, we propose a distributed framework that adopts an efficient group-order-merge mechanism to provide faster execution of RE-ORG tasks. We demonstrate the advantage of our framework by comparing its performance with Hadoop through extensive experiments on real-world datasets. The evaluation results show that our framework can achieve up to 6.3x speedup over Hadoop in executing RE-ORG tasks.