Efficient analytics on ordered datasets using MapReduce

Authors:
Jiangtao Yin;Yong Liao;Mario Baldi;Lixin Gao;Antonio Nucci
Affiliations:
UMass Amherst, Amherst, MA, USA;Narus Inc., Sunnyvale, CA, USA;Narus Inc., Sunnyvale, CA, USA;UMass Amherst, Amherst, MA, USA;Narus Inc., Sunnyvale, CA, USA
Venue:
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Year:
2013

Citing 6
Cited 1

MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
PrIter: a distributed framework for prioritized iterative computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
iMapReduce: A Distributed Computing Framework for Iterative Computation

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Accelerating Expectation-Maximization Algorithms with Frequent Updates

CLUSTER '12 Proceedings of the 2012 IEEE International Conference on Cluster Computing

A Scalable Distributed Framework for Efficient Analytics on Ordered Datasets

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Efficiently analyzing data on a large scale can be vital for data owners to gain useful business intelligence. One of the most common datasets used to gain business intelligence is event log files. Oftentimes, records in event log files that are time sorted, need to be grouped by user ID or transaction ID in order to mine user behaviors, such as click through rate, while preserving the time order. This kind of analytical workload is here referred to as RElative Order-pReserving based Grouping (Re-Org). Using MapReduce/Hadoop, a popular big data analysis tool, in an as-is manner for executing Re-Org tasks on ordered datasets is not efficient due to its internal sort-merge mechanism. We propose a framework that adopts an efficient group-order-merge mechanism to provide faster execution of Re-Org tasks and implement it by extending Hadoop. Experimental results show a 2.2x speedup over executing Re-Org tasks in plain vanilla Hadoop.