A Scalable Distributed Framework for Efficient Analytics on Ordered Datasets

Authors:
Jiangtao Yin;Yong Liao;Mario Baldi;Lixin Gao;Antonio Nucci
Affiliations:
-;-;-;-;-
Venue:
UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Year:
2013

Citing 23
Cited 0

Activity monitoring: noticing interesting changes in behavior

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management

Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management
Flow classification by histograms: or how to go on safari in the internet

Proceedings of the joint international conference on Measurement and modeling of computer systems
Demographic prediction based on user's browsing behavior

Proceedings of the 16th international conference on World Wide Web
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A comparative analysis of web and peer-to-peer traffic

Proceedings of the 17th international conference on World Wide Web
MapReduce Programming Model for .NET-Based Cloud Computing

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A distributed look-up architecture for text mining applications using mapreduce

Proceedings of the 20th international symposium on High performance distributed computing
In-situ MapReduce for log processing

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
PrIter: a distributed framework for prioritized iterative computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
iMapReduce: A Distributed Computing Framework for Iterative Computation

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
An Efficient Cross-Match Implementation Based on Directed Join Algorithm in MapReduce

UCC '11 Proceedings of the 2011 Fourth IEEE International Conference on Utility and Cloud Computing
Temporal Analytics on Big Data for Web Advertising

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Accelerating MapReduce Analytics Using CometCloud

CLOUD '12 Proceedings of the 2012 IEEE Fifth International Conference on Cloud Computing
Accelerating Expectation-Maximization Algorithms with Frequent Updates

CLUSTER '12 Proceedings of the 2012 IEEE International Conference on Cluster Computing
A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment

UCC '12 Proceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing
Efficient analytics on ordered datasets using MapReduce

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Workload Characteristic Oriented Scheduler for MapReduce

ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems
HybridMR: A Hierarchical MapReduce Scheduler for Hybrid Data Centers

ICDCS '13 Proceedings of the 2013 IEEE 33rd International Conference on Distributed Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most common datasets used by many corporations to gain business intelligence is event log files. Oftentimes, the records in event log files are temporally ordered, and need to be grouped by user ID with the temporal ordering preserved to facilitate mining user behaviors. This kind of analytical workload, here referred to as Relative Order-preserving based Grouping (RE-ORG), is quite common in big data analytics. Using MapReduce/Hadoop for executing RE-ORG tasks on ordered datasets is not efficient due to its internal sort-merge mechanism. In this paper, we propose a distributed framework that adopts an efficient group-order-merge mechanism to provide faster execution of RE-ORG tasks. We demonstrate the advantage of our framework by comparing its performance with Hadoop through extensive experiments on real-world datasets. The evaluation results show that our framework can achieve up to 6.3x speedup over Hadoop in executing RE-ORG tasks.