Assisting developers of big data analytics applications when deploying on hadoop clouds

Authors:
Weiyi Shang;Zhen Ming Jiang;Hadi Hemmati;Bram Adams;Ahmed E. Hassan;Patrick Martin
Affiliations:
Queen's University, Canada;Queen's University, Canada;Queen's University, Canada;Polytechnique Montréal, Canada;Queen's University, Canada;Queen's University, Canada
Venue:
Proceedings of the 2013 International Conference on Software Engineering
Year:
2013

Citing 17
Cited 1

The Field Programming Environment: A Friendly Integrated Environment for Learning and Development

The Field Programming Environment: A Friendly Integrated Environment for Learning and Development
System Evolution Tracking through Execution Trace Analysis

IWPC '05 Proceedings of the 13th International Workshop on Program Comprehension
Failure classification and analysis of the Java Virtual Machine

ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
An automated approach for abstracting execution logs to execution events

Journal of Software Maintenance and Evolution: Research and Practice - Special Issue on Program Comprehension through Dynamic Analysis (PCODA)
Exploiting Runtime Information in the IDE

ICPC '08 Proceedings of the 2008 The 16th IEEE International Conference on Program Comprehension
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
A Systematic Survey of Program Comprehension through Dynamic Analysis

IEEE Transactions on Software Engineering
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
SALSA: analyzing logs as state machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Leveraging existing instrumentation to automatically infer invariant-constrained models

Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering
Disco: a computing platform for large-scale data analytics

Proceedings of the 10th ACM SIGPLAN workshop on Erlang
An Exploratory Study of the Evolution of Communicated Information about the Execution of Large Software Systems

WCRE '11 Proceedings of the 2011 18th Working Conference on Reverse Engineering
Interactions with big data analytics

interactions

Continuous validation of load test suites

Proceedings of the 5th ACM/SPEC international conference on Performance engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Big data analytics is the process of examining large amounts of data (big data) in an effort to uncover hidden patterns or unknown correlations. Big Data Analytics Applications (BDA Apps) are a new type of software applications, which analyze big data using massive parallel processing frameworks (e.g., Hadoop). Developers of such applications typically develop them using a small sample of data in a pseudo-cloud environment. Afterwards, they deploy the applications in a large-scale cloud environment with considerably more processing power and larger input data (reminiscent of the mainframe days). Working with BDA App developers in industry over the past three years, we noticed that the runtime analysis and debugging of such applications in the deployment phase cannot be easily addressed by traditional monitoring and debugging approaches. In this paper, as a first step in assisting developers of BDA Apps for cloud deployments, we propose a lightweight approach for uncovering differences between pseudo and large-scale cloud deployments. Our approach makes use of the readily-available yet rarely used execution logs from these platforms. Our approach abstracts the execution logs, recovers the execution sequences, and compares the sequences between the pseudo and cloud deployments. Through a case study on three representative Hadoop-based BDA Apps, we show that our approach can rapidly direct the attention of BDA App developers to the major differences between the two deployments. Knowledge of such differences is essential in verifying BDA Apps when analyzing big data in the cloud. Using injected deployment faults, we show that our approach not only significantly reduces the deployment verification effort, but also provides very few false positives when identifying deployment failures.