Using Pig as a data preparation language for large-scale mining software repositories studies: An experience report

Authors:
Weiyi Shang;Bram Adams;Ahmed E. Hassan
Affiliations:
Software Analysis and Intelligence Lab (SAIL), Queen's University, Kingston, Canada K7L 3N6;Software Analysis and Intelligence Lab (SAIL), Queen's University, Kingston, Canada K7L 3N6;Software Analysis and Intelligence Lab (SAIL), Queen's University, Kingston, Canada K7L 3N6
Venue:
Journal of Systems and Software
Year:
2012

Citing 23
Cited 3

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

IEEE Transactions on Software Engineering
Java Quality Assurance by Detecting Code Smells

WCRE '02 Proceedings of the Ninth Working Conference on Reverse Engineering (WCRE'02)
An empirical study of code clone genealogies

Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering
Mining software repositories to assist developers and support managers

Mining software repositories to assist developers and support managers
Mining large software compilations over time: another perspective of software evolution

Proceedings of the 2006 international workshop on Mining software repositories
Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder

ICSE '07 Proceedings of the 29th international conference on Software Engineering
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A survey and taxonomy of approaches for mining software repositories in the context of software evolution

Journal of Software Maintenance and Evolution: Research and Practice
Determinism and evolution

Proceedings of the 2008 international working conference on Mining software repositories
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
WCRE 1998 Most Influential Paper: Grokking Software Architecture

WCRE '08 Proceedings of the 2008 15th Working Conference on Reverse Engineering
Macro-level software evolution: a case study of a large software compilation

Empirical Software Engineering
Predicting faults using the complexity of code changes

ICSE '09 Proceedings of the 31st International Conference on Software Engineering
Distributed data-parallel computing using a high-level programming language

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Amassing and indexing a large sample of version control systems: Towards the census of public source code history

MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
MapReduce as a general framework to support research in Mining Software Repositories (MSR)

MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
A platform for software engineering research

MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
An experience report on scaling tools for mining software repositories using MapReduce

Proceedings of the IEEE/ACM international conference on Automated software engineering
Matching dependence-related queries in the system dependence graph

Proceedings of the IEEE/ACM international conference on Automated software engineering
Studying the Impact of Clones on Software Defects

WCRE '10 Proceedings of the 2010 17th Working Conference on Reverse Engineering

Using the GPGPU for scaling up mining software repositories

Proceedings of the 34th International Conference on Software Engineering
Bridging the divide between software developers and operators using logs

Proceedings of the 34th International Conference on Software Engineering
Using alloy to support feature-based DSL construction for mining software repositories

Proceedings of the 17th International Software Product Line Conference co-located workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Mining Software Repositories (MSR) field analyzes software repository data to uncover knowledge and assist development of ever growing, complex systems. However, existing approaches and platforms for MSR analysis face many challenges when performing large-scale MSR studies. Such approaches and platforms rarely scale easily out of the box. Instead, they often require custom scaling tricks and designs that are costly to maintain and that are not reusable for other types of analysis. We believe that the web community has faced many of these software engineering scaling challenges before, as web analyses have to cope with the enormous growth of web data. In this paper, we report on our experience in using a web-scale platform (i.e., Pig) as a data preparation language to aid large-scale MSR studies. Through three case studies, we carefully validate the use of this web platform to prepare (i.e., Extract, Transform, and Load, ETL) data for further analysis. Despite several limitations, we still encourage MSR researchers to leverage Pig in their large-scale studies because of Pig's scalability and flexibility. Our experience report will help other researchers who want to scale their analyses.