CCFinder: a multilinguistic token-based code clone detection system for large scale source code
IEEE Transactions on Software Engineering
Java Quality Assurance by Detecting Code Smells
WCRE '02 Proceedings of the Ninth Working Conference on Reverse Engineering (WCRE'02)
An empirical study of code clone genealogies
Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering
Mining software repositories to assist developers and support managers
Mining software repositories to assist developers and support managers
Mining large software compilations over time: another perspective of software evolution
Proceedings of the 2006 international workshop on Mining software repositories
ICSE '07 Proceedings of the 29th international conference on Software Engineering
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Journal of Software Maintenance and Evolution: Research and Practice
Proceedings of the 2008 international working conference on Mining software repositories
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
WCRE 1998 Most Influential Paper: Grokking Software Architecture
WCRE '08 Proceedings of the 2008 15th Working Conference on Reverse Engineering
Macro-level software evolution: a case study of a large software compilation
Empirical Software Engineering
Predicting faults using the complexity of code changes
ICSE '09 Proceedings of the 31st International Conference on Software Engineering
Distributed data-parallel computing using a high-level programming language
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
MapReduce as a general framework to support research in Mining Software Repositories (MSR)
MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
A platform for software engineering research
MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
FlumeJava: easy, efficient data-parallel pipelines
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
An experience report on scaling tools for mining software repositories using MapReduce
Proceedings of the IEEE/ACM international conference on Automated software engineering
Matching dependence-related queries in the system dependence graph
Proceedings of the IEEE/ACM international conference on Automated software engineering
Studying the Impact of Clones on Software Defects
WCRE '10 Proceedings of the 2010 17th Working Conference on Reverse Engineering
Using the GPGPU for scaling up mining software repositories
Proceedings of the 34th International Conference on Software Engineering
Bridging the divide between software developers and operators using logs
Proceedings of the 34th International Conference on Software Engineering
Using alloy to support feature-based DSL construction for mining software repositories
Proceedings of the 17th International Software Product Line Conference co-located workshops
Hi-index | 0.00 |
The Mining Software Repositories (MSR) field analyzes software repository data to uncover knowledge and assist development of ever growing, complex systems. However, existing approaches and platforms for MSR analysis face many challenges when performing large-scale MSR studies. Such approaches and platforms rarely scale easily out of the box. Instead, they often require custom scaling tricks and designs that are costly to maintain and that are not reusable for other types of analysis. We believe that the web community has faced many of these software engineering scaling challenges before, as web analyses have to cope with the enormous growth of web data. In this paper, we report on our experience in using a web-scale platform (i.e., Pig) as a data preparation language to aid large-scale MSR studies. Through three case studies, we carefully validate the use of this web platform to prepare (i.e., Extract, Transform, and Load, ETL) data for further analysis. Despite several limitations, we still encourage MSR researchers to leverage Pig in their large-scale studies because of Pig's scalability and flexibility. Our experience report will help other researchers who want to scale their analyses.