Boa: a language and infrastructure for analyzing ultra-large-scale software repositories

Authors:
Robert Dyer;Hoan Anh Nguyen;Hridesh Rajan;Tien N. Nguyen
Affiliations:
Iowa State University, USA;Iowa State University, USA;Iowa State University, USA;Iowa State University, USA
Venue:
Proceedings of the 2013 International Conference on Software Engineering
Year:
2013

Citing 16
Cited 6

Awareness and coordination in shared workspaces

CSCW '92 Proceedings of the 1992 ACM conference on Computer-supported cooperative work
Building on the Basics: An Examination of High-Performance Computing Export Control Policy in the 1990s

Building on the Basics: An Examination of High-Performance Computing Export Control Policy in the 1990s
Facilitating software evolution research with kenyon

Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering
Finding application errors and security flaws using PQL: a program query language

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
How Long Will It Take to Fix This Bug?

MSR '07 Proceedings of the Fourth International Workshop on Mining Software Repositories
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Sourcerer: mining and searching internet-scale software repositories

Data Mining and Knowledge Discovery
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
An experience report on scaling tools for mining software repositories using MapReduce

Proceedings of the IEEE/ACM international conference on Automated software engineering
A study of the uniqueness of source code

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
The eval that men do: A large-scale study of the use of eval in javascript applications

Proceedings of the 25th European conference on Object-oriented programming
On the reproducibility of empirical software engineering studies based on data retrieved from development repositories

Empirical Software Engineering

Using alloy to support feature-based DSL construction for mining software repositories

Proceedings of the 17th International Software Product Line Conference co-located workshops
An adapter-aware, non-intrusive dependency injection framework for Java

Proceedings of the 2013 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools
Mining source code repositories with boa

Proceedings of the 2013 companion publication for conference on Systems, programming, & applications: software for humanity
Task fusion: improving utilization of multi-user clusters

Proceedings of the 2013 companion publication for conference on Systems, programming, & applications: software for humanity
Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes

Proceedings of the 12th international conference on Generative programming: concepts & experiences
A scalable crawler framework for FLOSS data

Proceedings of the 5th Asia-Pacific Symposium on Internetware

Quantified Score

Hi-index	0.00

Visualization

Abstract

In today's software-centric world, ultra-large-scale software repositories, e.g. SourceForge (350,000+ projects), GitHub (250,000+ projects), and Google Code (250,000+ projects) are the new library of Alexandria. They contain an enormous corpus of software and information about software. Scientists and engineers alike are interested in analyzing this wealth of information both for curiosity as well as for testing important hypotheses. However, systematic extraction of relevant data from these repositories and analysis of such data for testing hypotheses is hard, and best left for mining software repository (MSR) experts! The goal of Boa, a domain-specific language and infrastructure described here, is to ease testing MSR-related hypotheses. We have implemented Boa and provide a web-based interface to Boa's infrastructure. Our evaluation demonstrates that Boa substantially reduces programming efforts, thus lowering the barrier to entry. We also see drastic improvements in scalability. Last but not least, reproducing an experiment conducted using Boa is just a matter of re-running small Boa programs provided by previous researchers.