Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes

Authors:
Robert Dyer;Hridesh Rajan;Tien N. Nguyen
Affiliations:
Iowa State University, Ames, IA, USA;Iowa State University, Ames, IA, USA;Iowa State University, Ames, IA, USA
Venue:
Proceedings of the 12th international conference on Generative programming: concepts & experiences
Year:
2013

Citing 27
Cited 0

Design patterns: elements of reusable object-oriented software

Design patterns: elements of reusable object-oriented software
A language for specifying recursive traversals of object structures

Proceedings of the 14th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Predicting Fault Incidence Using Software Change History

IEEE Transactions on Software Engineering
Visitor combination and traversal control

OOPSLA '01 Proceedings of the 16th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Navigating and querying code without getting lost

Proceedings of the 2nd international conference on Aspect-oriented software development
DJ: Dynamic Adaptive Programming in Java

REFLECTION '01 Proceedings of the Third International Conference on Metalevel Architectures and Separation of Crosscutting Concerns
Mining Version Histories to Guide Software Changes

Proceedings of the 26th International Conference on Software Engineering
Use of relative code churn measures to predict system defect density

Proceedings of the 27th international conference on Software engineering
Facilitating software evolution research with kenyon

Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering
Understanding source code evolution using abstract syntax tree matching

MSR '05 Proceedings of the 2005 international workshop on Mining software repositories
Finding application errors and security flaws using PQL: a program query language

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Predicting Faults from Cached History

ICSE '07 Proceedings of the 29th international conference on Software Engineering
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
The visitor pattern as a reusable, generic, type-safe component

Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Sourcerer: mining and searching internet-scale software repositories

Data Mining and Knowledge Discovery
Predicting faults using the complexity of code changes

ICSE '09 Proceedings of the 31st International Conference on Software Engineering
An empirical investigation into a large-scale Java open source code repository

Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement
A study of the uniqueness of source code

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Mining evolution of object usage

Proceedings of the 25th European conference on Object-oriented programming
Mining Cause-Effect-Chains from Version Histories

ISSRE '11 Proceedings of the 2011 IEEE 22nd International Symposium on Software Reliability Engineering
CodeQuest: scalable source code queries with datalog

ECOOP'06 Proceedings of the 20th European conference on Object-Oriented Programming
Querying source code with natural language

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Temporal analysis of API usage concepts

Proceedings of the 34th International Conference on Software Engineering
Understanding myths and realities of test-suite evolution

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
How do developers use parallel libraries?

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
Boa: a language and infrastructure for analyzing ultra-large-scale software repositories

Proceedings of the 2013 International Conference on Software Engineering
Portfolio: Searching for relevant functions and their usages in millions of lines of code

ACM Transactions on Software Engineering and Methodology (TOSEM) - Testing, debugging, and error handling, formal methods, lifecycle concerns, evolution and maintenance

Quantified Score

Hi-index	0.00

Visualization

Abstract

Software repositories contain a vast wealth of information about software development. Mining these repositories has proven useful for detecting patterns in software development, testing hypotheses for new software engineering approaches, etc. Specifically, mining source code has yielded significant insights into software development artifacts and processes. Unfortunately, mining source code at a large-scale remains a difficult task. Previous approaches had to either limit the scope of the projects studied, limit the scope of the mining task to be more coarse-grained, or sacrifice studying the history of the code due to both human and computational scalability issues. In this paper we address the substantial challenges of mining source code: a) at a very large scale; b) at a fine-grained level of detail; and c) with full history information. To address these challenges, we present domain-specific language features for source code mining. Our language features are inspired by object-oriented visitors and provide a default depth-first traversal strategy along with two expressions for defining custom traversals. We provide an implementation of these features in the Boa infrastructure for software repository mining and describe a code generation strategy into Java code. To show the usability of our domain-specific language features, we reproduced over 40 source code mining tasks from two large-scale previous studies in just 2 person-weeks. The resulting code for these tasks show between 2.0x--4.8x reduction in code size. Finally we perform a small controlled experiment to gain insights into how easily mining tasks written using our language features can be understood, with no prior training. We show a substantial number of tasks (77%) were understood by study participants, in about 3 minutes per task.