Mining source code repositories at massive scale using language modeling

Authors:
Miltiadis Allamanis;Charles Sutton
Affiliations:
University of Edinburgh, UK;University of Edinburgh, UK
Venue:
Proceedings of the 10th Working Conference on Mining Software Repositories
Year:
2013

Citing 11
Cited 1

Programming by voice, VocalProgramming

Assets '00 Proceedings of the fourth international ACM conference on Assistive technologies
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Metrics and Models in Software Quality Engineering

Metrics and Models in Software Quality Engineering
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Jungloid mining: helping to navigate the API jungle

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Semantic clustering: Identifying topics in source code

Information and Software Technology
Sourcerer: mining and searching internet-scale software repositories

Data Mining and Knowledge Discovery
Learning from 6,000 projects: lightweight cross-project anomaly detection

Proceedings of the 19th international symposium on Software testing and analysis
A study of the uniqueness of source code

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Efficient Malicious Code Detection Using N-Gram Analysis and SVM

NBIS '11 Proceedings of the 2011 14th International Conference on Network-Based Information Systems
On the naturalness of software

Proceedings of the 34th International Conference on Software Engineering

Why, when, and what: analyzing stack overflow questions by topic, type, and code

Proceedings of the 10th Working Conference on Mining Software Repositories

Quantified Score

Hi-index	0.00

Visualization

Abstract

The tens of thousands of high-quality open source software projects on the Internet raise the exciting possibility of studying software development by finding patterns across truly large source code repositories. This could enable new tools for developing code, encouraging reuse, and navigating large projects. In this paper, we build the first giga-token probabilistic language model of source code, based on 352 million lines of Java. This is 100 times the scale of the pioneering work by Hindle et al. The giga-token model is significantly better at the code suggestion task than previous models. More broadly, our approach provides a new "lens" for analyzing software projects, enabling new complexity metrics based on statistical analysis of large corpora. We call these metrics data-driven complexity metrics. We propose new metrics that measure the complexity of a code module and the topical centrality of a module to a software project. In particular, it is possible to distinguish reusable utility classes from classes that are part of a program's core logic based solely on general information theoretic criteria.