Programming by voice, VocalProgramming
Assets '00 Proceedings of the fourth international ACM conference on Assistive technologies
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Metrics and Models in Software Quality Engineering
Metrics and Models in Software Quality Engineering
An empirical study of smoothing techniques for language modeling
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Jungloid mining: helping to navigate the API jungle
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Semantic clustering: Identifying topics in source code
Information and Software Technology
Sourcerer: mining and searching internet-scale software repositories
Data Mining and Knowledge Discovery
Learning from 6,000 projects: lightweight cross-project anomaly detection
Proceedings of the 19th international symposium on Software testing and analysis
A study of the uniqueness of source code
Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Efficient Malicious Code Detection Using N-Gram Analysis and SVM
NBIS '11 Proceedings of the 2011 14th International Conference on Network-Based Information Systems
On the naturalness of software
Proceedings of the 34th International Conference on Software Engineering
Why, when, and what: analyzing stack overflow questions by topic, type, and code
Proceedings of the 10th Working Conference on Mining Software Repositories
Hi-index | 0.00 |
The tens of thousands of high-quality open source software projects on the Internet raise the exciting possibility of studying software development by finding patterns across truly large source code repositories. This could enable new tools for developing code, encouraging reuse, and navigating large projects. In this paper, we build the first giga-token probabilistic language model of source code, based on 352 million lines of Java. This is 100 times the scale of the pioneering work by Hindle et al. The giga-token model is significantly better at the code suggestion task than previous models. More broadly, our approach provides a new "lens" for analyzing software projects, enabling new complexity metrics based on statistical analysis of large corpora. We call these metrics data-driven complexity metrics. We propose new metrics that measure the complexity of a code module and the topical centrality of a module to a software project. In particular, it is possible to distinguish reusable utility classes from classes that are part of a program's core logic based solely on general information theoretic criteria.