On the naturalness of software

Authors:
Abram Hindle;Earl T. Barr;Zhendong Su;Mark Gabel;Premkumar Devanbu
Affiliations:
UC Davis, USA;UC Davis, USA;UC Davis, USA;University of Texas at Dallas, USA;UC Davis, USA
Venue:
Proceedings of the 34th International Conference on Software Engineering
Year:
2012

Citing 34
Cited 7

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Programming by voice, VocalProgramming

Assets '00 Proceedings of the fourth international ACM conference on Assistive technologies
Recovering Traceability Links between Code and Documentation

IEEE Transactions on Software Engineering
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Mining Version Histories to Guide Software Changes

Proceedings of the 26th International Conference on Software Engineering
Spoken Language Support for Software Development

VLHCC '04 Proceedings of the 2004 IEEE Symposium on Visual Languages - Human Centric Computing
Jungloid mining: helping to navigate the API jungle

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Automatic generation of suggestions for program investigation

Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering
DynaMine: finding common error patterns by mining software revision histories

Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering
Using language clues to discover crosscutting concerns

MACS '05 Proceedings of the 2005 workshop on Modeling and analysis of concerns in software
What's in a Name? A Study of Identifiers

ICPC '06 Proceedings of the 14th IEEE International Conference on Program Comprehension
A voice-activated syntax-directed editor for manually disabled programmers

Proceedings of the 8th international ACM SIGACCESS conference on Computers and accessibility
Memories of bug fixes

Proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering
Using natural language program analysis to locate and understand action-oriented concerns

Proceedings of the 6th international conference on Aspect-oriented software development
Recommending random walks

Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Crowdsourcing user studies with Mechanical Turk

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Javert: fully automatic mining of general temporal properties from dynamic traces

Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering
Sourcerer: mining and searching internet-scale software repositories

Data Mining and Knowledge Discovery
Merlin: specification inference for explicit information flow problems

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Learning from examples to improve code completion systems

Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Data Mining for Software Engineering

Computer
Debugging Method Names

Genoa Proceedings of the 23rd European Conference on ECOOP 2009 --- Object-Oriented Programming
Statistical Machine Translation

Statistical Machine Translation
Improving code completion with program history

Automated Software Engineering
Code Completion from Abbreviated Input

ASE '09 Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering
Automatically documenting program changes

Proceedings of the IEEE/ACM international conference on Automated software engineering
Towards automatically generating summary comments for Java methods

Proceedings of the IEEE/ACM international conference on Automated software engineering
A study of the uniqueness of source code

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
IDE 2.0: collective intelligence in software development

Proceedings of the FSE/SDP workshop on Future of software engineering research
Code template inference using language models

Proceedings of the 48th Annual Southeast Regional Conference
Improving identifier informativeness using part of speech information

Proceedings of the 8th Working Conference on Mining Software Repositories
Automatically detecting and describing high level actions within methods

Proceedings of the 33rd International Conference on Software Engineering
Generating natural language summaries for crosscutting source code concerns

ICSM '11 Proceedings of the 2011 27th IEEE International Conference on Software Maintenance
An evaluation of the strategies of sorting, filtering, and grouping API methods for Code Completion

ICSM '11 Proceedings of the 2011 27th IEEE International Conference on Software Maintenance

The GISMOE challenge: constructing the pareto program surface using genetic programming to find better programs (keynote paper)

Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering
How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms

Proceedings of the 2013 International Conference on Software Engineering
Mining source code repositories at massive scale using language modeling

Proceedings of the 10th Working Conference on Mining Software Repositories
A statistical semantic language model for source code

Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Lexical statistical machine translation for language migration

Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Detecting API documentation errors

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Structured statistical syntax tree prediction

Proceedings of the 2013 companion publication for conference on Systems, programming, & applications: software for humanity

Quantified Score

Hi-index	0.00

Visualization

Abstract

Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension. We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether a) code can be usefully modeled by statistical language models and b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very repetitive, and in fact even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's completion capability. We conclude the paper by laying out a vision for future research in this area.