A statistical semantic language model for source code

Authors:
Tung Thanh Nguyen;Anh Tuan Nguyen;Hoan Anh Nguyen;Tien N. Nguyen
Affiliations:
Iowa State University, USA;Iowa State University, USA;Iowa State University, USA;Iowa State University, USA
Venue:
Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Year:
2013

Citing 19
Cited 0

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Automatic Method Completion

Proceedings of the 19th IEEE international conference on Automated software engineering
Using structural context to recommend source code examples

Proceedings of the 27th international conference on Software engineering
Topic modeling: beyond bag-of-words

ICML '06 Proceedings of the 23rd international conference on Machine learning
Using task context to improve programmer productivity

Proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering
Enabling static analysis for partial java programs

Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
A theory of aspects as latent topics

Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Learning from examples to improve code completion systems

Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
MAPO: Mining and Recommending API Usage Patterns

Genoa Proceedings of the 23rd European Conference on ECOOP 2009 --- Object-Oriented Programming
How Program History Can Improve Code Completion

ASE '08 Proceedings of the 2008 23rd IEEE/ACM International Conference on Automated Software Engineering
Code Completion from Abbreviated Input

ASE '09 Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering
A study of the uniqueness of source code

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Code template inference using language models

Proceedings of the 48th Annual Southeast Regional Conference
An evaluation of the strategies of sorting, filtering, and grouping API methods for Code Completion

ICSM '11 Proceedings of the 2011 27th IEEE International Conference on Software Maintenance
Graph-based pattern-oriented, context-sensitive source code completion

Proceedings of the 34th International Conference on Software Engineering
Automatic parameter recommendation for practical API usage

Proceedings of the 34th International Conference on Software Engineering
On the naturalness of software

Proceedings of the 34th International Conference on Software Engineering
Active code completion

Proceedings of the 34th International Conference on Software Engineering
Duplicate bug report detection with a combination of information retrieval and topic modeling

Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent research has successfully applied the statistical n-gram language model to show that source code exhibits a good level of repetition. The n-gram model is shown to have good predictability in supporting code suggestion and completion. However, the state-of-the-art n-gram approach to capture source code regularities/patterns is based only on the lexical information in a local context of the code units. To improve predictability, we introduce SLAMC, a novel statistical semantic language model for source code. It incorporates semantic information into code tokens and models the regularities/patterns of such semantic annotations, called sememes, rather than their lexemes. It combines the local context in semantic n-grams with the global technical concerns/functionality into an n-gram topic model, together with pairwise associations of program elements. Based on SLAMC, we developed a new code suggestion method, which is empirically evaluated on several projects to have relatively 18-68% higher accuracy than the state-of-the-art approach.