Nomen Est Omen: Analyzing the Language of Function Identifiers
WCRE '99 Proceedings of the Sixth Working Conference on Reverse Engineering
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger
EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
The Programmer's Lexicon, Volume I: The Verbs
SCAM '07 Proceedings of the Seventh IEEE International Working Conference on Source Code Analysis and Manipulation
On the Use of Domain Terms in Source Code
ICPC '08 Proceedings of the 2008 The 16th IEEE International Conference on Program Comprehension
A theory of aspects as latent topics
Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Visualizing the word structure of Java class names
Companion to the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Sourcerer: mining and searching internet-scale software repositories
Data Mining and Knowledge Discovery
Hi-index | 0.01 |
We conduct a large-scale analysis of Java source code vocabulary for 12,151 open source projects from Source-Forge and Apache, a corpus substantially larger than considered previously. Simple statistical analysis demonstrates robust power-law behavior for word count distributions across multiple program entities. We then identify salient vocabulary trends for classes, interfaces, methods, and fields. Our results provide low-level insight into the vocabulary space governing Java software development, with direct application to program comprehension and software search. Supplementary material may be found at: http://sourcerer.ics.uci.edu/suite2009/suite.html.