Exploring Java software vocabulary: A search and mining perspective

Authors:
Erik Linstead;Lindsey Hughes;Cristina Lopes;Pierre Baldi
Affiliations:
School of Information and Computer Sciences. University of California, Irvine, USA;Department of Math and Computer Science. Chapman University, Orange, CA, USA;School of Information and Computer Sciences. University of California, Irvine, USA;School of Information and Computer Sciences. University of California, Irvine, USA
Venue:
SUITE '09 Proceedings of the 2009 ICSE Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation
Year:
2009

Citing 7
Cited 0

Nomen Est Omen: Analyzing the Language of Function Identifiers

WCRE '99 Proceedings of the Sixth Working Conference on Reverse Engineering
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
The Programmer's Lexicon, Volume I: The Verbs

SCAM '07 Proceedings of the Seventh IEEE International Working Conference on Source Code Analysis and Manipulation
On the Use of Domain Terms in Source Code

ICPC '08 Proceedings of the 2008 The 16th IEEE International Conference on Program Comprehension
A theory of aspects as latent topics

Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Visualizing the word structure of Java class names

Companion to the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Sourcerer: mining and searching internet-scale software repositories

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.01

Visualization

Abstract

We conduct a large-scale analysis of Java source code vocabulary for 12,151 open source projects from Source-Forge and Apache, a corpus substantially larger than considered previously. Simple statistical analysis demonstrates robust power-law behavior for word count distributions across multiple program entities. We then identify salient vocabulary trends for classes, interfaces, methods, and fields. Our results provide low-level insight into the vocabulary space governing Java software development, with direct application to program comprehension and software search. Supplementary material may be found at: http://sourcerer.ics.uci.edu/suite2009/suite.html.