Exploring Java software vocabulary: A search and mining perspective

  • Authors:
  • Erik Linstead;Lindsey Hughes;Cristina Lopes;Pierre Baldi

  • Affiliations:
  • School of Information and Computer Sciences. University of California, Irvine, USA;Department of Math and Computer Science. Chapman University, Orange, CA, USA;School of Information and Computer Sciences. University of California, Irvine, USA;School of Information and Computer Sciences. University of California, Irvine, USA

  • Venue:
  • SUITE '09 Proceedings of the 2009 ICSE Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation
  • Year:
  • 2009

Quantified Score

Hi-index 0.01

Visualization

Abstract

We conduct a large-scale analysis of Java source code vocabulary for 12,151 open source projects from Source-Forge and Apache, a corpus substantially larger than considered previously. Simple statistical analysis demonstrates robust power-law behavior for word count distributions across multiple program entities. We then identify salient vocabulary trends for classes, interfaces, methods, and fields. Our results provide low-level insight into the vocabulary space governing Java software development, with direct application to program comprehension and software search. Supplementary material may be found at: http://sourcerer.ics.uci.edu/suite2009/suite.html.