Quantifying the similiarities between source code lexicons

  • Authors:
  • Lauren R. Biggers;Nicholas A. Kraft

  • Affiliations:
  • The University of Alabama, Tuscaloosa, AL;The University of Alabama, Tuscaloosa, AL

  • Venue:
  • Proceedings of the 49th Annual Southeast Regional Conference
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Several recent static analysis techniques automate software understanding activities by extracting textual information from source code and applying information retrieval models to the extracted corpora. These source code retrieval techniques show efficacy, but the literature provides no guidance regarding configuration of their constituent processes. For example, the literature provides conflicting information regarding the benefit of extracting comments and string literals along with identifiers such as method or variable names. In this paper we present an initial investigation into the similarities between three source code lexicons described in the literature: identifiers, comments, and string literals. We address three research questions using a case study of six open source Java projects. The results indicate that methods uniquely contain from 30% to 60% of the projects' terms, whereas the comments uniquely contain from 22% to 45% of the terms. Future work includes analyzing the extent to which comments and string literals introduce domain terms rather than non-domain terms.