The effects of identifier retention and stop word removal on a latent Dirichlet allocation based feature location technique

  • Authors:
  • Lauren R. Biggers

  • Affiliations:
  • The University of Alabama, Tuscaloosa, AL

  • Venue:
  • Proceedings of the 50th Annual Southeast Regional Conference
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Feature location, an important task in program comprehension, occurs when the developer identifies the source code entity or entities responsible for implementing a functionality. Researchers have applied static analysis techniques to multiple software maintenance tasks, including feature localization. Static analysis techniques operate on a document corpus. Configuration and preprocessing decisions are required to build a suitable source code corpus for a static analysis technique. Currently, there is little guidance in the software engineering literature for making such configuration decisions. This paper focuses on two preprocessing methods for source code corpora, identifier splitting and stop word lists. We experiment on three open source Java test suites, i.e. Mylyn 1.0.1, Rhino 1.5R5, and Rhino 1.6R5. Our results indicate that identifier splitting and stop word list decisions do not significantly affect the performance of the LDA based feature location technique.