Estimating the Optimal Number of Latent Concepts in Source Code Analysis

Authors:
Scott Grant;James R. Cordy
Affiliations:
-;-
Venue:
SCAM '10 Proceedings of the 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation
Year:
2010

Citing 0
Cited 6

Mining software repositories using topic models

Proceedings of the 33rd International Conference on Software Engineering
Semi-automatically extracting FAQs to improve accessibility of software development knowledge

Proceedings of the 34th International Conference on Software Engineering
The impact of identifier style on effort and comprehension

Empirical Software Engineering
How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms

Proceedings of the 2013 International Conference on Software Engineering
Using citation influence to predict software defects

Proceedings of the 10th Working Conference on Mining Software Repositories
Static test case prioritization using topic models

Empirical Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The optimal number of latent topics required to model the most accurate latent substructure for a source code corpus is an open question in source code analysis. Most estimates about the number of latent topics that exist in a software corpus are based on the assumption that the data is similar to natural language, but there is little empirical evidence to support this. In order to help determine the appropriate number of topics needed to accurately represent the source code, we generate a series of Latent Dirichlet Allocation models with varying topic counts. We use a heuristic to evaluate the ability of the model to identify related source code blocks, and demonstrate the consequences of choosing too few or too many latent topics.