Using Latent Dirichlet Allocation for automatic categorization of software

Authors:
Kai Tian;Meghan Revelle;Denys Poshyvanyk
Affiliations:
Computer Science Department, The College of William and Mary, Williamsburg, VA 23185 USA;Computer Science Department, The College of William and Mary, Williamsburg, VA 23185 USA;Computer Science Department, The College of William and Mary, Williamsburg, VA 23185 USA
Venue:
MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
Year:
2009

Citing 0
Cited 10

Software traceability with topic modeling

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Mining software repositories using topic models

Proceedings of the 33rd International Conference on Software Engineering
Applying a dynamic threshold to improve cluster detection of LSI

Science of Computer Programming
The effects of identifier retention and stop word removal on a latent Dirichlet allocation based feature location technique

Proceedings of the 50th Annual Southeast Regional Conference
Analyzing and mining a code search engine usage log

Empirical Software Engineering
Labeled topic detection of open source software from mining mass textual project profiles

Proceedings of the First International Workshop on Software Mining
Extraction of product evolution tree from source code of product variants

Proceedings of the 17th International Software Product Line Conference
Mining and recommending software features across multiple web repositories

Proceedings of the 5th Asia-Pacific Symposium on Internetware
Studying software evolution using topic models

Science of Computer Programming
Tag recommendation for open source software

Frontiers of Computer Science: Selected Publications from Chinese Universities

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper, we propose a technique called LACT for automatically categorizing software systems in open-source repositories. LACT is based on Latent Dirichlet Allocation, an information retrieval method which is used to index and analyze source code documents as mixtures of probabilistic topics. For an initial evaluation, we performed two studies. In the first study, LACT was compared against an existing tool, MUDABlue, for classifying 41 software systems written in C into problem domain categories. The results indicate that LACT can automatically produce meaningful category names and yield classification results comparable to MUDABlue. In the second study, we applied LACT to 43 software systems written in different programming languages such as C/C++, Java, C#, PHP, and Perl. The results indicate that LACT can be used effectively for the automatic categorization of software systems regardless of the underlying programming language or paradigm. Moreover, both studies indicate that LACT can identify several new categories that are based on libraries, architectures, or programming languages, which is a promising improvement as compared to manual categorization and existing techniques.