How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms

Authors:
Annibale Panichella;Bogdan Dit;Rocco Oliveto;Massimiliano Di Penta;Denys Poshyvanyk;Andrea De Lucia
Affiliations:
University of Salerno, Italy;College of William and Mary, USA;University of Molise, Italy;University of Sannio, Italy;College of William and Mary, USA;University of Salerno, Italy
Venue:
Proceedings of the 2013 International Conference on Software Engineering
Year:
2013

Citing 24
Cited 5

Genetic Algorithms in Search, Optimization and Machine Learning

Genetic Algorithms in Search, Optimization and Machine Learning
Modern Information Retrieval

Modern Information Retrieval
Recovering documentation-to-source-code traceability links using latent semantic indexing

Proceedings of the 25th International Conference on Software Engineering
Latent dirichlet allocation

The Journal of Machine Learning Research
Introduction to Clustering Large and High-Dimensional Data

Introduction to Clustering Large and High-Dimensional Data
Detection of Duplicate Defect Reports Using Natural Language Processing

ICSE '07 Proceedings of the 29th international conference on Software Engineering
Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval

IEEE Transactions on Software Engineering
Mining Eclipse Developer Contributions via Author-Topic Models

MSR '07 Proceedings of the Fourth International Workshop on Mining Software Repositories
Fast collapsed gibbs sampling for latent dirichlet allocation

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
A theory of aspects as latent topics

Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Using information retrieval based coupling measures for impact analysis

Empirical Software Engineering
An information retrieval process to aid in the analysis of code clones

Empirical Software Engineering
Software traceability with topic modeling

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Bug localization using latent Dirichlet allocation

Information and Software Technology
On the Equivalence of Information Retrieval Methods for Automated Traceability Link Recovery

ICPC '10 Proceedings of the 2010 IEEE 18th International Conference on Program Comprehension
Validating the Use of Topic Models for Software Evolution

SCAM '10 Proceedings of the 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation
Estimating the Optimal Number of Latent Concepts in Source Code Analysis

SCAM '10 Proceedings of the 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation
Using Relational Topic Models to capture coupling among classes in object-oriented software systems

ICSM '10 Proceedings of the 2010 IEEE International Conference on Software Maintenance
On parameter tuning in search based software engineering

SSBSE'11 Proceedings of the Third international conference on Search based software engineering
On integrating orthogonal information retrieval methods to improve traceability recovery

ICSM '11 Proceedings of the 2011 27th IEEE International Conference on Software Maintenance
A topic-based approach for narrowing the search space of buggy files from a bug report

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Integrated impact analysis for managing software changes

Proceedings of the 34th International Conference on Software Engineering
On the naturalness of software

Proceedings of the 34th International Conference on Software Engineering
Integrating information retrieval, execution and link analysis algorithms to improve feature location in software

Empirical Software Engineering

A dataset from change history to support evaluation of software maintenance tasks

Proceedings of the 10th Working Conference on Mining Software Repositories
Searching for better configurations: a rigorous approach to clone evaluation

Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Improving trace accuracy through data-driven configuration and composition of tracing features

Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Empirical answers to fundamental software engineering problems (panel)

Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Improving software modularization via automated analysis of latent topics and dependencies

ACM Transactions on Software Engineering and Methodology (TOSEM)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information Retrieval (IR) methods, and in particular topic models, have recently been used to support essential software engineering (SE) tasks, by enabling software textual retrieval and analysis. In all these approaches, topic models have been used on software artifacts in a similar manner as they were used on natural language documents (e.g., using the same settings and parameters) because the underlying assumption was that source code and natural language documents are similar. However, applying topic models on software data using the same settings as for natural language text did not always produce the expected results. Recent research investigated this assumption and showed that source code is much more repetitive and predictable as compared to the natural language text. Our paper builds on this new fundamental finding and proposes a novel solution to adapt, configure and effectively use a topic modeling technique, namely Latent Dirichlet Allocation (LDA), to achieve better (acceptable) performance across various SE tasks. Our paper introduces a novel solution called LDA-GA, which uses Genetic Algorithms (GA) to determine a near-optimal configuration for LDA in the context of three different SE tasks: (1) traceability link recovery, (2) feature location, and (3) software artifact labeling. The results of our empirical studies demonstrate that LDA-GA is ableto identify robust LDA configurations, which lead to a higher accuracy on all the datasets for these SE tasks as compared to previously published results, heuristics, and the results of a combinatorial search.