Automated topic naming

  • Authors:
  • Abram Hindle;Neil A. Ernst;Michael W. Godfrey;John Mylopoulos

  • Affiliations:
  • Dept. of Computing Science, University of Alberta, Edmonton, Canada;Dept. of Computer Science, University of British Columbia, Vancouver, Canada;David Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada;Dept. Information Eng. and Computer Science, University of Trento, Trento, Italy

  • Venue:
  • Empirical Software Engineering
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Software repositories provide a deluge of software artifacts to analyze. Researchers have attempted to summarize, categorize, and relate these artifacts by using semi-unsupervised machine-learning algorithms, such as Latent Dirichlet Allocation (LDA). LDA is used for concept and topic analysis to suggest candidate word-lists or topics that describe and relate software artifacts. However, these word-lists and topics are difficult to interpret in the absence of meaningful summary labels. Current attempts to interpret topics assume manual labelling and do not use domain-specific knowledge to improve, contextualize, or describe results for the developers. We propose a solution: automated labelled topic extraction. Topics are extracted using LDA from commit-log comments recovered from source control systems. These topics are given labels from a generalizable cross-project taxonomy, consisting of non-functional requirements. Our approach was evaluated with experiments and case studies on three large-scale Relational Database Management System (RDBMS) projects: MySQL, PostgreSQL and MaxDB. The case studies show that labelled topic extraction can produce appropriate, context-sensitive labels that are relevant to these projects, and provide fresh insight into their evolving software development activities.