Investigating the Use of Lexical Information for Software System Clustering

  • Authors:
  • Anna Corazza;Sergio Di Martino;Valerio Maggio;Giuseppe Scanniello

  • Affiliations:
  • -;-;-;-

  • Venue:
  • CSMR '11 Proceedings of the 2011 15th European Conference on Software Maintenance and Reengineering
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Developers have a lot of freedom in writing comments as well as in choosing identifiers and method names. These are intentional in nature and provide a different relevance of information to understand what a software system implements, and in particular the role of each source file. In this paper we investigate the effectiveness of exploiting lexical information for software system clustering. In particular we explore the contribution of the combined use of six different dictionaries, corresponding to the six parts of the source code where programmers introduce lexical information, namely: class, attribute, method and parameter names, comments, and source code statements. Their relevance has been weighted by means of a probabilistic model, whose parameters have been estimated by the Expectation-Maximization algorithm. To group source files accordingly we used a hierarchical clustering algorithm. The investigation has been conducted on a dataset of 13 open source Java software systems.