My Repository Runneth Over: An Empirical Study on Diversifying Data Sources to Improve Feature Search

  • Authors:
  • Sukanya Ratanotayanon;Hye Jung Choi;Susan Elliott Sim

  • Affiliations:
  • -;-;-

  • Venue:
  • ICPC '10 Proceedings of the 2010 IEEE 18th International Conference on Program Comprehension
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Research on feature location that applies information retrieval techniques have experimented the kinds of inputs to the corpus and the algorithms that could be used. At first, only source code was used. Later extraction techniques were improved, and data from other software tools and analyses were used to expand or augment the repository. But, does having more diverse data in the repository always produce better results? In this paper, we report on an empirical study to examine the effect of increasing data diversity to improve feature location through search. In particular, we looked at the effect of including: i) change sets from revision control system, ii) tickets from issue trackers, and iii) elements from a Static Dependency Graph (SDG). We searched for three features of Jajuk, an open source Java jukebox, and two features of jEdit, an open source Java text editor. We used four different corpuses built with a combination of the above data. We used Eclipse’s code search and an index built with source code as baseline conditions. We found that it is not always better to have more diverse data. Adding SDG data to change sets increased recall, but drove down precision. Adding data from issue trackers had little effect and in one case lowered recall. We also found that large-scale refactoring of the code decreases the effectiveness using change sets for feature location.