Improving identifier informativeness using part of speech information

  • Authors:
  • Dave Binkley;Matthew Hearn;Dawn Lawrie

  • Affiliations:
  • Loyola University Maryland, Baltimore, USA;Loyola University Maryland, Baltimore, USA;Loyola University Maryland, Baltimore, USA

  • Venue:
  • Proceedings of the 8th Working Conference on Mining Software Repositories
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent software development tools have exploited the mining of natural language information found within software and its supporting documentation. To make the most of this information, researchers have drawn upon the work of the natural language processing community for tools and techniques. One such tool provides part-of-speech information, which finds application in improving the searching of software repositories and extracting domain information found in identifiers. Unfortunately, the natural language found is software differs from that found in standard prose. This difference potentially limits the effectiveness of off-the-shelf tools. An empirical investigation finds that with minimal guidance an existing tagger was correct 88% of the time when tagging the words found in source code identifiers. The investigation then uses the improved part-of-speech information to tag a large corpus of over 145,000 structure-field names. From patterns in the tags several rules emerge that seek to understand past usage and to improve future naming.