Expanding identifiers to normalize source code vocabulary

Authors:
Dawn Lawrie;Dave Binkley
Affiliations:
Loyola University Maryland, Baltimore, 21210-2699, USA;Loyola University Maryland, Baltimore, 21210-2699, USA
Venue:
ICSM '11 Proceedings of the 2011 27th IEEE International Conference on Software Maintenance
Year:
2011

Citing 0
Cited 6

Source code identifier splitting using Yahoo image and web search engine

Proceedings of the First International Workshop on Software Mining
Normalizing source code vocabulary to support program comprehension and software quality

Proceedings of the 2013 International Conference on Software Engineering
An ontology toolkit for problem domain concept location in program comprehension

Proceedings of the 2013 International Conference on Software Engineering
Supporting concept location through identifier parsing and ontology extraction

Journal of Systems and Software
Enhancing software artefact traceability recovery processes with link count information

Information and Software Technology
Recovering test-to-code traceability using slicing and textual analysis

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Maintaining modern software requires significant tool support. Effective tools exploit a variety of information and techniques to aid a software maintainer. One area of recent interest in tool development exploits the natural language information found in source code. Such Information Retrieval (IR) based tools compliment traditional static analysis tools and have tackled problems, such as feature location, that otherwise require considerable human effort. To reap the full benefit of IR-based techniques, the language used across all software artifacts (e.g., requirements, design, change requests, tests, and source code) must be consistent. Unfortunately, there is a significant proportion of invented vocabulary in source code. Vocabulary normalization aligns the vocabulary found in the source code with that found in other software artifacts. Most existing work related to normalization has focused on splitting an identifier into its constituent parts. The next step is to expand each part into a (dictionary) word that matches the vocabulary used in other software artifacts. Building on a successful approach to splitting identifiers, an implementation of an expansion algorithm is presented. Experiments on two systems find that up to 66% of identifiers are correctly expanded, which is within about 20% of the current system's best-case performance. Not only is this performance comparable to previous techniques, but the result is achieved in the absence of special purpose rules and not limited to restricted syntactic contexts. Results from these experiments also show the impact that varying levels of documentation (including both internal documentation such as the requirements and design, and external, or user-level, documentation) have on the algorithm's performance.