Normalizing Source Code Vocabulary

Authors:
Dawn Lawrie;Dave Binkley;Christopher Morrell
Affiliations:
-;-;-
Venue:
WCRE '10 Proceedings of the 2010 17th Working Conference on Reverse Engineering
Year:
2010

Citing 0
Cited 8

Improving identifier informativeness using part of speech information

Proceedings of the 8th Working Conference on Mining Software Repositories
Source code identifier splitting using Yahoo image and web search engine

Proceedings of the First International Workshop on Software Mining
Normalizing source code vocabulary to support program comprehension and software quality

Proceedings of the 2013 International Conference on Software Engineering
A dataset for evaluating identifier splitters

Proceedings of the 10th Working Conference on Mining Software Repositories
Supporting concept location through identifier parsing and ontology extraction

Journal of Systems and Software
Enhancing software artefact traceability recovery processes with link count information

Information and Software Technology
Studying software evolution using topic models

Science of Computer Programming
Recovering test-to-code traceability using slicing and textual analysis

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information Retrieval (IR) based tools complement traditional static and dynamic analysis tools by exploiting the natural language found within a program's text. Tools incorporating IR have tackled problems, such as feature location, that previously required considerable human effort. However, to reap the full benefit of IR-based techniques, the language used across all software artifacts (e.g., requirement and design documents, test plans, as well as the source code) must be consistent. Vocabulary normalization aligns the vocabulary found in source code with that found in other software artifacts. Normalization both splits an identifier into its constituent parts and expands each part into a full dictionary word to match vocabulary in other artifacts. An algorithm for normalization is presented. Its current implementation incorporates a greatly improved splitter that exploits a collection of resources including several dictionaries, frequency distributions derived from the corpus of programs, and co-occurrence data. Empirical study of this new splitter, GenTest, on almost 8000 identifiers finds that it correctly splits 82%, outperforming the current state-of-the-art. A preliminary experiment with the normalization algorithm finds it improving the FLAT feature locator's scores of relevant code from 0.60 to 0.95 on a scale from 0 to 1.