Normalizing Source Code Vocabulary

  • Authors:
  • Dawn Lawrie;Dave Binkley;Christopher Morrell

  • Affiliations:
  • -;-;-

  • Venue:
  • WCRE '10 Proceedings of the 2010 17th Working Conference on Reverse Engineering
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Information Retrieval (IR) based tools complement traditional static and dynamic analysis tools by exploiting the natural language found within a program's text. Tools incorporating IR have tackled problems, such as feature location, that previously required considerable human effort. However, to reap the full benefit of IR-based techniques, the language used across all software artifacts (e.g., requirement and design documents, test plans, as well as the source code) must be consistent. Vocabulary normalization aligns the vocabulary found in source code with that found in other software artifacts. Normalization both splits an identifier into its constituent parts and expands each part into a full dictionary word to match vocabulary in other artifacts. An algorithm for normalization is presented. Its current implementation incorporates a greatly improved splitter that exploits a collection of resources including several dictionaries, frequency distributions derived from the corpus of programs, and co-occurrence data. Empirical study of this new splitter, GenTest, on almost 8000 identifiers finds that it correctly splits 82%, outperforming the current state-of-the-art. A preliminary experiment with the normalization algorithm finds it improving the FLAT feature locator's scores of relevant code from 0.60 to 0.95 on a scale from 0 to 1.