Improving identifier informativeness using part of speech information
Proceedings of the 8th Working Conference on Mining Software Repositories
Source code identifier splitting using Yahoo image and web search engine
Proceedings of the First International Workshop on Software Mining
Normalizing source code vocabulary to support program comprehension and software quality
Proceedings of the 2013 International Conference on Software Engineering
A dataset for evaluating identifier splitters
Proceedings of the 10th Working Conference on Mining Software Repositories
Supporting concept location through identifier parsing and ontology extraction
Journal of Systems and Software
Enhancing software artefact traceability recovery processes with link count information
Information and Software Technology
Studying software evolution using topic models
Science of Computer Programming
Recovering test-to-code traceability using slicing and textual analysis
Journal of Systems and Software
Hi-index | 0.00 |
Information Retrieval (IR) based tools complement traditional static and dynamic analysis tools by exploiting the natural language found within a program's text. Tools incorporating IR have tackled problems, such as feature location, that previously required considerable human effort. However, to reap the full benefit of IR-based techniques, the language used across all software artifacts (e.g., requirement and design documents, test plans, as well as the source code) must be consistent. Vocabulary normalization aligns the vocabulary found in source code with that found in other software artifacts. Normalization both splits an identifier into its constituent parts and expands each part into a full dictionary word to match vocabulary in other artifacts. An algorithm for normalization is presented. Its current implementation incorporates a greatly improved splitter that exploits a collection of resources including several dictionaries, frequency distributions derived from the corpus of programs, and co-occurrence data. Empirical study of this new splitter, GenTest, on almost 8000 identifiers finds that it correctly splits 82%, outperforming the current state-of-the-art. A preliminary experiment with the normalization algorithm finds it improving the FLAT feature locator's scores of relevant code from 0.60 to 0.95 on a scale from 0 to 1.