Quantifying the similiarities between source code lexicons

Authors:
Lauren R. Biggers;Nicholas A. Kraft
Affiliations:
The University of Alabama, Tuscaloosa, AL;The University of Alabama, Tuscaloosa, AL
Venue:
Proceedings of the 49th Annual Southeast Regional Conference
Year:
2011

Citing 19
Cited 0

The maintenance problem of application software: an empirical analysis

Journal of Software Maintenance: Research and Practice
Reverse engineering: a roadmap

Proceedings of the Conference on The Future of Software Engineering
Software Engineering Economics

Software Engineering Economics
Recovering Traceability Links between Code and Documentation

IEEE Transactions on Software Engineering
File clustering using naming conventions for legacy systems

CASCON '97 Proceedings of the 1997 conference of the Centre for Advanced Studies on Collaborative research
Assessing the relevance of identifier names in a legacy software system

CASCON '98 Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative research
Restructuring Program Identifier Names

ICSM '00 Proceedings of the International Conference on Software Maintenance (ICSM'00)
SNIAFL: Towards a static noninteractive approach to feature location

ACM Transactions on Software Engineering and Methodology (TOSEM)
Leveraged Quality Assessment using Information Retrieval Techniques

ICPC '06 Proceedings of the 14th IEEE International Conference on Program Comprehension
Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval

IEEE Transactions on Software Engineering
Identifying Changed Source Code Lines from Version Repositories

MSR '07 Proceedings of the Fourth International Workshop on Mining Software Repositories
Feature location via information retrieval based filtering of a single scenario execution trace

Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
Do Code and Comments Co-Evolve? On the Relation between Source Code and Comment Changes

WCRE '07 Proceedings of the 14th Working Conference on Reverse Engineering
An approach to detecting duplicate bug reports using natural language and execution information

Proceedings of the 30th international conference on Software engineering
A Traceability Technique for Specifications

ICPC '08 Proceedings of the 2008 The 16th IEEE International Conference on Program Comprehension
On the Use of Domain Terms in Source Code

ICPC '08 Proceedings of the 2008 The 16th IEEE International Conference on Program Comprehension
Analyzing the Evolution of the Source Code Vocabulary

CSMR '09 Proceedings of the 2009 European Conference on Software Maintenance and Reengineering
Lexicon Bad Smells in Software

WCRE '09 Proceedings of the 2009 16th Working Conference on Reverse Engineering
Bug localization using latent Dirichlet allocation

Information and Software Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several recent static analysis techniques automate software understanding activities by extracting textual information from source code and applying information retrieval models to the extracted corpora. These source code retrieval techniques show efficacy, but the literature provides no guidance regarding configuration of their constituent processes. For example, the literature provides conflicting information regarding the benefit of extracting comments and string literals along with identifiers such as method or variable names. In this paper we present an initial investigation into the similarities between three source code lexicons described in the literature: identifiers, comments, and string literals. We address three research questions using a case study of six open source Java projects. The results indicate that methods uniquely contain from 30% to 60% of the projects' terms, whereas the comments uniquely contain from 22% to 45% of the terms. Future work includes analyzing the extent to which comments and string literals introduce domain terms rather than non-domain terms.