A dataset for evaluating identifier splitters

Authors:
David Binkley;Dawn Lawrie;Lori Pollock;Emily Hill;K. Vijay-Shanker
Affiliations:
Loyola University Maryland, USA;Loyola University Maryland, USA;University of Delaware, USA;Montclair State University, USA;University of Delaware, USA
Venue:
Proceedings of the 10th Working Conference on Mining Software Repositories
Year:
2013

Citing 10
Cited 0

Assessing the relevance of identifier names in a legacy software system

CASCON '98 Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative research
Nomen Est Omen: Analyzing the Language of Function Identifiers

WCRE '99 Proceedings of the Sixth Working Conference on Reverse Engineering
Concise and consistent naming

Software Quality Control
Quantifying identifier quality: an analysis of trends

Empirical Software Engineering
Identifier length and limited programmer memory

Science of Computer Programming
Mining source code to automatically split identifiers for software analysis

MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
Normalizing Source Code Vocabulary

WCRE '10 Proceedings of the 2010 17th Working Conference on Reverse Engineering
Recognizing Words from Source Code Identifiers Using Speech Recognition Techniques

CSMR '10 Proceedings of the 2010 14th European Conference on Software Maintenance and Reengineering
Improving the tokenisation of identifier names

Proceedings of the 25th European conference on Object-oriented programming
LINSEN: An efficient approach to split identifiers and expand abbreviations

ICSM '12 Proceedings of the 2012 IEEE International Conference on Software Maintenance (ICSM)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Software engineering and evolution techniques have recently started to exploit the natural language information in source code. A key step in doing so is splitting identifiers into their constituent words. While simple in concept, identifier splitting raises several challenging issues, leading to a range of splitting techniques. Consequently, the research community would benefit from a dataset (i.e., a gold set) that facilitates comparative studies of identifier splitting techniques. A gold set of 2,663 split identifiers was constructed from 8,522 individual human splitting judgements and can be obtained from www.cs.loyola.edu/~binkley/ludiso. This set's construction and observations aimed at its effective use are described.