A dataset for evaluating identifier splitters

  • Authors:
  • David Binkley;Dawn Lawrie;Lori Pollock;Emily Hill;K. Vijay-Shanker

  • Affiliations:
  • Loyola University Maryland, USA;Loyola University Maryland, USA;University of Delaware, USA;Montclair State University, USA;University of Delaware, USA

  • Venue:
  • Proceedings of the 10th Working Conference on Mining Software Repositories
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Software engineering and evolution techniques have recently started to exploit the natural language information in source code. A key step in doing so is splitting identifiers into their constituent words. While simple in concept, identifier splitting raises several challenging issues, leading to a range of splitting techniques. Consequently, the research community would benefit from a dataset (i.e., a gold set) that facilitates comparative studies of identifier splitting techniques. A gold set of 2,663 split identifiers was constructed from 8,522 individual human splitting judgements and can be obtained from www.cs.loyola.edu/~binkley/ludiso. This set's construction and observations aimed at its effective use are described.