Source code identifier splitting using Yahoo image and web search engine

  • Authors:
  • Ashish Sureka

  • Affiliations:
  • Indraprastha Institute of Information Technology, Delhi (IIIT-D), New Delhi, India

  • Venue:
  • Proceedings of the First International Workshop on Software Mining
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Source-code or program identifiers are sequence of characters consisting of one or more tokens representing domain concepts. Splitting or tokenizing identifiers that does not contain explicit markers or clues (such as came-casing or using underscore as a token separator) is a technically challenging problem. In this paper, we present a technique for automatic tokenization and splitting of source-code identifiers using Yahoo web search and image search similarity distance. We present an algorithm that decides the split position based on various factors such as conceptual correlations and semantic relatedness between the left and right splits strings of a given identifier, popularity of the token and its length. The number of hits or search results returned by the web and image search engine serves as a proxy to measures such as term popularity and correlation. We perform a series of experiments to validate the proposed approach and present performance results.