Source code identifier splitting using Yahoo image and web search engine

Authors:
Ashish Sureka
Affiliations:
Indraprastha Institute of Information Technology, Delhi (IIIT-D), New Delhi, India
Venue:
Proceedings of the First International Workshop on Software Mining
Year:
2012

Citing 13
Cited 0

The Google Similarity Distance

IEEE Transactions on Knowledge and Data Engineering
Extracting Meaning from Abbreviated Identifiers

SCAM '07 Proceedings of the Seventh IEEE International Working Conference on Source Code Analysis and Manipulation
Mining for personal name aliases on the web

Proceedings of the 17th international conference on World Wide Web
AMAP: automatically mining abbreviation expansions in programs to enhance software maintenance tools

Proceedings of the 2008 international working conference on Mining software repositories
Flickr distance

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Mining source code to automatically split identifiers for software analysis

MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
Using web-search results to measure word-group similarity

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A new semantic similarity measuring method based on web search engines

WSEAS Transactions on Computers
Normalizing Source Code Vocabulary

WCRE '10 Proceedings of the 2010 17th Working Conference on Reverse Engineering
Recognizing Words from Source Code Identifiers Using Speech Recognition Techniques

CSMR '10 Proceedings of the 2010 14th European Conference on Software Maintenance and Reengineering
Improving the tokenisation of identifier names

Proceedings of the 25th European conference on Object-oriented programming
Expanding identifiers to normalize source code vocabulary

ICSM '11 Proceedings of the 2011 27th IEEE International Conference on Software Maintenance
Flickr Distance: A Relationship Measure for Visual Concepts

IEEE Transactions on Pattern Analysis and Machine Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Source-code or program identifiers are sequence of characters consisting of one or more tokens representing domain concepts. Splitting or tokenizing identifiers that does not contain explicit markers or clues (such as came-casing or using underscore as a token separator) is a technically challenging problem. In this paper, we present a technique for automatic tokenization and splitting of source-code identifiers using Yahoo web search and image search similarity distance. We present an algorithm that decides the split position based on various factors such as conceptual correlations and semantic relatedness between the left and right splits strings of a given identifier, popularity of the token and its length. The number of hits or search results returned by the web and image search engine serves as a proxy to measures such as term popularity and correlation. We perform a series of experiments to validate the proposed approach and present performance results.