The Google Similarity Distance
IEEE Transactions on Knowledge and Data Engineering
Extracting Meaning from Abbreviated Identifiers
SCAM '07 Proceedings of the Seventh IEEE International Working Conference on Source Code Analysis and Manipulation
Mining for personal name aliases on the web
Proceedings of the 17th international conference on World Wide Web
AMAP: automatically mining abbreviation expansions in programs to enhance software maintenance tools
Proceedings of the 2008 international working conference on Mining software repositories
MM '08 Proceedings of the 16th ACM international conference on Multimedia
Mining source code to automatically split identifiers for software analysis
MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
Using web-search results to measure word-group similarity
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A new semantic similarity measuring method based on web search engines
WSEAS Transactions on Computers
Normalizing Source Code Vocabulary
WCRE '10 Proceedings of the 2010 17th Working Conference on Reverse Engineering
Recognizing Words from Source Code Identifiers Using Speech Recognition Techniques
CSMR '10 Proceedings of the 2010 14th European Conference on Software Maintenance and Reengineering
Improving the tokenisation of identifier names
Proceedings of the 25th European conference on Object-oriented programming
Expanding identifiers to normalize source code vocabulary
ICSM '11 Proceedings of the 2011 27th IEEE International Conference on Software Maintenance
Flickr Distance: A Relationship Measure for Visual Concepts
IEEE Transactions on Pattern Analysis and Machine Intelligence
Hi-index | 0.00 |
Source-code or program identifiers are sequence of characters consisting of one or more tokens representing domain concepts. Splitting or tokenizing identifiers that does not contain explicit markers or clues (such as came-casing or using underscore as a token separator) is a technically challenging problem. In this paper, we present a technique for automatic tokenization and splitting of source-code identifiers using Yahoo web search and image search similarity distance. We present an algorithm that decides the split position based on various factors such as conceptual correlations and semantic relatedness between the left and right splits strings of a given identifier, popularity of the token and its length. The number of hits or search results returned by the web and image search engine serves as a proxy to measures such as term popularity and correlation. We perform a series of experiments to validate the proposed approach and present performance results.