Improving the tokenisation of identifier names

Authors:
Simon Butler;Michel Wermelinger;Yijun Yu;Helen Sharp
Affiliations:
Computing Department and Centre for Research in Computing, The Open University, Milton Keynes, United Kingdom;Computing Department and Centre for Research in Computing, The Open University, Milton Keynes, United Kingdom;Computing Department and Centre for Research in Computing, The Open University, Milton Keynes, United Kingdom;Computing Department and Centre for Research in Computing, The Open University, Milton Keynes, United Kingdom
Venue:
Proceedings of the 25th European conference on Object-oriented programming
Year:
2011

Citing 13
Cited 7

The Elements of Java Style

The Elements of Java Style
Recovering Traceability Links between Code and Documentation

IEEE Transactions on Software Engineering
Nomen Est Omen: Analyzing the Language of Function Identifiers

WCRE '99 Proceedings of the Sixth Working Conference on Reverse Engineering
Static Techniques for Concept Location in Object-Oriented Code

IWPC '05 Proceedings of the 13th International Workshop on Program Comprehension
Semantic clustering: Identifying topics in source code

Information and Software Technology
Quantifying identifier quality: an analysis of trends

Empirical Software Engineering
Indexing the Java API Using Source Code

ASWEC '08 Proceedings of the 19th Australian Conference on Software Engineering
Extracting Domain Ontologies from Domain Specific APIs

CSMR '08 Proceedings of the 2008 12th European Conference on Software Maintenance and Reengineering
Mining source code to automatically split identifiers for software analysis

MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
Debugging Method Names

Genoa Proceedings of the 23rd European Conference on ECOOP 2009 --- Object-Oriented Programming
Natural Language Parsing of Program Element Names for Concept Extraction

ICPC '10 Proceedings of the 2010 IEEE 18th International Conference on Program Comprehension
Recognizing Words from Source Code Identifiers Using Speech Recognition Techniques

CSMR '10 Proceedings of the 2010 14th European Conference on Software Maintenance and Reengineering
Exploring the Influence of Identifier Names on Code Quality: An Empirical Study

CSMR '10 Proceedings of the 2010 14th European Conference on Software Maintenance and Reengineering

Mining Java class identifier naming conventions

Proceedings of the 34th International Conference on Software Engineering
Source code identifier splitting using Yahoo image and web search engine

Proceedings of the First International Workshop on Software Mining
What is middleware made of?: exploring abstractions, concepts, and class names in modern middleware

Proceedings of the 11th International Workshop on Adaptive and Reflective Middleware
Improving feature location using structural similarity and iterative graph mapping

Journal of Systems and Software
Why so complicated? simple term filtering and weighting for location-based bug report assignment recommendation

Proceedings of the 10th Working Conference on Mining Software Repositories
A dataset for evaluating identifier splitters

Proceedings of the 10th Working Conference on Mining Software Repositories
INVocD: identifier name vocabulary dataset

Proceedings of the 10th Working Conference on Mining Software Repositories

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identifier names are the main vehicle for semantic information during program comprehension. Identifier names are tokenised into their semantic constituents by tools supporting program comprehension tasks, including concept location and requirements traceability. We present an approach to the automated tokenisation of identifier names that improves on existing techniques in two ways. First, it improves tokenisation accuracy for identifier names of a single case and those containing digits. Second, performance gains over existing techniques are achieved using smaller oracles. Accuracy was evaluated by comparing the output of our algorithm to manual tokenisations of 28,000 identifier names drawn from 60 open source Java projects totalling 16.5 MSLOC. We also undertook a study of the typographical features of identifier names (single case, use of digits, etc.) per object-oriented construct (class names, method names, etc.), thus providing an insight into naming conventions in industrial-scale object-oriented code. Our tokenisation tool and datasets are publicly available.