Identifying domain expertise of developers from source code

Authors:
Renuka Sindhgatta
Affiliations:
IBM India Research Laboratory, Bangalore, India
Venue:
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2008

Citing 9
Cited 0

Algorithms for clustering data

Algorithms for clustering data
Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Expertise recommender: a flexible recommendation system and architecture

CSCW '00 Proceedings of the 2000 ACM conference on Computer supported cooperative work
Expertise browser: a quantitative approach to identifying expertise

Proceedings of the 24th International Conference on Software Engineering
Evaluation of hierarchical clustering algorithms for document datasets

Proceedings of the eleventh international conference on Information and knowledge management
Bunch: A Clustering Tool for the Recovery and Maintenance of Software System Structures

ICSM '99 Proceedings of the IEEE International Conference on Software Maintenance
Using latent semantic analysis to identify similarities in source code to support program understanding

ICTAI '00 Proceedings of the 12th IEEE International Conference on Tools with Artificial Intelligence
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
Semantic clustering: Identifying topics in source code

Information and Software Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

We are interested in identifying the domain expertise of developers of a software system. A developer gains expertise on the code base as well as the domain of the software system he/she develops. This information forms a useful input in allocating software implementation tasks to developers. Domain concepts represented by the system are discovered by taking into account the linguistic information available in the source code. The vocabulary contained in source code as identifiers such as class, method, variable names and comments are extracted. Concepts present in the code base are identified and grouped based on a well known text processing hypothesis - words are similar to the extent to which they share similar words. The developer's association with the source code and the concepts it represents is arrived at using the version repository information. In this line, the analysis first derives documents from source code by discarding all the programming language constructs. KMeans clustering is further used to cluster documents and extract closely related concepts. The key concepts present in the documents authored by the developer determine his/her domain expertise. To validate our approach we apply it on large software systems, two of which are presented in detail in this paper.