Mining concepts from code with probabilistic topic models

Authors:
Erik Linstead;Paul Rigor;Sushil Bajracharya;Cristina Lopes;Pierre Baldi
Affiliations:
University of California, Irvine, Irvine, CA;University of California, Irvine, Irvine, CA;University of California, Irvine, Irvine, CA;University of California, Irvine, Irvine, CA;University of California, Irvine, Irvine, CA
Venue:
Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
Year:
2007

Citing 7
Cited 7

What's the code?: automatic classification of source code archives

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Latent dirichlet allocation

The Journal of Machine Learning Research
An Information Retrieval Approach to Concept Location in Source Code

WCRE '04 Proceedings of the 11th Working Conference on Reverse Engineering
Probabilistic topic decomposition of an eighteenth-century American newspaper

Journal of the American Society for Information Science and Technology
Sourcerer: a search engine for open source code supporting structure-based search

Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications
Semantic clustering: Identifying topics in source code

Information and Software Technology
Mining Eclipse Developer Contributions via Author-Topic Models

MSR '07 Proceedings of the Fourth International Workshop on Mining Software Repositories

Software traceability with topic modeling

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Bug localization using latent Dirichlet allocation

Information and Software Technology
A topic-based approach for narrowing the search space of buggy files from a bug report

ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Analyzing and mining a code search engine usage log

Empirical Software Engineering
Labeled topic detection of open source software from mining mass textual project profiles

Proceedings of the First International Workshop on Software Mining
Sourcerer: An infrastructure for large-scale collection and analysis of open-source code

Science of Computer Programming
Tag recommendation for open source software

Frontiers of Computer Science: Selected Publications from Chinese Universities

Quantified Score

Hi-index	0.00

Visualization

Abstract

We develop and apply statistical topic models to software as a means of extracting concepts from source code. The effectiveness of the technique is demonstrated on 1,555 projects from SourceForge and Apache consisting of 113,000 files and 19 million lines of code. In addition to providing an automated, unsupervised, solution to the problem of summarizing program functionality, the approach provides a probabilistic framework with which to analyze and visualize source file similarity. Finally, we introduce an information-theoretic approach for computing tangling and scattering of extracted concepts, and present preliminary results