What's the code?: automatic classification of source code archives
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
The Journal of Machine Learning Research
An Information Retrieval Approach to Concept Location in Source Code
WCRE '04 Proceedings of the 11th Working Conference on Reverse Engineering
Probabilistic topic decomposition of an eighteenth-century American newspaper
Journal of the American Society for Information Science and Technology
Sourcerer: a search engine for open source code supporting structure-based search
Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications
Semantic clustering: Identifying topics in source code
Information and Software Technology
Mining Eclipse Developer Contributions via Author-Topic Models
MSR '07 Proceedings of the Fourth International Workshop on Mining Software Repositories
Software traceability with topic modeling
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Bug localization using latent Dirichlet allocation
Information and Software Technology
A topic-based approach for narrowing the search space of buggy files from a bug report
ASE '11 Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering
Analyzing and mining a code search engine usage log
Empirical Software Engineering
Labeled topic detection of open source software from mining mass textual project profiles
Proceedings of the First International Workshop on Software Mining
Sourcerer: An infrastructure for large-scale collection and analysis of open-source code
Science of Computer Programming
Tag recommendation for open source software
Frontiers of Computer Science: Selected Publications from Chinese Universities
Hi-index | 0.00 |
We develop and apply statistical topic models to software as a means of extracting concepts from source code. The effectiveness of the technique is demonstrated on 1,555 projects from SourceForge and Apache consisting of 113,000 files and 19 million lines of code. In addition to providing an automated, unsupervised, solution to the problem of summarizing program functionality, the approach provides a probabilistic framework with which to analyze and visualize source file similarity. Finally, we introduce an information-theoretic approach for computing tangling and scattering of extracted concepts, and present preliminary results