What's the code?: automatic classification of source code archives

Authors:
Secil Ugurel;Robert Krovetz;C. Lee Giles
Affiliations:
The Pennsylvania State University, University Park, PA;NEC Research Institute, Princeton, NJ;The Pennsylvania State University, University Park, PA
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 11
Cited 16

Software reuse

ACM Computing Surveys (CSUR)
Information access tools for software reuse

Journal of Systems and Software - Special issue on software reuse
The reuse of uses in Smalltalk programming

ACM Transactions on Computer-Human Interaction (TOCHI)
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Viewing morphology as an inference process

Artificial Intelligence - Special issue on Intelligent internet systems
Support vector machines: hype or hallelujah?

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
Automatically Identifying Reusable OO Legacy Code

Computer
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Category Specific Web Search by Learning Query Modifications

SAINT '01 Proceedings of the 2001 Symposium on Applications and the Internet (SAINT 2001)
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Classification of source code archives

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Organizing and visualizing software repositories using the growing hierarchical self-organizing map

Proceedings of the 2005 ACM symposium on Applied computing
MUDABlue: an automatic categorization system for open source repositories

Journal of Systems and Software - Special issue: Selected papers from the 11th Asia Pacific software engineering conference (APSEC 2004)
Supervised categorization of JavaScriptTM using program analysis features

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Mining concepts from code with probabilistic topic models

Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
Mining business topics in source code using latent dirichlet allocation

ISEC '08 Proceedings of the 1st India software engineering conference
Selective dissemination of XML documents based on genetically learned user model and Support Vector Machines

Intelligent Data Analysis
A theory of aspects as latent topics

Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Sourcerer: mining and searching internet-scale software repositories

Data Mining and Knowledge Discovery
Classification of software artifacts based on structural information

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part IV
Source code indexing for automated tracing

Proceedings of the 6th International Workshop on Traceability in Emerging Forms of Software Engineering
Approximate graph clustering for program characterization

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Supporting program indexing and querying in source code digital libraries

AOIS'05 Proceedings of the 7th international conference on Agent-Oriented Information Systems III
Supervised categorization of JavaScript™ using program analysis features

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Labeled topic detection of open source software from mining mass textual project profiles

Proceedings of the First International Workshop on Software Mining
Capturing programming content in online discussions

Proceedings of the seventh international conference on Knowledge capture

Quantified Score

Hi-index	0.00

Visualization

Abstract

There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly and are very large (on the order of terabytes). We demonstrate machine learning methods for automatic classification of archived source code into eleven application topics and ten programming languages. For topical classification, we concentrate on C and C++ programs from the Ibiblio and the Sourceforge archives. Support vector machine (SVM) classifiers are trained on examples of a given programming language or programs in a specified category. We show that source code can be accurately and automatically classified into topical categories and can be identified to be in a specific programming language class.