Classification of source code archives

Authors:
Robert Krovetz;Secil Ugurel;C. Lee Giles
Affiliations:
NEC Laboratories America, Princeton, NJ;NEC Laboratories America, Princeton, NJ;Pennsylvania State University, University Park, PA
Venue:
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Year:
2003

Citing 7
Cited 2

Software reuse

ACM Computing Surveys (CSUR)
The reuse of uses in Smalltalk programming

ACM Transactions on Computer-Human Interaction (TOCHI)
Support vector machines: hype or hallelujah?

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Applying information-retrieval methods to software reuse: a case study

Information Processing and Management: an International Journal
What's the code?: automatic classification of source code archives

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Supervised categorization of JavaScriptTM using program analysis features

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Effectively Searching Maps in Web Documents

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The World Wide Web contains a number of source code archives. Programs are usually classified into various categories within the archive by hand. We report on experiments for automatic classification of source code into these categories. We examined a number of factors that affect classification accuracy. Weighting features by expected entropy loss makes a significant improvement in classification accuracy. We show a Support Vector Machine can be trained to classify source code with a high degree of accuracy. We feel these results show promise for software reuse.