Sourcerer: mining and searching internet-scale software repositories

Authors:
Erik Linstead;Sushil Bajracharya;Trung Ngo;Paul Rigor;Cristina Lopes;Pierre Baldi
Affiliations:
Donald Bren School of Information and Computer Sciences, University of California, Irvine, USA;Donald Bren School of Information and Computer Sciences, University of California, Irvine, USA;Donald Bren School of Information and Computer Sciences, University of California, Irvine, USA;Donald Bren School of Information and Computer Sciences, University of California, Irvine, USA;Donald Bren School of Information and Computer Sciences, University of California, Irvine, USA;Donald Bren School of Information and Computer Sciences, University of California, Irvine, USA
Venue:
Data Mining and Knowledge Discovery
Year:
2009

Citing 41
Cited 27

Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Some advances in transformation-based part of speech tagging

AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
The software bookshelf

IBM Systems Journal
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
Supporting reuse by delivering task-relevant and personalized information

Proceedings of the 24th International Conference on Software Engineering
A Framework for Source Code Search Using Program Patterns

IEEE Transactions on Software Engineering
What's the code?: automatic classification of source code archives

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Component rank: relative significance rank for software component search

Proceedings of the 25th International Conference on Software Engineering
A model independent source code repository

CASCON '99 Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative research
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Latent dirichlet allocation

The Journal of Machine Learning Research
SCRUPLE: a reengineer's tool for source code search

CASCON '92 Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research - Volume 1
Probabilistic author-topic models for information discovery

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic Method Completion

Proceedings of the 19th IEEE international conference on Automated software engineering
JQuery: finding your way through tangled code

OOPSLA '04 Companion to the 19th annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
MUDABlue: An Automatic Categorization System for Open Source Repositories

APSEC '04 Proceedings of the 11th Asia-Pacific Software Engineering Conference
The author-topic model for authors and documents

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
An Information Retrieval Approach to Concept Location in Source Code

WCRE '04 Proceedings of the 11th Working Conference on Reverse Engineering
Using structural context to recommend source code examples

Proceedings of the 27th international conference on Software engineering
Jungloid mining: helping to navigate the API jungle

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Open source search: a data mining platform

ACM SIGIR Forum
Ranking Significance of Software Components Based on Use Relations

IEEE Transactions on Software Engineering
Micro patterns in Java code

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Probabilistic topic decomposition of an eighteenth-century American newspaper

Journal of the American Society for Information Science and Technology
Who should fix this bug?

Proceedings of the 28th international conference on Software engineering
Using an information retrieval system to retrieve source code samples

Proceedings of the 28th international conference on Software engineering
JIRiSS - an Eclipse plug-in for Source Code Exploration

ICPC '06 Proceedings of the 14th IEEE International Conference on Program Comprehension
The social network of Java classes

Proceedings of the 2006 ACM symposium on Applied computing
GPLAG: detection of software plagiarism by program dependence graph analysis

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Understanding the shape of Java software

Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
XSnippet: mining For sample code

Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
ChemDB: a public database of small molecules and related chemoinformatics resources

Bioinformatics
Semantic clustering: Identifying topics in source code

Information and Software Technology
Approximate Structural Context Matching: An Approach to Recommend Relevant Examples

IEEE Transactions on Software Engineering
Recommending Emergent Teams

MSR '07 Proceedings of the Fourth International Workshop on Mining Software Repositories
Mining Eclipse Developer Contributions via Author-Topic Models

MSR '07 Proceedings of the Fourth International Workshop on Mining Software Repositories
Power-Laws in a Large Object-Oriented Software System

IEEE Transactions on Software Engineering
A theory of aspects as latent topics

Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Analyzing entities and topics in news articles using statistical topic models

ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics
CodeQuest: scalable source code queries with datalog

ECOOP'06 Proceedings of the 20th European conference on Object-Oriented Programming

Sourcerer: An internet-scale software repository

SUITE '09 Proceedings of the 2009 ICSE Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation
Exploring Java software vocabulary: A search and mining perspective

SUITE '09 Proceedings of the 2009 ICSE Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation
Towards query formulation and visualization of structural search results

Proceedings of 2010 ICSE Workshop on Search-driven Development: Users, Infrastructure, Tools and Evaluation
Searching for reputable source code on the web

Proceedings of the 16th ACM international conference on Supporting group work
Leveraging usage similarity for effective retrieval of examples in code repositories

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
IDE 2.0: collective intelligence in software development

Proceedings of the FSE/SDP workshop on Future of software engineering research
A test-driven approach to code search and its application to the reuse of auxiliary functionality

Information and Software Technology
A spontaneous code recommendation tool based on associative search

Proceedings of the 3rd International Workshop on Search-Driven Development: Users, Infrastructure, Tools, and Evaluation
A block-structured model for source code retrieval

ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part II
Minersoft: Software retrieval in grid and cloud computing infrastructures

ACM Transactions on Internet Technology (TOIT)
Automated Tagging for the Retrieval of Software Resources in Grid and Cloud Infrastructures

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Analyzing and mining a code search engine usage log

Empirical Software Engineering
Performance debugging in the large via mining millions of stack traces

Proceedings of the 34th International Conference on Software Engineering
Where does this code come from and where does it go? - integrated code history tracker for open source systems -

Proceedings of the 34th International Conference on Software Engineering
On the naturalness of software

Proceedings of the 34th International Conference on Software Engineering
Recommending source code for use in rapid software prototypes

Proceedings of the 34th International Conference on Software Engineering
Software development environments on the web: a research agenda

Proceedings of the ACM international symposium on New ideas, new paradigms, and reflections on programming and software
Escaping the maze of twisty classes

Proceedings of the ACM international symposium on New ideas, new paradigms, and reflections on programming and software
Discovering math APIs by mining unit tests

FASE'13 Proceedings of the 16th international conference on Fundamental Approaches to Software Engineering
Boa: a language and infrastructure for analyzing ultra-large-scale software repositories

Proceedings of the 2013 International Conference on Software Engineering
Mining source code repositories at massive scale using language modeling

Proceedings of the 10th Working Conference on Mining Software Repositories
Do software categories impact coupling metrics?

Proceedings of the 10th Working Conference on Mining Software Repositories
Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes

Proceedings of the 12th international conference on Generative programming: concepts & experiences
A scalable crawler framework for FLOSS data

Proceedings of the 5th Asia-Pacific Symposium on Internetware
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)
Evolutionary and collaborative software architecture recovery with Softwarenaut

Science of Computer Programming
Sourcerer: An infrastructure for large-scale collection and analysis of open-source code

Science of Computer Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large repositories of source code available over the Internet, or within large organizations, create new challenges and opportunities for data mining and statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, fingerprinting, and database storage of open source software on an Internet-scale. In one experiment, we gather 4,632 Java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, method call, and lexical containment distributions. We then develop and apply unsupervised, probabilistic, topic and author-topic (AT) models to automatically discover the topics embedded in the code and extract topic-word, document-topic, and AT distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing source file similarity, developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering an software development staffing. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the area under the curve (AUC) retrieval metric to 0.92--- roughly 10---30% better than previous approaches based on text alone. A prototype of the system is available at: http://sourcerer.ics.uci.edu .