Approximate string-matching with q-grams and maximal matches
Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Some advances in transformation-based part of speech tagging
AAAI '94 Proceedings of the twelfth national conference on Artificial intelligence (vol. 1)
IBM Systems Journal
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Modern Information Retrieval
Supporting reuse by delivering task-relevant and personalized information
Proceedings of the 24th International Conference on Software Engineering
A Framework for Source Code Search Using Program Patterns
IEEE Transactions on Software Engineering
What's the code?: automatic classification of source code archives
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Component rank: relative significance rank for software component search
Proceedings of the 25th International Conference on Software Engineering
A model independent source code repository
CASCON '99 Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative research
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
The Journal of Machine Learning Research
SCRUPLE: a reengineer's tool for source code search
CASCON '92 Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research - Volume 1
Probabilistic author-topic models for information discovery
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the 19th IEEE international conference on Automated software engineering
JQuery: finding your way through tangled code
OOPSLA '04 Companion to the 19th annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
MUDABlue: An Automatic Categorization System for Open Source Repositories
APSEC '04 Proceedings of the 11th Asia-Pacific Software Engineering Conference
The author-topic model for authors and documents
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
An Information Retrieval Approach to Concept Location in Source Code
WCRE '04 Proceedings of the 11th Working Conference on Reverse Engineering
Using structural context to recommend source code examples
Proceedings of the 27th international conference on Software engineering
Jungloid mining: helping to navigate the API jungle
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Open source search: a data mining platform
ACM SIGIR Forum
Ranking Significance of Software Components Based on Use Relations
IEEE Transactions on Software Engineering
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Probabilistic topic decomposition of an eighteenth-century American newspaper
Journal of the American Society for Information Science and Technology
Proceedings of the 28th international conference on Software engineering
Using an information retrieval system to retrieve source code samples
Proceedings of the 28th international conference on Software engineering
JIRiSS - an Eclipse plug-in for Source Code Exploration
ICPC '06 Proceedings of the 14th IEEE International Conference on Program Comprehension
The social network of Java classes
Proceedings of the 2006 ACM symposium on Applied computing
GPLAG: detection of software plagiarism by program dependence graph analysis
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Understanding the shape of Java software
Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
XSnippet: mining For sample code
Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
Semantic clustering: Identifying topics in source code
Information and Software Technology
Approximate Structural Context Matching: An Approach to Recommend Relevant Examples
IEEE Transactions on Software Engineering
MSR '07 Proceedings of the Fourth International Workshop on Mining Software Repositories
Mining Eclipse Developer Contributions via Author-Topic Models
MSR '07 Proceedings of the Fourth International Workshop on Mining Software Repositories
Power-Laws in a Large Object-Oriented Software System
IEEE Transactions on Software Engineering
A theory of aspects as latent topics
Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications
Analyzing entities and topics in news articles using statistical topic models
ISI'06 Proceedings of the 4th IEEE international conference on Intelligence and Security Informatics
CodeQuest: scalable source code queries with datalog
ECOOP'06 Proceedings of the 20th European conference on Object-Oriented Programming
Sourcerer: An internet-scale software repository
SUITE '09 Proceedings of the 2009 ICSE Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation
Exploring Java software vocabulary: A search and mining perspective
SUITE '09 Proceedings of the 2009 ICSE Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation
Towards query formulation and visualization of structural search results
Proceedings of 2010 ICSE Workshop on Search-driven Development: Users, Infrastructure, Tools and Evaluation
Searching for reputable source code on the web
Proceedings of the 16th ACM international conference on Supporting group work
Leveraging usage similarity for effective retrieval of examples in code repositories
Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
IDE 2.0: collective intelligence in software development
Proceedings of the FSE/SDP workshop on Future of software engineering research
A test-driven approach to code search and its application to the reuse of auxiliary functionality
Information and Software Technology
A spontaneous code recommendation tool based on associative search
Proceedings of the 3rd International Workshop on Search-Driven Development: Users, Infrastructure, Tools, and Evaluation
A block-structured model for source code retrieval
ACIIDS'11 Proceedings of the Third international conference on Intelligent information and database systems - Volume Part II
Minersoft: Software retrieval in grid and cloud computing infrastructures
ACM Transactions on Internet Technology (TOIT)
Automated Tagging for the Retrieval of Software Resources in Grid and Cloud Infrastructures
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Analyzing and mining a code search engine usage log
Empirical Software Engineering
Performance debugging in the large via mining millions of stack traces
Proceedings of the 34th International Conference on Software Engineering
Proceedings of the 34th International Conference on Software Engineering
On the naturalness of software
Proceedings of the 34th International Conference on Software Engineering
Recommending source code for use in rapid software prototypes
Proceedings of the 34th International Conference on Software Engineering
Software development environments on the web: a research agenda
Proceedings of the ACM international symposium on New ideas, new paradigms, and reflections on programming and software
Escaping the maze of twisty classes
Proceedings of the ACM international symposium on New ideas, new paradigms, and reflections on programming and software
Discovering math APIs by mining unit tests
FASE'13 Proceedings of the 16th international conference on Fundamental Approaches to Software Engineering
Boa: a language and infrastructure for analyzing ultra-large-scale software repositories
Proceedings of the 2013 International Conference on Software Engineering
Mining source code repositories at massive scale using language modeling
Proceedings of the 10th Working Conference on Mining Software Repositories
Do software categories impact coupling metrics?
Proceedings of the 10th Working Conference on Mining Software Repositories
Proceedings of the 12th international conference on Generative programming: concepts & experiences
A scalable crawler framework for FLOSS data
Proceedings of the 5th Asia-Pacific Symposium on Internetware
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences
ACM Computing Surveys (CSUR)
Evolutionary and collaborative software architecture recovery with Softwarenaut
Science of Computer Programming
Sourcerer: An infrastructure for large-scale collection and analysis of open-source code
Science of Computer Programming
Hi-index | 0.00 |
Large repositories of source code available over the Internet, or within large organizations, create new challenges and opportunities for data mining and statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, fingerprinting, and database storage of open source software on an Internet-scale. In one experiment, we gather 4,632 Java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, method call, and lexical containment distributions. We then develop and apply unsupervised, probabilistic, topic and author-topic (AT) models to automatically discover the topics embedded in the code and extract topic-word, document-topic, and AT distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing source file similarity, developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering an software development staffing. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the area under the curve (AUC) retrieval metric to 0.92--- roughly 10---30% better than previous approaches based on text alone. A prototype of the system is available at: http://sourcerer.ics.uci.edu .