SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects

Authors:
Joel Ossher;Sushil Bajracharya;Erik Linstead;Pierre Baldi;Cristina Lopes
Affiliations:
Bren School of Information and Computer Sciences, University of California, Irvine, USA;Bren School of Information and Computer Sciences, University of California, Irvine, USA;Bren School of Information and Computer Sciences, University of California, Irvine, USA;Bren School of Information and Computer Sciences, University of California, Irvine, USA;Bren School of Information and Computer Sciences, University of California, Irvine, USA
Venue:
MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
Year:
2009

Citing 0
Cited 12

A search engine for finding highly relevant applications

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Searching API usage examples in code repositories with sourcerer API search

Proceedings of 2010 ICSE Workshop on Search-driven Development: Users, Infrastructure, Tools and Evaluation
Exemplar: EXEcutable exaMPLes ARchive

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2
An empirical investigation into a large-scale Java open source code repository

Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement
Leveraging usage similarity for effective retrieval of examples in code repositories

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
A test-driven approach to code search and its application to the reuse of auxiliary functionality

Information and Software Technology
Towards sharing source code facts using linked data

Proceedings of the 3rd International Workshop on Search-Driven Development: Users, Infrastructure, Tools, and Evaluation
Portfolio: finding relevant functions and their usage

Proceedings of the 33rd International Conference on Software Engineering
Portfolio: a search engine for finding functions and their usages

Proceedings of the 33rd International Conference on Software Engineering
Finding relevant functions in millions of lines of code

Proceedings of the 33rd International Conference on Software Engineering
A benchmarking-inspired approach to determine threshold values for metrics

ACM SIGSOFT Software Engineering Notes
Sourcerer: An infrastructure for large-scale collection and analysis of open-source code

Science of Computer Programming

Quantified Score

Hi-index	0.01

Visualization

Abstract

Abstract The open source movement has made vast quantities of source code available online for free, providing an extremely large dataset for empirical study and potential resuse. A major difficulty in exploiting this potential fully is that the data are currently scattered between competing source code repositories, none of which are structured for empirical analysis and cross-project comparison. As a result, software researchers and developers are left to compile their own datasets, resulting in duplicated effort and limited results. To address this challenge, we built SourcererDB, an aggregated repository of statically analyzed and cross-linked open source Java projects. SourcererDB contains local snapshots of 2,852 Java projects taken from Sourceforge, Apache and Java.net. These projects are statically analyzed to extract rich structural information, which is then stored in a relational database. References to entities in the 16,058 external jars are resolved and grouped, allowing for cross-project usage information to be accessed easily. This paper describes: (a) the mechanism for resolving and grouping these cross-project references, (b) the structure of and the metamodel for the SourcererDB repository, and (d) end-user dataset access mechanisms. Our goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments.