Amassing and indexing a large sample of version control systems: Towards the census of public source code history

Authors:
Audris Mockus
Affiliations:
Avaya Labs Research, 233 Mt Airy Rd, Basking Ridge, NJ 07901, USA
Venue:
MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
Year:
2009

Citing 0
Cited 15

Collecting data about FLOSS development: the FLOSSMetrics experience

Proceedings of the 3rd International Workshop on Emerging Trends in Free/Libre/Open Source Software Research and Development
An empirical investigation into a large-scale Java open source code repository

Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement
An experience report on scaling tools for mining software repositories using MapReduce

Proceedings of the IEEE/ACM international conference on Automated software engineering
Recovering inter-project dependencies in software ecosystems

Proceedings of the IEEE/ACM international conference on Automated software engineering
A study of the uniqueness of source code

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Growth of newcomer competence: challenges of globalization

Proceedings of the FSE/SDP workshop on Future of software engineering research
Using Pig as a data preparation language for large-scale mining software repositories studies: An experience report

Journal of Systems and Software
Using the GPGPU for scaling up mining software repositories

Proceedings of the 34th International Conference on Software Engineering
Looking for micro-process in large-scale data

Proceedings of the 2nd international workshop on Evidential assessment of software technologies
How do developers react to API deprecation?: the case of a smalltalk ecosystem

Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
Review code evolution history in OSS universe

Proceedings of the Fourth Asia-Pacific Symposium on Internetware
Replicating mining studies with SOFAS

Proceedings of the 10th Working Conference on Mining Software Repositories
Using topic models to understand the evolution of a software ecosystem

Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Risky files: an approach to focus quality improvement effort

Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Tag recommendation for open source software

Frontiers of Computer Science: Selected Publications from Chinese Universities

Quantified Score

Hi-index	0.00

Visualization

Abstract

The source code and its history represent the output and process of software development activities and are an invaluable resource for study and improvement of software development practice. While individual projects and groups of projects have been extensively analyzed, some fundamental questions, such as the spread of innovation or genealogy of the source code, can be answered only by considering the entire universe of publicly available source code and its history. We describe methods we developed over the last six years to gather, index, and update an approximation of such a universal repository for publicly accessible version control systems and for the source code inside a large corporation. While challenging, the task is achievable with limited resources. The bottlenecks in network bandwidth, processing, and disk access can be dealt with using inherent parallelism of the tasks and suitable tradeoffs between the amount of storage and computations, but a completely automated discovery of public version control systems may require enticing participation of the sampled projects. Such universal repository would allow studies of global properties and origins of the source code that are not possible through other means.