Collecting data about FLOSS development: the FLOSSMetrics experience
Proceedings of the 3rd International Workshop on Emerging Trends in Free/Libre/Open Source Software Research and Development
An empirical investigation into a large-scale Java open source code repository
Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement
An experience report on scaling tools for mining software repositories using MapReduce
Proceedings of the IEEE/ACM international conference on Automated software engineering
Recovering inter-project dependencies in software ecosystems
Proceedings of the IEEE/ACM international conference on Automated software engineering
A study of the uniqueness of source code
Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Growth of newcomer competence: challenges of globalization
Proceedings of the FSE/SDP workshop on Future of software engineering research
Journal of Systems and Software
Using the GPGPU for scaling up mining software repositories
Proceedings of the 34th International Conference on Software Engineering
Looking for micro-process in large-scale data
Proceedings of the 2nd international workshop on Evidential assessment of software technologies
How do developers react to API deprecation?: the case of a smalltalk ecosystem
Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering
Review code evolution history in OSS universe
Proceedings of the Fourth Asia-Pacific Symposium on Internetware
Replicating mining studies with SOFAS
Proceedings of the 10th Working Conference on Mining Software Repositories
Using topic models to understand the evolution of a software ecosystem
Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Risky files: an approach to focus quality improvement effort
Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering
Tag recommendation for open source software
Frontiers of Computer Science: Selected Publications from Chinese Universities
Hi-index | 0.00 |
The source code and its history represent the output and process of software development activities and are an invaluable resource for study and improvement of software development practice. While individual projects and groups of projects have been extensively analyzed, some fundamental questions, such as the spread of innovation or genealogy of the source code, can be answered only by considering the entire universe of publicly available source code and its history. We describe methods we developed over the last six years to gather, index, and update an approximation of such a universal repository for publicly accessible version control systems and for the source code inside a large corporation. While challenging, the task is achievable with limited resources. The bottlenecks in network bandwidth, processing, and disk access can be dealt with using inherent parallelism of the tasks and suitable tradeoffs between the amount of storage and computations, but a completely automated discovery of public version control systems may require enticing participation of the sampled projects. Such universal repository would allow studies of global properties and origins of the source code that are not possible through other means.