A scalable crawler framework for FLOSS data

Authors:
Lingxiao Zhang;Yanzhen Zou;Bing Xie
Affiliations:
Peking University and Ministry of Education, Beijing, P.R. China;Peking University and Ministry of Education, Beijing, P.R. China;Peking University and Ministry of Education, Beijing, P.R. China
Venue:
Proceedings of the 5th Asia-Pacific Symposium on Internetware
Year:
2013

Citing 9
Cited 0

Two case studies of open source software development: Apache and Mozilla

ACM Transactions on Software Engineering and Methodology (TOSEM)
Organizational Benchmarking Using the ISBSG Data Repository

IEEE Software
Facilitating software evolution research with kenyon

Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering
Predicting Defects for Eclipse

PROMISE '07 Proceedings of the Third International Workshop on Predictor Models in Software Engineering
Sourcerer: mining and searching internet-scale software repositories

Data Mining and Knowledge Discovery
FLOSSMetrics: Free/Libre/Open Source Software Metrics

CSMR '09 Proceedings of the 2009 European Conference on Software Maintenance and Reengineering
The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies

APSEC '10 Proceedings of the 2010 Asia Pacific Software Engineering Conference
Repositories with Public Data about Software Development

International Journal of Open Source Software and Processes
Boa: a language and infrastructure for analyzing ultra-large-scale software repositories

Proceedings of the 2013 International Conference on Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Free / Libre / Open Source Software (FLOSS) data, such as bug reports, mailing lists and related webpages, contains valuable information for reusing open source software projects. Before conducting further experiment on FLOSS data, researchers often need to download these data into a local storage system. We refer to this pre-process as FLOSS data retrieval, which in many cases can be a challenging task. In this paper, we proposed a crawler framework to ease the process of FLOSS data retrieval. To cope with various types of FLOSS data scattered on the Internet, we designed the framework in a scalable manner where a crawler program can be easily plugged into the system to extend its functionality. Researchers can perform the retrieval process on datasets of various types and sources simply by adding new configurations to the system. We have implemented the framework and provided basic functions via web-based interfaces. We presented the usage of the system by a detailed case study where we retrieved various types of datasets related to Apache Lucene project using our framework.