Can we evaluate the quality of software engineering experiments?

Authors:
Barbara Kitchenham;Dag I. K. Sjøberg;O. Pearl Brereton;David Budgen;Tore Dybå;Martin Höst;Dietmar Pfahl;Per Runeson
Affiliations:
Keele University, Keele, Staffordshire, UK;University of Oslo, Oslo, Norway;Keele University, Keele, Staffordshire, UK;Durham University, Durham, UK;University of Oslo, Norway;Lund University, Lund, Sweden;University of Oslo, Norway and University of Calgary, Canada;Lund University, Lund, Sweden
Venue:
Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement
Year:
2010

Citing 12
Cited 8

Evidence-Based Software Engineering

Proceedings of the 26th International Conference on Software Engineering
Evidence-Based Software Engineering for Practitioners

IEEE Software
Experimental evaluation of an object-oriented function point measurement procedure

Information and Software Technology
Pair-wise comparisons versus planning game partitioning--experiments on requirements prioritisation techniques

Empirical Software Engineering
The Future of Empirical Methods in Software Engineering Research

FOSE '07 2007 Future of Software Engineering
Comprehension strategies and difficulties in maintaining object-oriented systems: An explorative study

Journal of Systems and Software
Testing input validation in Web applications through automated model recovery

Journal of Systems and Software
Empirical studies of agile software development: A systematic review

Information and Software Technology
Strength of evidence in systematic reviews in software engineering

Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement
A systematic review of search-based testing for non-functional system properties

Information and Software Technology
Refining the systematic literature review process--two participant-observer case studies

Empirical Software Engineering
An evaluation of quality checklist proposals: a participant-observer case study

EASE'09 Proceedings of the 13th international conference on Evaluation and Assessment in Software Engineering

Research synthesis in software engineering: A tertiary study

Information and Software Technology
Reuse vs. maintainability: revealing the impact of composition code properties

Proceedings of the 33rd International Conference on Software Engineering
Three empirical studies on the agreement of reviewers about the quality of software engineering experiments

Information and Software Technology
Quality indicators for business process models from a gateway complexity perspective

Information and Software Technology
Lessons learned from evaluating a checklist for reporting experimental and observational research

Proceedings of the ACM-IEEE international symposium on Empirical software engineering and measurement
Awareness Support in Distributed Software Development: A Systematic Review and Mapping of the Literature

Computer Supported Cooperative Work
Evidence in software architecture, a systematic literature review

Proceedings of the 17th International Conference on Evaluation and Assessment in Software Engineering
A systematic review of systematic review process research in software engineering

Information and Software Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Context: The authors wanted to assess whether the quality of published human-centric software engineering experiments was improving. This required a reliable means of assessing the quality of such experiments. Aims: The aims of the study were to confirm the usability of a quality evaluation checklist, determine how many reviewers were needed per paper that reports an experiment, and specify an appropriate process for evaluating quality. Method: With eight reviewers and four papers describing human-centric software engineering experiments, we used a quality checklist with nine questions. We conducted the study in two parts: the first was based on individual assessments and the second on collaborative evaluations. Results: The inter-rater reliability was poor for individual assessments but much better for joint evaluations. Four reviewers working in two pairs with discussion were more reliable than eight reviewers with no discussion. The sum of the nine criteria was more reliable than individual questions or a simple overall assessment. Conclusions: If quality evaluation is critical, more than two reviewers are required and a round of discussion is necessary. We advise using quality criteria and basing the final assessment on the sum of the aggregated criteria. The restricted number of papers used and the relatively extensive expertise of the reviewers limit our results. In addition, the results of the second part of the study could have been affected by removing a time restriction on the review as well as the consultation process.