Three empirical studies on the agreement of reviewers about the quality of software engineering experiments

Authors:
Barbara Ann Kitchenham;Dag I. K. Sjøberg;Tore Dybå;Dietmar Pfahl;Pearl Brereton;David Budgen;Martin Höst;Per Runeson
Affiliations:
School of Computing and Mathematics, Keele University, Keele, Staffordshire ST5 5BG, UK;Department of Informatics, University of Oslo, P.O. Box 1080 Blindern, NO-0316 Oslo, Norway;Department of Informatics, University of Oslo, P.O. Box 1080 Blindern, NO-0316 Oslo, Norway and SINTEF, P.O. Box 4760 Sluppen, Trondheim, Norway;Department of Informatics, University of Oslo, P.O. Box 1080 Blindern, NO-0316 Oslo, Norway and Department of Computer Science, Lund University, SE-221 00 Lund, Sweden;School of Computing and Mathematics, Keele University, Keele, Staffordshire ST5 5BG, UK;School of Engineering and Computing Sciences, Durham University, Science Laboratories, Durham DH1 3LE, UK;Department of Computer Science, Lund University, SE-221 00 Lund, Sweden;Department of Computer Science, Lund University, SE-221 00 Lund, Sweden
Venue:
Information and Software Technology
Year:
2012

Citing 15
Cited 0

Electronic meeting systems

Communications of the ACM - Special issue on computer graphics: state of the arts
Preliminary guidelines for empirical research in software engineering

IEEE Transactions on Software Engineering
A Survey of Controlled Experiments in Software Engineering

IEEE Transactions on Software Engineering
Lessons from a dozen years of group support systems research: a discussion of lab and field findings

Journal of Management Information Systems - Special issue: Information technology and its organizational impact
Experimental evaluation of an object-oriented function point measurement procedure

Information and Software Technology
Pair-wise comparisons versus planning game partitioning--experiments on requirements prioritisation techniques

Empirical Software Engineering
Comprehension strategies and difficulties in maintaining object-oriented systems: An explorative study

Journal of Systems and Software
Testing input validation in Web applications through automated model recovery

Journal of Systems and Software
Empirical studies of agile software development: A systematic review

Information and Software Technology
Strength of evidence in systematic reviews in software engineering

Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement
A systematic review of search-based testing for non-functional system properties

Information and Software Technology
Requirements engineering for software product lines: A systematic literature review

Information and Software Technology
Can we evaluate the quality of software engineering experiments?

Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement
Refining the systematic literature review process--two participant-observer case studies

Empirical Software Engineering
A status report on the evaluation of variability management approaches

EASE'09 Proceedings of the 13th international conference on Evaluation and Assessment in Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Context: During systematic literature reviews it is necessary to assess the quality of empirical papers. Current guidelines suggest that two researchers should independently apply a quality checklist and any disagreements must be resolved. However, there is little empirical evidence concerning the effectiveness of these guidelines. Aims: This paper investigates the three techniques that can be used to improve the reliability (i.e. the consensus among reviewers) of quality assessments, specifically, the number of reviewers, the use of a set of evaluation criteria and consultation among reviewers. We undertook a series of studies to investigate these factors. Method: Two studies involved four research papers and eight reviewers using a quality checklist with nine questions. The first study was based on individual assessments, the second study on joint assessments with a period of inter-rater discussion. A third more formal randomised block experiment involved 48 reviewers assessing two of the papers used previously in teams of one, two and three persons to assess the impact of discussion among teams of different size using the evaluations of the ''teams'' of one person as a control. Results: For the first two studies, the inter-rater reliability was poor for individual assessments, but better for joint evaluations. However, the results of the third study contradicted the results of Study 2. Inter-rater reliability was poor for all groups but worse for teams of two or three than for individuals. Conclusions: When performing quality assessments for systematic literature reviews, we recommend using three independent reviewers and adopting the median assessment. A quality checklist seems useful but it is difficult to ensure that the checklist is both appropriate and understood by reviewers. Furthermore, future experiments should ensure participants are given more time to understand the quality checklist and to evaluate the research papers.