Empirical Evaluation of Scoring Methods

Authors:
Luca Pulina
Affiliations:
Laboratory of Systems and Technologies for Automated Reasoning (STAR-Lab), DIST, Università di Genova, Viale Causa, 13 --16145 Genova, Italy, pulina@dist.unige.it
Venue:
Proceedings of the 2006 conference on STAIRS 2006: Proceedings of the Third Starting AI Researchers' Symposium
Year:
2006

Citing 2
Cited 3

Validating the result of a Quantified Boolean Formula (QBF) solver: theory and practice

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
The 3rd international planning competition: results and analysis

Journal of Artificial Intelligence Research

Ranking and Reputation Systems in the QBF Competition

AI*IA '07 Proceedings of the 10th Congress of the Italian Association for Artificial Intelligence on AI*IA 2007: Artificial Intelligence and Human-Oriented Computing
Careful ranking of multiple solvers with timeouts and ties

SAT'11 Proceedings of the 14th international conference on Theory and application of satisfiability testing
Statistical methodology for comparison of SAT solvers

SAT'10 Proceedings of the 13th international conference on Theory and Applications of Satisfiability Testing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The automated reasoning research community has grown accustomed to competitive events where a pool of systems is run on a pool of problem instances with the purpose of ranking the systems according to their performances. At the heart of such ranking lies the method used to score the systems, i.e., the procedure used to compute a numerical quantity that should summarize the performances of a system with respect to the other systems and to the pool of problem instances. In this paper we evaluate several scoring methods, including methods used in automated reasoning contests, as well as methods based on voting theory, and a new method that we introduce. Our research aims to establish which of the above methods maximizes the effectiveness measures that we devised to quantify desirable properties of the scoring procedures. Our method is empirical, in that we compare the scoring methods by computing the effectiveness measures using the data from the 2005 comparative evaluation of solvers for quantified Boolean formulas. The results of our experiments give useful indications about the relative strengths and weaknesses of the scoring methods, and allow us to infer also some conclusions that are independent of the specific method adopted.