On SAT Instance Classes and a Method for Reliable Performance Experiments with SAT Solvers
Annals of Mathematics and Artificial Intelligence
Performance testing of combinatorial solvers with isomorph class instances
Proceedings of the 2007 workshop on Experimental computer science
Empirical Evaluation of Scoring Methods
Proceedings of the 2006 conference on STAIRS 2006: Proceedings of the Third Starting AI Researchers' Symposium
Statistical methodology for comparison of SAT solvers
SAT'10 Proceedings of the 13th international conference on Theory and Applications of Satisfiability Testing
Multicriterion Decision in Management: Principles and Practice
Multicriterion Decision in Management: Principles and Practice
Extended failed-literal preprocessing for quantified boolean formulas
SAT'12 Proceedings of the 15th international conference on Theory and Applications of Satisfiability Testing
Hi-index | 0.00 |
In several fields, Satisfiability being one, there are regular competitions to compare multiple solvers in a common setting. Due to the fact some benchmarks of interest are too difficult for all solvers to complete within available time, time-outs occur and must be considered. Through some strange evolution, time-outs became the only factor that was considered in evaluation. Previous work in SAT 2010 observed that this evaluation method is unreliable and lacks a way to attach statistical significance to its conclusions. However, the proposed alternative was quite complicated and is unlikely to see general use. This paper describes a simpler system, called careful ranking, that permits a measure of statistical significance, and still meets many of the practical requirements of an evaluation system. It incorporates one of the main ideas of the previous work: that outcomes had to be freed of assumptions about timing distributions, so that non-parametric methods were necessary. Unlike the previous work, it incorporates ties. The careful ranking system has several important nonmathematical properties that are desired in an evaluation system: (1) the relative ranking of two solvers cannot be influenced by a third solver; (2) after the competition results are published, a researcher can run a new solver on the same benchmarks and determine where the new solver would have ranked; (3) small timing differences can be ignored; (4) the computations should be easy to understand and reproduce. Voting systems proposed in the literature lack some or all of these properties. A property of careful ranking is that the pairwise ranking might contain cycles. Whether this is a bug or a feature is a matter of opinion. Whether it occurs among leaders in practice is a matter of experience. The system is implemented and has been applied to the SAT 2009 Competition. No cycles occurred among the leaders, but there was a cycle among some low-ranking solvers. To measure robustness, the new and current systems were computed with a range of simulated timeouts, to see how often the top rankings changed. That is, times above the simulated time-out are reclassified as time-outs and the rankings are computed with this data. Careful ranking exhibited many fewer changes.