Careful ranking of multiple solvers with timeouts and ties

  • Authors:
  • Allen Van Gelder

  • Affiliations:
  • University of California, Santa Cruz, CA

  • Venue:
  • SAT'11 Proceedings of the 14th international conference on Theory and application of satisfiability testing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In several fields, Satisfiability being one, there are regular competitions to compare multiple solvers in a common setting. Due to the fact some benchmarks of interest are too difficult for all solvers to complete within available time, time-outs occur and must be considered. Through some strange evolution, time-outs became the only factor that was considered in evaluation. Previous work in SAT 2010 observed that this evaluation method is unreliable and lacks a way to attach statistical significance to its conclusions. However, the proposed alternative was quite complicated and is unlikely to see general use. This paper describes a simpler system, called careful ranking, that permits a measure of statistical significance, and still meets many of the practical requirements of an evaluation system. It incorporates one of the main ideas of the previous work: that outcomes had to be freed of assumptions about timing distributions, so that non-parametric methods were necessary. Unlike the previous work, it incorporates ties. The careful ranking system has several important nonmathematical properties that are desired in an evaluation system: (1) the relative ranking of two solvers cannot be influenced by a third solver; (2) after the competition results are published, a researcher can run a new solver on the same benchmarks and determine where the new solver would have ranked; (3) small timing differences can be ignored; (4) the computations should be easy to understand and reproduce. Voting systems proposed in the literature lack some or all of these properties. A property of careful ranking is that the pairwise ranking might contain cycles. Whether this is a bug or a feature is a matter of opinion. Whether it occurs among leaders in practice is a matter of experience. The system is implemented and has been applied to the SAT 2009 Competition. No cycles occurred among the leaders, but there was a cycle among some low-ranking solvers. To measure robustness, the new and current systems were computed with a range of simulated timeouts, to see how often the top rankings changed. That is, times above the simulated time-out are reclassified as time-outs and the rankings are computed with this data. Careful ranking exhibited many fewer changes.