SAT Competition 2020 (cid:73)

The SAT Competitions constitute a well-established series of yearly open international algorithm implementation competitions, focusing on the Boolean satisﬁability (or propositional satisﬁability, SAT) problem. In this article, we provide a detailed account on the 2020 instantiation of the SAT Competition, including the new competition tracks and benchmark selection procedures, overview of solving strategies implemented in top-performing solvers, and a detailed analysis of the empirical data obtained from running the competition


Introduction
From what was once mainly the archetypal intractable (in particular NPcomplete) problem, propositional satisfiability (or Boolean satisfiability, SAT) has flourished into a success story of modern computer science [1]. This is due to advances in SAT solvers, i.e., implementations of decision procedures for SAT, which today form a central computational tool for solving real-world problem instances of various kinds of NP-hard search and optimization problems. With standardized input formats, readily-available APIs for incremental applications, and certified proof logging and checking capabilities, applications of SAT solver technology have branched from the first breakthrough applications in automated planning, test pattern generation and hardware verification to thousands of different application settings.
The success of SAT would not be possible without the persistent efforts of the SAT community to further improve the performance and robustness of SAT solvers. The SAT Competition series, with a history dating back to the early 90s, aims to support and provide further incentives for maintaining this progress. Organized yearly as an international open event, SAT Competitions (and their variants in the forms of SAT Races and a SAT Challenge) [2,3,4,5,6,7,8] have a consistent track record in receiving tens of solver submissions yearly, submitted by the community at large for obtaining a snapshot of the current state-of-the-art in practical SAT solving. Alongside participating solvers, the competition invites through open calls submissions of benchmark instances representing, in particular, new interesting applications scenarios of SAT solvers. Indeed, in addition to evaluating recently developed solvers, an important aspect of the SAT competition series is to collect on a yearly basis new benchmark sets, consisting of instances from various different application settings, which together with benchmark sets from previous years constitute a standard dataset for use in research papers and SAT solver development.
This article focuses on the 2020 instantiation of the SAT Competitions. To this end, we provide a detailed account of SAT Competition 2020 in terms of organizational details, competition tracks, participating solvers, benchmarks, and the empirical results from the competition. In terms of competition tracks, two new tracks, namely the cloud track and an application-specific track, were introduced in 2020, in addition to the already earlier established main, parallel, and incremental tracks; we provide motivation and the new organizational details for both of these new tracks. In terms of solvers, we provide an overview of solving strategies and other details implemented in the top-performing solvers from the competition, complementing the individual solver descriptions available in the 2020 competition proceedings [9]. As for benchmarks, we describe how the 2020 benchmark sets were constructed for each of the competition tracks, with an overview of the benchmarks contributed to the 2020 competition. In terms of empirical results we provide further analysis on the competition results, going beyond the standard rankings provided on the SAT competition web pages. 1 Finally, we also provide a discussion on lessons learned and ideas for future editions of SAT competitions.
This article is organized as follows. We start by providing an overview on the competition, including details on and motivations for the several competition tracks, the rules and other technical requirements of the competition, the ranking schemes used in evaluating the competing solvers, and the computing environments used for executing the competition (Section 2). We then provide an overview of the benchmark sets used in evaluating the solvers, including their origins and the selection process used for constructing the sets (Section 3).
In Section 4 we provide an overview of the competition results followed by a survey on the solving strategies implemented distinctly in the top-ranking solvers in Section 5. Going considerably beyond the plain competition rankings, we provide in Section 6, an in-depth analysis of the competition data from different perspectives, including correlation analysis of runtime performance of solvers and marginal contributions of individual solvers to the "virtual best solver" and portfolios constructed from the competing solvers. The article is concluded with future prospects in Section 7.

Overview of SAT Competition 2020
In this section, we describe the individual 2020 SAT Competition tracks, explain the requirements for participation and the ranking criteria, as well as describe the computing infrastructure used for executing the competition.

Competition Tracks
SAT Competition 2020 consisted of seven tracks: Main track, No-Limits track, Planning track, "Glucose hack" track, Incremental Library track, Parallel track, and the Cloud Track for massively parallel SAT solvers.

The Main, No-Limits, Planning, and "Glucose hack" tracks
The focus of the traditional Main track is on sequential SAT solvers and their evaluation on structured, non-random benchmarks coming from various application areas.
To participate in the Main track, solvers needed to output certificates for both the satisfiable and the unsatisfiable instances. Moreover, the source code of the solver were required to be made publicly available. Solvers not complying with either of these two criteria were only evaluated in a so-called No-Limits track and were not eligible for the Main track awards. The No-Limits track thus enabled participation of closed-source solvers (not being able or willing to expose the source code for legal or other reasons) as well as portfolio solvers (combining two or more core SAT solvers developed by different groups of authors; c.f. Sect. 2.2). Without limit, submissions could be solvers that use a lookup table or similar to determine solutions. Thus, the No-Limits track was only evaluated with respect to newly submitted benchmark instances, i.e., on instances which were submitted to SAT Competition 2020.
However, solvers in No-Limits still competed against all other solvers submitted to the Main Track. Thus, to deserve a mention, a No-Limits solver would need to rank among the best-performing solvers among the Main Track participants. In 2020, the top ranked solvers in the No-Limits track were the same as in the Main track. This also indicates the stability of results under the exclusion of old benchmark instances.
Complementing the generality advocated by the standard SAT Competition tracks, in which solvers are evaluated on a set of benchmarks including instances from various types of different problem domains, for 2020 the organizers aimed to experiment with the potential of a more application-specific track, each year highlighting a different problem domain where the SAT solving technology helps to advance the state of the art. In 2020, the Planning track represented the first trial instantiation of this idea. The focus of this track was specifically on efficiently solving instances arising from the domain of SAT-based automated planning [10]. Automated planning was chosen as the target problem domain of this first instantiation of the domain-specific tracks due to its centrality as one of the first breakthrough applications of SAT solvers.  [11]. In the past, several advances in SAT solving required only small modifications of an established solver to achieve a considerable contribution. Hack tracks encourage participation of such small modifications. The limit for being considered a "hack" was-somewhat arbitrarily 2 -set to 1000 non-space character edit distance from the sources of Glucose 3. Unfortunately, in 2020 there were not enough participants in this sub-track and so we do not report on it in the results section.
We evaluated all 64 solver submissions (including different configurations of specific solvers) to the Main track. Out of the 64 solvers, eight were explicitly submitted to the No-Limits track. Four solvers were demoted to the No-Limits track due to outputting invalid unsatisfiability proof certificates. Six solvers were disqualified due to outputting truth assignments which did not satisfy the corresponding benchmark instance. This left us with 46 configurations of 22 solvers, including one Glucose hack.

Incremental Library Track
The Incremental Library track was first introduced in SAT Race 2015 [12] and also took place in SAT Competitions 2016 and 2017. In the Incremental Library track the underlying idea is to mimic scenarios where a SAT solver is used as a back-end solver in a more complex tool (typically solving a harder problem than SAT) and is called multiple times before the enclosing tool reaches its final state. "Incremental" here refers to the idea that the individual calls to the SAT solver are not independent, but may share a common subset of the input clauses or differ in the presence of additional unit clause assumptions [13,14,15]. Examples for applications of incremental SAT solving are counterexample-guided abstraction refinement (CEGAR) based approaches, e.g., for Bounded Model Checking [16], SAT-based planning [17], multi-agent path finding [18], and satisfiability modulo theories (SMT) solvers [19].
Instead of using or extending the DIMACS input format, in the the Incre- 2 The specific threshold for edit distance is not central here; the idea is essentially to only allow relatively small changes to the Glucose code base, i.e., "quick hacks" to Glucose. mental Library track a general incremental interface called IPASIR (Re-entrant Incremental Solver API) is employed [12]. The idea is that we actually run the enclosing tool on its own benchmark and communicate with the competing SAT solver through this interface. SAT solvers that are submitted for this track must hence implement the interface. Furthermore, it should be noted that the solutions output by a solver may, in general, influence the forthcoming invocations of the solver.
Six solver were submitted to the Incremental Library track. Two of the six solvers were disqualified due to outputting wrong answers.

Parallel Track
The Parallel track evaluates the runtime performance of SAT solvers making use of multiple processor cores in terms of wall-clock time. The benchmarks are the same as in the Main track. In contrast to the Main track, proof logging for unsatisfiable instances is not required in the Parallel track. 3 A total of 14 solver configurations, based on 10 solvers, were submitted to the Parallel track. Three solver configurations were disqualified due to wrong answers.

Cloud Track
The Cloud track was a new development in the SAT Competitions for 2020. The track focuses on evaluating distributed solvers running on multiple machines in a network. Communication between the machines is possible using MPI and SSH. We received six solver submissions to the Cloud track.

Mandatory Participation Requirements
The following requirements were imposed for participating in SAT Competition 2020.
Source Code. The source code of submitted SAT solvers had to be made available (licensed for research purposes) except for the solvers participating only in the No-Limits track.
Description. A short system description was required for each solver submission, including a list of all authors involved in developing the solver, description of any non-standard algorithmic techniques and data structures implemented in the solver, as well as references to the relevant literature. These system descriptions have been collected and made available publicly in the non-refereed competition proceedings [9].
Benchmarks. The authors of solvers participating in the Main track were required to submit 20 "new" benchmark instances. The exact details of this rule are further explained in Section 3.1. In short, this rule guaranteed that the competition could be run on instances mostly unseen to the solver developers prior to the competition. Moreover, by making these benchmarks publicly available after the competition, the SAT community benefits by having an ever growing repository of diverse problems that next developments will target. The descriptions of the submitted benchmarks are also made available in the competition proceedings [9].
Input and Output Format. No-limits The benchmark instances were presented to the solvers in the de facto standard DIMACS input format for propositional formulas in conjunctive normal form (CNF). A simple extension of this format was to be adhered to when printing the satisfying assignment (see, e.g., [8], Section 2.4).
Where required, proofs of unsatisfiability were to be output in the DRAT format [20], either in its textual version-which is also very similar to the DIMACS input format-or in a more compact binary version (for more details, see [21], Unsat Certificates). Details on certification are further discussed in Section 2.4.

Number of Submissions.
Due to the shear number of participants in the SAT Competitions, in order to make it feasible to run the whole competition, specific limits were set on the number of submitted solvers. In particular, each solver author was allowed to be an author of at most four different sequential solvers, two different parallel solvers, and one "Glucose hack" sub-track solver. Two solvers were considered different as soon as their sources differed or the compilation options were different, or different command line options were used (with the exception of an option enabling or disabling the proof output).
Portfolio Solvers. Apart from the No-Limits track, participants were not allowed to submit a portfolio of solvers, i.e., a combination of two or more core SAT solvers developed by different groups of authors. 4 This rule is mainly meant to encourage the SAT community to invest more effort into developing new solver code bases. Moreover, while we acknowledge that research on solver selection tools that typically orchestrate portfolio solvers is interesting, it is not at present the focus of the SAT competitions.
Organizers. The organizers of the competition were not allowed to participate.

Solver Ranking and Disqualification
Solvers were ranked using a PAR-2 score based on a 5000-second timeout. A PAR-2 system assigns as many points as the amount of time (in seconds) it took the solver to solve a particular instance and twice the time limit, i.e. 10 000 points, if the instance was not solved. In particular, this means that the lower the score a solver obtained, the better the solver performs.
A solver was disqualified if it produced a wrong answer: specifically, if a solver reported "unsatisfiable" on an instance that was proven to be satisfiable by some other solver, or reported "satisfiable" but provided a wrong certificate. Solvers disqualified from the competition were not eligible for awards.

Certificates
In all tracks, solvers were required to output a solution (a satisfying truth assignment, i.e., a model on the instance in question) to certify recognizing a satisfiable instance. On the other hand, certificates for unsatisfiable instances (proofs) were required only in the Main track (besides the No-Limits track). In some cases, a solver output the correct result, but the respective certificate was wrong. Such solvers were demoted to the No-Limits track of the competition.
Each unsatisfiability proof produced by each solver was validated in a twostep fashion. First, the tool DRAT-trim [20] was used for initial checking and optimizing the proof, thereby obtaining a so-called LRAT proof file. An independent formally-verified checker cake lpr [22] was then used for validating the LRAT proof as a correct proof of unsatisfiability.
In a few cases DRAT-trim ran into the verification timeout of 45,000 seconds. In the Main track, only those unsatisfiable benchmark instances for which the proof produced by a solver could be validated at least by DRAT-trim were considered solved by the solver. While there were several cases where cake lpr ran out of resources, there was no case where DRAT-trim would accept a proof and cake lpr would not.

Computing Environments
The Main, No-Limits, and Planning tracks were run on the StarExec cluster [23] with computing nodes equipped with Intel Xeon 2.4 GHz processors and 128 GB of memory. The time limit enforced on each solver for solving an instance was 5,000 seconds. In the Main track, proof validation was limited to 45,000 seconds per proof.) The solvers were allowed to use the full 128 GB of RAM. 5 The Incremental Library Track was run on computers with 2x Intel Xeon E5430 2.66 GHz (4-Core) processors and 24 GB of RAM. The Parallel track was run on AWS m4.16xlarge machines with 64 virtual CPUs and 256 GB of memory, while the Cloud track was run on Amazon Web Services (AWS) m4.4xlarge 5 Unfortunately, the memory limit of 24 GB, that was used in the previous years, was by mistake advertised on the competition web page prior to solver submission. This could have resulted in some solvers not "daring" to use the full 128 GB in the competition. We do not, however, have concrete evidence to support this possibility.

Benchmarks
For data-driven selection of benchmark instances, we used GBD Tools 6 which facilitates querying for instances with desired properties, e.g., by instance author, family, result or solver runtime [24]. We also use GBD Tools for distributing benchmark instances and their attributes to the general public. 7

Selection of Instances
The "Bring Your Own Benchmarks" (BYOB) rule, first established in SAT Competition 2017 [25], was again followed in 2020. By this rule, solver authors are required to submit 20 benchmark instances to accompany a solver submission in order to participate in the competition. These benchmarks have to be "new" in the sense that instances included in benchmark sets from previous SAT competitions are not allowed. Furthermore, at least ten of the required 20 instances are required to be "interesting", interpreted in loose terms by the standard Minisat SAT solving needing at least one minute of runtime (on typical computing hardware) to solve an instance. It should be noted that new benchmarks could be submitted to the competition without needing to submit a solver. As a results, as detailed in Table 1, 27 authors contributed a set of 1,260 new benchmark instances from a wide range of different instance families.
We decided to include a total of 300 new benchmarks and a further 100 benchmarks from previous SAT Competitions to the main benchmark set of the 2020 competition. Key aims of benchmark selection is to ensure that (i)  the benchmark set includes enough many relatively hard-to-solve instances in order to differentiate the overall runtime performances of the competing solvers (without actually running the competing solvers during benchmark selection); (ii) the number of benchmarks included in the benchmark set from different problem domains is balanced across the problem domains, and that (iii) the benchmark set is also balanced in terms of the number of unsatisfiable and satisfiable instances included in the set.
To compile the set of 300 new instances, we first applied a hardness criterion by filtering out all instances solved by Minisat in less than ten minutes. 8 From 8 Note that the limit of ten minutes is again somewhat arbitrary. This runtime hardness filter essentially aims to make sure that enough instances are included in the final benchmark set which allow for distinguishing in terms of relative performance between the best-performing Instances  114  78  108  300  Old Instances  21  57  22  100  Σ  135  135 130 400 Table 2: Amount of old and new instances by result the resulting 1,012 instances, in order to obtain a balanced benchmark set, we randomly selected k instances per author using the value k which ensured that the resulting set contains at least 300 instances. This rule-based randomization procedure is detailed as Algorithm 1. Specifically, we randomly selected seven satisfiable and seven unsatisfiable instances per author (Lines 3 and 4) and added instances of yet unknown result if this did not yield a total of 14 instances (Lines 5-7). Of the such obtained 308 instances, we randomly removed eight satisfiable instances, yielding a total of 114 satisfiable, 78 unsatisfiable and 108 instances of unknown satisfiability status. We augmented the then obtained set of 300 new benchmarks with 100 instances from previous SAT competitions as follows. In order to further balance the number of satisfiable and unsatisfiable instances in the new benchmark set, we randomly selected 21 satisfiable, 57 unsatisfiable and 22 unknown instances. With additional constraints, we made sure not to select instances from benchmark families which are already represented in the set of 300 new instances (cf. Table 1). We also excluded random, agile and planning instances (due to the Planning track). The final main benchmark set contains 135 satisfiable, 135 unsatisfiable, and 130 instances of "unknown" status (cf. Table 2).

Planning Instances
Classical planning is the problem of finding a sequence of actions-a planthat transforms the world from some initial state to a goal state. In 1992 Kautz and Selman [10] proposed to encode planning as satisfiability, constituting one of the hallmark early adoptions of SAT solving to solve real-world problems. In their encoding the problem of finding a plan of length i (i.e., the makespan) is translated into a Boolean formula F i that is satisfiable if a plan of length i or less exists. Their encoding is called sequential, whereas parallel encodings allow the execution of multiple actions in one step [26,27,28]. Finding the smallest makespan i for which F i is satisfiable is important for SAT-based planning in general and the generation of this benchmark set in particular. The hardest formulas that a SAT-based planner has to solve are usually the last unsatisfiable F i before the next higher makespan i + 1 is satisfiable [26]. competing solvers. Unfortunately this limit was much greater that the requirements imposed for "interesting" benchmark instances by the BYOB rule. It could be more sensible to impose the same ten minutes limit also for an instance being "interesting". However, this would require more efforts at least in terms of computation times from the solver authors in order to construct a set of interesting new benchmarks required by the BYOB rule.  For the Planning track, the benchmark instances were generated using two SAT-based planners Madagascar [29] and Pasar [30]. We used Madagascar both in its default configuration to generate a parallel encoding based on ∃-step plans and to generate a sequential encoding. Pasar uses the grounding routine deployed by the well-known planner Fast Downward [31] to translate planning tasks into a different formalism and then encodes it to SAT using a parallel encoding. The classical planning benchmarks were selected from the Satisfying and Optimal tracks of the International Planning Competitions 2014 9 and 2018 10 . We only selected planning domains with unit cost and eliminated those that take more than 100 GB of memory to encode into SAT. We ran both Pasar and Madagascars ∃-configuration with a timeout of three hours on the remaining instances to find the minimal makespans. For each planning task where this process did not timeout, we generated a pair of satisfiable and unsatisfiable SAT instances. A significant number of instances from this set were not used as Minisat could solve them in under ten minutes. We augmented the remaining domains with the last unsatisfiable formulas generated for planning tasks where the minimal makespan could not be found. To generate the missing benchmarks, we use a sequential encoding together with bounds 11 on the optimal plan length.

Encoding
In addition to the classical planning problems, we also included SAT instances generated by Tree-REX [32], a planner for Hierarchical Task-networks (HTN). In HTN planning, additional domain knowledge besides the problem description is provided. The HTN benchmarks were provided by the author of Tree-REX.
The instances of the Planning track are large in size compared to the Main track instances. Using the number of clauses as a metric, out of the 100 largest instances across both tracks, 86 belong to the Planning track benchmark set. The large size of Planning track instances can mainly be attributed to large numbers of binary clauses that SAT encodings of planning problems naturally produce. On average, more than 98% of the clauses are binary for planning instances. The average for the Main track instances is below 60%. Table 3.2 shows the number of benchmarks generated by each encoding. For a complete list of the encoded planning tasks we refer to the generation script. 12 The benchmarks of the Planning track adhere to the following naming convention: SAT/UNSAT encoding name makespan .cnf

Incremental Library Track Benchmarks
Benchmarks for the Incremental Library track consist of benchmark applications which implement and use the incremental SAT solver in their back-end as well as benchmark instances which serve as input to these applications. For evaluating solvers participating in the the Incremental Library track, we used six available IPASIR applications. For each of the six applications, we individually selected 50 application instances as follows.
Backbone Computation. Backbone variables [33,34] are variables which take the same value in all models of a given SAT instance. The application genipabones incrementally determines backbone variables in a given satisfiable SAT instances using the so-called dual rail encoding [12]. We selected 50 of the smallest and easiest satisfiable instances from previous SAT competitions to evaluate solver performance with this application.
Essential Variables. Variables which have to be assigned in all partial models of a formula as essential (as opposed to don't care-values) [35]. The application genipaessentials incrementally determines essential variables in a given satisfiable formula [12]. For this application, we used the same 50 satisfiable instances as for backbone computation.
Longest Simple Paths (LSP). The application genipalsp determines longest simple paths in a graph [36]. We selected 50 LSP instances for our evaluation. 13 Maximum Satisfiability (MaxSAT). The application genipamax solves partial MaxSAT problems by augmenting soft clauses with relaxation (or blocking) variables which are input to a cardinality constraint [37]. The MaxSAT problem is then solved by incrementally minimizing the bound of the cardinality constraint. For this application, we selected 50 instances from MaxSAT Evaluation 2019. 14 Quantified Boolean Formulas. Ijtihad is a QBF solver which uses counterexampleguided expansion to incrementally solve a given QBF instance with a SAT solver [38]. Here we used 50 instances from QBF Evaluation 2019. 15 Planning (SAS+). We selected 50 planning instances to evaluate incremental SAT solvers with Pasar, a planner which uses counterexample-guided abstraction refinement (CEGAR) [30].

Competition Results
In this section, we provide a high-level overview of the results of SAT Competition 2020. Later on, we will provide an overview of some of the key and new solving techniques implemented in best-performing solvers (Section 5) as well as a more in-depth analysis of the competition results (Section 6). An overview of the top-10 solvers in each of the competition tracks discussed in the following is provided in Table 4.

Main Track
Starting with the Main track, Figure 1 shows the cumulative solved instances plot of the best-performing solver of the strongest ten teams (in short, the top-10 solvers) together with the Virtual Best Solver (VBS-see also 6.1). The best-performing solver overall on the combination of satisfiable and unsatisfiable instances is Kissat-sat and the runner-up is Relaxed-newTech. Notice that Relaxed-newTech solved more instances within the 2000-second limit. Third place, based on the PAR-2 score, went to CMS-ccnr-lsids. It solved two instances less than CaDiCaL-alluip-trail, but on the other hand solved various formulas more quickly. Similar observations have been made also in earlier recent SAT competitions where solvers were ranked based on the PAR-2 score. Furthermore, we observe more differences in overall runtime performance among the top solvers than what has been observed in the recent past competitions.
The four solvers Kissat-sat, Relaxed-newTech, CaDiCaL-alluip-trail, and CMS-ccnr-lsids performed significantly better that all other solvers submitted to the Main track (cf. Table 4). A very interesting observation is that these four top solvers all have a different code base. This has not been observed for many years; more typically many of the best-performing solvers have been based on same code bases.
The majority of the overall performance differences between the top-4 solvers and the other solvers is due to performance differences on satisfiable instances; see Figure 2. Indeed, in recent years, several techniques have been added to SAT solvers to improve their performance on satisfiable instance. Examples of such techniques are the integration of a local search solver and alternating between a SAT mode (infrequent restarts) and an UNSAT mode (frequent restarts and variable-move-to-front [39]). The best-performing solver in the Main SAT track on satisfiable instances is Relaxed-newTech, followed by Kissat-sat and CMS-ccnr-lsids.
Overall, solvers performed much more similarly on unsatisfiable instances than on satisfiable instances; see Figure 3 for the runtime performance on unsatisfiable instances. Only Kissat-unsat, the winner of the Main UNSAT track, performed significantly better than all other participating solvers. It is therefore not surprising that the VBS is reasonably close to Kissat-unsat. The solvers CaDiCaL-trail and f2trc-s placed, respectively, second and third in the Main UNSAT track.

Planning Track
The competition in the Planning Track was more tight. The best solver CaDiCaL-alliup-trail solved only one instance more than the runner up CMS-ccnr-lsids. The PAR-2 scores of these two solvers were quite similar as well. The third ranked solver, Kissat, solved fewer instances, but its fast runtimes on several instances resulted in a strong PAR-2 score. Notice that these three solvers were also strong in the Main track. It should be noted that, somewhat disappointingly, none of the participating solvers were actually optimized for planning instances.

Parallel Track
Turning to the Parallel track, Figure 4 shows the performance of all participating parallel solvers. The best solver here is Painless-MCOMSPS-STR32. Interestingly, this solver used 32 threads on the 64 virtual cores that were available. In fact, it has been observed in already recent earlier SAT competitions that using fewer threads than the number of available virtual cores can be helpful; as threads compete for memory, using all virtual cores may be detrimental to overall performance. The runner up is Plingeling, while the third place goes to ManyGlucose-32. Interestingly, one can observe from Table 4 that only the Painless-MCOMSPS-STR* solvers and Plingeling had a lower PAR-2 score than the winner of the Main Track (Kissat-sat). It appears still to be challenging to beat the best-performing sequential solvers with with a parallel solver.

Cloud Track
The clear winner of the Cloud track is Mallob-Mono and the runner-up is TopoSAT2 (see Figure 5). Mallob-Mono was able to solve more instances in 1000 seconds than the winner of the Parallel Track in 5000 seconds, which shows the potential of distributed SAT solving. The other four participants performed significantly worse. The massive parallelism in distributed SAT solving imposes additional challenges on scalable information sharing and search diversification. Since 2020 was the first year of this track, we expect a tighter competition in the future.

Winning Solvers
In this section we will provide an overview of the participating teams and solvers, and summarize new strategies implemented in the best-performing solvers of the 2020 competition. We start with a few remarks on the evolution of code-bases of well-known SAT solvers.    Progress in SAT solvers is often based on successful modifications of existing and openly available solver code-bases. One well-known tree of code-base evaluation is rooted in the code-base of Minisat by Eén and Sörensson [40]. A well-known fork of Minisat is Glucose by Audemard and Simon [11]. In particular, Glucose introduced the influential literal block distance (LBD) heuristics for deciding which learned clauses to keep and which ones to forget during search [41]. The SAT solver RISS by Manthey is a further fork of Glucose, combining Glucose with the Coprocessor [42] preprocessor.
A further, more recent line of evolution in SAT solvers is rooted in the CoMinisatPS by Oh, which is itself a again a fork of Minisat, and which introduced three-tier clause-management [43]. Building on CoMinisatPS, the SAT solver Maple appeared as a series of forks presenting innovative branching heuristics at SAT Competition 2016 [44]. The at-the-time award-winning variant MapleCOMSPS by Liang et al. implements a hybrid branching heuristic of classic variable-state independent decaying sum (VSIDS) [45] and the newer learning rate based branching (LRB) [46].
For SAT Competition 2017, Luo et al. integrated learned clause minimization based on unit propagation (LCM) in their award-winning Maple LCM Dist [47] which also uses the new branching heuristic Distance (Dist) in an initial solving period [48]. In SAT Competition 2018, Ryvchin and Nadel successfully integrated conditional chronological backtracking (ChronoBT) [49] in their award  award-winning solver Maple LCM Dist ChronoBT [50].
Kochemazov et al. improved three-tier clause-management by persisting additional clauses through hash-based detection of repeatedly learned clauses and presented their award-winning MapleLCMDistChronoBT-DL in SAT Race 2019 [51]. As can be seen in Table 5, numerous submissions to SAT Competition 2020 are forks of some recent award-winning descendants of a Maple-based solver.
Also starting as a fork of Minisat with the integration of special treatment for XOR constraints [52], CryptoMinisat by Soos continues to be a state-of-the-art and feature-rich SAT solver. One highlight of CryptoMinisat are its advanced data-logging capabilities for statistical analysis of SAT solver behavior [53].
Many independent and award-winning code-bases can be found among the SAT solvers written by Biere. The sequential SAT solver Lingeling has been award-winning since SAT Competition 2011 and is still competitive in its parallel version Plingeling [54]. As of SAT Competition 2017, CaDiCaL by Biere is another independent representative of state-of-the-art SAT solvers and its improved re-implementation Kissat [55] was successful in the 2020 competition.

Sequential SAT Solvers
Sequential SAT solvers have been evaluated in the Main, Planning and Incremental Library track of SAT Competition 2020. 18 teams submitted a total of 48 solvers and configurations to the Main track and the Planning track of the competition, and four solvers participated the Incremental Library track.  Table 5 displays an overview of the participating teams, base solvers and their variants. In the following, we provide a short overview of the best-performing solvers of 2020, based mainly on the solver descriptions submitted to the 2020 competition proceedings by the authors of the individual solvers.

Kissat
Three configurations of Kissat were submitted to the 2020 competition, including one default configuration and two specialized configurations which are specifically tailored towards satisfiable and unsatisfiable instances, respectively. Kissat received four awards, achieving the first place in the Main track, the best score on unsatisfiable instances, the second-best score on satisfiable instances and the third place in the Planning track.
Kissat is a low-level re-implementation of CaDiCaL with new sophisticated lazy data-structures for clause state monitoring, e.g., through binary clause inlining, sentinel values and bit stuffing [56,55]. Moreover, forward subsumption for learned clauses is mostly replaced by vivification algorithms [57]. Since conflict number has been observed to be too unstable for measuring the length of two alternating restart modes, Kissat uses the new unit "ticks" which approximates the number of cache-line accesses in unit-propagation [55]. Kissat also exploits autarkies to account for saved phases. In order to keep valuable information of saved phases, before each rephasing step Kissat computes the largest autarky for the assignment implied by the current saved phases [58]. As such an autarky Team Base Solver Variant Name M U S P I Biere Kissat might contain satisfying assignments which imply disconnected components, those variables are subject to subsequent variable elimination.

CryptoMiniSat
CryptoMiniSat received four awards, achieving the first place in the Incremental Library track, the second place in the Planning track, the third place in the Main track, and the third-best score on satisfiable instances. Two submitted variants, default and LSIDS, scored mostly adjacent ranks in the individual competition tracks.
The LSIDS variant of CryptoMiniSat comes with a new hybrid phase selection approach [59,60]. CryptoMiniSat comes with an independent implementation of state-of-the-art hybrid branching heuristics which alternate between classic phase saving and target phase selection [56]. CryptoMiniSat-CCAnr regularly schedules short periods of local search and imports the best assignment for phase selectiona procedure which is known as "rephasing" from CaDiCaL [56]. In addition, CryptoMiniSat-CCAnr bumps the VSIDS scores of the first 100 variables in those clauses which the SLS solver weighs most hard to satisfy [60]. Inprocessing has been extended to include ternary resolution and more vivification [57]. CryptoMinisat alternates decay factors of its branching heuristics, thus avoiding the restriction to a "single best" configuration [60]. The submitted version of CryptoMinisat entails a new optimized implementation of Gauss-Jordan Elimination [61]. CryptoMinisat periodically executes the BreakId algorithm to calculate symmetry breaking clauses [62].

CaDiCaL AllUip
Based on CaDiCaL, its variants Trail and AllUip present implementations of a new Trail Saving approach [63] and the improved clause-learning heuristic Stable AllUIP [64]. Submitted were the three variants Trail, AllUip and AllUip+Trail. The variants including Stable AllUIP were the most successful in the 2020 competition, achieving in particular the first place in the Planning track.
Stable AllUip resolves additional clauses beyond the First Unit Implication Point (1-UIP) and keeps them whenever they are of smaller size and their LBD not greater than that of the 1-UIP clause. By monitoring the frequency of clauses which successfully pass that filter, the solver dynamically limits the amount of such extended learning attempts [45,64]. The Trail Saving variant caches backtracked portions of the trail and uses them to restore decision levels during search if possible [63].

Relaxed newTech
The Relaxed fork of MapleLCMDistCBT-DL was first presented in SAT Race 2019 [65]. In SAT Competition 2020, its variant newTech showed a good performance especially on satisfiable instances. The solver received two awards, achieving the second place in the Main track and the best score on satisfiable instances.
Relaxed integrates short runs of the local search solver CCAnr through periodic export and import of assignments [65] and uses a probabilistic schedule for switching between ten phase selection modes. The Relaxed newTech variant uses occurrence counts of variables in unsatisfied clauses during stochastic local search runs to recalculate variable priorities for their modified branching heuristic [66].

Maple F2TRC
The F2TRC fork of MapleLCMDistCBT achieved the third best score on unsatisfiable instances. F2TRC comes with deterministic re-implementations of former winning strategies in Maple, e.g., by replacing time-based intervals through conflict-based intervals [67].
F2TRC introduces improved management of learned clauses in the tree tiers core, tier2 and local, which are inherited from CoMinisatPS [43]. A dynamic size limit for the core tier triggers the reassignment of inactive clauses from core to tier2. To counter-act an observed starving of tier2, the conflict-based heuristic that controls demotion of clauses from tier2 to local was replaced by a size-based heuristic [67].

Parallel SAT Solvers
Six teams submitted a total of ten solvers and configurations to the Parallel track. In the following, we outline the best-performing parallel solver implementations.

Painless MapleCOMSPS STR
Painless-MCOMSPS-STR integrates the solver MapleCOMSPS in the Painless parallelization framework [68,69]. The authors submitted a 32 and a 64 threaded variant, which altogether won three awards, achieving the first place overall, the best score on satisfiable and the second-best score on unsatisfiable instances. Interestingly, the 32 threaded variant performed better than the 64 threaded variant.
Painless uses a generic interface to integrate a solver and abstracts away the implementation details of parallelism and concurrent data-structures. Due to this, implementations in Painless boil down to implementing parallelization and clause sharing strategies [68]. Painless-MCOMSPS-STR diversifies mainly via hard-coded configurations of the branching strategies LRB and VSIDS, and via sparse random initialization of variable polarities [70]. Two special solver instances perform concurrent clause strengthening [71] and Gaussian elimination, respectively. Regarding sharing, Painless-MCOMSPS-STR uses an all-to-all strategy with a fixed-size clause buffer and a dynamic LBD filter [9].

Plingeling
The parallel solver Plingeling achieved the best score on unsatisfiable instances, the second place in the overall evaluation, and the third-best score on satisfiable instances. Plingeling is built around the well-known Lingeling and did not change since 2016. In a global master queue, Plingeling shares unit clauses, equivalences and short clauses with a size limit of 40 and an LBD limit of eight. Plingeling uses random seeds for diversification via variable polarities [55,54].

ManyGlucose
ManyGlucose was submitted in 32 and 64 threaded variants. The 32 threaded variant won two awards in this competition, achieving the overall third place as well as the third-best score on unsatisfiable instances. ManyGlucose is a fork of GlucoseSyrup that uses strategies known from ManySat to achieve deterministic solver behavior [72,73,74].

Painless Maple
Painless Maple received the award for second-best performance on satisfiable instances. Interestingly, Painless Maple at the same time exhibits worst performance on unsatisfiable instances. Painless Maple integrates the solver ExMapleLCMDistChronoBT into the Painless parallelization framework [68]. It uses a sharing strategy in which the solvers are divided into those which only export clauses and others which import and export clauses and was submitted with two diversification variants v1 and v2. Painless Maple v1 diversifies via hand-crafted heuristic configurations and Painless Maple v2 diversifies via randomized initialization of branching heuristics [9].

Massively Parallel SAT Solvers in the Cloud Track
Five teams submitted a total of six solvers and configurations in the Cloud track. In the following, we outline the best-performing massively parallel solver implementations.

Mallob Mono
Mallob is a fork of the massively parallel SAT solver HordeSAT [70]. Mallob performs dynamic load balancing through malleable job scheduling in case the input contains several SAT instances of varying priority. This functionality is disabled in the submitted variant Mallob Mono. Mallob uses Lingeling-bcj and as every 14th solver Mallob spawns the stochastic local search solver YalSAT [75]. Diversification is done via randomized sparse initialization of branching scores. Mallob shares clauses by organizing solvers in a binary tree in which clauses are asynchronously aggregated in a buffer which is passed along this tree from its leafs to the root. Each node performs a three-way merge of its local export buffer and the two incoming buffers. The aggregate that approaches the root of the binary tree is then broadcast to all solvers [76]. Mallob uses a global clause size limit and a dynamic size limit for the sharing buffer, which depends on its position in the binary tree and is larger the closer we get to the root. Clauses are sorted by their size during aggregation, such that smaller clauses are preferred over longer clauses. Duplicates are avoided by using a Bloom filter which is cleared periodically [77].

TopoSAT 2
TopoSAT 2 [78] is a massively parallel SAT solver using Glucose 3 [11]. The solver uses lock-free clause-exchange for solvers on the same machine and the message passing interface (MPI) to share clauses between machines [74]. TopoSAT 2 strengthens clauses before export and delays clause import until the trail-size reaches a local minimum. TopoSAT 2 diversifies via strategies used for branching, restarting, and clause forgetting [79].

Slime
Slime is built from MapleLCMDistChronoBT and was first submitted as a sequential solver to SAT Race 2019 with a new phase selection heuristic [80]. The new version of Slime submitted to the 2020 competition came with periodic randomization in geometrically increasing intervals [81]. Even thought its sequential version was unsuccessful in the Main track, the MPI-based cloud version of Slime achieved the third place in the Cloud track.

Differentiated Analysis of Main Track Results
In this section, we provide an additional analysis of the Main track results, going beyond the rankings. In particular, we focus on metrics complementing the PAR-2 score used for ranking the solvers in the actual competition.

Contributions to the Virtual Best Solver
The Virtual Best Solver (VBS) is a fictitious solver consisting of all solvers that actually participated in the competition (or a specific track) and an oracle which, when given an input instance, invokes the solver which performed the best on that instance. This way, the performance of the VBS highlights a certain upper bound on the performance achievable in principle by the participating solvers (cf. the figures in Section 4).
One can see that the VBS solves all instances that were solved by at least one solver and solves each instance in the best observed time. By quantifying how much each participating solver contributes to the performance of the VBS, we may attempt to establish which technology (as represented by the solvers) is the most important (and to what degree) in the observed state of the art in SAT solving. We consider here the following three related metrics conceptually derived from the notion of VBS.
VBS-1 "The fastest takes it all": For each solver, we count the number of times the solver was the fastest to solve an instance.   We remark that the sum of VBS-1 points as well as the sum of VBS-3 points computed across all solvers is equal to the number of instances solved by at least one solver (later denoted total ). This is obvious for VBS-1, as exactly one solver scores a point for solving an instance. In the case of VBS-3, where we discard the information about the solution times, we evenly split the one-point reward for solving an instance among those solvers which succeeded in solving the instance. In contrast, VBS-2 does not have this property as it in general distributes more than one point per instance. Similarly as VBS-1, it takes the solution time into account. Similarly as VBS-3, it does not award just the best solver on an instance. For example, a solver that uses twice as much runtime as the fastest solver on an instance receives a half a point. Furthermore, if all solvers solve an instance equally fast, each solver receives a whole point for the instance. Table 6 provides the result of applying the just-described three metrics to the full results of the Main track. We can see that the respective leaderboards are generally dominated by Kissat in at least one of its configurations. VBS-1 tells us that Kissat-unsat was most often the fastest solver, in particular on 11.5 % of the solved instances.
The metric VBS-2 identifies Kissat-sat as the best solver. Its leading score of 32.9 % of the total is more difficult to interpret, though: a solver can score 32.9 % of VBS-2 total by solving 32.9 % of the solved instances in the best observed time and no others. However, we see from its VBS-1 score that Iteration Selected Solver Solved Contributes 1 Kissat-sat 264 264 2 CaDiCaL -alluip  250  22  3 f2trc-s 214 10 4 Relaxed-newTech 253 6 5 Kissat-unsat 238 4 6 Relaxed 245 3 7 CMS-walksat 243 3 8 CMS-ccnr-lsids 248 1 9 MapleCBT-DL-v3 211 1 10 DurianSat Kissat-sat solved 9.8 % of the solved instances in the best observed time (and some others). A solver can also score 32.9 % of VBS-2 total by solving all solved instances, but always being roughly three times slower than the VBS. The performance of Kissat-sat lies (clearly) somewhere between these two extremes.
Finally, according to VBS-3, the best solver is Kissat with 13.0 points, which amounts to 3.3 % of the distributed total score. The VBS-3 metric is generally the most evenly distributed one, at least among the first 10 solvers. (The last solver receives 0.9 points, which is 0.221 % of the total.) One can conclude from this that most of the benchmarks are solved by most of the well-performing solvers.

Greedy Set Cover
Another perspective on how much each solver contributes to the state of the art can be obtained by attempting to construct a sequential schedule of solvers (rather than relying on an oracle to pick one solver for each instance, as with VBS) and observing how big role each solver plays in such a schedule. Since constructing an optimal schedule tends to be computationally hard, we start here by presenting a computationally more efficient alternative-a greedy set cover approach.
With greedy set cover, we start with an empty schedule and iteratively consider each solver for the addition to the schedule obtained so far, picking the one with the highest "marginal contribution" in terms of the number of problems the new schedule will be able to solve. We demonstrate this on the actual data from the competition, again focusing in particular on the Main track results.
A greedy set cover of the solved instances by solvers of the Main track is presented in Table 7. In the first iteration, the solver which solved the highest number of instances is selected; in our case it was Kissat-sat with 264 instances as we know already from Table 4. With these 264 instances already covered, CaDiCaL-alluip is the best in further contributing to the set by additional 22  instances in the second iteration. We can see that the further iterations tend to add very little, with the final four iterations adding one instance each. Note that each solver that managed to solve an instance uniquely (i.e., being the only solver that solved a particular instance) shows up in the greedy set cover. Indeed, the greedy set cover metric highlights solvers which are able to uniquely solve specific benchmark instances and thereby contribute to the current state of the art.

Time-Limited Schedules
The greedy set cover disregards the time it would take to execute the obtained "schedule" (of running the solvers that jointly cover all solved instances). However, we can also look at schedules that would fit in a prescribed time budget. A natural choice of the budget seems to be the original time limit of 5000 s.
To this end, by employing a brute-force approach, we first construct a sequence of schedules where the i-th schedule splits the available time of 5000 s uniformly among i solvers and solves the highest number of instances under these constraints. The results are presented in Table 8. We can see that the initial increase from 264 to 278 of "covered" instances when using two solvers instead of one (although allowing each to only use half of the time) does not continue further with additional solvers allowed, although it is still better to use three solvers in a fair time split (and cover 272 instances) than just one. Based on this observation, it is plausible that the really hard instances that were solved actually may require quite large runtime to get "cracked" by any solver and thus the advantage of adding more solvers to the schedule quickly diminishes.
We complement this "uniform time split" schedule by formulating the schedule construction problem as a MaxSMT formula and using the Z3 SMT solver [82] in its optimization mode [83] to solve it. For each solver (and its configuration) S we introduce an integer variable R S denoting the number of seconds S runs in the new schedule. We then construct a formula with hard constraints 0 ≤ R S for every S and S∈S R S ≤ 5000 and with one soft constraint for every instance I of the form   where S I is the set of solvers which solved the instance I and T S I is the time it took solver S to solve I here rounded up to the nearest integer. (Note that while R S are variables, i.e., unknowns, the T S I are known constants in the formula.) Finding a solution which satisfies all hard constraints and as many soft constraints as possible, Z3 provided the schedule shown in Table 9 (in under two hours on a single core of a 2.30 GHz CPU). The table is sorted by R S , the time the schedule allocates to individual solvers, with zero entries ignored. It is not clear to what degree is the obtained schedule unique and how much it relies on each solver being present and for how long. Nevertheless, it is interesting to observe the total number of problems covered, here 286, and compare it to the 278 achieved in Table 8 with the uniform split and two solvers.
As can be seen from the "contributes" column, the presence of f2trc-DL in the schedule is not necessary. The 15 instances this solver solves under 76 seconds were already covered by the preceding three solvers. This result is due to the fact that Z3 was not asked to produce a schedule with a minimal number of participating solvers. Indeed, allowing any other solver to run for the 76 "wasted" seconds would not increase the overall total.

Small Portfolios
The PAR-2 score of the VBS of all 48 submitted solvers in the Main track is 2431.4, which is close to 40% better than the PAR-2 score of 3926.2 of the single best solver Kissat-sat. Given the set of solvers S, the set of tuples of size k is defined as follows P k := {T | T ∈ 2 S ∧ |T | = k}. We calculate the PAR-2 score for each VBS created from solver tuples in P k . In Table 10, we report on the single best performing k-tuple T k ∈ P k (1 ≤ k ≤ 7).
Interestingly, each of the first five tuples T k≤5 contains exactly one of the three Kissat variants. The set T 2 is composed of the two winners of the Main SAT and Main UNSAT tracks. For i < 5 the relation T i ⊂ T i+1 holds only under projection to base solvers due to the fluctuating variants of Kissat.
The composition changes more strongly in T 6 . Interestingly, we now have both variants {Kissat-sat, Kissat-unsat} ⊂ T 6 , and moreover it holds that T 6 ⊂ T 7 . All solvers in T k≤7 are among the top-performing solvers which received awards in the Main track, with the only exception of Scavel01 ∈ T 4 ∩ T 5 (cf. Table 4).

Score per Instance Family
Contributions to the VBS can be captured by clustering the instances by their family. We evaluate the runtimes of the three winning solvers of the Main track on those new families which are represented by at least 14 instances (cf. Table 1) and report their places and scores in Table 11. Interestingly, the overall best solver Kissat-sat is outperformed by the second and third ranked solvers Relaxed-newTech and CMS-ccnr-lsids on the Anti-Bandwidth, Vlsat, and Influence Maximization families by a large margin.

Similarity of Solvers
To investigate the similarity of solvers from the Main track, we define a similarity metric based on the measured runtimes. We start by removing 84 benchmarks that have not been solved by any solver. For the rest, a PAR-2 score is assigned to each instance for every solver, i.e., we set a score of 10,000 We calculate the similarity of the 30 solvers with the best average PAR-2 score in the Main track. The results are shown in Figure 6 as a heat map, similar to the visualization in [12]. Additionally, the result of hierarchically clustering the solvers based on their similarity is illustrated as a dendrogram. The height at which two solvers or clusters are joined reflects how similar they are. For example, enabling trail saving in CaDiCaL-alluip has no impact on the runtime, resulting in a similarity above 0.999. Therefore, the two solvers are joined low in the dendrogram.
Interestingly, one identifiable large cluster consists of the Maple-descendants, all of which are modifications of the winners in the SAT Competition 2018 and the SAT Race 2019. The similarity within the cluster is high, except for Scavel and exp V MLD CBT DL. They form a subcluster with a lower similarity compared to the rest, but a high similarity within. In fact, the highest measured similarity, besides the aforementioned CaDiCaL-alluip, is observed between them. The two solvers are both based on MapleLCMDistChronoBT-dl-v2.2, but have different authors. This high similarity suggests that the changes they made either result in a very similar behavior or do not have a significant impact on the runtime performance.
The two configurations of Relaxed by Zhang and Cai use the same codebase as a lot of solvers in the Maple-cluster. However, the overall performance of the solver is better and closer to the CMS-cluster. The three CMS configurations differ in their implementation of stochastic local search (SLS) and have similar performance. The 2020 version of CaDiCaL exhibits weaker performance than the modifications based on the 2019 version and does not quite fit into any cluster.
The leftmost cluster in the heat map is comprised of other solvers originally written by Biere. What is interesting to note is that the Kissat configuration specialized for unsatisfiable instances joins the others configurations in the cluster high in the dendrogram. In fact, Kissat-unsat has the lowest average similarity to all other solvers in the top 30. This suggests its importance for an optimal portfolio.

Influence of Benchmark Selection on Solver Ranking
To evaluate the impact of benchmark selection on the solver ranking, we follow the experiments described in the tool suite benchfeature [84]. In particular, we first use random sampling to select subsets of the benchmarks used in the Main track. We start with 316 benchmarks that have been solved by at least one solver in the Main track and remove a number of benchmarks randomly. For each possible subset size (1-316) we generate 50 random samples. The solvers are assigned a new rank in ascending order of their PAR-2 score on each random   sample. Note that we never encounter a tie. This ranking can be seen as an estimate of the original ranking. If even relatively small random samples result in a good estimate, we draw a positive conclusion about the robustness of the ranking.
To determine how similar an estimate is to the original ranking, we calculate the Spearman's rank correlation coefficient of the two rankings. Spearman's rank correlation coefficient for two rankings r 1 and r 2 is defined by the following equation: where n is the number of solvers in the Main track (48), and the rankings r {1,2} (S) map a solver S to its rank, i.e., 1 for the best performing solver, whereas 48 is assigned to the solver with the highest average PAR-2 score. The coefficient ρ is in the interval [−1, 1], where a rank correlation of 1 means that the two rankings are equal and −1 means that one ranking is the reverse of the other. To give a better intuition for ρ, we list a few modifications to the original ranking together with the resulting rank correlation coefficient. The smallest change we can make is to switch the rank of two adjacent solvers, resulting in a high rank correlation ρ = 0.9999. Several small changes also result in a high rank correlation; repeating the same modification as before n/2 times to switch all pairs of adjacent solvers in rank still gives a value of ρ = 0.9974. On the other hand, switching the highest ranked solver (Kissat-sat) with the lowest results in a rank correlation of ρ = 0.7602. Moving Kissat-sat to the bottom of the ranking while moving every other solver up one rank gives a higher The mean and standard deviation of the computed correlation coefficients are depicted in Figure 7. The rank correlation is high even for relatively small samples. The average rank correlation drops below 0.99 only after randomly removing at least 95 benchmarks, which is 30% of the considered benchmark set. Accordingly, removing fewer benchmarks randomly has almost no effect on the ranking of the solvers. Furthermore, removing fewer than 200 (63%) benchmarks still results in an average rank correlation above 0.96. This suggests that the impact of the random selection in Algorithm 1 on the solver ranking is limited. The collected data cannot show whether all of the benchmarks originally submitted by the solver authors have a systematic bias. However, since each newly submitted benchmark family originates from a different domain and is often the result of current research, we can assume that the submitted families together are representative. Figure 8 shows the rank correlation coefficient resulting from removing a complete benchmark family. As expected, removing a nonrandom subset can have a higher impact on the ranking even if it is small. Removing the 13 hgen benchmarks results in a (still high) rank correlation of 0.9895. Additionally, the ranking of the top five solvers stays the same. The individual removal of all other benchmark families results in a rank correlation above 0.99.

Conclusion and Prospects
The 2020 SAT Competition successfully continues the tradition of the SAT Competition series. In 2020, significant advances in SAT solvers compared to previous years were observed. Some of the more interesting observations on the winning solving strategies include the following. All winning solvers of the Main track periodically schedule runs of a stochastic local search (SLS) solver and import statistical information generated in unsuccessful SLS runs to reconfigure weights in their branching heuristics. As observed from the results of the Parallel track, it appears difficult to make proper use of more than 32 threads for SAT solving, as in some occasions the 32-threaded version of the same solver outperformed its 64-threaded counterpart. However, from the winner in the massively parallel Cloud track, we can learn that classical all-to-all clause-sharing can be outperformed by a more sophisticated clause-sharing architecture. It appears challenging to integrate and test sophisticated state-of-the-art methods in an incremental SAT solver and thus solvers usually disable parts of their features in the incremental use case. The winner of the Incremental Library track shows that it is worth integrating a full solver functionality in the incremental use case.

Prospects
In the instance selection, the author-wise balancing of satisfiable and unsatisfiable instances turned out often counterproductive as it did not lead to a more balanced overall selection of new instances. Moreover, this practice discriminated against authors who submitted solely satisfiable or unsatisfiable instances. The hardness criterion of 10 Minisat minutes was set higher than the hardness criterion of 1 Minisat minute of the "bring your own benchmarks" rule, which can be viewed as problematic. As lessons learned, in future competitions we will not aim to balance the benchmarks by satisfiability status on author level, will aim to be more consistent with the imposed hardness criteria for benchmark selection, and will also make sure to clearly communicate what it means for an instance to be counted as unknown.
The IPASIR interface facilitates the integration of SAT solvers into incremental applications. In contrast to benchmarking with instances given in the DIMACS CNF format, there are only a few benchmark applications available for the Incremental track. This calls for more community-level efforts for constructing a more diverse and well-organized repository of applications for incremental SAT solvers. Proper benchmarking and new tools for testing incremental SAT solvers may also help solver authors to deal with more complex use cases. 16 This year, the Hack track was organized for hacks of Glucose 3. In the next competition, we plan on moving to the well-structured and documented state-of-the-art SAT solver CaDiCaL. The CaDiCaL-alluip solver-which is a modified CaDiCaL-has shown competitive performance in this competition.
The first instantiation of an Application track was the Planning track organized as a one-time track in 2020. The results show that different solvers take the lead when we only evaluate for a single application, when compared to the overall Main track results. We intend to run further instantiations of the Application track. While none of the solvers that participated in the Planning track seemed particularly optimized for planning instances, we hope that in future iterations the community will pick up on the challenge of optimizing SAT solvers towards different focus applications, as a complementary challenge when compared to the generality of the Main track. In the next competition, the planned focus of Application track will be on SAT solver applications in cryptography.
The portfolio rule has been established to prohibit the participation of pure solver portfolios to stimulate the development of new codebases and to ensure fair competition among sequential solvers. The rule was challenged in this competition as it can be hurtful to cooperation in the community when solver authors use the work of other researchers as fully integrated subsystems in their own codebase. We aim to revisit and refine the portfolio rule in future instantiations of the competition to ensure that the rule does not unnecessarily hinder interesting algorithmic developments in SAT solving.