The irace package: Iterated racing for automatic algorithm configuration

Modern optimization algorithms typically require the setting of a large number of parameters to optimize their performance. The immediate goal of automatic algorithm conﬁguration is to ﬁnd, automatically, the best parameter settings of an optimizer. Ultimately, automatic algorithm conﬁguration has the potential to lead to new design paradigms for optimization software. The irace package is a software package that implements a number of automatic conﬁguration procedures. In particular, it offers iterated racing procedures, which have been used successfully to automatically conﬁgure various state-of-the-art algorithms. The iterated racing procedures implemented in irace include the iterated F-race algorithm and several extensions and improvements over it. In this paper, we describe the rationale underlying the iterated racing procedures and introduce a number of recent extensions. Among these, we introduce a restart mechanism to avoid premature convergence, the use of truncated sampling distributions to handle correctly parameter bounds, and an elitist racing procedure for ensuring that the best conﬁgurations returned are also those evaluated in the highest number of training instances. We experimentally evaluate the most recent version of irace and demonstrate with a number of example applications the use and potential of irace , in particular, and automatic algorithm conﬁguration, in general.


Introduction
Many algorithms for solving optimization problems involve a large number of design choices and algorithm-specific parameters that need to be carefully set to reach their best performance.This is the case for many types of algorithms ranging from exact methods, such as branch-and-bound and the techniques implemented in modern integer programming solvers, to heuristic methods, such as local search or metaheuristics.Maximizing the performance of these algorithms may involve the proper setting of tens to hundreds of parameters [42,44,59,89] .Even if default parameter settings for the algorithms are available, these have often been determined with other problems or application contexts in mind.Hence, when facing a particular problem, for example, the daily routing of delivery trucks, a non-default, problem-specific setting of al-gorithm parameters can result in a much higher-performing optimization algorithm.
For many years, the design and parameter tuning of optimization algorithms has been done in an ad-hoc fashion.Typically, the algorithm developer first chooses a few parameter configurations, that is, complete assignments of values to parameters, and executes experiments for testing them; next, she examines the results and decides whether to test different configurations, to modify the algorithm or to stop the process.Although this manual tuning approach is better than no tuning at all, and it has led to highperforming algorithms, it also has a number of disadvantages: (i) it is time-intensive in terms of human effort; (ii) it is often guided by personal experience and intuition and, therefore, biased and not reproducible; (iii) algorithms are typically tested only on a rather limited set of instances; (iv) few design alternatives and parameter settings are explored; and (v) often the same instances that are used during the design and parameter tuning phase are also used for evaluating the final algorithm, leading to a biased assessment of performance.
Because of these disadvantages, this ad-hoc, manual process has been sidelined by increasingly automatized and principled meth- ods for algorithm development.The methods used in this context include experimental design techniques [2,29] , racing approaches [20] , and algorithmic methods for parameter configuration, such as heuristic search techniques [3,10,41,73,81] , and statistical modeling approaches [11,43] .These methods have led to an increasing automatization of the algorithm design and parameter setting process.
Automatic algorithm configuration can be described, from a machine learning perspective, as the problem of finding good parameter settings for solving unseen problem instances by learning on a set of training problem instances [19] .Thus, there are two clearly delimited phases.In a primary tuning phase, an algorithm configuration is chosen, given a set of training instances representative of a particular problem.In a secondary production (or testing) phase, the chosen algorithm configuration is used to solve unseen instances of the same problem.The goal in automatic algorithm configuration is to find, during the tuning phase, an algorithm configuration that minimizes some cost measure over the set of instances that will be seen during the production phase.In other words, the final goal is that the configuration of the algorithm found during the tuning phase generalizes to similar but unseen instances.The tuning phase may also use automatic configuration methods repeatedly while engineering an algorithm [71] .Due to the separation between a tuning and a production phase, automatic algorithm configuration is also known as offline parameter tuning to differentiate it from online approaches that adapt or control parameter settings while solving an instance [13,50] .Nevertheless, online approaches also contain parameters that need to be defined offline, for example, which and how parameters are adapted at run-time; such parameters and design choices can be configured by an offline tuning method [59] .
In our research on making the algorithm configuration process more automatic , we have focused on racing approaches.Birattari et al. [19,20] proposed an automatic configuration approach, F-Race, based on racing [64] and Friedman's non-parametric two-way analysis of variance by ranks.This proposal was later improved by sampling configurations from the parameter space, and refining the sampling distribution by means of repeated applications of F-Race.The resulting automatic configuration approach was called Iterated F-race (I/F-Race) [10,21] .Although a formal description of the I/F-Race procedure is given in those publications, an implementation was not made publicly available.The irace package implements a general iterated racing procedure, which includes I/F-Race as a special case.It also implements several extensions already described by Birattari [19] , such as the use of the paired t -test instead of Friedman's test.Finally, irace incorporates several improvements not published before, such as sampling from a truncated normal distribution, a parallel implementation, a restart strategy that avoids premature convergence, and an elitist racing procedure to ensure that the best parameter configurations found are also evaluated on the highest number of training instances.
The paper is structured as follows.Section 2 introduces the algorithm configuration problem and gives an overview of approaches to automatic algorithm configuration.Section 3 describes the iterated racing procedure as implemented in the irace package as well as several further extensions including the elitist irace .Section 4 illustrates the steps followed to apply irace to two configuration scenarios and compares experimentally the elitist and non-elitist variants.In Section 5 , we give an overview of articles that have used irace for configuration tasks and we conclude in Section 6 .For completeness, we include in Appendix A a brief description of the irace package itself, its components and its main options.

Configurable algorithms
Many algorithms for computationally hard optimization problems are configurable, that is, they have a number of parameters that may be set by the user and affect their results.As an example, evolutionary algorithms (EAs) [36] often require the user to specify settings such as the mutation rate, the recombination operator and the population size.Another example is CPLEX [45] , a mixed-integer programming solver, that has dozens of configurable parameters that affect the optimization process, for instance, different branching strategies.The reason these parameters are configurable is that there is no single optimal setting for every possible application of the algorithm and, in fact, the optimal setting of these parameters depends on the problem being tackled [2,19,42] .
There are three main classes of algorithm parameters: categorical , numerical and ordinal parameters.Categorical parameters represent discrete values without any implicit order or sensible distance measure.An example is the different recombination operators in EAs.Ordinal parameters are seemingly categorical parameters but with an implicit order of their values, e.g., a parameter with three values { low , medium , high }.Numerical parameters, such as the population size and the mutation rate in EAs, have an explicit order of their values.In addition, parameters may be subordinate or conditional to other parameters, that is, they are only relevant for particular values of other parameters.For example, an evolutionary algorithm may have a parameter that defines the selection operator as either roulette_wheel or tournament .The value roulette_wheel does not have any specific additional parameters, whereas the value tournament requires to specify the value of parameter tournament_size .In this case, the parameter tournament_size is conditional to the fact that the selection operator takes the value tournament .Conditional parameters are not the same as constraints on the values of parameters.For example, given parameters a and b , a constraint may be that a < b .Such constraints limit the range of values that a certain parameter can take in dependence of other parameters, whereas conditional parameters either are disabled or they have a value from a predefined range.

The algorithm configuration problem
We briefly introduce the algorithm configuration problem here.A more formal definition is given by Birattari [19] .Let us assume that we have a parameterized algorithm with N param parameters, X d , d = 1 , . . ., N param , and each of them may take different values (settings).A configuration of the algorithm θ = { x 1 , . . ., x N param } is a unique assignment of values to parameters, and denotes the possibly infinite set of all configurations of the algorithm.
When considering a problem to be solved by this parameterized algorithm, the set of possible instances of the problem may be seen as a random variable I, i.e., a set with an associated probability distribution, from which instances to be solved are sampled.A concrete example of a problem would be the Euclidean symmetric traveling salesman problem (TSP), where each problem instance is a complete graph, each node in the graph corresponds to a point within a square of some dimensions and the distance between the nodes corresponds to the Euclidean distance between their associated points.If the points are randomly and uniformly generated, this class of instances, called RUE, is frequently used in the evaluation of algorithms for the TSP [4 8,4 9] .In principle, the set of RUE instances is infinite and all instances are equally interesting, thus I would be an infinite set of equally probable RUE instances.In practice, however, each instance may be generated by a concrete pseudorandom instance generator and we are only interested in a particular range of dimensions and number of points, which can be seen as the random variable I having a particular non-uniform probability distribution, where some elements, i.e., RUE instances outside the range of interest, have a zero probability associated to them.
We are also given a cost measure C(θ , i ) that assigns a value to each configuration θ when applied to a single problem instance i .Since the algorithm may be stochastic, this cost measure is often a random variable and we can only observe the cost value c ( θ , i ), that is, a realization of the random variable C(θ , i ) .The cost value may be, for example, the best objective function value found within a given computation time.In decision problems, the cost measure may correspond to the computation time required to reach a decision, possibly bounded by a maximum cut-off time.In any case, the cost measure assigns a cost value to one run of a particular configuration on a particular instance.Finally, when configuring an algorithm for a problem, the criterion that we want to optimize is a function F (θ ) : → R of the cost of a configuration θ with respect to the distribution of the random variable I.The goal of automatic configuration is finding the best configuration θ * that minimizes F ( θ ).
A usual definition of , the expected cost of θ .The definition of F ( θ ) determines how to rank the configurations over a set of instances.If the cost values over different instances are incommensurable, the median or the sum of ranks may be more meaningful.The precise value of F ( θ ) is generally unknown, and it can only be estimated by sampling.This sampling is performed in practice by obtaining realizations c ( θ , i ) of the random variable C(θ , i ) , that is, by evaluating the algorithm configuration θ on instances sampled from I. In other words, as most algorithms of practical interest are sufficiently complex to preclude an analytical computation, the configuration of such algorithms follows an experimental approach, where each experiment is a run of an implementation of the algorithm under specific experimental conditions [11] .

Methods for automated algorithm configuration
The importance of the algorithm configuration problem has been noted by many researchers and, despite its importance, the manual approach has prevailed for a long time.In several papers, proposals were made to exploit more systematically techniques from the field of design of experiments (DOE) to set a, usually small, number of parameters.These methods include CALIBRA [2] , which tunes up to five parameters using Taguchi partial designs combined with local search methods, methods based on response surface methodology [29] and more systematic applications of ANOVA techniques [80,83] .
In configuration scenarios where all parameters are numerical, configuration approaches may rely on the application of classical black-box numerical optimizers, such as CMA-ES [38] , BOBYQA [79] , or MADS [5] .Although these methods are designed for continuous optimization, they can often optimize integer parameters by simply rounding the decision variables.MADS was used for tuning the parameters of various other direct search methods for continuous optimization.Later, it was extended to more general tuning tasks within the OPAL framework [6] .Yuan et al. [93] compared the three optimizers CMA-ES, BOBYQA and MADS with irace for tuning numerical parameters using various techniques for handling the stochasticity in the tuning problem.They concluded that BOBYQA works best for very few parameters (less than four or five), whereas CMA-ES is the best for a larger number of parameters.In a follow-up study, they introduced the post-selection method, where the numerical optimizers use few evaluations per configuration in a first phase, and the most promising configurations are evaluated by racing more carefully in a second post-selection phase, which deals better with the stochasticity of the configuration problem [94] .
If we consider the full automatic configuration problem including conditional and categorical parameters, this problem can essentially be characterized as a stochastic black-box mixed-variables optimization problem.Therefore, apart from the above mentioned continuous direct search methods, many other heuristic optimization algorithms are natural candidates for tackling the algorithm configuration problem.Among the first proposals is the meta-GA algorithm proposed by Grefenstette [37] , who used a genetic algorithm (GA) to tune the parameter settings of another GA.A more recent method is REVAC [74] , an evolutionary algorithm that uses multi-parent cross-over and entropy measures to estimate the relevance of parameters.The gender-based GA [3] uses various sub-populations and a specialized cross-over operator to generate new candidate configurations.Hutter et al. [41] proposed ParamILS, an iterated local search method for automatic configuration that works only on categorical parameters and, hence, requires discretizing numerical ones.The evolutionary algorithm EVOCA [81] generates at each iteration two new candidates using a fitness proportionate crossover and a local search procedure; the candidates are evaluated a user-defined number of times on each instance to account for the stochastic behavior of the target algorithm.
The evaluation of configurations is typically the most computationally demanding part of an automatic configuration method, since it requires actually executing the target algorithm being tuned.Several methods aim to reduce this computational effort by using surrogate models to predict the cost value of applying a specific configuration to one or several instances.Based on the predictions, one or a subset of the most promising configurations are then actually executed and the prediction model is updated according to these evaluations.Among the first surrogate-based configuration methods is sequential parameter optimization (SPOT) [12] .A more general method also using surrogate models is the sequential model-based algorithm configuration (SMAC) [43] .A recent variant of the gender-based GA also makes use of surrogate models with promising results [4] .
Finally, some methods apply racing [64] for selecting one configuration among a number of candidates using sequential statistical testing [20] .The initial candidate configurations for a race may be selected by DOE techniques, randomly or based on problemspecific knowledge.In the case of iterated racing, a sampling model is iteratively refined according to the result of previous races.The next section explains iterated racing, as implemented in the irace package.

An overview of iterated racing
The irace package that we describe in this paper is an implementation of iterated racing, of which I/F-Race [10,21] is a special case that uses Friedman's non-parametric two-way analysis of variance by ranks [28] .
Iterated racing is a method for automatic configuration that consists of three steps: (1) sampling new configurations according to a particular distribution, (2) selecting the best configurations from the newly sampled ones by means of racing, and (3) updating the sampling distribution in order to bias the sampling towards the best configurations.These three steps are repeated until a termination criterion is met.
In iterated racing as implemented in the irace package, each configurable parameter has associated a sampling distribution that is independent of the sampling distributions of the other parameters, apart from constraints and conditions among parameters.The sampling distribution is either a truncated normal distribution for numerical parameters, or a discrete distribution for categorical parameters.Ordinal parameters are handled as numerical (integers).The update of the distributions consists in modifying the mean and the standard deviation in the case of the normal distribution, or the discrete probability values of the discrete distributions.The update biases the distributions to increase the probability of sampling, in future iterations, the parameter values in the best configurations found so far.
After new configurations are sampled, the best configurations are selected by means of racing.Racing was first proposed in machine learning to deal with the problem of model selection [64] .Birattari et al. [20] adapted the procedure for the configuration of optimization algorithms.A race starts with a finite set of candidate configurations.In the example of Fig. 1 , there are ten configurations θ i .At each step of the race, the candidate configurations are evaluated on a single instance ( I j ).After a number of steps, those candidate configurations that perform statistically worse than at least another one are discarded, and the race continues with the remaining surviving configurations.Since the first elimination test is crucial, typically a higher number of instances ( T first ) are seen before performing the first statistical test.Subsequent statistical tests are carried out more frequently, every T each instances (by default for every instance).This procedure continues until reaching a minimum number of surviving configurations, a maximum number of instances that have been used or a pre-defined computational budget.This computational budget may be an overall computation time or a number of experiments, where an experiment is the application of a configuration to an instance.
An overview of the main steps of the iterated racing approach is given in Fig. 2 .While the actual algorithm implemented in irace is a search process based on updating sampling distributions [96] , the key ideas of iterated racing are more general.An iterated racing approach would be, from a more general perspective, any process that iterates the generation of candidate configurations with some form of racing algorithm to select the best configurations.Hence, the search process of an iterated racing approach could be, in principle, very different from the current irace algorithm, and make use, for example, of local searches, population-based algorithms or surrogate-models.The important element here is the appropriate combination of a search process with an evaluation process that takes the underlying stochasticity of the evaluation into account.
The next subsection ( Section 3.2 ) gives a complete description of the iterated racing algorithm as implemented in the irace package.We mostly follow the description of the original papers [10,21] , adding some details that were not explicitly given there.Later, in Section 3.3 , we introduce a new "soft-restart" mechanism to avoid premature convergence and, in Section 3.4 , we describe a new elitist variant of iterated racing aimed at preserving the best configurations found so far.In Section 3.5 , we mention other features of irace that were not proposed in previous publications.

The iterated racing algorithm in the irace package
In this section, we describe the implementation of iterated racing as proposed in the irace package.The setup and options of the irace package itself are given in Appendix A .More details about the use of irace can be found in the user guide of the package [62] .
An outline of the iterated racing algorithm is given in Algorithm 1 .Iterated racing requires as input a set of instances I sampled from I, a parameter space X , a cost function C, and a tuning budget B .
Algorithm 1 Algorithm outline of iterated racing.

Require:
9: end while 10: Output: elite Iterated racing starts by estimating the number of iterations N iter (races) that it will execute.The default setting of N iter depends on the number of parameters with N iter = 2 + log 2 N param .The motivation for this setting is that we should dedicate more iterations for larger parameter spaces, with a minimum of two iterations per run to allow for some intensification of the search.Each iteration performs one race with a limited computation budget B j = (B − B used ) / (N iter − j + 1) , where j = 1 , . . ., N iter .Each race starts from a set of candidate configurations j .The number of candidate configurations is calculated as is, the number of candidate configurations decreases with the number of iterations, which means that more evaluations per configuration are possible in later iterations.The above setting also means that we do not keep decreasing N j beyond the fifth iteration, to avoid having too few configurations in a single race.The parameter μ is by default equal to the number of instances needed to do a first test ( μ = T first ), and allows the user to influence the ratio between budget and number of configurations, which also depends on the iteration number j .The reason behind the formula above is the intuition that configurations generated in later iterations will be more similar and, hence, more evaluations will be necessary to identify the best ones.
In the first iteration, the initial set of candidate configurations is generated by uniformly sampling the parameter space X (line 1 in Algorithm 1 ) and the best configurations are determined by a race (line 2).When sampling the parameter space, parameters are considered in the order determined by the dependency graph of conditions, that is, non-conditional parameters are sampled first, those parameters that are conditional to them are sampled next if the condition is satisfied, and so on.When a race starts, each configuration is evaluated on the first instance by means of the cost measure C. Configurations are iteratively evaluated on subsequent instances until a number of instances have been seen ( T first ).Then, a statistical test is performed on the results.If there is enough statistical evidence to identify some candidate configurations as performing worse than at least another configuration, the worst configurations are removed from the race, while the others, the surviving configurations, are run on the next instance.
There are several alternatives for selecting which configurations should be discarded during the race.The F-Race algorithm [19,20] relies on the non-parametric Friedman's two-way analysis of variance by ranks (the Friedman test) and its associated posthoc test described by Conover [28] .Nonetheless, the irace package, following the race package [18] , also implements the paired t -test as an alternative option.Both statistical tests use a default significance level of 0.05 (the value can be customized by the user [62] ).The statistical tests in irace are used as a selection heuristic and irace does not attempt to preserve the statistical significance level by sacrificing search performance.For example, the t -test is applied without p -value correction for multiple comparisons, since poor behavior of racing was previously reported if corrections are applied [19] , due to the test becoming more conservative and not discarding configurations.Similarly, from a sequential statistical testing perspective, preserving the actual significance level would require additional adjustments that may hinder heuristic performance.
The most appropriate test for a given configuration scenario depends mostly on the tuning objective F ( θ ) and the characteristics of the cost function C(θ , i ) .Roughly speaking, the Friedman test is more appropriate when the ranges of the cost function for different instances are not commensurable and/or when the tuning objective is an order statistic, such as the median, or a rank statistic.On the other hand, the t -test is more appropriate when the tuning objective is the mean of the cost function.
After the first statistical test, a new test is performed every T each instances.By default T each = 1 , yet in some situations it may be helpful to perform each test only after the configurations have been evaluated on a number of instances.For example, given a configuration scenario with clearly defined instance classes, one may wish to find a single configuration that performs well, in general, for all classes.In that case, the sequence of instances presented to irace should be structured in blocks that contain at least one instance from each class, and T first and T each should be set as multiples of the size of each block.This ensures that configurations are only eliminated after evaluating them on every class, which re-duces bias towards specific classes.We recommend this approach when configuring algorithms for continuous optimization benchmarks [39] , where very different functions exist within the benchmark set and the goal is to find a configuration that performs well in all functions [57] .In this case, each block will contain one instance of every function, and different blocks will vary the number of decision variables and other parameters of the functions to create different instances of the same function.
Each race continues until the budget of the current iteration is not enough to evaluate all remaining candidate configurations on a new instance ( B j < N surv j ), or when at most N min configurations remain ( N surv j ≤ N min ).At the end of a race, the surviving configurations are assigned a rank r z according to the sum of ranks or the mean cost, depending on which statistical test is used during the race.The N elite j = min { N surv j , N min } configurations with the lowest rank are selected as the set of elite configurations elite .
In the next iteration, before a race, a number of N new j = N j − N elite j−1 new candidate configurations are generated (line 6 in Algorithm 1 ) in addition to the N elite j−1 elite configurations that continue for the new iteration.For generating a new configuration, first one parent configuration θ z is sampled from the set of elite configurations elite with a probability: which is proportional to its rank r z .Hence, higher ranked configurations have a higher probability of being selected as parents.
Next, a new value is sampled for each parameter X d , d = 1 , . . ., N param , according to a distribution that its associated to each parameter of θ z .As explained before, parameters are considered in the order determined by the dependency graph of conditions, non-conditional parameters are sampled first followed by the conditional ones.If a conditional parameter that was disabled in the parent configuration becomes enabled in the new configuration, then the parameter is sampled uniformly, as in the initialization phase.
If X d is a numerical parameter defined within the range [ x d , x d ] , then a new value is sampled from the truncated normal distribution N x z d , (σ j d ) 2 , such that the new value is within the given range. 1 The mean of the distribution x z d is the value of parameter d in elite configuration θ z .The standard deviation σ j d is initially set to ( x d − x d ) / 2 , and it is decreased at each iteration before sampling: By reducing σ j d in this manner at each iteration, the sampled values are increasingly closer to the value of the parent configuration, focusing the search around the best parameter settings found as the iteration counter increases.Roughly speaking, the multidimensional volume of the sampling region is reduced by a constant factor at each iteration, and the reduction factor is higher when sampling a larger number of new candidate configurations ( N new j ).If the numerical parameter is of integer type, we round the sampled value to the nearest integer.The sampling is adjusted to 1 Sampling from a truncated normal distribution was never mentioned by previous description of I/F-Race [10,21] .However, naive methods of handling the ranges of numerical parameters, such as "rejection and resampling" or "saturation", may lead to under-sampling or over-sampling of the extreme values better methods exist [82] .For sampling from a truncated normal distribution, we use code from the msm package [46] .avoid the bias against the extremes introduced by rounding after sampling from a truncated distribution. 2  If X d is a categorical parameter with levels then a new value is sampled from a discrete probability distribution P j,z (X d ) .In the first iteration ( j = 1 ), P 1 ,z (X d ) is uniformly distributed over the domain of X d .In subsequent iterations, it is updated before sampling as follows: where Finally, the new configurations generated after sampling inherit the probability distributions from their parents.A set with the union of the new configurations and the elite configurations is generated (line 7 in Algorithm 1 ) and a new race is launched (line 8).
The algorithm stops if the budget is exhausted ( B used > B ) or if the number of candidate configurations to be evaluated at the start of an iteration is not greater than the number of elites ( N j ≤ N elite j−1 ), since in that case no new configurations would be generated.If the iteration counter j reaches the estimated number of iterations N iter but there is still enough remaining budget to perform a new race, N iter is increased and the execution continues.
Although the purpose of most parameters in irace is to make irace more flexible when tackling diverse configuration scenarios, the iterated F-race procedure implemented in irace has several parameters that directly affect its search behavior.The default settings described here were defined at design time following common sense and experience.A careful fine-tuning of irace would require an analysis over a large number of relevant configuration scenarios.In a preliminary study, we analyzed the effects of the most critical parameters of irace [78] on a few classical scenarios.
We could not find settings that are better for all scenarios than the default ones, and settings need to be adapted to scenario characteristics.The user guide of irace [62] provides advice, based on our own experience, for using different settings in particular situations.

Soft-restart
The iterated racing algorithm implemented in irace incorporates a "soft-restart" mechanism to avoid premature convergence.In the original I/F-Race proposal [10] , the standard deviation, in the case of numerical parameters, or the discrete probability of unselected parameter settings, in the case of categorical ones, decreases at every iteration.Diversity is introduced by the variability of the sampled configurations.However, if the sampling distributions converge to a few, very similar configurations, diver- 2 Let us assume we wish to sample an integer with range 1, 2, 3.The naive way would be to sample from a truncated distribution N (μ = 2 , x = 1 , x = 3) , and then round such that however, given these ranges, the interval of values that are rounded to 2 is twice the length of the interval that are rounded to either 1 or 3 and, thus, the sampling would be biased against the extreme values.We remove the bias if we instead sample from N (μ = 2 .5 , x = 1 , x = 4) , and round such that sity is lost and newly generated candidate configurations will be very similar to the ones already tested.Such a premature convergence wastes the remaining budget on repeatedly evaluating minor variations of the same configurations, without exploring new alternatives.
The "soft-restart" mechanism in irace checks for premature convergence after generating each new set of candidate configurations.We consider that there is premature convergence when the "distance" between two candidate configurations is zero.The distance between two configurations is the maximum distance between their parameter settings, which is defined as follows: • If the parameter is conditional and disabled in both configurations, the distance is zero; • if it is disabled in one configuration but enabled in the other, the distance is one; • if the parameter is enabled in both configurations (or it is not conditional), then: -in the case of numerical parameters (integer or real), the distance is the absolute normalized difference between their values if this difference is larger than a threshold value 10 −digits , where digits is a parameter of irace ; if the difference is smaller, it is taken as zero; -in the case of ordinal and categorical parameters, the distance is one if the values are different and zero otherwise.
When premature convergence is detected, a "soft-restart" is applied by partially reinitializing the sampling distribution.This reinitialization is applied only to the elite configurations that were used to generate the candidate configurations with zero distance.The other elite configurations do not suffer from premature convergence, thus they may still lead to new configurations.
In the case of categorical parameters, the discrete sampling distribution of elite configuration z , P j,z (X d ) , is adjusted by modifying each individual probability value p ∈ P j,z (X d ) as follows: and the resulting probabilities are normalized to [0, 1].
For numerical and ordinal parameters, the standard deviation of elite configuration z , σ j,z d , is "brought back" two iterations, with a maximum limit of its value in the second iteration.For numerical parameters this is done using After adjusting the sampling distribution of all affected elite configurations, the set of candidate configurations that triggered the soft-restart is discarded and a new set of N new configurations is sampled from the elite configurations.This procedure is applied at most once per iteration.

Elitist iterated racing
The iterated racing procedure described above does not take into account the information from previous races when starting a new race.This may lead irace to erroneously discard a configuration based on the information from the current race, even though this configuration is the best found so far based on the information from all previous races.For example, if the first race identified a configuration as the best after evaluating it on ten instances, this configuration may get discarded in the next race after seeing only five instances (the default value for T first ), without taking into account the data provided by the ten previous evaluations.This may happen simply because of unlucky runs or because the best configuration overall may not be the best for a particular (small) subset of training instances.An empirical example of this situation is given in Section 4.3 .
Ideally, the best configuration found should be the one evaluated in the largest number of instances, in order to have a precise estimation of its cost statistic F ( θ ) [41] .Therefore, we present here an elitist variant of iterated racing. 3This elitist iterated racing aims at preserving the best configurations found so far, called elite configurations, unless they become worse than a new configuration that is evaluated in as many instances as the elite ones.
The main changes are in the racing procedure of irace .After the first race (iteration), the elite configurations have been evaluated on a number e of instances, for example, I 1 , . . ., I e .In the next race, we first randomize the order of the instances in this set and prepend to it a number of T new newly sampled instances (by default one).Randomizing the order of the instances should help to avoid biases in the elimination test induced by a particularly lucky or unlucky order of instances.The rationale for prepending new instances is to give new configurations the opportunity to survive based on results on new instances, thus reducing the influence of already seen instances.This new set of T new + e instances is used for the new race.In a race, a new configuration is eliminated as usual, that is, if it is found to be statistically worse than the best one after performing a statistical test.However, elite configurations are not eliminated from the race unless the new configurations have been evaluated on at least T new + e instances, that is, as many instances as the elite configurations were evaluated in the previous race plus T new new ones.If the race continues beyond T new + e instances, then new instances are sampled as usual, and any configuration may be eliminated from the race.
If a race stops before reaching T new + e, which may happen when the number of remaining configurations is no more than N min , configurations that were not elite are evaluated on fewer instances than the ones that were elite at the start of the race, thus there would be no unique value of e for the next iteration.We avoid this problem by keeping track of the instances on which each configuration has been evaluated so that we can calculate a value of e for each elite configuration.
With the elitist irace procedure described above, the number of configurations sampled at each race is limited by the number of instances seen in previous iterations.In particular, where B j is the computational budget assigned to the current iteration j , N elite j is the number of elite configurations, e is the maximum number of instances seen by the elite configurations, T new is the minimum number of new instances to be evaluated in this iteration, and μ (by default T first ) and T each control the frequency of statistical tests performed, as explained in Section 3.2 .The function nm( x , d ) gives the smallest multiple of d that is not less than x .
In elitist irace , as shown by the equation above, the number of new configurations that we can sample at each iteration is limited by the number of instances seen so far.Thus, if a particular iteration evaluates a large number of instances, then, subsequent iterations will be strongly limited by this number.This situation may arise, for example, in the first iteration, if most configurations are discarded after the first test and the remaining budget for this iteration will be spent on evaluating a few surviving configurations on new instances, but not being able to discard enough configurations to reach N min and stop the race.This will result in a large value of e for the subsequent iterations, thus reducing the number of new configurations that can be sampled.In order to prevent this situation, we added a new stopping criterion that stops the race if there are T max (by default 2) consecutive statistical tests without discarding any candidate.This stopping criterion is only applied after seeing T new + e instances, that is, when the statistical test may discard any elite configuration.
The described elitist strategy often results in a faster convergence to good parameter values, which has the disadvantage of reducing the exploration of new alternative configurations.Unfortunately, the soft-restart mechanism explained above is not sufficient to increase exploration in elitist irace , since it applies only when the sampling model of all parameters has converged, that is, when sampling a configuration almost identical to its parent.However, we observed in elitist irace that categorical parameters, in particular, tend to converge quite rapidly to consistently good values.Probably this happens because good overall values are easier to find at first and, in the elitist variant, it is harder to discard the elite configurations that contain them.Hence, they get continuously reinforced when updating their associated probabilities, but differences in numerical values prevent a soft-restart.In order to increase exploration, we limit the maximum probability associated to a categorical parameter value after updating it as follows: and re-normalizing the probabilities to [0, 1].

Other features of irace
We have implemented in irace several extensions that were never mentioned in the original I/F-Race.

Initial configurations
We can seed the iterated race procedure with a set of initial configurations, for example, one or more default configurations of the algorithm.In that case, only enough configurations are sampled to reach N 1 in total.

Parallel evaluation of configurations
The training phase carried out by irace is computationally expensive, since it requires many runs of the algorithm being tuned.
The total time required by a single execution of irace is mostly determined by the number of runs of the algorithm being tuned (the tuning budget, B ) and the time required by those runs.In irace , these runs can be executed in parallel, either across multiple cores or across multiple computers using MPI.It is also possible to submit each run as a job in a cluster environment such as SGE or PBS.
The user guide of irace [62] describes all the technical details.

Forbidden configurations
A user may specify that some parameter configurations should not be evaluated by defining such forbidden configurations in terms of logical expressions that valid configurations should not satisfy.In other words, no configuration that satisfies any of these expressions will be evaluated by irace .For example, given a parameter space with two parameters, a numerical one param1 and a cat-  then a configuration such as {7, ''x1'' } will not be evaluated, whereas {7, ''x2'' } would be.This is useful, for example, if we know that certain combinations of parameter values lead to high memory requirements and, thus, they are infeasible in practice.

Applications of irace
In this section, we present two detailed examples of configuration scenarios and how to tackle them using irace .The first scenario illustrates the tuning of a single-objective metaheuristic (ant colony optimization) on a well-known problem (the traveling salesman problem).The second scenario concerns the tuning of a framework of multi-objective ant colony optimization algorithms.We have chosen these scenarios for illustrative purposes and because their setup is available in AClib [44] .

Example of tuning scenario: tuning ACOTSP
ACOTSP [87] is a software package that implements various ant colony optimization algorithms to tackle the symmetric traveling salesman problem (TSP).The configuration scenario illustrated here concerns the automatic configuration of all its 11 parameters.
The goal is to find a configuration of ACOTSP that obtains the lowest solution cost in TSP instances within a given computation time limit.We explain here the setup of this configuration scenario.For a more detailed overview of the possible options in the irace package, we refer to Appendix A .
First, we define a parameter file ( parameters.txt , Fig. 3 ) that describes the parameter space, as explained in Section A.2 .
We also create a scenario file ( scenario.txt, Fig. 4 ) to set the tuning budget ( maxExperiments ) to 5 0 0 0 runs of ACOTSP .Next, we place the training instances in the subdirectory ''./Instances/'' , which is the default value of the option trainInstancesDir .We create a basic target-runner-run script that runs the ACOTSP software for 20 CPU-seconds and prints the objective value of the best solution found.
At the end of a run, irace prints the best configurations found as a table and as command-line parameters: where the first number of each row is a unique number that identifies a particular configuration within a single run of irace , and NA denotes that a parameter did not have a value within a particular configuration (because it was not enabled).test set of 200 instances of size 2 0 0 0, while the default configuration of ACOTSP is run 30 times on the same set using different random seeds.We then report the percentage deviation from the optimal objective value.Each data point shown in Fig. 5 corresponds to an instance, values are the mean of the results obtained either by the 30 runs of the default configuration of ACOTSP or by the 30 configurations produced by irace .In order to reduce variability, we associate a random seed to each instance and use this seed for all runs performed on that instance.As we can observe, the improvement of the tuned configuration over the default one is significant.In practice, we often observe that the largest improvements are obtained when configuring an optimization algorithm for scenarios that differ substantially from those for which it was designed, either in terms of problem in-Fig.6. Example of the computation of the hypervolume quality measure.Each white diamond represents a solution in the objective space of a bi-objective minimization problem.The black point is a reference point that is worse in all objectives than any Pareto-optimal solution.The area of the objective space dominated by all solutions in the set and bounded by the reference point is called its hypervolume.The larger the hypervolume of a set, the higher the quality of the set.
stances or in terms of other characteristics of the scenario, such as termination criteria or computation environment.Nonetheless, it is not rare that an automatic configuration method finds a better configuration than the default even for those scenarios considered when designing an algorithm and even when not providing the default configuration as an initial configuration as in our examples here.

A more complex example: tuning multi-objective optimization algorithms
In this section, we explain how to apply irace to automatically configure algorithms that tackle multi-objective optimization problems in terms of Pareto optimality.This example illustrates the use of an additional script (or R function) called targetEvaluator .
In multi-objective optimization in terms of Pareto optimality, the goal is to find the Pareto front, that is, the image in the objective space of those solutions for which there is no other feasible solution that is better in all objectives.For many interesting problems, finding the whole Pareto front is often computationally intractable, thus the goal becomes to approximate the Pareto front as well as possible.Algorithms that approximate the Pareto front, such as multi-objective metaheuristics, typically return a set of nondominated solutions, that is, solutions for which no other solution in the same set is better in all objectives.
Automatic configuration methods, such as irace , have been primarily designed for single-objective optimization, where the quality of the output of an algorithm can be evaluated as a single numerical value.In the case of multi-objective optimization, unary measures, such as the hypervolume (see Fig. 6 ) and themeasure [95] , assign a numerical value to a set of nondominated solutions, thus allowing the application of standard automatic configuration methods [58,91] .However, computing these unary quality measures often requires a reference point (or set), which depends on the sets being evaluated.One may define the reference point a priori based on some knowledge about the instances being tackled, such as lower/upper bounds.On the other hand, it would be desirable if the reference point could be computed from the results obtained while carrying out the automatic configuration process.In irace , the latter can be achieved by first running all candidate configurations on a single instance and, once all these runs have finished, computing the cost/quality value of each configuration in a separate step.
In practical terms, this means that the targetRunner program is still responsible for running the configuration θ on instance i , but it does not compute the value c ( θ , i ).This value is computed by a different targetEvaluator program that runs after all targetRunner calls for a given instance i have finished.The communication between targetRunner and targetEvaluator is scenario-specific and, hence, defined by the user.
In the case of elitist irace , targetEvaluator might be called including configurations that were evaluated in a previous race, since they were elite.Therefore, one way to dynamically compute the reference point for the hypervolume computation may be by targetRunner saving the nondominated sets corresponding to each pair ( θ , i ) and targetEvaluator using them (and not deleting them since they might be needed again later) to update the reference point or the normalization bounds.
We have applied the above method to automatically configure various multi-objective optimization algorithms by means of irace .
In particular, we first applied irace to instantiate new algorithmic designs from a framework of multi-objective ant colony optimization algorithms (MOACO) [58] .In that work, we tested the combination of irace with the hypervolume measure and themeasure, but we did not find significant differences between the results obtained with each of them.The MOACO algorithms automatically instantiated by irace were able to significantly outperform previous MOACO algorithms proposed in the literature.Fig. 7 compares the results of a configuration obtained by (elitist) irace and the best manually-designed configuration from the literature [58] on 60 bi-objective Euclidean TSP instances of sizes {50 0,60 0,70 0,80 0,90 0,10 0 0}.The complete MOACO scenario is too complex to describe here, but it is provided as an example together with irace and it is also included in AClib [44] .

Comparison of irace and elitist irace
In this section, we compare irace with and without elitism in three configuration scenarios. 4COTSP is similar to the scenario described in Section 4.1 .We consider a budget of 5 0 0 0 runs of ACOTSP and 20 s of CPU-time per run.As benchmark set, we consider Euclidean TSP instances of size 20 0 0, in particular, 200 training instances and 200 test instances.
MOACO is similar to the scenario described in Section 4.2 .
SPEAR where the goal is to minimize the mean runtime of Spear, an exact tree search solver for SAT problems [9] with with the non-elitist irace using t -test as statistical test.The plots give the 95% confidence intervals of the mean differences between the results obtained by θ 0 and the new elite configurations ( θ 1 , . . ., θ 6 ) on different subsets of the training set.The top plot considers only the 9 instances seen by irace at the iteration in which θ 0 was discarded, the middle plot considers the 37 instances on which θ 0 was evalu- ated since the start of this run, and the bottom plot considers the full training set.
Negative values indicate that θ 0 has a better performance than the configuration to which it is compared.If the interval contains zero, there is no statistical difference.
26 categorical parameters.We consider a budget of 10 0 0 0 runs, a maximum runtime of 300 CPU-seconds per run, and a training and a test set of 302 different SAT instances [8] .
Instance homogeneity is an important factor when tuning an algorithm.We measure instance homogeneity by means of the Kendall concordance coefficient ( W ) [78] computed from the results of 100 uniformly random generated algorithm configurations executed on the training set.Values of W close to 1 indicate high homogeneity while values close to 0 indicate high heterogeneity.
The W values for the ACOTSP , MOACO and SPEAR scenarios are 0.98, 0.99 and 0.16 respectively, showing that SPEAR is a highly heterogeneous scenario.
Given the characteristics of each scenario [78] , we use as the statistical elimination test in irace the default F -test in the ACOTSP and MOACO scenarios, and the t -test in the SPEAR scenario.For each scenario, we run irace 30 times on the training set, obtaining 30 different algorithm configurations.Then, we run each of these configurations on the test set, which is always different from the training set.
As explained in Section 3.4 , the non-elitist irace may discard high-quality configurations due to the use of partial information, ignoring the results obtained in past iterations.A concrete example is shown in Fig. 8 for an actual run of irace on the SPEAR scenario.The plots give the 95% confidence intervals of the mean differences between the results obtained by configuration θ 0 , which is an elite configuration indentified in iteration 5, and the elite candidates ( θ 1 , . . ., θ 6 ) obtained in iteration 6. Configuration θ 0 is discarded by irace in iteration 6 due to statistically significantly worse performance than configuration θ 6 after 9 instances were executed (top plot); θ 0 is also statistically significantly worse than θ 1, 3, 4 and it has a worse mean than θ 2, 5 .However, θ 0 has a statistically better performance on the full training set when compared to all elite candidates of iteration 6 (bottom plot).Even if we consider only the 37 training instances on which θ 0 was evaluated from the start of this run of irace up to the moment it was discarded, θ 0 is significantly better than all configurations that irace selects as final elites (see middle plot of Fig. 8 ).Hence, using all available information, irace would have detected that θ 0 is the best configuration of the group.The loss of such potentially winning configurations happened in eight of the 30 executions of the non-elitist irace on SPEAR .Next, we perform experiments comparing both variants of irace on the three configuration scenarios mentioned above across 30 repetitions.Fig. 9 shows the results for each scenario as the mean performance reached by each of the 30 configurations generated for each scenario on the test set.On the left, it shows box-plots of the results and, on the right, scatter plots where each point pairs the executions of elitist and non-elitist irace using the same set of initial configurations.In none of the three scenarios, we observe statistically significant differences.

Heterogeneous scenario setting
Examining closer the results in the previous section (middle plots of Fig. 9 ), one may notice that the non-elitist irace produces the two worst configurations found for SPEAR .These two particularly bad runs of irace discard elite configurations as explained above.We conjecture that for heterogeneous training sets such as in the SPEAR scenario, the elitist version may avoid the loss of high-quality configurations, thus producing more consistent results.In fact, facing scenarios with an heterogeneous set of training instances is a difficult task for automatic configuration methods, which normally work better in homogeneous scenarios [85] .
In an heterogeneous scenario, measuring the quality of a configuration typically requires evaluating on a large number of instances in order to find configurations that optimize the algorithm performance across the training set and to capture also the possible existence of few rare but hard instances.Unfortunately, evaluating on more instances with the same tuning budget strongly reduces the ability of the tuner to explore new configurations and, hence, there is a trade-off between increasing the confidence on the quality of configurations and sampling effectively the configuration space.
When using irace to tackle very heterogeneous scenarios, it may be useful to adjust the default settings to increase the number of instances evaluated by each configuration.For elitist irace this can be achieved by increasing the number of new instances added initially to a race ( T new ); in non-elitist irace this can be achieved by increasing the number of instances needed to perform the first statistical test ( T first ).Fig. 10 gives the mean runtime per candidate on the test set of the algorithm configurations obtained by 10 runs of the elitist irace (top) using various values of T new and of the non-elitist irace using various values of T first (bottom) on the SPEAR scenario.We can observe that using a larger value than the default for T new and T first strongly improves the cost (mean runtime) of the final configurations, because configurations are evaluated on more instances before selecting which one should be discarded.Further increasing the values of T new and T first does not lead to further improvements because enough instances are already seen to account for their heterogeneity.It does lead, however, to fewer configurations being explored, thus, at some point, larger values will actually generate worse configurations.This effect is stronger for T first because all configurations at each iteration have to be evaluated on that many instances, which consumes a substantial amount of budget and results in a much lower number of configurations being generated.This is shown by the number  within parentheses in Fig. 10 .In the case of T new , non-elite configurations may be discarded before seeing T new instances and the effect on the budget consumed is lower.The same experiment for the ACOTSP scenario showed that the best configurations become worse when T new or T first are increased.This is due to the fact that ACOTSP has a homogeneous training set and, therefore, sampling new candidates is more important than executing a large number of instances.

Other applications of irace
Since the first version of the irace package became publicly available in 2012, there have been many other applications of irace .In this section, we provide a list of the applications of the irace package of which we are aware at the time of writing.Some of these applications go beyond what is traditionally understood as algorithm configuration, demonstrating the flexibility of irace .
Automatic configuration is also useful when the goal is to analyze the effect of particular parameters or design choices.Instead of a (factorial) experimental design, which often is intractable because of the large number of parameters and/or the limited computation time available, the analysis starts from very highperforming configurations found by an automatic configuration method, and proceeds by changing one parameter at a time.An example is the analysis of a hybrid algorithm that combines ant colony optimization and a MIP solver to tackle vehicle routing problems (VRP) with black-box feasibility constraints [67] .Pellegrini et al. [76] applied this principle to the analysis of parameter adaptation approaches in ant colony optimization.More recently, Bezerra et al. [15] have applied the same idea to analyze the contribution of various algorithmic components found in multiobjective evolutionary algorithms.
The idea behind the above analysis is that, in terms of performance, there are many more "uninteresting" configurations than "interesting" ones, and statements about the parameters of uninteresting configurations are rarely useful, thus it makes more sense to start the analysis with high-performing configurations.In its general form, such procedure may be used to analyze differences between configurations, which has been described as ablation [32] .

Multi-objective optimization metaheuristics
Besides the application to the MOACO framework described above [58] , irace has been applied to aid in the design of other multi-objective optimization algorithms.Dubois-Lacoste et al. [31] used irace to tune a hybrid of two-phase local search and Pareto local search (TP + PLS) to produce new state-of-the-art algorithms for various bi-objective permutation flowshop problems.
Fisset et al. [33] used irace to tune a framework of multi-objective optimization algorithms for clustering.When applied to a sufficiently flexible algorithmic framework, irace has been used to design new state-of-the-art multi-objective evolutionary algorithms [16,17] .

Anytime algorithms (improve time-quality trade-offs)
There is often a trade-off between solution quality and computation time: Algorithms that converge quickly tend to produce better solutions for shorter runtimes, whereas more exploratory algorithms tend to produce better solutions for longer runtimes.Improving the anytime behavior of an algorithm amounts to improving the trade-off curve between solution quality and computation time such that an algorithm is able to produce as high quality solutions as possible at any moment during their execution.López-Ibáñez and Stützle [59] modeled this trade-off curve as a multiobjective optimization problem, and measured the quality of the trade-off curve using the hypervolume quality measure.This approach allows the application of irace to tune the parameters of an algorithm for improving its anytime behavior.They applied this technique to tune parameter variation strategies for ant colony optimization algorithms, and to tune the parameters of SCIP, a MIP solver, in order to improve its anytime behavior.The results show that the tuned algorithms converge much faster to good solutions without sacrificing the quality of the solutions found after relatively longer computation time.

Automatic algorithm design from a grammar description
Algorithm configuration methods have been used in the literature to instantiate algorithms from flexible algorithmic frameworks in a top-down manner, that is, the framework is a complex algorithm build from components of several related algorithms and specific components can be selected through parameters.One example using ParamILS is SATenstein [51] .Examples using irace include the MOACO framework described above [58] and multiobjective evolutionary algorithms [17] .A different approach describes the potential algorithm designs as a grammar.This provides much more flexibility when composing complex algorithms.Mascia et al. [66] proposed a method for describing a grammar as a parametric space that can be tuned by means of irace in order to generate algorithms.They applied this technique to instantiate iterated greedy algorithms for the bin packing problem and the permutation flowshop problem with weighted tardiness.Marmion et al. [63] applied this idea to automatically design more complex hybrid local search metaheuristics.

Applications in machine learning
In machine learning, the problem of selecting the best model and tuning its (hyper-)parameters is very similar to automatic algorithm configuration.Thus, it is not surprising that irace has been used for this purpose, for example, for tuning the parameters of support vector machines [70] .Lang et al. [54] used irace for automatically selecting models (and tuning their hyperparameters) for analyzing survival data.The automatically tuned models significantly outperform reference (default) models.The mlr software package [22] uses irace , among other tuning methods, for tuning the hyperparameters of machine learning models as a better performing alternative to random search and grid search.

Automatic design of control software for robots
A very original application of irace is the automatic design of control software for swarms of robots.Francesca et al. [35] propose a system to automatically design the software that controls a swarm of robots in order to achieve a specific task.The problem is specified as a series of software modules that provide many different robot behaviors and the criteria to transition between behaviors.Each module can be further customized by means of several parameters.A particular combination of behaviors and transitions represents one controller, that is, an instance of the software that controls the robots in the swarm.The performance of a particular controller is evaluated by means of multiple simulations.The search for the best controller over multiple training simulations is carried out by means of irace .The authors report that this system is not only able to outperform a previous system that used F-race [34] , but also a human designer, under the scenarios studied by them.

Conclusion
This paper presented the irace package, which implements the iterated racing procedure for automatic algorithm configuration.Iterated racing is a generalization of the iterated F-race procedure.
The primary purpose of irace is to automatize the arduous task of configuring the parameters of an optimization algorithm.However, it may also be used for determining good settings in other computational systems such as robotics, traffic light controllers, compilers, etc.The irace package has been designed with simplicity and ease of use in mind.Despite being implemented in R , no previous knowledge of R is required.We included two examples for the purposes of illustrating the main elements of an automatic configuration scenario and the use of irace to tackle it.In addition, we provided a comprehensive survey of the wide range of applications of irace .
There are a number of directions in which we are trying to extend the current version of irace .One is the improvement of the sampling model to take into account interactions among parameters.In some cases, irace converges too quickly and generates very similar configurations; thus, additional techniques to achieve a better balance between diversification and intensification seem worth pursuing.In the same direction, techniques for automatically adjusting some settings of irace (such as T new and T first ) in dependence of the heterogeneity of a scenario would be useful.Finally, we are currently adding tools to provide a default analysis of the large amount of data gathered during the run of irace to give the user information about the importance of specific parameters and the most relevant interactions among the parameters.
The iterated racing algorithms currently implemented in irace have, however, a few well-known limitations.The most notable is that they were primarily designed for scenarios where reducing computation time is not the primary objective.Methods designed for such type of scenarios, such as ParamILS [41] and SMAC [43] , dynamically control the maximum time assigned to each run of the target algorithm and use an early pruning of candidate configurations in order to not waste time on time-consuming and, therefore, poor configurations.Moreover, the default parameters of irace assume that a minimum number of iterations can be performed and a minimum number of candidate configurations can be sampled.If the tuning budget is too small, the resulting configuration might not be better than random ones.
Finally, automatic configuration methods in general may be difficult to apply when problem instances are computationally expensive, for example, when the computational resources available are limited (lack of multiple CPUs, cluster of computers) or when a single run of the algorithm requires many hours or days.In such situations, two main alternatives have been proposed in the literature.Styles et al. [88] proposed to use easier instances (less computationally expensive) during tuning to obtain several good configurations, and then apply a racing algorithm to these configurations using increasingly difficult instances to discard those configurations that do not scale.Mascia et al. [65] proposed to tune on easy instances, and then identify which parameters need to be modified and how in order for the algorithm to scale to more difficult instances.
The main purpose of automatic algorithm configuration methods is to configure parameters of optimization and other algorithms.Nonetheless, the use of these methods has a crucial role in new ways of designing software, as advocated in the programming by optimization paradigm [40] .Moreover, the importance of properly tuning the parameters of algorithms before analyzing and comparing them is becoming widely recognized.We hope that the development of the irace package will help practitioners and researchers to put these ideas into practice.
given by option trainInstancesDir will be prefixed to them.
Nonetheless, an instance may also be the parameter settings for selecting a benchmark function implemented in the target algorithm or for invoking an instance generator (in that case, option trainInstancesDir should be set to the empty string).If the option trainInstancesFile is not set, then irace considers all files found in trainInstancesDir , and recursively in its subdirectories, as training instances.The order in which instances are considered by irace is randomized if the option sampleInstances is enabled.Otherwise, the order is the same as given in trainInstancesFile if this option is set or in alphabetical order, otherwise.
In order to reduce variance, irace uses the same random seed to evaluate different configurations on the same instance.If an instance is seen more than once, a different random seed is assigned to it.Thus, in practice, the sequence of instances seen within a race ( Fig. 1 ) is actually a sequence of instance and seed pairs.

A2. Parameter space
For simplicity, the description of the parameter space is given as a table.Each line of the table defines a configurable parameter: where each field is defined as follows:

< name >
The name of the parameter as an unquoted alphanumeric string, for instance: ' ants '. < label > A label for this parameter.This is a string that will be passed together with the parameter to targetRunner .In the default targetRunner provided with the package, this is the command-line switch used to pass the value of this parameter, for instance ' ''--ants '' '.

< type >
The type of the parameter, either integer , real , ordinal or categorical , given as a single letter: ' i ', ' r ', ' o ' or ' c '.

< domain >
The range (for integers and real parameters) or the set of values (for categorical and ordinal) of the parameter.
< condition > An optional condition that determines whether the parameter is enabled or disabled, thus making the parameter conditional.If the condition evaluates to false, then no value is assigned to this parameter, and neither the parameter value nor the corresponding label are passed to targetRunner .The condition must be a valid R logical expression.The condition may contain the name of other parameters as long as the dependency graph does not have any cycles.Otherwise, irace will detect the cycle and stop with an error.

Parameter types and domains
Parameters can be of four types: • Real parameters are numerical parameters that can take any floating-point values within a given range.The range is specified as an interval ' ( < lower bound > , < upper bound > ) '.
This interval is closed, that is, the parameter value may eventually be one of the bounds.The possible values are rounded to a number of decimal places specified by the option digits .
For example, given the default number of digits of 4, the values 0.12345 and 0.12341 are both rounded to 0.1234.• Integer parameters are numerical parameters that can take only integer values within the given range.The range is specified as for real parameters.• Categorical parameters are defined by a set of possible values specified as ' ( < value 1 > , ... , < value n > ) '.
The values are quoted or unquoted character strings.Empty strings and strings containing commas or spaces must be quoted.
• Ordinal parameters are defined by an ordered set of possible values in the same format as for categorical parameters.They are handled internally as integer parameters, where the integers correspond to the indices of the values.

A3. Output of irace
During its execution, irace prints a detailed report of its progress.In particular, after each race finishes, the elite configurations are printed; and at the end, the best configurations found are printed as a table and as command-line parameters (see the example output shown in Section 4.1 ) In addition, irace saves an R dataset file, by default as irace.Rdata , which may be read from R by means of the function load() .This dataset contains a list iraceResults , whose most important elements are: scenario : the configuration scenario given to irace (any option not explicitly set has its default value).
parameters : the parameter space.seeds : a matrix with two columns, instance and seed .
Rows give the sequence of pairs instance-seed seen by irace .
allConfigurations : a data frame with all configurations generated during the execution of irace .
experiments : a matrix storing the result of all experiments performed across all iterations.Each entry is the result of evaluating one configuration on one instance at a particular iteration.Columns correspond to configurations and match the row indexes in allConfigurations .Rows match the row indexes in the matrix seeds , giving the instance-seed pair on which configurations were evaluated.A value of ' NA ' means that this configuration was not evaluated on this particular instance, either because it did not exist yet or it was discarded.
The irace.Rdata file can also be used to resume a run of irace that was interrupted before completion.

Fig. 1 .
Fig. 1.Racing for automatic algorithm configuration.Each node is the evaluation of one configuration on one instance.' × ' means that no statistical test is performed, ' − ' means that the test discarded at least one configuration, ' = ' means that the test did not discard any configuration.In this example, T first = 5 and T each = 1 .
and for ordinal parameters,x d − x d is replaced by | X d | − 1 ,as these are the corresponding upper and lower bounds for an ordinal parameter.

Fig. 3 .
Fig. 3. Parameter file ( parameters.txt ) for tuning ACOTSP .The first column is the name of the parameter; the second column is a label, typically the command-line switch that controls this parameter, which irace will concatenate to the parameter value when invoking the target algorithm; the third column gives the parameter type (either i nteger , r eal , o rdinal or c ategorical ); the fourth column gives the range (in case of numerical parameters) or domain (in case of categorical and ordinal ones); and the (optional) fifth column gives the condition that enables this parameter.

Fig. 5
compares the configurations obtained by 30 independent runs of elitist irace and the default configuration of ACOTSP .Each run of irace uses the settings described above and a training set of 200 Euclidean TSP instances of size 2 0 0 0 nodes.The best configurations found by each run of irace are run once on a (different)

Fig. 7 .
Fig. 7. Comparison of a configuration obtained by (elitist) irace and the default configuration of MOACO.Hypervolume values should be maximized (since irace minimizes by default, targetRunner multiplies the values by −1 before returning them to irace ).

Fig. 9 .
Fig. 9. Comparison between elitist and non-elitist irace .Plots give the mean candidate performance on the test instance set as % deviation from optima, runtime and hypervolume for the ACOTSP (top), SPEAR (middle), and MOACO (bottom) scenarios respectively.Hypervolume values are multiplied by −1 so that all scenarios must be minimized.The p -values of the statistical test are reported on the left plots.