Model selection with bootstrap validation

Model selection is one of the most central tasks in supervised learning. Validation set methods are the standard way to accomplish this task: models are trained on training data, and the model with the smallest loss on the validation data is selected. However, it is generally not obvious how much validation data is required to make a reliable selection, which is essential when labeled data are scarce or expensive. We propose a bootstrap‐based algorithm, bootstrap validation (BSV), that uses the bootstrap to adjust the validation set size and to find the best‐performing model within a tolerance parameter specified by the user. We find that BSV works well in practice and can be used as a drop‐in replacement for validation set methods or k‐fold cross‐validation. The main advantage of BSV is that less validation data is typically needed, so more data can be used to train the model, resulting in better approximations and efficient use of validation data.


INTRODUCTION
In machine learning, the model selection task is to select a model from a set of candidate models (or a model class), such that the performance of the chosen model is maximized on new data. Model selection is a fundamental task for which many approaches exist in the literature; see [14] for discussion.
The textbook standard for model selection is the holdout approach: split the data into training and validation sets, train the models on the training set, and then choose a model which performs the best on the validation set. Another standard is cross-validation (CV), which utilizes the data more efficiently but at the computational cost imposed by the need to train the model several times-one model trained for each CV fold.
A central question with validation methods is to estimate how much data is needed for training the models and how much data should be set aside for validation purposes. On the one hand, we would like to make the training set as large as possible to train the best models, but on the other hand, have a large enough validation set to choose the best model reliably. However, it is non-trivial to find a statistical guarantee that ensures we are not making a random choice caused by a too small validation set. Indeed, determining the correct validation set size and the over-fitting caused by data reuse are significant issues in machine learning model evaluation [20].
The contribution of this paper is a method, called bootstrap validation (BSV) that selects a "good enough" model from k given models using as little data as possible. BSV is of general use and can be used as a drop-in replacement for the standard validation set methods.
The data efficiency provided by BSV is essential when data used to evaluate the model are scarce, such as when collecting new data is expensive or slow. As shown in our experiments, BSV often uses a substantially smaller amount of data than conventional methods, such as the holdout approach. As a result, we can increase the training set size, which results in better models, or use the remaining validation set items later for other purposes, such as estimating generalization performance.
The main idea in BSV is to automatically adjust the validation set size until we reach a user-specified tolerance and statistical guarantee. The statistical guarantee, specified by a parameter , gives an upper bound for the probability that the selected model is not "good enough." The tolerance, specified by a parameter , defines that a model is "good enough" if its actual loss is within from the best out of the k models. BSV works with arbitrary loss functions and can be used in most supervised learning setups. The data efficiency of BSV is due to (i) selecting a "good enough" model instead of "the best" and (ii) being based on bootstrapping the validation set, which automatically takes into account correlations between model predictions.
The paper is organized as follows: Section 1.1 discusses potential applications of BSV. Section 1.2 reviews related work on validation set methods. Section 2 describes the problem of finding a model that is approximately the best, introduces the BSV algorithm for solving the problem, and shows its theoretical properties. Section 3 shows experimentally that the algorithm works as described and can be used as a drop-in replacement for validation set methods. Section 4 contains discussion and future work.

Applications of BSV
The main application of BSV is model selection, but it can also be applied in various other settings. We next discuss some of the potential applications.

Replacement for validation set methods
The most obvious application for BSV is to use it as a drop-in replacement for validation set methods. The advantage of BSV is that by specifying the required performance parameters, it automatically adjusts the required size of the validation set to be as small as possible and simultaneously sufficient to reach the required statistical guarantee. The potential drawback of BSV is that there is a non-zero probability that the algorithm needs more validation data than what is available. However, if this happens, BSV becomes equivalent to the traditional training-validation set split without statistical guarantees (the holdout approach). Another advantage of BSV, especially compared to k-fold CV, is that it is often necessary to train the model only once, as opposed to k times in k-fold CV.

1.1.2
Model hyper-parameter tuning Related to the above application, BSV can also be used to tune hyper-parameters within a model family with the same advantages. BSV is especially useful for hyper-parameter tuning, as predictions from models with slightly different hyper-parameters are often highly correlated. Since the bootstrap uses the joint empirical distribution of the model predictions, it will take their correlations into account and count models with identical predictions effectively as one, resulting in efficient data use. An example of applying BSV for hyper-parameter tuning is given in Section 3.4.

Concept drift detection
An interesting topic for future study is concept drift detection [13,25]. In this scenario, we have trained a set of supervised learning (classification or regression) models and know the models that perform well on the original data. When the models are applied in real-life settings, it is essential to detect if the data distribution has changed in a way that affects the relative performance of the machine learning models. Even if there is a large volume of data, labeled data is often scarce and expensive. An example of such a scenario is sensor calibration, where the idea is to use regression models to "calibrate" the measurements of low-cost sensors [5]. In this application, the performance of a low-cost sensor is affected by environmental conditions; these interactions are captured and modeled by the regression models. Reference measurements (new labeled data points) for the low-cost sensors are usually expensive to produce, so it is vital to use the scarce labeled data items sparingly. We can utilize BSV in this setup by feeding the reference measurements to the algorithm. If BSV detects that some other model starts to perform better, we may conclude that concept drift has occurred and the model may have to be re-trained.

Interactive data exploration
BSV is useful in all setups where it is advantageous to save validation data for further uses. One such generic scenario is interactive data analysis. A typical workflow in such a setup is a user (or algorithm) that sees a view in the data, based on which they proceed to the next step, obtaining another view, and so forth; see, for example, [15,28,29]. During the exploration, the user can observe various patterns in the data. It would be useful if there were a holdout dataset that could be used to test whether an observed pattern is real or just a random artifact [32]. Data efficient validation methods, such as BSV can support such interactive data exploration by allowing the user to test more hypotheses with the holdout data than is possible with traditional methods.

1.1.5
Privacy-preserving machine learning In many applications, it is essential to validate models while simultaneously revealing as little information as possible from the underlying data distribution, such as when the data contain sensitive information. One way to approach the problem is to introduce carefully controlled noise to the validation data [9,11] so that it is possible to select the best model yet minimize information leakage. BSV offers an alternative approach: instead of adding noise to the data, we can equivalently try to minimize the amount of data used for validation. Using less data for validation BSV can limit the information leakage compared to using the whole validation data.

Related work
Model selection is an essential task in machine learning.
There is a plethora of model selection methods that can be mainly differentiated into probabilistic methods, such as Akaike or Bayesian Information Criteria (AIC or BIC), or re-sampling methods, such as CV and bootstrapping-see [8] for a review. Our algorithm uses the bootstrap to select a "good enough" model. Although the bootstrap has been used for model selection previously, the focus has been on choosing optimal features [2,34], reducing the variance of bootstrap estimates [17], accuracy estimation with a dataset of a given size [16] or after model selection [10]. In contrast, we use a bootstrap confidence interval to obtain a p-value for a derived loss statistic inside an iterative significance testing procedure.
The iterative nature of our algorithm (increasing the validation set until a criterion is satisfied) is similar to early stopping. Early stopping is often seen in the context of neural networks [27], but has also been used to speed up hyper-parameter optimization with adaptive resource allocation [19]. Furthermore, there are similarities to the Hoeffding races algorithm [21,22] that iterates over validation points and eliminates candidates whose lowest error estimate is worse than the highest of the best one. The clinical trial literature also contains various methods that try to minimize data use in order to terminate a drug experiment after enough evidence has been collected on the efficacy of the treatment [26].
The model selection literature includes other related topics, such as over-fitting prevention with the reusable holdout [9], controlling bias via limiting information usage [31], and various data partition practices [3,7,16]. The latter is a major point of interest for various CV techniques [1]. While the literature contains various approaches for model selection of classifiers, they tend to either focus on the case of multiple datasets [6] or make simplistic assumptions on the stopping criteria (boundary behavior) when selecting a winner between two classifiers [18].
The probability bounds given for the model output by our algorithm are similar to bounds in prior work, such as in Rashomon sets [33] (the set of almost-equally-accurate models for a given problem), and predictive multiplicity [23].

METHODS
In this paper, we study the problem of finding a "good enough" supervised learning (classification or regression) model with minimal data. In this section, we formulate the problem, propose the BSV algorithm for solving it, and show the theoretical properties of the algorithm.

Problem definition
We assume we can draw i.i.d. data points (x, y) at random from an unknown but fixed distribution, where x is a covariate (e.g., a real or categorical vector) and y is the discrete class variable (for classifiers), probability (for probabilistic classifiers), or a real number (for regression). We assume that there are k distinct and fixed supervised learning models that have been trained beforehand with separate training data and that all try to estimate y given x, that is,ŷ = f (x), where j ∈ [k] and where we use the short-hand notation [k] = {1, … , k}. The functions f j output discrete classes or class probabilities for classification tasks and real numbers for regression tasks. The (true) loss of a model j is defined by loss , where L is a loss function, and the expectation is over sampling of a new data point (x, y) from the distribution.
We can choose the loss function L freely, but in this paper, we will consider (i) classifier models that output the predicted classŷ ∈ [K], where K is the number of classes (ii) probabilistic classifiers that output a simplex vector of class probabilitiesŷ ∈ Δ K , 1 and (iii) regression models that output real numbersŷ ∈ R. The respective loss functions are (i) 0-1 loss L 01 (ŷ, y) = I(ŷ ≠ y) for classification, where y ∈ [K] and I(z) is the indicator function that equals unity if z is true and zero otherwise, (ii) clipped log-loss L log (p y , y for probabilistic classification, wherep y is the estimated probability for class y and L max is a positive real constant, and (iii) quadratic loss L 2 (ŷ, y) = (ŷ − y) 2 , where y ∈ R, for regression. We assume throughout that the value of the loss is finite and has a well-defined mean and variance. This is why we use clipped log loss instead of raw log loss, which may, in practice, have infinite values. Notice that a well-defined mean is required for any validation set method to work; no validation method can work if the mean of the loss is ill-defined or cannot be estimated with a finite validation set.
Given a parameter and two models i, j ∈ [k], we say that the models i and j are approximately as good if their (true) losses are within from each other, or | | loss i − loss j | | ≤ . We say that a model is approximately the best if it is approximately as good as the model with the lowest loss loss min = min i∈[k] loss i . We now formally define the main problem addressed in this paper: Problem 1. Given , , and the definitions above, find a model i alg ∈ [k] that is approximately the best with a probability of at least 1 − , or: If our algorithm outputs a model that is not approximately the best, we say that the algorithm made an error. We denote the set of approximately the best models by Our objective is to solve Problem 1 using as few data points as possible. We next describe the algorithms used to solve it in Section 2.2.

BSV: Finding a model that is approximately the best
Our algorithm BSV (Algorithm 1) works as follows: The algorithm takes as an input the desired significance level , the tolerance parameter , as defined in Problem 1, the initial number of data points T to sample, and the maximum number of data points T max available for sampling. We let D 0 be T data points (1, x 1 , y 1 ), … , (T, x T , y T ) sampled i.i.d. from the data distribution (line 1). Here the data points are described by triplets (j, x, y), where j denotes the unique integer index of the data point.
We denote the empirical loss of model i ∈ [k] on a dataset D by Algorithm 1. Bootstrap validation (BSV) input : : significance level; : tolerance parameter; T: initial number of data points; T max : maximum number of data points, where 2T ≤ T max . output: A model i that is approximately the best with a probability of at least 1 − .
be a set of T newly sampled data points.; be a set of at most |D 0 | newly sampled data points.; /* Compute a p-value. Adjust the number of bootstraps so we can 11 Output a warning: T max was reached, and we are unable to give a statistical guarantee.; Furthermore, we define our test statistic for a dataset D and model i ∈ [k] to be ⧵{i}, and where we have used the empirical loss defined by Equation (2). We observe that the model i ∈ [k] is approximately the best, according to the definition of Problem 1, if and only if, at the limit of an infinitely large D, the inequality S = lim |D|→∞ s i (D) ≤ is satisfied. Notice that if the loss has a finite and well-defined mean, then S is always uniquely defined.
At each iteration t ∈ {1, 2, … }, the algorithm uses the data in D 0 to find a candidate model i t C = arg min i∈[k] L i (D 0 ) (line 5) and samples 2 t−1 T new data points to D 1 from the distribution (line 6). The data points in D 1 are used by the bootstrap procedure of Algorithm 2  [24].
to find a p-value p t for the null hypothesis that the candidate model i t C is not approximately the best (line 7). If p t ≤2 −t the algorithm terminates and outputs i t C ; otherwise, the counter t is incremented by one, D 1 is appended to D 0 , and the iteration is repeated.
If the algorithm terminates because it reached T max (the upper limit on the data points to be sampled), and if the bootstrap p-value does not satisfy the termination condition, then a warning is output that we cannot give a level-statistical guarantee that the output model is approximately the best (line 11). We observe that the algorithm uses ∼ min ( T max , 2 t × T ) data points, where t is the number of iterations run.
When T max is reached, the algorithm has used T max /2 validation points to select a model and T max /2 to test if it is approximately the best. In this case, we can instead use all T max points to select the best model, and BSV then reduces to the holdout approach.
If the data fed into the bootstrap procedure is too small (line 1 of Algorithm 2), then the bootstrap is not run and a p-value p = 1 is returned. The reason is that the bootstrap requires a minimum sample size from which to draw bootstrap samples; for example, we cannot bootstrap one data point. Since this affects both T min in Algorithm 2 and T in Algorithm 1, we use T min = T, and ensure T is large enough by, for example, using the approximate conservative lower bound for T provided in Section 2.3. In practice, small sample sizes appear in line 6 of Algorithm 1 when fewer than |D 0 | points are sampled for D 1 . This happens when T max /T is not a power of 2, resulting in T max − 2 T leftover data points in the last iteration . If T max /T is a power of 2, then T min is not required.
Algorithm 1 has three user-specified parameters: , , T. In general, data use increases with smaller (requiring smaller error), smaller (detecting smaller differences), and usually with larger T (more data for the initial guess). The significance level is selected similarly to the significance level in statistical tests, which by convention have values, such as ∈ {0.1, 0.05, 0.01}. The tolerance concerns the difference between losses, and its magnitude depends on the used loss function. Typical values for the 0-1 loss may be ∈ {0.1, 0.05, 0.01} depending on how close the user requires the selected model to be to the true best of the k models. The initial data size T sets the minimum data use at 2 T, since T is used for finding a candidate model and T for testing it. A larger T means the first iteration uses more data, which leads to a better initial candidate model. Selecting the correct value for T is problem dependent and typically requires domain knowledge of the modeling task. A conservative lower bound for T is given by Equation (4), which is based on bounding the bootstrap approximation error. In practice, the experiments show that values, such as T = 128 result in the required probabilistic control for the datasets used in this paper.

Theoretical properties of BSV
We next discuss the theoretical properties of BSV. The technical proofs are in the Appendix. The first theorem shows that Algorithm 1 provides the probabilistic guarantee in Problem 1, when it does not run out of data: Theorem 1. Algorithm 1 makes an error with a probability of at most α, if the loss has a finite mean and variance and the algorithm does not run out of data.
Theorem 1 relies on the fact that the p-value given by Algorithm 2 is a proper p-value for the null case of a model not in . The next lemma proves that this p-value is approximately proper.

Lemma 1. At iteration t, the p-value p t approximated by
Algorithm 2 is to a good accuracy a proper p-value, that is, it satisfies Pr (p t < a) ≲ a for any a ∈ The proof of Lemma 1 contains an approximate sign due to the bootstrap approximation. The following lemma shows that the bootstrap approximation converges asymptotically, at least if the losses have finite means and variances: almost surely for any a ′ ∈ [0, 1] at the limit of infinite data We next provide an approximate lower bound for T in Equation (4) that ensures a good enough bootstrap approximation. The bound is derived from the observation that the highest error in the bootstrap approximation happens (at least for the 0-1 loss) when the model losses for the T data points are identical.

Observation 1.
A conservative approximate lower bound for the necessary number of data points T for a good enough bootstrap approximation when using the 0-1 loss is given by: Typically, the model losses are correlated, and a smaller value of k can be used. If the loss is instead bounded to an interval [0, L max ], we can replace with /L max in Equation (4).
The bound of Equation (4) shows that the number of required data points increases inversely proportionally to the required accuracy but only logarithmically to the required probability guarantee . Notice that at each iteration of Algorithm 1, we double the number of data points and halve the level at each iteration. It follows that if Equation (4) is satisfied at the first iteration of Algorithm 1, then it will also be satisfied at subsequent iterations.
The final theorem (Theorem 2) shows that, under some not-too-strict assumptions of the test statistic, the algorithm stops after a finite number of steps. The assumption is that the test statistic is sub-Gaussian, which should be true at least for any loss bounded by an interval [0, L max ]. The non-sub-Gaussian case corresponds to a difficult learning problem for any validation set based method. Notice that there is still a non-zero probability that the algorithm does not stop before running out of validation data when there is a finite amount of validation data available.
Theorem 2. Algorithm 1 stops almost surely, at least if the distribution of the test statistic s i (D) for any model i ∈ [k] is sub-Gaussian with variance proxy 2 i upper bounded by 2 i ≤ 2 ∕ | D | for some , even when there is unlimited amount of data available or T max → ∞.

EXPERIMENTS
In the experiments, we apply BSV to synthetic (Section 3.1) and real datasets (Section 3.2). We also briefly discuss the advantages compared to CV (Section 3.3) and describe a parameter tuning use-case (Section 3.4).
The experiments demonstrate that BSV can be used as a drop-in replacement for the standard holdout approach while using fewer data and satisfying the probabilistic guarantees of the previous section. The experiments also examine how different parameters affect its performance in terms of error rate (selecting a model not in ), actual data use, and power (how often the p-value of a correctly chosen model is below ).
The source code and the results from the experiments are available at github.com/edahelsinki/bsv. All real datasets used in the experiments are available from openml.org.

Datasets
The datasets used in the experiments are empirical losses from given, trained models. We use both synthetically generated losses and losses on real datasets.

Synthetic data
The synthetic losses are 0-1 losses generated by treating a model's loss as a Bernoulli random variable. If we denote the 0-1 loss of model i on a data point (x, y) by

Real data
We again use model losses for the experiments with real datasets instead of training models and evaluating losses on data. We compute losses directly on model predictions for various datasets retrieved from the openML repository [35] using the openML R API [4]. We primarily focus on the 0-1 loss for binary classification tasks. We also compute the clipped log-loss (for models that output a probability) and the mean square error (MSE) for regression tasks. The log loss is clipped to a maximum value, L max = 1. The datasets were selected based on having available a large enough validation set and predictions from a large enough collection of models. The Appendix contains a list of the datasets with their openML identifiers and visualizations of model losses.
The experiments also include modified versions of the real datasets, with clear and unclear winners. These are denoted by _clear and _unclear, and are visible in Figure 6 and in the additional results in the Appendix. In the _clear versions of the datasets, models are removed, such that the difference between the accuracy of the best model and the remaining models is at least 0.1. In the _unclear versions, only the top k clear performing models are kept. We use k clear , the number of models in the respective _clear version, so that the results of the _clear and _unclear versions of a dataset are comparable and any differences are not due to the number of models.

Experimental setup
In the experiments, Algorithm 1 is run on the data sets described above using various values for the parameters , , and T. The performance metrics are the data use n, the error rate = Pr ( i alg ∉  ) , and the empirical power = Pr , where p t and t = 2 −t are the p-value and the allocated in the last iteration. Good performance means low data use while having low error and high power. Another metric shown in Table 3 is the empirical alpha emp = Pr , which indicates that the p-values produced by the algorithm are calibrated, meaning emp ≤ . Note that the probabilistic guarantee of Algorithm 1 is emp ≤ , which is equivalent to error ≤ when the data use does not exceed T max .
The algorithm is run n run = 1000 times for each parameter combination to compute averages, standard deviations, and empirical probabilities. Empirical probabilities are computed as follows throughout the paper, where I (□ i ) is the indicator function which equals unity if the Boolean expression □ i is true and zero otherwise. The experiments were run using R 4.1 [30] on a high-performance computing cluster.

Synthetic data experiments
This section demonstrates simple scenarios using synthetic 0-1 losses for k models. Synthetic 0-1 losses are simulated as Bernoulli variables with a probability of success equal to the model's accuracy. We show that (i) the p-values computed with Algorithm 2 work as expected, (ii) there is a trade-off between error rate and data use, which BSV handles well, and (iii) BSV performs well when there are models that are superior to others.

p-value simulation
We first examine the synth_5 dataset, which contains 5 models, whose losses are shown in Figure 1. We will examine a tolerance = 0.1, such that models 2, 3, and 4 are exactly = 0.1 from model 1. If we now consider the p-values for these models, we expect that the p-value is often low for the best model and is often large for the non-best models, or Pr (p 1 ≤ ) ≫ , Pr (p i ≤ ) ≈ for i = 2, 3, 4, and Pr (p 5 ≤ ) ≪ . a uniform distribution. In the QQ plot, the diagonal corresponds to a uniform distribution. We observe that the p-values behave as described in the previous section: the best model (1) is below the diagonal, meaning its p-values tend to be small (or stochastically larger than uniform), and the worst model (5) is above the diagonal, meaning its p-values tend to be large (or stochastically smaller than uniform). The middle models (2, 3, and 4) are on the diagonal, meaning their p-values are approximately uniform.
The power of the procedure (that is, how often p 1 ≤ ) can also be read from the figure for any as the intersection of the best model with a horizontal line at (here the dashed line is at = 0.05). We observe that the power is close to 1, which means the p-value for the best model is almost always less than .

Data-error tradeoff
In model selection, there is an inherent trade-off between data use and error rate, as shown in Figure 2 for the synth_5 and spambase datasets. If we use n randomly sampled points to select a model with the smallest loss on this sample, then the error rate is the probability that the chosen model is not in . We can make this error smaller by increasing the data use n. The gray line indicates this error for a given n, which we call the reference boundary. In BSV, we seek a solution that has an error lower than a threshold (the dotted line denotes = 0.05) and that uses as little data as possible (small n). Figure 2  variation in data use for a given error rate and an initial number of data points T. For smaller values of T, such as 16, data use has a large variation because BSV automatically adjusts the data use to ensure an error below = 0.05. The variation for larger values (T = 256) is smaller since the best model on T points is a good enough estimation. In both cases, BSV hovers around the reference data use and often requires less data (as indicated by the frequencies of each n, shown as vertical lines). Notice in the spambase dataset that often only a few data points are required to select even between k = 223 models if the goal is to find a model within = 0.05 from the best (here 142 models satisfy this requirement). While a standard train-validation split uses the whole validation data at once, BSV adaptively increases the size of the data such that the selected model is within from the best with an error level less than .

Effect of clear winners
We next examine how the algorithm's data use and error are affected by the number of models (k) and the number of models that are approximately the best (k = || for some ), by running BSV on N = 5000 points of the synth_k and synth_kbest datasets.
In the synth_k dataset, there is one clear winner and k − 1 models that are not in ; losses for k = 4 and k = 16 are shown in Figure 3 on the right. Figure 3 shows that in this clear winner case, increasing the number of models (larger k) does not increase the error rate (solid line, left figure) or the data use (middle figure). This indicates that the algorithm works well in the "needle in a haystack" problem of finding the single best model out of clearly inferior ones. In the synth_kbest dataset, we vary k , the number of models in ; losses for k = 1 and k = 16 are shown in Figure 4 on the right. Notice that the data use is adjusted automatically to ensure a low error rate (middle figure). Figure 4 shows that even when there are multiple clear winners, the error rate remains constant (solid line, left figure).
The gray line in both figures denotes the error rate of a dummy model selection where a model is selected uniformly at random from k models, which leads to an error rate of 1 − k /k. The dummy baseline highlights that model selection is challenging for many models k or for a small number of "good" models k . BSV performs well in both cases.

Real-world data experiments
We next apply BSV on real datasets, and examine how the tolerance parameter affect its data use.

Results
The results on the real datasets demonstrate that BSV often requires only a fraction of the available data to make a selection, it can accurately select out of a large number of models k a model that is "approximately the best," and it works using various loss functions. Table 1   In the results shown in Table 1, we observe that across all datasets and losses, the average data use E[n] is lower than the available validation data N, which is the data used by the standard holdout approach. We also observe that the error is controlled, meaning a correct model is selected, and the power is high, meaning the p-value is low when it should be.
In some cases, BSV requires as low as 2% of the available data on average for a model selection as accurate as the standard holdout for a given tolerance . These results indicate that BSV can be used as a data-efficient replacement for validation set methods. While the error is relatively constant over different scenarios, the data use and power vary, mainly due to the tolerance , the number of models k, and the initial number of data points T. We next look at the effect of in more detail.

3.4.2
Effect of the tolerance δ Detecting smaller differences (small ) generally requires more data. Figure 5 illustrates how affects data use n for four datasets. There is some variation due to the random nature of the bootstrap, but in all datasets, the data use is less than the maximum available N, shown with a dotted line.
Especially for larger , model selection with BSV uses only a fraction of the available data. For example, for = 0.1 in the electricity dataset, we can select between 109 models using only 267 ± 53 data points (≈ 2% of the available data), with an error of 0.1%. Generally, a smaller also leads to a larger error, which is nevertheless still bounded by = 0.05.
As was shown in the synthetic experiments, the effect of on data use is mediated by whether there is a clear winner, meaning a model with a loss much lower than others. We also observe this in the real datasets: when there is a clear winner, the data use is approximately constant over different values of ( Figure 6). The error rate in these scenarios is below 1% for the clear case, while for the unclear case it is zero when all models are approximately the best since any selection is correct.

Comparison to cross validation
One advantage of BSV over CV is that the models need to be trained only once. In contrast, CV requires models to be trained once per fold, which, at the limit of leave-one-out CV (LOOCV), is equal to the number of data points. Table 2 compares the number of times k models are trained when using BSV versus when using CV and LOOCV.

F I G U R E 5
The effect of the tolerance parameter ( ) on data use (n) using 0-1 loss. Detecting smaller differences (smaller ) requires more data. The data used is a fraction of the total available data for larger differences. The upper dotted horizontal line denotes the maximum available data for validation for each dataset (N), and the lower dotted line denotes the minimum data used for the particular T (2 T). Confidence bands denote standard deviation over 1000 runs with = 0.05 and T = 128. In a typical use of BSV, N tr points are used for training and n < N va points from a validation set are used for model selection. In contrast, CV with 10 folds uses N = N tr + N va points in total and N/10 points per fold for model selection.

Parameter tuning use case
So far, we have considered model selection when given k models. Another common use case in model building is parameter tuning. In parameter tuning, we seek the optimal parameters for a single model. We can perform model selection in continuous spaces by defining a grid over the parameter values. Parameter tuning can then be viewed as a model selection task over k different model versions. We here present an example of tuning the regularization parameter in LASSO regression on the spambase dataset. We first train the LASSO model (using N tr = 460 points) on a grid of k = 95 values for , producing k versions of the model, and then use BSV (on N va = 461 validation points) to select a model that is approximately the best. The top images in Figure 7 show the test losses (on N te = 3680 TA B L E 2 Number of times k models are trained when using different validation methods. BSV requires models to be trained only once. Typically, when CV is used in parameter tuning, it estimates the generalization error and selects the min with the lowest CV estimate of the error. In Figure 7, the shown min is computed by the glmnet R package [12]. As mentioned in the previous section, 10-fold CV offers a lower error rate than BSV, but uses more data (N = N va + N tr = 921 points), and trains each model 10 times.
The advantage of BSV is that it allows parameter tuning to trade-off between data use and error rate by adjusting the tolerance . For example, for = 0.1 in Table 3 the error rate is 0.1%, and the data use E[n] can be a fraction of the whole validation data, while for = 0.01 we get a 12%-15% error rate if T is too small.
Notice that in Table 3 the error rate is sometimes larger than . This occurs if Algorithm 1 has not yet confirmed the candidate model is correct (the p-value is not below the allocated level) when it runs out of data (T max is reached). In these cases, the algorithm outputs a warning.

DISCUSSION
We have presented the BSV algorithm that can find a "good enough" classification or regression model out of a collection of given models by adaptively changing the validation set size. In BSV, the user can pre-define a probability and a tolerance, such that the algorithm is guaranteed with a high probability to output a model that is "approximately the best," that is, a model with an expected loss on new

TA B L E 3
Parameter tuning for LASSO regression on the spambase dataset using BSV with α = 0.05. See Table 1 for column descriptions. data that is within the given tolerance from the smallest loss.

T E[n](sd(n)) E[n]/N error
We have shown theoretically and experimentally that BSV uses data economically and that it can be used as a drop-in replacement for validation methods for black box supervised learning models. In addition to optimizing the usage of validation data, the advantages of BSV include a statistical guarantee for the selected model and that training the models only once is sufficient, as in a simple training-validation set split. The standard CV methods may still be preferable if a statistical guarantee is not required, training the models several times (one for each CV fold) is acceptable, and the validation data is not needed later.
We see a few directions for future work. First, the bootstrap p-value computed by Algorithm 2 is based on the basic bootstrap confidence interval, but it can be based on a different type of bootstrap confidence interval. Second, in Algorithm 1, the total error rate allocated to each iteration t is 2 −t while the data use is increased as 2 t T, where T is the initial data size. Other options besides halving the error and doubling the data are possible and may result in improved performance. Third, BSV currently assumes i.i.d. data; therefore, it does not, in theory, handle dependent data, such as time series. That said, the electricity dataset used in the experiments is a time series, so in practice, the approach may work for dependent data under certain conditions. Extensions for dependent data may include using a bootstrap method that handles dependent data and modifying the iterative testing procedure to utilize known data autocorrelations for improved power. BSV can potentially be used in applications outlined in Section 1.1, such as concept drift detection and iterative data exploration.
Finally, even though in most examples discussed in this paper, the k models are from different model families, Problem 1 is a formulation of a generic learning problem. The models f j , where j ∈ [k], can also present a model family parametrized by a discretized parameter vector j. For example, the model family could be a given deep learning architecture, and the parameter vector j would consist of the network weights. Solving Problem 1 would then be equivalent to finding the model parameters that provide the best generalization performance. While a naive application of BSV to such a task would be infeasible (except for limited use cases, such as parameter tuning in Section 3.4), the same ideas could be used to formulate a variant of stochastic gradient descent (SGD). One iteration of this SGD variant would correspond to using (a modified) BSV to select a best-performing parameter vector from a neighborhood of the best-performing parameter vector found so far.

ACKNOWLEDGMENTS
The authors thank the Finnish Computing Competence Infrastructure (FCCI) for supporting this project with computational and data storage resources. We acknowledge the support of the Doctoral Programme in Computer Science at the University of Helsinki, Academy of Finland (decisions 345704 and 346376), and the Future Makers Program of the Technology Industries of Finland Centennial Foundation and Jane and Aatos Erkko Foundation. Open Access funding enabled and organized by the Helsinki University Library.

DATA AVAILABILITY STATEMENT
The source code to reproduce the experimental results are publicly available at github.com/edahelsinki/bsv. The datasets used in the experiments are publicly available at openml.org.

A.1 Technical proofs
The proofs based on the bootstrap are true asymptotically at the limit T → ∞ if the loss has a finite and well-defined mean and variance. Also, the proofs assume that we can compute p-values p t , which are approximated in Algorithm 2 (line 6) with an empirical p-valuep t (M) on M bootstrap samples. By the law of large numbersp t (M) → p t as M → ∞. Therefore, the proof holds asymptotically for p t (M) at the limit M → ∞.
Proof of Theorem 1. Algorithm 1 makes an error-assuming the data limit T max is not reached-only if it outputs a model i that is not approximately the best or i ∉ . The error can happen in iteration t only if the candidate model i t C is not approximately the best and the p-value p t satisfies the termination condition p t ≤ 2 −t . Below, we analyze the infinite algorithm that does an infinite number of iterations and only outputs approximately the best model at iteration (where Algorithm 1 would terminate). Algorithm 1 makes an error only if the infinite algorithm has at least one iteration t where i t C ∉  and p t ≤ 2 −t . The probability of at least one such event is upper bounded by: The first inequality follows from the observation that if Algorithm 1 makes an error, then the infinite algorithm must have at least one iteration where i t C ∉  and p t ≤ 2 −t . The last inequality follows from Lemma 1 below. □ Proof of Lemma 1. We use the bootstrap to study the statistic s i (D) given by: which is an approximator for the asymptotic "true" value denoted here by S = lim |D|→∞ s i (D). In this proof, we take i to be fixed and assume that the model i is not approximately the best, which implies that < S. Further, we assume we are given a constant a ∈ [0, 1]. We denote by D* a random variable representing a bootstrap sample of the dataset D, obtained by sampling |D| data points from D randomly with replacement. We denote the corresponding test statistic by S* = s i (D*), and by Q (a) the ath quantile of the bootstrap distribution of S*. Furthermore, we useŜ = s i (D) to denote the value of the statistic on the original (non-bootstrapped) data.
We can write: where we defined a new random variable Δ a = 2Ŝ − Q(a) for convenience. The first equality follows from the definition of the ath quantile. The approximate sign follows from the assumption that the distribution of the random variable S * −Ŝ converges toŜ − S almost surely at the limit T → ∞, at least if the loss has a finite mean and variance. For proof of convergence, we refer to Lemma 2 below. Notice that Equation (A3) gives the level-a basic bootstrap upper bound for the value of S [36,37] 2 ; see Sect. 6.2.3 of [37] for discussion and analogous derivations. From < S (which follows from our assumption that the model i was not approximately the best) and Equation (A3), it follows that: By rearranging Δ a = 2Ŝ − Q(a) < we obtain an equivalent expression p t = Q −1 (2Ŝ − ) < a, where we have defined a new random variable p t = Q −1 (2Ŝ − ). p t is the p-value computed approximately by Algorithm 2 (line 6). Finally, by replacing Δ a < with p t < a in Equation (A4), we end up with the expression: which proves the claim. □ Proof of Lemma 2. We refer to Example 3.1 of [38], which shows the consistency of the bootstrap for the sample mean of random p-vectors with a finite mean and variance.
The p-vectors of [38], denoted by X, correspond to k − 1-dimensional vectors (p = k − 1), where each of the elements (indexed by j) is given by the difference in losses, or , where we have taken i = k for simplicity of notation. H BOOT (a ′ ) can be defined analogously.
With these definitions, [38] shows that the bootstrap estimator H BOOT converges almost surely to the cumulative distribution H T at the limit of infinite data (T → ∞) and bootstrap samples (M → ∞) if the random variables (or here the losses) have a finite mean and variance. □ Proof of Theorem 2. In the proof, we are interested only in the asymptotic behavior of the algorithm when t → ∞. At this limit, the candidate model i = lim t→∞ i t C is the model with the smallest loss with a very high probability, since |D| ∝ 2 t . In the remainder of this proof, we, therefore, assume that loss i =loss best . The Chernoff bound for sub-Gaussian random variables reads: The left-hand side of the equation is a probability that acts as a proxy for the p-value of the best model i, for some constant Δ. The algorithm stops if the p-value goes below 2 −t . The right-hand side is proportional to exp(− |D|) ∝ exp ( −2 t ) for some > 0. This upper bound goes down faster than the algorithm stopping limit, and therefore Algorithm 1 stops after a finite number of steps almost surely. □

A.1.1. Derivation of Observation 1
We first introduce additional notation and establish an auxiliary lemma. Consider the case of a 0-1 loss with two models (k = 2), and take i = 1 in (2). The test statistic of Equation (A2) is then given byŜ = 1 1}. Denote the probability of Z i = −1 by p − (model 1 makes a wrong prediction and model 2 gives a correct prediction), the probability of Z i = 0 by p 0 (models give the same prediction), and the probability of Z i = 1 by p + (model 1 makes a correct and model 2 a wrong prediction). The "true" value of the test statistic (at the limit T → ∞)-here, the difference in model accuracies-is given by S = p + − p − . Denote by x − , x 0 , and x + the counts of observed values − 1, 0, and 1 of Z i , respectively. Notice that p − + p 0 + p + = 1 and x − + x 0 + x + = T by definition.
We can use the observed counts to compute the level-a ′ exact upper bound Δ E for the value of S as follows.
where B (p;v, w) is the pth quantile from Beta distribution with shape parameters v and w and where we have assumed uniform prior Proof. The trinomial distribution gives the likelihood of the observed counts: where x 0 = T − x − − x + and p 0 = 1 − p − − p + . We can re-parametrize the likelihood by S and p − : Due to the uniform prior Pr (S|p − ) for S ∈ [−p − , 1 − p − ], we can write the posterior for S (where the normalization term Z contains terms constant in S) as: The claim follows directly from the definition of the Beta distribution after inserting the expression of the posterior from Equation (A10) into the integral of Equation (A11). □ Observation 1. We can now derive a conservative approximate lower bound for T. Equation (A3) essentially tries to estimate the upper bound Δ E by using the bootstrap approximation, denoted by Δ a . In practice, the largest error ′ = Δ E −Δ a in the upper bound is obtained when x 0 = T and x − = x + = 0, that is, the two models give the same predictions for the T data points. If this happens, the bootstrap approximation gives Δ a = 0, and therefore the bootstrap error is upper bounded by: which applies to any value of p − ∈ [0, 1], which allows us to drop the nuisance parameter p − from consideration, and where we have assumed that x − = x + = 0. We can rewrite Equation (A12) as: Solving for T, we obtain: where we have used log ( 1 − ′ ) ≈ − ′ . The lower bound for the case k ≥ 3 follows by simply noticing that the test statistic S is a maximum of k − 1 differences for which we can make a Bonferroni correction for a ′ or: It, therefore, follows that if we want the bootstrap error ′ to be smaller than the tolerance parameter or ′ ≲ with a probability of at least 1 − or a ′ ≲ , we can, by setting ′ → and a ′ → , obtain a conservative approximate lower bound for the number of data points: If the number of bootstrap samples T satisfies Equation (A16), then we expect the bootstrap approximation to behave sufficiently nicely. Table A1 contains the OpenML [35] identifiers for the datasets used in the experiments. Data ID refers to a dataset, and Task ID refers to a dataset plus an estimation procedure, either "33% Holdout Set" or "10-fold Cross Validation." Note that the experiments use model predictions on these datasets but not the datasets themselves (except for the parameter tuning example). The log-loss data denoted with _ll in the experiments are computed on the same datasets as the ones with the 0-1 loss.

A.2 Dataset descriptions
The openML datasets were selected as follows: For the binary classification tasks, we retrieved predictions on a 1/3 holdout set (reported as N in the tables) from models trained on the other 2/3 of the data. The selection criteria were that the datasets should have at least 3000 data points (of which 1/3 are in the validation set) and predictions from at least 10 models. For binary classification, we also retrieved 10-fold cross-validated model predictions. We used the first fold as the validation set (also reported as N in the tables) for models trained on the other folds (9/10 of the data). For these datasets, the selection criteria were a sample size of 10 4 -10 6 data points and predictions from 10 to 500 models. The clipped log-loss was computed for a subset of the binary classification models that output a probability. For regression tasks, due to the lack of holdout-set predictions in openML that satisfy the above criteria, we only retrieved 10-fold cross-validated model predictions, as described above. The selection criteria for regression were that the datasets should have at least 1000 points and predictions from at least five models. Figures A1 and A2 summarize the empirical model losses on the validation set for the selected datasets. The shape of the "loss profile" shown in the bar chart summarizes the properties of the model selection task, such as whether there are clear or unclear winners.
The columns in the tables are N: number of available validation data points, k: number of models, k : number of models in  for a given , : tolerance parameter, E[n]: average data use, sd (n): standard deviation of data use, error: proportion of incorrect selections, power emp : proportion of correct selections with a final p-value less than the required threshold. Averages and proportions are over 1000 runs of BSV.  sylva_prior.0

F I G U R E A2
Empirical losses (y-axis) of models (x-axis) for datasets used in the experiments.