Surrogate Losses in Passive and Active Learning

Active learning is a type of sequential design for supervised machine learning, in which the learning algorithm sequentially requests the labels of selected instances from a large pool of unlabeled data points. The objective is to produce a classifier of relatively low risk, as measured under the 0-1 loss, ideally using fewer label requests than the number of random labeled data points sufficient to achieve the same. This work investigates the potential uses of surrogate loss functions in the context of active learning. Specifically, it presents an active learning algorithm based on an arbitrary classification-calibrated surrogate loss function, along with an analysis of the number of label requests sufficient for the classifier returned by the algorithm to achieve a given risk under the 0-1 loss. Interestingly, these results cannot be obtained by simply optimizing the surrogate risk via active learning to an extent sufficient to provide a guarantee on the 0-1 loss, as is common practice in the analysis of surrogate losses for passive learning. Some of the results have additional implications for the use of surrogate losses in passive learning.


Introduction
In supervised machine learning, we are tasked with learning a classifier whose probability of making a mistake (i.e., error rate) is small. The study of when it is possible to learn an accurate classifier via a computationally efficient algorithm, and how to go about doing so, is a subtle and difficult topic, owing largely to nonconvexity of the loss function: namely, the 0-1 loss. While there is certainly an active literature on developing computationally efficient methods that succeed at this task, even under various noise conditions [e.g., 2, [30][31][32], it seems fair to say that at present, many of these advances have not yet reached the level of robustness, efficiency, and simplicity required for most applications. In the mean time, practitioners have turned to various heuristics in the design of practical learning methods, in attempts to circumvent these tough computational problems. One of the most common such heuristics is the use of a convex surrogate loss function in place of the 0-1 loss in various optimizations performed by the learning method. The convexity of the surrogate loss allows these optimizations to be performed efficiently, so that the methods can be applied within a reasonable execution time, using modest computational resources. Although classifiers arrived at in this way are not always guaranteed to be good classifiers when performance is measured 1 under the 0-1 loss, in practice this heuristic has often proven quite effective. In light of this fact, most modern learning methods either explicitly make use of a surrogate loss in the formulation of optimization problems (e.g., SVM), or implicitly optimize a surrogate loss via iterative descent (e.g., AdaBoost). Indeed, the choice of a surrogate loss is often as fundamental a part of the process of approaching a learning problem as the choice of hypothesis class or learning bias. Thus it seems essential that we come to some understanding of how best to make use of surrogate losses in the design of learning methods, so that in the favorable scenario that this heuristic actually does work, we have methods taking full advantage of it.
In this work, we are primarily interested in how best to use surrogate losses in the context of active learning, which is a type of sequential design in which the learning algorithm is presented with a large pool of unlabeled data points (i.e., only the covariates are observable), and can sequentially request to observe the labels (response variables) of individual instances from the pool. The objective in active learning is to produce a classifier of low error rate while accessing a smaller number of labels than would be required for a method based on random labeled data points (i.e., passive learning) to achieve the same. We take as our starting point that we have committed to use a given surrogate loss, and we restrict our attention to just those scenarios in which this heuristic actually does work: specifically, where the minimizer of the surrogate risk also minimizes the error rate, and is contained in our function class. We are then interested in how best to make use of the surrogate loss toward the goal of producing a classifier with relatively small error rate.
In passive learning, the most common approach to using a surrogate loss is to minimize the empirical surrogate risk on the labeled data. One can then derive guarantees on the error rate of this strategy by bounding the surrogate risk via concentration inequalities, and then converting these guarantees on the surrogate risk into guarantees on the error rate, a technique pioneered by Bartlett, Jordan, and McAuliffe [6] and Zhang [51]. Interestingly, we find that this direct approach is not appropriate in the context of active learning: that is, optimizing the surrogate risk to a sufficient extent to guarantee small error rate generally cannot yield large improvements over passive learning. While at first this finding might seem quite negative, it leaves open the possibility of methods making use of the surrogate loss in alternative ways, which still guarantee low error rate and computational efficiency, but for which these guarantees arise via a less direct route. Indeed, since we are interested in the surrogate loss only insofar as it helps us to optimize the error rate with computational efficiency, we may even consider methods that provide no guarantees on the achieved surrogate risk whatsoever (even in the limit).
In the present work, we propose such an alternative approach to the use of surrogate losses in active learning. The insight leading to this approach is that, if we are truly only interested in achieving low 0-1 loss, then once we have identified the sign of the optimal function at a given point, we need not optimize the value of the function at that location any further, and can therefore focus the label requests elsewhere. Based on this insight, we construct an active learning strategy that optimizes the empirical surrogate risk over increasingly focused subsets of the instance space, and derive bounds on the number of label requests the method requires to achieve a given error rate. In many cases, these bounds reflect strong improvements over the analogous results for passive learning by minimizing the given surrogate loss. As a byproduct of this analysis, we find this insight has implications for the use of certain surrogate losses in passive learning as well, though to a lesser extent.
Most of the mathematical tools used in this analysis are inspired by techniques for the study of active learning developed over the past decade [4,23,24,36], in conjunction with the results of Bartlett, Jordan, and McAuliffe [6] bounding the excess error rate in terms of the excess surrogate risk, and the works of Koltchinskii [34] and Bartlett,Bousquet,and Mendelson [8] on local Rademacher complexity bounds.

Related Work
There are many previous works on the topic of surrogate losses in the context of passive learning. Perhaps the most relevant to our results below are the work of Bartlett, Jordan, and McAuliffe [6] and the related work of Zhang [51]. These develop a general theory for converting results on excess risk under the surrogate loss into results on excess risk under the 0-1 loss. Below, we describe the conclusions of that work in detail, and we build on many of the basic definitions and insights pioneered in it.
Another related line of research, explored by Audibert and Tsybakov [3], studies "plug-in rules," which make use of regression estimates obtained by optimizing a surrogate loss, and are then rounded to {−1, +1} values to obtain classifiers. They prove minimax optimality results under smoothness assumptions on the actual regression function. Under similar conditions, Minsker [41] studies an analogous active learning method, which again makes use of a surrogate loss, and obtains improvements in label complexity compared to the passive learning method of Audibert and Tsybakov [3]. Minsker's active learning work has also recently been strengthened and extended in [27,38]. Remarkably, as discussed by Audibert and Tsybakov [3], the rates of convergence obtained in such works are often better than the known results for methods that directly optimize the 0-1 loss, under analogous complexity assumptions on the Bayes optimal classifier (rather than the regression function). As a result, these works raise interesting questions about whether the general analysis of methods that optimize the 0-1 loss remain tight under complexity assumptions on the regression function, and potentially also about the design of optimal methods for classification when assumptions are phrased in terms of the regression function.
In the present work, we focus our attention on scenarios where the main purpose of using the surrogate loss is to ease the computational problems associated with minimizing an empirical risk, so that our statistical results might typically be strongest when the surrogate loss is the 0-1 loss itself, even if in some cases stronger results might in principle be achievable from assumptions involving the surrogate loss [as in 3,41]. As such, in the specific scenarios studied by Minsker [41], our results are generally not optimal; rather, the main strength of our analysis lies in its generality. In this sense, our results are more closely related to those of Bartlett, Jordan, and McAuliffe [6] and Zhang [51] than to those of Audibert and Tsybakov [3] and Minsker [41]. That said, we note that several important elements of the design and analysis of the active learning method below are already hinted at to some extent in the work of Minsker [41], albeit in a form that also relies heavily on the assumptions and function class specific to that work; the present work takes the general perspective, developing theory and methods applicable to any function class and surrogate loss function.
Our approach to the design of active learning methods below follows the wellstudied strategy of disagreement-based active learning, an approach pioneered by Balcan, Beygelzimer, and Langford [4], and further developed by several later works [e.g., 14,24,25,36]. The basic strategy maintains a set V of plausible candidates for the optimal classifier, and requests the labels of samples disagreed-upon by classifiers in V ; it periodically updates the set V by eliminating classifiers making an excessive number of mistakes on the requested labels. The analysis of the number of label requests sufficient for this technique to achieve a given error rate in the general case was explored by Hanneke [22,24], Dasgupta, Hsu, and Monteleoni [14], Koltchinskii [36], and others, and the results are typically expressed in terms of a quantity known as the disagreement coefficient. In the present work, we modify the disagreement-based active learning strategy by updating the set V , not based on the number of mistakes, but rather based on the empirical surrogate risk on the queried samples. We derive bounds on the number of label requests this method requires to achieve a given excess error rate, in terms of properties of the surrogate loss. In particular, when the surrogate loss is chosen to be the 0-1 loss itself, this method behaves nearly-identically to previouslystudied methods [25,36], and in this special case, our results match those established in the literature (with some small refinements in the logarithmic factors).
There are several interesting works on active learning methods that optimize a general loss function. Beygelzimer, Dasgupta, and Langford [9] and Koltchinskii [36] have both proposed such methods, and analyzed the number of label requests the methods make before achieving a given excess risk for that loss function. The former method is based on importance weighted sampling, while the latter makes clear an interesting connection to local Rademacher complexities. One natural idea for approaching the problem of active learning with a surrogate loss is to run one of these methods with the surrogate loss. The results of Bartlett, Jordan, and McAuliffe [6] allow us to determine a sufficiently small value γ such that any function with excess surrogate risk at most γ has excess error rate at most ε. Thus, by evaluating the established bounds on the number of label requests sufficient for these active learning methods to achieve excess surrogate risk γ, we immediately have a result on the number of label requests sufficient for them to achieve excess error rate ε. This is a common strategy for constructing and analyzing passive learning methods based on a surrogate loss. However, as we discuss below, this strategy does not generally lead to the best results for active learning, and often will not be much better than results available for related passive learning methods. Instead, the method we propose does not aim to optimize the surrogate risk overall, but rather optimizes it on a sequence of increasingly-focused subregions of the instance space, and thereby provides a smaller bound on the number of label requests sufficient to guarantee excess error rate ε.

Definitions
Let (X , B X ) be a measurable space, where X is called the instance space. Let Y = {−1, +1}, and equip the space X × Y with its product σ-algebra: B = B X ⊗ 2 Y . Let R = R ∪ {−∞, ∞}, let F * denote the set of all measurable functions g : X →R, and let F ⊆ F * , where F is called the function class. Throughout, we fix a distribution P XY over X × Y, and we denote by P the marginal distribution of P XY over X . In the analysis below, we make the usual simplifying assumption that the events and functions in the definitions and proofs are indeed measurable. In most cases, this holds under simple conditions on F and P XY [see e.g., 48]; when this is not the case, one may turn to outer probabilities. However, we will not discuss these technical issues further.
For any h ∈ F * , and any distribution P over X × Y, denote the error rate by er(h; P ) = P ((x, y) : sign(h(x)) = y); when P = P XY , we abbreviate this as er(h) = er(h; P XY ). Also, let η(X; P ) be a version of P(Y = 1|X), for (X, Y ) ∼ P ; when P = P XY , abbreviate this as η(X) = η(X; P XY ). In particular, note that er(h; P ) is minimized at any h with sign(h(·)) = sign(η(·; P ) − 1/2). For any We will use standard big-O notation to express asymptotic dependences. Specifically, for f, g : Our interest here is learning from data, so let Z = {(X 1 , Y 1 ), (X 2 , Y 2 ), . . .} denote a sequence of independent P XY -distributed random variables, referred to as the labeled data sequence, while {X 1 , X 2 , . . .} is referred to as the unlabeled data sequence. For m ∈ N, we also denote Z m = {(X 1 , Y 1 ), . . . , (X m , Y m )}. Throughout, we will let δ ∈ (0, 1/4) denote an arbitrary confidence parameter, which will be referenced in the methods and theorem statements.
The active learning protocol is defined as follows. An active learning algorithm is initially permitted access to the sequence X 1 , X 2 , . . . of unlabeled data. It may then select an index i 1 ∈ N and request to observe Y i1 ; after observing Y i1 , it may select another index i 2 ∈ N, request to observe Y i2 , and so on. After a number of such label requests not exceeding a given budget n, the algorithm halts and returns a function h ∈ F * . Formally, this protocol specifies a type of decision rule mapping the random sequence Z to a functionĥ, whereĥ is conditionally independent of Z given X 1 , X 2 , . . . and (i 1 , Y i1 ), (i 2 , Y i2 ), . . . , (i n , Y in ), where each i k is conditionally independent of Z and i k+1 , . . . , i n given X 1 , X 2 , . . . and (i 1 , Y i1 ), . . . , (i k−1 , Y i k−1 ).

Surrogate Loss Functions for Classification
Throughout, we let ℓ :R → [0, ∞] denote an arbitrary surrogate loss function. For simplicity, suppose |z| < ∞ ⇒ ℓ(z) < ∞. Definel = 1∨sup (x,y)∈X ×Y sup h∈F ℓ(yh(x)). We will generally supposel < ∞. In practice, this is more often a constraint on F and X than on ℓ: that is, we could have ℓ unbounded, but due to some normalization of the functions h ∈ F , ℓ is bounded on the corresponding set of values. For any g ∈ F * and distribution P over X × Y, let R ℓ (g; P ) = E [ℓ(g(X)Y )], where (X, Y ) ∼ P . This is the ℓ-risk of g under P . When P = P XY , abbreviate this as R ℓ (g) = R ℓ (g; P XY ).
We will be interested in loss functions ℓ whose point-wise minimizer necessarily also optimizes the 0-1 loss. This property was nicely characterized by Bartlett, Jordan, and McAuliffe [6] as follows.
. In our context, for X ∼ P, ℓ ⋆ (η(X)) represents the minimum value of the conditional ℓ-risk at X, so that E[ℓ ⋆ (η(X))] = inf h∈F * R ℓ (h), while ℓ ⋆ − (η(X)) represents the minimum conditional ℓ-risk at X, subject to having a sub-optimal conditional error rate at X: i.e., sign(h(X)) = sign(η(X)−1/2). Thus, being classificationcalibrated implies the minimizer of the conditional ℓ-risk at X necessarily has the same sign as the minimizer of the conditional error rate at X. Since we are only interested here in using ℓ as a reasonable surrogate for the 0-1 loss, for the remainder of this article we suppose ℓ is classification-calibrated.
Though not strictly necessary for our results below, it will be convenient for us to suppose that, for all η 0 ∈ [0, 1], this infimum value ℓ ⋆ (η 0 ) is actually obtained as For instance, this is the case for any nonincreasing right-continuous ℓ, or continuous and convex ℓ, which include most of the cases we are interested in using as surrogate losses anyway. The proofs can be modified in a natural way to handle the general case, simply substituting any z with conditional risk sufficiently close to the infimum value. For any distribution P , denote f ⋆ P (x) = z ⋆ (η(x; P )) for all x ∈ X . In particular, note that f ⋆ P obtains R ℓ (f ⋆ P ; P ) = inf g∈F * R ℓ (g; P ). Furthermore, since ℓ is classificationcalibrated, we have sign(f ⋆ P (x)) = sign(η(x; P ) − 1/2) for all x ∈ X with η(x; P ) = 1/2, and hence er(f ⋆ P ; P ) = inf h∈F * er(h; P ) as well. When P = P XY , we abbreviate by f ⋆ = f ⋆ PXY . All of our main results below rely on the assumption that f ⋆ ∈ F . When combined with the fact that ℓ is classification-calibrated, this essentially stands as a formal representation of the informal assumption that the surrogate loss ℓ was chosen wisely: that is, that functions in F with relatively low surrogate risk necessarily have relatively low error rate. However, it should be noted that this is often a very strong assumption, significantly restricting the allowed distributions P XY . For instance, for many losses ℓ in practical use (e.g., the quadratic loss), when F is a parametric family, the assumption that f ⋆ ∈ F essentially restricts the allowed functions η(·) to also form a parametric family. This fact underscores the need for great care in selecting a surrogate loss when approaching a given learning problem in practice. In principle, one can relax this assumption slightly, at the expense of significantly more-complicated theorem statements, and we include some superficial remarks on this in Appendix F. However, it seems any truly-substantial relaxation would require a significantly different approach.
The following definition will enable us to transform guarantees on the excess surrogate risk into guarantees on the excess error rate. Definition 1. For any distribution P over X × Y, and any ε ∈ [0, 1], define By definition, Γ ℓ has the property that In fact, Γ ℓ is defined to be maximal with this property, in that any 1]. For this reason, we will be interested in calculating lower bounds on Γ ℓ . Bartlett, Jordan, and McAuliffe [6] studied various ways to obtain concrete, calculable lower bounds of this type. Specifically, for , and let ψ ℓ be the largest convex lower bound ofψ ℓ on [0, 1], which is well-defined in this context [6]; for convenience, also define ψ ℓ (x) for x ∈ (1, ∞) arbitrarily, subject to maintaining convexity of ψ ℓ . Bartlett, Jordan, and McAuliffe [6] show ψ ℓ is continuous and nondecreasing on (0, 1), and in fact that x → ψ ℓ (x) /x is nondecreasing on (0, ∞). They also show every , so that ψ ℓ ≤ Γ ℓ , and they find this inequality can be tight for a particular choice of P XY . They further study more subtle relationships between excess ℓ-risk and excess error rate holding for any classificationcalibrated ℓ. In particular, following the argument in the proof of their Theorem 3, one can show that ∀h ∈ F * , The implication of this in our context is the following. Fix any nondecreasing function . (2) In fact, though we do not present the details here, with only minor modifications to the proofs below, when f ⋆ ∈ F , all of our results involving Γ ℓ (ε) also hold while replacing Γ ℓ (ε) with any nondecreasing Ψ ′ ℓ s.t. ∀ε ≥ 0, Ψ ′ ℓ (ε) ≤ radius(F (ε; 01))ψ ℓ ε 2radius(F (ε;01)) , which can sometimes lead to tighter results. Some of our stronger results below will be stated for a restricted family of losses, originally explored by Bartlett,Jordan,and McAuliffe [6]: namely, smooth losses with convexity quantified by a polynomial, as described in the following condition.
In particular, note that if F is convex, and the functions in F are uniformly bounded, and ℓ is convex and continuous, then Condition 2 is always satisfied (though possibly with r ℓ = ∞) by taking d ℓ (x, y) = |x − y|/(4B).

A Few Examples of Loss Functions
Here we briefly mention a few loss functions ℓ in common practical use, all of which are classification-calibrated. These examples are taken directly from the work of Bartlett, Jordan, and McAuliffe [6], which additionally discusses many other interesting examples of classification-calibrated loss functions and their corresponding ψ ℓ functions.

Example 1
The quadratic loss (or squared loss), specified as ℓ(x) = (1 − x) 2 , is often used in so-called plug-in classifiers [3], which approach the problem of learning a classifier by estimating the regression function E[Y |X = x] = 2η(x) − 1, and then taking the sign of this estimator to get a binary classifier. The quadratic loss has the convenient property that for any distribution P over X × Y, f ⋆ P (·) = 2η(·; P ) − 1, so that it is straightforward to describe the set of distributions P satisfying the assumption f ⋆ P ∈ F . In classification, this loss is sometimes modified as ℓ(x) = max{1 − x, 0} 2 , called the truncated quadratic loss. Bartlett, Jordan, and McAuliffe [6] show that for the quadratic loss (with or without truncation), ψ ℓ (x) = x 2 , and Condition 2 is satisfied with L = 2(B + 1), C ℓ = 1/4, r ℓ = 2.

Example 2
The exponential loss is specified as ℓ(x) = e −x . This loss function appears in many contexts in machine learning; for instance, the popular AdaBoost method can be viewed as an algorithm that greedily optimizes the exponential loss [18]. Bartlett,Jordan,and McAuliffe [6] show that under the exponential loss, f ⋆ (x) = 1 2 ln η(x) 1−η(x) and ψ ℓ (x) = 1 − √ 1 − x 2 , which is tightly approximated by x 2 /2 for small x. They also show this loss satisfies the conditions on ℓ in Condition 2 with d ℓ (x, y) = |x − y|, L = eB, C ℓ = e −B /8, and r ℓ = 2. Note, however, that for noisefree distributions, we would need f ⋆ (x) = ±∞, which means most common function classes F could not be expected to contain f ⋆ for this loss in the noise-free case.

Example 3
The hinge loss, specified as ℓ(x) = max {1 − x, 0}, is another common surrogate loss in machine learning practice today. For instance, it is used in the objective of the Support Vector Machine (along with a regularization term) [13]. Bartlett,Jordan,and McAuliffe [6] show that for the hinge loss, f ⋆ (x) = sign(η(x) − 1/2) and ψ ℓ (x) = |x|. The hinge loss is Lipschitz continuous, with Lipschitz constant 1. However, for the remaining conditions on ℓ in Condition 2, any x, y ≤ 1 have

Methods Based on Optimizing the Surrogate Risk
Perhaps the simplest way to use a surrogate loss is to optimize In this section, we introduce a classic passive learning method based on this strategy, and discuss the potential drawbacks of this approach for active learning.

Passive Learning: Empirical Risk Minimization
In the context of passive learning, the method of empirical ℓ-risk minimization is one of the most-studied methods for optimizing R ℓ (h) over h ∈ F . To define this method, we first introduce some notation. For any m ∈ N, g : X →R, and S = {(x 1 , y 1 ), . . . , (x m , y m )} ∈ (X × Y) m , we overload the R ℓ (g; ·) notation, defining the empirical ℓ-risk as R ℓ (g; S) = m −1 m i=1 ℓ(g(x i )y i ): that is, R ℓ (g; S) is the ℓrisk of g under the uniform distribution on S. At times it will be convenient to keep track of the indices for a subsequence of Z, and for this reason we further overload the notation, so that for any Q = {(i 1 , y 1 ), . . . , (i m , y m )} ∈ (N × Y) m , we define S[Q] = {(X i1 , y 1 ), . . . , (X im , y m )} and R ℓ (g; Q) = R ℓ (g; S[Q]). For completeness, we also generally define R ℓ (g; ∅) = 0.
The method of empirical ℓ-risk minimization, here denoted by ERM ℓ (H, Z m ), is characterized by the property that it returnsĥ = argmin h∈H R ℓ (h; Z m ). This is a wellstudied and classical passive learning method, presently in popular use in applications, and as such it will serve as our baseline passive learning method for comparison. We review several known performance guarantees for ERM ℓ below.

Negative Results for Active Learning
As mentioned, there are several active learning methods designed to optimize a general loss function [9,36]. However, it turns out that for many interesting loss functions, the number of labels required for active learning to achieve a given excess surrogate risk value is not significantly smaller than that sufficient for passive learning by ERM ℓ .
Specifically, consider a problem with X = {x 0 , x 1 }, a fixedB ∈ (0, ∞), and F as the set of all functions f with (f (x 0 ), f (x 1 )) ∈ [−B,B] × (0,B]. Let z ∈ (0, 1/2) be a constant, let η(x 1 ) = 1/2 + z, and suppose that ℓ is a classification-calibrated loss withl < ∞ such that for any η(x 0 ) ∈ [4/6, 5/6], we have f ⋆ ∈ F (the latter condition could equivalently be stated as a constraint onB). Given a small value ε ∈ (0, z), let Existing results of Hanneke and Yang [28] (with a slight modification to rescale for η(x 0 ) ∈ [4/6, 5/6]) imply that, for many classification-calibrated losses ℓ, the minimax optimal number of labels sufficient for an active learning algorithm to achieve this latter guarantee is Θ(1/ε). Hanneke and Yang [28] specifically show this for losses ℓ that are strictly positive, decreasing, strictly convex, and twice differentiable with continuous second derivative; however, that result can easily be extended to a wide variety of other classification-calibrated losses, such as the quadratic loss, which satisfy these conditions in a neighborhood of 0. It is also known [6] (see also below) that for many such losses (specifically, those satisfying Condition 2 with r ℓ = 2), Θ(1/ε) random labeled samples are sufficient for ERM ℓ to achieve this same guarantee, so that error bounds based purely on the surrogate risk of the function produced by an active learning method in this scenario can be at most a constant factor smaller than those provable for passive learning methods.
Below, we provide an active learning algorithm and analysis of its performance which, in the scenario above (with r ℓ = 2), guarantees expected excess error rate less than ε, using a number of label requests O(log(1/ε) log log(1/ε)). The implication is that, to identify the improvements achievable by active learning with a surrogate loss, it is not sufficient to merely analyze the surrogate risk of the function produced by a given active learning algorithm. Indeed, since we are not particularly interested in the surrogate risk itself, we may even consider active learning algorithms that do not actually optimize R ℓ (h) over h ∈ F (even in the limit).

Alternative Use of the Surrogate Loss
Given that we are interested in ℓ only insofar as it helps us to optimize the error rate with computational efficiency, we might ask whether there is a method that makes more effective use of ℓ for optimizing the error rate, while maintaining the computational advantages. To explore this question, we propose the following method, which generalizes the methods of Koltchinskii [36] and Hanneke [25]. Results similar to those proven below should also hold for analogous generalizations of the related methods of [4,9,14]. Algorithm 1: Input: surrogate loss ℓ, unlabeled sample budget u, labeled sample budget n Output: classifierĥ The intuition behind this algorithm is that, since we are only interested in achieving low error rate, once we have identified sign(f ⋆ (x)) for a given x ∈ X , there is no need to further optimize the value E[ℓ(ĥ(X)Y )|X = x]. Thus, as long as we maintain f ⋆ ∈ V , the data points X m / ∈ DIS(V ) are typically less informative than those X m ∈ DIS(V ). We therefore focus the label requests on those X m ∈ DIS(V ), since there remains some uncertainty about sign(f ⋆ (X m )) for these points. The algorithm updates V periodically (Step 6), removing those functions h whose excess empirical risks (under the current sampling distribution) are relatively large; by setting this thresholdT ℓ appropriately, we can guarantee the excess empirical risk of f ⋆ is smaller thanT ℓ . Thus, the algorithm maintains f ⋆ ∈ V as an invariant, while shrinking the sampling region DIS(V ). The actual definition ofT ℓ sufficient for the results stated below will be specified in Section 6.3 below, based on data-dependent concentration inequalities.
In practice, the set V can be maintained implicitly, simply by keeping track of the constraints (Step 6) that define it. Then the condition in Step 3 can be checked by solving two constraint satisfaction problems (one for each sign). Likewise, the value inf g∈V R ℓ (g; Q) in these constraints, as well as the finalĥ, can be found by solving constrained optimization problems. Thus, for convex loss functions and convex finitedimensional classes of function, these steps typically have computationally efficient realizations as convex optimization problems, as long as theT ℓ values can also be obtained efficiently.
We include general results on the performance of Algorithm 1 in Section 6 below. For now, we briefly sketch the main ideas of the analysis, in rough outline. For any measurable U ⊆ X , and any h, g ∈ F * , define the spliced function As mentioned, the idea in the analysis is to argue that Algorithm 1 maintains f ⋆ ∈ V , while also removing from V any function with relatively large error rate, within a certain number of rounds. More explicitly, upon reaching m satisfying the condition in Step We therefore defineT ℓ (V ; Q, m) to provide a concentration inequality One can then show that, upon reaching m of a certain size u j (quantified below), the value 2|Q| mT ℓ (V ; Q, m) will be small enough that, in combination with concentration of R ℓ (h DIS(V ) ; L m ) values, after the update in This provides a sufficient size of u to obtain excess error rate ε. Next, we note that the algorithm requests a label Y m only if Thus, the number of labels the algorithm requests among indices m with u j−1 < m ≤ u j is at most the number with X m ∈ DIS(F (E ℓ (2 2−j ); 01)), a number which can easily by upper bounded by a simple Chernoff bound. This provides a sufficient size of n for the algorithm to obtain excess error rate ε.
The number of label requests sufficient for Algorithm 1 to obtain excess error rate ε can often (though not always) be significantly smaller than the number of random labeled data points sufficient for ERM ℓ to achieve the same. This is typically the case when P(DIS(F (ε; 01))) → 0 as ε → 0. When this is the case, the number of labels requested by the algorithm is sublinear in the number of unlabeled samples it processes. Not surprisingly, the magnitude of the improvements of Algorithm 1 over ERM ℓ can be quantified in terms of the rate at which P(DIS(F (ε; 01))) vanishes as ε → 0. In the next section, we quantify this rate in terms of a complexity measure known as the disagreement coefficient.

Main Results
We provide a general analysis of Algorithm 1 in Section 6.4 below. For now, we summarize a few of the most interesting implications of that analysis, under commonlystudied complexity conditions: namely, VC subgraph classes and entropy conditions. Detailed derivations for all of these results (from the abstract theorems) are included in Section 7 below. Appendix C further includes a brief discussion of VC major classes and VC hull classes. In the interest of making the results more concise and explicit, we express them in terms of well-known conditions relating distances to excess risks. We also express them in terms of a lower bound on Γ ℓ (ε) of the type in (2), with convenient properties that allow for closed-form expression of the results. Throughout, we use the convenient notation Log(x) = max{ln(x), 1}, defined for all x ∈ (0, ∞).

Diameter Conditions
To begin, we first state some general characterizations relating distances to excess risks; these characterizations will make it easier to express our results more concretely below, and make for a more straightforward comparison between results for the above methods. The following condition, introduced by Mammen and Tsybakov [40] and Tsybakov [45], is a well-known noise condition, about which there is now an extensive literature [e.g., 6,24,25,34].
Condition 3 is equivalently expressed in terms of certain noise conditions [6,40,45]. Specifically, satisfying Condition 3 with some α < 1 is equivalent to the existence of some , which is often referred to as a low noise condition. Additionally, satisfying Condition 3 with α = 1 is equivalent to having some a ′ ∈ [1, ∞) such that P (x : |η(x) − 1/2| ≤ 1/a ′ ) = 0, often referred to as a bounded noise condition.
For simplicity, we formulate our results in terms of a and α from Condition 3. However, for the abstract results in this section, the results remain valid under the weaker condition that replaces F * by F , and adds the condition that f ⋆ ∈ F . In fact, the specific results in this section also remain valid using this weaker condition while additionally replacing (2) with the F -specific Ψ ′ ℓ requirement mentioned in Section 2.1, as remarked above.
An analogous condition can be defined for the surrogate loss function, as follows. Essentially-similar notions have been explored by Bartlett, Jordan, and McAuliffe [6] and Koltchinskii [34].
Note that these conditions are always satisfied for some values of a, b, α, β, since α = β = 0 trivially satisfies the conditions. However, in more benign scenarios, values of α and β strictly greater than 0 can be satisfied. Furthermore, for some loss functions ℓ, Condition 4 can even be satisfied universally, in the sense that it holds for a particular value of β > 0 for all distributions. In particular, Bartlett, Jordan, and McAuliffe [6] show that this is the case under Condition 2, as stated in the following lemma (see [6] for the proof).

The Disagreement Coefficient
In order to more concisely state our results, it will be convenient to bound P(DIS(H)) by a linear function of radius(H), for radius(H) in a given range. This type of relaxation has been used extensively in the active learning literature [5, 9, 14, 19, 22-25, 36, 44, 50], and the coefficient in the linear function is typically referred to as the disagreement coefficient. Specifically, the following definition is due to Hanneke [22,24]; related quantities have been explored by Alexander [1] and Giné and Koltchinskii [20].
The value of θ(ε) has been studied and bounded for various function classes F under various conditions on P. In many cases of interest, θ(ε) is known to be bounded by a finite constant [5,19,22,24,39], while in other cases, θ(ε) may have an interesting dependence on ε [5,44,50]. The reader is referred to the works of Hanneke [24,25] for detailed discussions on the disagreement coefficient.

VC Subgraph Classes
We begin with results for VC subgraph classes. For a collection A of sets, a set of The VC dimension vc(A) of A is then defined as the largest integer k for which there exist k points {z 1 , . . . , z k } shattered by A [49]; if no such largest k exists, we define vc(A) = ∞. For a set G of real-valued functions, denote by vc(G) the VC dimension of the collection {{(x, y) : y < g(x)} : g ∈ G} of subgraphs of functions in G (called the pseudo-dimension [29,43]); to simplify the results below, we adopt the convention that when the VC dimension of this collection is 0, we let vc(G) = 1. G is said to be a VC subgraph class if vc(G) < ∞ [47].
Because we are interested in results concerning values of R ℓ (h) − R ℓ (f ⋆ ), for functions h in certain subsets H ⊆ [F ], we will formulate results below in terms of vc(G H ). In some special cases, such as monotonic ℓ, these results can be rephrased directly in terms of vc(H) if desired [e.g., 17,29].
We can now state the following theorem, providing a sample size sufficient for ERM ℓ to obtain excess error rate ε. This result is implicit in the work of Giné and Koltchinskii [20].
Next, we turn to the analysis of Algorithm 1 under these same conditions. Suppose P XY satisfies Conditions 3 and 4, and for γ 0 ≥ 0, define We claim the following theorem, bounding the number of samples (labeled and unlabeled) sufficient for Algorithm 1 to obtain excess error rate ε, under the same conditions as Theorem 7. As mentioned above, the specific definition ofT ℓ sufficient for this theorem will be formally specified in Section 6.3. Also, the specification ofŝ will be given in the proof, in Appendix B.
Theorem 8. For a universal constant c ∈ [1, ∞), if P XY satisfies Condition 3 and Condition 4, ℓ is classification-calibrated, f ⋆ ∈ F , and Ψ ℓ is as in (3), for any ε ∈ (0, 1), then, with arguments ℓ, u, and n, and an appropriateŝ function, Algorithm 1 uses at most u unlabeled samples and makes at most n label requests, and with probability at To be clear, in specifying B 1 and C 1 , we adopt the convention that 1/0 = ∞ so that B 1 and C 1 are well-defined even when α = 1 or β = 1. When α < 1, the dependence on ε in (7) ). Comparing Theorem 8 to Theorem 7, the conditions on u in (6) and m in (5) are almost identical, aside from a logarithmic factor, so that the total number of data points indicated is roughly the same. However, the number of labels indicated by (7) may often be significantly smaller than the condition in (5), multiplying it by roughly θaε α . This reduction is particularly strong when θ is bounded by a finite constant and α is large. Moreover, this is the same type of improvement known to occur when ℓ is itself the 0-1 loss [24]; in particular, in this special case, (7) is sometimes nearly minimax [24,44]. Regarding the slight difference between (6) and (5) from replacing τ ℓ by χ ℓl , the effect is somewhat mixed, and which of these is smaller may depend on F and ℓ. For ℓ the 0-1 loss, τ ℓ = χ ℓl = θ(a(ε/2) α ).
In the case when ℓ satisfies Condition 2, we can derive the following sometimesstronger result with the help of Lemma 5.
Theorem 9. For a universal constant c ∈ [1, ∞), if P XY satisfies Condition 3, ℓ is classification-calibrated and satisfies Condition 2, f ⋆ ∈ F , Ψ ℓ is as in (3), and b and β are as in Lemma 5, then for any ε ∈ (0, 1), letting θ = θ(aε α ) and A 2 = vc(G F )Log l 2 /b (aθε α /Ψ ℓ (ε)) β + Log (1/δ), and letting C 1 be as in Theorem 8, if u, n ∈ N satisfy then, with arguments ℓ, u, and n, and an appropriateŝ function, Algorithm 1 uses at most u unlabeled samples and makes at most n label requests, and with probability at The constraint on u in (8) while the constraint on n in (9) This is noteworthy when θ is small while α > 0 and r ℓ > 2, for at least two reasons. First, the sufficient size of n in (9) is smaller than that in Theorem 8, multiplying by roughly (aθε α ) 1−β .
Second, even the sufficient number of unlabeled samples in (8) may be smaller than the sufficient number of labeled samples for ERM ℓ from Theorem 7, again multiplying by roughly (aθε α ) 1−β . Thus, in the case ℓ satisfies Condition 2 with r ℓ > 2, when Theorem 7 is tight, even with access to a fully labeled data set, we may still prefer to use Algorithm 1 rather than ERM ℓ . This is somewhat surprising, since (as (9) indicates) we expect Algorithm 1 to ignore the vast majority of the labels in this case. That said, it is not clear whether there exist natural losses ℓ of this type for which Theorem 7 is competitive with results for methods directly based on the 0-1 loss. Thus, these improvements in u and n in Theorem 9 may simply indicate that Algorithm 1 is, to some extent, compensating for a choice of ℓ that would otherwise lead to suboptimal error rates.

Entropy Conditions
In this section, we consider characterizations of the complexity of F in terms of entropy conditions. As with the above results, detailed derivations of all of these results are presented in Section 7.3 below, based on the abstract theorems presented in Section 6.4. For a distribution P over X × Y, a set G ⊆ G * , and ε ≥ 0, let N (ε, G, L 2 (P )) denote the size of a minimal ε-cover of G (that is, the minimum number of balls of radius at most ε sufficient to cover G), where distances are measured in terms of the L 2 (P ) pseudo-metric: (f, g) → f − g P . Also, for functions g 1 ≤ g 2 , a bracket [g 1 , g 2 ] is the set of functions g ∈ G * with g 1 ≤ g ≤ g 2 ; [g 1 , g 2 ] is called an ε-bracket under L 2 (P ) if g 1 − g 2 P < ε. Then N [] (ε, G, L 2 (P )) denotes the smallest number of ε-brackets (under L 2 (P )) sufficient to cover G.
The following represent two commonly-studied conditions.
or for all finitely discrete P , ∀ε > 0, The following theorem is a classic result on the performance of ERM ℓ under the above conditions [e.g., 6,47].
Theorem 11. For a universal constant c ∈ [1, ∞), if P XY satisfies Condition 3 and Condition 4, F and P XY satisfy Condition 10, ℓ is classification-calibrated, f ⋆ ∈ F , and Ψ ℓ is as in (3), then for any ε ∈ (0, 1) and m with Turning to the analogous setting for active learning, we are able to establish the following theorem on the performance of Algorithm 1 under these same conditions.
Theorem 12. For a universal constant c ∈ [1, ∞), if P XY satisfies Condition 3 and Condition 4, F and P XY satisfy Condition 10, ℓ is classification-calibrated, f ⋆ ∈ F , and Ψ ℓ is as in (3), then for any ε ∈ (0, 1), letting B 1 and C 1 be as in Theorem 8, then, with arguments ℓ, u, and n, and an appropriateŝ function, Algorithm 1 uses at most u unlabeled samples and makes at most n label requests, and with probability at The constraint on u in (12) is identical (up to constant factors) to the sample size in Theorem 11 sufficient for ERM ℓ to achieve the same. In contrast, when θ is small, the constraint on n in (13) improves this, multiplying by a factor ∝ θaε α .
As before, when ℓ satisfies Condition 2, we can derive sometimes-stronger results via Lemma 5. In this case, we will distinguish between the cases of (11) and (10), as we find a slightly stronger result for the former. We begin with the following result, under the uniform entropy condition (11). (3), b and β are as in Lemma 5,and (11) is satisfied with F ≤l (∀ finitely discrete P , ∀ε > 0), then ∀ε ∈ (0, 1), for C 1 as in Theorem 8 and then, with arguments ℓ, u, and n, and an appropriateŝ function, Algorithm 1 uses at most u unlabeled samples and makes at most n label requests, and with probability at Compared to Theorem 12, the constraints for u and n here may have improved dependences on ε, multiplying by O (θε α ) 1−β(1−ρ) . Furthermore, for small θ, these are also smaller than the size of m for ERM ℓ (F , Z m ) from Theorem 11. Next, we turn to the bracketing entropy condition (10). For simplicity, we will only consider the case that (10) is satisfied with F =l constant. In this case, we have the following result.
Theorem 14. For a universal constant c ∈ [1, ∞), if P XY satisfies Condition 3, ℓ is classification-calibrated and satisfies Condition 2, f ⋆ ∈ F , Ψ ℓ is as in (3), b and β are as in Lemma 5,and (10) is satisfied with F =l, then ∀ε ∈ (0, 1), letting C 1 be as in Theorem 8, C 2 be as in Theorem 12,and then, with arguments ℓ, u, and n, and an appropriateŝ function, Algorithm 1 uses at most u unlabeled samples and makes at most n label requests, and with probability at Compared to Theorem 12, the dependence on ε in the sizes for both u and n may be smaller here, multiplying by O (θε α ) (1−β)(1−ρ) , which is sometimes significant, though not quite as dramatic a reduction as we found under (11) in Theorem 13. As with Theorem 13, when θ(ε α ) = o(ε −α ), the sizes of u and n indicated by Theorem 14 are smaller than the results for ERM ℓ (F , Z m ) from Theorem 11.

An Example: Discrete Distributions
As a concrete example applying the above results, we find that Algorithm 1 generally provides some benefits for discrete P distributions. To describe these benefits quantitatively, consider the special case where ∃x 1 , , and take ℓ to be the quadratic loss (in which casel = 4). In particular, since 1], the condition f ⋆ ∈ F is satisfied in this scenario. We will use Theorem 12 to bound the number of labels sufficient for Algorithm 1 to achieve excess error rate ε. For any g ∈ F * , we have so that Condition 3 is satisfied with α = 1 and a = 1/(1−2ν 0 ). Furthermore, F is convex, and this ℓ satisfies Condition 2, with β = 1 and b = 32 in Lemma 5. Also, since ψ ℓ (x) = x 2 here [6], we have that Ψ ℓ (ε) = ε 2−α /(4a) = (1 − 2ν 0 )ε/4. Additionally, this scenario satisfies (10) in Condition 10 with q = 7 ω and ρ = 1 3 + ω, for any choice of ω ∈ (0, 1/2]; we include a simple proof of this fact in Appendix B.1. Finally, we bound θ(r 0 ) for r 0 ∈ (0, 1]. For any r ∈ (0, 1), we have DIS(B(f ⋆ , r)) ∩ {x i : i ∈ N} = x i : 90 π 4 i 4 ≤ r , so that P(DIS(B(f ⋆ , r))) i r −1/4 i −4 r 3/4 . Therefore, Plugging these values into Theorem 12, and choosing ω = ln , we find that there is a label budget n, sufficient to guarantee er(ĥ) − er(f ⋆ ) ≤ ε with probability at least 1 − δ in Algorithm 1, with dependence Θ ε −7/12 Log(1/ε) on ε. For comparison, the corresponding bound for ERM ℓ from Theorem 11 has dependence Θ ε −4/3 Log(1/ε) . This is larger than the above bound by a factor Θ ε −3/4 . Furthermore, one can show an Ω(ε −4/3 ) lower bound on the sample size necessary to obtain ε minimax expected excess error rate for passive learning in this scenario. Thus, Algorithm 1 achieves a significant improvement over the guarantees achievable by all passive learning methods. The details of this minimax lower bound are included in Appendix B.1.

An Example: Linear Functions
As another example applying the above results, consider the class of homogeneous linear functions. Specifically, fix any k ∈ N with k ≥ 5, X = {x ∈ R k : x ≤ 1}, and consider the class F = {x → w · x : w ∈ R k , w ≤ 1}. Take ℓ as the quadratic loss (in which casel = 4). Together with the assumption of f ⋆ ∈ F , this restricts P XY to have η(x) = (w · x + 1)/2 (almost everywhere), for some w ∈ R k with w ≤ 1.
Furthermore, this ℓ satisfies Condition 2, with β = 1 and b = 32 in Lemma 5, and has Ψ ℓ (ε) = ε 2−α /(4a). It is also known that vc(G F ) k (following from arguments of [16,29]). Additionally, for this class F , it is known that if P has a density (with respect to Lebesgue measure), then θ(ε) = o(1/ε) [26]. Together, these facts imply that, if P has a density, the sufficient size of n in Theorem 9 has dependence on ε that is o ε α−2 Log(1/ε) . We also note that, by varying P, it is possible to realize any α value in (0, 1] in Condition 3 [see 12,15]. To exhibit a concrete example, consider the simple scenario of P uniform on {x ∈ R k : x = 1}, and suppose P XY is such that f ⋆ ∈ F . For simplicity, also suppose the w ∈ R k with f ⋆ (x) = w · x satisfies w = 1. In this case, one can show that Condition 3 is satisfied with a ∝ k 1/4 and α = 1/2. For completeness, a proof of this is included in Appendix B.2. It is also known that θ(ε) ≤ π √ k for this scenario [22]. Plugging all of this into Theorem 9 reveals that, for Algorithm 1 to achieve excess error rate ε with probability at least 1 − δ (given sufficiently large u), it suffices to have a label budget n of size at least for a universal constant c > 0. In contrast, Theorem 7 gives a sufficient sample size for ERM ℓ (F , ·) proportional to k 1/4 ε 3/2 (kLog(k) + Log(1/δ)), which is significantly larger than the above size of n for ε sufficiently small. To our knowledge, it is not presently known what the optimal sample complexity of passive learning is for this scenario, so that in contrast to the previous example, here we can only claim an improvement in the upper bound. We note that Dekel, Gentile, and Sridharan [15] have also studied active learning with this F and ℓ under the same assumption of f ⋆ ∈ F , and established a similar result to the above (with slightly better dependence on k but slightly worse logarithmic factors), via a learning method tailored specifically to this function class.

General Theorems
The remainder of the article is devoted to a general analysis of Algorithm 1, from which we derive the more-explicit theorems stated above. The results are formulated analogously to localization arguments common in the literature on empirical risk minimization, but with a slight twist to introduce a relevant subregion to the argument. As such, we begin with a discussion of general localized sample complexity bounds.

Localized Sample Complexities
The derivation of localized excess risk bounds is essentially motivated as follows. We are interested in bounding the excess ℓ-risk of theĥ returned by ERM ℓ (H, Z m ). Suppose we have a coarse guarantee U ℓ (H, m) on this value: that is, (H, m). In a sense, this guarantee identifies a set H ′ ⊆ H of functions that a priori may have the potential to be returned by ERM ℓ (H, Z m ) (namely, H ′ = H(U ℓ (H, m); ℓ)), while those in H \ H ′ do not. With this information in hand, we can think of H ′ as a kind of effective function class, and we can think of ERM ℓ (H, Z m ) as equivalent to ERM ℓ (H ′ , Z m ). We may then repeat this same reasoning, now thinking ofĥ as the function returned by ERM ℓ (H ′ , Z m ): that is, we calculate U ℓ (H ′ , m) to determine a further subset H ′′ = H ′ (U ℓ (H ′ , m); ℓ) ⊆ H ′ of functions that we again expect to contain the empirical minimizerĥ, so that ERM ℓ (H ′ , Z m ) = ERM ℓ (H ′′ , Z m ), and so on. This repeats until we identify a fixed-point set H (∞) of functions such that H (∞) (U ℓ (H (∞) , m); ℓ) = H (∞) , so that no further reduction is possible. Following this chain of reasoning back to the beginning, we find that ERM ℓ (H, Z m ) = ERM ℓ (H (∞) , Z m ), so that the functionĥ returned by ERM ℓ (H, Z m ) has excess ℓ-risk at most U ℓ (H (∞) , m), which may be significantly smaller than U ℓ (H, m), depending on how U ℓ (H, m) varies with H.
To formalize this fixed-point argument for ERM ℓ (H, Z m ), Koltchinskii [34] makes use of the following quantities to define the coarse bound U ℓ (H, m) [see also 8,20].
In other cases, for completeness, we define them to be ∞.
In particular, the quantityM ℓ (γ; F , P XY , s) is used in Theorem 17 below to quantify the performance of ERM ℓ (F , Z m ). The primary practical challenge in calculat-ingM ℓ (γ; H, P, s) is handling the φ ℓ (H(γ ′ ; ℓ, P ); m, P ) quantity. In the literature, the typical (only?) way such calculations are approached is by first deriving a bound on φ ℓ (H ′ ; m, P ) for every H ′ ⊆ H in terms of some natural measure of complexity for the full class H (e.g., entropy numbers) and some very basic measure of complexity for H ′ : most often D ℓ (H ′ ; P ) and sometimes a seminorm of an envelope function. After this, one then proceeds to bound these basic measures of complexity for the specific subsets H(γ ′ ; ℓ, P ), as a function of γ ′ . Composing these two results is then sufficient to bound φ ℓ (H(γ ′ ; ℓ, P ); m, P ). For instance, bounds based on an entropy integral tend to follow this strategy. This approach effectively decomposes the problem of calculating the complexity of H(γ ′ ; ℓ, P ) into the problem of calculating the complexity of H and the problem of calculating some more basic properties of H(γ ′ ; ℓ, P ). See [6,20,34,47], or Section 7.1 below, for several explicit examples of this technique.
Another technique often (though not always) used in conjunction with the above strategy when deriving explicit rates of convergence is to relax D ℓ (H(γ ′ ; ℓ, P ); P ) to D ℓ (F * (γ ′ ; ℓ, P ); P ) or D ℓ ([H](γ ′ ; ℓ, P ); P ). This relaxation can sometimes be a source of slack; however, in many interesting cases, such as for certain losses or noise conditions, this approach can still lead to nearly tight bounds [6,40,45].
For our purposes, it is convenient to make these common techniques explicit in the results. This will make the benefits of our proposed method more apparent, while still allowing us to state results in a form abstract enough to encompass the more-specific complexity measures referenced in the theorems of Section 5. Toward this end, we have the following definition (recall the definitions of h U ,g and H U ,g from Section 4 above).
Definition 16. For every distribution P over X × Y, letφ ℓ (σ, H; m, P ) be a quantity defined for every σ ∈ [0, ∞], H ⊆ [F ], and m ∈ N, such that the following conditions are satisfied when f ⋆ P ∈ H.
It will often be convenient to isolate the terms inŮ ℓ when inverting for a sufficient m, thus arriving at an upper bound onM ℓ . Specifically, definė Also note that we clearly havė so that, in the task of boundingM ℓ , we can simply focus on boundingM ℓ . We will express our main abstract results below in terms of the incremental values M ℓ (γ 1 , γ 2 ; H, P XY , s); the quantityM ℓ (γ; H, P XY , s) will also be useful in deriving explicit results for ERM ℓ . When f ⋆ P ∈ H, (16) implies

General Analysis of Empirical Risk Minimization
Based on Lemma 15 and the above definitions, one can derive a bound on the number of labeled data points m sufficient for ERM ℓ (F , Z m ) to achieve a given excess error rate. Specifically, the following theorem is due to Koltchinskii [34] (slightly modified here, following Giné and Koltchinskii [20], to allow for general s functions). It will be useful for deriving Theorems 7 and 11. For ε > 0, let Z ε = {j ∈ Z : 2 j ≥ ε}.

Specification ofT ℓ in Algorithm 1
The quantityT ℓ in Algorithm 1 can be defined in one of several possible ways. In our present abstract context, we consider the following definition. Let {ξ ′ k } k∈N denote independent Rademacher random variables (i.e., uniform in {−1, +1}), also independent from Z; these should be considered internal random variables used by the algorithm, which is therefore a randomized algorithm. For any q ∈ N ∪ {0} and Q = y 1 ), . . . , (X iq , y q )}, as previously defined. Then we can define the quantityT ℓ in the method above aŝ for someŝ : N → [1, ∞). This definition has the appealing property that it allows us to interpret the update in Step 6 in two complementary ways: as comparing the empirical risks of functions in V under samples from the conditional distribution of (X, Y ) given X ∈ DIS(V ), and as comparing the empirical risks of the functions in V DIS(V ) under samples from the original distribution P XY . Our abstract results below are based on this definition ofT ℓ . This can sometimes be problematic due to the computational challenge of the optimization problems in the definitions ofφ ℓ andD ℓ . There has been considerable work on calculating and boundingφ ℓ for various classes F and losses ℓ [e.g., 7, 33], but it is not always feasible. However, the specific theorems stated in Section 5 above continue to hold if we instead takeT ℓ based on a well-chosen upper bound on the respectiveŮ ℓ function, such as those obtained in the derivations of those respective results below; we provide descriptions of such efficiently-computable relaxations, for each of these results, in Appendix D (though in some cases, these bounds have a mild dependence on P XY via certain parameters of the specific noise conditions considered there).

General Analysis of Algorithm 1
The following theorem represents our main abstract result. The key steps in its proof were already sketched above in Section 4. The complete proof is included in Appendix A.
Theorem 18. Fix any functionŝ : N → [1, ∞). Let j ℓ = −⌈log 2 (l)⌉, u j ℓ −2 = u j ℓ −1 = 1, and for each integer j ≥ j ℓ , let F j = F (E ℓ (2 2−j ); 01) DIS(F (E ℓ (2 2−j );01)) , U j = DIS(F j ), and suppose u j ∈ N satisfies log 2 (u j ) ∈ N and Suppose f ⋆ ∈ F . For any ε ∈ (0, 1), s ∈ [1, ∞), letting j ε = ⌈log 2 (1/Γ ℓ (ε))⌉, if u ≥ u jε and n ≥ s + 2e jε j=j ℓ P(U j )u j , then, with arguments ℓ, u, and n, Algorithm 1 uses at most u unlabeled samples, requests at most n labels, and with probability at In defining and calculating the valuesM ℓ in Theorem 18, it is sometimes convenient to use the alternative interpretation of Algorithm 1, in terms of sampling the set S[Q] from the conditional distribution given the region of disagreement. Specifically, for any measurable U ⊆ X with P(U) > 0, define the probability measure P U (·) = P XY (·|U × Y): that is, P U is the conditional distribution of (X, Y ) ∼ P XY given that X ∈ U. Generally, for any probability measure P on X × Y, and any measurable U ⊆ X × Y with P (U) > 0, define P U (·) = P (·|U). Also, for any H ⊆ F * , define the region of value-disagreement DISF(H) = {x ∈ X : ∃h, g ∈ H s.t. h(x) = g(x)}, and denote by DISF(H) = DISF(H) × Y. The following lemma then allows us to replace calculations in terms of F j and P XY with calculations in terms of F (E ℓ (2 1−j ); 01) and P DIS(Fj ) . Its proof is included in Appendix A.
Lemma 19. Letφ ℓ be any function satisfying Definition 16. Let P be any distribution over X × Y. For any and otherwiseφ ′ ℓ (σ, H; m, P ) = 0. Thenφ ′ ℓ also satisfies Definition 16. Plugging thisφ ′ ℓ function into Theorem 18 immediately yields the following corollary; the proof is included in Appendix A.

Derivations of the Explicit Results
We are now ready to present derivations of the explicit results of Section 5, based on the general results of the previous section. To simplify the presentation, we often omit numerical constant factors in the inequalities below, and for this we use the common notation f (x) g(x) to mean that f (x) ≤ cg(x) for some implicit numerical constant c ∈ (0, ∞).

Specification ofφ ℓ
We begin by recalling a few well-known bounds on the φ ℓ function, which lead to a more concrete instance of a functionφ ℓ satisfying Definition 16.
Uniform Entropy: The first bound is based on the work of van der Vaart and Wellner [48]; related bounds have been studied by Giné and Koltchinskii [20], Giné, Koltchinskii, and Wellner [21], van der Vaart and Wellner [47], and others. For σ ≥ 0 and F ∈ G * , define the function where Π ranges over all finitely discrete probability measures.

VC Subgraph Classes
The following is a classic result for VC subgraph classes [see e.g., 47], derived from the works of Pollard [42] and Haussler [29].
When ℓ satisfies Condition 2, we can derive the sometimes-stronger result in Theorem 9 via Corollary 20. Specifically, combining (31), (18), (19), and Lemma 5, we have that if f ⋆ ∈ F and Condition 2 is satisfied, then for j ≥ j ℓ in Corollary 20, where b and β are as in Lemma 5. Plugging this into Corollary 20, we arrive at Theorem 9; the remaining details proceed similarly to those of Theorem 8, and a detailed sketch appears in Appendix B.
The corresponding result for Algorithm 1, namely Theorem 12, follows by combining (37) with (18), (19), and Theorem 18. The details of the proof follow analogously to that of Theorem 8, and are therefore omitted for brevity.
Next, we turn to deriving the corresponding results stated above under Condition 2. As discussed above, we treat separately the cases of (11) and (10).
First, suppose (11) holds (for all P , ε) with F ≤l. Following the derivation of (37) above, combined with (19), (18), and Lemma 5, for j ≥ j ℓ in Corollary 20, where b and β are from Lemma 5. This immediately leads to Theorem 13 by reasoning analogous to the proof of Theorem 9. The case (10) can be treated similarly, though the result we obtain (Theorem 14) is slightly weaker. Suppose (10) is satisfied with F =l constant. In this case,l ≥ , so that F j and P Uj also satisfy (10) with F =l: Thus, based on (37), (18), (19), and Lemma 5, we have that if f ⋆ ∈ F and Condition 2 is satisfied, then for j ≥ j ℓ in Corollary 20, where b and β are as in Lemma 5. Combining this with Corollary 20 and reasoning analogously to the proof of Theorem 9, we obtain Theorem 14.
probability, f ⋆ ∈ V is maintained as an invariant, and second, showing that, with high probability, the set V will be sufficiently reduced to provide the guarantee onĥ after at most the stated number of label requests, given the value of u is as large as stated. Both of these components are served by the following application of Lemma 15. Consider any m ∈ S, and note that ∀h, g ∈ V (m) , (38) and furthermore that Applying Lemma 15 under the conditional distribution given V (m) , combined with the law of total probability, we have that, for every m ∈ N with log 2 (m) ∈ N, on an event of probability at least and furthermoreÛ Dm ; P XY , m/2,ŝ(m) .
By a union bound, on an event of probability at least 1 − 6e −ŝ(2 i ) , for every m ∈ S with m ≤ u jε and f ⋆ ∈ V (m) , the inequalities (40), (41), and (42) hold. Call this event E.
In particular, note that on the event E, for any m ∈ S with m ≤ u jε and (38), (41), and (39) imply so that f ⋆ ∈Ṽ (m) as well. Since f ⋆ ∈ V (2) , and every m ∈ S with m > 2 has V (m) =Ṽ (m/2) , by induction we have that, on the event E, every m ∈ S with m ≤ u jε has f ⋆ ∈ V (m) and f ⋆ ∈Ṽ (m) ; this also implies that (40), (41), and (42) all hold for these values of m on the event E. We next prove by induction that, on the event E, and F (E ℓ (2 −j ); 01) = F , so that these values can serve as our base case. Now take as an inductive hypothesis that, for some j ∈ {j ℓ , . . . , j ε }, if u j−2 ∈ S ∪ {1}, then on the event E,Ṽ ; 01 , and suppose the event E occurs. If u j / ∈ S, the claim is trivially satisfied; otherwise, suppose u j ∈ S, which further implies u j−2 ∈ S ∪{1}. Since u j ≤ u jε , for any h ∈Ṽ (uj ) , (40) implies Since we have already established that f ⋆ ∈ V (uj ) , (38) and (39) imply The definition ofṼ (uj ) from Step 6 implies By (39) and (42), Altogether, we have that, ∀h ∈Ṽ (uj ) , By definition ofM ℓ , monotonicity of m →Ů ℓ (·, ·; ·, m, ·), and the condition on u j in (22), we know thatŮ The fact that u j ≥ 2u j−2 , combined with the inductive hypothesis, implies This also implies D uj ⊆ DIS(F (E ℓ (2 2−j ); 01)). Combined with (17), these implẙ Together with (16), this implies The inductive hypothesis implies V Plugging this into (43) implies, ∀h ∈Ṽ (uj ) , In particular, since f ⋆ ∈ F , we always haveṼ every h ∈Ṽ (uj ) , so that every h ∈Ṽ (uj ) has er(h) = er(h Du j ), and therefore (by definition of E ℓ (·)), (44) implies This impliesṼ (uj ) ⊆ F E ℓ (2 −j ); 01 , which completes the inductive proof. This implies that, on the event E, if u jε ∈ S, then (by monotonicity of E ℓ (·) and the fact that In particular, since the update in Step 6 always keeps at least one element in V , the functionĥ in Step 8 exists, and hasĥ ∈Ṽ (uj ε ) (if u jε ∈ S). Thus, on the event E, if u jε ∈ S, then er(ĥ) − er(f ⋆ ) ≤ ε. Therefore, since u ≥ u jε , to complete the proof it suffices to show that taking n of the size indicated in the theorem statement suffices to guarantee u jε ∈ S, on an event (which includes E) having at least the stated probability.

Appendix B: Proofs of Results in Section 5
This appendix includes the remaining details of the proof of Theorem 8, to complete the derivations from Section 7.2, and also presents the remaining essential details for the proof of Theorem 9.
We note that the valuesŝ(m) used in the proof of Theorem 8 have a direct dependence on the parameters b, β, a, α, and χ ℓ . Such a dependence may be undesirable for many applications, where information about these values is not available. However, one can easily follow this same proof, takingŝ(m) = Log 12 log 2 (2m) 2 δ instead, which only leads to an increase by a log log factor: specifically, replacing the factor of A 1 in (6), and the factors (A 1 + Log(B 1 )) and (A 1 + Log(C 1 )) in (7), with a factor of (A 1 + Log(Log(l/Ψ ℓ (ε)))). It is not clear whether it is always possible to achieve the slightly tighter result of Theorem 8 without having direct access to the values b, β, a, α, and χ ℓ in the algorithm.
Proof Sketch of Theorem 9. The proof follows analogously to the proof of Theorem 8, with the exception that now, for each integer j with j ℓ ≤ j ≤j ε , we replace the definition of u ′ j from (51) with the following definition. Letting where c ′ ∈ [1, ∞) is an appropriate universal constant, and s j is as in the proof of Theorem 8. With this substitution in place, the values u j and s, and functionŝ, are then defined as in the proof of Theorem 8. Since x → xΨ −1 ℓ (1/x) is nondecreasing, a bit of calculus reveals u j ≥ u j−1 and u j ≥ 2u j−2 . Combined with (35), (19), (18), and Lemma 5, this implies we can choose the constant c ′ so that these u j satisfy (24). By an identical argument to that used in Theorem 8, we have It remains only to show that any values of u and n satisfying (8) and (9), respectively, necessarily also satisfy the respective conditions for u and n in Corollary 20.
Toward this end, note that since x → xΨ −1 ℓ (1/x) is nondecreasing on (0, ∞), we have that Thus, for an appropriate choice of c, any u satisfying (8) has u ≥ u jε , as required by Corollary 20. Finally, note that for U j as in Theorem 18, and i j =j ε − j, By changing the order of summation, now summing over values of i j from 0 to N = j ε − j ℓ ≤ log 2 (4l/Ψ ℓ (ε)), and noting 2j ε ≤ 2/Ψ ℓ (ε), and Ψ −1 Considering these sums separately, we have N i=0 2 i(α−1)(2−β) (A 2 + Log(i + 2)) ≤ (N + 1)(A 2 + Log(N + 2)) and N i=0 2 i(α−1) (A 2 + Log(i + 2)) ≤ (N +1)(A 2 +Log(N +2)). When α < 1, we have ). Plugging this into (55), we find that for an appropriately large numerical constant c, any n satisfying (9) has n ≥ jε j=j ℓ P(U j )u j , as required by Corollary 20. We note that, as in Theorem 8, the valuesŝ used to obtain Theorem 9 have a direct dependence on certain values, which are typically not directly accessible in practice: in this case, a, α, and θ. However, as was the case for Theorem 8, we can obtain only slightly worse results by instead takingŝ(m) = Log 12 log 2 (2m) 2 δ , which again only leads to an increase by a log log factor: replacing the factor of A 2 in (8), and the factor of (A 2 + Log(C 1 )) in (9), with a factor of (A 2 + Log(Log(l/Ψ ℓ (ε)))). As before, it is not clear whether the slightly tighter result of Theorem 9 is always available, without requiring direct dependence on these quantities.

B.1. Derivations for Section 5.5
For completeness, we include here derivations of quantities appearing in the example given in Section 5.5. We begin with the claim that, for any ω ∈ (0, 1/2], (10) is satisfied in Condition 10 with the values q = 7 ω and ρ = 1 3 + ω. Specifically, for a given ε > 0, let i ε = 3 ε 2/3 , and let G ε be the set of functions g in G * with g(x, y) ∈ {jε/ √ 2 : j ∈ {0, . . . , ⌈4 √ 2/ε⌉ − 1}} for each x ∈ {x i : 1 ≤ i ≤ i ε } and y ∈ Y, and g(x, y) = 0 for every x ∈ X \ {x i : 1 ≤ i ≤ i ε } and y ∈ Y. For each g ∈ G ε , let g ′ be the function in G * with g ′ (x, y) = g(x, y) + ε/ √ 2 for each x ∈ {x i : 1 ≤ i ≤ i ε } and y ∈ Y, and g ′ (x, y) = 4 for each x ∈ X \ {x i : 1 ≤ i ≤ i ε } and y ∈ Y. Note that g∈Gε [g, g ′ ] contains all functions g in G * having 0 ≤ g(x, y) ≤ 4 for all x ∈ X and y ∈ Y; in particular, this implies it contains G F . Furthermore, for each g ∈ G ε , . Since ln(x) ≤ tx 1/t for any x, t ≥ 1, this is at most 7 ω ε −2( 1 3 +ω) when ε ∈ (0, 1), for any value ω ∈ (0, 1/2]. This is trivially also an upper bound on ln N [] (4ε, G F , L 2 (P XY )) for all ε ≥ 1 (since N [] (4ε, G F , L 2 (P XY )) = 1 in that case). Thus, (10) is satisfied with q = 7 ω and ρ = 1 3 + ω, for any choice of ω ∈ (0, 1/2], as claimed. Next, we present a proof of the claimed Ω(ε −4/3 ) lower bound on the sample size required to obtain an ε bound on the minimax expected excess error rate of passive learning methods in the example scenario. We approach this with the classic technique of Assouad (see e.g., [46]). Specifically, fix any ε ∈ (0, (1 − 2ν 0 )/64), and fix a sample define P v as the probability measure on X × Y with marginal P on X (as specified in the construction), η(x i ; P v ) = 1 for i ∈ N \ {j 0 , . . . , j 1 }, and η( Therefore, Theorem 2.12(ii) of [46] implies that, for any estimatorv : Thus, there exists a choice of v ∈ {0, 1} k such that, defining P XY = P v , we have that Thus, since these P v distributions satisfy the description of the construction in Section 5.5, we see that to guarantee expected excess error rate at most ε for all P XY fitting the description in the construction, any passive learning method would require the sample size m for its input labeled data set to be greater than 2 −13 (1 − 2ν 0 ) 1/3 ε −4/3 = Ω(ε −4/3 ), as claimed. In particular, this agrees with the dependence on ε derived for ERM ℓ in Section 5.5 (up to a logarithmic factor). In contrast, the analysis of Algorithm 1 in Section 5.5 reveals that (by choosing δ = ε/2), Algorithm 1 can achieve E[er(ĥ) − er(f ⋆ )] ≤ ε for all such P XY with a number of label requests n having only O(ε −7/12 Log(1/ε)) dependence on ε, a significant decrease compared to the Ω(ε −4/3 ) lower bound we have just established for all passive learning methods.

B.2. Derivations for Section 5.6
For completeness, we include here a derivation of the parameters a and α for which the distributions P XY in the example in Section 5.6 satisfy Condition 3. Specifically, as in Section 5.6, let ℓ be the quadratic loss, fix an integer k ≥ 5, suppose P is uniform on {x ∈ R k : x = 1}, and suppose P XY is such that f ⋆ (x) = w * · x for some w * ∈ R k with w * = 1. In particular, for this choice of ℓ, this implies η(x) = (w * ·x+1)/2. For Therefore, among functions f ∈ F * with a given value p of ∆(f, f ⋆ ), the functions with minimal er(f )−er(f ⋆ ) are those that minimize E |2η(X)−1| X ∈ DIS({f, f ⋆ }) subject to P(DIS({f, f ⋆ })) = p; since |2η(x) − 1| = |w * · x| is increasing in |w * · x| and t → P(x : |w * · x| ≤ t) is continuous, any f ∈ F * of minimal er(f ) − er(f ⋆ ) subject to ∆(f, f ⋆ ) = p has DIS({f, f ⋆ }) = {x : |w * · x| ≤ γ p } (up to probability zero differences) for some γ p ∈ [0, 1] chosen so that P(x : |w * · x| ≤ γ p ) = p; in particular, the minimum value of er(f ) − er(f ⋆ ) among such func- For X ∼ P, one can show that the [0, 1]-valued random variable |w * ·X| has density function g(t) = where Γ is the usual gamma function (see [37] for a derivation of the CDF, from which this g can be derived). Thus, This implies that for F a VC major class, and ℓ classification-calibrated and either nonincreasing or Lipschitz on [− sup h∈F sup x∈X |h(x)|, sup h∈F sup x∈X |h(x)|], if f ⋆ ∈ F and P XY satisfies Condition 3 and Condition 4, then the conditions of Theorem 18 can be satisfied with the probability bound being at least 1 − δ, for some where θ = θ(aε α ), andÕ(·) hides logarithmic and constant factors. Under Condition 2, with β as in Lemma 5, the conditions of Corollary 20 can be satisfied with the probability bound being at least 1 − δ, for some u =Õ . When θ is small, these values of n (and indeed u) compare favorably to the value of m =Õ Ψ ℓ (ε) β/2−2 , derived analogously from Theorem 17, sufficient for ERM ℓ (F , Z m ) to achieve the same [see 20].
For example, for X = [0, 1] and F the class of all nondecreasing functions mapping X to [−1, 1], F is a VC major class with index 1, and θ(0) ≤ 2 for all distributions P. Thus, for instance, if η is nondecreasing and ℓ is the quadratic loss, then f ⋆ ∈ F , and Algorithm 1 achieves excess error rate ε with high probability for some u =Õ ε 2α−3 and n =Õ ε 3(α−1) .
VC major classes are contained in special types of VC hull classes, which are more generally defined as follows. Let C be a VC Subgraph class of functions on X , with bounded envelope, and for B ∈ (0, ∞), let denote the scaled symmetric convex hull of C; then F is called a VC hull class. For instance, these spaces are often used in conjunction with the popular AdaBoost learning algorithm. One can derive results for VC hull classes following analogously to the above, using established bounds on the uniform covering numbers of VC hull classes [see 47, Corollary 2.6.12], and noting that for any VC hull class F with envelope function F, and any U ⊆ X , F U is also a VC hull class, with envelope function F1 U . Specifically, one can use these observations to derive the following results. For a VC hull class F = Bconv(C), if ℓ is classification-calibrated and Lipschitz on [− sup h∈F sup x∈X |h(x)|, sup h∈F sup x∈X |h(x)|], f ⋆ ∈ F , and P XY satisfies Condition 3 and Condition 4, then letting d = 2vc(C), the conditions of Theorem 18 can be satisfied with the probability bound having value at least 1 − δ, for some d+2 −2 . Under Condition 2, with β as in Lemma 5, the conditions of Corollary 20 can be satisfied with the probability being at least 1 − δ, for some u =Õ . Compare these to the value m =Õ Ψ ℓ (ε) 2β d+2 −2 , derived analogously from Theorem 17, sufficient for ERM ℓ (F , Z m ) to achieve the same general guarantee [see also 6,10]. However, it is not clear whether these results for active learning with VC hull classes have any practical implications, since we do not know of any scenarios where this sufficient value of m reflects a tight analysis of ERM ℓ (F , ·) while simultaneously being significantly larger than either of the above sufficient n values.

Appendix D: Computationally Efficient Updates
As mentioned in Section 6.3, though convenient in the sense that it offers a completely abstract and unified approach, the choice ofT ℓ (V ; Q, m) given by (21) may often make Algorithm 1 computationally inefficient. However, for each of the applications studied in this work, we can relax thisT ℓ function to a computationally-accessible value, which will then allow the algorithm to be efficient under convexity conditions on the loss and class of functions.
In particular, in the application to VC Subgraph classes, Theorem 8 remains valid if we instead defineT ℓ as follows. If we let V (m) and Q m denote the sets V and Q upon reaching Step 5 for any given value of m with log 2 (m) ∈ N realized in Algorithm 1, then consider definingT ℓ in Step 6 inductively by lettinĝ , and taking (with a slight abuse of notation to allowT ℓ to depend on sets V (m ′ ) and Q m ′ with m ′ < m) for an appropriate universal constant c 0 . This value is essentially derived by bounding m/2 |Q|∨1Ũ ℓ (V DIS(V ) ; P XY , m/2,ŝ(m)) (which is a bound on (21) by Lemma 15), based on (30) and Condition 4 (and a Chernoff bound to argue |Q m | ≈ P(DIS(V ))m/2); since the sample sizes derived for u and n in Theorem 8 are based on these relaxations anyway, they remain sufficient (with slight changes to the constant factors) for these relaxedT ℓ values. We include a more detailed proof that these values ofT ℓ suffice to achieve Theorem 8 in Appendix E.1. Note that we have introduced a dependence on b and β in (56). These values would indeed be available for some applications, such as when they are derived from Lemma 5 when Condition 2 is satisfied; however, in other cases, there may be more-favorable values of b and β than given by Lemma 5, dependent on the specific P XY distribution, and in these cases direct observation of these values might not be available. Thus, there remains an interesting open question of whether there exists a functionT ℓ (V ; Q, m), which is efficiently computable (under convexity assumptions) and yet preserves the validity of Theorem 8.
In the special case where Condition 2 is satisfied, it is also possible to define a value forT ℓ that is computationally accessible, and preserves the validity of Theorem 9. Specifically, consider instead definingT ℓ in Step 6 aŝ for b and β as in Lemma 5, and for an appropriate universal constant c 0 . This value is essentially derived (following 34) by using Lemma 15 under the conditional distribution P DIS(V ) , in conjunction with a localization technique similar to that employed in the derivation of Theorem 17. Appendix E.2 includes a proof that the conclusions of Theorem 9 remain valid for this specification ofT ℓ in place of (21). That these conclusions remain valid for this bound on excess conditional risks should not be too surprising, since Theorem 9 is itself proven by considering concentration under the conditional distributions P Uj via Corollary 20. Note that, unlike the analogous result for Theorem 8 based on (56) above, in this case all of the quantities inT ℓ (V ; Q, m) are directly observable (in particular, b and β), aside from any possible dependence arising in the specification ofŝ.
It is also possible to define computationally tractable values ofT ℓ (V ; Q, m) in scenarios satisfying the entropy conditions (Condition 10), while preserving the validity of Theorem 12. This substitution can be derived analogously to (56) above, this time leading to the definition whereγ m/2 is defined (inductively) as above, and c 0 is an appropriately large universal constant. By essentially the same argument used for (56) (see Appendix E.1), one can show that using (58) in place of (21) preserves the validity of Theorem 12; for brevity, the details are omitted.
In the case that Condition 2 and (11) are satisfied, it is possible to define a computationally accessible quantityT ℓ (V ; Q, m), while preserving the validity of Theorem 13. Specifically, following the same reasoning used to arrive at (57), except using (36) instead of (30), we find that while replacing (21) with the definition for b and β as in Lemma 5 and for an appropriate universal constant c 0 , the conclusions of Theorem 13 remain valid. The proof follows similarly to the proof (in Appendix E.2) that (57) preserves the validity of Theorem 9, and is omitted for brevity.
Finally, in the case that Condition 2 and (10) are satisfied, we can again derive an efficiently computable value ofT ℓ (V ; Q, m), which in this case preserves the validity of Theorem 14. Specifically, noting that the reasoning preceding Theorem 14 also implies ln N [] εl, G V , L 2 (P DIS(V ) ) ≤ qP(DIS(V )) −ρ ε −2ρ , and following the reasoning leading to (59) while replacing q with qP(DIS(V )) −ρ , combined with a Chernoff bound to argue P(DIS(V )) ≈ 2|Q|/m in the algorithm, we find that Theorem 14 remains valid after replacing (21) with the definition for an appropriate universal constant c 0 , and where b and β are as in Lemma 5. The proof is essentially similar to that given for (57) in Appendix E.2, and is omitted for brevity.

Appendix E: Proofs for Efficiently Computable Updates
Here we include more detailed proofs of the arguments leading to computationally efficient variants of Algorithm 1, for which the specific results proven in this work for the given applications remain valid. Specifically, we focus on the application to VC Subgraph classes here; the applications to scenarios satisfying the entropy conditions follow analogously. Throughout this section, we adopt the notational conventions introduced in the proof of Theorem 18 (e.g., V (m) ,Ṽ (m) , Q m , L m , S), except in each instance here these are defined in the context of applying Algorithm 1 with the respective stated variant ofT ℓ .
By essentially the same reasoning used in the proof of Theorem 8, the right hand side of this inequality is the conditions on u and n stated in Theorem 8 (with an appropriate constant c) suffice to guarantee er(ĥ)−er(f ⋆ ) ≤ ε on the event E ′ ∩ iε−1 i=1 E 2 i ∩E ′′ 2 i+1 . Finally, the proof is completed by noting that a union bound implies the event has probability at least Note that, as in Theorem 8, the functionŝ in this proof has a direct dependence on a, α, and χ ℓ , in addition to b and β. As before, with an alternative definition ofŝ, similar to that mentioned in the discussion following the proof of Theorem 8, it is possible to remove this dependence, at the expense of the same logarithmic factors mentioned above. Theorem 9 under (57) Next, consider the conditions of Theorem 9, and suppose the definition ofT ℓ from (57) is used in Step 6. For simplicity, we let V (m) and Q m be defined (though arbitrarily) even when m / ∈ S. Fix a functionŝ (to be specified below) and any value of ε ∈ (0, 1). We will prove by induction that there exist eventsÊ m ′ , for values m ′ with log 2 (m ′ ) ∈ N, each with respective probability at least 1 − 12e −ŝ(m ′ ) such that, for every m with log 2 (m) ∈ N, on log 2 (m) i=1Ê2 i , if m ∈ S, we have that f ⋆ ∈Ṽ (m) andṼ (m) ⊆ V (m) 4T m ; ℓ, P Dm , whereT m =T ℓ V (m) ; Q m , m . This claim is trivially satisfied for m = 2, sinceT 2 =l, so this will serve as our base case in the inductive proof. Now fix any m > 2 with log 2 (m) ∈ N, and take as an inductive hypothesis that there exist eventsÊ m ′ for each m ′ < m with log 2 (m ′ ) ∈ N, such that, on for each j ∈ N, define
In particular, this definition impliesŝ(m j ) = s j . We next prove by induction that there are eventsÊ ′ j , for j ∈ N ∪ {0}, each with respective probability at least 1 − 2 −sj , such that for every j ∈ N ∪ {0}, on . This claim is trivially satisfied for j = 0, which therefore serves as the base case for this inductive proof. Now fix any j ∈ N, and take as an inductive hypothesis that there exist eventsÊ ′ j ′ , as above, for all j ′ < j, such that on ℓ (ε j−1 ); 01 . By the above, we have that on i=1Ê2 i , if m j ∈ S, then f ⋆ ∈Ṽ (mj ) ⊆ V (mj ) 4T mj ; ℓ, P Dm j . In particular, this implies that every h ∈Ṽ (mj ) has R ℓ (h Dm j ; P XY ) − R ℓ (f ⋆ ; P XY ) = R ℓ (h; P Dm j ) − R ℓ (f ⋆ ; P Dm j ) P(D mj ) ≤ 4T mj P(D mj ). (71) By a Chernoff bound and the law of total probability, on an eventÊ ′ j of probability at least 1 − 2 −sj , if m j ∈ S, (1/2)m j P(D mj ) − s j m j P(D m ) ≤ |Q mj |.
By monotonicity of m → DIS Ṽ (m) , the right hand side of (74) is at most jε j=0 min{mj ,max S} m=mj−1+1 1 DIS(Ṽ (m j−1 ) ) (X m ). By a Chernoff bound, on an eventÊ ′′ of probability at least 1 − δ/2, the right hand side of the above is at most

Furthermore, on
sets V ⊆ F arising in the algorithm. In principle, the results in this work can be generalized to provide guarantees when this condition (suitably formalized) is satisfied. However, the statements of the results become considerably more involved, and moreover we do not know of concise, general, a priori conditions on F , ℓ, and P XY , under which this property will hold. Beyond this, it appears our analysis does not easily extend to the important problem of active learning with surrogate losses in the general case, where results would presumably need to be expressed in terms of the approximation loss inf f ∈F R ℓ (f ) − R ℓ (f ⋆ ) or related quantities (as observed for passive learning [6]). It seems such a generalization would require a significantly different approach.