Large deviations for the perceptron model and consequences for active learning

Active learning is a branch of machine learning that deals with problems where unlabeled data is abundant yet obtaining labels is expensive. The learning algorithm has the possibility of querying a limited number of samples to obtain the corresponding labels, subsequently used for supervised learning. In this work, we consider the task of choosing the subset of samples to be labeled from a fixed finite pool of samples. We assume the pool of samples to be a random matrix and the ground truth labels to be generated by a single-layer teacher random neural network. We employ replica methods to analyze the large deviations for the accuracy achieved after supervised learning on a subset of the original pool. These large deviations then provide optimal achievable performance boundaries for any active learning algorithm. We show that the optimal learning performance can be efficiently approached by simple message-passing active learning algorithms. We also provide a comparison with the performance of some other popular active learning strategies.


I. INTRODUCTION
Supervised learning consists in presenting a parametric function (often a neural network) with a series of samples (samples) and labels, and adjusting (training) the parameters (network weights) so as to match the network output with the labels as closely as possible.Active learning (AL) is concerned with choosing the most informative samples so that the training requires the least number of labeled samples to reach the same test accuracy.Active learning is relevant in situations where the potential set of samples is large, but obtaining the labels is expensive (computationally or otherwise).There exist many strategies for active learning, see e.g.[1] for a review.In membership-based active learning [2], [3], [4] the algorithm is allowed to query the label of any sample, most often one it generates itself.In stream-based active learning [5] an infinite sequence of samples is presented to the learner which can decide whether or not to query its label.In pool-based active learning, which is the object of the present work, the learner can only query samples that belong to a pre-existing, fixed pool of samples.It therefore needs to choose according to some strategy which samples to query so as to have the best possible test accuracy.
Pool-based active learning is relevant for many machine learning applications, e.g. because not every possible input vector is of relevance.A beautiful recent application of active learning is in computational chemistry [6] where a neural network is trained to predict inter-atomic potentials.In this case the pool of data is large and consists in all possible alloys, but not of arbitrary input vectors, and labelling is extremely expensive, as it demands resource-intensive abinitio simulations.Consequently, only a limited number of samples can be labeled, i.e. one only possesses a certain budget for the cardinal of the training set.Another setting where a cheap large pool of input data is readily available but labelling is expensive is drug discovery [7], where given a target molecule one aims to find new compounds among the pool able to bind it.Another example would be on text classification [8], [9], [10], where labelling a text requires non-negligible human input, while a large pool of texts is readily available on the internet.Establishing efficient pool-based active learning procedures in this case implies to select a priori the most informative data samples for labelling.
Main-stream works on active learning focus on designing heuristic algorithmic strategies in a variety of settings, and analyzing the performance thereof.It is very rarely known what are the information-theoretic limitations an active learning algorithm can face and hence evaluating the distance from optimality is mostly an open question.The main contribution of the present work is to provide a toy model that is at the one hand challenging for active learning, and at the same time where the optimal performance of pool-based active learning can be computed and heuristic algorithms hence evaluated and bench-marked against the optimal solution.To our knowledge, this is the first work to derive optimal performance results for pool-based active learning procedures are computed.More specifically we study the random perceptron model [11].The available pool of samples is assumed to be i.i.d.normal vectors, the teacher generating the labels is taken to be also a perceptron with the vector of teacher-weights having i.i.d.normal components.We compute the large deviation function for how likely one is to find a subset of the samples that leads to a given learning accuracy.Our results are based on the replica method computation of this large deviation function, that is an exact method (modulo the possibility of the so-called replica symmetry breaking that we are not evaluating in the present work) originating in theoretical statistical physics [12]; [13]; [14].Providing a rigorous proof of the obtained results or turning them into rigorous bounds would be a natural, and rather challenging, next step.In the algorithmic part of this work we benchmark several existing algorithms and also propose two new algorithms relying on the approximate-message-passing algorithm for estimation of the label uncertainty for yet unlabeled sample, showing that they closely achieve, in the studied cases, the relevant information-theoretic limitations.
The paper is organized as follows: the problem is defined and related work discussed in section II.In section III, we propose a measure to quantify the informativeness of given subsets of samples.In section IV, we derive the large deviation function over all possible subset choices and deduce performance boundaries that apply to any pool-based active learning strategy.In section V, we then compare these theoretical results with the performance of existing active learning algorithms and propose two new ones, based on approximate-message-passing.

II. DEFINITION OF THE PROBLEM AND RELATED WORK
A natural modeling framework for analyzing learning processes and generalization properties is the so-called teacherstudent (or planted) perceptron model [15], where the input samples are assumed to be random i.i.d.vectors, and the ground truth labels are assumed to be generated by a neural network (denoted as the teacher) belonging to the same hypothesis class as the student-neural-network.In this work we will restrict to single-layer neural networks (without hidden units) for which this setting was defined and studied in [16].Specifically we collect the input vectors into a matrix F ∈ R P ×N where N is the dimension of the input space and P is the number of samples.The teacher generating the labels, called teacher perceptron, is characterized by a teacher-vector of weights x 0 and produces the label vector Y ∈ R P according to Y = sign(F • x 0 ).Learning is then done using a student perceptron and consists in finding a vector x so that for the training set F we have as closely as possible Y = sign(F • x).The relevant notion of error is the test accuracy (generalization error) measuring the agreement between the teacher and the student on a new sample not presented in the training set.Since both teacher and student possess the same architecture, the training process can be rephrased in terms of an inference problem (as discussed for instance in [15]): the student aims to infer the teacher weights, used to generate the labels, from the knowledge of a set of input-output associations.This scenario allows for nice geometrical insights (see for example [17]), as the generalization properties are linked to the distance in weight space between teacher and student functions.Note that, in the case of a noiseless labelling process, the teacher-student scenario guarantees that perfect training is always possible.
Active learning was previously studied in the context of the teacher-student perceptron problem.Best known is the line of work on Query by Committee [4,18,19], dealing with the membership based active learning setting, i.e.where the samples are picked one by one into the training set and can be absolutely arbitrary N -dimensional vectors.The active learning is in that case more a strategy for designing the samples rather than one for selecting them smartly from an predefined set.In the original work [4] the new samples are chosen so that a committee of several student-neural-networks has the maximum possible disagreement on the new sample.The paper shows that in this way one can reach a generalization error that decreases exponentially with the size of the training set, while for a random training set the generalization error can decrease only inversely proportionally to the size of the set [17].However, in many practical applications the possible set of samples to be picked into the training set is not arbitrarily big, e.g.not every input vector represents an encoding of a molecular structure.We hence argue that the pool-based active learning, studied in the present paper, where the samples are selected from a pre-defined set is of larger relevance to many applications.
The theoretical part of this paper is presented for a generalization of the perceptron model, specifically the for the random teacher-student Generalized Linear Models (GLM), see e.g.[20].An instance of a GLM is specified by a prior measure on the weights P X (•), from which the true generative model is assumed to be sampled, and an output channel measure P out (•|•), defining the generative process for the labels given the pre-activations.In the part where results of this work are presented we focus on the prototypical case of the noiseless continuous perceptron, where 2 / √ 2π and P out (y|h) = δ(y − sign(h)) where for example µ we have h µ = F µ • x.Moreover, we will consider the setting where the learning model is matched to the generative model and thus the student has perfect knowledge on the correct form of the two above defined measures.
The pool-based active learning task can now be more formally stated as follows: given a set of N −dimensional samples S = {F µ } of cardinality |S| = P = αN , the goal is to select and query the labels of a subset S ∈ S of cardinality |S| = nN , 0 < n ≤ α, according to some active learning criterion.We will refer to n as the budget of the student.The true labels are then obtained through y µ ∼ P out (y µ |F µ • x 0 ), x 0 ∼ P X (x 0 ).Henceforth measures with vector arguments are understood to be products over the coordinates of the corresponding scalar measures.For technical reasons, we rely on the strong (but customary) assumption that the samples are i.i.d.Gaussian distributed, F µ i ∼ N (0, 1), ∀i ∈ {1, ..., N }, ∀µ ∈ {1, ..., P }.Note that, while this assumption implies that the full set S of input data is generally unstructured and uncorrelated, it does not prevent non-trivial correlations to appear in any smaller labeled subset S, selected through an active learning procedure.
In pool-based active learning settings, it is assumed that the student has a fixed budget n for building its training set, i.e. that only up to nN labels can be queried for training.The active learning goal is to select, among the pool S of available samples, the nN most informative labels, to present to the student so that the latter achieves the best possible generalization performance.While many criteria of informativeness have been considered in the literature, see e. g. [1], in the teacher-student setting there exist a natural measure of informativeness, which we shall define in the next section.

III. THE GARDNER VOLUME AS A MEASURE OF OPTIMALITY
A natural strategy for ranking the possible subset selections is to evaluate the mutual information between the teacher vector x 0 and each subset of labels Y , conditioned on the corresponding inputs F .Good selections contain larger amounts of information about the ground truth, encoded in the labels, and make the associated inference problem for the student easier.Conversely, bad selections are characterized by less informative labels.In the case of the teacher-student perceptron, where the output channel P out (•|•) is completely deterministic and binary, the mutual information can be rewritten (following [20]) as Equation ( 1) allows a connection with a quantity well-known in statistical physics, the so-called Gardner volume [11], [21], [17], denoted in the following by v The Gardner volume v represents the extent of the version space [22], i.e. the entropy of hypotheses in the model class consistent with the labeled training set.This provides a natural measure of the quality of the student training.A narrower volume implies less uncertainty about the ground truth x 0 and is thus a desirable objective in an active learning framework.We shall focus the rest of our discussion on the large deviation properties of the Gardner volume, but we invite the reader to keep in mind that this is equivalent to studying the above defined mutual information.
There exist other natural measures of informativeness, e.g. the student generalization error ǫ g and the magnetization (or teacher/student overlap) m = x • x 0 /N .In the thermodynamic limit N ↑ ∞, ǫ g is a decreasing function of m (see the appendix D for more details).Moreover we will show analytical and numerical evidence that all these measures co-vary, at least in the simple teacher-student setting studied in this work.A numerical check at finite N of the correlation between v and m can also be found in appendix E.

IV. LARGE DEVIATIONS OF THE GARDNER VOLUME
We consider the problem of sampling labeled subsets of cardinality nN , 0 < n ≤ α, from a fixed pool of data of cardinality αN , α ∼ O(1), and study the variations in the associated Gardner volumes.We will hereby consider that, for any fixed pool and subset size, the Gardner volume probability distribution follows a large deviation principle, i.e. that there exist an exponential number e N Σ(n,v) of subsets choices that produce Gardner volumes equal to v. Employing a statistical physics terminology, we will refer to the rate function, Σ(n, v), as the complexity of labeled subsets associated to a budget n and a volume v.
In the large N limit, the overwhelming majority of subsets will thus realize a Gardner volume v ⋆ , such that v ⋆ = argmax v Σ(n, v).This means that, since the fluctuations around this typical value are exponentially rare, random sampling will almost certainly yield Gardner volumes extremely close to v ⋆ .However, the aim of active learning is to find strategies for accessing the atypically informative subsets (i.e., the atypically small volumes v < v ⋆ ), whence the necessity of analyzing the large deviations properties of the subset selection process.
We will here give a brief outline of the analytic computation, based on standard methods from physics of disordered systems [12], [13], [14], and refer the reader to appendices A, B and C for a more detailed derivation.It is convenient to introduce a vector of selection variables {σ µ } 0≤µ≤αN ∈ {0, 1} αN , such that σ µ = 1 when the sample F µ ∈ S is selected (and added to the labeled training set), while σ µ = 0 otherwise.In this notation the selected subset S ⊂ S is easily defined as S = {F µ ∈ S|σ µ = 1}.
Since a direct computation of the complexity is not straightforward, as customary in this type of analyses [23] we derive it by first evaluating its Legendre transform.We introduce the (unnormalized) measure over the selection variables and the associated free entropy From a statistical physics perspective, Ξ can be regarded as a grand-canonical partition function, with β playing the role of an inverse temperature, the Gardner volume being the associated energy function, and where φ is an effective chemical potential controlling the cardinality of the selection subset, |S|.In the thermodynamic limit N ↑ ∞, by applying the saddle-point method one can easily see that Φ(β, φ) will be dominated by a subset of selection vectors {σ µ } whose budget and energy, n ⋆ and v ⋆ , are given by Thus, inverting the Legendre transform yields the sought complexity At fixed budget n, the range of values of the volume v associated to positive complexities, i.e. with Σ(n, v) > 0, effectively spans all the achievable Gardner volumes for subsets of that given cardinality, agnostic of the actual strategy for selecting them.In particular, inf v {v|Σ(n, v) > 0} and sup v {v|Σ(n, v) > 0} define the minimal and maximal Gardner volumes and provide theoretical algorithmic boundaries for all realizable active learning strategies.
Note that this means that our prototypical model, albeit being idealized, constitutes a nice benchmark for comparing known pool-based active learning heuristics.

A. Replica symmetric formula for the large deviations
In practice, the analytic evaluation of Φ(β, φ) involves the computation of a quenched average of a log-partition function and is not feasible via rigorous methods.In order to perform the computation, we resort to the replica method from statistical physics [12], [13], [14], based on the identity and the fact that for integer values s the E F ,x 0 Ξ s can be computed.We refer the interested reader to appendix B, where the computation is explicited in the more general case of a generalized linear model [24], [20], and then specialized to the case of interest of a teacher-student perceptron.The final analytic expression for the replica symmetric free entropy Φ RS in this special case reads  where we introduced the definitions The extremum operation entailed in the free entropy computation is performed over a set of overlap order parameters, amenable of the following geometric interpretation , typical overlap between students with different labeled subsets.
, typical overlap between students with the same labeled subset.
, typical norm of a student.
, typical magnetization of a student.
Once the free entropy is evaluated, the complexity can be obtained via a numerical implementation of the extremization prescribed by the inverse Legendre transform (6).We remark at this point that the presented replica calculation was obtained in the so-called replica symmetric ansatz.In general, it is possible for the replica symmetric result not to be exact, requiring replica symmetry breaking (RSB) in order to evaluate the correct free entropy Φ(β, φ) [14].In this model, while RSB is surely not needed close to the maximum of the complexity curves (as implied by results in [20]), it is plausibly arising for highly frustrated cases, either very large or very negative β, corresponding to the values of complexity close to zero.At the same time, the presence of RSB usually entails corrections that are very small in magnitude.We leave the (technically more challenging) investigation of the RSB solution for the large deviations for future work.

B. Large deviation results
In Fig. 1, we show the results of the large deviation analysis at α = 3.Note that the qualitative picture is unaltered when α is varied (e.g., equivalent results for α = 10 are shown in appendix C).The different curves, obtained at fixed values of the budget n, show the complexity (i.e., the exponential rate of the number) of possible subset choices, Σ, that realize the corresponding Gardner volumes v.As expected, the maximum of each curve is observed at β = 0, and yields the typical Garner volume of a teacher-student perceptron that has learned to correctly classify nN i.i.d.Gaussian input patterns.The associated complexity is simply given by the binomial distribution.
The cases where the extremum in equation ( 6) is realized for positive values of β describe choices of the labeled subsets that induce atypically large Gardner volumes: these correspond to active learning scenarios where the student The information-theoretic volume-halving limit 2 −n for label-agnostic active learning procedures is plotted in a dotted line.
We notice that the qualitative picture is essentially unchanged when α is varied.
query is worse than random sampling.The number of possible realizations of these scenarios decreases exponentially as one approaches the right-hand extremum of where the complexity curve is positive, describing the largest possible volume at that given budget n.An important remark is that as soon as β > 0, the statistics of the input patterns in the labeled set is no longer i.i.d., but has increasing correlations for larger β.
On the other side, negative β induces atypically small Gardner volumes and labeled subsets with high information content.Again, as one spans smaller and smaller volumes the associated complexity drops, making the problem of finding these desirable subsets harder and harder.The left positive-complexity extremum of the curves in the left plot of Fig. 1 corresponds to the smallest reachable Gardner volumes.We observe in the figure that for larger values of budgets the complexity curves saturate fast very close to the smallest possible Garner volume corresponding to the Gardner volume for entire pool of samples v(α).
In the right plot of Fig. 1, we also show the prediction for the typical value of the magnetization, i.e. the overlap between teacher and students, as the Gardner volume is varied.As mentioned in section III, small Gardner volumes induce high magnetizations and thus low generalization errors.
In Fig. 2 the typical (purple) and corresponding minimum (orange, yellow, cyan) Gardner volumes are depicted as a function of the budget n for various pool sizes α = 3, 10, 100.Note that the qualitative picture is unaltered when α is varied.We further observe that the minimum volume becomes very close to the Gardner volume of the entire pool of samples v(α) already for very small budgets n.

V. ALGORITHMIC IMPLICATIONS A. Generic considerations
The setting investigated in this paper provides a unique chance to benchmark the algorithmic performance of any given pool-based active learning algorithm against the optimal achievable performance, and to measure how closely are the large deviations results approached.Before going to such algorithmic performance we should make a distinction between two active learning strategies • Label-agnostic algorithms, where the student is not able to extract any knowledge on the labels.In this case, for binary labels there is a simple lower bound on the Gardner volume reachable with nN samples v ≥ 2 −n which is obtained by the argument that every new sample can at best divide the current volume by a factor two [4].This strategy is explored in the famous query by committee active learning strategy, and the classical work [4] argues that the volume halving can be actually achieved when unlimited set of samples is available.Plotting this volume-halving bound in Fig. 2 we see that even though there exit subsets leading to smaller Gardner volumes, they cannot be found in a label-agnostic way • Label-informed algorithms, where external knowledge of the labels is available and can be exploited for extracting information about which sample to choose.While in our toy setting, the additional information can only consist of disclosing the true labels, which would defy the very point of active learning, in applications with structured data the structure in the input space could be exploited (e.g., through clustering, transfer learning, etc) for making unsupervised guesses of the labels.A concrete example where external insight is available is drug discovery [7], where additional information can be inferred from the presence of chemical functional groups (or absence thereof) on the molecules in the data pool.In the present work, we study whether it is possible, with full access to the labels, to devise an efficient method for finding a subset of samples that achieves close to the minimal Gardner volume (note that this is still an algorithmically non-trivial problem).
In this section we will investigate both the label-agnostic and label-informed strategies.We will benchmark several well known active learning algorithms on the model studied in the present paper as well as design and test a new message passing based active learning algorithm.Before doing that let us describe the general strategy.
Many of the commonly used active learning criteria rely on some form of label-uncertainty measure.Uncertainty sampling [1], [25] is an active learning scheme based on the idea of iteratively selecting and labelling data-points where the prediction of the available trained model is the least confident.In general, the computational complexity associated to this type of scheme is of order O(N 3 ), requiring an extensive number of runs of a training algorithm (which can scale as O(N 2 ) at best).Since even training a single model per pattern addition can become expensive in the large N setting, in all our numerical tests we opted for adding to the labeled set batches of k = 20 samples instead of a single sample per iteration.We remark that, despite the k-fold speed-up, the observed performance deterioration is negligible.The structure of this type of algorithm is sketched in Algorithm 1.In general, estimating the Gardner volume on a given training set or the label-uncertainty of a new sample is a computationally hard problem.However, in perceptrons (or more general GLMs) with i.i.d.Gaussian input data F , at large system size N one can rely on the estimate provided by a well known algorithm for approximate inference, Approximate Message Passing (AMP) (historically also referred to as the Thouless-Anderson-Palmer (TAP) equations, see [26]).The AMP algorithm [27,28], [15] yields (at convergence) an estimator of the posterior means, x, and variances, ∆, thus accounting for uncertainty in the inference process including the label of a new sample.The Gardner volume v (corresponding to the so-called Bethe free entropy) can then be expressed as a simple function of the AMP fixed-point messages (see [29] for an example).We provide a pseudo-code of AMP in the case of the perceptron in Algorithm 2. An important remark is that when the training set is not sampled randomly from the pool, as in the active learning context, correlations can arise and AMP is no longer rigorously guaranteed to converge nor to provide a consistent estimate of the Gardner volume.In the present work, we can only argue that its employment seems to be justified a posteriori by observing the agreement between theoretical predictions and numerical experiments for instance for the generalization error.Algorithm 2 single-instance AMP for the perceptron

Algorithm 1 Uncertainty sampling
We use the AMP algorithm to introduce a new uncertainty sampling procedure relying on the information contained in the AMP messages, denoted as AL-AMP in the following.At each iteration, the single-instance AMP equations are run on the current training subset to yield posterior mean estimate x and variance ∆.These quantities can then be used to evaluate, for all the unlabeled samples, the output magnetization (i.e. the Bayesian prediction) defined as where we introduced the output overlaps ω = F ′ • x and variances Γ = (F ′ ⊙ F ′ ) • ∆, where ⊙ is the component-wise product.The output magnetizations correspond to the weighted output average over all the estimators contained in the current version space, and their magnitude represents the overall confidence in the classification of the still unlabeled samples.This means that AMP provides an extremely efficient way of obtaining the information on uncertainty.The specifics of the algorithm can be found in Tab.I. We also explore numerically the label-informed active learning regime introduced in the previous section.We consider its limiting case by introducing the informed AL-AMP strategy, which can fully access the true labels Y in order to query the samples F µ whose output magnetisation m µ out (9) is maximally distant from the correct classification y µ .This selection process can iteratively reduce the Gardner volume of factors larger than 2. Again, the relevant specifics of informed AL-AMP algorithm are detailed in Tab.I.

C. Other tested measures of uncertainty
One of the widely used uncertainty sampling procedure is the so-called Query by Committee (QBC) strategy [4], [18].In QBC, at each time step, a committee of K students is to be sampled from the version space (e.g., via the Gibbs algorithm).The committee is then employed to choose the labels to be queried, by identifying the samples where maximum disagreement in the committee members outputs is observed.The QBC algorithm was introduced as a proxy for doing bisection, i.e. cutting version space into two equal-volume halves.As already mentioned, this constitutes the optimal information gain in an label-agnostic setting [30].Note that, however, the QBC procedure can achieve volume-halving only in the infinite-size committee limit, K ↑ ∞, with uniform version space sampling and with availability of infinitely many samples.Obviously, running a large number K of ergodic Gibbs sampling procedures quickly becomes computationally unfeasible.Moreover in the pool-based active learning the pool of samples is limited.In order to allow comparison with other strategies at finite sizes, we approximated the uniform sampling with a set of greedy optimization procedures (e.g., stochastic gradient descent) from random initialization conditions, checking numerically that this yields a committee of students reasonably spread out in version space.It is possible to ensure a greater coverage of the version space by performing a short Monte-Carlo random walk for each committee member.The effect has been found to be small for computationally reasonable lengths of walk.
We also implemented an alternative uncertainty sampling strategy, relying on a single training procedure (e.g., training with the perceptron algorithm or logistic regression) per iteration: in this case, the uncertainty information is extracted from the magnitude of the pre-activations measured at the unlabeled samples after each training cycle.This strategy implements the intuitive geometric idea of looking for the samples that are most orthogonal to the available updated estimator, which are more likely to halve the version space independently of the value of the true label.

FIG. 3. (Left)
Performance of the label-agnostic (yellow circles) and label-informed (blue circles) AL-AMP, plotted together with the minimum and maximum values of the Gardner volume extracted from the large deviation computation (purple and green) and the volume-halving curve (dotted black).For comparison we also plot the typical Gardner volume (cyan) and the one obtained by random sampling (orange squares).Numerical experiments were run for system size N = 2 • 10 3 and pool size α = 3.For each algorithmic performance curve the average over 10 samples is presented.Fluctuations were found to be negligible around the average and are not shown.(Right) The same plot with the Gardner volume replaced by the Bayesian test accuracy, derived in Appendix D. For the AL-AMP algorithm the accuracy is evaluated using a test set of size Ptest = 5 • 10 4 .The qualitative picture is very similar to the one for the Gardner volume curves (left), once more confirming that Gardner volumes and generalization errors both constitute good measures for informativeness.
In Fig. 3, we compare the minimum Gardner volume obtained from the large deviation calculation with the algorithmic performance obtained on synthetic data at finite size, N = 2 • 10 3 , by the AL-AMP algorithms detailed in Algorithm 1 and Tab.I.The data-pool size is fixed to α = 3.The large deviation analysis yields values for the minimum and maximum achievable Gardner volumes at any budget n.We compare the algorithmic results also with the prediction for the typical case and with the volume-halving curve 2 −n .Since in the considered pool-based setting the volume-halving performance cannot be achieved for volumes smaller than the Gardner volume corresponding to the entire pool v(α), the relevant volume-halving bound should be more precisely max(2 −n , v(α)).Random sampling displays good agreement with the expected typical volumes.Most notably, the label-agnostic AL-AMP algorithm tightly follows the volume-halving bound max(2 −n , v(α)), thus reaching close to optimal possible performance.Since for large α the behaviour of v(α) = const./α[17] we conclude that the AL-AMP algorithm will reach close to minimum possible Gardner volumes for a budget n ∼ O[log(α)].We thus obtain an exponential reduction in the number of samples even in the pool-based active learning similarly to the original Query by Committee work [4].
The label-informed AL-AMP also approaches the theoretically minimal volume but not as closely.We remark that an important limit of the AL-AMP algorithm comes from the fact that AMP is not guaranteed to provide good estimators (or converge at all) with correlated data.For example, in the numerical experiments for obtaining the informed AL-AMP curve, we had to resort to mild damping schemes in the message-passing to allow fixed-points being reached.This effect was stronger for the label-informed algorithm than for the label-agnostic one.
In Fig. 4, we provide a numerical comparison of the performance of the agnostic AL-AMP and the other above mentioned label-agnostic active learning algorithms.The finite size experiments were run at N = 2.10 3 , while here we set α = 10.Note that, while the mentioned different active learning strategies where employed for selecting the labeled subset, in all cases supervised learning and the related performance estimates were obtained by running AMP.
In the plot, we can see that, while AL-AMP is able to extract very close to the maximum amount of information from each query (one bit per pattern, until the volume v(α) is saturated), other heuristics with the same computational complexity are sub-optimal.In particular, in the simplified query by committee procedure we observe that increasing the size K of the committee does not yield very noticeable change in its performance, most probably because the committee cannot cover a sufficient portion of the version space if the computational cost is to be kept reasonable.On the other hand, using the information of the magnitude pre-activations allows better performance while being also more time-efficient, since only a single perceptron, rather than a committee thereof, has to be trained at each step.The logistic loss allows a rather good performance, close to that of AL-AMP, while the uncertainty sampling with the perceptron algorithm yields a mitigated performance.
We leave a more systematic bench-marking of the many existing strategies for future work, stressing the fact that, while there certainly exist more involved procedures that can yield better performance than the presented heuristics, the absolute performance bounds still apply, agnostic of the implemented active learning strategy.

VI. CONCLUSIONS
Using the replica method for large deviation calculation of the Gardner volume, we computed for the teacher-student perceptron model the minimum Gardner volume (equivalently, maximum mutual information) achievable by selecting a subset with fixed cardinality from a pre-existing pool of i.i.d.normal samples.We evaluated the large deviation function based on the replica symmetric assumption; checking for replica symmetry breaking and evaluating the eventual corrections to the presented results is left for future work, as well as rigorous establishment of the presented results.Our result for the information-theoretic limit of pool-based active learning in this setting complements the already known volume-halving bound for label-agnostic strategies.We hope our result may serve as a guideline to benchmark future heuristic algorithms on the present model, while our modus operandi regarding the derivation of the large deviations may help for future endeavour in theoretical analysis of active learning in more realistic settings.We presented the performance of some known heuristics, plus we suggested the AL-AMP algorithms to perform the uncertainty based active learning.We show numerically that on the present model the label-agnostic AL-AMP algorithm performs very close to the optimal bound, thus being able to achieve accuracy corresponding to the entire pool of samples with exponentially fewer samples.
Inverting the Legendre transform is then straightforward and yields Σ(n, v) The spectrum of values of v such that Σ(n, v) > 0 for any given fixed n corresponds to all achievable Gardner volumes for a budget n.In particular, inf v {v|Σ(n, v) > 0} is the minimal Gardner volume when the nN samples are chosen in an optimal way.Contrariwise, sup v {v|Σ(n, v) > 0} is the maximal Gardner volume when the nN samples are chosen in the least informative way for the student, so that the student inference problem is hardest.Finally note that the selection variables {σ µ } play in the grand-canonical partition function (A2) the role of an annealed disorder in disordered systems terminology [14], and shall be sometimes referred to as such in the following.
Appendix B: Replica computation

Replica trick
The standard way of taking care of the logarithm in equation (A2) is the replica trick ( [12], [13], [14]), To compute E F ,Y ,x 0 Ξ s , one needs to further replicate β times to care for the power β involved in the summand in equation (A2) (dx aα P X (x aα )) In the present problem we thus introduced two replication levels.Each replica is hence characterized by a set of two indices: the first a index runs from 1 to s and specifies the disorder replica, the second α index, running from 1 to β is related to the replication in β.In total there are therefore s × β replicas.The teacher is set as replica 0. Implicitly henceforth aα when summed over will be running over [1, s] where we defined h µ aα ≡ F µ • x aα Gaussian because of the central limit theorem and enforced the definition of its covariance matrix Q with integral representations of Dirac deltas.The conjugate matrix is Q.Matrix elements are noted with small q.Then where we factorized both in i indices (first parenthesis) and in µ indices (second parenthesis).The free entropy defined in (A2) then reads with x aα qaα,cγ x cγ , (B6)

Replica Symmetric (RS) ansatz
The extremization in equation ( B5) is hard to carry out.As is now standard in the disordered systems literature we can reduce the number of parameters to be extremized over by enforcing the so-called Replica Symmetric (RS) ansatz ( [14]) on both replication levels q 0,0 = r 0 , q0,0 = r0 (B8) where q < Q. Physically, the ansatz (B8)-(B12) means that two replicas seeing the same realisation of disorder (i.e., possessing the same first index) have an overlap Q greater than the overlap between students seeing different realisations (an thus possessing different a-index).The − 1 2 in the definition of r (B10) is just introduced for latter convenience.
Note finally that while the ansatz (B8) to (B12) is replica-symmetric for both replications, it gives a set of equations that are formally those of a 1RSB problem ( [32]).This is also a reason why taking 1RSB ansatz ( [14]) in the present large deviation calculation would be rather involved as it would lead to equations in the usual 2RSB form that are numerically involved to be solved.
We plug the RS ansatz (B8)-(B12) into the three contributions that make up equation (B5).The trace term is We can decompose the exponent in (B6) according to the ansatz (B8)-(B12) aα =cγ x aα qaα,cγ x cγ =r 0 (x 0 ) 2 + mx 0 aα =0 x aα x aγ + q 2 aα =0,cγ =0 x aα x cγ .(B14) In the last but one term index 0 does not intervene.Introducing Hubbard-Stratonovitch fields {λ a } for the last but one term and Hubbard-Stratonovith field ξ for the last, I X reads To carry out the computation for I Y (equation (B7)) we need to explicitly compute the inverse of the Parisi matrix Q involved in equation (B7).This is done in the following subsection.Some linear algebra for hierarchical matrices Name Q ≡ Q −1 the inverse of the overlap matrix Q.Since Q is clearly of the same form as Q, we can parametrize its coefficient in an identical fashion as those of Q r0 , m, r, q, Q.
m m + rq + (β − 1)Qq + qr + (β − 1)q Q + β(s − 2)q q = 0, (B21) We'll also need the determinant of Q.To do this, the simplest way is to guess the eigenvectors.For (x, 1, 1, ...1) T we get two eigenvalues λ ± whose product is The same equality holds with tilde quantities in the right-hand side provided the signs are inverted, since ln detQ = −ln det Q. Identifying term by term results straightforwardly in a set of relations between tilde and non-tilde quantities (henceforth referred as determinant relations), Now decomposing the exponent in I Y (B7) Introducing HS fields {ζ a } and η for the last two sums of (B32) and factorizing in the index a Now that all terms are computed the next step is then to divide by s and take the s → 0 limit as prescribed by the replica trick (B1).First, we need to enforce that all non-vanishing order 0 contribution cancel out, since the free entropy should not be diverging.Then, one needs to actually compute first order terms that will contribute in Φ (B5).Order 0 At order 0, The cancellation of order 0 terms imposes 0 = r0 r 0 + ln dx 0 P X (x 0 )e r0 (x 0 ) 2 , (B35) where the first term comes from the trace term (B13).It follows that r0 = 0.Moreover, because of the saddle point equality q aα,cγ = dδ (dx dδ P X (x dδ ))x aα x cγ e dδ =eǫ x dδ qdδ,eǫ x eǫ dδ (dx dδ P X (x dδ ))e dδ =eǫ x dδ qdδ,eǫ x eǫ , (B36) derived straightforwardly from (B5) we also have Order 1 a.I X Carrying out a change of variables ξ → ξ + q− 1 2 mx 0 in equation (B15), I X assumes the compact form lim s→0 where with g 0 (y, η) = dh 0 P out (y|h 0 )e Similarly the first terms in (B49) are We again used the determinant identity (B29)-(B31) in the last line.Then tilde quantities in the s = 0 limit can be accordingly be replaced by their expressions (B22)-(B26) Ultimately some changes of variables can be used to bring g 0 (0) to a more compact form with Replica symmetric free entropy for GLM Putting everything together the replica free entropy (B5) reads In writing so we took into account the fact that that y = ±1, which implies also to replace the integral over y in the energetic part by a sum over {±1}.Furthermore from which it follows that the energetic term in equation (B62) reads We derive here the expression for the optimal generalization error ǫ g (in the Bayesian sense) associated to a subset of volume v and of budget n as a function of the perceptron order parameters m and q, see appendix B and C. The Bayesian ǫ g was introduced for example in [33] and, unlike the test error yielded by training the perceptron on some loss, is independent of the training procedure and thus may serve as a nice measure of informativeness for subsets.For the usual perceptron model, it is known that the optimal generalization error is achieved when the student classification is performed by averaging the predicted label over the student measure P X (•)P out (Y |F •) and taking the sign thereof.Note that the average predicted label is but the output magnetization m out discussed in section V of the main text.Transposition to the large deviation setting, which allows to fix the budget n and the volume v, is straightforward provided one averages over the large deviation measure (A2).By definition the test error is the probability that a new sample F new d = N (0, 1) is correctly classified by the student according to the output magnetization E where E β,φ x denote the average with respect to the large deviation posterior measure (A2) with control parameters β and φ.Introducing an integral representation for a Dirac delta and expanding the resulting exponential For any fixed j, the computation of E x 0 ,Y Θ[sgn(x 0 • F new )v]E F (E β,φ x ) j is formally very similar to the one detailed in appendix B and follows the same lines.First notice that using the replica trick E β,φ x sgn(x • F new ) prescribes to introduce as precedently βs replicas In going from the second to the last line we introduced overlap variables h as in appendix B. The rest of the large deviation measure in equation (D6) factorize into e N sjΦ(β,φ) , with Φ the free entropy computed in appendices B and C, and goes to 1 as the s → 0 limit is taken.We also introduced an overlap matrix Q of RS form The order parameters r 0 , r, m and q were defined in the replica ansatz equations (B8)-(B12).Note that the relevant overlap is q, rather than Q, since the j-fold replication in equation (D6) affects also the selection variables σ and thus the variables x l11 see different disorders, see appendix B. The inverse Q = Q −1 is characterized by the coefficients r0 = r + (j − 1)q r 0 (r + (j − 1)q) − jm 2 (D13) r = r 0 (r + (j − 2)q) − (j − 1)m 2 (r − q)(r 0 (r + (j − 1)q) − jm 2 ) (D14) m = −m (r 0 (r + (j − 1)q) − jm 2  (D15) r = m 2 − r 0 q (r − q)(r 0 (r + (j − 1)q) − jm 2 ) .(D16) (D17) The last integral in equation (D8) can be taken core of in the usual manner, by decomposing the exponent and introducing a Hubbard-Stratonovitch field η, see for example appendix B. Similarly the expression can then be factorized in l indices.The result is sgn(h l ) = Dηdh 0 (2π) j+1 detQ e − 1 2 r0 (h 0 ) 2 e j 2 ( − q r− q η− m √ r− q h 0 ) 2 This terminates the computation of ǫ g , since from equation (D8) We used the fact that x → 1 − 2H(x) was odd and the fact that r 0 = 1 for the perceptron model with Gaussian priors, see appendix C. Volumes were evaluated using the AMP Algorithm 2. The agreement is rather good knowing the discrepancies that ought to be expected because of running AMP Algorithm 2 at finite and small N .The experiments were performed at system size N = 10 3 , pool size α = 10 and budget n = 1.5.Subsets covering a wide range of volumes were designed by varying the ratio of informative samples (using label-informed AL-AMP, see section V) and uninformative samples (selected using simple passive learning).Magnetizations and volumes were evaluated using the AMP procedure 2. In solid line is the typical m(v) curve predicted by the large deviation computations, which agrees quite well with the numerical simulations.

FIG. 1 .
FIG. 1. (Left) Complexity-volume curves Σ(n, v) for various budgets n, at pool size α = 3 extracted from the large deviation computations.These curves reach their maxima at a point with coordinates corresponding to the Gardner volume of randomly chosen nN samples, and log-number of choices of nN elements among αN ones.(Right) The magnetization order parameter m (in other words the teacher-student overlap) as a function of the Gardner volume v for a pool of cardinality α = 3, as extracted from the large deviation computations.As is physically intuitive, smaller Gardner volumes imply larger values of the magnetization.
FIG.2.Typical Gardner volume (purple, decreasing linearly at large α) and information theoretically smallest achievable one (orange, yellow and blue) extracted from the large deviation computation for α ∈ {3, 10, 100}.The horizontal lines depict the value of the Gardner volume corresponding to the whole pool, we see the fast saturation of the lowest volumes at these lines.The information-theoretic volume-halving limit 2 −n for label-agnostic active learning procedures is plotted in a dotted line.We notice that the qualitative picture is essentially unchanged when α is varied.

FIG. 4 .
FIG. 4. (Left)Performance of the label-agnostic algorithms presented in Tab.I is plotted against the budget n and compared to the volume-halving lower bound.Experiments were performed at system size N = 2 • 10 3 , and pool size α = 10.For each algorithm the average over 10 samples is presented.Note that error bars are smaller than marker size.(Left) (Bayesian) test accuracy of the same heuristics for various budgets n.The test set size was chosen to be Ptest = 5 • 10 4 .In blue the Bayesian test accuracy for a typical subset, see appendix D. Again, the qualitative picture is unchanged going from the Gardner volume to the test accuracy.

FIG. 5 .
FIG.5.Complexity Σ(n, v) as a function of the Gardner volume v for budgets n ≤ 2 at pool size α = 10 extracted from the large deviation computations, see also Fig.1for the same curves at different pool size.For any budget n the maximum complexity corresponds to the volume reached by random subset selection (passive learning), see discussion in section IV in the main text.We invite the reader to notice that the qualitative allure of the complexity curves remains significantly unchanged as the pool size α is varied (see Fig.1for α = 3.)

7 FIG. 6 .
FIG.6.Complexity vs volume curves for α = 3, and n ∈ {0.3, 0.6, 0.9, 2.7}.The dots are the values extracted from numerical experiments performed at N = 20 by repeatedly sampling passively 10 7 times a subset of cardinality n out of a fixed pool of size α = 3. Solid lines are the theoretical complexities as predicted by the large deviation computations, see also Fig.1.Volumes were evaluated using the AMP Algorithm 2. The agreement is rather good knowing the discrepancies that ought to be expected because of running AMP Algorithm 2 at finite and small N .

FIG. 7 .
FIG.7.Magnetization m against Gardner volume v for various subsets.The experiments were performed at system size N = 10 3 , pool size α = 10 and budget n = 1.5.Subsets covering a wide range of volumes were designed by varying the ratio of informative samples (using label-informed AL-AMP, see section V) and uninformative samples (selected using simple passive learning).Magnetizations and volumes were evaluated using the AMP procedure 2. In solid line is the typical m(v) curve predicted by the large deviation computations, which agrees quite well with the numerical simulations.
Select heuristic strategy from Table I Define batch size k Initialize S ⊂ S = {Fµ} 1≤µ≤αN (|S| > 0) while |S| < nN do Obtain required estimates given S Obtain model predictions at data-points in S c Sort predictions according to sorting criterion Add first k elements in the sorting permutation to S end while B. Approximate message passing for active learning (AL-AMP)

TABLE I .
Table summarizing the specifics of the uncertainty sampling strategies considered in this paper.