A theory of capacity and sparse neural encoding

Motivated by biological considerations, we study sparse neural maps from an input layer to a target layer with sparse activity, and specifically the problem of storing $K$ input-target associations $(x,y)$, or memories, when the target vectors $y$ are sparse. We mathematically prove that $K$ undergoes a phase transition and that in general, and somewhat paradoxically, sparsity in the target layers increases the storage capacity of the map. The target vectors can be chosen arbitrarily, including in random fashion, and the memories can be both encoded and decoded by networks trained using local learning rules, including the simple Hebb rule. These results are robust under a variety of statistical assumptions on the data. The proofs rely on elegant properties of random polytopes and sub-gaussian random vector variables. Open problems and connections to capacity theories and polynomial threshold maps are discussed.


Introduction
Sparse representations of information are often observed in biological and artificial neural systems, and in other statistical systems as well.Arguments in support of sparsity range from low energy consumption in physical systems to interpretability in artificial models.Furthermore, sparsity can be an emergent properties, or it can be artificially designed, typically by including penalty functions that favor sparsity.Here we study sparse encoding of information in neural maps and analyze their properties and possible computational advantages, particularly from a storage viewpoint.
1.1.Biological Sparsity.Many examples of sparse representations in neurobiology are found, for instance, in the early processing stages of sensory systems, across both sensory modalities and biological organisms.Together with a change in the activity pattern, from a dense input representation to a sparse target representation in response to a stimulus, one often observes also a significant expansion in the number of active neurons in the target layer.
For example, in the mouse visual system there are about 20,000 projecting neurons in the dorsal Lateral Geniculate Nucleus (dLGN) [30] whereas there are 120,000-215,000 neurons in mouse primary visual cortex area V1, where sparse activity is observed ( [59] and references therein).In the cat visual cortex, a 25-fold expansion is observed between the number of axons leaving V1 and the number of axons entering V1 from the LGN.However, only 5-10% of V1 neurons respond to any natural scene stimulus [51].The following additional examples are extracted from [5].In the olfactory system of the fly, the antenna lobe comprising 50 glomeruli projects to the mushroom body containing about 2,500 Kenyon cells.When an odorant stimulus is presented, 59% of the projection neurons and only 6% of the Kenyon cells are activated [63].Likewise, in rodents, the olfactory bulb projects to the piriform cortex [48], which hosts millions of pyramidal neurons,roughly three orders of magnitude more than Date: February 23, 2021.The work of R.V. is in part supported by U.S. Air Force grant FA9550-18-1-0031.The work of P.B. is in part supported by grant ARO 76649-CS.. the number of glomeruli in the bulb.While the response of the neurons in the olfactory bulb to odorant stimuli is quite dense [66], only about 10% of the neurons in the piriform cortex show an evoked response to each odorant [60,55].Similar ratios are observed in the somatosensory system [14], the auditory system [23], and even the electrosensory system of electric fish [19].
The fact that the same basic strategy seems to have emerged in evolution across a variety of organisms and sensory systems requires an explanation and suggests that this strategy may have specific advantages.There have been attempts, for instance, to explain the emergence of sparse representations in V1 as reflecting the sparse, largely statistically independent components of natural images [50,12].However these arguments do not necessarily apply to other sensory system, or explain why a sparse basis is chosen over a dense basis that could be more compact or combinatorially richer, or justify the expansion aspect of the strategy.
1.2.Computational Sparsity.On the computational side, sparsity has been studied in several different settings.Regularization terms, or prior distributions, associated with the L1 norm tend to produce sparsely parameterized models where a subset of the parameters are equal to zero, which can increase interpretability in some situations.The L1 approach goes back at least to work done in geology in the 1980s, [58] and has been further developed and publicized under the name of LASSO (least absolute shrinkage and selection operator) [61] (see also [62]).Many other sparsity-related priors have been developed in recent years.An example of continuous "shrinkage" prior centered at zero is the horseshoe prior [17,18].However technically these continuous priors do not have a mass at zero.Thus another alternative direction is to use discrete mixtures [47,33] where the prior on each weight w i consists of a mixture of a point mass at w i = 0 with an absolutely continuous distribution.A similar approach, applied to pixel intensities, rather than weights, has been developed recently to construct effective generative models of very sparse images [45].Finally, there is a significant literature in compressed sensing research, where efficient sparse coding algorithms have been developed for recovering sparse signals that underwent linear compression [25,16,57,29,32,31,52,2,53,54].
Our main goal in this work is to better understand the computational role of sparsity in neuronal maps.Our work is closest in spirit to [5], but with a number of significant differences.First, although we discuss expansion issues, our primary focus here is on sparsity, not on expansion.Second our goal is to understand the possible computational advantages of sparsity.And Tthird, our approach is mathematical and aimed at deriving precise theorems, as opposed to approximate results derived using physics approximations or computer simulations.

Basic Framework and Notation
2.1.Neural Maps and Threshold Functions.We wish to understand neural mappings F from a layer of size n to a layer of size m.For simplicity, we call the layer of size n the input layer, and the layer of size m the target layer and the resulting architecture an A(n, m) architecture.The mapping is to be implemented by m linear threshold functions-as one of the simplest neuronal models-although we will briefly consider other computational units, such as polynomial threshold functions of low degree [6].We let T (n, m) denote the set of all such linear threshold maps, and T d (n, m) denote the set of all such threshold maps of degree d.As a result, the activities in the target layers are always binary with value 0 or 1.When the activities in the layer of size n are also binary with values in {0, 1} or {−1, +1}, the units in the layer of size m implement Boolean functions and we call them linear threshold gates, or polynomial threshold gates in the polynomial case.We let H n = {0, 1} n denote the n-dimensional hypercube with individual coordinates in {0, 1}.It is sometimes more convenient to consider input vectors in K n = {−1, 1} n , the n-dimensional hypercube with individual coordinates in {−1, 1}.A simple affine transformation transforms one type of hypercube into the other, and such transformations can be absorbed into the weights of the threshold functions, so any result obtained with a threshold map applied to input vectors in H n can be transformed into an equivalent result with input vectors in K n and vice versa.

2.2.
Input and Target Models.In general, we imagine that the input layer is presented with dense input vectors x of length n, and we want to explore their mapping into sparse representations y of length m in the target layer.To generate dense input vectors x, one can consider different models, in both the continuous and binary cases, including the following ones: (1) Gaussian Model [N (0, 1)] n in which the components of x are independent identically distributed with standard normal distribution.(2) Uniform Model U [S(n − 1)] in which x is sampled uniformly over the unit sphere in n-dimensional Euclidean space.(3) Bernoulli Model [B( 12 )] n in which the components of x are independent identically distributed with symmetric Bernoulli coin flip distribution with parameter p=0.5.(4) Uniform Model U ( 1 2 , n) which corresponds to a uniform distribution over all vectors of length n containing n/2 ones and n/2 zeros.The fact that n may be odd is not important for our considerations (in this case use the the floor and ceiling operators).Some of the same notation and models can be used also to generate sparse vectors, so that we let: (1) Sparse Bernoulli Model [B(p] n in which the components of x are independent and identically distributed with probability p of being one (and zero otherwise), with p small.(2) Sparse Uniform: U (p, n) in which x is sampled uniformly over the binary vectors of H n having a fraction p of their entries equal to one, and the rest equal to zero, with p small.There are of course n np such vectors, with the same remark as above regarding the use of the floor ceiling operators when np is not an integer.
Although these sparse models can also be applied to the input layer, they are meant to be applied primarily to the target layer, replacing n with m, and x with y.While for certain mathematical considerations one model may be easier to use than the other, it is well known that for many probabilistic considerations, especially in terms of asymptotic results, the corresponding Bernoulli and Uniform models are very similar and that [B(p] n is a slightly "smeared" version of U (p, n).In particular, all the vectors with pn components equal to one have the same probability in [B(p)] n , but this probability is slightly lower compared to the corresponding uniform model due to the smearing.Most importantly, we will also consider models, other than the uniform models, where the components of x or y are not independent of each other, or where x and y are not independent of each other.
Whatever the model, in the end we assume that we have a set of memories, or training set, consisting of K pairs (x, y), and one of our main goals is to find the maximal number K of memories that can be stored in the neural map.
A Boolean vector of size n is called p-sparse if it contains pn ones, and n(1 − p) zeros.Likewise, we call a Boolean function of n variables p-sparse if its vector of assignment or targets (corresponding to the last column of its truth table) is p-sparse, i.e. the function takes the value 1 for p2 n entries, and 0 otherwise.In general, we will use p and q to denote unrelated probabilities (and thus it is not the case that q = 1 − p).Finally, in order to avoid the use of double indexes, we use x 1 , x 2 , ...., x K to denote the set of input training examples, and x 1 , x 2 , ...., x n to denote the components of an input vector x.Whenever this notation is used, its meaning should be obvious from the context.2.3.Storage and Memories.Now let us assume that we have K (dense) real-valued or binary vectors x in the input layer, and that we want to map them into K (sparse) binary vectors, or representations, y = F (x) in the target layer.The K associations (x, y) are called memories and, for concreteness, the reader may think of x as the activity triggered by an odor in a primary sensory layer, and of y as its sparse representation in a subsequent layer.In this work, we are concerned primarily with maximizing K, i.e. the number of memories that are stored in the mapping and the effects that the size m of the target layer, and the sparsity of the vectors y, have on the mapping.There are two additional properties of the mapping F that are important: continuity and un-ambiguity.By continuous, we mean that if x is one of the input memories and x is close to x, then in general one should expect F (x ) = F (x), i.e. the odors of two slightly different bananas should be mapped to identical (or very similar) binary representations.Using linear threshold functions automatically enforces continuity almost everywhere.By un-ambiguity, we mean that the target vectors y should be far apart from each other to avoid any possibility of confusion (the binary representation of the banana odor should not be confused with the binary representation of the odor of any other fruit).This can be formalized for instance by maximizing the average Hamming distance between the vectors y = F (x).In short we want a map F that has maximal memory storage, that is also continuous and un-ambiguous.In the rest of the paper we will prove that maximizing memory storage leads to sparsity in the target layer and suggest that large target layers support un-ambiguity.
2.4.Paradox.It must be noted from the outset that the maximization of memory storage by sparse neural maps has a paradoxical flavor.For simplicity, let us assume that we want to encode the K input vectors into K p-sparse vectors in the target layer.The total number of such possible vectors is given by m pm and this number is maximal when p = 0.5.In other words there are far more possibilities for choosing the target y vectors when the target vectors are constrained to be dense.Likewise, the total number of p-sparse Boolean functions of n Boolean variables is given by 2 n p2 n , which is also maximal when p = 0.5, providing also the impression that dense representations offer more choices and are easier to realize.

2.5.
Resolution.The resolution of this paradox must come from the constraints we placed on the function F .In particular, consider a single linear threshold function or gate, with K random input vectors of size n.Assume that the targets are assigned randomly with sparsity p. Equivalently, assume that the K points are colored randomly in black and white, where p is the probability of assigning a white color.When are the black and white points linearly separable?If p = 0.5, we know [8] that the maximal number of random memories that can be stored satisfies K ≈ n (related results are known also for polynomial threshold functions [9]).On the other hand, in the binary case, if only one target is equal to 1 and all the other targets are 0, it is easy to see that any K memory associations can be realized, i.e. it is always possible to separate one corner of the hypercube from all the other corners using a hyperplane.Thus, in a sense this extreme case of sparsity leads to greater storage, i.e. greater values of K.In short, it is intuitively clear that the smaller the fraction of white points is, the greater its chance of being linearly separable.Thus what is needed is a quantitative understanding of this phenomenon.As we are going to describe, the solution of this problem is closely related to the theory of random polytopes and is characterized by a phase transition.

Phase Transition
We now provide a formal definition for the neural maps of interest and the underlying question we wish to address.Definition 3.1 (Threshold maps).A map F : R n → {0, 1} m is called a linear threshold map if all m components of F are linear threshold functions.Equivalently, F is a threshold map if it can be expressed as: F (x) := h W x − b for some m × n matrix W and some vector b ∈ R m , where h is the Heaviside function applied component-wise.The Heaviside function has value 0 for negative arguments, and 1 for positive arguments.
Note that the bias can also be included in the weights W by assuming there is one additional input unit always clamped to one.Likewise, we can define polynomial threshold maps of degree d if all m components of F are polynomial threshold functions of degree d.We let T d (n, m) denote the set of all such threshold maps.In this case, each component i has the form: f i (x) = h(P d (x)) where P d is a polynomial of degree d in the variables x 1 , . . .x n and h is the Heaviside function.Question 3.2.Let x ∈ R n and y ∈ {0, 1} m be random vectors, possibly dependent.Consider a sample of K independent data points (x k , y k ) drawn from the distribution of (x, y).Does there exists a threshold map F ∈ T (n, m) such that: If we require F : R n → R m to be a linear map (and the distribution of x is non-degenerate, e.g.absolutely continuous) then the answer to Question 3.2 is Yes if and only if K ≤ n.Remarkably, for a larger class of linear threshold maps, one can fit samples of size much larger than n.Theorem 3.3 (Phase transition).Assume that x is a standard normal random vector in R n and y is an independent vector in {0, 1} m whose coordinates are i.i.d.Bernoulli with parameter q ∈ (0, 1).Fix ε ∈ (0, 1) and let n → ∞, allowing m, K and q depend on n.Here, and everywhere else, the notation a It is important to note how little this result depends on m.If we consider a single linear threshold neuron (m = 1) corresponding to an A(n, 1) network, we have: Corollary 3.4 (Phase transition).Assume that x is a standard normal random vector in R n and y is an independent vector in {0, 1} whose coordinates are i.i.d.Bernoulli with parameter q ∈ (0, 1).Fix ε ∈ (0, 1) and let n → ∞, allowing K and q depend on n.Assume that K n. 1.If 2Kq log(K/n)(1+ε) < n then the sample of K points is linearly separable with probability 1 − o(1).
To better understand this result, let us first notice that for very small we have: (1) If K = Cn for some constant C > 0, then the phase transition occurs for: q = 1/(2C log C).For instance, if C = 10 then the phase transition occurs for: q = 1/(20 log 10).(2) If K = n 1+α for α > 0, then the phase transition occurs for: q = 1/(2αn α log n).For instance, if K = n 2 , then α = 1 and the transition occurs for: q = 1/(2n log n).Theorem 3.3 can be deduced from two results on the geometry of Gaussian polytopes.Consider N independent random vectors x 1 , . . ., x N taking values in R n .Their convex hull is a random polytope in R n .If x k are drawn from the standard Gaussian distribution, the random polytope: Random polytopes including random regular polytopes [1,64,13,26], random Gaussian polytopes [39,10,24,36], and more general random polytopes [46,44,43,38,41,42], have been extensively studied in the area of asymptotic convex geometry.One remarkable property is that random polytopes in high dimensions are neighborly: points x k are likely to form vertices of P (instead of falling into the interior of P ), the intervals that join pairs of points x k are likely to form edges of P , the triangles that are formed by triples of points x k are likely to form two-dimensional faces of P , and this continues up to faces of a certain dimension s.D. Donoho and J. Tanner were the first to determine asymptotically sharp threshold for the critical dimension s [24]: Theorem 3.5 (Typical faces of a Gaussian polytope).Let x 1 , . . ., x N be independent standard Gaussian random vectors in R n .Fix ε ∈ (0, 1) and let n → ∞, allowing N and s depend on n.
Motivated by the basic problem of compressed sensing, this theorem sparked many later developments, some of which are summarized in e.g.[27,3,11,40].In particular, the probability in both parts of Theorem 3.5 can be improved to: see [3,Theorem II].
Proof of Part 1 of Theorem 3.3.Let us first assume that m = 1.Call the points x k with labels y k = 0 "black points" and the others "white points".Let s denote the number of white points.The assumption Kq 1 implies that s = Kq(1 + o(1)) with probability 1 − o(1).Let us condition on the labels (y k ) with the number of white points s satisfying the condition above.Our assumption implies that: if n is large.Then, applying part 1 of Theorem 3.5 with ε/2 instead of ε, we see that the convex hull of white points is a face of the polytope conv(x 1 , . . ., x N ) with probability 1−o(1) as n → ∞.This means that the sets of black and white points are linearly separable, i.e. they can be separated by an affine hyperplane.Equivalently, there exists a threshold function F ∈ T (n, 1) that realizes the data.
The general case where m ≥ 1 follows by taking a union bound over the m events, one for each coordinate of y.Due to (3.1), the probability of failure is bounded by m•exp(−cεKq) The second part of Theorem 3.3, unfortunately, does not follow from Theorem 3.5 by a similar argument.While it is true that a set of points x 1 , . . ., x s that spans a face of the polytope P = conv(x 1 , . . ., x N ) must be linearly separated from the other points x s+1 , . . ., x N , the converse may may not be true.As Figure 1 shows, points x 1 , . . ., x s might still be linearly separated from x s+1 , . . ., x N even if they do not form a face of P .A different property of Gaussian polytopes can be used to deduce the second part of Theorem 3.3: the existence of a round core of P .The following result shows that P contains the centered Euclidean ball of radius r ≈ 2 log(N/n).
Theorem 3.6 (Round core of a Gaussian polytope).For every ε ∈ (0, 1) there exists C(ε) > 0 such that the following holds.Assume that N ≥ C(ε)n and let x 1 , . . ., x N be independent standard Gaussian random vectors in R n .Then: with probability at least 1 − e −n .Here B(n) denotes the unit Euclidean ball in R n centered at the origin.
A weaker version of this result, with an absolute constant factor instead of the constant 2, goes back to Gluskin [35], where the result is stated in the dual form.Gluskin's result inspired many further developments in the area of asymptotic convex geometry.Its ramifications can be found in particular in [34,44,22] and [4,Section 7.5].None of the published versions of Gluskin's theorem, to our knowledge, exhibit the exact absolute constant 2 that is critical for our purposes.We give a proof of Theorem 3.6 in Appendix A, which essentially combines the argument in [34] with an asymptotically sharp tail bound of the normal distribution.Now we can deduce Part 2 of Theorem 3.3, setting m = 1 for simplicity.There are s ≈ Kq white points (those with labels y k = 1), and they are independent Gaussian random vectors, so their arithmetic mean x 0 has Euclidean norm r 0 ≈ n/Kq.By the assumption, this quantity is smaller than r ≈ 2 log(N/n), which is the radius of the round core of the convex hull of the N − s black points.So x 0 falls inside this round core and, as such, it is not linearly separable from the black points, see Figure 2. Hence the black and white points are not linearly separable.Here is a more formal proof.Proof of Part 2 of Theorem 3.3.Without loss of generality, we can assume that m = 1.Condition on all labels y k so that the number of white points s (those with labels y k = 1) satisfies s = Kq(1 + o(1)), just like we did in the proof of the first part of the theorem.Without loss of generality, q ≤ 1/2.The number of black points N := K − s then satisfies N ≥ K/3 for large n.Thus we have for large n: where: On the other hand, the arithmetic mean of the white points: x k is a rescaled normal random vector, namely it x 0 = g/ √ s where g is a standard normal random vector in R n .Due to a standard concentration inequality for the norm, √ n with probability 1 − o(1), which yields: Comparing this to (3.3), we see that for large n: x 0 2 < r with probability 1 − o(1).This means that x 0 lies in the ball rB(n), which in turn lies in the convex hull of the black points.
Summarizing, we showed that with high probability, the arithmetic mean of the white points x 0 lies in the convex hull of the black points.Therefore, the sets of black and white points can not be separated by any hyperplane.Equivalently, there does not exist any threshold function F ∈ T (n, 1) that realizes the data.The proof is complete.
3.1.Realizing all label assignments simultaneously.The model of data considered in Theorem 3.3, in which we assumed that the labels y k are independent of the data points x k , is not very realistic.Fortunately, this result can be strengthened and allow for any dependence of the labels y k on x k .The only requirement is the sparsity of label assignment.We say that the label assignment is s-sparse if, for each coordinate i ∈ {1, . . ., m}, at most s of the labels y 1 (i), . . ., y K (i) are equal to 1.
Theorem 3.7 (All label assignments simultaneously).Assume that x 1 , . . ., x K are drawn independently from the standard normal distribution in R n .Fix ε ∈ (0, 1) and let n → ∞, allowing m, K and s depend on n.If: then the following holds with probability 1−o(1).For any s-sparse label assignment y 1 , . . ., y K ∈ {0, 1} m , there exists a function F ∈ T (n, m) such that: Up to absolute constant factors, this result is stronger than the first part of Theorem 3.3.Indeed, if Kq log m, the label assignment is s-sparse with s = Kq(1+o(1)) with probability 1 − o(1).
Theorem 3.7 follows in a way similar to the previous theorems in this Section from a stronger form of Donoho-Tanner's Theorem 3.5, which was also proved in [24].Theorem 3.8 (All faces of a Gaussian polytope).Let x 1 , . . ., x N be independent standard Gaussian random vectors in R n .Fix ε ∈ (0, 1) and let n → ∞, allowing N and s depend on n.
Theorem 3.3 establishes the existence of a phase transition for the number K of associations that can be stored in a linear threshold map, under the assumptions that x is a standard normal vector and y is independent from x.However, this leaves open two important questions.First, it would be useful to be able to prove a similar result for other realistic distributions for x and y.It would be of particular interest to obtain results for the case where the components of x are binary, or when they are not independent.And similarly for y, for instance when y is not independent of x.Second, Second, one would like to know if the memories that are plausible for a physical neural system [6].These questions will be addressed using two key concepts: (1) sub-gaussian distributions; and (2) local learning rules, in particular Hebbian learning rules.We begin by providing some background on learning rules.

Learning algorithms
Before we use sub-gaussian distributions to extend the previous theorems, it is useful to look at the algorithms by which the memories could be learnt.First, it should be clear that in general the m units of an A(n, m) architecture learn independently of each other, and thus it is enough to study learning in a single unit.Second, if the set of data pairs (x, y) is linearly separable, the SVM learning approach of finding an hyperplane with maximum margin can be solved using linear or quadratic programming methods [20,15,21].However, such methods are not necessarily plausible for a physical neural system, as they do not necessarily result in a learning algorithm for the synaptic weights that is local [7], i.e. that it depends only on variables available locally at the synapse.In practice, for the models considered here, it means that we are interested in learning rules of the form: Here x i is the i-th component of the input vector x, y is the target value, and o is the actual output value of the neuron.The rules in this section are written for a single training examples corresponding to on-line learning, but they can be averaged across multiple examples in batch learning.There are three main, different but highly related, local learning rules that can be considered: gradient descent, the perceptron rule, and the simple Hebb rule.
4.1.Gradient Descent Learning Rule.For gradient descent, we modify the Heaviside threshold function to a sigmoidal logistic function.It is well known (e.g.[6] that, using the relative entropy (or Kullback-Leibler divergence) between the target y and the output o produced by the logistic function, the gradient descent rule has the form: where η is the learning rate.The error function is convex and therefore gradient descent, or stochastic gradient descent, with a suitable learning rate converge to a set of weights which minimize the error function.
4.2.Perceptron Learning Rule.The perceptron learning rule [56]is usually written as: using a linear threshold functions with outputs and targets in {−1, +1}.It is applied to all the weights including the bias.The perceptron learning algorithm initializes the weight vector to zero w(0) = 0, and then at each step it selects an element of the training set that is mis-classified and applies the learning rule above.Note that because the weights are initialized to zero, the learning rate simply rescales all the weights, including the biase,) and thus it can be chosen to be 1.Notice that the rule above can be rewritten as: which shows its connection to gradient descent, and as: for the examples that are misclassified, which shows its connection to the simple Hebb rule described below.
The perceptron learning theorem states [49] that if the data is separable, then the perceptron algorithm will converge to a separating hyperplane in finite time.One may suspect that this may be the case because the rule amounts to applying stochastic gradient descent to a unit with a sigmoidal (logistic or tanh) transfer function, which is similar to a perceptron.In addition, the rule above clearly improves the performance on an example x that is misclassified.For instance if the target of x is y = +1 and x is mis-classified and selected at step t, then we must have w(t) • x < 0 and w(t + 1) = w(t) + x.As a result, the performance of the perceptron on example x is improved since w(t + 1) • x = w(t) • x + ||x|| 2 , and similarly for mis-classified examples that have a negative target.To prove convergence more precisely, it is enough to take a unit vector w * that separates the data and show that the cosine of the angle between w(t) and w * increases faster than C √ t.

4.3.
Simple Hebb Learning Rule.The simple Hebb rule can be written as: ∆w i = yx i with a learning rate of one.For the threshold maps F considered here (Definition 3.1), in vector form this translates to: The simple Hebb rule is the rule used, for instance, to store memories in Hopfield networks [37] corresponding to networks of symmetrically connected linear threshold gates.As we have seen the perceptron algorithm is identical to the simple Hebb rule on the examples that are misclassified.Thus a key question to be examined is what happens when the simple Hebb rule is applied once to all the training examples.
Thus in the next section we extend the previous sparsity results into two directions.First we allow more general sub-gaussian models for the data, with more complex dependency structures.Second, we show that the neural map can be implemented using the simple Hebb rule.

Computing threshold maps with sub-gaussian data and the simple Hebb rule
In a sense, Theorems 3.3 and 3.7 tell us that threshold maps are able to realize memories that are completely random.But such memories, which lack any pattern, seem to be the hardest data to realize.And thus one can reasonably suspect that threshold maps ought to be able to realize pretty much any kind of data for the same value of K. We are going to show that this is indeed the case.Not only any dependence of the labels y k on x k can be allowed as we saw in Theorem 3.7, but the data points x k may come from a general distribution in R n , and without any independence requirements on the coordinates of x k or y k .
The reader may be quick to realize that this task is impossible in some cases, even for m = 1.If the distribution of the input data consists of three points on a line, with the middle point labeled 1 and the other two 0, then such data is not linearly separable and thus not realizable by a linear threshold function.Remarkably, these impossible cases are rare and there is a simple recipe to rule them out.
We only need to place standard moment assumptions on the distribution of x.Namely, we assume x to be sub-gaussian, which means that all one-dimensional marginals of x are stochastically dominated by λg where g ∼ N (0, 1) and λ ≥ 0 is some number.The smallest multiplier λ can be defined as the sub-gaussian norm x ψ 2 .The Gaussian, Uniform, and Bernoulli models described in Section 2.2 are all examples of sub-gaussian distributions.In all of these models, the sub-gaussian norms of x are bounded by an absolute constant (irrespective of n or p).Basic definitions about sub-gaussian distributions are given for completeness in Appendix B, while a more extensive treatment can be found in [65,Sections 2.5,2.6,3.4].
Let us first state our result informally.
Theorem 5.1 (Informal).If x ∈ R n is sub-gaussian and all coordinates of y ∈ {0, 1} m take value 1 with probabilities at most q, and Kq log m, then the condition: Kq log(Km) log(1/q) n guarantees that all data points (x k , y k ) for which x k 2 √ n can be realized by a threshold map F .Moreover, the map F can be achieved using the simple Hebb rule.
Here and in the following sections, we use the notation a b if there exist two absolute positive constants c 1 and c 2 such that c 1 b ≤ a ≤ c 2 b.This notation is useful only when a and b vary as a function of other variables, such as n, and the constants are absolute in the sense that they do not depend on these other variables.The condition x k 2 √ n may seem mysterious at the first sight.Note, however, that this condition is consistent with the natural scaling: if all coordinates of x have unit variance, then E x 2 2 = n, so that the norm of x is expected to be of order √ n.If, in addition, the coordinates of x are independent, the concentration of the norm [65, Theorem 3.1.1]guarantees that x 2 ≈ √ n with probability 1 − exp(−cn).By a union bound, this means that the requirement x k 2 √ n holds automatically for all data points in the sample, so it can be removed from the statement of the theorem.
For general distributions, however, the condition x k 2 √ n can not be removed.Jointly with the requirement of sub-gaussian distribution, this condition rules out the data that is impossible to realize.Suppose, for instance, that the distribution of x is supported on a line, like the three-point distribution we mentioned above.Since the distribution is sub-gaussian, the event x k 2 √ n is extremely unlikely: its probability is exponentially small in n.This event is unlikely to hold for any data point in the sample.
Let us now state Theorem 5.1 formally.
Theorem 5.2 (Formal).Assume that x is a mean zero, sub-gaussian random vector in R n , and y is a random vector in {0, 1} m .Denote α := x ψ 2 and q i := P y(i) = 1 , i = 1, . . ., m.Let m 0 ≥ m be such that Kq i ≥ C log m 0 for all i.Let β, γ > 0 be such that: Consider K data points (x k , y k ), k = 1, . . ., K sampled independently from the distribution of (x, y).Then, with probability at least 1 − 2m/m 0 , there exists a map F ∈ T (n, m) such that: Moreover, the matrix W of the threshold map F = h(W x − b) can be computed by the Hebb rule (4.1) and b can be any vector (either fixed or dependent on the data) whose coordinates b(i) all satisfy: Note that in this theorem we do not assume any kind of independence in the distribution of (x, y).In particular, the coordinates of x and of y may be correlated with each other, and the label vector y may be correlated with x.The proof of this theorem is given in Appendix C.

Binary Input Vectors
Theorem 3.3 dealt with inputs associated with a standard normal random vectorm, and remains true for any rescaling, if the standard deviation of the normal components is not one.From Theorem 5.2, we can immediately derive corollaries to deal with binary vectors taken according to the models [B(p)] n or U (p, n) with p = 0.5, as well as other values of p (as long as p is not too small).When p = 0.5, these models are very close to the standard normal model.In the [B(0.5)]n model over K n all the components are i.i.d. with mean zero and variance 1, as in the standard normal model.In the U (0.5, n) model over K n , all the components are identically distributed with mean zero and variance 1, and with an identical small negative covariance for all non-diagonal terms (see [26] for results on randomly projected hypercubes).Corollary 6.1 (Informal).If x ∈ K n and all coordinates of y ∈ {0, 1} m take value 1 with probabilities at most q, and Kq log m, then the condition: Kq log(Km) log(1/q) n guarantees that all data points (x k , y k ) can be realized by a threshold map F .Moreover, the map F can be achieved using the simple Hebb rule.
More precisely, one has the following result.
Corollary 6.2 (Formal).Assume that x is a mean zero random binary vector in K n , and y is a random vector in {0, 1} m .Denote α(n) := x ψ 2 and q i := P y(i) = 1 , i = 1, . . ., m.Let m 0 ≥ m be such that Kq i ≥ C log m 0 for all i.Consider K data points (x k , y k ), k = 1, . . ., K sampled independently from the distribution of (x, y) with K satisfying: ≤ cn, i = 1, . . ., m. (6.1) Then, with probability at least 1 − 2m/m 0 , there exists a map F ∈ T (n, m) such that F (x k ) = y k .Moreover, the matrix W of the threshold map F = h(W x − b) can be computed by the simple Hebb rule (4.1) and b can be any vector (either fixed or dependent on the data) whose all coordinates b(i) all satisfy: This corollary is obtained immediately by applying Theorem 5.2, noting that the binary vector x is sub-gaussian and that for every vector in K n : ||x|| = √ n.As previously stated, we know that α(n), which appears in 6.1, is bounded.An obvious special case of this Corollary is obtained when the components of x are i.i.d.symmetric Bernoulli random variables (i.e Rademacher random variables).In Appendix B, we show that in this case the sub-gaussian norm α = α(n) is bounded by, and as n → ∞ converges to, √ 8/ √ 3.

Input Sparsity
Theorem 5.2 holds for considerably general input distributions, in particular distributions that produce dense input vectors.However, one can also consider cases where the input vectors themselves tend to be sparse.In particular, this situation could occur if the first sparse target layer became the input layer for a subsequent, new, target layer.Theorem 5.2 does allow the data points x k to be sparse, having most of their coordinates equal zero.
However, the sparsity reduces the norms of x k , thereby making the condition x k 2 √ n harder to satisfy, which in turn demands more sample points K in (5.1).
As we will show now, the data points x k can be made almost arbitrarily sparse for free.Surprisingly, the sparsity has almost no effect on the sample size.Let us first state this result informally.
Theorem 7.1 (Informal).If x ∈ {0, 1} n and y ∈ {0, 1} m are independent random vectors whose coordinates are i.i.d and take values 1 with probabilities p and q respectively, and Kq log m and np log(Km), then the condition: guarantees that the answer to Question 3.2 is Yes with probability 1 − o(1).Moreover, the threshold map F can be computed by the Hebb rule.
And here is a formal version of the result, with more controls.
Theorem 7.2.Assume that x is a random vector in {0, 1} n and y is an independent random vector in {0, 1} m .Assume that the coordinates of x are i.i.d.Bernoulli with parameter p ∈ (0, 1/2] and the coordinates of y are i.i.d Bernoulli with parameter q ∈ (0, 1).Consider K data points (x k , y k ), k = 1, . . ., K sampled independently from the distribution of (x, y).Let m 0 ≥ m be such that Kq ≥ C log m 0 , np ≥ C log(Km 0 ), and: Then, with probability at least 1 − 3m/m 0 , there exists a map F ∈ T (n, m) such that: This result can be proved in a similar way to Theorem 5.2.
Proof.Let us first assume that m = 1 and check that the map F satisfies: with high probability.Once we have done this, a union bound over K data points and m coordinates of y will finish the argument.When m = 1, the function F can be expressed as: Step 1. Decomposition into signal and noise.In order to prove that F (x 1 ) = y 1 , let us expand w, x 1 as follows: We would like to show that the signal to noise ratio is large.To this end, consider the random sets: Let us estimate the strength of the signal and noise in (7.3).If y 1 = 0, the signal is obviously zero, and when y 1 = 1, we have: Step 2. Bounding the noise.The noise in (7.3) can be expressed as: The sets I and J are fixed by conditioning, and the noise is the sum of |I||J| i.i.d.random variables with mean zero, variance p(1 − p), and which are uniformly bounded by 1. Bernstein's inequality then implies that: (by (7.4) and (7.5)) with a suitably large constant C 1 .Thus, with (conditional) probability at least 1 − 1/(Km 0 ), the noise satisfies: The last bound follows from the assumptions of the theorem with sufficiently large constant C and sufficiently small constant c.
Step 4. Union bound.We can repeat this argument for any fixed k = 1, . . ., K and thus obtain F (x k ) = y k with probability at least 1 − 1/m 0 − 2/(Km 0 ).Now take a union bound over all k = 1, . . ., K.This should be done carefully: recall that the term 1/m 0 in the probability bound appears because we wanted the set I to satisfy (7.4).The set I obviously does not depend on our choice of a particular k; it is fixed during the application of the union bound and the term 1/m 0 does not increase in this process.Thus, we showed that the conclusion: This completes the proof of the theorem in the case m = 1.To extend it to general m, we apply the argument above for each coordinate i = 1, . . ., m of y and finish by taking the union bound over all m coordinates.

Further results
8.1.Autoencoders.It is easy to check that the conclusion of Theorem 7.1 remains the same if we center the label vectors y k in Hebb rule, i.e. set: One can check that the effect of the centering of y k on the signal-to-noise ratio is negligible; we skip verifying the routine details.This version of Hebb rule is symmetric, so we can apply Theorem 7.1 again, swapping x k , n and p with y k , m and q respectively.It follows that F can be inverted on the data, and the inverse is again a threshold function!Moreover, the inverse: is given by the same Hebb rule (up to the swapping), namely: This, of course, holds under the mild assumptions that Kq log m, np log(Km), Kp log n, mq log(Kn), as well as the key assumptions: Kq log(Km) n and Kp log(Kn) m.
This observation has an unusual consequence for "Hebb networks", i.e. two-layer neural networks whose weights are trained by the Hebb rule.If we feed x k into the input layer, the network computes y k in the output layer.Furthermore, we can reverse the direction of this computation by feeding y k into the output layer; the network then computes x k in the input layer.
One can interpret this as a construction of a "Hebb autoencoder" with three layers of sizes n, m and n.If we feed the data x k into the input layer, it is transformed into y k in the hidden layer, and back to x k in the output layer.Up to logarithmic factors, we can train such autoencoder to a zero error on a sample of size K ∼ nm if we set p ∼ 1/n and q ∼ 1/m.8.2.Robustness.Hebb rule is very robust.Indeed, we can replace the exact formula W := K k=1 y k x T k in (4.1) by its approximate version: where xk are any sub-gaussian i.i.d.random vectors in R n whose distribution is positively correlated with x k , i.e.: Our analysis of signal-to-noise ratio remains mostly the same, and the results modify in a natural way (the constant c enters the formulas).We skip the details.This robustness may be useful during development and learning.In addition, it has two other consequences.
1. Quantization.The weights can be updated by just three values: −1, 0, 1.This can be seen if we use the Hebb rule (8.1) with: where the sign is applied coordinate-wise.
2. Sparsification.The weight matrices associated with Hebbian learning can easily be sparsified.All we have to do is multiply the weights by independent Bernoulli(ρ) random variables with small ρ.The sparsified weights are positively correlated with the original weights, and thus versions of our results hold for sparse networks as well.8.3.Learning.In terms of [65,Section 8.4]: We showed that the empirical risk, or in-sample risk, is R K (f * K ) = 0. Then the expected error, or expected learning risk, is: (The last bound can be found in [65,Section 8.4.4].)

Sparsity and Expansion
The results above show that a computational advantage of sparsity in the target layer is that it allows to increase the number of memories that can be stored in the map.However it does not say anything about the expansion often observed in the target layer.Indeed, we have already noted how little the theorems derived in the previous sections depend on the size m of the target layer.Thus is there an explanation for the expansion?
There could be many reasons behind the expansion, for instance developmental constraints.However, one obvious computational reason that may be taken into consideration is producing maps that are un-ambiguous see Section 2. In order to minimize the risk of ambiguity, it is reasonable to try to maximize the Hamming distance between patterns in the target layer.If we have two q-sparse binary patterns, in the target layer, their maximal Hamming distance is 2qm and it is easy to see that only a linear number of patterns can be selected so that any pair of them is at maximal Hamming distance.Thus the number of such memories must grow linearly in m; and the same time it must be equal to K, which is significantly larger than n given the results in the previous theorems.Thus maximizing the pairwise distances of the target memories leads to layer expansion where m is significantly larger than n in order to minimize the overlap between the encodings of different memories.

Conclusion and Open Problems
In this work, we have shown that neural maps with a sparse hidden layer can store more memories, and both effective coding and decoding can be achieved using the simple Hebb's learning rule.However, many open problems remain to investigate including further tightening the bound of some of the theorems or obtaining results that are not necessarily asymptotic but hold exactly in some finite regime.10.1.Polynomial Threshold Maps.Superficially it may seem that the results in this work are restricted to the case of linear threshold functions or gates, but this is not the case.Similar results may hold for other classes of functions, such as polynomial threshold functions or gates of degree d with the functional form: Here I runs over all non-empty subsets of [n] = {1, 2, . . ., n}, and if I = {i 1 , . . ., i k } we let: x I = x i 1 . . .x i k .Note that in this notation we allow only pure monomials where all the powers associated with each variable are equal to one.While the more general case can be analyzed similarly, focusing on pure monomials simplifies things and furthermore, when x ∈ K n , x 2 i = 1 for every i = 1, . . ., n and thus higher power terms are not needed.Note also that the bias b correspond to I = ∅.We call homogeneous the case where all the monomials have degree exactly d: For a given n-dimensional vector x, we let x ⊗d denote the tensor of all the monomials of order exactly d, and x ⊗≤d denote the tensor of all non-constant monomials of order d or less.Thus a polynomial threshold function (or gate), can be viewed as a linear function (or gate) applied to the corresponding tensors.
Next, consider that the vector x is a random vector with i.i.d.symmetric Bernoulli components.Note that in this case x I is also a symmetric Bernoulli random variable for any non-empty I ⊂ [n].Furthermore, for any pair of distinct subsets I and J the variables x I and x J are independent, i.e. there is pairwise independence but not global independence.Using the results from Section 5 leads to the following corollaries, stated first informally and then more formally.
Corollary 10.1 (Informal).If x ∈ K n has i.i.d symmetric Bernoulli components and all coordinates of y ∈ {0, 1} m take value 1 with probabilities at most q, and Kq log m, then the condition guarantees that all data points (x k , y k ) can be realized by a polynomial (resp.homogenous polynomial) threshold map F of degree d. ) and q i := P y(i) = 1 , i = 1, . . ., m.Let m 0 ≥ m be such that Kq i ≥ C log m 0 for all i.
Consider K data points (x k , y k ), k = 1, . . ., K sampled independently from the distribution of (x, y) with K satisfying: or, respectively in the homogeneous case, Then, with probability at least 1 − 2m/m 0 , there exists a polynomial (resp.homogeneous polynomial) threshold map The proof of this statement is an immediate application of Theorem 5.2, noting that: (1) the tensors x ⊗≤d (resp.x ⊗d ) are sub-gaussian; and (2) ).However, the bounds above depend on the value of α = α(n, d), the sub-gaussian norm of the corresponding Bernoulli tensors.Thus open problems here include estimating the value of α(n, d), and finding better estimates associated with the phase transition for polynomial threshold maps with d > 1, in both the asymptotic and non-asymptotic regimes (see additional discussion at the end of Appendix B).
10.2.Neuronal Capacity and Storage.Finally, it is useful to view the results in this paper in terms of neuronal capacity, storage, and information theory where neural learning is seen as a communication process whereby information is transferred from the training data to the synaptic weights.The amount of information that can be communicated, essentially the capacity of the channel, can be estimated into two different ways, one at each end of the channel.At the synaptic end, we can investigate how much information can be stored in the synapses and at the data end, we can investigate how much information can be extracted from the training set.The apparent paradox alluded to in Section 2 is that in the case of sparse functions, information seems to decrease at the synaptic end, but to increase at the training data end.We now treat these questions more precisely by defining and comparing different notions of storage and capacity.
For simplicity, we look at the A(n, 1) Boolean architectures, but the same ideas can be extended to other architectures, including A(n, m) maps, as well as to non-Boolean cases.Thus in general we assume that we are considering a class C of Boolean functions of n variables.Of particular interest here are the cases where the Boolean functions are linear threshold gates, and the training sets have targets that are sparse.At the level of the class itself, we can first define the cardinal capacity.
The capacity can be interpreted as the minimum average number of bits required to communicate an element of C in a very long message consisting of a random sequence of elements in C taken with a random uniform distribution (which corresponds to the worst case in terms of the number of required bits).In the case of linear threshold gates, it can be viewed as the number of bits that must be "communicated" from the world (i.e. the training set) to the synaptic weights, and stored in the synaptic weights in order to select a specific input-output function.The set of all Boolean functions has capacity 2 n .The set of all p-sparse Boolean functions has obviously a small cardinal capacity given by log 2 If the linear threshold functions where to intersect the p sparse Boolean functions roughly in the same way as all other Boolean functions do as a function of p, then one would conjecture that the number of p-sparse linear threshold gates is approximately given by: It is worth noting, that the value of |T p (n, 1)| is known exactly in some simple cases corresponding to the lowest values of p.In particular: since it is always possible to linearly separate one vertex of the hypercube from all the other vertices.Likewise, |T 2 −(n−1) (n, 1)| = n2 n 2 since two vertices can be linearly separated if and only if they are adjacent.And similarly for p = 3/2 n and p = 4/2 n (e.g.four vertices can be linearly separated if and only if they form a face).
Now we look at the other end of the communication channel, at the information contained in the data, which itself can be captured using different notions, such as the VC dimension, the discriminant dimension, and the training capacity.10.2.2.VC dimension.The VC dimension V of C is the size of the largest set S of input vectors that can be shattered by C: S∈H n |S| : S can be be shattered Thus obviously we have: 2 V ≤ 2 C = |C|.In addition, the Sauer-Shelah lemma gives the upper bound: where 2 n ≤V denotes the sum of all binomial terms of the form 2 n k with k ≤ V .The VC dimension of all Boolean functions of n variables is 2 n .The VC dimension of all p-sparse Boolean functions is p2 n .The VC dimension of all linear threshold gates is n + 1, which raises another problem: What is the VC dimension of the set T p (n, 1) of all p-sparse linear threshold gates?B.2. Sub-gaussian norm of symmetric Bernoulli vectors.In connection with Corollary 6.1, we assume that x = (x 1 , . . ., x n ) and the x i are i.i.d.Bernouilli ±1 random variables with probability p = 0.5.The sub-gaussian norm of x is given by: where S n−1 is the sphere of radius 1 in R n .Now we can write: Note that for fixed u the expectation is a continuous, strictly monotone, decreasing function of t ∈ (0, +∞), decreasing in value from +∞ to 0. Thus the value 2 is achieved by the expectation for a single value of t and inf can be replaced by min in Equation B.1.The corresponding value of t varies continuously as u is varied over the closed set S n−1 .Thus the maximum value of the corresponding t is achieved on S n−1 (at multiple points for symmetry reasons) and sup can be replaced by max in Equation B.1.The following theorem provides the bound and asymptotic value of the sub-gaussian norm.
Theorem B.1.Let Z be a standard normal random variable Z ∼ N (0, 1) and x = (x 1 , . . ., x n ) be a vector of i.i.d.symmetric Bernoulli random variables.Fix u ∈ S n−1 and let X =< u, x >.Then, for any σ > 0, we have: Furthermore, the sub-gaussian norm α(n) of x satisfies: as n → ∞.
Proof of Theorem B.1.The proof is based on the Chernoff bound on the moment generating function of Z and X.
Lemma B.2 (Chernoff's bound).For any λ ∈ R, we have To prove this bound, note that the identity for Z is the basic formula for the moment generating function of the normal distribution.For X, we have Now to finish the proof of Theorem B.1, we first note that the following identity holds for every x ∈ R and σ > 0: e λx e −λ 2 /2σ 2 dλ since each side represents the moment generating function of a N (0, σ 2 ) random variable evaluated at x, i.e.E exp(Y x) where Y ∼ N (0, σ 2 ).We then substitute x = X and take expectation on both sides.This yields: If we repeat the same computation for Z, the inequality (due to the application of Lemma B.2) becomes an equality and the first part of the theorem is proven.As a consequence, setting σ 2 = 3/4, we obtain: ).With this in mind, for every vector a in R n we have: X =< x d , a d >=< x, a > d .Let a be the unit vector with all the same coefficients 1/ √ n.By the Central Limit Theorem, < x, a >→ G where G is N (0, 1).The convergence here is in distribution as n → ∞.Thus: for every t > 0, as long as d > 1.This shows that the sub-gaussian norm of X is larger than t (for large enough n).Since t is arbitrary, it follows that the sub-gaussian norm of X goes to infinity.Using the same Central Limit Approximation used above, in the case of d = 1, does not help in the case d > 1.
Appendix C. Proof of Theorem 5.2 Our proof of Theorem 5.2 will be based on standard facts about sub-gaussian distributions (see [65]) and the following lemma. .
In the statement of this lemma and thereafter, we write x ψ 2 (•|E) to indicate that the sub-gaussian norm of x is computed while conditioned on the event E.
Proof.Taking the inner product of x with a fixed unit vector, we can reduce the problem to the case n = 1 where x is a random variable.Furthermore, by homogeneity we can assume that x ψ 2 = 1.Then, denoting q := P(E), we have as long as t ≥ t 0 := 2 c log 1 q .
In the range where t < t 0 , a trivial bound holds because the right hand side is greater than 1.Combining the two bounds, we conclude by the definition of the sub-gaussian norm that The proof is complete.
Proof of Theorem 5.2.Let us first assume that m = 1 and check that the map F satisfies Step 1. Decomposition into signal and noise.In order to prove that F (x 1 ) = y 1 , let us expand w, x 1 as follows: The equality in (C.5) is due to independence.The inequality in (C.5) uses the fact that the events {y k = 0} and {y k = 1} have probabilities with probabilities 1 − q and q, both of which can be bounded below by q(1 − q).
This implies that, conditioned on x 1 and y 2 , . . ., y k , the noise term in (C.The last bound follows from the definitions of t and R in (C.6) and the key assumption (5.1) with a suitably large constant C.
Step 3. Estimating the signal-to-noise ratio.Lifting the conditioning on x 1 and y 2 , . . ., y K , we conclude the following with (unconditional) probability at least 1 − 1/m 0 − 1/(Km 0 ).If y 1 = 0 then signal = 0, otherwise signal ≥ γ 2 n as long as x 1 has moderate norm per (C.4); the noise satisfies |noise| ≤ 1 2 γ 2 n.Putting this back into (C.2),we see that if Step 4. Union bound.We can repeat this argument for any fixed k = 1, . . ., K and thus obtain F (x k ) = y k with probability at least 1 − 1/m 0 − 1/(Km 0 ).Now take a union bound over all k = 1, . . ., K.This should be done carefully: recall that the term 1/m 0 in the probability bound appears because we wanted the set I to satisfy (7.4).The set I obviously does not depend on our choice of a particular k; it is fixed during the application of the union bound and the term 1/m 0 does not increase in this process.Thus, we showed that the conclusion F (x k ) = y k for all k = 1, . . ., K holds with probability at least 1 − 1/m 0 − K/(Km 0 ) = 1 − 2/m 0 .
This completes the proof of the theorem in the case m = 1.To extend it to general m, we apply argument above for each coordinate i = 1, . . ., m of y and finish by taking the union bound over all m coordinates.

Figure 1 .
Figure 1.Proof of Part 1 of Theorem 3.3.:The white points x k (labeled y k = 1) form a face of the Gaussian polytope conv(x 1 , . . ., x N ) and thus are linearly separated from the black points.However, this reasoning can not be reversed: black points may be linearly separated from the white without forming a face of the Gaussian polytope.

Figure 2 .
Figure 2. Proof of Part 2 of Theorem 3.3.The arithmetic mean of the white points (labeled y k = 1) has norm r 0 ≈ n/Kq.This is smaller than the radius of the round core r ≈ 2 log(N/n) of the Gaussian polytope formed by the black points.Hence the black and white points are not linearly separated.

4 <
y k for all k = 1, . . ., K.Moreover, the matrix W in the threshold map F = h(W x − b) can be computed by a version of the Hebb rule W := K k=1 y k xT k where xk = x k − E x k , and b can be any vector (either fixed or dependent on the data) whose all coordinates satisfy np

10. 2 . 1 .
The Synaptic View: the Cardinal Capacity.The cardinal capacity C of C is defined by:

2 n
p2 n .The set T (n, 1) of all linear threshold gates of n variables has capacity log 2 |T (n, 1)| ≈ n 2 ([8]and references therein).The work presented here leads to an interesting open question: what is the fraction of p sparse Boolean functions that can be implemented by linear threshold gates?Or, equivalently, what is the fraction of linear threshold Boolean functions that are also p-sparse?And obviously a similar question can be posed for polynomial threshold gates of degree d > 1.

Lemma C. 1 ( 2 P
Conditioning sub-gaussian distributions).Let x be a sub-gaussian random vector taking values in R n .Then for any event E with positive probability, we havex ψ 2 (•|E) ≤ C x ψ 2 log

F (x 1 ) = y 1 with
high probability.Once we have done this, a union bound over K data points and m coordinates of y will finish the argument.When m = 1, the function F can be expressed asF (x) = h w, x − b where w = K k=1 y k x k .(C.1)
Sug-gaussian norm of symmetric Bernoulli tensors.Unlike the case d = 1, here the numbers α(n, d) are not bounded as n → ∞.To see this, let us allow for simplicity repetitions in the sets I of indices defining the tensor.This makes the tensor x d have dimension n d (as opposed to n d 2) is sub-gaussian:noise ψ 2 (•|x 1 ,y 2 ,...,y K ) = K k=2 y k x k , x 1 ψ 2 (•|x 1 ,y 2 ,...,y K )By the definition of sub-gaussian norm, this yields the tail bound:P |noise| > t | x 1 , y 2 , . .., y K ≤ 2 exp(−c 0 t 2 /R 2 ) ≤ 1 Km 0 if we choose t := C 1 log(Km 0)R with a suitably large constant C 1 .Thus, with (conditional) probability at least 1 − 1/(Km 0 ),