Convergence guarantees for forward gradient descent in the linear regression model

Renewed interest in the relationship between artificial and biological neural networks motivates the study of gradient-free methods. Considering the linear regression model with random design, we theoretically analyze in this work the biologically motivated (weight-perturbed) forward gradient scheme that is based on random linear combination of the gradient. If d denotes the number of parameters and k the number of samples, we prove that the mean squared error of this method converges for $k\gtrsim d^2\log(d)$ with rate $d^2\log(d)/k.$ Compared to the dimension dependence d for stochastic gradient descent, an additional factor $d\log(d)$ occurs.


Introduction
Looking at the past developments, it is apparent that artificial neural networks (ANNs) became more powerful the more they resembled the brain.It is therefore anticipated that the future of AI is even more biologically inspired.As in the past, the bottlenecks towards more biologically inspired learning are computational barriers.For instance, shallow networks only became computationally feasible after the backpropagation algorithm was proposed.Deep neural networks were proposed for a longer time but deep learning became implementable after the development of large scale GPU computing.Neuromorphic computing aims to imitate the brain on computer chips, but is currently not fully scalable.
The mathematics of AI has focused on explaining the state-of-the-art performance of modern machine learning methods and empirically observed phenomena such as the good generalization properties of extreme overparametrization.To shape the future of AI, statistical theory needs more emphasis on anticipating future developments and proposing biologically motivated methods already at a stage before scalable implementations exist.This work aims to analyze a biologically motivated learning rule building on the renewed interest of the differences and similarities between ANNs and biological neural networks (BNNs) [16,23,30] which are rooted in the foundational literature from the 1980s [8,6].A key difference between ANNs and BNNs is that ANNs are usually trained based on a version of (stochastic) gradient descent, while this seems prohibitive for BNNs.Indeed, to compute the gradient, knowledge of all parameters in the network is required, but biological networks do not posses the capacity to transport this information to each neuron.This suggests that biological networks cannot directly use the gradient to update their parameters [6,16,27].
The brain still performs well without gradient descent and can learn tasks with much fewer examples than ANNs.This sparks interest in biologically plausible learning methods that do not require (full) access of the gradient.Such methods are called derivative-free.A simple example of a derivative-free method is to randomly sample in each step a new parameter.If this decreases the loss one keeps the parameter and otherwise discards it.There is a wide variety of derivative-free strategies [5,13,26].Among those, so-called zeroth-order methods use evaluations of the loss function to build a noisy estimate of the gradient.This substitute is then used to replace the gradient in the gradient descent routine [17,7].[23] establishes a connection between the Hebbian learning underlying the local learning of the brain (see e.g.Chapter 6 of [27]) and a specific zeroth-order method.A statistical analysis of this zeroth-order scheme is provided in the companion article [24].
In this article, we study (weight-perturbed) forward gradient descent.This method is motivated by biological neural networks [2,22] and lies between full gradient descent methods and derivative-free methods, as only random linear combination of the gradient are required.The form of the random linear combination is related to zeroth-order methods, see Section 2. Settings with partial access to the gradient have been studied before.For example, [19] proposes a learning method based on directional derivatives for convex functions.In this work we specifically derive theoretical guarantees for forward gradient descent in the linear regression model with random design.Theorem 3.1 establishes an expression for the expectation.A bound on the mean squared error is provided in Theorem 3.3.
The structure of the paper is as follows.In Section 2 we describe the linear regression model and define the update rule.We present the main theorems and some discussion thereof in Section 3. Proofs can be found in Section 4.

Notation
Vectors are denoted by bold letters and we write • 2 for the Euclidean norm.We denote the largest and smallest eigenvalue of a matrix A by the respective expressions λ max (A) and λ min (A).The spectral norm is A S := λ max (A ⊤ A).The condition number of a positive semi-definite matrix B is κ(B) := λ max (B)/λ min (B).
For a random variable U we denote the expectation with respect to U by E U .The symbol E stands for an expectation taken with respect to all random variables that are inside that expectation.The (multivariate) normal distribution with mean vector µ and covariance matrix Σ is denoted by N (µ, Σ).
2 Weight-perturbed forward gradient descent Suppose we want to learn a parameter vector θ from training data with θ 0 some initial value and L(θ k ) := L(θ k , X k , Y k ) a loss that depends on the data only through the k-th sample (X k , Y k ).
For a standard normal random vector ξ k+1 ∼ N (0, I d ) that is independent of all the other randomness, the quantity (∇L(θ k )) ⊤ ξ k+1 ξ k+1 is called the (weight-perturbed) forward gradient [2,22].(Weight-perturbed) forward gradient descent is then given by the update rule Assuming that the exogenous noise has unit variance is sufficient as generalizing to ξ k+1 ∼ N (0, σ 2 I d ) with variance parameter σ 2 does not add flexibility to the procedure.Indeed, by rescaling the learning rate α k+1 → σ −2 α k+1 , we recover (2.2).
Since for a deterministic d-dimensional vector v, one has E[v t ξ k+1 ξ k+1 ] = v, taking the expectation of the weight-perturbed forward gradient descent scheme with respect to the exogenous randomness induced by ξ 1 , ξ 2 , . . .gives resembling the SGD dynamic (2.1).If ∇L(θ k ) depends on θ k linearly then also While in expectation, forward gradient descent is related to SGD, the induced randomness of the d-dimensional random vectors x k+1 induces a large amount of noise.To control the high noise level in the dynamic is the main obstacle in the mathematical analysis.One of the implications is that one has to make small steps by choosing a small learning rate to avoid completely erratic behavior.This particularly effects the first phase of the learning.
First order multivariate Taylor expansion shows that L(θ k +ξ k )−L(θ k ) and (∇L(θ k )) ⊤ ξ k+1 are close.Therefore, forward gradient descent is related to the zeroth-order method [17,23].Consequently, forward gradient descent can be viewed as an intermediate step between gradient descent, with full access to the gradient, and zeroth-order methods that are solely based on (randomly) perturbed function evaluations.
To complete this section, we briefly compare forward gradient descent with feedback alignment as both methods are motivated by biological learning and are based on additional randomness.As mentioned in the introduction, the brain cannot do gradient descent.
Backpropagation is an algorithm to compute the gradient and consists of a forward pass and a backward pass.While biological neural networks can execute a forward pass and evaluate the loss for a training sample, issues arise in the implementation of the backward pass.Inspired by biological learning, feedback alignment proposes to replace the learned weights in the backward pass by random weights chosen at the start of the training procedure [15,16].The so-called direct feedback alignment method goes even further: instead of back-propagating the gradient through all the layers of the network by the chain-rule, layers are updated with the gradient of the output layer multiplied with a fixed random weight matrix [20,14].(Direct) feedback alignment causes the forward weights to change in such a way that the true gradient of the network weights and the substitutes used in the update rule become more aligned [15,20,16].The linear model can be viewed as neural network without hidden layers.The absence of layers means that in the backward step, no weight information is transported between different layers.As a consequence, both feedback alignment and direct feedback alignment collapse in the linear model into standard gradient descent.The conclusion is that feedback alignment and forward gradient descent are not comparable.The argument also shows that to unveil nontrivial statistical properties of feedback alignment, one has to go beyond the linear model.We leave the statistical analysis as an open problem.

Convergence rates in the linear regression model
We analyze weight-perturbed forward gradient descent for data generated from the ddimensional linear regression with Gaussian random design.In this framework, we observe i with θ ⋆ the unknown d-dimensional regression vector, Σ an unknown covariance matrix, and independent noise variables ǫ i with mean zero and variance one.
For the analysis, we consider the squared loss The gradient is given by We now analyze the forward gradient estimator assuming that the initial value θ 0 can be random or deterministic but should be independent of the data.We employ a similar proving strategy as in the recent analysis of dropout in the linear model in [4].In particular, we will derive a recursive formula for In contrast to this work, we consider a different form of noise and non-constant learning rates.
The first result shows that forward gradient descent does gradient descent in expectation.
The proof does not exploit the Gaussian design and only requires that X i is centered and has covariance matrix Σ.The exogenous randomness induced by ξ 1 , ξ 2 , . . .disappears in the expected values but heavily influences the recursive expressions for the squared expectations.
Since A k depends on θ 2 k , the fourth moments of the design vectors X i and the exogenous random vectors ξ k play a role in this equation.
The risk for the condition number and building on Theorem 3.2, we can establish the following risk bound for forward gradient descent.
Theorem 3.3 (Mean squared error).Consider forward gradient descent (2.2) and assume that Σ is positive definite.For constant a > 2, choosing the learning rate ) .
Alternatively, the upper bound of Theorem 3.3 can be written as In the upper bound, the risk E θ 0 − θ ⋆ 2 2 of the initial estimate θ 0 appears.A realistic scenario is that the entries of θ ⋆ and θ 0 are all of order one.In this case, the inequality shows that the risk of the initial estimate will scale with the number of parameters d.Taking a = log(d) (for d ≥ 8 > e 2 such that a > log(e 2 ) = 2), Theorem 3.3 implies that The rate for k ≥ e 2 d 2 log(d) is thus d 2 log(d)/k.This means that forward gradient descent has dimension dependence d 2 log(d).This is by a factor d log(d) worse than the minimax rate for the linear regression problem, [29,10,18].In contrast, methods that have access to the gradient can achieve optimal dimension dependence in the rate, [21,12].The obtained convergence rate is in line with results for zeroth-order methods, which show that for convex optimization problems these methods have a higher dimension dependence, [7,17,19].
Assuming that the covariance matrix Σ is positive definite is standard for linear regression with random design [10,18,25].
For k d 2 , the decrease of the learning rate α k is is of the order 1/k, which is the standard choice [11,9,3].A constant learning rate is used for Ruppert-Polyak averaging in [21,9].For least squares linear regression, it is possible to achieve (near) optimal convergence with a constant (universal) stepsize [1].Conditions under which a constant (universal) stepsize in more general settings than linear least squares works or fails are investigated in [12].

Proofs
Proof of Theorem 3.1.By (3.2) and the linear regression model Combined with (2.3), we find The true parameter θ ⋆ is deterministic.Subtracting θ ⋆ on both sides, yields the claimed identity

Proof of Theorem 3.2
Lemma 4.1.If Z ∼ N (0, Γ) is a d-dimensional random vector and U is a d-dimensional random vector that is independent of Z, then Proof.Because U and Z are independent, the (i, j)-th entry of the d×d matrix Since Z ∼ N (0, Γ), , see for instance the example at the end of Section 2 in [28].Thus For a vector a = (a 1 , . . ., a d ) ⊤ , the scalar a i a j is the (i, j)-th entry of the matrix aa ⊤ .
Combined with the previous display, the result follows.
Proof of Theorem 3.2.As Theorem 3.2 only involves one update step, we can simplify the notation by dropping the index k and analyzing θ ′′ = θ ′ − α ∇L(θ ′ ) ⊤ ξξ for one data point (X, Y ) and independent ξ ∼ N (0,

we then have to prove that
Substituting the update rule (2.2) in A k gives by the linearity of the transpose that First, consider the terms with the minus sign in the above expression.The random vector ξ is independent of all other randomness and hence Taking the transpose and tower rule, we find In a next step, we derive an expression for Arguing as for (4.1) gives ∇L(θ ′ ) = −ǫX − XX ⊤ (θ ⋆ − θ ′ ) and this yields Because ǫ has mean zero and variance one and is independent of (X, θ ′ ), we conclude that where for the last equality we used that X ⊤ (θ ⋆ − θ ′ ) is a scalar and that X ∼ N (0, Σ).Since X ∼ N (0, Σ) is independent of θ ′ we get by Lemma 4.1 that Substituting this in (4.6) and (4.5) yields   For a positive semi-definite matrix A and a vector v, the min-max theorem states that Using that for a vector x it holds that tr(xx The spectral norm of a positive semi-definite matrix is equal to the largest eigenvalue and so tr where we use the convention that the (empty) product over zero terms is assigned the value 1.For ease of notation define c d := aκ 2 (Σ)(d + 2) 2 , with condition number κ(Σ) = Σ S /λ min (Σ).From the definition of α k , (3.4), it follows that α k = a λ min (Σ) • 1 k+c d .Using that for all real numbers x it holds that 1 + x ≤ e x , we get that for all integers k * < k,  Using that 0 < a/(a − 1) < 2 for a > 2, now yields the result.