Low-Variance Forward Gradients using Direct Feedback Alignment and Momentum

Supervised learning in Deep Neural Networks (DNNs) is commonly performed using the error Backpropagation (BP) algorithm. The sequential propagation of errors and the transport of weights during the backward pass limits its efficiency and scalability. Therefore, there is growing interest in finding local alternatives to BP. Recently, meth-ods based on Forward-Mode Automatic Differentiation have been proposed, such as the Forward Gradient algorithm and its variants. However, Forward Gradients suffer from high variance in large DNNs, which affects convergence. In this paper, we address the


Introduction
Over the past decades, the Backpropagation (BP) algorithm (Rumelhart et al., 1986) has emerged as a crucial technique for training Deep Neural Networks (DNNs).However, despite its success, BP has limitations that restrict its efficiency and scalability.
By sequentially propagating errors through multiple layers, BP limits the ability to parallelize the backward pass; this is often referred to as Backward Locking (Nøkland, 2016;Launay et al., 2020;Huo et al., 2018) and leads to time-consuming gradient computations.Secondly, BP relies on the transport of symmetric weights during the backward pass.Known as the weight transport problem (Akrout et al., 2019;Lillicrap et al., 2016;Nøkland, 2016), it is a source of significant power consumption on dedicated neural processors (Crafton et al., 2019;Launay et al., 2020;Han and Yoo, 2019;Han et al., 2019) and represents a major obstacle to the implementation of BP on low-powered continuous-time neuromorphic hardware (Neftci et al., 2017).Therefore, there is a growing need for alternatives to BP that can parallelize gradient computations without requiring global knowledge of the entire network.These new algorithms can help to overcome the limitations of BP and enable the development of energy-efficient and scalable deep learning.
Recently, approaches based on Forward-Mode Automatic Differentiation have received attention (Margossian, 2019).Referred to as Forward Gradients (Baydin et al., 2022;Silver et al., 2021;Ren et al., 2023), these methods evaluate directional derivatives in random directions during the forward pass to compute unbiased gradient estimates without backpropagation.However, in large DNNs, the use of forward gradients can result in high variance gradient estimates, which has a detrimental effect on convergence.(Ren et al., 2023;Silver et al., 2021).
While stochastic gradient descent (SGD) with unbiased gradients estimates is guaranted to converge for sufficiently small learning rates (Robbins and Monro, 1951), theoretical analysis have shown that the convergence of SGD improves with the variance of the gradient estimates.(Bubeck et al., 2015;Bottou et al., 2018;Murata, 1999;Moulines and Bach, 2011;Needell et al., 2014;Chee and Toulis, 2018;Gower et al., 2019;Faghri et al., 2020).Estimators with low variance have less variability and are more consistent, leading to better convergence than those with high variance.
One approach to address the variance issue of forward gradients is to perturb neuron activations instead of weights (Ren et al., 2023).Because deep neural networks (DNNs) typically have fewer neurons than weights, perturbing neurons results in estimating fewer derivatives through forward gradients, leading to lower variance.Additionally, local greedy loss functions can be used to further reduce the variance of estimates, where each loss function trains only a small portion of the network (Ren et al., 2023).Ren et al. showed that this approach improves the convergence of Local MLP Mixers over the original FG algorithm.However the method only circumvents large gradient variance by unsuring that each local loss function only trains a small number of neurons and alternative solutions can be found to better reduce the variance of forward gradients in DNNs.
In this work, we propose the Forward Direct Feedback Alignment (FDFA) algorithm, a method that combines Activity-Perturbed Forward Gradients (Ren et al., 2023) with Direct Feedback Alignment (Nøkland, 2016) and momentum to compute lowvariance gradient estimates.Our method addresses the limitations of BP by avoiding sequential error propagation and the backward transport of weights.We present theoretical and empirical results that demonstrate the effectiveness of FDFA in reducing the variance of gradient estimates, enabling fast convergence with DNNs.Compared to other Forward Gradient and Direct Feedback Alignment methods, FDFA achieves better performance, making it a promising alternative for scalable and energy-efficient training of DNNs.

Background
We start by reviewing the Forward Gradient algorithm (Baydin et al., 2022) and its DNN applications (Silver et al., 2021;Ren et al., 2023) as well as Direct Feedback Alignment (Nøkland, 2016), which form the technical foundation of our FDFA algorithm.

Forward
Figure 1: Figure 1a: Projection of the Jacobian J at w onto a given direction v.The vector d•v is obtained by scaling the direction v by the directional derivative d evaluated at w in the direction of v. Figure 1b: The expected directional derivative (green arrow), computed by averaging directional gradients (red arrows) over many random directions (black arrows), is an unbiased estimate of the true gradient (blue arrow).
The Forward Gradient (FG) algorithm (Baydin et al., 2022;Silver et al., 2021) is a recent weight perturbation technique that uses Forward-Mode Automatic Differentiation (AD) (Margossian, 2019) to estimate gradients without backpropagation.Consider a differentiable function f : R m → R n and a vector v ∈ R m , Forward-Mode AD evaluates the directional gradient d = J • v of f in the direction v. Here, J ∈ R n×m is the Jacobian matrix of f , and d is obtained by computing the matrix-vector product between J and v during the function evaluation.
By sampling each element of the direction vector v i ∼ N (0, 1) from a standard normal distribution and multiplying back each computed directional derivatives by v, an unbiased estimate of the Jacobian is computed: Here, ⊗ is the outer product.See (Baydin et al., 2022) or Theorem 4 in Appendix for the proof of unbiasedness and Figure 1 for a visual representation of forward gradients.

Weight-Perturbed Forward Gradient
When the Weight-Perturbed FG algorithm is applied to DNNs, a random perturbation matrix V (l) ∈ R n l ×n l−1 is drawn from a standard normal distribution for each weight matrix W ∈ R n l ×n l−1 of each layer l ≤ L. Forward-Mode AD thus computes the directional derivative d FG of the loss L (x) along the drawn perturbation, such as: The weight-perturbed forward gradient g FG (W (l) ) for the weights of the layer l is obtained by scaling its perturbation V (l) matrix with the directional derivative d FG : See Algorithm 2 in Appendix for the full algorithm applied to a fully-connected DNN.
However, it has been previously shown that the variance of Weight-Perturbed FG scales poorly with the number of parameters in DNNs which impacts convergence of SGD (Ren et al., 2023).

Activity-Perturbed Forward Gradient (FGA)
To address the variance issues of Weight-Perturbed FG, Ren et al. proposed the Activity-Perturbed FG (FGA) algorithm that perturbs neurons activations instead of weights.
In FGA, a perturbation vector u (l) ∈ R n l is drawn from a multivariate standard normal distribution for each layer l ≤ L. Forward-Mode AD thus computes the following directional derivative: where y (l) are the activations of the l th layer.Note that directional derivatives computed in the FGA algorithm is defined as the sum over all neurons rather than all weights.The activity-perturbed forward gradient g FGA W (l) is then: See Algorithm 3 in Appendix for the full algorithm applied to a fully-connected DNN.
This method reduces the number of derivatives to estimate since the number of neurons is considerably lower than the number of weights (see Table 5 in Appendix for some examples).Consequently, the method leads to lower variance than Weight-Perturbed FG.
Activity-Perturbed FG demonstrated improvements over Weight-Perturbed FG on several benchmark datasets (Ren et al., 2023) but still suffers from high variance as Loss Figure 2: Illustrations of the error backpropagation (Figure 2a) and Direct Feedback Alignment (Figure 2b).Solid arrows represent forward paths and dotted arrows represent backpropagation paths.
the number of neurons in DNNs remains large.To circumvents this issue, Ren et al. proposed the Local Greedy Activity-Perturbed Forward Gradient (LG-FGA) method, which uses local loss functions to partition the gradient computation and decrease the number of derivatives to estimate.By adopting this local greedy strategy, LG-FGA greatly improved the performance of Local MLP Mixers, a specific architecture that uses shallow multi-layer perceptrons to perform vision without having to use convolution.However, no results were reported for conventional fully-connected or convolutional neural networks, where the number of neurons per layer is larger than in MLP Mixers.

Direct Feedback Alignment (DFA)
While the Backpropagation (BP) algorithm relies on symmetric weights to propagate errors to hidden layers, it has been shown that weight symmetry is not mandatory to achieve learning (Lillicrap et al., 2016).For example, Random Feedback Alignment (FA) (Lillicrap et al., 2016) has proven that random fixed weights can also be used for backpropagating errors and still achieve learning.
The Random Direct Feedback Alignment (DFA) algorithm takes the idea of FA one step further by directly projecting output errors to hidden layers using fixed linear feedback connections (Nøkland, 2016) (see Figure 2).Feedback matrices B (l) ∈ R n L ×n l replace the derivative ∂y (L)  ∂y (l) of output neurons with respect to hidden neurons.The approximate gradient g DFA W (l) for the weights of the hidden layer l is then computed as follows: In DFA, feedback matrices are usually randomly drawn and kept fixed during training.
The success of the Random DFA algorithm depends on the alignment between the forward and feedback weights, which results in the alignment between the approximate and true gradient (Lillicrap et al., 2016;Nøkland, 2016;Refinetti et al., 2021).When the angle between these gradients is within 90 degrees, the direction of the update is descending (Lillicrap et al., 2016;Nøkland, 2016).
DFA is able to scale to modern deep learning architectures such as Transformers (Launay et al., 2020) but is unable to train deep convolution layers (Launay et al., 2019) and fails to learn challenging datasets such as CIFAR-100 or ImageNet without the use of transfer learning (Bartunov et al., 2018;Crafton et al., 2019).However, recent methods to learn symmetric feedback such as the Direct Kolen-Pollack (DKP) algorithm (Webster et al., 2020) showed promising results with convolutional neural networks due to improved gradient alignments.
Algorithm 1 Forward Direct Feedback Alignment algorithm with a fully-connected DNN.

23:
end for

Method
In this section, we describe our proposed Forward Direct Feedback Alignment (FDFA) algorithm which uses forward gradients to estimate derivatives as direct feedback connnections.
Similarly to FGA, we sample perturbation vectors u (l) ∈ R n l for each layer l ≤ L from a multivariate standard normal distribution.However, in contrast to FGA, we use the directional derivatives that are computed at the output layer rather than at the loss function, such as: which produces a vector of directional derivatives.Rather than relying solely on the most recent forward gradient, as in FG and FGA, it is possible to obtain a more accurate estimate of ∂y (L)  ∂y (l) by averaging the forward gradient over the past training steps.
Formally, we define an update rule for the feedback connections that performs an exponential average of the estimates, such as: which can also be re-written in a form compatible with the stochastic gradient descent algorithm: where The full algorithm applied to a fully connected DNN is given in Algorithm 1.
Momentum is believed to mitigate the effect of gradient noise on the convergence of SGD, especially during the early iterations (Defazio, 2020).Hence, incorporating momentum in the derivative estimations of FDFA implies reduced variance in the gradient estimates, ultimately leading to improved convergence.

Results
We now present detailed empirical results with our proposed FDFA algorithm and other local alternatives to BP.
To ensure consistency, all the results were obtained by implementing each method and training DNNs with identical experimental settings.Details about our experimental settings are given in Appendix.

Performance
We compared the performance of the proposed Forward Direct Feedback Alignment (FDFA) method with various local alternatives to BP including weight-perturbed For- Overall, the weight-perturbed FG algorthm achieves poor generalization compared to BP.More importantly, its performance significantly decreases with the size of the network and the complexity of the task.For example, the method is unable to converge with AlexNet on the CIFAR100 or Tiny ImageNet 200 datasets with 3.43% and 0.70% of test accuracy respectively.The activity-perturbed FG method slightly improves generalization of forward gradients but still fails to match the performance of BP.The greedy approach used in LG-FGA further improves performance when the number of neurons per layer is relatively low, as in fully-connected networks for MNIST and Fash-ion MNIST.However, the method does not perform as well as FGA on networks that contain large layers such as convolution.This suggests, that LG-FGA is most suitable for specific architectures where each local loss function sends error signals to a small number of neurons.
Random DFA achieves performance closer to BP than forward gradient methods.
Performance of direct feedback connections are significantly increased by the DKP algorithm but a gap still exists with BP, especially in difficult tasks such as CIFAR100 ad Tiny ImageNet 200.In contrast, our proposed FDFA method provides performance that are closer to BP than Random DFA and DKP on all benchmarked networks and datasets.For example, our method doubles the test accuracy of Random DFA and improves by at least 10% the performance of DKP on Tiny ImageNet 200.

Convergence
To evaluate the convergence improvements of our method, we measured the evolution of the training loss and test accuracy during the training process.As shown in Figure 3, the FG algorithm exhibited slower convergence compared to both Random DFA and BP.
Although FGA slightly improves the convergence of FG, it was still unable to converge as quickly as BP and Random DFA.Both the DKP and FDFA algorithms showed better convergence rates than FG, FGA, and Random DFA.However, DKP seems to be unable to reduce the loss as low as BP with convolutional layers.Finally, our proposed FDFA algorithm achieved a similar convergence rate as BP with fully connected networks and  3a and 3b) and a CNN (Figures 3c and 3d) trained on the CIFAR10 dataset.FDFA has similar convergence rate as BP on fully-connected networks.With the CNN, FDFA is not able to overfit the training data.However, our method has the highest convergence rate compared to FG, FGA and Random DFA.
substantially improved the convergence rate of Random DFA and DKP with CNNs.

Variance
To gain a deeper understanding of the reasons for the improvements observed in FDFA compared to FG and FGA,, we now compare the theoretical variances of each method and empirically demonstrate the significance of gradient variance in the convergence of Stochastic Gradient Descent (SGD).
In Proposition 1, we derive the theoretical variance of our proposed FDFA algorithm under the assumption that feedback connections have converged to symmetry with the output weights of a two-layer DNN.Proposition 1 analytically shows that the variance of FDFA estimates quadratically decreases with the feedback learning rate α (see Figure 4 for numerical verifications).Therefore, by choosing values of α closed to zero, the FDFA algorithm produces low-variance gradient estimates which allows fast convergence with SGD.
Proposition 1. Variance of Forward Direct Feedback Alignment estimates Let W (1) ∈ R n 1 ,n 0 be the hidden weights of a two-layers fully-connected neural network evaluated with an input sample x ∈ R n 0 .We denote by g FDFA w (1) i,j the forward DFA gradient estimate for the weight w (l) i,j and assume that all the elements of the perturbation vector u (2) for the activations of the output layer are 0. We also assume that the feedback matrix B (1) converged to ∂y (2)  ∂y (1) .If each element u (1) i ∼ N (0, 1) of u (1) follows a standard normal distribution, then: Proof.Starting from gradient estimate g FGA w (1) i,j , we have: Because both ∂L(x)   ∂w reduces to: Therefore: which concludes the proof.
Using similar methods, we can prove the theoretical variance of FG and FGA esti-mates in a two-layers neural network.
Proposition 2. Variance of Weight-Perturbed Forward Gradients Let W (1) ∈ R n 1 ,n 0 be the hidden weights of a two-layers fully-connected neural network evaluated with an input sample x ∈ R n 0 .We denote by g FG w (1) i,j the weightperturbed forward gradient for the weight w (l) i,j and assume that all the elements of the perturbation matrix V (2) for the weights of the output layer are 0. If each element v (1) i,j ∼ N (0, 1) of V (1) follows a standard normal distribution, then: in the limit of large n 1 and large n 0 .
Proposition 3. Variance of Activity-Perturbed Forward Gradients Let W (1) ∈ R n 1 ,n 0 be the hidden weights of a two-layers fully-connected neural network evaluated with an input sample x ∈ R n 0 .We denote by g FGA w (1) i,j the activityperturbed forward gradient for the weight w (l) i,j and assume that all the elements of the perturbation vector u (2) for the activations of the output layer are 0. If each element u (1) i ∼ N (0, 1) of u (1) follows a standard normal distribution, then: in the limit of large n 1 and large n 0 .
Full proofs of Propositions 2 and 3 are given in Appendix and numerical verifications are given in Figure 5. Table 4 shows the comparison of these theoretical results.
Equations 16, 17 show that the variance of FG and FGA scales differently with the number of neurons and parameters in the network, with FG scaling linearly with number of parameters and FGA scaling linearly with the number of neurons.This indicates that, in large DNNs, gradient estimates provided by the FG algorithm exhibit larger variance than FGA estimates by a factor n 0 .However, as DNNs often contain large numbers of neurons, FGA only mitigates the variance issue of the FG algorithm.In comparison 10 7 a = 1.07 ± 0.02 Figure 5: Variance of the FG (red triangles) and FGA (blue squares) gradient estimates as a function of the number of neurons n 1 (Figure 5a) and number of inputs n 0 (Figure 5b) in a two-layer fully connected network.Each point were produced by computing the variance of gradient estimates over 10 iterations of the MNIST dataset.The pixels of input images were duplicated, to increase the number of inputs n 0 .The red and blue lines were fitted using linear regression.The slope a and the asymptotic standard error of each line are given with the same color.These figures shows that the variance of FG scales linearly with the number of neurons and inputs while the variance of FGA only scales linearly with the number of neurons.
with our proposed FDFA method, both FG and FGA produce high-variance gradient estimates.
To understand the significance that the variance of gradient estimates has in the  Figure 7: Layerwise alignment between gradient estimates and the true gradient computed using BP.These figure shows that the proposed FDFA method (Figure 7c) produces gradient estimates that better align with the true gradients than Random DFA (Figure 7a) which suggests improved descending directions.convergence, with a slight advantage for the FGA algorithm.In contrast, our proposed method achieves significantly lower variance and appears to converge faster than the FG and FGA algorithm.These results highlight the necessity of developping low-variance gradient estimators and explains the observed improvement of our method compared to FG and FGA.

Gradient Alignment
We observed that the FDFA algorithm exhibits better convergence compared to Random DFA and DKP, suggesting that it could benefit from stronger gradient alignments.To test this hypothesis, we evaluated the layer-wise gradient alignment of Random DFA, DKP, and FDFA algorithms in a 5-layer DNN trained on the MNIST dataset.The evolution of the angle between the approximate and true gradients for each layer during 100 epochs is shown in Figure 7.Our results show that the proposed FDFA method achieves faster alignment with the true gradient compared to Random DFA and DKP, which suggests earlier descending updates.Moreover, the alignment of the gradient is globally enhanced by 30 degrees compared to Random DFA, indicating that our method provides estimates that are closer to backpropagation.This observation thus explains the differences in convergence and performance between the methods.

Discussion
With the increasing size of DNNs and the emergence of low-powered neuromorphic hardware, exploring local alternatives to BP is becoming increasingly important.In this work, we proposed the FDFA algorithm, which combines the FG and DFA algorithms to train DNNs without relying on backpropagation.Due to the averaging process provided by its feedback learning rule that acts as momentum, our method provides more accurate estimates of derivatives between output and hidden neurons, improved convergence, and greater performance than other alternatives to BP.
To gain insights into the improvements offered by our proposed method, we compared the theoretical variances of the FDFA, FG, and FGA algorithms and demonstrated the influence of variance on the convergence of SGD.Our theoretical analysis revealed that the variance of FG and FGA estimates scales with the size of DNNs.In contrast, the variance of our proposed FDFA algorithm is quadratically reduced by the feedback learning rate, which is typically chosen close to zero.By showing that convergence improves as variance diminishes, we provided an explanation for the enhanced convergence observed in FDFA.These results shed light on the underlying mechanisms that contribute to the improvements achieved by our approach and highlight the importance of developing low-variance alternatives to BP.
When learning with feedback connections, the alignment between gradient estimates and the true gradients computed with BP reflects the accuracy of the update directions.Gradients that are close to the true gradient tend to converge faster than those that deviate significantly.In our experiments, we conducted a comparative analysis and demonstrated that our method exhibits a notable improvement in aligning the approximate gradient with the true gradient, surpassing the alignment achieved by Random DFA and DKP.These results provide an explanation for the performance and convergence enhancements observed in our experiments on benchmark datasets, particularly when compared Random DFA and DKP.By improving the gradient alignment, our method enables more effective weight updates, resulting in better performance and faster convergence.
The memory requirements is also an important aspect to consider when developping training algorithms.In fact, the learning mechanisms introduced in each evaluated algorithm scale differently the number of neurons and parameters in the networks.For example, the local output layers introduced in the LG-LGA algorithm increases the number of neurons and consequently, the number of weights.Similarly, the number of parameters in DFA, DKP and our proposed FDFA algorithms is greater than BP due to the additional direct feedback connections between output and hidden neurons.However, the number of neurons in the network is the same as BP for both algorithm.In contrast both FG and LGA have the same number of neurons and parameters as BP as no local output layer or additional connection is used.While direct feedback aligment methods offer better performance, the additional parameters pose a significant memory limitation, especially as networks grow larger.Finding a balance between memory usage and performance becomes crucial in the context of local learning.This highlight an important tradeoff between memory usage and performances.Future work could explore mechanisms such as sparse feedback matrices or reduced weight precision to mitigate the memory impact of the FDFA algorithm, further enhancing its practicality and scalability.
Overall, the FDFA algorithm represents a promising alternative to the error backpropagation algorithm, effectively resolving backward locking and the weight transport problem.Its ability to approximate backpropagation with low variace not only opens new possibilities for the creation of efficient and scalable training algorithms but also holds significant importance in the domain of neuromorphic computing.By solely propagating information in a forward manner, the FDFA algorithm aligns with the online constraints of neuromorphic systems, presenting new prospects for developing algorithms specifically tailored to meet the requirement of these hardware.Therefore, the implications of our findings highlight the potential of FDFA as a promising direction of research for online learning on neuromorphic systems.by directional gradient descent.In International Conference on Learning Representations.

Network Settings
Each architecture uses the ReLU activation function in the hidden layers and linear activation in the output layer.For LG-FGA, additional local linear outputs with their own loss function were added after each layer l < L − 1 to perform local greedy learning (Ren et al., 2023) In BP, FG, FGA, LG-FGA, DKP and FDFA, forward weights were initialized with the uniform Kaiming initialization (He et al., 2015).For FDFA, feedback connections were initialized to 0. For Random DFA, forward weights were initialized to 0 and feedback weights were drawn from a uniform Kaiming distribution and kept fixed during traing.We removed the dropout layer in AlexNet as we found that it negatively affects feedback learning.Finally, we also added batch normalization (Ioffe and Szegedy, 2015) after each convolutional layer of AlexNet to help training.

Training Settings
We train each network over 100 epochs with Adam (Kingma and Ba, 2014) and softmax cross-entropy loss functions.We also used Adam for the updates of feedback matrices in the FDFA and DKP algorithm.Adam was used because it is invariant to rescaling of the gradient (Kingma and Ba, 2014), making it a good optimization method to benchmark convergence with gradient estimators that exhibit different scales.We used a learning rate of λ = α = 10 −4 and the default values of the parameters β 1 = 0.9, β 2 = 0.999 and ϵ = 10 −8 in Adam.No regularization or data augmentation has been applied.Finally, we used learning rate decay with a decay rate of 0.95 after every epoch.

Unbiasedness of Forward Gradients
Theorem 4. Unbiasedness of Forward Gradients (Baydin et al., 2022) Let x ∈ R n be a vector of size n and v ∈ R n be a random vector of n independant variables.If v ∼ N (0, I) follows a multivariate standard normal distribution, then: Proof.Focusing on the i th element, we have: However, we know that v i ∼ N (0, 1).Therefore, Var [v i ] = 1.Using these properties, Equation 19 reduces to: and which concludes the proof.

Variance of Forward Gradients
Lemma 5. Let x ∈ R n be vector of size n and v ∈ R n be a random vector of n independant variables.If v ∼ N (0, I) follows a multivariate standard normal distribution, then: Proof.Because the elements of x are considered as constants and all the elements of v are independant from each other, the variance of (x • v) v i decomposes as follows: We can show that, if j ̸ = i: and if j = i: Therefore, by plugging Equations 24 and 25 into Equation 23, we find: which concludes the proof.

Figure 3 :
Figure3: Training loss and test accuracy of a 2-layer fully-connected network (Figures3a and 3b) and a CNN (Figures3c and 3d) trained on the CIFAR10 dataset.FDFA has similar convergence rate as BP on fully-connected networks.With the CNN, FDFA is not able to overfit the training data.However, our method has the highest convergence rate compared to FG, FGA and Random DFA.

Figure 4 :
Figure4: Variance of FDFA gradient estimates as a function of the feedback learning rate α in a two-layer fully connected network.Each point were produced by computing the variance of gradient estimates over 10 iterations of the MNIST dataset.The blue line was fitted using linear regression.The slope a and the asymptotic standard error is given with the same color.This figure shows that the variance of FDFA scales quadratically with α.

Figure 6 :
Figure6: Correlation between the normalized variance of gradient estimates and the loss of a two-layer network with 1000 hidden neurons, following a single training epoch on the MNIST dataset.The variance of BP was artificially increased by adding gaussian noise to the gradients to simulate the stochasticity of forward gradients.All gradient variances were normalized with the expected squared norm of the gradient estimates to ensure invariance with regards to the norm.Pairs of variance-loss for the FG, LGA, and FDFA algorithms are represented in green, red, and blue, respectively.This figure shows that the differences in convergence are solely due to the variance of the gradient estimates.

Figure 6
Figure6shows the relationship between the variance and the measured loss.It shows that the training loss achieved after one epoch strongly depends on the variance, with low-variance gradients converging towards lower loss values than high-variance gradients.Therefore, the key factor determining the convergence in this context is the variance.Figure6also shows the variance and loss values for the FG, FGA, and FDFA algorithms.Notably, these paris of variance-loss align with the line formed by the noisy gradients computed with BP.This demonstrates that the differences in variance of forward gradients are solely responsible for their differences in convergence rates.It can be observed that both the FG and FGA algorithms exhibit high variance hindering

1:
Input: Training data D 2: Randomly initialize w (l) ij for all l, i and j.

Table 1 :
Performance of 4-layers fully connected DNNs trained on the MNIST, Fashion MNIST and CIFAR10 dataset.
(Krizhevsky et al., 2012)t (FG-FGA), Random Direct Feedback Alignment (DFA) and the Direct Kolen-Pollack (DKP) algorithm.We evaluated each method with the same experimental settings on a set of benchmark datasets (MNIST, Fashion MNIST, CIFAR10, CIFAR100 and Tiny ImageNet 200) with fully-connected DNNs and convolutional neural networks including AlexNet(Krizhevsky et al., 2012).For each method, average test performance over 10 trainings is reported in Tables 1, 2 and 3. Experimental settings and additional results with fully-connected DNNs of different depths are given Table6in Appendix.

Table 4 :
Theoretical variance of gradient estimates produced by the FG, FGA and FDFA algorithm given a single input sample.

Table 5 :
Number of neurons and number of parameters in fully-connected DNNs with different depths.

Table 6 :
Performance of fully-connected DNNs with different depths.LG-FGA was not evaluated with two-layers DNNs as these networks are not deep enough to require greedy learning.