Biologically plausible deep learning -- but how far can we go with shallow networks?

Training deep neural networks with the error backpropagation algorithm is considered implausible from a biological perspective. Numerous recent publications suggest elaborate models for biologically plausible variants of deep learning, typically defining success as reaching around 98% test accuracy on the MNIST data set. Here, we investigate how far we can go on digit (MNIST) and object (CIFAR10) classification with biologically plausible, local learning rules in a network with one hidden layer and a single readout layer. The hidden layer weights are either fixed (random or random Gabor filters) or trained with unsupervised methods (PCA, ICA or Sparse Coding) that can be implemented by local learning rules. The readout layer is trained with a supervised, local learning rule. We first implement these models with rate neurons. This comparison reveals, first, that unsupervised learning does not lead to better performance than fixed random projections or Gabor filters for large hidden layers. Second, networks with localized receptive fields perform significantly better than networks with all-to-all connectivity and can reach backpropagation performance on MNIST. We then implement two of the networks - fixed, localized, random&random Gabor filters in the hidden layer - with spiking leaky integrate-and-fire neurons and spike timing dependent plasticity to train the readout layer. These spiking models achieve>98.2% test accuracy on MNIST, which is close to the performance of rate networks with one hidden layer trained with backpropagation. The performance of our shallow network models is comparable to most current biologically plausible models of deep learning. Furthermore, our results with a shallow spiking network provide an important reference and suggest the use of datasets other than MNIST for testing the performance of future models of biologically plausible deep learning.


Introduction
While learning a new task, synapses deep in the brain undergo task-relevant changes [1].These synapses are often many neurons downstream of sensors and many neurons upstream of actuators.Since the rules that govern such changes deep in the brain are poorly understood, it is appealing to draw inspiration from deep artificial neural networks (DNNs) [2].DNNs and the cerebral cortex share that information is processed in multiple layers of many neurons [3,4] and that learning depends on changes of synaptic strengths [5].However, learning rules in the brain are most likely different from the backpropagation algorithm [6,7,8].Furthermore, biological neurons communicate by sending discrete spikes as opposed to real-valued numbers used in DNNs.Differences like these suggest that there exist other, possibly nearly equally powerful, algorithms that are capable to solve the same tasks by using different, more biologically plausible mechanisms.Thus, an important question in computational neuroscience is how to explain the fascinating learning capabilities of the brain with biologically plausible network architectures and learning rules.Moreover from a pure machine learning perspective there is increasing interest in neuron-like architectures with local learning rules, mainly motivated by the current advances in neuromorphic hardware [9].Image recognition is a popular task to test the performance of ORCID(s): neural networks.Because of its relative simplicity and popularity, the MNIST dataset (28×28-pixel grey level images of handwritten digits, LeCun [10]) is often used for benchmarking.Typical performances of existing models are around 97-99% classification accuracy on the MNIST test set (see section 2 and Table 2).Since the performances of many classical DNNs trained with backpropagation (but without data augmentation or convolutional layers, see table in LeCun [10]) also fall in this region, accuracies around these values are assumed to be an empirical signature of backpropagationlike deep learning [11,12,13,8].It is noteworthy, however, that several of the most promising approaches that perform well on MNIST have been found to fail on harder tasks [14] or at least need major modifications to scale to deeper networks [15].There are two obvious alternatives to supervised training of all layers with backpropagation.The first one is to fix weights in the first layer(s) at random values , as proposed by general approximation theory [16] and the extreme learning field [17].The second alternative is unsupervised training in the first layer(s).In both cases, only the weights of a readout layer are learned with supervised training.Unsupervised methods are appealing since they can be implemented with local learning rules, see e.g."Oja's rule" [18,19] for principal component analysis, nonlinear extensions for independent component analysis [20] or algorithms in Olshausen and Field [21], Rozell et al. [22], Liu and Jia [23], Brito and Gerstner [24] for sparse coding.A single readout layer can be implemented with a local rule as well.A candidate is the delta-rule (also called "perceptron rule"), which may be implemented by pyramidal spiking neurons with dendritic prediction of somatic spiking [25].Since straightforward stacking of multiple fully connected layers of unsupervised learning does not reveal more complex features [21] we focus here on networks with a single hidden layer (see also Krotov et al. [26]).The main objective of this study is to see how far we can go with networks with a single hidden layer and biologically plausible, local learning rules, preferably using spiking neurons.To do so we first compare the classification performance of different rate networks: networks trained with backpropagation, networks with fixed random projections or random Gabor filters in the hidden layer and networks where the hidden layer is trained with unsupervised methods (subsection 3.1).Since sparse connectivity is sometimes superior to dense connectivity [27,14] and successful convolutional networks leverage local receptive fields, we investigate sparse connectivity between input and hidden layer, where each hidden neuron receives input only from a few neighboring pixels of the input image (subsection 3.2).Finally we implement the simplest, yet promising and biologically plausible models -localized random projections and random Gabor filters -with spiking leaky integrate-and-fire neurons and spike timing dependent plasticity (subsection 3.3).We discuss the performance and implications of this simplistic model with respect to current models of biologically plausible deep learning.

Results
We study networks that consist of an input ( 0 ), one hidden ( 1) and an output-layer ( 2 ) of (nonlinear) units, connected by weight matrices W 1 and W 2 (Figure 1).Training the hidden layer weights W 1 with standard supervised training involves (non-local) error backpropagation using summation over output units, the derivative of the units' nonlinearity ( ′ (⋅)) and the transposed weight matrix W 2 (Figure 1a).In the biologically plausible network considered in this paper (Figure 1b & c), the input-to-hidden weights W 1 are either fixed random, random Gabor filters or learned with an unsupervised method (Principal/ Independent Component Analysis or Sparse Coding).The unsupervised learning algorithms assume recurrent inhibitory weights V 1 between hidden units to implement competition, i.e. to make different hidden units learn different features.For more model details we refer to Appendix A -Appendix D. Code for all (rate & spiking) models discussed below is publicly available at https://github.com/EPFL-LCN/pub-illing2019-nnetworks.

Benchmarking biologically plausible rate models and backpropagation
To see how far we can go with a single hidden layer, we systematically investigate rate models using different methods to initialize or learn the hidden layer weights W 1 (see Figure 1 and methods Appendix A-Appendix C for details).We use two different ways to set the weights W 1 of the hidden layer: either using fixed Random Projections (RP) or Random Gabor filters (RG), see Figure 1b & blue curves in Figure 2, or using one of the unsupervised methods Principal Component Analysis (PCA), Independent Component Analysis (ICA) or Sparse Coding (SC), see Figure 1c & red curves in Figure 2. All these methods can be implemented with local, biologically plausible learning rules [18,20,21].We refer to the methods Appendix B for further details.As a reference, we train networks with the same architecture with standard backpropagation (BP, see Figure 1a).As a step from BP towards increased biologically plausibility, we include Feedback Alignment (FA, Lillicrap et al. [11]) with fixed random feedback weights for error backpropagation (see methods Appendix D for further explanation).A Simple Perceptron (SP) without a hidden layer serves as a further reference, since it corresponds to direct classification of the input.We expect any biologically plausible learning algorithm to achieve results somewhere between SP ("lower") and BP ("upper performance bound") The hidden-to-output weights W 2 are trained with standard stochastic gradient descent (SGD), using a one-hot representation of the class label as target.Since no error backpropagation is needed for a single layer, the learning rule is local ("delta" or "perceptron"-rule).Therefore the two-layer network as a whole is biologically plausible in terms of online learning and synaptic updates using only local variables.For computational efficiency, we first train the hidden layer and then the output layer, however, both layers could be trained simultaneously.We compare the test errors on the MNIST digit recognition    data set for varying numbers of hidden neurons ℎ (Figure 2).The PCA (red dashed) and ICA (red dotted) curves in Figure 2 end at the vertical line ℎ = = 784 because the number of principal/independent components (PCs/ICs), i.e. the number of hidden units ℎ , is limited by the input dimension .Since the PCs span the subspace of highest variance, classification performance quickly improves when adding more PCs for small ℎ and then saturates for larger ℎ .ICA does not seem to discover significantly more useful features than PCA, leading to similar classification performance.SC (red solid line) extracts sparse representations that can be overcomplete ( ℎ > ), leading to a remarkable classification performance of around 96 % test accuracy.This suggests that the sparse representation and the features extracted by SC are indeed useful for classification, especially in the overcomplete case.As expected, the performance of RP (blue solid) for small numbers of hidden units ( ℎ < ) is worse than for feature extractors like PCA, ICA or SC.Also for large hidden layers, performance improves only slowly with ℎ , which is in line with theory [16] and findings in the extreme learning field [17].However, for large hidden layers sizes, RP outperforms SC.As a reference, we also studied fixing the hidden layer weights to Gabor filters of random orientation, phase and size, located at the image center (RG, blue dashed, see Appendix C).For hidden layers with more than 1000 neurons, SC is only marginally better than the network with fixed random Gabor filters.
For all tested methods and hidden layer sizes, performance is significantly worse than the one reached with BP (black solid in Figure 2).In line with [11], we find that FA (black dashed) performs as well as BP on MNIST.Universal function approximation theory predicts lower bounds for the squared error that follow a power law with hidden layer size ℎ for both BP ((1∕ ℎ )) and RP ((1∕ 2∕ ℎ ), where is the input dimension [60,16]).In the log-log-plot in Figure 2 this would correspond to a factor ∕2 = 784∕2 = 392 between the slopes of the curves of BP and RP, or at least a factor eff ∕2 ≈ 10 using an effective dimensionality of MNIST (see methods A).We find a much faster decay of classification error in RP and a smaller difference between RP and BP  1d).The test error decreases for increasing hidden layer size ℎ for all methods, i.e.Principal/Independent Component Analysis (PCA/ICA, curves are highly overlapping), Sparse Coding (SC), fixed Random Projections (RP) and fixed random Gabor filters (RG) as well as for the fully supervised reference algorithms Backpropagation (BP) and Feedback Alignment (FA).The dash-dotted line at 90 % is chance level, the dotted line around 8 % is the performance of a Simple Perceptron (SP) without hidden layer.The vertical line marks the input dimension = 784, i.e. the transition from under-to overcomplete hidden representations.Note the log-log scale.
slopes than suggested by the theoretical lower bounds.Taken together, these results show that the high dimensionality of the hidden layers is more important for reaching high performance than the global features extracted by PCA, ICA or SC.Tests on the object recognition task CIFAR10 lead to the same conclusion, indicating that this observation is not entirely task specific (see subsection 3.2 for further analysis on CIFAR10).

Localized receptive fields boost performance
There are good reasons to reduce the connectivity from all-to-all to localized receptive fields (Figure 1e & f): local connectivity patterns are observed in real neural circuits [61], useful theoretically [27] and empirically [14], and successfully used in convolutional neural networks (CNNs).Even though this modification seems well justified from both biological and algorithmic sides, it reduces the generality of the algorithm to input data such as images where neighborhood relations between pixels (i.e.input dimensions) are important.To obtain localized receptive fields (called " -" methods in the following) patches spanning × pixels in the input space are assigned to the hidden neurons.The centers of the patches are chosen at random positions in the input space, see Figure 1e & f.For localized Random Projections ( -RP) and localized random Gabor filters ( -RG) the weights within the patches are randomly drawn from the respective distribution and then fixed.For the localized unsupervised learning methods ( -PCA, -ICA & -SC) the hidden layer is split into 500 independent populations.Neurons within each population compete with each other while different populations are independent, see Figure 1f.This split implies a minimum number of ℎ = 500 hidden neurons for these methods.For -PCA and -ICA a thresholding nonlinearity was added to the hidden layer to leverage the local structure (otherwise PCA/ICA act globally due to their linear nature, see methods Appendix B).We test -RP for different patch sizes and find an optimum around ≈ 10 (see Figure 3a).Note that = 1 corresponds to resampling the data with random weights, and = 28 recovers fully connected RP performance.The other methods show similar optimal values around = 10 (not shown).The main finding here is the significant improvement in performance using localized receptive fields.All tested methods improve by a large margin when switching from full image to localized patches and some methods ( -RG and -ICA) even reach BP performance for ℎ = 5000 hidden neurons (see Figure 3b).To achieve a fair comparison BP is also implemented with localized receptive fields ( -BP) which leads to a minor improvement compared to global BP.This makes local random projections or local unsupervised learning strong competitors to BP as biologically plausible algorithms in the regime of large, overcomplete hidden layers ℎ > -at least for MNIST classification.To test whether localized receptive fields only work for the relatively simple MNIST data set (centered digits, uninformative margin pixels, no clutter, uniform features and perspective etc.) or generalizes to more difficult tasks, we apply it to the CIFAR10 data set [62].We first reproduce a typical benchmark performance of a fully connected network with one hidden layer trained with standard BP (≈ 56% test accuracy, ℎ = 5000, see also Lin and Memisevic [63]).Again, classification performance increases for increasing hidden layer size ℎ and localized receptive fields perform better than full connectivity for all methods.Furthermore, as on MNIST, we can see similar performances for local feature learning methods ( -PCA, -ICA & -SC) and local random features ( -RP, -RG) in the case of large, overcomplete hidden layers (see Table 3).Also on CIFAR10, localized random filters and local feature learning reach the performance of biologically plausible models of deep learning [14,26] and come close to the performance of the reference algorithm -BP.However, the difference remains statistically significant here.Given that the state-of-the-art performance on CIFAR10 with deep convolutional neural networks is close to 98% (e.g.Real et al. [64]), the limitations of our shallow local network and the well-known differences in difficulty between MNIST and CIFAR10 become apparent.In summary, the main message of this section is that unsupervised methods, as well as random features, perform significantly better when applied locally.Equipped with local receptive fields our shallow network can outperform many current models of biologically plausible deep learning (see Table 2).On MNIST some models ( -RG & -ICA) even reach backpropagation performance, while on CI-b a Figure 3: Effect of localized connectivity on MNIST.a Test error for localized Random Projections ( -RP), dependent on receptive field size for different hidden layer sizes ℎ .The optimum for receptive field size = 10 is more pronounced for large hidden layer sizes.Full connectivity is equivalent to = 28.Note the log-lin scale.b Localized receptive fields decrease test errors for all tested networks (compare Figure 2): Principal/Independent Component Analysis ( -PCA/ -ICA), Sparse Coding ( -SC), Random Projections ( -RP), Random Gabor filters ( -RG) and Backpropagation ( -BP).The effect is most significant for -ICA and -RG, which approach -BP performance for large ℎ and = 10, while all other methods reach test errors between 1 − 2%.All other reference lines as in Figure 2. -PCA/ -ICA & -SC use 500 independent populations in the hidden layer (see Figure 1f) which constrains the hidden layer size to ℎ ≥ 500.Note the log-log scale.

Table 3
Test accuracies (%) on MNIST and CIFAR10 for rate networks and spiking LIF models.The Simple Perceptron (SP) is equivalent to direct classification on the data without hidden layer.All other methods use ℎ = 5000 hidden neurons and receptive field size = 10.Note that CIFAR10 has = 32×32×3 = 3072 input channels (the third factor is due to the color channels), MNIST only = 28×28 = 784.The rate (spiking) models are trained for 167 (117) epochs.Best performing in bold.

Spiking localized random projections
Real neural circuits communicate with short electrical pulses, called spikes, instead of real numbers such as rates.We thus extend our shallow network model to networks of leaky integrate-and-fire (LIF) neurons.The network architecture is the same as in Figure 1b.To keep it simple we implement the two models with fixed random weights with LIF neurons: fixed localized Random Projections ( -RP) and fixed localized random Gabor filters ( -RG) with patches of size × -as in subsection 3.2.The output layer weights W 2 are trained with a supervised spike timing dependent plasticity (STDP) rule.
The spiking dynamics follow the usual LIF equations (see methods Appendix E) and the readout weights W 2 evolve according to a supervised delta rule via spike timing dependent plasticity (STDP) using post-synaptic spike-traces tr ( ) and a post-synaptic target trace tgt ( ) where is the learning rate.Thus, for a specific readout weight 2, , the post-synaptic trace is updated at every postsynaptic spike time and the weight is updated at every presynaptic spike time .The target trace is constant while a pattern is presented and uses a standard one-hot coding for the supervisor signal in the output layer ( 2 ).
To illustrate the LIF and STDP dynamics, a toy example consisting of one pre-connected to one post-synaptic neuron is integrated for 650 ms.The pre-and post-synaptic membrane potentials show periodic spiking (Figure 4a) which induces post-synaptic spike traces and corresponding weight changes (Figure 4b), according to Equation 1.For the MNIST task, Figure 4c shows a raster plot for an exemplary training and testing protocol.During activity transients after a switch from one pattern to the next, learning is disabled un- til regular spiking is recovered.We experienced that without disabling learning during these transient phases the networks never reached a low test error.This is not surprising, since in this phase the network activities carry information both about the previously presented pattern and the current one, but the learning rule is designed for network activities in response to a single input pattern.It is also known that LIF neurons differ from biological neurons in response to step currents (see Naud et al. [65] and references therein).During the testing period, learning is shut off permanently (see methods section E for more details).The LIF and STDP dynamics can be mapped to a rate model (see e.g.[51] and Appendix E for details).However all following results are obtained with the fully spiking LIF/STDP model.
When directly trained with the STDP rule of Equation 1, the spiking LIF models closely approach the performance of their rate counterparts.Table 3 compares the performances of the rate and spiking LIF -RP & -RG models with the reference algorithm -BP (for same hidden layer size ℎ and patch size , see subsection 3.2).The remaining gap (< 0.3%) between rate model and spiking LIF model presumably stems from noise introduced by the spiking approximation of rates and the activity transients mentioned above.Both, the rate and spiking LIF model of -RP/ -RG achieve accuracies close to the backpropagation reference algorithm -BP and fall in the range of performance of prominent, biologically plausible models, i.e. 98-99% test accuracy (see section 2 and Table 2).Based on these numbers we conclude that the spiking LIF model of localized random projections using STDP is capable of learning the MNIST task to a level that is competitive with known benchmarks for spiking networks.

Discussion
In contrast to biologically plausible deep learning algorithms that are derived from approximations of the backpropagation algorithm [8,11,12,43], we focus here on shallow networks with only one hidden layer.The weights from the input to the hidden layer are either learned by unsupervised algorithms with local learning rules; or they are fixed.
If fixed, they are drawn randomly or represent random Gabor filters.The readout layer is trained with a supervised, local learning rule.When applied globally, randomly initialized fixed weights/ Gabor filters (RP/RG) of large hidden layers lead to better classification performance than training them with unsupervised methods like Principal/Independent Component Analysis (PCA/ICA) or Sparse Coding (SC).Such observations also occur in different contexts, e.g.Dasgupta et al. [66] showed that (sparse) random projections, combined with dimensionality expansion outperform known algorithms for locality-sensitive hashing.It may be interesting to search for alternative unsupervised, local learning rules with an inductive bias that is better adapted to image processing tasks than the one of SC.Replacing all-to-all connectivity with localized input filters is such an inductive bias that already proved useful in supervised models [14] but turns out to be particularly powerful in conjunction with unsupervised learning ( -PCA, -ICA & -SC).Interestingly, non of the local unsupervised methods could significantly outperform localized random Gabor filters ( -RG).Furthermore, we find that the performance scaling with the number of hidden units ℎ is orders of magnitudes better than the lower bound suggested by universal function approximation theory [16].To move closer to realistic neural circuits we implement our shallow, biologically plausible network with spiking neurons and spike timing dependent plasticity to train the readout layer.Spiking localized random projections ( -RP) and localized Gabor filters ( -RG) reach >98% test accuracy on MNIST which lies within the range of current benchmarks for biologically plausible models for deep learning (see section 2 and Table 2).Our network model is particularly simple, i.e. it has only one trainable layer and does not depend on sophisticated architectural or algorithmic features typically necessary to approximate backpropagation [8].Instead it only relies on the properties of high-dimensional localized random projections.Since we want to keep our models as simple as possible, we use online stochastic gradient descent (SGD, no minibatches) with a constant learning rate.There are many known ways to further tweak the final performance, e.g. with adaptive learning rate schedules or data augmentation, but our goal here is to demonstrate that even a simple model with constant learning rate achieves results that are comparable with more elaborate approaches that use e.g.convolutional layers with weight sharing [56], backpropagation approximations [38], multiple hidden layers [11], dendritic neurons [12], recurrence [55] or conversion from rate to spikes [51].Above 98% accuracy we also have to take into account a saturating effect of the network training: better models will only lead to subtle improvements in accuracy.It is not obvious whether improvements are really a proof of having achieved deep learning or just the result of tweaking the models towards the peculiarities of the MNIST dataset.Localized random filters or local unsupervised feature learning perform remarkably well compared to fully-connected back-propagation in shallow networks, even on more challenging data sets such as CIFAR10.This makes our model an important benchmark for future, biologically plausible models but also clearly highlights the limitations of our shallow two-layer model.A long time ago state-of-the-art deep learning has moved from MNIST to harder datasets, such as CIFAR10 or ImageNet [67].Yet MNIST seems to be the current reference task for most biologically plausible deep learning models (see section 2 and Table 2).We suggest that novel, progressive approaches to biologically plausible deep learning should significantly outperform the results presented here.Furthermore, they should be tested on tasks other than MNIST, where real deep learning capabilities become necessary.

Acknowledgments
This research was supported by the Swiss National Science Foundation (no.200020_165538 and 200020_184615) and by the European Union Horizon 2020 Framework Program under grant agreement no.785907 (HumanBrain Project, SGA2).

A. General rate model details
We use a 3-layer (input 0 , hidden 1 = ℎ and output 2 ) feed-forward rate-based architecture with layer sizes ( 0 for input), 1 (hidden) and 2 (output, with 2 = 10 = number of classes).The layers are connected via weight matrices W 1 ∈ ℝ 1 × 0 and W 2 ∈ ℝ 2 × 1 and each neuron receives bias from the bias vectors b 1 ∈ ℝ 1 and b 2 ∈ ℝ 2 respectively (see Figure 1).The neurons themselves are nonlinear units with an element-wise, possibly layer-specific, nonlinearity a = (u ).The feed-forward pass of this model thus reads The Simple Perceptron (SP) only consists of one layer ).The sparse coding (SC) model assumes recurrent inhibition within the hidden layer 1 .This inhibition is not modeled by an explicit inhibitory population, as required by Dale's principle [68], but direct, plastic, inhibitory synapses V 1 ∈ ℝ 1 × 1 are assumed between neurons in 1 .Classification error variances in Figure 2 & Figure 3 are displayed as shaded, semi-transparent areas with the same colors as the corresponding curves.Their lower and upper bounds correspond to the 25% and 75% percentiles of at least 10 independent runs.An effective dimensionality eff of the MNIST data set can be obtained, e.g.via eigen-spectrum analysis, keeping 90% of the variance.We obtain values around eff ≈ 20.The measure proposed by Litwin-Kumar et al. [27] gives the same value eff ≈ 20.We checked that training a perceptron (1 hidden layer, ℎ = 1000, 10 7 iterations, ReLU, standard BP) on the first 25 PCs of MNIST instead of the full data set leads to a comparable MNIST performance (1.7% vs 1.5% test error respectively).Together, these findings suggest that the MNIST dataset lies mostly in a low-dimensional linear subspace with eff ≈ 25 ≪ .The MNIST (& CIFAR10) data was rescaled to values in [0,1] and mean centered, which means that the pixel-wise average over the data was subtracted from the pixel values of every image.Simulations were implemented and performed in the Julia-language.The code for the implementation of our rate network models is publicly available at https://github.com/EPFL-LCN/pub-illing2019-nnetworks.

B. Unsupervised methods (PCA, ICA & SC)
In this paper we do not implement PCA/ICA learning explicitly as a neural learning algorithm but by a standard PCA/ICA algorithm (MultivariateStats.jl)since biologically plausible online algorithms for both methods are well known [19,20].For -dimensional data such algorithms output the values of the ≤ first principal/ independent components as well as the corresponding subspace projection matrix P ∈ ℝ × .This matrix can directly be used as feedforward matrix W 1 in our network since the lines of P correspond to the projections of the data onto the single/independent principal components.In other words each neuron in the hidden layer 1 extracts another principal/independent component of the data.ICA was performed with the usual pre-whitening of the data.Since PCA/ICA is a linear model, biases b 1 were set to 0 and 1 (u) = u.With this, we can write the (trained) feedforward pass of the first layer of our PCA/ICA model as follows: Since the maximum number of principal/independent components that can be extracted is the dimensionality of the data, max = , the number of neurons in the hidden layer 1 is limited by .This makes PCA/ICA unusable for overcomplete hidden representations as investigated for SC and RP.In the localized version of PCA/ICA we assume the hidden layer to consist of independent populations, each extracting PCs/ICs of its respective localized receptive field (see Figure 1).The hidden layer was divided into 500 of those populations, resulting in a minimum number of ℎ = 500 hidden neurons (1 PC/IC per population) for these methods (and up to 10 PCs/ICs per population for ℎ = 5000).The classifier was then trained on the combined activations of all populations of the hidden layer.Because PCA/ICA are linear methods the localized PCA/ICA version would not extract significantly different features unless we introduce a nonlinearity in the hidden units.This was done by simply thresholding the hidden activations (ReLU with threshold 0).No further optimization in terms of nonlinearity-and threshold-tuning was performed.
Sparse coding (SC) aims at finding a feature dictionary W ∈ ℝ ℎ× (for -dimensional data) that leads to an optimal representation a 1 ∈ ℝ ℎ which is sparse, i.e. has as few non-zero elements as possible.The corresponding optimization problem reads: Since this is a nonlinear optimization problem with latent variables (hidden layer) it cannot be solved directly.
Usually an iterative two step procedure is applied (akin to the expectation-maximization algorithm) until convergence: First optimize with respect to the activities a with fixed weights W. Second, assuming fixed activities, perform a gradient step w.r.t to weights.We implement a biologically plausible SC model using a 2layer network with recurrent inhibition and local plasticity rules similar to the one in Brito and Gerstner [24].For a rigorous motivation (and derivation) that such a network architecture can indeed implement sparse coding we refer to Olshausen and Field [21], Zylberberg et al. [69], Pehlevan and Chklovskii [70], Brito and Gerstner [24].We apply the above mentioned two step optimization procedure to solve the SC problem given our network model.The following two steps are repeated in alternation until convergence of the weights: 1. Optimizing the hidden activations: We assume given and fixed weights W 1 and V 1 and ask for optimal hidden activations a 1 .Because of the recurrent inhibition V 1 the resulting equation for the hidden activities a 1 is nonlinear and implicit.To solve this equation iteratively, we simulate the dynamics of a neural model with time-dependent internal and external variables u 1 ( ) and a 1 ( ) respectively.The dynamics of the system is then given by Zylberberg et al. [69], Brito and Gerstner [24]: In practice the dynamics is simulated for iter = 50 iterations, which leads to satisfying convergence (change in hidden activations < 5%).

Optimizing the weights:
Now the activities a 1 are kept fixed and we update the weights following the gradient of the loss function.The weight update rules are Hebbian-type local learning rules [24]: ⟨⋅⟩ is a moving average (low-pass filter) over several past hidden representations (after convergence of the recurrent dynamics) with some time constant mav , e.g.mav = 100 patterns.At the beginning of the simulation (or after a new pattern presentation) mav is increased starting from 0 to mav during the first mav .The values of the rows of W 1 are normalized after each update, however this can also be achieved by adding a weight decay term.Additionally the values of V 1 are clamped to positive values after each update to ensure that the recurrent input is inhibitory.Also the diagonal of V 1 is kept at zero to avoid selfinhibition.
During SC learning, at every iteration, the variables u 1 ( ) and a 1 ( ) are reset (to avoid transients) before an input is presented.Then for every of the iterations, ( 5) is iterated for iter steps and the weights are updated according to (6).Similar to localized PCA/ICA, the localized version of SC uses independent populations in the hidden layer (see Figure 1).The SC algorithm above was applied to each population and its respective receptive field independently.The classifier was then trained on the combined activations of all populations of the hidden layer.

C. Fixed Random Filters (RP & RG)
For RP, the weight matrix W 1 between input and hidden layer is initialized randomly W 1 ∼  (0, 2 ) with variancepreserving scaling: 2 ∝ 1∕ 0 .The biases b 1 are initialized by sampling from a uniform distribution  ([0, 0.1]) between 0 and 0.1.In practice we used the specific initialization for RP (keeping weights fixed), SC, SP and also BP & RF (both layers with W 2 , b 2 and 1 respectively).
For localized RP ( -RP), neurons in the hidden layer receive input only from a fraction of the input units called a receptive field.Receptive fields are chosen to form a compact patch over neighbouring pixels in the image space.For each hidden neuron a receptive field of size × ( ∈ ℕ) input neurons is created at a random position in the input space.The weight values for each receptive field (rf) and the biases are initialized as: were the parameter = 3 was found empirically through a grid-search optimization of classification performance.For exact parameter values, see Table 4.The (localized) random Gabor filters in RG have the same receptive field structure as in -RP (see Appendix C) but instead of choosing the weights within the receptive field as random values, they are choosen according to Gabor filters W 1 ∝ ( , ).Here, and denote the pixel coordinates within the localized receptive field relative to the patch center.The Gabor filters have the following functional form: To obtain diverse, random receptive fields we draw the parameters , Θ, , , of the Gabor functions from uniform distributions over some intervals.The bounds of the sampling interval are optimized using Bayesian optimization (BayesianOptimization.jl)with respect to classification accuracy on the training set.

D. Classifier & Supervised reference algorithms (BP, FA & SP)
The connections W 2 from hidden to output layer are updated by a simple delta-rule which is equivalent to BP in a single-layer network and hence is biologically plausible.For having a reference for our biologically plausible models (Figure 1b & c), we compare it to networks with the same architecture (number of layers, neurons, connectivity) but trained in a fully supervised way with standard backpropagation (Figure 1a).The forward pass of the model reads: Given the one-hot encoded target activations tgt, the error ẽ is when minimizing mean squared error (MSE) or for the softmax/cross-entropy loss (CE), Classification results (on the test set) for MSE-and CEloss were found to be not significantly different.Rectified linear units (ReLU) were used as nonlinearity (u ) for all layers (MSE-loss) or for the first layer only (CE-loss).In BP the weight and bias update is obtained by stochastic gradient descent, i.e.Δ , ∝  , .The full BP algorithm for deep networks reads [71]: where ⊙ stands for element-wise multiplication, ⊗ is the outer (dyadic) product, ′ (⋅) is the derivative of the nonlinearity and is the learning rate.FA [11] uses a fixed random matrix R instead of the transpose of the weight matrix W ⊤ for the error backpropagation step in (15).
To allow for a fair comparison with -RP, BP and FA were implemented with full connectivity and with localized receptive fields with the same initialization as in -RP.During training with BP (or FA), the usual weight update (15) was applied to the weights within the receptive fields.The exact parameter values can be found in Table 4.

E. Spiking implementation of RP & RG
The spiking simulations were performed with a custom-made event-based leaky integrate-and-fire (LIF) integrator written in the Julia-language.Code is available at https://github.com/EPFL-LCN/pub-illing2019-nnetworks.For large network sizes, the exact, event-based integration can be inefficient due to a large frequency of events.We thus also added an Euler-forward integration mode to the framework.For sufficiently small time discretization (e.g.Δ ≤ 5 ⋅ 10 −2 ms for the parameters given in Table 6) the error of Euler-forward integration does not have negative consequences on the learning outcome.The dynamics of the LIF network is given by: and the spiking condition: ( ) ≥ : → reset , where ( ) is the membrane potential, the membrane timeconstant, the membrane resistance, are the synaptic weights, ( ) = ( )∕ is the post-synaptic potential evoked by a pre-synaptic spike arrival, is the spiking threshold and reset the reset potential after a spike.The input is split into a feed-forward ( ( )) and an external ( ( )) contribution.Each neuron in the input layer 0 ( 0 = ) receives only external input proportional to one pixel value in the data.To avoid synchrony between the spikes of different neurons, the starting potentials and parameters (e.g.thresholds) for the different neurons are drawn from a (small) range around the respective mean values.We implement STDP using post-synaptic spike-traces tr ( ) and a post-synaptic target-trace tgt ( ).To train the network, we present patterns to the input layer and a target-trace to the output layer.The MNIST input is scaled by the input amplitude amp inp , the targets tgt( ) of the output layer are the one-hot-coded classes, scaled by the target amplitude amp tgt .Additionally, every neuron receives a static bias input ext bias ≈ to avoid silent units in the hidden layer.Every pattern is presented as fixed input for a time pat and the LIF dynamics as well as the learning evolves according to ( 16) and ( 17) respectively.Learning is disabled after pattern switches for a duration of trans = 4 since the noise introduced by these transient phases was found to deteriorate learning progress.With the parameters we used for the simulations (see Table 6), firing rates of single neurons in the whole network stayed below 1 kHz which was considered as a biologically plausible regime.For the toy example in Figure 4a&  Most of the parameters of the spiking-and the LIF rate models can be mapped to each other directly (see Table 6).The learningrate must be adapted since the LIF weight change depends on the presentation time of a pattern pat .In the limit of long pattern presentation times ( pat ≫ , tr ), the theoretical transition from the learning rate of the LIF rate model ( ̃ ) to the one of the spiking LIF model ( ) is = 1000 ms pat [ms] ⋅ 1000 ⋅ ̃ , where the second factor comes from a unit change from Hz to kHz.It is also possible to train weight matrices computationally efficient in the LIF rate model and plug them into the spiking LIF model afterwards.The reasons for the remaining difference in performance presumably lie in transients and single-spike effects that cannot be captured by the rate model.Furthermore the new target was presented immediately after a pattern switch even though the activity obviously needs at least a couple time constants ( tr or ) to propagate through the network.Removing this asynchrony between input and target should further shrink the discrepancy between rate and spiking models.

Figure 1 :
Figure 1: The proposed network model has one hidden layer ( 1 ) and one readout layer ( 2 ) of nonlinear units (nonlinearity (⋅)).Respective neural activations (e.g.0, ) and update rules (e.g.Δ 1, ) are added.( (⋅)) (⋅), ℎ(⋅) & h(⋅) are (non-)local plasticity functions, i.e. using only variables (not) available at the synapse for the respective update.a Training with backpropagation (BP) through one hidden layer is biologically implausible since it is nonlocal (e.g. using W 2 & ′ (⋅) from higher layers to update W 1 , see Appendix D). b & c Biologically plausible architecture with fixed Random Projections (RP) or fixed random Gabor filters (RG) (blue box in b) or unsupervised feature learning in the first layer (red box in c), and a supervised classifier in the readout layer 2 (green boxes).All weight updates are local.W stands for feed-forward, V for recurrent, inhibitory weights.(Crossed out) brain icons in a,b & c stand for (non-)bio-plausibility of the whole network.d & e Illustration of fully connected and localized receptive fields of W 1 .f For localized Principal/Independent Component Analysis ( -PCA/ -ICA) and Sparse Coding ( -SC) the hidden layer is composed of independent populations.Neurons within each population share the same localized receptive field and compete with each other while the populations are conditionally independent.For more model details, see Appendix A -Appendix D.

Figure 2 :
Figure2: MNIST classification with rate networks according to Figure1a-c with full connectivity (Figure1d).The test error decreases for increasing hidden layer size ℎ for all methods, i.e.Principal/Independent Component Analysis (PCA/ICA, curves are highly overlapping), Sparse Coding (SC), fixed Random Projections (RP) and fixed random Gabor filters (RG) as well as for the fully supervised reference algorithms Backpropagation (BP) and Feedback Alignment (FA).The dash-dotted line at 90 % is chance level, the dotted line around 8 % is the performance of a Simple Perceptron (SP) without hidden layer.The vertical line marks the input dimension = 784, i.e. the transition from under-to overcomplete hidden representations.Note the log-log scale.

Figure 4 :
Figure 4: Spiking LIF and STDP dynamics.a Dynamics of the pre-and postsynaptic membrane potentials, spike-traces and the weight value (b) of a toy example with two neurons and one interconnecting synapse.The weight decreases when the post-trace is above the post-target-trace (see Equation 1 and Appendix E).Both neurons receive static supra-threshold external input: ext pre ≫ ext post ≈ (spiking threshold).Note that presynaptic spikes only slightly alter the postsynaptic potential since the weight is initially zero.c Rasterplot of a network trained on MNIST, where every spike is marked with a dot.The background color indicates the corresponding layers: input (blue, 0 = 144 neurons), hidden (green, 1 = ℎ = 100) and output (red, 2 = 10).Bold vertical lines indicate pattern switches, thin lines indicate ends of transient phases (indicated by semi-transparency), during which learning is disabled.Left: Behaviour at the beginning of the training phase.Right: Testing period (learning off) after 6 ⋅ 10 4 presented patterns (1 epoch).As can be seen in the zoomed view of the 10 output layer neurons (red), the output layer has started to learn useful, 1-hot encoded class predictions.A downsampled (12 × 12) version of MNIST is used for improved visibility.

Table 1
Alphabetical list of abbreviations in this paper.

Table 6 (
Hyper-)Parameters for the spiking LIF -RP & -RG models (apart from weight initialization, Appendix C).Input and target amplitudes are implausibly high due to the arbitrary convention = 1 Ω.Best performing parameters in bold.