NMDA-driven dendritic modulation enables multitask representation learning in hierarchical sensory processing pathways

Significance In deep learning, the standard approach to accommodate changing task demands is to train new output layers on top of a common trunk network, and, if needed, to relearn synapses throughout the whole network. However, the brain appears to take a radically different strategy, as neurons in all processing layers are modulated by contextual information. We show that context-dependent dendritic afferents can powerfully modulate the neuronal output and that this modulation dynamically reshapes network function to solve new tasks, without adapting any feedforward synapses. We furthermore show that these dendritic modulations could underlie self-supervised learning of deep networks, without relying on the backpropagation of errors across the layers of the network.


Supporting Information Text
Methods Biophysical modelling. The morphology and ion channels for the L5 PC model were taken from Hay et al. (1) and implemented in the NEURON simulator (2). In Figs 1, 2, and 6, we only retained the most important somatic Na + and K + channels (NaTa and Kv3.1) and used a passive dendritic membrane, with physiological parameters that were proposed by Major et al. to reproduce the amplitudes of glutamate-uncaging evoked NMDA-spikes in L5 PC dendrites and somata (3), combined with a spine correction as in Rhodes et al. (4). For Fig S7, we used the parameters for the fully active L5 PC as provided by template 1 in Hay et al. (1).
Contextual and background synapses were conductance based, either containing AMPA+NMDA (excitatory) or GABA (inhibitory) receptors. AMPA and GABA receptors were implemented as the product of a double exponential conductance profile (5) g with a driving force: isyn = g (er − v), with g = w n(τr, τ d ) e −t/τ d − e −t/τr . [1] Here, er is the synaptic reversal potential, τr and τ d are the synaptic rise and decay time constants, and n a normalization constants that depends on τr and τ d and normalizes the peak of the conductance window to the synaptic weight w. AMPA rise and decay times were τr = 0.2 ms, τ d = 3 ms and AMPA reversal potential was er = 0 mV. For GABA, we set τr = 0.2 ms, τ d = 10 ms and er = −80 mV. N-methyl-D-aspartate (NMDA) currents (6) were implemented as: [2] with τr = 0.2 ms, τ d = 43 ms, and er = 0 mV, while σ(v) -the channel's magnesium block -had the form (7): [3] The weight w of a synapse signifies the maximum value of its conductance profile. For an AMPA+NMDA synapse, the weight is the maximal value of the AMPA conductance profile, and the maximal value of the NMDA profile is twice that of the AMPA window (NMDA ratio of 2). Feedforward synapses were current based, with a dual exponential current profile with τr = 0.2 ms and τ d = 3 ms. The excitatory synaptic weight corresponded to the maximal value of the current profile. For inhibitory feedforward synapses, the weight was negative and corresponded to the minimum value. For dendritic modulation, we identified 20 dendritic compartments suited for semi-independent NMDA-spike generation (e.g. avoiding compartments on sister branches, Fig 1F) and equipped each of these compartments with an AMPA+NMDA synapse. Both the feedforward and shunt inputs targeted the somatic compartment. The 20 dendritic compartments as well as the soma were also equipped with an AMPA and GABA background synapse. All parameters for the simulations are shown in Table S1.
Note that for dendritic modulation, we set the number of compartments with NMDA-spikes nc by targetting nc compartments with an AMPA+NMDA synapse of high weight and 20 − nc compartments with an AMPA+NMDA synapse of low weight, and delivered the same amount of input spikes in each case. For somatic modulation, the weight of the shunt synapses remained the same, but the number of input spikes was modified. Note that the burst width for somatic modulation was enlarged to account for the shorter time-scale of the GABA conductance window compared to the NMDA conductance window.
To measure the membrane conductance change induced by the contextual inputs (Fig 1H), we followed Haider et al. (8) and measured the time-dependent electrode current in voltage clamp (i(t; v h ), with v h the holding potential) for two different holding potentials: v h1 = −80 mV and v h2 = 0 mV. Following the Ohmic relationship g(t) (v h − v0) = i h (t; v h ), with v0 the equilibrium potential, we found the membrane conductance as To visualize g(t) as in Fig 1H, we subtracted the pre-stimulus baseline and plotted the min-max envelope obtained over 10 trial runs.
IO curve parameter fitting. To obtain the IO curves (Fig 2), we changed the number of feedforward inputs in each burst (weights as in Table S1), and either had only excitatory inputs (no. of feedforward inputs > 0) or inhibitory inputs (no. of feedforward inputs < 0). For each number of inputs, we presented ten independently sampled bursts featuring that number of inputs, and measured the average number of output spikes generated in response.
To fit the IO curves, we used a linear least squares fit (argmin u ∥A u − y∥ 2 ) both for the shared gain and shared bias cases, but constructed the feature matrix A and parameter vector u differently. On the domain where y > 0, the IO curve is y = gm n ff + b for gain modulation and y = g n ff + bm for bias modulation, [5] with n ff the number of feedforward inputs. For M modulation levels, we have parameter vectors u = (g1, g2, . . . , gM , b) for gain modulation and u = (g, b1, b2, . . . , bM ) for bias modulation, and corresponding feature matrices for gain modulation and for bias modulation, with n ff1 | m corresponding to the number of feedforward inputs where the output transitioned from 0 to 1 spikes and n ff2 | m from 1 to 2 spikes, given a modulation level m. The output vector y = (0.5, 1.5, 0.5, 1.5, . . . , 0.5, 1.5) was identical in both cases. To extend gain modulation with a shared x-shift, we modified the entries in the gain modulation feature matrix A from n ffi | m to n ffi | m − x shift (i = 1, 2, m = 1, . . . , M ) and minimized the fit residual minu ∥A(x shift ) u − y∥ 2 over the x shift values.
Supervised learning in the fully connected networks. The fully connected network architectures with neuron-specific modulations (Fig 3) consisted of a fixed number of layers of equal size, the final layer connected to a single output unit. The neuron-specific modulations were implemented in an abstract fashion, as task-specific parameters (detailed mathematical description of each model in Table S2), although it is straightforward to extend the network architecture to implement the modulations as synaptic inputs. The networks were trained through end-to-end error backpropagation. For the networks with task-specific readouts, the hidden units had unit gain (fixed) and a trained bias shared across tasks, and as many output units as there were tasks. The hidden layer neurons had ReLU transfer functions and the output unit a tanh(·) transfer function. The networks were implemented in PyTorch (9), trained on batches of size 1 000 using Adam (10). We computed the mean squared error between target and network output as loss. The targets were defined as either −1 or +1, and a custom sampler assured batches consisted of an equal number of samples from each task and task-class. The training was interrupted with an early stopping criterion with a patience of 5 epochs on the validation performance evaluated after each epoch on a validation set of 47932 samples, or after 50 epochs, whichever came first. Independent learning rates for shared and task-specific parameters were optimized for each form of multitask leaning separately with an iterative grid search ( Fig S1C). Performances were measured by averaging over all tasks, and by additionally averaging over twenty initialization seeds (error bars show standard deviation of task-performance across seeds, averaged over all tasks). In the case of transfer learning, each of the twenty pseudo-random seeds led to an independently drawn random subset of tasks (with fixed subset size) that was used to train all model parameters with gradient descent using the multitask learning approach described above. For each seed, all the available tasks that were not used in pre-training were then learned by adapting only the task-dependent parameters of the model (see Table S2). This transfer learning phase also employed early stopping on a validation set to stop training and evaluated test performance on a testing set of data samples unused during any phase of training.

Model type
Mathematical formulation Task-dependent parameters task-specific readout task-specific gain & shared x-shift, bias task-specific gain & shared bias Table S2. Comparison of the task-dependenent parameters for the different mulitask and transfer learning approaches.
The mathematical formulations describe the output vector y (l) for neurons in layer l with the input vector x (corresponding either to the output of the previous layer or the input data vector) for a feedforward network of L layers solving a task t. Note: ⊙ denotes the Hadamard product (element-wise multiplication) and σ describes the element-wise transfer function mapping a vector of pre-activations in layer (l) to an output vector. * In the case of task-specific readouts W Decision boundary normal vectors. To explain our approach, we consider the pre-activation a : R n → R : x → a(x) of a neuron in a feedforward network as a function of the sensory input x (the activation y being given by y(x) = σ(a(x))). Biophysically, this quantity most closely corresponds with the somatic voltage under Na + -channel blockage; it is the aggregate of all inputs and when it crosses a threshold, the neuron would emit a spike if the non-linear activation function (the Na + -channels) were applied. The neuron may be active (a(x) > 0) or inactive (a(x) < 0), and its decision boundary on the input domain is given by the set D = {x D ∈ R n | a(x D ) = 0}. In a small enough region around a point x D ∈ D, a(x) can be approximated as being linear: and is the local normal vector of the decision boundary. This normal vector is always a linear sum of the input weight vectors wj to the first layer neurons (j = 1, . . . , k, with k the number of neurons in the first layer), and is perpendicular to the local decision boundary (see methods), thus capturing the local input features that a uses to make a decision about whether to become active close to x D ( Fig 4A).
First, we note that for input x ∈ D on the decision boundary D close enough to x D , a(x) = 0, and thus, following Eq. (8): By consequence, w ⊥ is perpendicular to the local decision boundary and indeed its normal vector. Second, we compute w ⊥ explicitly, and to that purpose write the preactivation a of a neuron in layer K of the network as: with y (i) the neural activations in layer i, σ : R → R the neural activation function applied element-wise to its inputs, W (i) the weight matrix from layer i − 1 to layer i, xshift resp. b (i) the gains, x-shifts resp. biases in layer i, w the weight vector to the neuron and g, x xshift resp. b its gain, x-shift and bias. w ⊥ is then found as with dy (i) /dy (i−1) the Jacobian matrix of y (i) with respect to y (i−1) . This Jacobian is given by Thus, we find for Eq. (11): [13] By rearranging this matrix product, we obtain a linear weighted sum of the inputs weight vectors (with w (1) j: the j'th row of W (1) and d (1) :j the j'th column of D (1) ).

Unsupervised loss function
Constraints Parameter values Table S3. The unsupervised loss functions with their constraints and parameter values.
Without regularizer or constraint, the optimum of the reconstruction loss for k ≤ n is given by the singular value decomposition of X or ∆X, and the rows of W (1) are the principal components if X is centered (∆X is centered by definition). With the L1 regularizer, C (1) is commonly referred to as the sparse code (SC) and W (1) as the sparse dictionary (SD).
Unsupervised learning of weights combined with supervised gain modulation. The unsupervised optimization problems for PCA, ∆PCA, SC, SD, and ∆SD were solved using Scikit-learn (11), and for PMD and ∆PMD we used a custom implementation. The networks consisted of a single hidden layer, with weights to the hidden layer neurons resulting from optimising the loss functions in Table S3 (for RP, weights were drawn from a Gaussian distribution and then normalized so that w Weights to the output unit were uniform, and also normalized to have unit Euclidean norm (w out = (1, . . . , 1)/ √ k, with k the dimensionality of the hidden layer). Activation functions, target outputs, the optimizer and the way in which performance was measured were identical to the fully supervised networks.
We optimized gains for each task separately by performing gradient descent with batches of 100 samples, and performed an evolutionary meta-parameter optimization using DEAP (12) to find optimal values for x-shift, bias, and learning rate by maximizing validation performance on a subset of 10 tasks (Fig S4C).
To construct Fig S2, we evaluated the residual minC ∥∆X − C W ∥ of the reconstruction loss Eq. (3) during supervised training without regularizer or constraint (i.e. the least mean squares optimum with respect to C) on matrices ∆X containing 1000 differences.
Loss gradient with respect to task gains as a Hebbian learning rule with global error modulation. Here, we show that in the network architecture described in the previous section, the error gradient with respect to the task-dependent gains can be expressed as a Hebbian learning rule modulated by a global error signal. We interpret task-dependent gains to the hidden neurons with activity yi (i = 1, . . . , k) as synaptic inputs originating from task-encoding neurons. With gi,t the weight of the connection to neuron i associated with task t, and zt ∈ {0, 1} the activity of the associated task-encoding neuron (1 if the task is active and 0 otherwise), the total gain is g = t gi,t zt. The structure of these networks then becomes with yo the activity of the output neuron, bo its bias and go,t the weight of the task-gain connection to the output neuron. yo,t ∈ {−1, 1} is the task-dependent target value for a given input sample. Computing the gradient of the task-loss Lt with respect to the task-gains in the hidden layer yields [16] One crucial observation here is that since the feedforward weights to the output unit are all identical, they do not modify the error signal in a neuron-specific manner. By consequence, the error signal to the hidden layer is global, i.e. identical for all hidden neurons. A second crucial observation is that while it may seem as if the post factor could change the sign of the gradient, if w T i x − x shift < 0, [17] this is actually impossible with the ReLU activation function (which results in σ ′ ∈ {0, 1}), and positive gains together with negative bias, as obtained from the L5 PC model. Indeed, for the neuron to be active (σ ′ = 1), the inputs need to be sufficiently strong, so that [18] where the second inequality holds because we have negative bias and positive gains. Thus, whenever Eq. (17) is satisfied, the neuron is inactive (σ ′ = 0) and w T i x − x shift does not influence the update of gi,t. We will leverage this fact in the spiking neural network, where we replace the post factor by a low pass filter of the somatic output spikes to obtain a learning rule that follows the approximate gradient.
Spiking network. In order to simulate the network model, which consisted of one hidden layer with 100 neurons, over sufficiently long time scales to allow significant learning, we reduced the passive L5 PC model by retaining only the dendritic compartments and the soma (Fig 6A, Fig S5), using the method proposed by Wybo et al. (13) to conserve dendro-somatic response properties. We equipped each dendritic compartment with an AMPA+NMDA synapse for each context -47 in multitask EMNIST (Fig 6) and 14 for the boolean tasks (Fig S6) -whose weight evolved according tȯ w = η dend upre upost ϵ [19] with η dend a learning rate dependent on a low-pass filter u dend of the local dendritic voltage v dend , upre a low pass filter of the input spikes to the contextual synapse, upost a low pass filter of the output spikes and ϵ a global error signal, implemented as the low pass filter of an error pulse whose amplitude ae was proportional to the difference between the number of generated output spikes in the 'reward window' and the expected number (i.e. 0 or 1). This 50 ms reward window opened upon the arrival of the first feedforward input spike associated with a data sample. A delta pulse with amplitude ae was then injected into ϵ at the closure time te of this reward window. Summarizing the above, we had the following: [20] Note that η dend , u dend , upre and upost are all specific to the synapse, whereas ϵ is global and shared across all synapses. Model parameters are summarized in Table S4.
To ensure that the combined feedforward post-synaptic potential (PSP) varied within a reasonable dynamic range, we took the feedforwards weights derived previously (∆PMD, ∆SD, PCA, and RP to the hidden layer, and (1, . . . , 1)/ √ k to the output neuron) and applied a scale factor sw = vrange ∥win∥ 1 zin nin [21] to the input weight vector win of each neuron, with zin the somatic input resistance of the L5 PC, and nin the number of feedforward inputs (n = 784 for a neuron in the hidden layer and k = 100 for the output neuron in multitask EMNIST, and n = 2 and k = {4, 10} for the boolean tasks). vrange is a parameter that fine-tunes the dynamic range of the combined feedforward PSPs. For the hidden neurons, vrange was set heuristically to 40 mV on multitask EMNIST and 20 mV on the boolean tasks, and for the output neuron, vrange was set to 300 mV on multitask EMNIST and 200 mV on the boolean tasks. Together with zin, this yielded feedforward weights in nA. Note that for the output weights, we additionally added Gaussian noise to the weight vector with σ = 0.1 µ, and µ = sw/ √ k. To train this system on multitask EMNIST, we presented 200 000 samples for each task (balanced across task-classes, so repetition of the same sample may occur) by converting them to Gaussian input spike bursts (the number of input spikes in a burst varied between 0 and 10 and was proportional to pixel intensity), and injecting these burst in the network at 150 ms intervals. We then froze the plasticity rule and tested performance on 500 samples for each task taken from the EMNIST test set (again balanced across task-classes). For the boolean tasks, there were either no spikes in the burst when the input was 0 or 10 spikes when the input was 1. Sets denote that the hyper-parameter is part of a grid search with the given values.
Task-modulated contrastive learning. For TMCL (Fig 6), we converted CIFAR-10 (14) and STL-10 (15) into multitask learning problems (multitask CIFAR-10 and multitask STL-10) as before, by defining 10 1-vs-all classification tasks, and sampled data in a balanced manner across tasks and task-classes. We kept 9000 samples as validation set for CIFAR-10 and 1350 for STL-10. Each color channel was centered, then normalized to unit variance. We implemented a visual geometry group-like (VGG) architecture following Illing et al. (16) For multitask CIFAR-10, we trained a stack of L convolutional layers with kernel size 3, stride 1 and 64 channels, and applied batch normalization (17) to the task-modulated convolution outputs before applying the ReLU activation function. For multitask STL-10, we employed a kernel size of 7 and stride of 3 in the first layer, and an identical architecture as for multitask CIFAR-10 everywhere else. Batch normalization renders the task-independent bias b obsolete; it was not included in our simulations. By consequence, the output of a convolutional unit was ReLU (BatchNorm(gt (w * x − x shift ))) , [22] where x denotes the image patch, w refers to the respective convolutional filter, * denotes the convolution operation, and gt the task-specific gain. Every second layer was succeeded by a MaxPool layer with stride 2x2. Our approach learns a stack of convolutional layers iteratively, by adding the next layer on top of the previously learnt ones. Each iteration consists of two distinct phases: first, feedforward filters to the next layer are learned through CL, and subsequently we learn task-and neuron-specific gains for the new layer in a supervised manner.
For CL phase, we followed the SimCLR algorithm (18). We used the same image augmentations that the authors used on their CIFAR-10 experiments followed by standard score normalization according to the statistics of the original dataset. To train the filters from layer l − 1 to layer l (l = 1, . . . , L), we first generated batches of augmented hidden representations in layer l − 1. With x1, . . . , xN a batch of input samples (see Table S5 for batch size), we generated an augmented batchx1, . . . ,x2N of twice the original size, with samplesx k ,x N +k being positive pairs, i.e. generated through image augmentations from the same source sample x k . We then propagated this augmented batch through the hierarchy to layer l − 1, while applying task-gains from a random task to each augmented sample, to obtain a batch of task-modulated hidden representationsỹ1, . . . ,ỹ2N . Next, the convolutions with the to be learned filters from layer l − 1 to layer l were computed and the ReLU was applied to obtain the representation that was fed into the CL-MLP. The CL-MLP consisted of a hidden layer and an output layer with dimension 64, and used ReLU activation. The similarity si,j between representations zi and zj obtained from final layer of the CL-MLP was computed as their cosine similarity si,j = zi · zj ∥zi∥ 2 ∥zj∥ 2 . [23] The loss was computed as and temperature τ = 0.5. The error gradient of this loss was then used to train the filters from layer l − 1 to layer l. All parameters, including convolutional filters, for layers < l − 1 remained frozen. For l = 1, layer l − 1 is the input layer and no task-modulation could be applied, so thatỹ k ≡x k .
To perform the classification for all one-vs-all tasks, we reduced the height and width axes by averaging, retaining a 64-dimensional vector with the per-channel averages. Finally, an output unit with tanh(·) was applied to the inner product of the 64-dimensional representation vector and a learned, task-independent weight vector wout. Task-specific gains in layer l were then trained to minimize the classification loss (same as in our fully connected architectures) at the output unit (test performance at this output unit is what is reported in Fig 5B,C), and the process was continued at layer l + 1.
Except for the task-specific gains, all parameters (i.e. the CL-MLP parameters, convolutional filters, the x-shift and the output unit) followed the Kaiming initialization, a standard approach in the VGG literature (19). For the task-specific gains, we tested the following initialization strategies • Constant(1): Initialize each entry to 1.
• KaimingUniform: Each entry is sampled i.i.d. from Uniform(−γ, γ) with γ = 2 3 n in and where nin is the number of inputs targeting a given unit.
• Rademacher+KaimingUniform: Sample each entry from Rademacher, then add a sample from KaimingUniform to each entry.
The final performance numbers reported were selected from the grid search described in Table S5, according to the best accuracy on the validation set. All experiments were implemented in PyTorch (9) and training was performed using Adam (10). We compared our TMCL algorithm with four control cases: (i) error backpropagation, (ii) CL without task-similarity, (iii) RP, and (iv) stacking RP layers on top of a TMCL-trained first layer. For (i), we trained both filters, x-shifts, and task-gains across all layers on the supervised classification loss. For (ii), we applied the same task-gains to both augmented data samples in a CL pair. For (iii), filters remained unchanged from their initialization, while task-gains were trained in a supervised manner. Finally, for (iv), we trained the initial layer with plain TMCL and then applied the same procedure as in (iii). All control experiments used the applicable parts of the grid search as outlined in Table S5. For Fig 5D, we split the tasks into two disjoint subsets, one with T ∈ {1, . . . , 10} tasks that were different across CL-pairs, and one with 10 − T tasks that were the same. We then sampled a random task for the first pair element, and, when the task was in the first subset, sampled another random tasks from the same subset for the second pair element. When the task was in the other subset, we applied the same task to the second pair element.  . For neuron-specific modulations (middle and bottom rows) the optimal (green cross) modulation learning rate (y-axis) was generally larger than the weight learning rate (x-axis). C: Performances on multitask EMNIST of task-specific networks (no parameters shared across tasks, grey), task-specific readouts (red) and the various possible neuron-specific modulations for all architectures (x-labels denote the number of hidden units in each layer).  Fig 3) of the input weight matrix W , for an architecture with one hidden layer of 100 units, and for task-specific networks (grey) and neuron-specific modulations (blue, dashed). The reconstruction loss was evaluated on W during supervised training, and compared with the case were W was given by ∆PMD (purple, dash-dotted), with the aim to investigate whether the supervised approach trained purely on the classification loss also minimizes Eq. (3), so as to elucidate the extent to which supervised training adapts the input weight matrix to the subspace of the data, and how that depends on the task set. To that end, we compare the value of the residual min C ∥∆X − C W ∥ of the reconstruction loss Eq. (3) during training of W in a fully supervised fashion, with the reconstruction loss when W was given by ∆PMD.
We assess the case where the entire network was task-specific (feedforward weights included, grey) and the gain-modulated case (blue), and find that the sharp initial reduction in train loss (i.e. the classification loss, top) is associated with a sharp reduction in reconstruction loss (bottom). The reconstruction loss for the gain-modulated network, where a single feedforward weight matrix has to contribute to solving a multitude of tasks, decreased to a value much closer to the ∆PMD optimum than the task-specific network average, where the feedforward weight matrix only has to solve a single task. Thus, supervised training forgets the initialization by adapting feedforward weights to the data subspace, and then fine tunes to the specific set of tasks, as validation performance during the later phase increases from the unsupervised value (obtained with ∆PMD) to its maximum. The unsupervised approach on the other hand is generalist, and admits any task that might be defined post hoc, at the cost of forgoing task-specific fine tuning.  Fig 4, i.e. principal component analysis (top), applied to data samples (PCA) or difference vectors (∆PCA), sparse dictionary learning (middle) applied to data samples (SD) or differences (∆SD), and penalized matrix decomposition (bottom) applied to data samples (PMD) or differences (∆PMD). B: Comparison of performance of gain-modulated networks on multitask EMNIST, between feedforward weights given by applying the respective unsupervised learning algorithms to the data samples versus the difference vectors, as a function of layer size. For PCA and ∆PCA, weight vectors are very similar (A), resulting in similar performance for k ≤ 100. For larger k, performance decreases as the extra principal components are no longer useful for classification tasks. For SD and PMD applied to data samples vs. differences, the distinct weight vectors (A) result in a performance increase for ∆SD resp. ∆PMD vs. SD resp. PMD. C: To combine the unsupervised weights with supervised gain modulation to learn the specific tasks, we performed an evolutionary meta-parameter optimization on the shared bias, the shared x-shift, the gain learning rate and the gain initialization. Note that shared parameters were not only shared across tasks, but also across neurons. Gain initialization was Gaussian with optimized mean (gain init avg) and standard deviation (gain init stdev). Plots show performance of a randomly chosen subset of ten tasks evaluated on the validation set for ∆PMD with k = 100, for all configurations tried by the evolutionary algorithm. The ten best metaparameter configurations are marked by red crosses. The optimal metaparameter configuration was determined in this way for each input weight matrix and hidden layer size shown in Fig 4F.   . Note that the simplification needs to incorporate all bifurcations between branches with input sites (grey squares on the right) to be accurate (13). B: Matrix of steady state input and transfer resistances (diagonal and off-diagonal elements respectively) between the synaptic inputs sites and the soma, computed for the full model (left) and the reduction (right). C: Input and transfer resistance kernels (also termed impulse response kernels or the Green's function) between the soma and three randomly selected dendritic sites (as indicated in A). Note that these kernels are symmetric zij (t) = zji(t) ≡ zi↔j (t). D: Responses in the full model (grey, full line) and the reduction (colored, dashed line) for the soma (teal) and the three dendritic sites (blue). Only the dendritic sites receive synaptic input, in the form of excitatory inputs to AMPA + NMDA synapses (blue) and inhibitory inputs to GABAA synapses (red). E: Number of compartments in the full and reduced models. F: Runtime of the full and reduced models on an Apple M1 Max Macbook Pro with 32 GB of RAM for a simulation time of 10000 ms. Fig. S6. Application of the biophysically realistic spiking network architecture to all non-trivial boolean functions. A: Two network configurations were trained, one with 4 hidden neurons and one with 10 hidden neurons. The weight vectors were distributed so as to allow construction of decision boundaries with normal vectors in any direction of the two-dimensional input space. For 4 hidden neurons (top) we chose the centers of the 4 quadrants, whereas for 10 hidden neurons (bottom) we divided the unit circle in 10 equal parts and sampled a vector at random from each of these parts. B: We learn the 14 possible non-trivial boolean classification tasks on inputs x, y ∈ {0, 1}. 12 of these tasks are linearly separable (left) and two are linearly non-separable (right). The output neuron of the network should learn to spike in response to the blue crosses, and not spike in response to the red circles. C: Voltage response of the output neuron (in the network with 10 hidden neurons) to inputs x, y ∈ {0, 1}. Note that input values were sampled in such a way that for each task the network received a balanced set of input combinations with + and • targets. Input values {0, 1} were converted to a Gaussian burst of 6 ms width containing 20 spikes in case of 1, and no spikes otherwise. The output neuron spikes indiscriminately initially (left) but learns to spike correctly after learning (right), here shown for the "and" (top) "xor" (bottom) task. The apparent variability in spike amplitude is due to a recording time step of 1 ms. D: Performance (best of 3 initialization seeds) for the networks with 4 and 10 hidden neurons.  (1). A: Configuration of contextual input sites (blue triangles), the soma (teal square) and a recording site in the apical trunk (near the Ca 2+ hotzone, red square). Labeled sites (1-4, Ca hotzone) correspond to the plotted traces in B. B: Responses to identical feedforward input (top, green) for various levels of contextual input, recorded at the soma (teal), the Ca 2+ hotzone (red), and selected dendritic compartments (blue, locations as indicated in A). C: Effective membrane conductance change for the three modulation levels shown in B, as in Fig 1H. D-F: Modeling of the IO relationship as in Fig 2A, B & E. Gain modulation with a shared x-shift and bias is most accurate.