Stochasticity from function - why the Bayesian brain may need no noise

An increasing body of evidence suggests that the trial-to-trial variability of spiking activity in the brain is not mere noise, but rather the reflection of a sampling-based encoding scheme for probabilistic computing. Since the precise statistical properties of neural activity are important in this context, many models assume an ad-hoc source of well-behaved, explicit noise, either on the input or on the output side of single neuron dynamics, most often assuming an independent Poisson process in either case. However, these assumptions are somewhat problematic: neighboring neurons tend to share receptive fields, rendering both their input and their output correlated; at the same time, neurons are known to behave largely deterministically, as a function of their membrane potential and conductance. We suggest that spiking neural networks may, in fact, have no need for noise to perform sampling-based Bayesian inference. We study analytically the effect of auto- and cross-correlations in functionally Bayesian spiking networks and demonstrate how their effect translates to synaptic interaction strengths, rendering them controllable through synaptic plasticity. This allows even small ensembles of interconnected deterministic spiking networks to simultaneously and co-dependently shape their output activity through learning, enabling them to perform complex Bayesian computation without any need for noise, which we demonstrate in silico, both in classical simulation and in neuromorphic emulation. These results close a gap between the abstract models and the biology of functionally Bayesian spiking networks, effectively reducing the architectural constraints imposed on physical neural substrates required to perform probabilistic computing, be they biological or artificial.


Introduction
An ubiquitous feature of in-vivo neural responses is their stochastic nature [1][2][3][4][5][6]. The manifest saliency of this variability has spawned many functional interpretations, with the Bayesian-brain hypothesis arguably being the most notable example [7][8][9][10][11][12]. Under this assumption, the activity of a neural network is interpreted as representing an underlying (prior) probability distribution, with sensory data providing the evidence needed to constrain this distribution to a (posterior) shape that most accurately represents the possible states of the environment given the limited available knowledge about it.
Neural network models have evolved to reproduce this kind of neuronal response variability by introducing noisegenerating mechanisms, be they extrinsic, such as Poisson input [13][14][15][16] or fluctuating currents [17][18][19][20][21], or intrinsic, such as stochastic firing [22][23][24][25][26][27] or membrane fluctuations [28,29,19]. However, while representing, to some degree, reasonable approximations, none of the commonly used sources of stochasticity is fully compatible with biological constraints. Contrary to the independent white noise assumption, neuronal inputs are both auto-and cross-correlated to a significant degree [30][31][32][33][34][35][36], with obvious consequences for a network's output statistics [37]. At the same time, the assumption of intrinsic neuronal stochasticity is at odds with experimental evidence of neurons being largely deterministic units [38][39][40]. Therefore, it remains an interesting question how cortical networks that use stochastic activity as a means to perform probabilistic inference can realistically attain such apparent randomness in the first place.
We address this question within the normative framework of sampling-based Bayesian computation [41][42][43][44][45], in which the spiking activity of neurons is interpreted as Markov Chain Monte Carlo sampling from an underlying distribution over a high-dimensional binary state space. We demonstrate how an ensemble of dynamically fully deterministic, but functionally probabilistic networks, can learn a connectivity pattern that enables probabilistic computation with a degree of precision that matches the one attainable with idealized, perfectly stochastic components. The key element of this construction is self-consistency, in that all input activity seen by a neuron is the result of output activity of other neurons that fulfill a functional role in their respective subnetworks. The present work supports probabilistic computation in light of experimental evidence from biology and suggests a resource-efficient implementation of stochastic computing by completely removing the need for any form of explicit noise.
Contributions: MAP, DD and IB conceived and designed the study. DD performed the analytical calculations and simulations. OB developed a software module based on NEST and PyNN which enabled faster, larger-scale simulations. AFK provided Python code for setting up SSNs on BrainScaleS. DD and MAP wrote the paper. DD designed and created the figures. All authors reviewed the manuscript.

Neuron model and simulation details
We consider (deterministic) LIF neurons with conductancebased synapses and dynamics described by g syn with membrane capacitance C m , leak conductance g l , leak potential E l , excitatory and inhibitory reversal potentials E rev e/i , synaptic strength w k , synaptic time constant τ syn and firing threshold ϑ. During the refractory period τ ref , the membrane potential is clamped to the reset potential . We have chosen the above model because it provides a computationally tractable abstraction of neurosynaptic dynamics [40], but our general conclusions are not restricted to these specific dynamics.
We further use the short-term plasticity mechanism described in [46] to modulate synaptic interaction strengths with an adaptive factor U SE × R(t), where the time-dependence is given by with t s denoting the time of a presynaptic spike and τ rec the time scale on which the reservoir R recovers. This enables a better control over the inter-neuron interaction, as well as over the mixing properties of our networks [47]. All simulations were performed with PyNN 0.8 [48] and NEST 2.4.2 [49].

Sampling framework
As a model of probabilistic inference in networks of spiking neurons, we adopt the framework introduced in [43,45]. There, the neuronal output becomes stochastic due to a high-frequency bombardment of excitatory and inhibitory Poisson stimuli (Fig. 1A), elevating neurons into a highconductance state (HCS), where they attain a high reaction speed due to a reduced membrane time constant. Under these conditions, a neuron's response function becomes approximately logistic and can be represented as ϕ(u) = (1 + exp (−(u − u 0 )/α)) −1 with inverse slope α and inflection point u 0 . Together with the mean free membrane potential µ and the mean effective membrane time constant τ eff , the scaling parameters α and u 0 can be used to translate the weight matrix W and bias vector b of a target Boltzmann distribution p(z) ∝ exp 1 2 z T W z + z T b with binary random variables z ∈ {0, 1} n to synaptic weights and leak potentials in a sampling spiking network (SSN): This translation effectively enables sampling from p(z), where a refractory neuron is considered to represent the state z k = 1 (see Fig. 1B,C).

Measures of network performance
To assess how well a sampling spiking network (SSN) samples from its target distribution, we use the Kullback-Leibler divergence which is a measure for the similarity between the sampled distribution p net and the target distribution p target . For inference tasks, we determine the network's classification rate on a subset of the used data set which was put aside during training.

Learning algorithm
Networks were trained with a Hebbian wake-sleep algorithm which minimizes the D KL (p net p target ) [50]. η is a learning rate which is either constant or decreases over time η ∝ 1 t . For high-dimensional data sets (e.g. handwritten letters and digits), binary-unit networks were trained with the CAST algorithm [51], a variant of wake-sleep with a tempering scheme, and then translated to SSN parameters with Eqs. (5) and (6) instead of training the SSNs directly to reduce simulation time.

Results
We approach the problem of externally-induced stochasticity incrementally. Throughout the remainder of the manuscript, we discern between background input, which is provided by other functional networks, and explicit noise, for which we use the conventional assumption of Poisson spike trains. We start by analyzing the effect of correlated background on the performance of SSNs. We then demonstrate how the effects of Instead of using Poisson processes as a source of explicit noise, we replace the Poisson input with spikes coming from other networks performing spike-based probablistic inference by creating a sparse, asymmetric connectivity matrix between several SSNs. For instance, the red neuron receives not only information-carrying spikes from its home network (black lines), but also spikes from the other two SSNs as background (red arrows), and in turn projects back towards these networks.
both auto-and cross-correlated background can be mitigated by Hebbian plasticity. This ultimately enables us to train a fully deterministic network of networks to perform different inference tasks without requiring any form of explicit noise.

Background autocorrelations
Unlike ideal Poisson sources, single spiking neurons produce autocorrelated spike trains, with the shape of the autocorrelation function (ACF) depending on their firing rate p(z = 1) and refractory time τ ref . For higher output rates, spike trains become increasingly dominated by bursts, i.e., sequences of equidistant spikes with an interspike interval (ISI) of ISI ≈ τ ref . These fixed structures also remain in a population, since the population autocorrelation is equal to the averaged ACFs of the individual spike trains. We investigated the effect of such autocorrelations on the output statistics of SSNs by replacing the Poisson input in the ideal model with spikes coming from other SSNs. As opposed to Poisson noise, the autocorrelation of the SSN-generated background (Fig. 2B) is non-singular and influences the free membrane potential (FMP) distribution (Fig. 2C) and thereby activation function (Fig. 2D) of individual sampling neurons. With increasing firing rates (controlled by the bias of the neurons in the background SSNs), the number of significant peaks in the ACF increases as well: wherep is the probability for a burst to start. This regularity in the background input manifests itself in a reduced width exc.
inh. random    15)); simulation results given as red dots, linear fit as red line. Weight rescaling followed by a transformation back into the z ∈ {0, 1} n state space, shown in green (which affects both weights and biases) can therefore alleviate the effects of correlated background. (I) Similar experiment as in (E) for a network with ten neurons, with parameters adjusted to compensate for input cross-correlations. As in the case of autocorrelated background, cross-correlations can be cancelled out by appropriate reparametrization.
with a scaling factor √ β that depends on the ACF, which in turn translates to a steeper activation function (12) with inflection point u 0 and inverse slope α . Thus, autocorrelations in the background input lead to a reduced width of the FMP distribution and hence to a steeper activation function compared to the one obtained using uncorrelated Poisson input. For a better intuition, we used an approximation of the activation function of LIF neurons, but the argument also holds for the exact expression derived in [43], as verified by simulations (Fig. 2D).
Apart from the above effect, the background autocorrelations do not affect neuron properties that depend linearly on the synaptic noise input, such as the mean FMP and the inflection point of the activation function (equivalent to zero bias). Therefore, the effect of the background autocorrelations can be functionally reversed by rescaling the functional (from other neurons in the principal SSN) afferent synaptic weights by a factor equal to the ratio between the new and the original slope α /α (Eqs. (5) and (6)), as shown in Fig. 2E.

Background cross-correlations
In addition to being autocorrelated, background input to pairs of neurons can be cross-correlated as well, due to ei-ther shared inputs or synaptic connections between the neurons that generate said background. These background crosscorrelations can manifest themselves in a modified crosscorrelation between the outputs of neurons, thereby distorting the distribution sampled by an SSN.
However, depending on the number and nature of presynaptic background sources, background cross-correlations may cancel out to a significant degree. The correlation coefficient (CC) of the FMPs of two neurons fed by correlated noise amounts to where l sums over all background spike trains S l,i projecting to neuron i and m sums over all background spike trains S m,j projecting to neuron j. C (κ, κ, ∆) is the unnormalized autocorrelation function of the postsynaptic potential (PSP) kernel κ, i.e., C (κ, κ, ∆) = κ(t)κ(t + ∆) , and C (S l,i , S m,j , ∆) the cross-correlation function of the background inputs. λ li,mj is given by λ li,mj = Var (S l,i ) Var (S m,j ). The background cross-correlation is gated into the cross-correlation of FMPs by the nature of the respective synaptic connections: if the two neurons connect to the cross-correlated inputs by synapses of different type (one excitatory, one inhibitory), the sign of the CC is switched (Fig. 2F). However, individual contributions to the FMP CC also depend on the difference of the mean free membrane potential and the reversal potentials, so the gating of cross-correlations is not symmetric for excitatory and inhibitory synapses. Nevertheless, it is apparent that if the connectivity statistics (in-degree and synaptic weights) from the background sources to an SSN are chosen appropriately and enough presynaptic partners are available, the total pairwise cross-correlation between neurons in an SSN can cancel out to zero, leaving the sampling performance unimpaired (Fig. 2G). It is important to note that this way of reducing cross-correlations is independent of the underlying weight distribution of the networks providing the background; the required cross-wiring of functional networks could therefore, in principle, be encoded genetically and does not need to be learned. Furthermore, a very simple cross-wiring rule, i.e., independently and randomly determined connections, already suffices to accomplish low background cross-correlations and therefore reach a good sampling performance. While this method is guaranteed to work in an artificial setting, further analysis is needed to assess its compatibility with the cortical connectome with respect to connectivity statistics or synaptic weight distributions. However, even if cortical architecture prevents a clean implementation of this decorrelation mechanism, SSNs can themselves compensate for residual background cross-correlations by modifying their parameters, similar to the autocorrelation compensation discussed above.
To demonstrate this ability, we need to switch from the natural state space of neurons z ∈ {0, 1} N to the more sym- to conserve state probabilities (and thereby also correlations), the desired change of state variables z = 2z − 1 can be achieved with a linear parameter transformation: In the {−1, 1} N state space, both synaptic connections w ij and background cross-correlations ρ(s i , s j ) shift probability mass from the mixed states (z i , z j ) = (−1, 1) and (1, −1) to the aligned states (z i , z j ) = (−1, −1) and (1, 1) (see SI, Fig. S1). Therefore, by adjusting b and W , it is possible to find a W (Fig. 2H) that precisely conserves the desired correlation structure between neurons: with constants g 0 and g 1 (Fig. 2I). Therefore, when an SSN learns a target distribution from data, background crosscorrelations are equivalent to an offset in the initial network parameters and are automatically compensated during training.
At this point, we can conclude that all effects that follow from replacing input noise in an SSN with functional output from other SSNs (which still receive explicit noise) can be compensated by appropriate parameter adjustments. This is an important preliminary conclusion for the next sections, where we show how all noise can be eliminated in an ensemble of interconnected SSNs endowed with synaptic plasticity without significant penalty to their respective functional performance. We start with larger ensembles of small networks, each of which receives its own target distribution, which allows a straightforward quantitative assessment of their sampling performance D KL (p net p target ). We study the behavior of such ensembles both in computer simulations and on mixed-signal neuromorphic hardware. Finally, we demonstrate the capability of our approach for truly functional, larger-scale networks, trained on high-dimensional visual data.

Sampling without explicit noise in large ensembles
We initialized an ensemble of 100 6-neuron SSNs with an inter-network connectivity of = 0.1 and random synaptic weights. No external input is needed to kick-start network activity, as some neurons spike spontaneously, due to the random initialization of parameters (see Fig. 3A). The existence of inhibitory weights disrupts the initial regularity, initiating  Ongoing learning (Equations (8) and (9)) shapes the sampled distributions towards their respective targets (Fig. 3B), the parameters of which were drawn randomly. Our ensemble achieved a sampling performance (median D KL ) of 1.06 +0. 27 −0.40 × 10 −3 , which is similar to the median performance of an idealized setup (independent, Poisson-driven SSNs) of 1.05 +0. 15 −0.35 × 10 −3 (errors are given by the first and third quartile). To put the above D KL values in perspective, we compare the sampled and target distributions of one of the SSNs in the ensemble at various stages of learning (Fig. 3C). Instead of training ensembles, they can also be set up by translating the parameters of the target distributions to neurosynaptic parameters directly, as discussed in the previous section (see SI, Fig. S2).
As we show in the following, this approach to noise-free sampling-based computation can also be applied to physical neural substrates which incorporate unreliable components and are therefore significantly more difficult to control.

Implementation on a neuromorphic substrate
To test the robustness of our results, we studied an implementation of noise-free sampling on an artificial neural substrate. For this, we used the BrainScaleS system [52], a mixed-signal neuromorphic platform with analog neurosynaptic dynamics and digital inter-neuron communication (Fig. 3D, see also SI, Fig. S3). A major advantage of this implementation is the emulation speedup of 10 4 with respect to biological real-time; however, for clarity, we shall continue using biological time units instead of actual emulation time.
The additional challenge for our neuronal ensemble is to cope with the natural variability of the substrate, caused mainly by fixed-pattern noise, or with other limitations such as a finite weight resolution (4 bits) or spike loss, which can all be substantial [53,54]. It is important to note that the ability to function when embedded in an imperfect substrate with significant deviations from an idealized model represents a necessary prerequisite for viable theories of biological neural function.

C B E
A C E L Z 1 3 5 7 9 In the first run, we clamped the top arc of a "B" compatible with either a "B" or an "R" (top three rows, red), in the second run we chose the bottom line of an "L" compatible with an "L", an "E", a "Z" or a "C" (bottom three rows, red). An ensemble of SSNs performs Bayesian inference by implicitly evaluating the conditional distribution of the unstimulated visible neurons, which manifests itself here as sampling from all image classes compatible with the ambiguous simulus (see also SI, Fig. S6).
We emulated an ensemble of 15 4-neuron SSNs, with an inter-SSN connectivity of = 0.2 and with randomly drawn target distributions. The biases were provided by additional bias neurons and adjusted during learning via the synaptic weights between bias and sampling neurons, along with the synapses within the SSNs, using the same learning rule as before (Equations (8) and (9)). After 200 training steps, the ensemble reached a median D KL of 3.99 +1.27 −1.15 · 10 −2 (errors given by the distance to the first and third quartile) compared to 1.18 +0. 47 −0.55 before training (Fig. 3E). As a point of reference, we also considered the idealized case by training the same set of SSNs without interconnections and with every neuron receiving external Poisson noise generated from the host computer, reaching a D KL of 2.49 +3. 18 −0.71 · 10 −2 . This relatively small performance loss of the noise-free ensemble compared to the ideal case confirms the theoretical predictions and simulation results. Importantly, this was achieved with only a rather small ensemble, demonstrating that large numbers of neurons are not needed for realizing this computational paradigm.
In Fig. 3F, we show the sampling dynamics of all emulated SSNs after learning. While most SSNs are able to approximate their target distributions well, some sampled distributions are significantly skewed (Fig. 3G). This is caused by a small subset of dysfunctional neurons, which we have not discarded beforehand, in order to avoid an implausibly fine-tuned use-case of the neuromorphic substrate. These effects become less significant in larger networks trained on data instead of predefined distributions, where learning can naturally cope with such outliers by assigning them smaller output weights. Nevertheless, these results demonstrate the feasibility of self-sustained Bayesian computation through sampling in physical neural substrates, without the need for any source of explicit noise. Importantly, and in contrast to other approaches [55], every neuron in the ensemble plays a functional role, with no neuronal real-estate being dedicated to the production of (pseudo-)randomness.

Ensembles of hierarchical SSNs
When endowed with appropriate learning rules, hierarchical spiking networks can be efficiently trained on highdimensional visual data [47,54,[56][57][58][59]. Such hierarchical networks are characterized by the presence of several layers, with connections between consecutive layers, but no lateral connections within the layers themselves. When both feedforward and feedback connections are present, such networks are able to both classify and generate images that are similar to those used during training.
In these networks, information processing in both directions is Bayesian in nature. Bottom-up propagation of information enables an estimation of the conditional probability of a particular label to fit the input data. Additionally, top-down propagation of neural activity allows generating a subset of patterns in the visible layer conditioned on incomplete or partially occluded visual stimulus. When no input is presented, such networks will produce patterns similar to those enforced during training ("dreaming"). In general, the exploration of a multimodal solution space in generative models is facilitated by some noise-generating mechanism. We demonstrate how even a small interconnected set of hierarchical SSNs can perform these computations self-sufficiently, without any source of explicit noise.
We used an ensemble of four 3-layer hierarchical SSNs trained on a subset of the EMNIST dataset [60], an extended version of the widely used MNIST dataset [61] that includes digits as well as capital and lower-case letters. All SSNs had the same structure, with 784 visible units, 200 hidden units and 5 label units (Fig. 4A). To emulate the presence of networks with different functionality, we trained each of them on a separate subset of the data. (To combine sampling in space with sampling in time, multiple networks can also be trained on the same data, see SI Fig. S5.) Since training the spiking ensemble directly was computationally prohibitive, we trained four Boltzmann machines on the respective datasets and then translated the resulting parameters to neurosynaptic parameters of the ensemble using the analytical approximations for correlation compensation described earlier in the manuscript.
To test the discriminative properties of the SSNs in the ensemble, one was stimulated with visual input, while the remaining three were left to freely sample from their underlying distribution. We measured a median classification rate of 91.5 +3.6 −3.0 % with errors given by the distance to the first and third quartile, which is close to the 94.0 +2.1 −1.5 % achieved by the idealized reference setup provided by the abstract Boltzmann machines (Fig. 4B). At the same time, all other SSNs remained capable of generating recognizable images (Fig. 4C). It is expected that direct training and a larger number of SSNs in the ensemble would further improve the results, but a functioning translation from the abstract to the biological domain already underpins the soundness of the underlying theory.
Without visual stimulus, all SSNs sampled freely, generating images similar to those on which they were trained (Fig. 4D). Without any source of explicit noise, the SSNs were capable to mix between the relevant modes (images belonging to all classes) of their respective underlying distributions, which is a hallmark of a good generative model. We further extended these results to an ensemble trained on the full MNIST dataset, reaching a similar generative performance for all networks (see SI Fig. S5).
To test the pattern completion capabilities of the SSNs in the ensemble, we stimulated them with incomplete and ambiguous visual data (Fig. 4E). Under these conditions, SSNs only produced images compatible with the stimulus, alternating between different image classes, in a display of pattern rivalry. As in the case of free dreaming, the key mechanism facilitating this form of exploration was provided by the functional activity of other neurons in the ensemble.

Discussion
Based on our findings, we argue that sampling-based Bayesian computation can be implemented in ensembles of spiking networks without requiring any explicit noisegenerating mechanism. While in biology various explicity sources of noise exist [62][63][64], these forms of stochasticity are either too weak (in case of ion channels) or too highdimensional for efficient exploration (in the case of stochastic synaptic transmission, as used for, e.g., reinforcement learning [65]). On the other hand, neuronal population noise can be highly correlated, affecting information processing by, e.g., inducing systematic sampling biases [32].
In our proposed framework, each network in an ensemble plays a dual role: while fulfilling its assigned function within its home subnetwork, it also provides its peers with the spiking background necessary for stochastic search within their respective solution spaces. This enables a self-consistent and parsimonious implementation of neural sampling, by allowing all neurons to take on a functional role and not dedicating any resources purely to the production of background stochasticity. The underlying idea lies in adapting neuro-synaptic parameters by (contrastive) Hebbian learning to compensate for auto-and cross-correlations induced by interactions between the functional networks in the ensemble. Importantly, we show that this does not rely on the presence of a large number of independent presynaptic partners for each neuron, as often assumed by models of cortical computation that use Poisson noise (see, e.g., [66]). Instead, only a small number of ensembles is necessary to implement noise-free Bayesian sampling. This becomes particularly relevant for the development of neuromorphic platforms by eliminating the computational footprint imposed by the generation and distribution of explicit noise, thereby reducing power consumption and bandwidth constraints.
The suggested noise-free Bayesian brain reconciles the debate on spatial versus temporal sampling [67,41]. In fact, the suggested ensembles of spiking neurons that provide each other with virtual noise may be arranged in parallel sensory streams. An ambiguous stimulus will trigger different representations on each level of these streams, forming a hierarchy of probabilistic population codes. While these population codes learn to cover the full sensory distribution in space, they will also generate samples of the sensory distribution in time (see Fig. S5 in the SI). Attention may select the most likely representation, while suppressing the representations in the other streams. Analogously, possible actions may be represented in parallel motor streams during planning and a motor decision may select the one to be performed. When recording in premotor cortex, such a selection causes a noise reduction [68], that we suggest is effectively the signature of choosing the most probable action in a Bayesian sense.
In our simulations, we have used a simplified neuron model to reduce computation time and facilitate the mathematical analysis. However, we expect the core underlying prin-ciples to generalize, as evidenced by our results on neuromorphic hardware, where the dynamics of individual neurons and synapses differ significantly from the mathematical model. Such an ability to compute with unreliable components represents a particularly appealing feature in the context of both biology and emerging nanoscale technologies. (F) Study of the optimal compensation rule in a network with two neurons. For simplicity, the ordinate represents weight changes for a network with states in the {−1, 1} 2 space, which are then translated to corresponding parameters (W, b) for the {0, 1} 2 state space. The colormap shows the difference between the sampled and the target distribution measured by the Kullback-Leibler divergence D KL p net p target . The mapping provided by the compensation rule f (see (E)) is depicted by the green curve. Note that the compensation rule ∆W = f (s) provides a nearly optimal parameter translation. Remaining deviations are due to differences between LIF and Glauber dynamics. (G) Compensation of noise correlations in an SSN with ten neurons. The results are depicted for a set of ten randomly drawn Boltzmann distributions over z ∈ {0, 1} 10 (error bars). For a set of randomly chosen Boltzmann distributions, a ten-neuron network performs sampling in the presence of pairwise-shared noise ratios s (x-axis). The blue line marks the sampling performance without noise-induced correlations (s = 0). For an increasing shared noise ratio, uncompensated noise (green) induces a significant increase in sampling error. After compensation, the sampling performance is nearly completely restored. As before, remaining deviations are due to differences between LIF and Glauber dynamics. (H) An LIF-based ten-neuron network with shared noise sources (s = 0.3 for each neuron pair) is trained with data samples generated from a target Boltzmann distribution (blue bars). During training, the sampled distribution becomes an increasingly better approximation of the target distribution (red line). For comparison, we also show the distribution sampled by an SSN with parameters translated directly from the Boltzmann parameters (purple). The trained network is able to improve upon this result because learning implicitly compensates for the abovementioned differences between LIF and Glauber dynamics.  Figure S2: (A) A straightforward way to set up the parameters of each network (w ij and E l ) is to use the parameter translation as described in the main text, i.e., use the corresponding activation function of each neuron to correctly account for the background noise statistics. This is demonstrated here for the case of (left) 399 networks (only two shown) receiving Poisson noise and one network only receiving ensemble input and (right) all networks only receiving ensemble input. In both cases, the resulting activation function is the same and we can indeed use it to translate the parameters of the target distribution to neurosynaptic parameters. (B) Using the corresponding activation functions to set up the ensemble (but no training), each network in the ensemble is indeed able to accurately sample from its target distribution without explicit noise, as expected from our considerations in (A) and the main text. This is shown here (in software simulations) for an ensemble of 400 3-neuron SSNs with an interconnection probability of 0.05, reaching a median D KL of 12.8 +6.4 −5.0 × 10 −3 (blue), which is close to the ideal result with Poisson noise of 6.2 +2.0 −2.0 × 10 −3 (black, errors given as the first and third quartile). Figure S3: (A) A single HICANN chip (High Input Count Analog Neural Network), the elemental building block of the BrainScaleS wafer. The HICANN consists of two symmetric halves and harbors analog implementations of adaptive exponential integrate-andfire (AdEx) neurons and conductance-based synapses in 180nm CMOS technology. Floating gates next to the neuron circuits are used to store neuron parameters. Spikes are routed digitally through horizontal and vertical buses (not shown) and translated into postsynaptic conductances in the synapse array. Unlike in simulations on general-purpose CPUs, here neurons and synapses are physically implemented, with no numeric computations being performed to calculate network dynamics. A single wafer consists of 384 HICANN chips. (B) Individual components of the BrainScaleS system, including both wafer and support structure. For instance, FPGA boards provide an I/O interface for wafer configuration and spike data and Giga-Ethernet slots provide a connection between FPGAs and the control cluster from which users conduct their experiments via Python scripts using the PyNN API. (C) Completely assembled wafer of the BrainScaleS neuromorphic system. Figure S4: t-SNE representation [70] of consecutively generated images of two of the four SSNs trained on EMNIST. Both SSNs smoothly traverse several regions of the state space representing image classes while dreaming. The red diamond marks the first image in the sequence, gray lines connect consecutive images. Consecutive images are 200 ms apart. Figure S5: (A) Dreaming ensemble of five hierarchical SSNs with 784 visible, 500 hidden and 10 label neurons (without explicit noise). Each row represents samples from a single network of the ensembles, with samples being 375 ms apart. To set up the ensemble, a restricted Boltzmann machine was trained on the MNIST dataset and the resulting parameters translated to corresponding neurosynaptic parameters of the ensemble. Here, to facilitate mixing, we used short-term depression to modulate synaptic interactions and weaken attractor states that would be otherwise difficult to escape [47]. (B) t-SNE representation [70] of consecutively generated images of two of the five SSNs trained on MNIST digits. Both SSNs are able to generate and mix between diverse images of different digit classes while dreaming. The red diamond marks the first image in the sequence, gray lines connect consecutive images. Consecutive images are 400 ms apart. Examples of the visible layer activity while classifying the image as a "T", "X" or "V". In these cases, the images generated by the visible neurons show prominent features of these letters.