Sub-realtime simulation of a neuronal network of natural density

Full scale simulations of neuronal network models of the brain are challenging due to the high density of connections between neurons. This contribution reports run times shorter than the simulated span of biological time for a full scale model of the local cortical microcircuit with explicit representation of synapses on a recent conventional compute node. Realtime performance is relevant for robotics and closed-loop applications while sub-realtime is desirable for the study of learning and development in the brain, processes extending over hours and days of biological time.


INTRODUCTION
The cortical neuronal network of mammals exhibits a two-fold universality: basic characteristics of its architecture are conserved in evolution from mouse to human as well as across brain areas. This has motivated researchers to investigate models of the local cortical microcircuit, the network below a square millimeter of cortical surface, as a universal building block of brain-like computing. It is the smallest network in which both a realistic number of 10, 000 synapses per neuron and a connection probability of 0.1 are realized simultaneously.
In a prototype network model of the microcircuit [1], the spatial structure of the cortex is neglected and replaced by cell-type specific random connectivity. Each cortical layer is represented by an excitatory and an inhibitory population of integrate-and-fire model neurons (Fig. 1a).
The microcircuit model has become a benchmark for neuromorphic computing systems: it can be simulated with moderate hardware investments [2, 3], its natural size renders questions of downscaling irrelevant [4], and it marks an upper bound as larger neuronal networks are necessarily less densely connected, and thus are, relative to the problem size, easier to simulate.
Fast and energy efficient simulation is a promise of neuromorphic computing [5]; desirable for large-scale neuroscientific models [6] and imperative in artificial intelligence and machine learning applications [7]. The first milestone is realtime performance, which was accomplished for the microcircuit model in 2019 on a neuromorphic system [8] followed this year by GPU systems [9,10], one of them already breaking into the sub-realtime regime [10]. However, these results have to be evaluated in the light of continuously advancing commodity hardware as a reference technology providing more flexibility at potentially lower costs. With this aim we set out to investigate the performance of the general purpose simulation engine NEST [11] on a recent conventional computing system. Preliminary results have been presented in abstract form [12].  Strong scaling experiments keep the task size fixed while systematically increasing the computational resources (Fig. 1b). The task is a simulation of 10 s of model time (T Model ), referring to the span of biological time described by the model, if not stated otherwise. Measurements start after model instantiation with optimized initial conditions [8] and an initial interval of 0.1 s of model time to ensure that potential transients of the network dynamics are discarded. To assess simulation speed we use the realtime factor: Here, T Wall denotes the wall-clock time; the time passed in the machine hall until the simulation completes. A realtime factor smaller than 1 implies sub-realtime performance.
A common measure for comparing the energy consumption of neuromorphic systems is energy per synaptic event defined as total consumed energy divided by the total number of transmitted spikes (see Supp. Inform. Power measurements). For conducting the benchmarks we employ the JUBE [17] benchmarking environment.

RESULTS
We assess the strong scaling performance of microcircuit model simulations by increasing the number of threads on up to two compute nodes with two different schemes of binding threads to cores: In the "sequential" placing scheme, threads are bound onto physically consecutive cores per socket (thread counts 1 to 64 in steps of 1), and 1 MPI process per socket is used for simulations on one and two full nodes with 128 and 256 threads, respectively. In the "distant" placing scheme, threads are bound such that L3 cache and chiplet overlap is minimized per node (thread counts 1 to 128 in steps of 1, see Supp. Inform. Distant Placing) and 1 MPI process per node is used.
For sequential placing, we observe linear scaling for a thread count between 1 and 32 as well as super-linear scaling between 32 and 64 (Fig. 1b). A full compute node achieves sub-realtime performance with an RTF of 0.7. Two nodes reduce the realtime factor to 0.59; the simulation runs 1.7 times faster than realtime. The distant placing scheme exhibits super-linear scaling already for a small number of threads. At 33 threads, we note a sudden rise of the realtime factor. At this point, the L3 cache is shared for the first time. Nevertheless, sub-realtime performance is already achieved when using only 64 threads. Comparing the two placings at 128 and 256 threads respectively, we observe that sequential placing results in better performance. This is due to 2 MPI processes being used on one node in the sequential placing scheme as compared with 1 for the distant placing. The relative time spent in the update phase on a single node is decreased in the distant placing when compared with the sequential one and communication between the two nodes is not a limiting factor. This suggests that simulation time can be further reduced by increasing the number of nodes and alternatively using faster nodes.
We also assess the energy consumption of the simulation phase to investigate how the increased power uptake due to using more computational resources is counterbalanced by decreased simulation time (Fig. 1c). For this we compare a configuration using all 128 cores of a node with two configurations using only half of the cores. The former sequentially fills the cores of one socket, the latter employs the distant placing scheme.
During simulations of 100 s of model time we record the power consumption and obtain the energy consumed in the simulation phase by integrating over the power readings.
We observe that power consumption during the simulation phase is largest for the dis-  formance measurements). Ultimately, the 128 thread configuration does not only exhibit the shortest time to solution but also requires the smallest amount of energy.
The energy per synaptic event for the two fastest configurations (128 and 256 threads in sequential placing) are 0.33 µJ and 0.48µJ, respectively.

DISCUSSION
Our study shows that a single compute node achieves sub-realtime performance in the simulation of a natural density local cortical microcircuit model. To our best knowledge, we report the lowest realtime factor so far at a competitive energy consumption (Table I).
There are, however, preliminary data [18] on an even smaller realtime factor for a dedi- but also raises hope that methods of prefetching and latency hiding can further improve simulation code without restricting generality [19].
Achieving realtime performance is a criterion for robotics. But for basic research and medical applications, also faster simulations are of use, because biological processes extending over long periods of time can be observed on a reduced time scale and multiple scenarios can be investigated quickly.
We hope that our results further advance and inspire the constructive competition between neuromorphic hardware and conventional computer architectures [2] which led to an order of magnitude improvement within just four years.

DATA AVAILABILITY
All data and analysis code to reproduce the results of this study can be downloaded from https://doi.org/10.5281/zenodo.5637375.

ACKNOWLEDGMENTS
We are grateful to Tobias Noll and Arne Heittmann for fruitful discussions, to Susanne   Simulation on two nodes are launched by mpirun --n 2 --npernode 1 --mca pml ucx -x UCX_NET_DEVICES = mlx5_1 :1 --bind -to board python3 run_microcircuit . py in this example with 1MPI process per node.

Power measurements
Power was measured with a Raritan Dominion PX and a Raritan PX3-5190 power distribution unit (PDU). The units have an accuracy of ±5% and data collection frequency of 1 Hz. The power measurement has a delay of 1 s, so that the power readings need to be shifted by 1 s to be aligned to wall-clock time. Since the nodes are connected point-to-point, we do not need to take additional passive energy consumption by an interconnect into account.

Low level performance measurements
In order to determine the number of cache misses we employ the perf performance analysis tool of the Linux operating system. We use the command options perf stat -ae task -clock , cycles , instructions , cache -references , cache -misses and increase the simulation time to 100 s. With this we ensure that approximately 80% of the run time of the program is spent in the simulation phase guaranteeing a reliable assessment of the percentage of cache misses during that phase.