Sparsity through evolutionary pruning prevents neuronal networks from overfitting

Modern Machine learning techniques take advantage of the exponentially rising calculation power in new generation processor units. Thus, the number of parameters which are trained to resolve complex tasks was highly increased over the last decades. However, still the networks fail - in contrast to our brain - to develop general intelligence in the sense of being able to solve several complex tasks with only one network architecture. This could be the case because the brain is not a randomly initialized neural network, which has to be trained by simply investing a lot of calculation power, but has from birth some fixed hierarchical structure. To make progress in decoding the structural basis of biological neural networks we here chose a bottom-up approach, where we evolutionarily trained small neural networks in performing a maze task. This simple maze task requires dynamical decision making with delayed rewards. We were able to show that during the evolutionary optimization random severance of connections lead to better generalization performance of the networks compared to fully connected networks. We conclude that sparsity is a central property of neural networks and should be considered for modern Machine learning approaches.


Introduction
Sparsity is a characteristic property of the wiring scheme of the human brain, which consists of about 8.6 × 10 10 neurons Herculano-Houzel (2009), interconnected by approximately 10 15 synapses Sporns et al. (2005); Hagmann et al. (2008).Thus, from almost 10 22 theoretically possible synaptic connections, only one of 10 million connections is actually realized.This extremely sparse distribution of both neural connections and activity patterns is not a unique feature of the human brain Hagmann et al. (2008) but can also be found in other vertebrate species such as for example mice and rats Perin et al. (2011); Oh et al. (2014); Kerr et al. (2005); Song et al. (2005).Even evolutionary very old organisms with a quite simple nervous system such as the nematode C. elegans with only 302 neurons Jarrell et al. (2012) and over 7000 connections J.G.White, E.Southgate, J.N. Thomson (1986), show sparsity.Besides the described sparsity of virtually all nervous systems, many biological neural networks also show small world properties Watts and Strogatz (1998); Amaral et al. (2000); Latora and Marchiori (2001); Bassett and Bullmore (2006), such as scale free connectivity patterns Perin et al. (2011); van den Heuvel and Yeo (2017); Bassett et al. (2006).
The described sparsity is the result of both, phylogenetic and ontogenetic adaptations.Even though, almost all species' immature nervous systems are already very sparse, this sparsity is even further increased during development and maturation of the organisms' nervous systems Low and Cheng (2006).In fact, the infant human brain contains two times more synapses than the adult brain Kolb and Gibb (2011).Analogously, the immature nervous system of C. elegans contains more synapses than the adult form Oren-Suissa et al. (2016).
But pruning is not restricted to axons and synaptic connections.It even extends to the total number of neurons, which also decreases during development.For instance, the immature nervous system of C. elegans initially consists of 308 neurons Chalfie (1984), whereas the adult form contains only 302 neurons Jarrell et al. (2012).And also in humans, the number of neurons decreases during development Yeo and Gautier (2004).These ontogenetic changes are referred to as pruning Paolicelli et al. (2011), and it seems to be a universal phenomenon for all species from C. elegans to humans.Furthermore, pruning is found to be mandatory for healthy development Hong et al. (2016).In cases where normal synaptic pruning fails, this may even lead to disorders like schizophrenia Boksa (2012).
Since sparse connectivity architectures are realized on all scales and in a vast number of organisms of different complexity, it can be assumed that sparse connectivity is a general principle in neural information processing systems, leading to advantages compared to densely connected networks.One major advantage of sparse artificial deep neural networks used for image classification in comparison to fully connected networks, is the reduction of computational costs while at the same time boosting the ability to generalize Han et al. (2015); Anwar et al. (2017); Wen et al. (2016); Mocanu et al. (2018).
However, these pure feed forward network architectures show low biological plausibility as they neither have the ability of dealing with time series data, nor have any memory-like features.Efficient processing of time series data in artificial neural networks is a complex task with a bunch of limitations.The technique of training the neural networks by unfolding the data in time is computational expensive and time consuming and leads to effects such as vanishing or exploding gradients Hochreiter (1998); Pascanu et al. (2012), which have been partly overcome by the introduction of Long-Short-Term-Memory Networks Schmidhuber and Hochreiter (1997).However, these networks are difficult to interpret from a biological point of view.
To overcome these limitation, novel biologically inspired approaches for processing time series data were introduced called reservoir computing Verstraeten et al. (2007); Lukoševičius et al. (2012).A so called reservoir of neurons with fixed (i.e.not adjusted by training) random recurrent connections is used to calculate higher-order correlations of the input signal which serve as input for a feed forward output layer that is trained with error back-propagation.The properties of the reservoir networks were found to be ideal for biologically inspired parameters with a high sparsity Alexandre et al. (2009).Thus, these reservoirs work best at the edge of chaos, meaning that the parameters have to be chosen so that they are balanced between complete chaos and absolute periodicity Schrauwen et al. (2007); Bertschinger and Natschläger (2004); Krauss et al. (2019).
Much effort has be undertaken to apply the technique of reservoir computing on tasks with delayed rewards such as robot navigation in mazes Antonelo and Schrauwen (2012); Antonelo et al. (2007).
However, the technique of reservoir computing is still based on the fact that the output layer has to be trained in a supervised way using back-propagation and, thus, complex tasks with a delayed reward are difficult to realize.Thus, in this study we used an evolutionary approach to train networks in solving a maze task.We did not use reinforcement learning approaches as these techniques suffer from the hidden state problem, meaning that not all information is present at any time and from the delayed reward problem Littman (2001), which plays an important role in our maze task Lample and Chaplot (2017).
In our evolutionary system we were able to show that the random severing of connections (evolutionary pruning), without explicitly rewarding sparsity, did lead to a general sparsification of the networks and a better generalization performance.The green line depicts one ideal path though the network.

Software Resources
All simulation were run on a desktop computer equipped with an i9 extreme processor (Intel) with 10 calculation cores.The complete software was written in Python 3.6 using the libraries sys, os, glob, subprocess, json, natsort, pickle, shutil, NumPy Van Der Walt et al. (2011).Data visualization was done by the use of Matplotlib Hunter (2007) and plots were arranged using the Pylustrator Gerum (2019).

Maze Task
The task for the agents to perform is a maze based on a rectangular grid of 400x22 cells (Fig. 1c).There are two types of cells, free cells and wall cells.Free cells can be entered and wall cells not.The border of the maze consists of walls to prevent agents from leaving the maze.Starting from the left, every 2 to 10 cells a wall with a length between 4 to 20 is inserted.With a probability of 0.25 the wall is inserted from the same side (up or down) as the last wall or with a probability of 0.75 it is inserted from the other side.
Agents start always at the left end of the maze facing to the right.As 'sensory' input each agent receives the distance to the wall in front, to the left and to the right (input neurons 0 to 6, cf.Fig. 1a).
If the distance is larger then 10, it is set to 10 (visual range).It also receives the direction, it is currently looking at, as a one-hot encoded, four-neuron input, i.e. one neuron at a time is in state 1 and the others are in state 0. This input serves as a kind of compass.The seven input neurons do exclusively receive input from the environment, but do not get any input from other neurons, thus, they are reset at each time step.
The agent can output three values for the three possible actions: go straight, turn right, or turn left.The action with the highest value is selected (winner takes all) (output neurons 7 to 9, cf.Fig. 1a).When the agent chooses to go straight and the next field is a wall, it is not moved.
After 400 actions, the covered x-distance of the agent is fed to the fitness function which is proportional to the covered distance (cf.Fig. 1c).Thus, the reward is delayed by 400 time steps.

Network
The logic of each agent consists of a fully connected network of N = 16 neurons, with states s (cf.Fig. 1a,b).The connection weights W are initialized with a random value drawn from a uniformly distribution from the interval [−σ, +σ] with σ = 4 • 6 N +N /10.For each time step t in the task, the 3+4 input values (distances (left, front, right) and one-hot encoded direction) are set as the states of the neurons #0 to #6.Then one pass thought the network is calculated, s t+1 = ReLu (W • s t ), with ReLu(x) being the rectified linear function The states of the neurons #7 to #9 are used as outputs to choose one of the three possible actions to perform in the maze task (connectivity matrix see Fig. 1a).Thus, the action is determined by a winner-takes all-method (in the case of no activation, the "move" action is chosen).Our approach is a policy based approach, as the output of the network directly is the action to take and no quality assessment of different states is undertaken.

Evolutionary Algorithm
For optimizing the networks to fulfill the maze task, we use an evolutionary algorithm Fekiac et al. (2011).Therefore, a pool of 1 000 agents is created with a random initialization and 10 mazes are created for the agents to be trained on.
For each iteration, all agents have to perform the maze task and are assigned a fitness, depending on their score in the task.The best half of the agent pool has now the chance to create offspring.The probability to generate an offspring is proportional to their relative fitness compared to the other agents.Agents with a probability of 10% or more are set to a maximal probability of 10% to retain biodiversity.For each of the old agents to be replaced, a parent agent is selected at random according to their reproduction probabilities.Agents can have multiple offspring or no offspring at all.
After offspring generation, each agent is mutated.We used three different mutation types: • Weight mutation: The connection weights W are each mutated by addition of a Gaussian distributed random variable (µ = 0, σ = |σ mut |).
• Connection mutation: Existing connections are removed with a probability p disconnect and non-existing connections are added with a probability of p connect = p disconnect .Removed connections have a weight of 0 and are not subject to weight mutations.Thus, a removed connection cannot be recovered by a simple mutation step, but can only be recovered by a reconnection mutation.
The fitness (eq.2) is calculated from the squared mean of the square root distances the agent reached in all 10 training mazes (SMR, eq. 3, this is done to favor generalizing agents which perform okay in all mazes against a specializing agent which performs well in only one maze), the maximum mean activation of the neurons and optionally from the mean number of active connections: All experiments are repeated for 5 different seeds for the random number generator that is used for the initial weights and the mutations.The different repetitions were performed on the same 10 training mazes to keep them comparable.

Results
The evolutionary algorithm was able to find solutions enabling the agents to efficiently navigate through the mazes.
In all experiments, except the experiment without weight mutations, the agents gradually learned to perform better in the maze tasks over the generations (Fig. 2h-m).The convergence was quite slow, as about 5 000 generations were needed to converge to a stable solution.The convergence behaviour was mostly independent of the seed of the random  number generator, except for the "no mutation" condition, which relied strongly on the initial weights.The fitness during training was best for the conditions with no, or low sparsification pressure (Fig. 2h,j,m).
The fitness in validation mazes, that the agents had not seen during training, was more sensitive to the seed than the fitness during training.For the experiments with more sparsification pressure (Fig. 2b,e) the validation fitness did exceed the fitness during training, showing good generalisation, whereas the experiments with lower sparsification pressure (Fig. 2a,c,f) showed more problems with over-fitting, which means that the fitness is higher during training than during validation.
Validation fitness increased in most runs during training, although some drops in validation fitness were observed, which sometimes recovered after a few generations, but in some cases the validation fitness continued to fluctuate (see Supplementary Material).In general when the mean validation fitness dropped also the variability of the fitness increased, showing that even if some agents found an "over-fitting" solution, it was not quickly adopted by all agents, whereas when the solutions were more general, they seemed more stable and were adopted by the whole pool of agents.
The weight distribution of the final generation was different for each experiment.While the experiments with low sparsification pressure, that also showed overfitting, show more connections and more weights with larger absolute values (Fig. 2n,p,s) compared to the other experiments that show more small weight values (Fig. 2o,q).Apart from "connection severance no mut" and "connection severance sparsity reward" (48% and 49% negative weights) all experiments show more negative weights (52-55%), referring to inhibitory synapses, a fact which indicates more interesting behaviour and more efficient information processing Krauss et al. (2019).
In most experiments, the sparsity increased over time (Fig. 3a), but in "connection severance" it even slightly decreased after the initial rise and in "sparsity reward" the sparsity fluctuated strongly over time.A higher sparsity was in all cases (except the no mutation case) associated with a higher validation fitness (Fig. 3b), showing that sparsity improves the generalisation behaviour of evolutionary trained networks.A comparison of the training fitness to the validation fitness also shows that for the sparser cases, the training fitness decreases and in contrast to that, the validation fitness increases.Therefore, the sparsification prevents overfitting and enhances the generalisation performance.In addition, it improves computation efficiency in the evolutionary trained networks as the computations, divided between many neurons in the fully connected control group, are, in sparser networks, forced to be carried out on a small subset of neurons.
Furthermore, it could be shown that the networks which perform best in the test mazes develop simple feed-forward structures (cf.Fig. 1a).Additionally, some asymmetry in the connectivity matrix can be observed (cf.Fig. 1b).On the one hand, the bias units prefer the turn towards a certain direction.On the other hand, the connection from the input distance sensor to the turn output (e.g. to turn to the left side d left in example Fig. 1b) is over-represented for one side.Thus, the networks have a preference to go to one side (e.g. to turn left, in the example in Fig. 1b), if there is enough space.The bias unit serves as counterpart, if the agent moves along the upper edge (resp.left side seen when moving along the x-axis) and guarantees that the network can walk away from the wall.The simple network architectures allow for the analysis of the functional tasks of certain neurons.This understanding of the functional tasks of neurons in artificial neural networks could potentially help to understand biological neural networks Jonas and Kording (2017); Kriegeskorte and Douglas (2018).Different experiments develop quite different solutions to solve the maze task (see Supplementary Material S1).Interestingly, in some experiments similar structures emerge, regardless of the seed.This is especially the case in the "connection severance" experiment, where 4 out of 5 solutions are strikingly similar.This hints at the existence of strongly attractive maxima in the space of possible solutions.

Discussion
In this study we showed that evolutionary pruning of artificial neural networks, evolutionary trained to solve a simple maze task, leads to sparser networks with better generalization properties compared to dense networks trained without pruning.The evolutionary pruning is realized via two mutation mechanisms: First the networks can set a threshold defining the upper limit of the absolute weight value being virtually set to zero.Secondly, some connections are removed (weight set to 0) at random with a given probability.The random removal of existing connections leads to more robust networks with better generalization abilities and thus can be added as additional mechanism to improve the evolutionary training of neural networks.
However, the maze task is still not complicated enough to trigger the development of complex recurrent neural networks which include abilities such as memory.As the organisms in the maze are always provided with a "compass" meaning always know in which direction they are pointing, the task can be solved with a Markov like decision process, as the information of one time step is enough to decide the next action.Thus, after 5000 epochs the best performances are achieved by simple feed forward networks (cf.Fig. 4).Nevertheless, it has to be considered that these networks were not forced to develop feed forward architectures but are a result of the Markov properties of the task they were trained  on.The simulation shows that simple feed forward networks are able to navigate through a maze using only 16 sparsely connected neurons.
These networks demonstrate that only a small number of neurons is needed to navigate through environment and shows that for example C. elegans with its 302 neurons J.G.White, E.Southgate, J. N.Thomson (1986) should be indeed able to perform complex tasks.It has already been shown that C. elegans shows thermo-and electrotaxis Gabel et al. (2007) which are not simply a biased random walk in contrast to its' chemotaxis (for certain temperatures thermoataxis also changes to a random walk) Pierce-Shimomura et al. (1999); Ryu and Samuel (2002).C. elegans shows a direct movement along the electric gradient Gabel et al. (2007).Thus, the amphid sensory neurons of C. elegans could be seen as and equivalent to the "compass" neurons in our simulation.
The here described study does not show a biologically inspired neuron configuration, however, demonstrates that 16 neurons are enough to perform relatively simple navigation tasks and gives a hint that the development of sparsity has evolutionary advantages.These findings contrast with current developments in the field of artificial intelligence where the size of the networks due to higher calculation power are scaled up Xu et al. (2018).
Furthermore, the maze task could be extended so that the task is not a simple Markov process, which means that the next decision cannot be made by simply analyzing the current position in the maze Wierstra and Wiering (2004);Littman (2012).This increase of complexity can be achieved e.g. by removing the compass neurons.Consequently, a neural network which is able to achieve similar performance than networks with compass neurons need to develop some "memory" features.The networks would have to remember which movements they recently executed and they would have to dynamically recall this information.An even more demanding task would be to force the agents to go directly back to the starting point after they passed the maze.This task requires path integration Wehner and Wehner (1986); Etienne and Jeffery (2004), which in turn requires the ability to flexibly navigate in physical space like insects Wehner and Wehner (1986); Müller andWehner (1988, 1994); Andel and Wehner (2004), and at least in mammals episodic memory Etienne et al. (1996); Séguinot et al. (1998);McNaughton et al. (2006).It has been demonstrated that these abilities can only be achieved within highly recurrent networks, as they can be found in the hippocampus Etienne and Jeffery (2004).Recently it has been shown that the network architecture of the hippocampus is not limited to spatial navigation, but seems to be domain-general Aronov et al. (2017); Killian and Buffalo (2018); Nau et al. (2018) and even allows navigation in abstract high-dimensional cognitive feature spaces Eichenbaum (2015); Constantinescu et al. (2016); Garvert et al. (2017); Bellmund et al. (2018); Theves et al. (2019).Future work will have to investigate, whether networks with the above mentioned abilities can also be found by evolutionary algorithms.

Figure 1 :
Figure 1: Example network and mazes.a, Weight matrix of an exemplary network.b, The same network displayed as connections.c, Mazes are 400 cells wide and 22 high.Walls (red) are at all borders and randomly placed in between.The green line depicts one ideal path though the network.

Figure 2 :
Figure 2: Performance of networks of the different experiments during training and validation.a-f, Fitness over generations for different experiments in 10 validation mazes.Curves show 5 different training seeds, evaluated on the same 10 validation mazes.h-m, Fitness during training over generations for the 5 different training seeds.Labels stand for the different properties of the condition: "mut" for mutation of the weights, "sev" for severing/restoring connections, "sprs."for adding a sparsity reward to the fitness function.n-s, Histograms of the connection weights for the different experiments.Zero connections are not included in the histograms.Gray shaded areas in the center indicate which weights can be removed without reducing the fitness.

Figure 3 :Figure 4 :
Figure 3: Correlation of validation fitness to sparsity and training fitness.a, Sparsity (1 − nnon-zero npossible ) as a function of the epochs (generations).b, Correlation of validation fitness to sparsity.Except for the case of "connection severance no mutation" validation fitness increases on average with increased sparsity.c, Correlation of validation fitness to training fitness.Higher training fitness leads in all experiments also to higher validation fitness, indicating that none of the networks runs into severe overfitting.a

Figure S1 :
Figure S1: Best network connections for all experiments and seeds.The network connections of the best network visualised as a connected graph and a connection matrix for each seed (column) and each experiment (row).

Figure S2 :
Figure S2: Validation fitness in the "control" experiment.The validation fitness over generation for different seeds (columns) and different validation mazes (rows).Shaded area denotes the range from minimum to maximum, dotted lines indicate the 25% and 75% percentile, and the solid line denotes the mean.

Figure S3 :
FigureS3: Validation fitness in the "connection severance" experiment.The validation fitness over generation for different seeds (columns) and different validation mazes (rows).Shaded area denotes the range from minimum to maximum, dotted lines indicate the 25% and 75% percentile, and the solid line denotes the mean.

Figure S4 :
Figure S4: Validation fitness in the "connection severance (low rate)" experiment.The validation fitness over generation for different seeds (columns) and different validation mazes (rows).Shaded area denotes the range from minimum to maximum, dotted lines indicate the 25% and 75% percentile, and the solid line denotes the mean.

Figure S5 :
Figure S5:Validation fitness in the "connection severance no mut" experiment.The validation fitness over generation for different seeds (columns) and different validation mazes (rows).Shaded area denotes the range from minimum to maximum, dotted lines indicate the 25% and 75% percentile, and the solid line denotes the mean.

Figure S6 :
Figure S6:Validation fitness in the "connection severance sparsity reward" experiment.The validation fitness over generation for different seeds (columns) and different validation mazes (rows).Shaded area denotes the range from minimum to maximum, dotted lines indicate the 25% and 75% percentile, and the solid line denotes the mean.

Figure S7 :
FigureS7: Validation fitness the "sparsity reward" experiment.The validation fitness over generation for different seeds (columns) and different validation mazes (rows).Shaded area denotes the range from minimum to maximum, dotted lines indicate the 25% and 75% percentile, and the solid line denotes the mean.

Table 1 :
Settings for the different experiments.The initial mutation rate σ mut , the probability of removing or restoring a connection p connect and the sparsity reward factor f sparsity .