Neural

With ever increasing computational capacities, neural networks become more and more proficient at solving complex tasks. However, picking a sufficiently good network topology usually relies on expert human knowledge. Neural architecture search aims to reduce the extent of expertise that is needed. Modern architecture search techniques often rely on immense computational power, or apply trained meta-controllers for decision making. We develop a framework for a genetic algorithm that is both computationally cheap and makes decisions based on mathematical criteria rather than trained parameters. It is a hybrid approach that fuses training and topology optimization together into one process. Structural modifications that are performed include adding or removing layers of neurons, with some re-training applied to make up for any incurred change in input–output behaviour. Our ansatz is tested on several benchmark datasets with limited computational overhead compared to training only the baseline. This algorithm can achieve a significant increase in accuracy (as compared to a fully trained baseline), rescue insufficient topologies that in their current state are only able to learn to a limited extent, and dynamically reduce network size without loss in achieved accuracy. On standard ML datasets, accuracy improvements compared to baseline performance can range from 20% for well performing starting topologies to more than 40% in case of insufficient baselines, or reduce network size by almost 15%.


Introduction
A common problem for any given machine learning task making use of artificial neural networks (ANNs) is how to choose a sufficiently good network topology. Picking one that is too small may not yield acceptable prediction accuracy. To improve results, one can keep adding structural elements to the network until the desired accuracy value has been reached. Too large networks on the other hand may cause an explosion in computational cost for both training and evaluation. Finding the optimal balance is heavily dependent on the given task, dataset and further hyperparameters, and often requires expert domain knowledge. A priori optimization is not easily possible, since reliable estimates on network behaviour already require training results, and no generalization exists which topology will fit which problem. Researchers have applied a number of search strategies such as random search (Li & Talwalkar, 2019), Bayesian optimization (Kandasamy, Neiswanger, Schneider, Poczos, & Xing, 2018), * Equal contribution.
In this paper we propose a novel training regime incorporating a genetic algorithm that reduces computational cost compared to state of the art approaches of this kind (Dong & Yang, 2019;Li & Talwalkar, 2019). We achieve this by re-using network weights for competing modification candidates instead of retraining each net from scratch, branching off modification candidates during training, and letting them compete against each other until a new main branch is selected. This fuses the evolutionary optimization paradigm with the ANN training into an integrated framework that folds both processes into a single training/topology optimization hybrid. As such, evolutionary steps are not carried out by a meta-controller or other black-box-like implementations, but instead make use of mathematical tools such as singular value decomposition (SVD) and the Bayesian information criterion (BIC) (Schwarz, 1978) for network weight analysis, decision https://doi.org/10.1016/j.neunet.2021.08.034 0893-6080/© 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). making, and structural modifications. Network modifications are performed by adapting existing weights such as to incur minimal changes to input-output behaviour.
Our framework for a combined ANN training and neural architecture search consists of three main components: a module that can perform a number of minimally invasive network operations (''surgeries''), a module that analyses network weights and can give recommendations which modifications are most likely to increase (validation) accuracy, and finally a module that serves as a genetic algorithm (the ''Surgeon''), containing the former two while gradually evolving any given starting network. With the Surgeon, we are able to evolve and improve models for several benchmark datasets and varying starting topologies. We achieve particularly good results on starting topologies that would a posteriori have proven to be suboptimal. A great benefit of our approach is that it adds topology optimization to ML training while incurring very limited additional computational costs. Convergence is reached for all test cases within a few hours.
The supporting code can be accessed via https://github.com/ ElisabethJS/neural-network-surgery. This paper contributes a computationally cheap ansatz for a genetic neural architecture search algorithm that makes evolutionary decisions based on mathematical analysis.

Related work
Neural architecture search (NAS) has been an increasingly popular research topic for many years (Elsken et al., 2019), starting as early as Miller et al. (1989), who presented one of the earliest neuro-evolutionary algorithms to search for suitable network topologies. Recent approaches by Dong and Yang (2019), Li and Talwalkar (2019), and Zoph and Le (2017) reach competitive performance on benchmark datasets such as CIFAR-10. However, this often comes at the cost of vast computational resources, with Zoph and Le (2017) making use of up to 800 GPUs for several weeks. Cai, Chen, Zhang, Yu, and Wang (2018) attempt to reduce computational costs by re-using network weights, as well as training and applying a reinforcement meta-controller for structural decisions. They make use of a number of function-preserving transformations (net2net) introduced by Chen, Goodfellow, and Shlens (2016), and extend them to allow also non-sequential network structures, such as DenseNet (Huang, Liu, van der Maaten, & Weinberger, 2017). DiMattina and Zhang (2010) introduce and rigorously prove conditions, under which gradual changes of the parametrization of a neural network are possible, while keeping the input-output behaviour constant.
İrsoy and Alpaydın (2020) learn the network structure via socalled ''budding perceptrons'', in which an extra parameter is learned for each layer, that indicates whether or not any given node needs to branch out again or be removed altogether. Their method focuses on growing the network to the required size from a minimal starting topology. Frankle and Carbin (2019) present a method to identify particularly good network initializations that can train sparse networks to competitive accuracy. Another approach in NAS is to prune down from a larger starting topology (Blalock, Gonzalez Ortiz, Frankle, & Guttag, 2020). Popular pruning techniques include applying SVD to existing network weights (Denton, Zaremba, Bruna, LeCun, & Fergus, 2014;Girshick, 2015;Xue, Li, & Gong, 2013).
There are also a number of neural architecture search strategies that do not depend on manual network modifications. Liu et al. (2019) introduced DARTS, a method for differential architecture search that re-formulates the task of searching for network architectures as a graph optimization problem, where all possible network configurations are represented as nodes on a directed acyclic graph. This technique has rapidly become very popular and has seen a great number of extensions in various directions, such as Dong and Yang (2019), Li et al. (2021), Wang et al. (2021) and Xu et al. (2020). The downside of the DARTS based algorithms is that all possible network variations are predefined in advance and cannot be adapted during training based on the network state. Additionally, since all nodes of a certain hierarchy level need to be interchangeable or even skippable, the connections between network blocks are highly restricted with regard to network structure.
The novelty of our research lies in combining existing tools such as net2net (Chen et al., 2016) and SVD with a genetic algorithm that modifies the given network in a decision based process instead of utilizing a black-box like decision module, while retaining a very high level of structural freedom. We repeatedly generate functionally equivalent but structurally different networks that are then trained for a short number of epochs, after which their performance is compared and the best candidates are retained. Ultimately this yields a fully trained neural network with an optimized topology. To our best knowledge no such method has yet been proposed.

Methods
This work introduces and utilizes three main modules: • modification module: performs network modifications (''surgeries'') so as to incur minimal changes to input-output behaviour.
• recommendation module: analyses network weights and gives recommendations on which operations are most likely to improve network accuracy.
• ''the Surgeon'': a genetic algorithm that links the above two modules, and gradually evolves a given starting network.

The modification module
We want to be able to carry out a number of different modifications that can restructure network architecture while keeping the input-output behaviour intact, and in particular avoid loss of prediction quality. In cases where this is not possible we aim for minimal impact changes instead. This yields network variations that are structurally different but functionally equivalent.
In mathematical terms, let the output of a dense layer be given by with activation function Φ, weight matrix A and bias vector b, as well as an arbitrary input x to that layer. We perform four different types of modifications, namely adding or removing neurons or whole layers. In particular, we are looking forf (x) such that Activation functions used in neural networks are usually nonlinear, or piecewise linear at best. Thus changing the activation function will in general not produce equal results for arbitrary input values. Finding modifications that still suffice Eq. (2) is therefore synonymous to adequately adapting the affected network weight matrices and bias vectors.
Adding layers. Under (at least) piecewise linear activation functions such as ReLu, one can always add neurons to a hidden layer, or even add whole layers, without any change to the overall network behaviour (Chen et al., 2016). Adding a whole layer is done by using an identity matrix as a new weight matrix for this inserted layer. In particular, the added layer will have the same number n of neurons as the following layer. We initialize the weights as an identity matrix I ∈ R n . Let Φ denote the activation function of a dense layer and x an arbitrary input, then if and only if Φ is at least piecewise linear. In particular, for ReLu activation, adding layers without changing the input-output behaviour is possible.
Adding neurons. For adding neurons to an existing layer, consider the following example. Let x ∈ R n be the input to a layer with weights A ∈ R m×n , and B ∈ R k×m the next layer's weights, and let the layer's bias vector be zero w.l.o.g. Assume the activation function Φ acting between the two layers to be an identity mapping. Then the output y ∈ R k is given by In particular, the jth entry of y is where a ij is the ij element of A, and b ij the ij element of B. We now arbitrarily pick unit m and duplicate its incoming and outgoing weights. Note that we also have to divide by 2 in order to keep the total sum constant.
This methods yields the exact same output y ∈ R k given input x ∈ R n , but the weight matrices A and B are now of dimension (m + 1, n) and (k, m + 1) respectively, and can be extended to include a non-zero bias term in a similar fashion.
Recall now that we chose the activation Φ to be the identity mapping. For any other activation function, the equation in line (7) holds true, if and only if this activation is at least piecewise linear. Chen et al. (2016) base their Net2WiderNet and Net2Deeper-Net transformation on these steps.
Removing neurons. Removing neurons from a layer is rarely possible without changing the input-output behaviour unless some of the units are degenerate to begin with, i.e. the weight matrix is of reduced rank. For a modification with minimal impact on input-output behaviour we need the closest possible projection onto a lower rank subspace. As a measure for closeness we employ the Frobenius norm, defined as follows.
Let ∥ · ∥ F denote the Frobenius norm, with where A ∈ R m×n is an arbitrary, real-valued matrix of rank k ≤ min(m, n).
The Eckart-Young(-Mirsky) theorem states that the closest projection onto a lower rank subspace can be found by applying Singular Value Decomposition (SVD) to the weight matrix: Let A ∈ R m×n be the (weight) matrix of some layer L j of interest in a neural network, then there exists a representation with orthogonal U ∈ R m×m , V ∈ R n×n and rectangular diagonal Σ ∈ R m×n with k ≤ min(m, n) non-negative real entries σ i along its diagonal. This representation is called the Singular Value Decomposition of A. The σ i are called singular values of A, and are usually given in descending order, i.e. σ 1 ≥ σ 2 ≥ · · · ≥ σ k > 0. It can be shown that ∥A∥ 2 the square root of the ith non-zero eigenvalue of A T A. Note that U and V are projections from the (m × n) vector space spanned by A to the (k × k) vector space spanned by Σ and back. In order to reduce the rank of A to r < k, we set σ i = 0 ∀i > r, and drop associated columns/rows of U and V, which is feasible since by definition σ i ≥ σ i+1 . We then project to a smaller matrix A ∈ R m×r by computing U Σ. In order to properly re-connect the reduced layer L j to the following layer L j+1 , we have to modify L j+1 's weight matrix to accept input of length r. In particular, we need to project back to the original layer size, to ensure that the layer weight shapes match again. We achieve this by multiplying from the left the weight matrix of L j+1 with V T . Projecting onto a lower rank subspace by setting singular values to zero is sometimes called truncated SVD, and is used in network pruning (Denton et al., 2014;Girshick, 2015;Xue et al., 2013). This technique adds an intermediate layer of (potentially much fewer) neurons, which in case of very large weight matrices can drastically reduce the overall connection count.
Note that this projection method is not concerned at all with a potential activation function after the network's modified layer. As mentioned previously, changing a layer's activation function is in general not possible without changes to input/output behaviour, since activation functions are usually non-linear. We therefore do not perform any additional modifications to counterbalance a change of activation.
Removing layers. We remove whole layers in a similar fashion. Let A j , b j be the weight matrix and bias vector of some dense Layer L j , and A j+1 , b j+1 those of the subsequent Layer L j+1 . We remove the Layer L j by simply dropping it, and modify L j+1 's weights by matrix multiplication, yieldingL j+1 This again ignores any activation function between the two layers, and will thus cause a change in input-output behaviour. Recuperation from surgery. As we have seen, network modifications will in general cause a small change in input-output behaviour, typically leading to a loss in prediction accuracy. In order to make up for this, all network modifications are given a small amount of recuperative training (several batches) before any comparison is made. The retraining amount was determined by trials and statistical analysis based on a dataset 1 which we do not otherwise use in our results, to avoid overfitting the search algorithm to a specific dataset.

The recommendation module
For the decision when to execute which network modification, we perform a two-step analysis.
Analytical criterion for model selection. As a first order criterion, we compute the amount of information carried by each neuron by looking at the layer's singular values. The number of neurons that may be removed from a layer depends on the count of singular values that are (close to) zero, or are several orders of magnitude smaller than the layer's largest singular value: This number is compared to the layer's total neuron count n l . Let h be the number of hidden layers in the network, and i be the index of the layer in which the modification was performed. Then each modification candidate is given a score between 0 and 1, calculated as follows: • add neurons: 1 − n r /n l • add layer: In particular, the higher the layer number, the more likely a layer is added or removed. For further refinement in the future, we aim to replace this selection criterion with a more sophisticated formula, which would improve the optimization process beyond the results presented herein. Statistical criterion for model selection. The second order decision basis is derived from the Bayesian information criterion (BIC), also known as Schwarz information criterion. It was derived by Schwarz (1978) to address the problem of selecting between (statistical) models of different dimension. It takes into account the number of parameters of the given model, the sample size of the input data, as well as the a-posteriori model error computed from a likelihood function of the model given its parameters and input data. Since methods from Bayesian statistics are applied, it is assumed that the underlying data are independent and identically distributed from a family of allowed distributions.
The BIC is given by where k is the number of model parameters, n the sample size, and L the likelihood function.
In our application, sample size is constant throughout the whole process. L needs to be estimated or calculated a-posteriori after a network operation has been performed. Our modifications are intended to keep the change in input-output behaviour minimal, therefore the difference in L will be very small between any two modifications. Eq. (14) thus becomes where c 1 and c 2 are constant and ∆L is the (small) change in error depending on the performed operation. Since the number of parameters k may be well above 10 6 and c 1 = ln n ≫ 1 is dependent on the sample size, we neglect ∆L and the constants, and directly use k as a second order constraint when deciding which modifications to apply.
Thus, all potential network modifications are ranked by two parameters. The Surgeon has two modes: it can either pick the top n ranking operations per decision step (with n being a tunable hyperparameter), or select the highest ranking operation of each type. Our experiments are performed with the latter option.

''The Surgeon'': A genetic algorithm for neural architecture search
The final module is the Surgeon, the genetic algorithm that searches for an optimized network architecture while training network weights at the same time. We use the term genetic algorithm to describe a searching meta heuristic inspired by biological processes such as evolution, mutation, and selection. In this first paper about our new ansatz, we limit our focus on perhaps the most ubiquitous type of architecture: sequential networks (i.e. without recurrent or skip connections) consisting only of fully connected layers. As Cai et al. (2018) have shown, however, a number of the above described tools can be generalized easily also to non-sequential networks, as well as convolutional layers.
The rough idea behind the Surgeon is to alternate training and network optimization phases. Several competing topologies, called branches, may be retained concurrently. During the optimization phase, modification candidates are created for each such branch. From these, the n best performing ones are kept and put through another training step. One such training and optimization cycle is depicted in Fig. 1 The overall structure of the Surgeon can be seen in algorithm 1. First, the provided model is pre-trained for an initial number of epochs, then the list of current branches is initialized with it. The choice of how many branches at most are being kept concurrently is a hyperparameter setting, and has a great influence on the total computational cost of the Surgeon.
We then evolve and continuously update the list of concurrent branches until termination criteria, such as the maximal number of epochs (if limited by computational resources) or a minimum accuracy threshold (if the topology optimization is used without a computational resource bottleneck), are met. At each decision point, the recommendation module analyses all networks in the list of current branches, and ranks all potential modifications. From these, in line 15 of algorithm 1, we select the most promising candidates, and perform the selected operations using the modification module.
The generated candidates are re-trained for several batches, to make up for lost performance. Our main objective function is to maximize the accuracy achieved by any given neural network through continuous network surgery. It is therefore sensible to retain those network candidates in algorithm 1, line 19 that reach the highest validation accuracy score. Note that we are scoring on validation accuracy instead of accuracy to avoid overfitting.
Focusing on accuracy alone is a greedy approach and carries the risk of getting stuck in local optima. We overcome this by additionally rewarding modification candidates that show a greater accuracy gain. However we need to make a distinction when rewarding this gain. We share Cai et al. (2018)'s rationale that an increase of 1% needs to be weighed higher in case it happens from 90 to 91% accuracy rather than 70 to 71%. At the same time, if an operation keeps the accuracy constant at e.g. 95% we can assume that a local optimum may have been reached. Therefore an operation that leads to an accuracy increase from 90% to 94%, showing potential for further improvement, should be regarded higher even though the reached accuracy is lower. Lastly we do not wish to neglect network size. A 1% accuracy increase might never be favourable at all, if the required increase in network size is ''too big''.
Note that it is hard to define when a network has indeed become ''too big'', a fact that is emphasized by the large number of publications dealing with pruning techniques (Blalock et al., 2020).
We need to find a way to balance these three components (accuracy, accuracy gain, network size), and to create a composite score by which we can rank the performance of current branches. As an additional restriction, the composite score should not depend on any global candidate statistics, nor do we want to set a global limit for network size. Therefore we cannot in any meaningful way regard the total number of parameters of a modification candidate, since we lack overall comparison. Instead, for each candidate, we store the network size as a fraction of the candidate's parent's network size. Thus, a network size fraction greater than 1 indicates growth, a fraction smaller than 1 indicates shrinking, and the identity operation yields a size fraction of exactly 1.
We want the scoring function as well as its first order derivative to be strictly increasing with accuracy gain, but decreasing with size fraction. This behaviour needs to hold even when accuracy gain becomes 0 or network size fraction becomes 1. The Table 1 Scoring example. A denotes accuracy, AG accuracy gain, PF parameter fraction and S the calculated score. The layer number is given in reference to the hidden layers. Winning operations, that are kept as new concurrent branches, are indicated with an asterisk.
where a is the current accuracy, ∆a the current accuracy gain, and ∆f the current network size fraction, and is adapted from Cai et al. (2018). A scoring example can be seen in Table 1.
As an option to alleviate greediness, the Surgeon can keep a one cycle memory. In this case, the newly selected concurrent branches are compared to previously best scoring ones, and retained only if their achieved accuracy is at least as good as the previous value. Should this not be the case, they are discarded and we backtrack one step. New potential candidates are provided by the recommendation module, and we train and compare these.
Finally the best scoring branch is returned as the optimized network.

Table 2
Starting topologies for the Surgeon. Each topology additionally has a reshape layer as its input layer, as well as a dense layer without activation function as its output layer. The neuron count in the input and output layers is dependent on the dataset.

Data
We evaluate the performance of the Surgeon on several standard benchmark datasets (SVHN, CIFAR-10, CIFAR-100, EuroSAT, EMNIST, Fashion-MNIST), which are described below. The SVHN dataset was downloaded manually from http://ufldl.stanford.edu/ housenumbers/. The CIFAR-10 dataset was fetched from the keras.datasets catalogue. All others are fetched from the tensorflow-datasets catalogue, batched, and shuffled. As an additional preprocessing step, we normalize the data to be within the range [0, 1].
We pick three starting topologies that are described in Table 2, and perform several runs of the Surgeon on each one. Note that Table 2 omits the input layer, which is always a reshape layer that ensures the input data is formatted as a 1-dimensional vector instead of a multi-dimensional array, as well as the final dense layer. The Surgeon never adapts these two layers, as they are inherently determined by the dataset through the shape of the training data and number of output classes.
The small starting topology is consciously chosen to be insufficient for most of the regarded datasets, to mimic a case where a model is unknowingly trained with an inadequate network architecture. We average our results over several runs and several random seeds, and compare to results achieved by simply training the starting topology for the same number of epochs. Detailed machine properties and hyperparameter settings are listed in Appendix.
The SVHN dataset. The (Google) Street View House Number (SVHN) dataset was published by Netzer et al. (2011). It contains 73,257 training, as well as 26,032 validation colour images. We use the cropped version, where images are of size 32 × 32 pixels, and fall into 10 classes according to the numbers 0-9. Additionally, 531,131 extra images of lower difficulty are available but not currently used.
It contains 60,000 32 × 32 pixel colour images that are evenly divided into 10 classes. 10,000 of the images are set aside for validation purposes.
The CIFAR-100 dataset. The CIFAR-100 dataset is equivalent to the CIFAR-10 dataset, except that samples are evenly divided into 100 classes.
The EuroSAT dataset. The EuroSAT dataset was published by Borth (2018, 2019). It is based on Sentinel-2 satellite images consisting of 27000 labelled and geo-referenced colour images of size 64 × 64 pixels, that belong to 10 classes. We make use of the RGB version that contains only the optical red, green and blue frequency bands.
The EMNIST dataset. The EMNIST dataset was published by Cohen, Afshar, Tapson, and van Schaik (2017). It contains greyscale handwritten character digits of size 28 × 28 pixels that are derived from the NIST special database. They depict the numbers from 0-9 and thus fall into 10 classes. 697,932 training as well as 116,323 validation samples are provided. The Fashion-MNIST dataset. The Fashion-MNIST dataset was published by Xiao, Rasul, and Vollgraf (2017). It contains greyscale fashion image data taken from Zalando's 2 article catalogue that fall into 10 categories. The dataset consists of 60,000 training as well as 10,000 validation samples of size 28 × 28 pixels.

Results
We apply the Surgeon to each combination of the above datasets and starting topologies (cf. Table 2). We do so a total of 15 times, re-initializing the numpy and tensorflow modules with a new random seed after every 3 runs, and report average statistics (cf. Table 3). As a baseline, we train the starting topology for the same number of epochs with each random seed. Note that throughout the entire section, unless otherwise stated, we report achieved validation accuracies rather than training accuracies.
To avoid overfitting on any specific dataset, we fix some hyperparameters for training (such as batch size, optimizer, and learning rate) before starting any runs with the Surgeon (cf. Appendix). No additional fine tuning is performed on any model. Overall Surgeon performance. Table 3 and Fig. 2 both show an overview of average Surgeon performance for all starting topologies and datasets. We can see that in all cases, the Surgeon reaches or outperforms the result of the baseline with regard to validation accuracy. We are able to observe two types of behaviour. In cases where the baseline accuracy is already high to begin with, the relative accuracy increase achieved by the Surgeon is comparatively low. However the Surgeon is often able  to decrease the required amount of network parameters without loss in accuracy. On the other hand we are able to observe quite significant improvements in accuracy in cases where the starting topology is sub optimal. In fact, as can be seen in Fig. 2, there are cases where the baseline is not learning at all whereas the Surgeon is able to overcome the initial local minimum.
We will subsequently highlight a few interesting cases. Topology rescue. We recall that the small starting topology, consisting of only one hidden layer with 10 neurons, was purposefully chosen to be insufficient for convergence. In fact, as we can see in Fig. 2, the small topology baseline learns for neither SVHN, CIFAR-10, nor EuroSAT. The Surgeon on average manages to improve the topology and learn at least a little, even reaching a validation accuracy above 50% in case of the EuroSAT dataset. In case of the SVHN dataset, the global average over all runs contains several instances where even with the aid of the Surgeon, the model is not able to learn at all and stays stuck in the initial state, as well as a number of runs where the Surgeon is able to very quickly leave this local sink and then in fact provides a model that learns very well.
In Fig. 3 (left), we can see a single run of the Surgeon using the EuroSAT dataset and small starting topology, where an early Add Layer operation allows the network to train properly. The total parameter increase in this case is less than 1%, with the Surgeon preferring to add several small layers rather than widening existing layers. Note that due to the shape of our training data and choice of starting topology, a large portion of the network's parameters is required to connect the input layer to the first hidden layer. Adding a whole layer after the first 10-unit layer causes only a small overall increase, whereas widening the first hidden layer might cause a greater increment in parameter count. Fig. 3 (right) shows the evolution of the topology over the course of the run.
Accuracy increase. The Surgeon is able to detect and improve sub-optimally sized network architectures. This works in cases such as above (cf. Fig. 3 left), where it is very obvious that little to no learning happens, but also in less apparent ones, where the base topology does learn to a certain extent. In Fig. 4 (left) we can see a single run of the Surgeon trained on CIFAR-10 where both the large starting topology as well as the Surgeon start learning in a similar fashion and soon reach a plateau. After a while however, the Surgeon is able to perform an Add Layer operation that allows the topology to overcome the local optimum which had been reached.
Parameter reduction. As mentioned previously, using the large topology as a starting point for the Surgeon does not improve achieved accuracy by any large margin (cf. Table 3). In Fig. 4 (right) we can see a single run of the Surgeon on the Fashion-MNIST dataset using the large starting topology. The baseline in this case is already performing quite well given that we are regarding a very basic network architecture. In this case we can observe pruning by the Surgeon, such that an early Remove Layer operation allows parameter reduction of almost 15%.
For the large starting topology, the composite scoring function given in Eq. (16) prevents any big jumps in accuracy since they would most likely come at the cost of a (potentially rather large) increase in network size. The scoring function much rather prefers network operations that keep the accuracy more or less constant while reducing network size.
Computational costs. In our trials the Surgeon is configured to produce a resulting topology that has been trained for exactly 100 epochs, with decision points every 10 epochs. We allow for a maximum of two concurrent branches, as well as re-drawing from potential branches up to two times, cf. Appendix. On average, 0.56 re-draws are necessary per decision step, resulting in an average total training amount of around 290 epochs per run of the Surgeon. The additional overhead produced by the analysis and modification modules mostly relies on matrix calculations and manipulations, which are highly optimized by standard python modules such as numpy, and thus cause only negligible additional computation time. In particular, making use of large GPU clusters is not a necessary requirement for running the Surgeon. For verification purposes and in order to ensure scalability of our code we ran our experiments both on a standard office computer without making use of any GPU acceleration, as well as a GPU cluster (cf. Appendix for detailed hardware specifications).
We can see from Fig. 2 that 100 epochs in most cases seem to be longer than is necessary for the Surgeon to reach convergence. With appropriate early stopping techniques, and/or a more dynamic training schedule, the total training amount for the Surgeon could be considerably reduced, and the resulting model (including its trained weights) used either as is, or further fine tuned (cf. Fig. 5). Example of a topology evolution performed by the Surgeon on the SVHN dataset and medium starting topology. Early on, reducing the network size helps to increase the accuracy level (at epochs 10 and 30). Adding a whole layer in epoch 50 is still able to achieve some increase in accuracy. From epoch 60 on the network oscillates between very similar states. Ideally this behaviour can be used as an indicator for early stopping in future versions of the Surgeon.

Discussion and outlook
In this paper, we presented the Surgeon, a ANN/evolutionary algorithm hybrid optimization designed for neural architecture search. The algorithm utilizes a modification module, that is able to perform minimally invasive network surgeries, where the topology of the network is modified with as little change to overall input-output-behaviour as possible. Additionally, it uses a recommendation module, that analyses a given neural network and indicates which structural changes may be most beneficial. Those changes can be either increases of network width or depth in case of high information density, or respective decreases, in case network size can be reduced without fear of too high an accuracy loss. Both modules do not utilize any black box behaviour but are based on mathematical tools, which we presented in Section 3.
We put the Surgeon to the test on several combinations of starting topologies and datasets. We saw that the network generated by the Surgeon is able to outperform the baseline in case of suboptimal topologies, or reach comparable accuracies while pruning the underlying network structure to less resourceintensive topologies.
A very important feature of the Surgeon is that it itself is computationally cheap, with little overhead compared to simple baseline training. Via hyperparametre settings, it is possible to make use of larger computational power if required/available.

Limitations of this work and future goals
For this proof of concept work, we limited ourselves to the most basic and ubiquitous network structures -dense layers linked strictly in sequence. While such neural networks are best studied and easiest to understand and manipulate from a mathematical perspective, there are some draw backs.
Fully connected neural networks require a fixed input shape, thus they are not suited for a variety of classical deep learning tasks such as natural language processing (NLP) or image detection. Large tabular datasets often contain ordinal or categorical variables which are not well suited for deep learning tasks as gradient calculation becomes somewhat ill-defined. In particular, we are mostly limited to making use of image classification benchmark sets. For these, state of the art is driven by large, computationally expensive algorithms that allow more complex topological elements such as convolutional or recurrent layers, skip connections, etc., see for example Liu et al. (2019), Xu et al. (2020) and Zoph and Le (2017). In general, few benchmark results exist for fully connected neural networks trained on these datasets.
For future work we wish to extend and improve both the Surgeon as well as the underlying modules. Currently, the Surgeon follows a static routine, with hyperparameters such as epochs steps, number of selected candidates, or number of concurrent branches staying constant throughout the whole process. This could be changed to a more dynamic approach, and could maybe integrate further features such as adaptive learning rates, or a more sophisticated memory, as well as early stopping mechanisms to improve performance gains compared to baseline training even further. The recommendation module can be improved by finding a closer approximation for the BIC. For larger starting topologies, the scoring function given in Eq. (16) seems to be chosen too restrictively. The balancing between the different components can be adapted to allow for more drastic additions even when the network is already quite large to begin with.
Lastly we want to expand the modification module to allow more complex topologies as well, and include an option to cross architecture types. This would allow us to include e.g. convolutional or recurrent elements, change a network from one type to another, or even freely mix and match as required.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Appendix. Hyperparameter settings and machine specifications
Simulations for the SVHN and CIFAR-10 datasets were performed on a Windows 10 machine with an Intel(R) Core(TM) i7-9700K CPU 3.6 GHz processor and 64,0 GB RAM. The code was implemented in python 3.7.7 using tensorflow 2.1.0.
Simulations for the CIFAR-100, EuroSAT, EMNIST, and Fashion-MNIST datasets were performed on a virtual machine running on a 24-core 2.1 GHz Intel Xeon Scalable Platinum 8160 processor, which is equipped with a Tesla V100 GPU card with 16 GB memory. The code for these datasets was ported to python 3.8.2 using tensorflow 2.3.0 and tensorflow-datasets 4.1.0.
We chose the following hyperparameters: The modification module.