Trainless Model Performance Estimation for Neural Architecture Search

Neural architecture search has become an indispensable part of the deep learning field. Modern methods allow to find one of the best performing architectures, or to build one from scratch, but they typically make decisions based on the trained accuracy information. In the present article we explore instead how the architectural component of a neural network affects its prediction power. We focus on relationships between the trained accuracy of an architecture and its accuracy prior to training, by considering statistics over multiple initialisations. We observe that minimising the coefficient of variation of the untrained accuracy, $CV_{U}$, consistently leads to better performing architectures. We test the $CV_{U}$ as a neural architecture search scoring metric using the NAS-Bench-201 database of trained neural architectures. The architectures with the lowest $CV_{U}$ value have on average an accuracy of $91.90 \pm 2.27$, $64.08 \pm 5.63$ and $38.76 \pm 6.62$ for CIFAR-10, CIFAR-100 and a downscaled version of ImageNet, respectively. Since these values are statistically above the random baseline, we make a conclusion that a good architecture should be stable against weights initialisations. It takes about $190$ s for CIFAR-10 and CIFAR-100 and $133.9$ s for ImageNet16-120 to process $100$ architectures, on a batch of $256$ images, with $100$ initialisations.


Introduction
Since the beginning of the boom in the field of artificial intelligence, there has been continuous increase in data complexity and quantity, neural architecture designs, as well as yet increasing choice of powerful hardware. These factors render neural architecture building process complex. Given an extremely large number of parameters to be tuned, it can be extremely slow, when the decisions are made on a trial and error basis. Neural architecture search (NAS) is a way to automatise and accelerate the decision taking, shifting the task from humans to machines. It comes as no surprise, that recently NAS has become one of the most popular topics among the deep learning community.
The first attempts to find the most suitable network structure were done through evolutionary algorithms [1,2,3,4]. There, several architectures are mutated in various ways (e.g. adding or removing a layer, changing activation function, etc.), and the resulting offsprings are evaluated through training. The best performing of the offsprings are added to the population for the next step, and the procedure is repeated for a given number of steps. This method has been used since back in the 1990's [5] and shows one of the best performances until now [6].
Similarly, Bayesian optimisation [7] is used to predict the best performing architecture out of many by training a subset of architectures [8]. This method has shown a few state-of-the-art performances in the period between 2013 and 2020 [9,10,11,12].
In 2016 Zoch et al. [13] proposed to use the reinforcement learning to build neural architectures from scratch. There, a so-called controller neural network is trained to build a child-network -the network to be used for the final training and prediction. The original method demands tremendous amount of childmodel training and is extremely lengthy. Several related works show significant acceleration of the process by reducing the search space [14] or introducing weight sharing [15]. An extensive overview of the NAS methods has been recently done by Thomas Elsken et al. [16].
The common point of all the above mentioned NAS algorithms is that at some point they all require model training. Not only that means longer search times, but also a higher uncertainty, since model training brings extra parameters to be tuned (e.g., batch size or learning rate).
As a step towards trainless NAS, in 2018 Istrate et al. [17] have introduced a small LSTM-based model, that allows to predict architecture's performance without training it on the data of interest. This model predicts an architecture's potential for a given data complexity. This data is taken from a so-called lifelong database of experiments. The straightaway restriction of this method is that there should already exist some data of a similar complexity within the database, and the available networks are limited to already existing ones (focused on image classification). Moreover, with time the overall procedure might lead to a bias, i.e., a most often predicted architecture in the beginning will have yet more chance to be output in future, thus "locking" it at the top position.
A similar approach is proposed by Deng et al. [18]. They encode the layers composing the network into vectors, and bring them together with a predictive LSTM layer to build numerical representation of a network. A multilayer perceptron model is trained to predict the architecture with the highest prediction accuracy. Therefore, in order to use this method one needs to first train a set of architectures to acquire their trained accuracies, and then to train the predictive model on top. Since the final decision is made by a neural model, this method does not provide a reason why a given architecture has been chosen.
The first work that investigates fundamental architectural properties of neural networks in order to attain fully trainless NAS is proposed by Mellor et al. [19] in 2020. The authors assess the neural architecture's potential by passing a single minibatch of the data through a network forwards and backwardsone single time. Based on the results of the backpropagation, they measure the correlation between calculated gradients associated with the input layer. Using the NAS-Bench-201 benchmark database [20], the authors show that their metric is able to distinguish one of the best neural architectures among many with consistent success. To the best of our knowledge, this is the only approach that aims to give an explanation of neural network's performance based on its structure.
On another end, there are a few papers, indicating that the best trained neural architecture often shows a better untrained accuracy. For example, the work of the UBER team [21] mentions that the best final architecture shows nearly 40% accuracy on MNIST dataset [22] at initialisation. David Ha and Adam Gaier [23] have presented a NAS algorithm which builds an architecture based on the untrained score. Their score is taking into account both the number of parameters contained within a model, which they seek to minimise, and the mean accuracy, which is being maximised. The mean accuracy is computed over several initialisations of the child model using a set of constant weights (single value for all the weights). They report that the resulting model achieves 82.0% ± 18.7% on MNIST data with random weights at initialisation, and over 90% when the weights are fixed to the best performing constant ones.
These findings imply that neural networks might have an intrinsic property, which defines their prediction performance prior to training. Such property should not depend on the values of trainable parameters (weights), but only on network's topology. In order to cancel out the influence of the weights and to bring out the architectural component, we perform multiple random weights initialisations to assess averaged networks' performances. We compute several untrained statistics and explore their relationships with the trained accuracy. Based on the results of these tests, we deduce a trainless NAS scoring metric.
Our work can be divided into two parts. First, we have conducted an extensive MNIST study to explore dependencies between various untrained statistics and the trained accuracy. For this, we train a range of fully-connected neural networks on a reduced MNIST data, with multiple seeds and learning rates. 1 Then, the most promising statistical property, the coefficient of variation CV , is tested on larger datasets and more complex neural geometries, to confirm its generality as a scoring metric for NAS.
The paper is structured as follows: Section 2 details the search spaces, datasets and training schemes used for the scoring metric search (2.1) and application (2.2). We present and discuss the results in Section 3. Subsection 3.1 presents the selected scoring metric, while in Subsection 3.2 we provide the re-sults of the experiments with CIFAR-10, CIFAR-100 [24] and ImageNet16-120 [25]. Conclusions and future improvements are proposed in Section 4. First, we explore correlations between some of the untrained performance statistics and the resulting trained accuracy evaluated on the test set. For this purpose we use a reduced version of MNIST [22] dataset, containing images of handwritten digits from 0 to 9. We reduce the size of the training set, leaving 20 data points per class (200 data point in total). This is done to accelerate the training process and to train more models for better statistics. Besides, reduced training set makes the prediction task harder, which allows to distinguish the difference between architectures clearer. Note that both the validation and test sets are entirely preserved, containing 5000 data points each. No data augmentation is applied.

Search space
In order to reduce the uncertainty brought by complex neural structures (effects of initialisation, activation, etc.), the search region is limited to fully connected neural networks consisting of 2 hidden layers. The number of units in each hidden layer is set to be one of the 12 values in [8,16,24,32,56,64,96,128,256,512,1024,2048], making a total of 144 of possible architectures.

Training scheme
Every neural network is initialised and trained with 100 different seeds between 0 and 99, and 6 learning rates [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03] (600 trainings per architecture, 86400 trainings overall). The batch size N BS is fixed to 50, which we found showing the best results for a wide range of architectures within the search space. The models are built with Keras [26] and Tensorflow [27] and trained for 200 epochs using 3 NVIDIA Titan V GPUs. Weights are initialised using the He uniform initialiser [28], which is used together with ReLU activation function [29] for hidden layers and Adam optimiser [30] with default decay rates (0.9 and 0.99 for the first and second moments, respectively).
The final weights are based on the epoch with the best validation accuracy after a burn-in period of 50 epochs. Ignoring the first quarter of the training process is based on experience, since the validation loss of small noisy data tends to demonstrate random behaviour in the beginning of the training, leading to faulty results.
The pseudocode for the MNIST [22] training process is given in Algorithm 1.
Once the training is complete, only the learning rate showing the highest average training accuracy is selected for each architecture. This is done to insure that neural architectures are compared in a fair way, each showing its end for Select the best performing learning rate based on max(µ T ) one set of [µ T , σ T , µ U , σ U ] per architecture end for end for Table 1: A summary over the datasets used in this paper: number of classes, image resolution and splitting schemes (in thousands) for reduced MNIST [22], CIFAR-10, CIFAR-100 [24] and ImageNet16-120 [25].

Dataset
Classes To test more complex geometries on challenging datasets, we used a modified version of the code used by Mellor et al. [19], published together with their paper 2 . To check the validity of their NAS search metric, the authors use the NAS-Bench-201 search space [20]. It is a set of architectures with a fixed skeleton, consisting of convolution layer and three stacks of cells, connected by a residual block. Each cell is a densely-connected directed acyclic graph with 4 nodes, 5 possible operations and no limits on the number of edges, providing a total of 15, 625 possible architectures.

Datasets
Each of the architectures from NAS-Bench-201 [20] is trained on three major datasets: CIFAR-10, CIFAR-100 [24] and ImageNet [25]. Since the original CIFAR datasets do not contain a validation set, the NAS-Bench-201 authors created one by splitting the original data. In case of CIFAR-10, the training set is split into halves to form the validation set, leaving the test set unchanged; for CIFAR-100, the test set is split in halves to form the validation set and the new test set. For the sake of computational tractability, a simplified version of ImageNet is used [25]. All the images are down-scaled to 16x16 pixels, with 120 classes kept, forming a new ImageNet16-120 dataset. Data augmentation is used for all datasets; augmentation schemes differ slightly between CIFAR [24] and ImageNet [25] due to the difference between input image sizes.
An overview on all the data used in the present work is given in Table 1.

Training
The training is done using up to 3 different seeds, and with the same fixed set of hyperparameters for each dataset. The authors use stochastic gradient descent with Nesterov momentum, batch size N BS = 256, learning rate between 0.1 and 0 with cosine annealing and weight decay of 5 × 10 −4 . Architectures are trained for 200 epochs.

Experimental scheme
The goal of this part of the study is to determine how efficiently does a given scoring metric select a good architecture among many random ones. In order to obtain statistically significant information, selection process is run N runs = 500 times, each time choosing N a architectures at random (among 15, 625 available). Each architecture is initialised N init times, in order to access the mean µ U and standard deviation σ U of the untrained performance. The batch data used for the accuracy computation is fixed for every individual run, so that all the architectures are fairly compared, and there is no uncertainty coming from the data choice. The pseudocode for this part of the study is given in Algorithm 2. For this, we use a modified code provided by Mellow et al. together with their paper [19]. 3

Algorithm 2 CV U tests on NAS-Bench-201
for run in range(N runs ) do Randomly select N BS images from the training dataset Randomly select N a architectures from the whole space arches for arch in arches do for seed in range(N init ) do Initialise the arch with the seed Forward propagate selected N BS images Compute untrained accuracy U i end for Compute mean µ U , standard deviation σ U for untrained accuracies over initialisations end for Select the architecture with the minimum score value (CV U > 0) Retrieve trained accuracy T for the selected architecture from the database end for Average trained accuracies of selected architectures over N runs Filtering out the scores equal to zero is necessary for the random architectures containing no meaningful layers (for example, architectures consisting of skip-connection layers only). These architectures, naturally, show random accuracy with no deviation (σ U = 0).

Scoring metric search with MNIST
The aim of the experiments related to MNIST [22] is to explore dependencies between various untrained statistics and the trained accuracy. The existing machine learning literature suggests that the best trained architecture may also show high untrained performance [23,21]. We expect, thus, to see some tendency between mean accuracies prior to and after the training. We denote these accuracies as µ U and µ T , respectively. Against our expectations, there is no clear correlation between these two metrics, as shown in Figure 1a. Instead, surprisingly, the mean trained accuracy µ T seems to be related to the untrained standard deviation σ U : even though there is no linear correlation, the lowest σ U values belong to architectures from the top performance range (Figure 1b). We have also observed that lower means µ U corresponds to lower standard deviations σ U (Figure 2). Indeed, lower accuracy values lead to proportionally lower mean and standard deviation. Therefore, minimising standard deviation alone may bias towards the networks that show overall low untrained accuracies U i . To compensate for this effect, we normalise the standard deviation σ U by the mean µ U :  The resulting parameter CV U is known in statistics as the coefficient of variation, or relative standard deviation. When plotting the coefficient of variation CV U against the trained accuracy µ T in Figure 3, tendency becomes yet more clear: selecting the architectures with low CV U leads to high trained accuracy µ T .
When choosing a NAS scoring metric, one has to consider how it correlates with the number of parameters contained within the network. It has been shown earlier that bigger does not necessarily mean better [31,32]. Even though there is a higher chance for a bigger network to contain a subnetwork, capable of successfully fitting the data [33], there is also an increasing risk of overfitting, and increasing training time. We can confirm the effect of the performance saturation with our toy MNIST model both for the totality of parameters, and for the parameters in a single layer, as demonstrated in Figure 4.  Therefore, in order to find optimal architecture regardless the number of parameters, one should use a scoring metric uncorrelated with them. Figure  5 shows that there is no significant correlation between CV U and the number of parameters.
Taking all the above into consideration, we conclude that CV U is a suitable scoring metric for NAS.

Testing the scoring metric CV U
The results of the CV U performance with CIFAR-10, CIFAR-100 [24] and ImageNet16-120 [25] are given in Table 2. We present our results based on 100 initialisations (N init = 100), for N BS = 256, both used by Mellor et al. [19] and during the NAS-Bench-201 training. We also provide the two sample t-test p-values for the statistical significance of differences between our results and those of Mellor et al. [19], as well as for the random baseline (p-value < 0.05 means that the results are statistically different, otherwise, they are considered similar). Comparisons are made both with the best performing N a and with N a fixed to 100 architectures. The results show that the performance of the CV U scoring metric is clearly above random for all three datasets.
The effects of number of iterations and number of selected architectures are shown in Figures 6 and 7, respectively, on an example of CIFAR-10 [24]. The number of picked architectures considerably increases the overall performance, since there is more chance to involve a good architecture. The number of iterations improves the precision of the method. Similar plots for CIFAR-100 [24] and ImageNet16-120 [25] can be found in Appendix (Figures A.8, A.9, A.10, A.11). Table A.3 shows results of our metric performance with various batch sizes, N init and N a combinations.
We compare our results against the results presented by Mellor et al., since in their work they also aim to discover a direct architectural property. We do not make comparison with other trainless NAS methods, since they rely on a supplementary model responsible for the architecture choice. In Table 2, similar overall performances are observed. Methods are also similar in the sense that they filter out bad architectures, rather than choose the best one.
Mellor et al. focus on correlations between linear maps (Jacobians) of input entries. Jacobian of a given input expresses how much local perturbations within this input impact the corresponding output. Their metric minimises the correlation between Jacobians within a minibatch using the eigenvalues of the correlation matrix:   [24] and ImageNet16-120 [25] datasets. On the top, we list the best performing methods that require training (REA [6], random search, REINFORCE [34], BOHB [35]). As a low limit reference, the random and optimal values for Na ∈ {10, 100} are given. Then, the results from Mellor et al. [19] and our results are reported for Na ∈ {10, 25, 100} with N BS = 256. Our training elapsed times are reported in CIFAR-10/CIFAR-100/ImageNet16-120 format. Finally, the two sample t-test p-values are provided for two cases: when comparing best performing Na (bold), and with a fixed Na=100.
where σ J,i are the eigenvalues of Σ, and k is a small constant added for numerical stability (k = 1e−5). It's worth noting that the choice of the final score's shape is not clearly explained in [19].
The success of the S metric means, that when inputs affect the output in an uncorrelated way, the neural network has a higher chance to distinguish between them, and therefore to have a better trainability. Their method, however, depends slightly on the values of the initial weights.
Our approach, on the other hand, focuses on how much variation in weights affects the outputs. CV U quantifies the stability of the network against initialisations for the same fixed data minibatch. Intuitively, if a network is stable against random weights, it will also be less affected by weights fluctuations during the training. It might suggest that the function representing a stable network is relatively smooth, which allows for more efficient training and lower overfitting risks.
As it was mentioned above, our algorithm involves two extra hyperparameters, which may be considered as a disadvantage. The first one is the batch size: there are significant deviations on the prediction power (with different optimal N BS for each dataset, see Table A.3). The second is the number of initialisations. Besides, the fact that our method requires multiple initialisations leads to a significantly slower performance compared to Mellor et al. (running time grows linearly with the number of initialisations). Yet, comparing to the methods that require training, the absolute performance speed remains high (tens to hundreds of seconds).
Prediction accuracy improves with the number of sampled architectures (for any batch size). This is a natural consequence of the fact that the chance of having a well performing architecture among many architectures is higher than among few (which is confirmed by random selection tests, see Table 2). Note that in the work of Mellor et al. [19] increasing the number of sampled architectures does not improve the result, which is counterintuitive. While this could be a statistical artefact for CIFAR [24] data, for ImageNet [25] the difference between N a = 10 and N a = 100 is statistically significant (p-value of 5.8e−7).
Nevertheless, the CV U metric alone is not sufficient for successful NAS. It can be partly justified by the fact that all the architectures within the NAS-201-Benchmark are trained with the same fixed set of hyperparameters. For some of the networks contained within the benchmark these set may not be optimal. There is a possibility that the architectures selected by our metric could have achieved better accuracies. We plan to investigate it in future work, as well as to try to combine our metric with other NAS methods (for example, the one from Mellor et al. [19]).

Conclusions
In this work we explore relashionship between the prediction performance of an architecture and its accuracy prior to training. The principal objective is to better understand how the neural network's geometry affects its prediction power. For this, we evaluate untrained accuracy over multiple random weights initialisations. We observe that the architectures with low coefficient of variation of untrained accuracy CV U = σ U /µ U show overall better performance. We use this observation to develop an entirely trainless NAS technique. Our metric achieves the accuracies of 91.90 ± 2.27, 64.08 ± 5.63 and 38.76 ± 6.62 for CIFAR-10, CIFAR- [24] and a downscaled version of ImageNet [25], respectively (when choosing among 100 architectures, with 100 random initialisations and evaluating accuracies on a minibatch of 256 data points). These accuracies are statistically above the random baseline, which leads to the conclusion that the stability of a network against initialisations is an indicator of its trainability. However, since this metric does not guarantee the best architecture at all times, we consider that the stability is not the only property that influences the neural architecture's performance. Combining our method with others (for example, the one from Mellor et al. [19]), might lead to more stable results. We plan to explore various combinations of the CV U metric with other methods in future work.

Acknowledgement
We would like to express our deepest gratitude to Dr. Ayako Nakata and Dr. Guillaume Lambard for their continuous support and valuable discussions.