Neural Architecture Search and Hardware Accelerator Co-Search: A Survey

Deep neural networks (DNN) are now dominating in the most challenging applications of machine learning. As DNNs can have complex architectures with millions of trainable parameters (the so-called weights), their design and training are difficult even for highly qualified experts. In order to reduce human effort, neural architecture search (NAS) methods have been developed to automate the entire design process. The NAS methods typically combine searching in the space of candidate architectures and optimizing (learning) the weights using a gradient method. In this paper, we survey the key elements of NAS methods that – to various extents – consider hardware implementation of the resulting DNNs. We classified these methods into three major classes: single-objective NAS (no hardware is considered), hardware-aware NAS (DNN is optimized for a particular hardware platform), and NAS with hardware co-optimization (hardware is directly co-optimized with DNN as a part of NAS). Compared to previous surveys, we emphasize the multi-objective design approach that must be adopted in NAS and focus on co-design algorithms developed for concurrent optimization of DNN architectures and hardware platforms. As most research in this area deals with NAS for image classification using convolutional neural networks, we follow this trajectory in our paper. After reading the paper, the reader should understand why and how NAS and hardware co-optimization are currently used to build cutting-edge implementations of DNNs.


I. INTRODUCTION
M ACHINE learning (ML) technology is now routinely applied in cutting-edge applications such as image, speech, and natural language recognition, data mining, autonomous car driving, and automated system design in which humans had no competitors a few years ago. The core ML algorithms utilize deep neural networks (DNNs) -complex computational models that must be designed and then trained using suitable data from a given application domain [1]. For example, DNN called ViT-H/14 that shows state-of-the-art classification accuracy on one of the most significant image classification benchmarks (ImageNet) consists of 632 million trainable parameters [2]. As designing such complex DNNs is very time-consuming and requires skilled experts, much effort has been invested in recent years to automate this work.
Neural architecture search (NAS) [3]- [5] is a method capable of automated design of complex neural networks (NN) such as DNNs. In its single-objective setup, it creates neural networks that are optimized according to one objective (typically, the quality of output, expressed in terms of accuracy or other similar metrics). NAS has to solve two problems concurrently -designing the NN's architecture (including its size, structure, and types of components) and optimizing its trainable parameters (the so-called weights). Each of these problems is difficult in itself, and its solving requires considerable computing resources. The NAS methods typically combine searching in the space of candidate NNs (see Sect. IV-B) and optimizing (learning) the weights using a gradient method. However, the most significant benefit of the NAS methods is if they are used in a multi-objective setup and optimize not only the NN quality but also other parameters such as size or latency. If more objectives are considered, NAS methods currently provide state-of-the-art DNN implementations [6]- [8]. VOLUME 4, 2016 Only a fully trained NN is needed in most ML applications while its training is performed on a computer cluster before its deployment. However, even such a fully trained NN must execute billions of elementary arithmetic operations to process a single input, i.e., to perform the so-called inference. Note that each of the trainable parameters typically undergoes at least one multiplication during inference.
In order to reach desired latency or energy efficiency, hardware accelerators for NN inference were developed in recent years [9], [10]. In this direction, hardware-aware NAS methods were proposed to design NN architecture (and weights) optimally for a given hardware platform. Compared to single-objective methods, hardware-aware NAS delivered similar accuracy but reduced latency, size, and/or power consumption, which was demonstrated across different hardware platforms [7], [11], [12]. These methods only optimize the NN architecture. They do not extend the search space to co-optimize NN architecture with parameters of hardware platforms (such as amount and type of resources, dataflow strategies, buffer sizes, and compiler options). Thus, they neglect the hardware design freedom provided by many platforms (e.g., in FPGAs) [13]. Hence, the newest NAS methods co-optimize NN architectures and hardware configuration to further improve latency and other parameters that are important in many applications such as IoT or mobile phones. These methods work in three search spaces (weights, NN architectures, and hardware configurations) and must innovatively orchestrate several search algorithms to produce the best trade-offs between the accuracy and various hardwarerelevant metrics [8], [14], [15].
Modern NAS methods thus enabled to increase the design productivity by releasing human experts' capacity and improve the quality, performance, and energy requirements of resulting neural accelerators by adopting a data-driven automated search-based design approach.

A. PREVIOUS SURVEYS
The body of work dealing with NAS has significantly increased since 2015, as illustrated in Fig. 1. This rapid development is captured by detailed survey papers being published since 2018 [4], [5], [16]- [18]. Some specialized surveys focused on particular search methods such as metaheuristics [19], neuroevolution [20], and reinforcement learningbased NAS [21]. The most recent survey [22] provides a unifying view on NAS in terms of representation of DNNs, variation operators, multi-objective search, constraint handling, and performance estimation. However, these surveys only briefly mention hardware-aware NAS, if at all. The hardware design community only very recently started to survey relevant papers on hardware-aware NAS [23]- [25]. A detailed survey is currently provided only by unpublished paper [25]. As their work was finished in 2020, it does not fully cover the most recent approaches in which neural architecture is co-designed with a hardware accelerator.

B. PLAN OF THE SURVEY
In this paper, we focus on this new category: NAS with hardware co-design. Rather than providing a unified view which is hard to establish in this rapidly expanding field, our survey tries to follow the development from singleobjective NAS methods, via hardware-aware NAS (in which DNN is optimized for a particular hardware platform) to the NAS with hardware co-search, in which the hardware design space has to be explored in addition to the space of network architectures. Relevant NAS methods are classified according to carefully selected criteria and arranged in a tabular form. For the last category, we propose a new classification approach to understand how multiple search algorithms interact to construct an efficient DNN and its hardware accelerator. We emphasize the importance of fair comparison and benchmarking methodology for NAS methods and show examples of such comparisons. The entire topic is challenging to present because it requires that the reader is familiar with different fields: DNNs, methods for multi-objective optimization, and hardware accelerator design. Hence, the survey starts with a brief introduction to DNNs and explains the principles of the accelerator design. Most research in this field deals with NAS for convolutional neural networks (CNN) applied to the image classification task. Our survey is primarily focused on this area. Table 1 provides a list of abbreviations used throughout this paper. As a core database of relevant papers, we selected automl.org [26], which contains not only papers from standard databases of IEEE or ACM but also unpublished preprints. The rest of the paper is organized as follows. Section II summarizes relevant principles of neural networks, emphasizing CNNs, and their optimization and benchmarking. Section III is devoted to hardware accelerators developed for CNN inference. NAS methods that optimize only the CNN accuracy are surveyed in Section IV. The hardwareaware NAS is introduced in Section V, in which we also describe common multi-objective optimization approaches. The NAS with hardware co-design is discussed in Section VI. Specialized techniques and hardware platforms in NAS are treated separately in Section VII. Section VIII deals with evaluation and benchmarking of NAS methods. Concluding remarks are given in Section IX.

II. ARTIFICIAL NEURAL NETWORKS
Artificial neural networks are computational models inspired by biological brains. They are used in machine learning tasks such as classification, prediction, control, and function approximation. NN is defined by its architecture and weights [1]. The number and type of layers, neurons, and other parameters that define the architecture of the NN are called hyperparameters. Once the architecture and hyperparameters of NN are specified, NN can undergo a training procedure whose goal is to optimize trainable parameters (the so-called weights) to minimize a given loss function. In classical supervised gradient learning, the training algorithm works in iterations (epochs). For each data input (i.e., an input vector or a subset from input vectors called a batch), it computes the output vector, which is compared with the desired vector to determine the error. The error is propagated back along with the network, and by utilizing the gradient of the loss function, the weights are appropriately updated. While training is performed with a training data set, the final quality score of the NN (such as the classification accuracy) is determined for a test data set. The accuracy gives the proportions of correct classifications over a given data set. In this paper, the accuracy always refers to the accuracy on the test data. The top-n accuracy is the proportion of testing data for which any of the n highest-probability predictions is considered as a correct result.
In this paper, we will primarily deal with convolutional neural networks (CNNs) whose architecture is defined as a sequence of layers with no feedback that are composed of artificial neurons and other elements, and at least one of the layers is the so-called convolutional layer [27]. We will only briefly mention recurrent neural networks (RNNs) that have been developed for time-dependent problems. They support both feedback and feedforward connections and can store intermediate results internally in the NN. Long short-term memory networks (LSTMs) are the most popular variant of RNNs capable of capturing long-term time dependencies [1].
After introducing selected basic types of NNs, this section will focus on well-known CNN models developed by human experts and various techniques improving the performance of CNNs. As the main focus of this paper is the automated design of hardware-aware CNNs, we will not further deal with other types of DNNs, training algorithms, and application domains.

A. NEURONS AND MULTI-LAYER NETWORKS
A basic artificial neuron is an elementary building block of complex NNs. A neuron has n inputs (x 1 , x 2 , . . . , x n ) and returns a scalar output y (Fig. 2). The inputs are multiplied with the weights (w 1 , w 2 , . . . , w n ) and summed together with a bias term b. A non-linear activation function σ(z) is then applied to calculate the output of the neuron, i.e.
Common activation functions are Rectified Linear Unit (ReLU), sigmoid or hyperbolic tangent [1]. When used as a classifier, one neuron can, after training, only classify the input vectors to linearly separable classes. To approximate general functions, a multi-layer NN is used, which consists of multiple layers of neurons in which a suitable number of neurons that work in parallel constitute one layer (Fig. 2). Fully connected (FC) layers are usually organized in such a way that if some layer contains n neurons, then each neuron is connected to all m neurons of the previous layer, which requires to set up m × n weights. If the NN has to approximate a function F (x 1 , x 2 , . . . , x n ) then the last (output) layer contains just one neuron. If the NN works as a classifier into k classes, then the output layer has k neurons. The i-th neuron gives the probability of i-th class, which is often obtained using a softmax function: The number of layers and the number of neurons in each layer are hyperparameters of a multi-layer fully connected NN. When the trained NN is deployed, it only performs feedforward computations to obtain the result for every input; this is called the inference.

B. CONVOLUTIONAL NEURAL NETWORKS
In order to reduce the number of weights and automatically extract essential features from raw data, CNNs have been introduced in the image processing domain. CNNs contain between five and hundreds of layers. Each convolutional layer generates, by applying one or several convolutional kernels (filters), a successively higher level of abstraction of the input data, called a feature map. The core computational procedure of a convolutional layer is a high-dimensional convolution (Fig. 3). The convolutional layers take input activation maps, arranged in three dimensions (i.e., height H in , width W in and channel C in ), and generate output activation maps, arranged in three dimensions (i.e., height H out , width W out and channel C out ). Mathematically, it is the convolution between the input activation maps and a set of C out 3D filters. More precisely, every single 2D H out ×W out plane of the output activation maps is a result of the convolution between the 3D input activation maps with a set of 3D filters. The final step is adding a 1D bias. Formally, the convolutional processes with the input activation maps, the output activation maps, the filters and the bias matrices denoted as X, Y , W , and B, respectively, can be expressed as The stride S is the number of pixels of which the filter is shifted after each convolution. The parameters used in the convolutional process are summarized in Table 2.
CNNs are organized to learn the non-linear mapping between the features and resulting classes, layer by layer, where   higher-level features are extracted from lower-level features obtained in previous layers. A non-linear activation function typically follows each convolutional layer. CNNs also contain pooling layers, normalization layers, and specialized blocks of layers (e.g., residual and inception blocks will be discussed in Section II-C). Pooling layers combine, by applying the averaging or maximum operators, a set of input values into a small number of output values to reduce the dimension of feature maps. Normalization layers enable to control the input distribution across layers which can help to speed up the training process and improve accuracy.

C. HUMAN-CREATED CNN MODELS AND BENCHMARKS
The use of a CNN called AlexNet [27] (Fig. 4a) trained on GPUs led to a breakthrough result in the ImageNet 2012 challenge focused on image classification. AlexNet achieved a top-5 error of 15.3%, i.e., 10.8% lower than other competitors utilizing conventional classifiers. The ImageNet benchmark data set contains 1.2 million training images, with roughly 1000 images in each of 1000 categories [28].
Since that time, various innovations have been proposed to improve CNNs, leading to CNNs showing around 90% top-1 accuracy on ImageNet. In addition to maximizing the classification accuracy, many efforts have been invested in minimizing the size and latency of CNNs to deploy them in resource-constrained devices such as mobile phones. Major innovations are presented by CNN models given in Table 3; a detailed benchmarking analysis is in [29]. Please note that some of the following networks exist in several versions (e.g., ResNet-8, ResNet-14 etc.), which differ in the number of layers, the structure of building blocks, and some other parameters. The top-1 accuracy depends on the training algorithm setup and time and resources available for training. We will briefly introduce some CNN models important for this paper.
GoogLeNet [30] is the first complex CNN formed by stacking with inception modules (Fig. 4c) in which various convolutional operations with different sizes are performed in parallel and their results are aggregated by concatenation. ResNet [31] introduced the so-called residual blocks (Fig. 4b) containing a shortcut connection that enables to eliminate the gradient vanishing problem without degeneration in CNNs since the gradient is directly passed through shortcut con- There are many software platforms (such as Tensor-Flow [36] and PyTorch [37]) enabling us to use an existing CNN model or create a new one and evaluate it on pre-prepared data sets. CNNs are evaluated on benchmark problems (or data sets), where image classification is the most popular. ImageNet is considered as a difficult and highly important benchmark, for which the ML community carefully monitors the progress in the Top-1 accuracy. Some smaller CNNs (developed for, e.g., low-power devices) are only evaluated on less complex image sets such as CIFAR-10 (10 image classes) [38], CIFAR-100 (100 classes) [38], MNIST (10 classes) [39], Fashion-MNIST (F-MNIST, 10 classes), SVHN (10 classes) [40], and NORB [41].
Some other data sets are utilized in the papers that will be surveyed in the next sections. Table 4 gives their abbreviation and a brief description. Further details are available in particular papers referenced in Section IV, V, and VI.

D. CNN OPTIMIZATION
CNNs are typically used for error-resilient applications in which a minor error introduced by inexact computing is often invisible to the end-user. Hence, CNNs can be simplified to reduce hardware resources, power consumption, or latency.
By pruning, some connections, neurons, filters, and channels can be removed [42]. By quantization, the most suitable number of bits and data format is assigned to selected weights, activations, and other intermediate results in the network instead of using the common 32-bit floating-point (FP) data type [42]- [44]. Recent studies have shown that with novel quantization methodologies, namely PACT and SAWB, and specialized number formats, DLFloat16 (16 bit) and Hybrid-FP8 (8 bit) for training, and INT4 for inference, no loss in accuracy can be reached for 4-bit inference for common CNN models on ImageNet [45]. Model compression tries to reduce the number of different weight values to minimize the CNN memory footprint [42], [46].
NAS algorithms that will be discussed in Sections V and VI implicitly perform pruning. Searching for the optimal bit widths is directly performed by, e.g., [47]- [51].

III. HARDWARE IMPLEMENTATION OF NEURAL NETWORKS
Accelerating the DNN training as well as inference in specialized hardware has been a vital research topic since 2014. While training is typically performed on (clusters of) VOLUME 4, 2016 graphics processing units (GPUs) or Tensor Processing Units (TPUs), the accelerated inference is carried out on a variety of computing platforms ranging from low-power processors to high-performance specialized multi-chip systems. As the paper deals with hardware-aware NAS, we will primarily focus on accelerators devoted to inference. This topic is wellcovered in the literature, see, e.g., a recent book [52] or detailed fresh surveys from the year 2020 [10], [45], [46], [53]- [55].
The CNN accelerator design is motivated by the fact that most computations are carried out in convolutional layers whose computation is suitable for parallelization. Moreover, the parameters (weights) associated with convolutional filters are reused many times. For example, while 666 million multiply-and-accumulate (MAC) operations are performed in convolutional layers, only 58.6 million MACs are in fully connected layers of AlexNet. In the case of ResNet-50, the ratio of MACs conducted in convolutional to fully connected layers is 1930× [9]. Hence, by a smart organization of the convolutional operations, which involves supplying the relevant data on time, introducing a suitable data reusing strategy, and bit-width setting, a significant improvement in latency and power consumption can be obtained using specialized parallel hardware. Chen et al. suggested to directly optimize the data reuse, which is the number of MACs that use the same piece of data, i.e., MACs/data, to maximize the energy efficiency [56]. For example, if all data reuse is exploited, DRAM accesses in AlexNet can be reduced from 2 896 to 61 million [52].
Google's TPU accelerators exist in several versions [57]. TPUv1, introduced in 2016, provided a systolic array of 256×256 8-bit FX multipliers allowing to significantly accelerate matrix multiplications for CNN inference (with the peak performance 92 TOPS at 75 W). TPUv2 and TPUv3 provide increased performance and support FP operations which makes them usable for DNN training. EdgeTPU is a version developed for edge computing and smartphones.
A detailed survey of FPGA-based acceleration techniques for DNNs was recently published in [10], [63]. On the other hand, DNN accelerators based on processors and microcontrollers are an attractive solution not only because of their low power operation useful in the IoT domain but also because of easier programmability and flexibility compared to specialized hardware accelerators. Thanks to the CMSIS-  [56], [68] [65], these processors can implement complex CNNs such as MobileNetV1. To enable energy-efficient inference of quantized DNNs (with up to 550 GOPS/W) for IoT applications, the instruction set was extended for RISC-V by implementing low-bit width SIMD arithmetic instructions [66].
It is important to emphasize that the acceleration approaches mentioned above show very different trade-offs between two critical evaluation criteria -performance (inferences/s) and energy efficiency (inferences/s/W). On the AlexNet example, Table 5 indicates that GPUs provide the highest performance while specialized ASICs lead to the most energy-efficient implementations. Detailed evaluation methodology for DNN accelerators was published in [67].

B. TEMPORAL ARCHITECTURES
The temporal architecture (typical for CPUs and GPUs) employs a set of ALUs with a fixed connection pattern and a hierarchical memory subsystem. Because this architecture is primarily intended for general-purpose computing, it is in principle less energy efficient than specialized architectures with a dedicated data flow organization. When CPU and GPU implement DNNs, their performance can be increased by suitable algorithmic techniques and compiler-level optimizations whose goal is to reduce the number of expensive arithmetic operations, maximize the degree of parallel processing, and optimize the memory access pattern. Libraries such as MKL [69] and cuDNN [70] provide optimized algorithms for efficient computing of matrix multiplication, multi-dimensional convolution, Fast Fourier Transform, and other valuable operations for DNNs. GPUs, as the most popular platforms for DNNs, range from small devices (e.g., NVIDIA Jetson Nano with 472 GFLOPS and 5-10 W) to high-performance nodes of supercomputers (e.g., NVIDIA V100 with 100 TFLOPS and 300 W).

C. SPATIAL ARCHITECTURES
Spatial DNN accelerators that are usually implemented in ASICs or FPGAs consist of an array of Processing Elements (PE), on-chip buffers, a controller, and external DRAM mem-ory (Fig. 5). Each PE contains a MAC circuit to multiply the input data with weight and add the product to a partial sum. A small local memory (register file or buffer) implemented in each PE can store local data such as weights, activations, and partial sums. Another memory, a global buffer, is used to prefetch from DRAM the activations and weights associated with a part of DNN that will be processed in the next step.
Dataflow is a general term covering the computation order, parallelization strategy, and tiling strategy applied in the accelerator. Tiles are chunks of data that fit the resources that are available to process them. For example, tiling of input feature map means that instead of loading an entire feature map, only a few rows and columns of that feature map are loaded and processed by the PE array. Dataflow is closely related to the data reuse strategy. The data elements (either weights, inputs, or partial sums) that have to be reused are mapped into local memory in PEs and kept there (stationary) until all relevant computations are performed with them. The common dataflow strategies are • Weight Stationary, in which the weights are stored in PEs, e.g., used in [57]. • Output Stationary, in which the partial sums are stored in PEs. e.g., [62]. • Row Stationary, in which the weights are stored in PEs, and the operations of a row of the convolution are mapped to the same PE, e.g., [56]. • No Local Reuse, in which only the global buffer is used, e.g., [61].

FIGURE 5. Typical organization of a DNN accelerator and its programming
The critical parameters of any DNN accelerator (i.e., the runtime, throughput, and energy efficiency) depend on a given DNN architecture and its mapping (via a suitable dataflow organization) to available resources. The data can be reused across time (via buffers) and space (over wires). The tile sizes are bound by buffer sizes within the accelerator. The total number of tiles depends on the DNN model size and the dataflow strategy. The total number of PEs in the accelerator determines the peak throughput [9].
Using cost models (such as MAESTRO [71]), the run time, resources utilization, delay, and power can be estimated for a given accelerator, DNN, and data set. A DNN is executed either layer-by-layer, or the entire DNN is pipelined across the accelerator, e.g., by means of systolic array principles. While the former approach is easier to schedule, it usually leads to less efficient utilization of resources. The latter approach requires a suitable dynamic partitioning of resources, which is challenging to manage for some DNN models, but the accelerator can be used more effectively. Some accelerators can effectively exploit data sparsity (many zeroes in the weights and activations) and configure an optimal bit width for arithmetic operations to improve performance [45], [56].
Depending on a given DNN, accelerator (specified in terms of architecture, available resources, and dataflow options), and constraints (such as the maximum latency), a compiler generates the control sequence for the accelerator. Many authors have addressed the optimal mapping and execution of the DNN on a given accelerator, see [55], [62], [71], [72]. The more challenging problem is to implement multiple accelerators on a single chip or to deploy multiple accelerators on multiple chips [13], [73]. Note that the search for an optimized implementation of a DNN for a given accelerator is one of the problems addressed by hardware-aware NAS (see Section VI).

D. APPROXIMATE IMPLEMENTATIONS OF CNNS
One of the most prominent approaches developed to reduce the power consumption of computer systems is approximate computing [74]. According to [75], the approximations were introduced into CNNs at the level of data type quantization, microarchitecture (e.g., pruning, weight sharing, and dataflow organization), MAC circuits, and memory (utilizing approximate memory cells, architecture, and weight compression). The RAPID AI accelerator was built from the ground up to investigate the impact of approximation techniques in CNNs. The authors revealed that the most significant gains are obtained when the cross-layer approximation approach is adopted, involving software, architecture, and hardware, breaking thus conventional methods focused on optimizing each layer of abstraction independently [45]. The NAS combined with hardware accelerator co-search has a great potential to perfectly solve this problem.

IV. SINGLE-OBJECTIVE NAS
Developing a high-quality DNN model for a given (previously unseen) data set and creating its implementation optimized for a target platform is a very time-consuming task because it is inherently based on performing many experiments. The difficulty usually increases when the data set size grows, and challenging constraints (such as maximumallowed latency or power consumption) are introduced. NAS methods were invented to automate this design process. VOLUME 4, 2016 The whole NAS field started with the approaches in which only one objective, the accuracy, is optimized [3], see Fig. 6. Other objectives such as the number of parameters or FLOPS were not explicitly considered, but they have been often reported for resulting networks. Another goal, not explicitly formulated within the NAS methods, was to minimize the NAS execution time (or consumed energy or CO 2 emissions [76]). As we primarily deal with CNNs, the next paragraphs will mainly discuss CNN-oriented singleobjective NAS.
For given training and test data sets, D = {D trn , D tst }, and loss function L, the NAS problem can be formalized as a bilevel optimization problem [77]: where the upper-level variable α defines a candidate CNN architecture, and the lower level variable w(α) defines the associated weights. Ω α and Ω w denote the space of CNN architectures and the space of CNN weights, respectively. L D (α, w) is the cross-entropy loss on the data set D for architecture α and weights w.
The difficulty of the NAS problem lies in its complexity; the search space is enormous, and its dimension is variable. The search algorithm is often constructed as multi-objective (mandatory for hardware-aware NAS) and tries to balance the exploration and exploitation aspects. Moreover, because candidate networks are complex objects, their evaluation, which typically involves training, is very computationally expensive. The following sections survey the key principles of NAS algorithms.  Single-objective NAS (left) and hardware-aware NAS (right). From the search space, the search algorithm samples a candidate DNN architecture α, which is trained to get the weights w, and tested to get the test accuracy Acc. The implementation cost is evaluated only for the hardware-aware NAS.

A. SEARCH SPACE
A common practice is to model a candidate CNN using a directed acyclic graph encoded as a variable-length string.
All possible strings describing valid CNNs constitute the search space. Three strategies for building the search spaces have dominated in recent years: (i) a macro search space which describes the entire CNN, (ii) a micro search space which defines the architecture of a subgraph (or several subgraphs) which is then repeatedly reused in the CNN, and (iii) a hierarchical search space.
In the case of the macro search space (Fig. 7a), the search space is determined by a set of possible operations for each node, hyperparameters of the network architecture, and a network template. The template can be a simple linear sequence of N nodes, or it can support branches and skip connections. Independent branches are at some point merged using a suitable operator such as concatenation or sum [3], [78]. Another option is to parameterize a well-known CNN (such as ResNet [31] or MobileNetV3 [33]) and use it as a template (e.g., in [7], [51], [79]- [81]) for building and constraining the search space. Some parts of CNN can be fixed by an engineer, and their implementation is then not subject to optimization. For example, the last fully connected layer is always present in a CNN-based classifier, and, hence, it usually makes no sense to search for its hyperparameters.
In the case of the micro search space (Fig. 7b), the architecture of a subgraph (also denoted as a cell, block, or segment) or several subgraphs is sought by NAS. This technique effectively reduces the search space concerning the macro search. Each subgraph consists of several layers whose hyperparameters and connections have to be determined by NAS. The resulting subgraph(s) is/are then reused in the target CNN. For example, NASNet [82] proposes two types of cells: normal cell, which is used to extract advanced features, and reduction cell, whose task is to reduce the spatial resolution. Fig. 7c shows an example of a cell's encoding using a string of integers.
In the case of the hierarchical search space [83]- [86], a small set of primitives, including elementary operations like convolution, pooling, and identity is specified. Small subgraphs (the so-called motifs) that consist of these elementary operations are then recursively used to establish the entire network. MnasNet [12] organizes target CNN into multiple sequentially connected segments, each having its separate repeating structure.
CNN can also be encoded indirectly, using a procedure (a generator) which generates it in a number of construction steps, e.g. [20], [88]. While indirect encoding can significantly reduce the search space size and produce complex networks it has not become popular in the current NAS. The reason is that it is tricky to devise a suitable unbiased generator for CNNs.
It is an open research problem of search space engineering to devise unbiased search spaces that can effectively be explored by search algorithms and, at the same time, enable the discovery of novel and competitive CNNs. Example of cell's encoding according to [87] in which each node is defined by a five-tuple (Node ID; Operation; Parameter; Source ID 1; Source ID 2). The Operation is either (1) convolution, (2) max. pooling, (3) average pooling, (4) identity, (5) add, (6) concatenation, or (7) terminal node.

B. SEARCH ALGORITHM
Reinforcement learning (RL) and evolutionary algorithms (EA) dominate NAS methods. We will survey their principles; other relevant algorithms will be briefly mentioned.
A special section is devoted to the differentiable NAS.

1) RL-based methods
In the pioneering work, Zoph and Le [3] used a recurrent network, the so-called controller, to sequentially generate vectors representing (hyperparameters of) candidate CNNs. The controller sequentially produces hyperparameters such as filter height and width, stride height and stride width, and the number of filters for one layer and repeats. Every prediction is carried out by a softmax classifier and then fed into the next time step of the RNN as input. After generating the entire CNN description, the candidate CNN is assembled, trained, and its validation accuracy serves as the reward signal to update the controller's parameters (the RNN's weights). Reinforcement learning thus tries to maximize the reward (validation accuracy) from the actions performed (decisions enabling to construct a candidate CNN) by the controller. The controller's parameters are iteratively updated by a policy gradient method such as REINFORCE [89]. Fig. 8 shows how a candidate CNN (including its architecture) can be generated. This approach or its various extensions were used in many NAS methods; see 'RL' in the following tables.

2) Evolutionary Search
Evolutionary algorithms were used for neural network design and optimization since the 1980s. Surveys [90], [91] provide an overview of the early methods developed not only for the architecture design but also for the optimization of the weights by EAs as a complement to gradient methods. The most exciting method developed in the pre-DNN era is  the neuroevolution of augmenting topologies (NEAT) [92] which was quite competitive on small networks. NEAT was extended to CoDeepNEAT to produce DNNs through a coevolutionary approach, with good results on CIFAR-10 and the Omniglot multitask learning domain problem [84]. A survey of recent neuroevolutionary methods was published in Nature [20].
The first study dealing with the evolution of undoubtedly complex CNNs was presented by Real et al. [93] who evolved a competitive solution to the CIFAR-10 problem. In their method, a candidate CNN is encoded as a graph whose nodes are rank-3 tensors or activations and edges are convolutions or identity connections. The initial population consists of 1000 single-layer networks with no convolutions. After their training and evaluation, parents are selected using a tournament selection, and offspring networks are then generated by mutation. The mutation operator is either adding or removing a layer, altering the hyperparameters of a layer, adding skip connections, or altering training hyperparameters. Whenever possible, learned weights and parameters are inherited from the parents to their offspring, which is called the weight sharing. These steps are repeated until a pre-defined number of generations is not exhausted. Real's EA works in the macro search space. While the EA is responsible for delivering the architecture of CNN, candidate CNNs are trained using a standard gradient descent algorithm. Figure 9a summarizes the main steps of the evolutionary NAS method. The initial population is seeded either randomly or with existing models (to reduce the search time). Selection, recombination (mutation and crossover), and replacement are standard steps of a typical EA. As training is the most time-consuming step, various methods were developed to simplify it, see Section V-B.
Candidate CNNs are typically encoded using strings of integers as illustrated in Fig. 7c. A very efficient binary encoding was proposed in GeneticNET [78]. A candidate network is composed of N stages, and each stage (the yellow box in Fig. 9b) contains up to K nodes. Fig. 9b shows that j − 1 bits are devoted for encoding of the j-th node (K = 6 in our example). Each of these j − 1 bits determine if there is (1) or is not (0) a connection between the j-th node and nodes 1, . . . , j − 1. The last bit informs if the skip connection is active. Each node can represent one of the pre-defined layers (convolution, poling, etc.) whose selection is encoded using one integer. The complete CNN encoding then consists of N parts, each of them devoted to one stage. The main advantage of this encoding is its compactness and the possibility of using a crossover binary operator, which proved to generate good offspring. This encoding was later reused in NSGANet methods [6], [94]. Note that genetic operators must be tuned for a particular encoding to produce good offspring. For example, [93], [95] only employ mutation operators; NSGANet methods [6], [94] utilize the aforementioned crossover. AmoebaNet [96] provided the first large-scale comparison of EA and RL methods. Their simple EA searched over the same space as NASNet [82] and led to faster convergence to an accurate network when compared to RL and random search. Similar to the basic RL-based NAS methods, the main limitation of the EA-based NAS methods is their high computational overhead.

3) Other Search Strategies
Sequential Model Based Optimization (SMBO) is similar to the mutation-based EA. After generating and evaluating several smaller CNN models, it uses mutation to create more complex models. Their quality is predicted using a surrogate function, typically based on RNN. The surrogate function is updated using data collected from already evaluated networks. Having a large pool of candidate models, a selection strategy is needed to navigate in the search space. In PNAS, SMBO selects top-performing models based on predicted accuracy [97]. Note that in the Monte Carlo Tree Search (MCTS), a random selection is taken to choose which branch to expand for each node in the search tree [98].
Bayesian optimization methods (e.g., [85], [99]- [101]) employ a combination of a probabilistic surrogate model and an acquisition function to obtain suitable candidates. The acquisition function measures the utility by accounting for both the predicted response and the uncertainty in the prediction. The surrogate is constructed using the Gaussian process (GP), random forest, or similar methods. The idea is to limit the evaluation of the objective function by spending more time in choosing the most suitable candidates for the next step.
Training free approaches utilize results of theoretical analysis of DNNs. For example, TE-NAS [102] sorts candidate architectures according to a score obtained by analyzing the spectrum of the neural tangent kernel and the number of linear regions in the input space, which can be computed without training. The authors observed that these characteristics strongly correlate with the network's test accuracy.

C. SUPERNET AND ONE-SHOT METHODS
The authors of ENAS [104] observed that each candidate CNN could be seen as a subnetwork of a larger network. Hence, they constructed a generic network so that it is overparameterized, i.e., it contains all possible CNN realizations. It is also known as a supernet. Its nodes represent local computations, and the edges represent the flow of information. The nodes have their parameters, but they are used only when a particular node is activated. These parameters are shared among all subgraphs that can be sampled from the supernet. This idea is illustrated in Fig. 10. While it is time-consuming to train the supernet, obtaining a trained CNN (i.e., a subnetwork) from the supernet is computationally significantly less expensive as it requires to call a simple sampling algorithm. Sampled subnetworks require no training, or they are finetuned to improve their accuracy further. The methods which train the network just once are also known as one-shot NAS methods. They can be applied for the micro as well as macro search space and combined with other optimization methods. Various extensions of this idea have been proposed, including hardware-aware NAS methods [7], [11], [86], [105], [106].
SinglePathNAS [107] considers all candidate convolutional operations as subsets of a single "superkernel". Rather than choosing among different paths/operations in the supernet, the NAS problem is solved by finding which subset of kernel weights should be used in each convolutional layer. By sharing the convolutional kernel weights, all candidate NAS operations are encoded into a single "superkernel", i.e., with a single path, for each layer of the one-shot NAS supernet. This approach allowed to reduce further the number of trainable parameters and the search time.

D. CONTINUOUS SEARCH SPACE AND GRADIENT SEARCH
Previously introduced methods operate in discrete search spaces and can be seen as black-box optimizers. They need a huge computational effort to discover an interesting CNN. In order to reduce computational requirements of NAS, DARTS [108] introduces a simple continuous relaxation scheme for the micro search space, which leads to a differentiable learning objective. The architecture and its weights can then be jointly optimized by a gradient method which is less computationally demanding than a black box optimizer.
A subnetwork (cell) is modeled as a directed acyclic graph consisting of N nodes with two input nodes and one output node. Each node x (i) is a potential feature map and each directed edge (i; j) represents operation o (i,j) that transforms x (i) . The output of the cell is calculated by a suitable reduction operation. Still considering a discrete space, each intermediate node is expressed as: To make the search space continuous, the categorical choice of a particular operation is relaxed using a softmax function: where O = {o 1 , o 2 , . . . , o k } is a set of candidate operations (e.g., convolution, max pooling, zero) and γ  L Dtst (γ, w(γ)), (9) or form a bilevel optimization problem: Finally, parameters γ have to be discretized to obtain the final network architecture. DARTS enabled to reduce the search time to 4 GPU hours while delivering an accuracy comparable with other methods at that time. Differentiable NAS has been quite often combined with the supernet approach. However, inconsistency in the performance of the parent network and the derived network was observed by many practitioners. One of the reasons is that DARTS jointly optimizes network weights and architectural parameters. At the same time, the subnetwork needs to optimize only a subset of weights for a few selected operations. Various approaches have been proposed to eliminate this behavior [109], [110]. P-DARTS employed the so-called progressive search to gradually increase the depth of the network during the search phase to avoid another problem associated with this approach -only shallow architectures are typically derived from the parent network [111]. Table 6 summarizes key properties of single-objective NAS methods. NAS methods are sorted according to the year of publication (of a regular paper even if a pre-print was published earlier) and then alphabetically. Every NAS method is characterized in terms of the Search Algorithm, Search Space, and the utilization of a SuperNet. Instead of using 'micro' to identify a micro search space, we use terminology taken from particular papers, i.e., 'cell', 'block', or 'stage', to characterize searched subnetworks. For ImageNet (ImgNet) and CIFAR-10 (C-10) data sets, we provide the best top-1 accuracy (and the number of parameters) presented in a particular paper, i.e., independently of the NAS method setup or used resources. When compared with human-created CNNs given in Table 3, NAS methods are quite competitive in terms of accuracy; however, only if the last generation of human-created CNNs (such as Vision Transformers) is not considered. We also list other data sets (denoted according to Table 4) that were employed to evaluate a particular NAS method. No performance indicators are reported for them to keep the table easily readable.

E. SELECTED METHODS
Please note that the objective of this survey is not to perform a detailed quantitative comparison of NAS methods. Despite some efforts towards a correct comparison methodology (see a discussion in Section IX), any comprehensive benchmarking of NAS methods (especially the multiobjective ones) has not been reported in the literature.

V. HARDWARE-AWARE NAS METHODS
Hardware-aware NAS methods were introduced to optimize neural networks not only for accuracy but also with respect to the target hardware platform, where the trained network is implemented. It was later shown by a detailed design space exploration [121] that optimal CNN architectures for different devices are not the same.
Typical objectives to be optimized for a given hardware platform are latency, throughput, energy efficiency, and memory usage. Hence, in specific steps of the NAS algorithm, all relevant objectives have to be evaluated, either by direct measurement on real hardware or estimated using software models (see Fig. 6). As the NAS algorithms are usually very time demanding, many techniques have been proposed for their acceleration, particularly for shortening the candidate network evaluation time.
The execution time of NAS is often seen as an additional objective to be optimized (minimized). For example, NAG [79] is a Pareto frontier-aware neural architecture generator that takes an arbitrary budget as input and produces the Pareto optimal architecture for the target budget.
In this section, we first introduce the principles of multiobjective optimization methods (Section V-A). Then, in Section V-B, we briefly survey the techniques developed to reduce the time needed to evaluate candidate designs. In Section V-C, we propose our classification of hardwareaware NAS methods.

A. MULTI-OBJECTIVE OPTIMIZATION
By extending the single-objective NAS formulation from Section IV, the NAS problem can be seen as a multi-objective optimization problem, i.e. an optimization problem that involves multiple objective functions f i , i = 1 . . . m (all to be minimized, without loss of generality): where the upper-level variable α defines a candidate neural network architecture, and the lower level variable w(α) defines the associated weights. One of the objective functions is typically loss on the test data.
In the multi-objective optimization, there does not usually exist one solution that minimizes all objective functions simultaneously because the design objectives are conflicting.
Hence, rather than one (optimal) solution, the optimization results in a set of solutions, i.e. the solutions that cannot be improved in any of the objectives without degrading at least one of the other objectives. Formally, a solution a is said to (Pareto) dominate another solution b, if f i (a) ≤ f i (b) for all i ∈ {1, 2, . . . , m} and f j (a) < f j (b) for at least one index j ∈ {1, 2, . . . , m} , and all f i have to be minimized. A solution a + is called a non-dominated solution, if there does not exist another solution that dominates it. The set of non-dominated solutions is called the Pareto front. We say that non-dominated solutions are Pareto optimal solutions if all possible candidate solutions are considered during the optimization, and there are no provably better non-dominated solutions in the search space. In practice, we are almost always faced with a situation in which a given method produces suboptimal solutions, i.e., the Pareto front contains the best non-dominated solutions obtained during the experiments conducted with the method. As it is not known "how far" the obtained solutions are from the truly Pareto optimal solutions, a common practice is to introduce a quality metric capable of measuring the distance between two sets of solutions obtained with two multi-objective optimization methods (see, for example, [122]) and compare them under this metric. For example, NSGANetV1 [94] employs the hypervolume performance metric, which calculates the dominated area (hypervolume, in the general case) from the set of solutions to a reference point which is usually an estimate of the nadir point -a vector concatenating worst objective values of the Pareto front.
A common approach to solve the multi-objective NAS problem adopted by the NAS community is either (i) to transform it into a single-objective one (using suitable constraints, prioritization, or aggregation techniques) and solve it with a common single-objective method or (ii) to employ a truly multi-objective approach (the so-called aposteriori methods) [123].
In the constraints utilizing method, only one of the objective functions is optimized while the remaining ones are expressed as constraints h i (a) ≤ c i . A penalty function λ is then introduced to punish any violation of constraints c i , i.e.
For example, in order to constrain the latency (h i ), Tan et al. [12] defined the penalty function as where p is treated as a hyperparameter controlling the desired tradeoff. However, this method does not guarantee that some hard constrains are not violated. The prioritization means that the most crucial objective is optimized first. When a suitable solution is obtained, the second most crucial objective is optimized but ensuring that the first one is not worsened. This is repeated for all the objective TABLE 6. Single-objective NAS methods. The top-1 accuracy (Acc.) and parameters (Param.) are given for a CNN created by a particular NAS method and showed the highest accuracy in the corresponding paper. Symbol '-' denotes a non-reported value.

Method
Ref. functions according to their priority. The prioritization is also taken into account when multiple objectives are evaluated for a candidate solution a. First, the easiest-to-quantify objective is determined. If its value is not satisfactory, the remaining objectives are not evaluated, and a is discarded. Otherwise, the next objective is evaluated. For example, Smithson [103] first evaluated the number of MAC operations as it is easy to determine it, and if the candidate passes a certain limit, then its accuracy (whose obtaining requires time-consuming training) is assessed. The aggregation methods introduce a suitable aggregation function (such as the weighted sum, weighted exponential sum, or weighted product) for the objective functions and optimize the composition. In the case of the linear weighted sum, the new objective function is where v i is the weight of the i-th objective and v i = 1. This approach suffers from several problems. First, it is not easy to find suitable values of v i . Second, the linear weighted sum only works for problems with convex Pareto fronts, i.e., solutions on non-convex segments are unreachable. Third, similar to the constrained optimization, the method has to be executed several times with different weight settings to approximate the Pareto optimal front.
Truly multi-objective optimization methods iteratively build the Pareto front in the course of optimization by comparing candidate solutions using the non-dominance relation, promoting reasonable solutions, and trying to cover the expected Pareto front. This approach has significantly been developed within the evolutionary computation community. One of the most popular methods is NSGA-II [124]. It is based on sorting individuals in a population according to the dominance relation into multiple fronts. The first front contains all non-dominated solutions. Each subsequent front is constructed by removing all the preceding fronts from the population and finding a new Pareto front in the remaining individuals. The solutions within the individual fronts are then sorted according to the crowding distance metric. This metric helps to preserve the diversity of the population along the fronts. Best individuals then serve as parents for the new population. NSGA-II thus always produces a set of non-dominated solutions (i.e., a Pareto front) when it is terminated.

B. SHORTENING THE EVALUATION TIME
This section deals with techniques developed to reduce the evaluation time of candidate neural networks.

1) Accuracy Estimation
The quality of a candidate CNN architecture is typically obtained by training the CNN using a training data set and then measuring the final accuracy on the test set. Common strategies introduced to reduce the training time are decreasing the number of epochs and employing a proxy training data set (e.g., in [105]). The learning curve can also be extrapolated to estimate the performance of training. The extrapolation is based either on the number of iterations (or training time), or the size of the available data set for training. MetaQNN [112] compares the performance of the candidate CNN after the first training epoch with the performance of a random predictor to check if it is helpful to decrease the learning rate and restart training. Large-scale Evolution [93] enables the CNN obtained from a mutation to inherit the parent's weights whenever possible.
Although the authors of ProxylessNAS [11] argue that architectures optimized on proxy tasks are not guaranteed to VOLUME 4, 2016 be optimal on the target task, many successful NAS methods developed after ProxylessNAS used surrogate models. The accuracy is predicted using neural networks, classification trees, regression trees, or Gaussian Process [6], [103], [125]- [127]. In general, surrogate models replace expensive objectives with pre-trained models that provide desired approximation.

2) Latency and Other Hardware Parameters
First multi-objective hardware-aware NAS methods have considered the number of parameters, FLOPS, and MACs as the additional objectives to optimize because they are not expensive to quantify for each candidate CNN. Later, when CNNs were adopted for hardware accelerators, and it was necessary to consider real latency and area overhead that are expensive to quantify exactly, various proxy measures were proposed to reduce the computational effort. We will deal with latency in the next paragraph (as most papers on NAS do), but other parameters can be estimated similarly. Latency can be estimated using: • a surrogate model, e.g., [6], [126], [128]- [131]; • a suitable hardware simulator executing a candidate CNN, e.g. [51], [100], [132]- [135]; • a formula or model derived after analyzing the search space of possible CNN architectures, e.g., [14], [47], [49], [101], [136]- [138]; • a LUT-based model [105], [139], [140]. In LUT-based models, latency (and other parameters of interest) is stored to a LUT for each possible operation (e.g., a neuron, a convolution layer) from the space of all CNN architectures. These LUT values are either obtained by measurement on real hardware or estimated according to the datasheets for a given implementation technology. The total latency is then estimated using LUT values for the operations on the longest path from the input to the output of the CNN model.

C. NAS FOR PARTICULAR HARDWARE
An obvious approach to optimizing the CNN architecture for given hardware is employing only hardware-friendly hyperparameters and operations (suitable convolution types, arithmetic operator implementations, quantization schemes, or memory access mechanisms). For example, based on benchmarking 32 different operators, Hurricane [131] uses different subsets of operator choices for three types of hardware platforms. This way, the search space is narrowed towards CNN architectures suitable for a given hardware platform. An essential feature of the hardware-aware NAS methods is that they do not directly optimize the configuration of the hardware platform, i.e., there is no additional search space that can be explored to improve the implementation further. Table 7 surveys key properties of hardware-aware NAS methods and classify them according to several criteria. If a method is not presented under any abbreviation in the source paper, we identify it in the table according to the first author. Some of the criteria are identical with respect to our classification introduced for the single-objective methods in Table 6 (i.e., Search Space, Search Algorithm, SuperNet). It can be seen that the newest methods frequently employ the concept of supernet, allowing them to reduce their execution time.

1) Proposed Classification
The search algorithms optimize the design objectives that are listed in the Objectives column, together with the accuracy, which is not mentioned as it is always involved. The Estimation Method column tells us if at all and how particular hardware parameters are estimated (the methods are abbreviated according to Section V-B). We observe that latency (Lat) and Energy are often estimated rather than measured. According to [11], [67], the number of parameters (weights) or FLOPS is not a good proxy for latency for complex CNNs. However, when smaller CNNs are developed for tiny MCUs, the number of parameters or FLOPS are often used for this purpose. The reason is that CNNs are executed on a processor with limited options for parallel processing and, hence, the correlation between the CNN complexity and execution time is high [99], [141], [142]. If the accuracy (Acc) is estimated, then a NN-based predictor (surrogate) is almost always utilized for this purpose [6], [103], [125], [126]. However, for example, ChamNet [127] utilizes a Gaussian Process-based surrogate.
In order to provide basic information about the performance, we again present only the best top-1 accuracy reported in a particular paper for ImageNet (ImgNet) and CIFAR-10 (CF-10) and list other data sets used for evaluation. Compared to the single-objective NAS methods, it is even more challenging to present a fair quantitative evaluation. Section VIII will provide a comparison for Pixel 1 phone as the target platform. If accuracy is provided for ImageNet (i.e., the ImgNet column is not empty), the NAS method aims at solving complex problems; otherwise, it is focused on less challenging problems such as CIFAR-10, or even MNIST only.

2) Selected Methods
The NAS methods surveyed in Table 7 show series of innovative approaches which, since the year 2016, have enabled the improvement of state-of-the-art results continuously. The improvements are in: (1) providing a better trade-off between the accuracy and latency (or other hardware parameters) on various hardware platforms and (2) reducing the design time and resources needed to achieve innovative solutions. Because of the space limitation, we briefly survey only some of the hardware-aware NAS methods in the following paragraphs.
The first genuinely multi-objective hardware-aware NAS is DSE [103] in which a candidate DNN is optimized using an adapted Metropolis-Hastings (M-H) algorithm. Its accuracy is predicted by an MLP which reads the DNN's hyperparameters. The second objective is a normalized cost evaluating the number of MAC operations and memory accesses carried out during inference. The MLP predictor is very accurate because its mean error is only 0.35%. A limitation of the method is that DSE's maximum accuracy on CIFAR-10 (86%) is very low compared to current approaches.
DNAS [81] is a framework based on a differentiable NAS capable of selecting the most suitable number of bits for FX operations conducted in each block of the CNN. DNAS creates a supernet whose layers contain several parallel edges representing convolution operators with quantized weights and activations with different precisions. As all layers in one block use the same precision, the search is conducted at the block level. If the precision is 0, the block is skipped, which changes the size of CNNs. It is assumed that the underlying hardware supports this type of quantization.
MNASNet [12] uses a factorized hierarchical search space that provides CNN architectures suitable for hardware accelerators. The layers are grouped into blocks based on their input resolutions and filter sizes. The RL-based search is constrained by maximum latency, which is directly measured on a mobile phone. However, the method requires 40 thousand GPU hours to produce a CNN, which is not competitive today.
In FBNet [105], the NAS algorithm first trains a stochastic supernet using SGD to optimize the architecture distribution. Using the LUT-based approach, a differentiable loss function is created for latency. This innovation allows one to use gradient-based optimization to solve the NAS problem together with optimizing latency. State-of-the-art tradeoffs between the accuracy and latency were reported on ImageNet for several mobile phones. A limitation is that FBNet searches on a "proxy" dataset (i.e., a subset of the ImageNet dataset) and the entire supernet must be maintained in memory during the search.
ProxylessNAS [11] is another differentiable NAS in which an over-parameterized network that contains all candidate paths is trained. To guarantee its fidelity, no proxy such as reduced training data sets or shorter training periods are allowed. Specialized architecture parameters are introduced to learn which paths of the net are redundant. These parameters effectively switch off redundant parts of the network. To handle non-differentiable objectives such as latency during learning, network latency is modeled as a continuous func-tion and optimized as regularization loss.
NSGANetV2 [6] extends the NSGANet [94]. Both methods work in the search space proposed by GeneticCNN [78] but employ a truly multi-objective evolutionary algorithm NSGA-II. Instead of gradient-based relaxations used in FB-Net and ProxylessNAS, it builds surrogate models to predict the accuracy of candidate CNNs. It also uses a supernet trained with a progressive shrinking algorithm and weight sharing to reduce the training time. One nine data sets, including ImageNet, NSGANetV2 improved the state-of-theart results.
In order to effectively develop CNNs for different accelerators with different latency constraints, OFA [7] proposes to train a once-for-all (OFA) network that supports diverse architectural settings. Its training is expensive (1 200 GPU hours on V100 GPUs) but is amortized. As different subnetworks are interfering with each other, the training process of the whole OFA network is inefficient. Hence, instead of directly optimizing the OFA from scratch, it is proposed to first train the largest CNN with maximum depth, width, and kernel size; and then progressively fine-tune the OFA network to support smaller sub-networks that share weights with the larger ones. Specialized sub-networks for diverse hardware platforms (from the cloud to the edge) and various constraints were derived from OFA using a pre-trained predictor in constant time.
APQ performs a joint search for architecture, pruning, and quantization policy using an evolutionary algorithm [126] starting with the MobileNetV2 network. The accuracy is predicted using a quantization-aware predictor implemented as a three-layer feed-forward NN. The input to the predictor is the encoding of the network architecture, the pruning strategy, and the quantization policy. The predictor is first trained without quantization, then transfers its weights to train the quantization-aware predictor, which largely reduces the data collection time. The latency and energy of each layer are pre-computed and stored in LUTs.
HTAS [158] first expands the global search space. The reason is that more efficient architectures can be wider than the original network structure in some layers, and it would be impossible to find them in the limited search space. Then, based on the latency measurements over the channel numbers, the hardware-friendly channel choices are selected to construct the hardware-aware search space. A differentiable NAS with the latency regularizer is then employed to seek the most suitable CNN for the target CPU or GPU, including the optimal selection of its hyperparameters.
MicroNets [141] are devoted to resource-constrained microcontrollers. They exploit a specialized software called TinyML, allowing ML tasks to be implemented on IoT devices. MicroNets are optimized for MCU inference performance using differentiable NAS with constraints on latency and memory (both SRAM and eFlash sizes are considered). The authors observed that that the number of operations is a viable proxy for both latency and energy. For inference, MCUs predominantly use 8-bit FX operations. However, MicroNets support sub-byte quantization on 4 bits. A similar approach, but targeting even smaller microcontrollers, was presented in µNAS [142].

VI. NAS WITH HARDWARE CO-DESIGN
When the hardware-aware NAS is connected with a hardware co-design algorithm, the CNN accelerator can be cooptimized with the CNN architecture. The authors of [13] observed that "the hardware-aware NAS has a much narrower search space than the proposed co-exploration approach. Basically, hardware-aware NAS will prune the architectures with high accuracy but fail to meet hardware specifications on fixed hardware design. However, by opening the hardware design space, it is possible to find a tailor-made hardware design for the pruned architectures to make them meet the hardware specifications. Therefore, compared with the HW-aware NAS, the co-exploration approach enlarges the search space. As a result, it can make better tradeoffs between accuracy and hardware efficiency." Table 8 surveys major NAS methods utilizing hardware codesign. In addition to the columns used in Table 6 and 7, we added some new columns that characterize these methods. Their meaning will be defined in the next paragraphs.
Hardware platforms can be configured in multiple dimensions, including the PE array size, MAC circuit configuration, dataflow organization, tiling strategy, memory subsystem size and organization, and preferences for high-level synthesis software. In addition to the architecture search space and parameter search space (weights), there is an additional search space, called the hardware search space, containing all possible hardware configurations. The Accelerator Codesign: Search Space column shows the major hardware parameters optimized by a given NAS method.
The space of hardware configurations can be searched together with the space of DNN architectures using the same search algorithm, such as in [14], [48], [49], [129], [130], [139], [170]; see also 'in NAS' in the Accelerator Codesign: Search Alg. column of Table 8. However, another option for optimizing the hardware configurations is to use an independent search algorithm such as dynamic programming (DP) in [47], EA in [8], [51], gradient search in [15], or integer linear programming (ILP) in [73] as seen in the Accelerator Co-design: Search Alg. column of Table 8. The Quant column indicates that the method is also searching for a suitable quantization scheme.
In the Multi-objective: strategy column, the 'co-search' means that there are two independent search algorithms, i.e., a co-search is conducted; one search algorithm operates in the network architecture space and the other in the hardware search space. The Multi-objective strategy is based either on a Pareto front construction method ('Pareto'), aggregation method ('Agg'), or applying some constraints ('Constr').
Compared to previous tables, new objectives are defined in the Objectives column: Energy-Delay Product (EDP), Frame-Per-Second (FPS), and Energy-Delay-Area product (EDA). The meaning of Data set column is the same as in Table 7.

A. CO-SEARCH ORCHESTRATION
A straightforward approach to organizing the co-search is to generate a CNN-accelerator pair, which is evaluated by training the network to obtain its accuracy and measuring the hardware parameters (Fig. 12). Based on this evaluation, the next candidate pairs are generated until the desired solution is not obtained. However, this general approach leads to a timeconsuming search process due to the prohibitively huge joint space composed of the coupled yet different network and accelerator spaces with extremely sparse optima. To reduce the design time, a supernet is often constructed (e.g., in [129], [130]) before starting the co-search and then sampled to quickly obtain a candidate CNN and its accuracy.
In another co-search strategy used by, e.g., in [47], [51], the hardware optimization algorithm receives a CNN as the input and optimizes the hardware accelerator concerning desired objectives (Fig. 13). Suppose the accelerator optimizer produces a valid hardware configuration (i.e., all constraints are satisfied). In that case, the original CNN can undergo full training, and its accuracy, together with the hardware cost, is sent back to the NAS algorithm to generate another candidate CNN. Otherwise, the original CNN is discarded and a new Following the idea of differentiable NAS in DARTS [108] and hardware-aware differentiable NAS in FBNet [105] and ProxylessNAS [11], differentiable network-accelerator cosearch framework was proposed in EDD [48], and later in DNA [15]. Let us use DNA to explain the method.
DNA enables co-searching for the CNN architecture together with the accelerators' configuration (e.g., the PE array size, the local and global buffer sizes, dataflow) and the mapping method (e.g., loop tiling strategy and loop size/order). DNA consists of two search algorithms: (1) the Differentiable Accelerator Search (DAS) in a generic accelerator design space, and (2) the Differentiable Network Search (DNS) based on FBNet [105]. In each iteration, the global co-search algorithm samples M networks from the current network distribution N ET (α) and obtains the optimal accelerator for each of them using DAS. In order to continue the search in the CNN architecture space by DNS, the hardware cost loss is needed. It is obtained as an average hardware cost for each operator on the M optimized accelerators generated from the previous step. The optimization tasks performed by DNA can VOLUME 4, 2016  be formalized according to [15] as follows: where ω, α, and γ are the supernet weights, DNN architecture parameters, and the accelerator parameters, respectively; N ET (α) and HW (γ) denote the network and the accelerator space parameterized by α and γ, respectively. L hw is the hardware-cost loss determined by both the network and its accelerator. The accelerator is characterized by its parameters γ S (s = 1, . . . , S), which is a normalized vector representing the s-th accelerator parameter with each element of γ S defining the probability of the corresponding choice of its represented accelerator parameter.

B. SELECTED METHODS
In order to demonstrate how NAS and hardware co-search can operate together, we briefly present the most interesting methods of this category.
Lu et al. [47] introduce a joint exploration of the space of neural architectures, FPGA implementations, and layerwise quantization. The RL controller samples parameters of a candidate CNN architecture and its possible quantization. For the sampled network, the hardware builder searches the hardware space to find a suitable hardware model. Each candidate hardware model is validated against the specification (latency constraint) during the search, and the result is sent back to the controller. If there is a valid FPGA model, the sampled quantized CNN is trained, and its accuracy also serves as feedback to the controller. The hardware search space is determined by tiling parameters and partitioning the layers of CNN into clusters of the tile-based FPGA accelerator that are sought by a dynamic programming method (minimizing the number of LUTs and latency).
QNAS [51] focuses on optimizing the parameters of a mixed-precision systolic-array-like architecture (the array size, buffer input/weight/output size) while searching the quantized neural architecture. It includes an EA-based hardware architecture search and a one-shot supernet-based quantized neural architecture search. First, a suite of neural architectures is sampled as a benchmark to find the hardware architecture that achieves the best performance on the benchmark. The hardware architecture is fixed, and the quantized neural architecture search (QNAS) is then performed to determine the neural architecture and quantization policy.
In YOSO [130], each candidate solution in the search space concatenates the DNN architecture and the ASIC accelerator configuration. For experiments with CIFAR-10, there are 40 hyperparameters for CNN and four accelerator parameters that the RL controller generates. Hardware parameters are predicted using the Gaussian Process model to eliminate an ordinary time-consuming simulation.
In AutoDNN [14], each candidate CNN consists of several hardware-aware parameterizable cells called Bundles. By means of these cells, specialized software can map any candidate CNN generated by NAS to an FPGA accelerator based on a fine-grained tile-based pipeline architecture whose components are predesigned and stored in a component library. Latency and resources are estimated and used back in the NAS algorithm. The application is an object detection task targeting a PYNQ-Z1 embedded FPGA. Two design problems are solved simultaneously: the bottom-up CNN model exploration, and the top-down FPGA accelerator generation.
In Codesign-NAS [139], RL controller selects a CNN architecture from a CNN search space and a hardware architecture from an accelerator design space. Both are sent to the evaluator that implements the CNN on the proposed accelerator to find accuracy and efficiency metrics, such as latency, area, and power (based on pre-computed models). The authors enumerated 4 billion model-accelerator pairs to study the Pareto-front in a representative co-design search space. They proposed three different search strategies to navigate the co-design search space under one or two constraints. The CNN search space is based on NASBench [174] and the accelerator design space utilizes CHaiDNN --a library for the acceleration of CNNs on System-on chip FPGAs.
EDD [48] is the first co-exploration approach utilizing differentiable problem formulation inspired by DARTS. DNN hyperparameters and parameters of a simplified hardware platform (i.e., the parallel and tiling factor of an FPGA accelerator template) are integrated into one solution space so that gradient descent algorithm can be applied to find accurate and hardware friendly CNN implementations. Parallel factors describe parallelism, indicating how many multiplications can be done concurrently. A limitation of this method is a simplified evaluation of hardware parameters.
Another framework based on differentiable neural architecture search, DANCE [170], was developed with Eyeriss as the backbone hardware platform. An RL controller generates parameters describing the CNN architecture (based on ProxylessNAS) as well as the hardware parameters such as the number of PEs, buffer size, and dataflow pattern. At the heart of DANCE is the modeling of the accelerator evaluation software using a neural network (the evaluator) that can be used as a differentiable loss function. DANCE thus introduces a novel differentiable evaluator, which takes the architecture parameters from the RL controller, searches for the optimal hardware accelerator design, and evaluates its cost metrics. The evaluator is a pre-trained neural network frozen during search and used only to connect the architecture to the hardware cost metrics.
HotNAS [49] works in two steps: (1) it uses Monte Carlo test to select several backbone architectures from a model ZOO of pretrained models (the so-called hot models) that meet a latency constraint; (2) an RNN-based reinforcement learning optimizer tunes hyperparameters of neural architecture and hardware design simultaneously. After setting the parameters of the FPGA accelerator, latency can be estimated using a simple latency model. HotNAS combines three compression techniques (pattern pruning, channel pruning, and quantization) with the neural architecture search (filter expansion) and hardware optimization.
Liang et al. [136] deals with FPGA accelerators of CNNs in which irregular connections in the sparse convolutional layers have to effectively be handled. To this end, a weightoriented dataflow is proposed that exploits element-matrix multiplication as the critical operation. The corresponding accelerator features a tile lookup table and a channel multiplexer. To connect this accelerator with NAS, an analytical model is developed to estimate the latency and resources in the FPGA. NAS searches hardware design parameters (parallelization and buffering factors) and possible CNN models under resource constraints. The search space is developed using MobileNet's inverted residual block as the basic building block of the supernet.
Focusing on Google's EdgeTPU, NAHAS [129] performs a joint search in the space of CNN architectures and hardware accelerator configurations, where hardware resources and latency are constraints. The architecture search space is based on a new fused inverted bottleneck layer with tunable parameters. The accelerator search space is defined by seven parameters (e.g., the PE array size, the number of SIMD units, register file capacity). An in-house simulator is used to estimate latency and other hardware-related parameters.
NAAS [8] holistically searches the neural network architecture, accelerator architecture, and unlike other methods (e.g., [13], [73]), compiler mapping. The accelerator search space is defined by the number of processing elements, local memory size, global buffer size, memory bandwidth, and connectivity parameters. NAAS employs EA to optimize these parameters as well as the compiler mapping (the execution order and the tiling size). It introduces a special encoding, called importance-based encoding, for the accelerator space and the compiler mapping strategies to avoid enumerating all possible situations and representing them by indexes. First, NAAS generates a pool of accelerator candidates. For each accelerator candidate, a network architecture is sampled from a pre-trained OFA network [7] that satisfies the pre-defined accuracy requirement. Since each subnet of OFA network is well trained, the accuracy evaluation is fast. Finally, the compiler mapping strategy is sought for the network candidate on the corresponding accelerator candidate. A comparison of CNNs generated by NAAS and QNAS for an ASIC accelerator will be presented in Section VIII.

VII. OTHER APPROACHES IN NAS AND HARDWARE CO-OPTIMIZATION
This section is devoted to NAS working with specialized hardware, which includes multiple accelerators and unconventional accelerators. Relevant NAS methods are given as a part of Table 8.

A. MULTIPLE NETWORKS AND MULTIPLE ACCELERATORS
In this category, three application scenarios are considered: (i) a single CNN is executed on several pipelined accelerators to maximize the performance, (ii) a single CNN is executed on one accelerator, but the CNN is optimized for a group of accelerators to ensure good portability, and (iii) m CNNs are optimized for m tasks executed on k sub-accelerators available on a hybrid accelerator.
(i) Since the timing performance on a single FPGA is limited by its limited resources, multiple FPGAs are often organized in a pipelined fashion to provide high throughput for image and video processing applications. This problem was addressed by FNAS [172], and later HWSW-CoExp [13] which supports multiple FPGAs to implement one CNN. The design problem is formulated as follows. Given a dataset, a pool of FPGAs, and a throughput specification, the objective of to find a CNN (parameters of all layers, the partition of the layer set and assignment of pipelined stages to the set of FPGAs) such that the accuracy of the resulting network is maximized, the pipeline FPGA system can meet the required throughput, and the average utilization of all FPGAs is maximized. The co-exploration algorithm iteratively performs two actions: fast and slow exploration. In the fast exploration, a CNN architecture is predicted for which the design space is explored to generate a pipelined FPGA system meeting the throughput requirement. There is the same number of RNN controllers as the number of pipeline stages. This level explores the hardware design space without training child networks. In the slow exploration, the child network obtained from the previous step is trained. After that, a reward based on both the yielded accuracy and pipeline efficiency is generated, which is used to update the RNN controller.
(ii) Multi-HW [128] proposes another scenario -multihardware models, where a single CNN is optimized for multiple hardware but executed on one of them. The challenge is that hardware platforms differ in many aspects, including their data flows, supported operations, and latencies. A multihardware search space is defined as a set of CNN architectures that belongs to the intersection of supported architectures of hardware platforms. The search space is constructed with modified MobileNetV3 architecture as the template, and the search method is based on TuNAS [80]. To accelerate the evaluation step, latency is obtained from pre-trained linear cost models for target hardware. The reward function for the RL controller considers the worst and average latency across all the target hardware platforms. It was reported that multihardware models could provide state-of-the-art performance across multiple hardware in both average and worse case scenarios.
(iii) In the context of multitask workflow, which is typical for some ASIC accelerators on edge, the goal is to optimize m CNNs, each of them executing a different task, using k sub-accelerators. These sub-accelerators differ in the dataflow style, and they are connected using NoC. In ASICNAS [73], the RL controller simultaneously generates hyperparameters of CNNs (m segments) together with the parameters of hardware resource allocation for different accelerator templates (k segments). CNNs are specified by CNN type and hyperparameters, while the accelerators are specified by the accelerator type, the number of PEs, and bandwidth. The goal is to map network layers to a pool of available sub-accelerators and determine their execution orders on each sub-accelerator. Mapping and scheduling are solved by ILP combined with a heuristic approach. Circuit parameters are estimated using MAESTRO and serve as feedback to the RL controller.

B. UNCONVENTIONAL HARDWARE
Emerging technologies are investigated to find solutions that can improve the critical parameters of computer systems, in particular, performance, storage capacity, and energy efficiency. Some parts of CNN accelerators were implemented using such technologies. Recently, these accelerators were connected with NAS algorithms to find best-performing CNN-accelerator pairs. A typical feature of these methods is that they employ very specific simulators of the underlying hardware.
For example, PABO [173] uses NAS connected with a memristive crossbar-based CNN accelerator, where the CNN is mapped across the on-chip crossbar storage spatially. Note that the memristive devices have a high storage density, but the write cost is relatively high. PABO was later extended, and its efficiency was demonstrated on more benchmarks in [175]. NAS4RRAM is a NAS method for optimizing CNNs and Resistive Random Access Memory (RRAM)based accelerators [132]. NACIM [50] jointly explores device, circuit, and architecture design space and also takes device variation into account to find the most robust neural architectures, coupled with the most efficient hardware design for an in-memory computing ASIC. The joint search space involves decision parameters determining the neural architecture, quantization, data flow, circuit design strategy, and low-level device selection (ReRAM, FeFET, STTMRAM).

VIII. EVALUATION OF NAS METHODS
NAS methods are evaluated according to the quality of produced DNNs and the resources needed to generate them. Note that the NAS methods are multi-objective and have to be compared under all relevant objectives. Hence, fair benchmarking of an extensive collection of NAS methods (particularly the hardware-aware NAS methods) remains an open research problem. The difficulty is that too many aspects have to be considered during the comparison, and their deep cross-analysis is expensive to perform. The most relevant factors that have to be considered are: • stochastic nature of search algorithms whose performance depends on many parameters and algorithmic settings; • (multiple) objectives to be optimized; • constraints handling; • quantization options in DNNs; • the computing time devoted to the search, training, and post-optimization; • available computational resources; • architecture and parameters of target hardware in which the resulting CNNs have to be deployed; • the quality of estimation methods used for accuracy, latency, energy, and other objectives; • hardware synthesis and optimization algorithms; • re-using of pre-designed networks and other knowledge; • implementation aspects involving the quality of software and hardware libraries used, compilers, and parallelization strategies. Using the data available in the literature, we can only compare those NAS methods whose evaluation was conducted under comparable conditions and for the same target hardware. Fig. 14 compares CNNs obtained using NAS methods that consider the so-called mobile setting (up to 600 MACs per CNN inference). The criteria are the top-1 accuracy on ImageNet, the number of MACs, latency on Pixel 1 phone, and the total design time of the NAS method. As these data are available only for a few methods (OFA [7], Prox-ylessNAS [11], MNASNet [12], MobileNetV2 [32], Mo-bileNetV3 [33], AutoSlim [176], and NetAdaptV2 [166]), we also included NSGANetV2 [6], DARTS [108], SPOS [163], FBNet [105], and NASNet [82] for which the latency on Pixel1 is not reported, but the design time is measured using the same methodology. Note that NASNet and DARTS were not optimized for latency. The total design time (in GPU hours) covers the search, training, and supernet training (for one deployment). It rangers from 150 (MobileNetV2) to 48 000 (NASNet) GPU hours. For other deployments (e.g., for another hardware), the design cost can partly be amortized. With a latency of 51 ms, the CNN created by NetAdaptV2 is the fastest network on Pixel 1 phone; its top-1 accuracy is high (77%), and the design time is less than 400 GPU hours. Under the mobile setting, NAS methods such as OFA and NSGANetV2 provide better trade-offs than humancrafted CNNs reported in Table 3. Fig. 14 also illustrates that if more resources are available, these methods can provide CNNs showing higher accuracy. Another example deals with NAS utilizing hardware codesign. Fig. 15 shows the impact of using various NAS approaches in optimizing the accuracy and EDP of Ima-geNet classifiers based on ResNet-50 and implemented with resources similar to Eyeriss. The original implementation (black point) of ResNet-50 (no NAS employed) is improved by QNAS [51] (green point), which searches the network architecture and the accelerator sizing. Additional improvement is provided by NAAS performing the hardware and compiler mapping co-search (orange point). The best tradeoffs are reported for NAAS utilizing the hardware, compiler mapping, and CNN architecture co-search (blue points). These results (adopted from [8]) demonstrate that exploiting more design spaces can lead to better CNN implementations. . Normalized EDP and top-1 accuracy (on ImageNet) for CNNs running on an ASIC that were obtained by NAS methods [8]: NAAS co-optimizing HW, compiler mapping and NN architecture (blue); NAAS co-optimizing HW and compiler mapping (orange); QNAS (green); No NAS conducted (black).
In order to introduce a setup enabling designers to compare NAS methods and obtain reproducible results, almost half a million fully-trained CNNs sampled from a well defined compact search space were included in public data sets such as NAS-Bench-101 [174] and NAS-Bench-201 [177]. They can be used to compare new search algorithms intended for NAS methods that utilize the same search space. Instead of training each candidate CNN, the accuracy can be obtained by querying the pre-computed data set.
Considering the hardware-aware NAS methods, HW-NAS-Bench was introduced in 2021 [121]. In addition to the accuracy, it contains estimated hardware performance (e.g., energy cost and latency) of all the networks in the search spaces of both NAS-Bench-201 and FBNet, for six hardware platforms, including commercial edge devices FPGAs, and ASICs. For example, the HW-NAS-Bench provides the test accuracy vs. hardware cost of all the architectures in NAS-Bench-201 considering the ImageNet16-120 data set. It thus enabled the identification of the optimal CNN architecturehardware pairs under a clearly defined setup. We expect other detailed studies to appear shortly, dealing with largescale comparisons of multi-objective NAS methods. They should reveal the most suitable search algorithms and provide hints for building even better NAS methods. They should also focus on analyzing the relation between the quality or resulting CNNs/accelerators and the design time (or CO 2 emissions) needed for obtaining them [76].

IX. CONCLUDING REMARKS
We surveyed the key elements of recent NAS methods that -to various extents -consider hardware implementation of the resulting CNN. We classified these NAS methods into three major classes: single-objective NAS (no hardware is considered), hardware-aware NAS, and NAS with hardware co-optimization. Within each class we further categorized each method according several criteria as shown in Table 6, 7, VOLUME 4, 2016 and 8. We also provided additional details about selected methods that are essential in each class.
We showed that NAS methods improve design productivity and enable the designer to automatically obtain competitive CNNs for various hardware platforms and data sets. The original NAS approach [3] was significantly accelerated by using pre-trained supernets, adopting surrogate models, and incorporating the differentiable architecture search. Introducing the hardware search space has led to more efficient implementations of CNNs on particular hardware platforms. However, several search algorithms working in the space of weights, neural architectures, and hardware configurations have to be coordinated, making the entire method complicated. We proposed the first classification of NAS methods utilizing the hardware co-design.
As (hardware-aware) NAS methods are multi-objective, their fair assessment consists of evaluating multiple parameters of resulting implementations of NNs and the design cost (time). It thus leads to an expensive construction and comparison of multidimensional Pareto fronts (one Pareto front for one NAS method), which is often hard to perform because of incomplete information about some NAS methods. To support a fair benchmarking methodology and accelerate the development of new NAS methods, the open-source data sets containing many pre-trained and evaluated CNNs from a well-defined search space were introduced in the literature.
We can conclude that NAS methods have significantly been improved (in terms of search performance and the quality of resulting DNNs) since their first utilization by the ML community in 2016. Because hardware acceleration of DNNs is crucial for many application domains and DNNs are frequently applied in entirely new contexts, the importance of fully automated hardware-aware NAS as well as the NAS utilizing hardware co-design will grow in the following years. A barrier potentially slowing down their further expansion is the ever-increasing pressure to reduce the enormous energy requirements (and CO 2 emissions) of ML methods [76].
We pointed out in Section VII-B that DNNs are now implemented using emerging technologies. In the future, they can be deployed on more exotic platforms (such as nanoparticle networks configured using evolutionary algorithms [178]) that could provide richer and deeper interaction of machine learning and configurable physical materio and lead to more compact and energy-efficient solutions.