Multi-Branch Neural Architecture Search for Lightweight Image Super-resolution

Deep convolutional neural networks (CNNs) are widely used to improve the performance of image restoration tasks, including single-image super-resolution (SISR). Generally, researchers are manually designing more complex and deeper CNNs to further increase the given problems’ performance. Instead of this hand-crafted CNN architecture design, neural architecture search (NAS) methods have been developed to find an optimal architecture for a given task automatically. For example, NAS-based SR methods find optimized network connections and operations by reinforcement learning (RL) or evolutionary algorithms (EA). These methods enable finding an optimal system automatically, but most of them need a very long search time. In this paper, we propose a new search method for the SISR that can significantly reduce the overall design time by applying a weight-sharing scheme. We also employ a multi-branch structure to enlarge the search space for capturing multi-scale features, resulting in better reconstruction on the textured region. Experiments show that the proposed method finds an optimal SISR network about twenty times faster than the existing methods, while showing comparable performance in terms of PSNR vs. parameters. Comparison of visual quality validates that the obtained SISR network reconstructs texture areas better than the previous methods because of the enlarged search space to find multi-scale features.

DLSR [32]), object detection [33], and other dense predictions such as segmentation, pose estimation and 3D detection by encoding multi-scale image contexts in the search space [34]. For the SISR, FALSR and MoreMNAS used a reinforced evolution algorithm and solved the image SR task as a multi-objective problem. However, the reinforced evolution method took a tremendous amount of time to derive an optimal network. Additionally, FALSR did not use a complete training scheme, but they measured the performance of the network approximately. DLSR extended the differentiable NAS [25] and its improved version MiLeNas [35] for the SISR and achieved state-of-the-art performance while requiring about ten times less design time than the FLASR based on the reinforced evolution method.
In this paper, we adopt the weight-sharing scheme of ENAS [21] as our baseline search algorithm because it is known to provide faster design time than its predecesors. As in the original ENAS for the classification problems, we configure a controller and a child network in the search process. The controller generates a sequence for a child network, and a child network is constructed by the generated controller sequence. REINFORCE algorithm is used to train the controller network to generate a better child network. For the SISR task, The reward signal in REINFORCE is the PSNR between the generated child network's output and the ground-truth. We share the parameters of each child network during the search phase. In addition, we propose a complexity-based penalty to reduce the reward from the network that needs a huge parameter. By applying the complexity-based penalty, the controller tends to recommend powerful but lightweight networks.
Image super-resolution is a kind of regression task that generally requires a more precise and complex network than a classification task. For this reason, we search for a new SR architecture on a multi-branch search space as stated above. To be specific, we develop a Multi-Branch Neural Architecture Search (MBNAS) algorithm, which tries to find optimal connections of multi-scale features. The MBNAS search space consists of partially shared nodes (PSN) for multi-scale block, local feature fusion layer, and global feature fusion layer. The PSNs share their parameters with different network branches to transmit information efficiently with fewer parameters. For simplicity, we use only 3 × 3 convolution and 3 × 3 dilated convolutions [36] as basic building blocks, and let the search algorithm find optimal connections. Still, we obtain an efficient architecture as a result of the search algorithm, which is validated by extensive experiments. The experimental results show that our network obtained by the MBNAS, named as MBNASNet, performs comparably to human-crafted networks and the existing NAS-based SR networks [28]- [30].
Our main contributions are summarized as follows: 1) New NAS-based SR: We propose a new NAS-based SR network design method, named MBNAS, which searches for networks with higher performance by combining multi-scale information efficiently. The resulting SR network is the MBNASNet. 2) Complexity-based penalty: We propose a complexitybased penalty and add it to the reward signal of the REINFORCE algorithm. This enables us to search for an efficient network that has high performance with a lightweight structure. 3) Multi-scale feature extraction: We construct the network with a multi-branch structure, which has been used in existing lightweight SR network design [8], [14], [17], [18]. 4) Partially shared node (PSN): We partially share the parameters of branches to connect each other's information and construct a lightweight structure. The partially shared structure efficiently reduces the searched network's parameter without performance degradation. We presented a preliminary work of NAS-based image super-resolution with a single-branch network in [37], called DeCoNASNet. The major difference of this work from our previous version is that we propose an expanded search space for NAS to capture multi-scale information, which brings a significant performance gain with reduced parameters. For this, we modify the algorithm to include the multi-braches into the search space. Also, we provide detailed analysis and explanations of the search process and results, and exhibit more experimental results, including the results on higher rate SR.
The rest of this paper is organized as follows. Section 2 summarizes related works on the single image superresolution and neural architecture search methods. In section 3, We explain our proposed search method for SISR. Section 4 includes the details about our implementation settings and dataset configurations, followed by experiment results. We discuss our main contributions and conduct ablation experiments in section 5. Finally, We provide a summary and concluding remarks in section 6.

II. RELATED WORK A. SINGLE IMAGE SUPER-RESOLUTION
A number of methods have been proposed for learning the mapping function from LR images to the appropriate HR counterparts [4]- [14]. Dong et al. proposed SRCNN [4], which is the first deep learning structure for the SISR. It used three layers of convolutional neural networks (CNNs) and outperformed non-learning-based conventional methods by a large margin. FRCNN [5] and ESPCN [6] used specific structures to reduce the computational cost of deep neural networks in the SISR networks. They proposed deconvolution layers and sub-pixel convolution layers to upsample LR features to an HR image. VDSR [7] used residual learning and gradient clipping strategy to increase the depth and thus the performance. Lim et al. [8] introduced residual blocks with extensive features (EDSR) and multi-scale structure (MDSR) to improve the performance further. MemNet [10], MSRN [14], and DenseSR [9] proposed memory block, multi-scale residual block, and dense block, respectively, for a better SR restoration. SelNet [38] improved the performance by replacing the Relu operation with the selection unit. Zhang et al. proposed residual dense block and dense feature fusion algorithm in RDN [11] to extract abundant information from the input image. RCAN [39] proposed a channel attention scheme that improved the representational ability of the neural network.

B. NEURAL ARCHITECTURE SEARCH
In designing a deep network, we should select a considerable number of network configurations such as connection, operation type, the number of feature channels, depth, etc. Researchers have designed their structures through a large number of trials to achieve a competent performance. However, it is a tedious task and difficult to find an optimal system for a given task. The NAS algorithms have been proposed to alleviate this burden, especially in the case of image classification researches [19]- [26].
As the first study of NAS, Zoph et al. [19] proposed a reinforcement learning (RL) based algorithm. They configured a controller network to generate a child network and trained it by REINFORCE [40], which is a kind of policy gradient algorithm. The performance of the child network was used as a reward signal of the controller network, where the child network was trained from scratch. Therefore, it took a huge amount of time to get a reward signal from the child network. To reduce the time to measure the performance, PNAS by Liu et al. [20] used the sequential model-based optimization (SMBO) with a surrogate model which predicts its performance instantly. On the other hand, Pham et al. [21] proposed ENAS that constructs a weight sharing child network to reduce the reward calculation time. This method configured a large graph and regarded each child network as a sub-graph. The parameters of the child network were shared in the search phase by storing their weights in the main graph.
Evolutionary methods [22]- [24] are another trend of the NAS algorithm. They pick a population of architectures randomly at first and then encode these networks as binary codes. Genetic modifications such as crossover or mutations are applied to the sequence, suggesting a better structure. Lu et al. [23] proposed another method that takes advantage of search history by using a Bayesian optimization algorithm. AmoebaNet [24] applies an aging evolution method to NAS to discard the earliest trained network.
DARTS [25], SGAS [41], NAO [26] and CSA-NAS [27] proposed different approaches from RL and evolutionary methods. Specifically, DARTS applies continuous relaxation to the neural architecture's connections for optimizing the connections and parameters simultaneously. SGAS applies a greedy operation selection method to the DARTS and obtains the best architecture without retraining. NAO projects the encoded sequence to the learnable embedding space of structures and recommends the best architecture as a result. CSA-NAS adopts a binary crow search algorithm to find the optimal architecture. More recently, HR-NAS [34] was proposed to exploit multiscale features by adopting a multi-branch architecture. As a result, they could effectively learn high-resolution representations and showed improved performance in several dense prediction tasks, as well as in image classification.
Regarding the search space design, neural architecture search methods can be categorized into two groups: methods dealing with (1) flat search space or (2) cell-based search space. The methods with flat search space [19], [21]- [23] aim to find the optimal setting for the number of channels (width), number of layers (depth), types of operations (convolution or max pooling) for the whole structure, while cell-based algorithms [20], [21], [23]- [26] try to find a structure of the cell before stacking them to form the final architecture. The cell-based search space design is inspired by the splittransform-merge strategy used in Inception block [42], hence it can approximate the optimal solution for a given task.
Unlike the above algorithms, CSNAS [43], UnNAS [44], and SSNAS [45] discard supervised settings which suffer from the high cost of data labeling. CSNAS and SSNAS adopt a self-supervised setting, and UnNAS applies unsupervised learning to search for promising architectures with unlabeled data. Recently, researchers are also trying to overcome the reproduction challenge and fairly compare search methods by proposing benchmarks for the NAS and providing some important principles for scientific research in the community [46]- [48].
There have been many NAS methods as stated above, among which we choose ENAS as our SR design baseline for its fast design time and also for including the network complexity in the design constraints. Regarding the design time, DARTS [25], FBNet [49], and FBNetV2 [50] also provide fast design time for practical use. But, we choose ENAS as our SR design baseline because we can easily include the complexity constraint into consideration within the ENAS framework. Specifically, as the ENAS is based on the REINFORCE, we modify the reward signal of the REINFORCE to consider the network complexity as well as the SR performance.

C. IMAGE SUPER-RESOLUTION WITH NEURAL ARCHITECTURE SEARCH
Some researchers recently adopted NAS methods to design image super-resolution CNNs [28]- [30], [32]. MoreM-NAS [28] adopted multi-objective genetic algorithm NSGA-II [51] for the model generation and proposed a reinforced mutation method. FALSR [29] used a hybrid controller instead of a reinforced controller and proposed an elastic search space for macro and micro search. The search space complexity of both methods is 9.6 × 10 15 . HNAS [30] adopted a hierarchical search algorithm with reinforcement learning to simultaneously find promising cell structure and upsampling layer positions. They also considered the computational cost (FLOPS) to meet the requirements about resources constraint. More recently, DLSR [32] adopted DARTS for SR network search, which is shown to require less design time than the preceding design methods. They also showed that Global feature fusion layer SISR models could be searched on both the cell-level and network-level by their method and reported the state-of-theart models.
Regarding the architecture and the search space thereof, these previous NAS-based methods prepare basic building blocks, which consist of convolutional layers, ReLu, etc., in cascade. Then, they let the NAS algorithm determine the number of layers and connections inside the cells. Meanwhile, we prepare a sophisticated architecture to have expanded search space, i.e., a structure with more different functional elements to connect. Specifically, we prepare several branches of building blocks, consisting of multirate dilated convolutions, ReLu, and attention, and let the NAS algorithm find the connections among the variousscale convolutions. By expanding search space through the multi-branch of dilated convolutions, we can exploit multiscale features for better SR reconstruction than conventional single-branch architecture.

A. OVERVIEW OF THE PROPOSED MBNAS
Our MBNASNet (a child network) is shown in Fig. 1, whose components (MSBs) are designed by a controller in Fig. 2, according to the MBNAS algorithm of Fig. 3. The automated design cycle in Fig. 3 illustrates that the controller is trained to generate a potent network, and the child network is trained to get the performance, which is used to calculate the reward signal. Fig. 1 shows the overview of MBNASNet, which consists of a shallow feature extraction network (SFENet), an upscaling network (UPNet), and a multi-branch network (MBNet). The MBNet is designed by the NAS, which consists of several branches. The MSB (multi-scale block) in the figure is the basic building block detailed in Fig. 2. We extract a shallow feature by the SFENet that is fed to each branch. The partially shared parameters in each branch extract the multiscale features with different receptive fields. Results from each branch are combined and upsampled by pixelshuffle layers [6] to create HR residual information. Finally, the residual information is added to the upsampled LR input to make the final HR result. Fig. 2 shows the details of MSB and illustrates their internal connections according to sequences from the controller. In each of Fig. 2 We use Long Short Term Memory (LSTM) [52] to create the controller, where the parameters are updated by REIN- 3 The upper part of (a) shows details of our MSB, and the lower part is illustrating that a controller determines the connections inside the MSBs of branch 1 according to the controller sequence (outputs of FC layers), with an example that there are two partially shared nodes (PSNs) (M = 2) and two branches (B = 2). (b) shows the example for branch 2, where the elements inside the MSB are differently connected than the above case according to the corresponding controller. Two branches share the parameters of the light purple box. The dashed arrows and colored arrows mean that these connections are to be searched. VOLUME 4, 2016 Search phase  FORCE algorithm. While conventional RL methods calculate the reward signal of REINFORCE as the performance of validation sets, we consider both performance and network complexity. For this, we design a complexity-based penalty and add it to the reward signal to find a more efficient architecture. The details of the controller, MBNASNet, and design procedure are explained in the rest of this Section.

1) Controller configuration
We use a two-layer LSTM as our controller as shown in the lower part of Fig. 2. It generates a sequence for creating a child network at the end of the fully connected layer (FC). The output sequence S c for a child network c is defined as in the case that the child network consists of B branches, M PSNs in one multi-scale block (MSB), and each node has N layers. S c consist of B sequences, and each s b denotes the sequence of the b-th branch structure. We need N sequences to create the m-th PSN for one branch. As a result, our controller consists of M × N × B LSTM blocks, where each block is followed by an FC layer. The FC layer has K outputs, where K is the candidate operations of our network. The example sequence and the constructed block architecture are shown in Fig. 2, which generate eight outputs for a twobranch structure (B = 2) with two PSNs (M = 2) that have two layers (N = 2). In our search space, the total number of possible directed acyclic graphs (DAGs) is |K| B×M ×N . The set of all possi-ble neural architecture is enormously expanded by a factor of |K| M ×N when increasing the number of branches. The search space is also expanded if we increase the number of PSNs or their layers. Hence, to limit the number of possible architectures to a manageable size, we choose B = 3, M = 2 and N = 2 in our MBNASNet. Because we have three candidate operations (|K| = 3), as will be addressed in Sec. III-D, the possible set of the architecture is 5.3×10 5 . Finally, to ensure that the number of parameters is less than 2M, we construct our MBNASNet with four multi-scale blocks (D = 4).

2) Complexity-based penalty
The REINFORCE algorithm uses a reward signal to train the parameters of the controller. While ENAS uses only a task performance as the reward signal, we modify the reward signal to find a more powerful and lightweight architecture, as stated in overview section. Specifically, we propose a complexity-based penalty to penalize a structure with large parameters, and define a reward signal R as where p(c; w) is the PSNR of model c and w is the parameters of a child network. The complexity-based penalty, cb(c) is defined as where n max denotes the number of the model's parameters, which uses all candidates in the search space, and n c is the number of parameters of the designed child network. To set a trade-off between the parameters and the performance, we multiply λ to the complexity-based penalty.

C. MBNASNET
As shown in Fig. 1, we first extract a shallow feature F 0 from an input low-resolution image (I LR ) by the SFENet (3 × 3 convolution layers). The F 0 is then fed to the first MSB of each branch. Formally, the F 0 is expressed as where H 3 (·) denotes the 3 × 3 convolution operation. The MBNet is constructed to have B branches, where each branch is a cascade of MSBs followed by their outputs' concatenation and 1 × 1 convolution to make a feature map. The searched MSBs in each branch have different receptive fields, and thus each branch learns multi-scale characteristics for image super-resolution. We multiply an independent scalar weight to the outputs of each node and block to adjust the gradient magnitude in back-propagation. A similar technique was used in [14]. We name these weights as gradient flow control weights and denote them as α, as illustrated in the last part of the MBNet block in Fig. 1.
Formally, the output of the d-th MSB in the b-th branch, where F b,d,m denotes the output of the m-th PSN of the dth multi-scale block (MSB) in the b-th branch, and H 1 (·) denotes the 1 × 1 convolution operation for the local feature fusion layer. Also, α skip and α res are the gradient flow control weights for residual feature and skip connection, respectively. F b,d,m will be detailed in the following subsection, with Fig. 2 and Eq. 9. Then, the output of the MBNet is a weighted sum of all the branch outputs: where and (α gf f ) b is a gradient flow control weights for global feature fusion layer. Also, (F gf f ) b is the output of global feature fusion layer of the b-th branch. Finally, we obtain the reconstructed high-resolution image I HR by combining the up-sampled low-resolution image I LR and residual information in the UPNet F M B . Formally, the I HR is computed as where H ps (·) denotes 3 × 3 convolution and periodic shuffling layer as in ESPCN [6]. We fix the structure of SFENet and UPNet while searching the connection of MBNet.

D. MULTI-SCALE BLOCK WITH PARTIALLY SHARED NODES
We apply a cell structure for the MSB, which means that all MSBs in the same branch have the same connection and operation. Each MSB consists of M PSNs as shown in the upper part of Fig. 2 (a) and (b). The dashed arrows and colored arrows in Fig. 2 mean that these connections are to be searched. The candidate operations of the PSN are 1) 3 × 3 convolution, 2) 3 × 3 dilated convolution with rate two, 3) 3 × 3 dilated convolution with rate three.
Following the signal flow in Fig. 2, F where (H b ) P SN,m (·) denotes the operation of the m-th PSN in the b-th branch. The (H b ) P SN,m (·) can be expressed as where H (s b )m,n (·) denotes the k-th operation among K candidates, which is chosen by the configuration sequence (s b ) m,n . We construct the PSN with two operations and one Relu activation as shown in Eq. 10. CA(·) denotes channel attention layer of RCAN [39].
To reduce the number of network parameters and spread the information through the branches, the parameters of PSNs have common weights if the configuration sequence of different branches activates an identical position in their sequence. For example, if two branches' configuration sequences are '001' and '011,' the operation corresponding to the first and the third digit share their weights. In Fig. 2, we emphasize the shared positions in the controller sequence (FC outputs) by big bold digits.

E. MBNAS
Like conventional RL-based NAS methods [19], [21], our algorithm has θ and w, which represents the parameter of the controller and the child network, respectively. In the search phase, θ and w are trained alternately for each epoch. After the search phase is finished, we sample the sequences by the trained controller. Then, the best sequence among the sampled ones is chosen and trained from scratch.

1) Training the child network
We first train the parameters of a child network to calculate the reward signal of the controller. The problem is formulated as min where L(·) denotes the loss function for the task which is the L1 loss in our setting. The controller's policy π(c; θ) is fixed when training the child network. The Adam opti-VOLUME 4, 2016 mizer [53] is used to optimize w. We estimate the gradient of E c∼π(c;θ) [L(c; w)] with the Monte Carlo estimate where c i denotes a sampled child network by the controller's policy. We choose M = 1, which means that we sample just one child network for each mini batch.

2) Training the controller
In the controller training phase, w is fixed, and θ is trained by REINFORCE [40] algorithm. We optimize θ to maximize the expectation of reward signal, which can be expressed as where a 1:T is the configuration sequence for the child network c. In the REINFORCE, the gradient of the expected reward is approximated as where b is the baseline which is used to reduce the variance. The moving average of the reward signal is used for the baseline in our algorithm. As explained with Eq. 2, we use the PSNR of validation set and complexity-based penalty to calculate reward signal. Adam [53] is used to optimize the reward.

A. SETTINGS 1) Datasets, degradation methods, and metrics
We choose DIV2K [54] dataset for the training and validation. The DIV2K dataset is widely used as a training set of various image restoration tasks. It contains 1,000 images, consisted of 800 for training, 100 for validation, and the other 100 images for test. The validation images are used as the data for measuring reward signal of controller network.
We measure the performance on four different benchmark dataset; Set5 [55], Set14 [56], BSDS100 [57], and Urban100 [58]. To compare the performances with others, we measure the PSNR and SSIM [59] of the test image on the Y channel of YCbCr color domain. We create the synthetic low-resolution image by applying Matlab's imresize function [60].

2) Implemenatation details
We construct the controller by a two-stacked LSTM network with 64 hidden states. We connect three fully connected layers to the end of each LSTM block to get the configure sequence for the child network. We use word embedding [61] to make the input of the LSTM layer from the previous LSTM block's output.

3) Hyper-parameter settings
In the search phase, we alternatively train the controller and child network for one epoch each. We initialize both the controller parameter θ and the child network parameter w by using the variance scaled initialization [62] with 0.02 scaling value. We train the controller and the child network for 500 epochs. For one epoch, we apply 100 iterations for the controller, and 1,000 iterations for the child network. The learning rate of the controller is fixed to 3 × 10 −4 . The learning rate of the child network initialized to 3 × 10 −4 and decreased by half for every 100 epochs. We use 16 lowresolution image patches of size 64 × 64 from DIV2K train images as a mini-batch of the child network. We augment the patches by randomly applying horizontal flip and 90°, 180°, 270°rotation. The λ in Eq. 2 is set to 2, and p(c; w) is the validation PSNR of child network. We randomly extract 1,000 low-resolution image patches from DIV2K validation images and compute PSNR to calculate the reward.
In the training phase, we sample 500 configuration sequences from the trained controller network and choose the architecture which has the best performance in the DIV2K validation set as our MBNASNet. We train the selected network for 1,000 epochs and finetune the trained network for 1,000 more epochs. The hyper-parameter settings are the same as the search phase except for the learning rate. The learning rate of the child network is initialized to 3 × 10 −4 and decreased half by 200 epochs.

1) MBNAS search reseult
The proposed MBNASNet has four multi-scale blocks (D = 4) and two PSNs (M = 2) with three branches (B = 3). We sample 500 architectures and choose the best architecture from them. For ×2 scale, the configuration sequence of each branch is found to be s 1 = {0, 0, 1, 2}, PSNR and SSIM on benchmark datasets (Set5, Set14, B100, and Urban100) for ×2 and ×3 SR tasks. We emphasize the best and the second-best performances with the red and blue colors, respectively. Methods with bold characters are NASbased methods, and the "Design time" at the last column indicates the times taken for the search process. All four indicated design times are calculated with the same GPU (NVIDIA Tesla V100). Other NAS-based methods do not report more than ×3 SR results due to huge search times, whereas we could. *In the case of the HNAS, the complexity is an estimated one because they do not explicitly reveal the number of parameters. Also, the + sign at the HNAS denotes that they used self-ensemble, which generally gives higher PSNR than the baseline.  We note that our searched structure has two same blocks with different channel attention and one block with a larger receptive field to capture multi-scale features efficiently.

Model
On the other hand, the searched configuration sequence for ×3 scale is   The ×3 scale SR task generally needs a larger receptive field than the ×2 to extract multi-scale features, and our searched ×3 network satisfies this property. It takes about 24 hours to train the controller and the child network by one Tesla V100 GPU in the search phase, which is far less than other NASbased methods such as MoreMNAS [28] and FALSR [29].
To show the robustness of our search algorithm, we search three times from different random seeds.

2) Image super-resolution results
Bicubic image down-sampling is widely used as the image degradation setting of super-resolution task. We measure PSNR and SSIM on four public benchmark dataset to compare our method with eleven state-of-the-art methods: SRCNN [4], VDSR [7], LapSRN [13], MemNet [10], MSAN [17], SelNet [38], CARN [12], A 2 F [63], MoreM-NAS [28], FALSR [29], HNAS [30], and DeCoNASNet [37]. Among these, MoreMNAS, FALSR, DeCoNASNet, HNAS, and ours are NAS-based aproaches. HNAS uses large training patch (96×96) when training and applies self-ensemble to get better performances. Table 2 shows the comparison with several state-of-theart SISR networks, where boldfaced methods are NAS-based ones as ours, and non-bold are conventional hand-crafted designs. Since our NAS-based approach is based on an efficient search algorithm, which is about twenty times faster than MoreMNAS and FALSR, we could conduct experiments on x3 super-resolution tasks while other NAS-based methods did not. As shown in Table 2, MBNASNet performs comparable to hand-crafted state-of-the-art methods and outperforms the NAS-based ones in many situations. Specifically, HNAS shows good performance for Set5 dataset, but MBNASNet performs better for complex datasets such as Urban100 and B100 datasets because we extract multi-scale features successfully. Compared to a state-of-the-art handcraft design A 2 F -M [60], MBNASNet shows comparable results in the case of x2 SR, but slightly worse for x3. We believe the A 2 F -M shows higher PSNR because they used more elements and technics (such as attentive auxiliary feature block and dense block connection) than our automatic design having only channel attention and feature fusion in block output. We believe we can bring possibly better results by employing more elements in our automated design, i.e., by further expanding search space. However, this may also  The blue dots are from the CBP, the reds are the Baseline, and the greens are Random settings. The "Relative Complexity" is defined the same as cbp in equation (3), meaning the cbp in the case of NAS design results. In the case of random and baseline, since the "penalty" is not defined, we denote it as "Relative Complexity." induce huge design times so that we leave it as future work. Since different initial conditions may lead to different results, we perform the design four times with different initial hyperparameters. But, there are just slight differences for all the cases in Table 2, with PSNR variance under 10 −4 , validating the robustness of our method against different initial conditions. Hence, we denote the best PSNR among the four experiments, following the convention.
In Fig. 4 and Fig. 7, we display the qualitative result of our method and conventional methods. As shown in the figures, MBNASNet successfully restores the structures of the images. Specifically, our network recovers the gray vertical lines and holes in each image while other methods do not. In summary, we compare the overall ×2 performance of lightweight models graphically in Fig. IV-B1.

V. DISCUSSION
In this section, we discuss the effect of the proposed method's contributions; complexity-based penalty, multi-branch structure, and partially shared parameters. VOLUME 4, 2016   The PSNR on Set5 for three structures. The red line indicates our MBNASNet structure, the green line is the multi-branch structure with separate parameters, and the blue is the single branch structure.

A. EFFECT OF THE COMPLEXITY-BASED PENALTY TO THE PERFORMANCE OF CONTROLLER
To evaluate the controller's performance and the effect of complexity-based penalty in the search phase, we conduct three experiments.The first experiment uses a non-trained controller, which generates a random controller sequence (denoted as Random). The controller trained with the PSNR reward but without the complexity-based penalty is denoted as Baseline, and the one including the complexity-based penalty is denoted as CBP. We choose λ = 2 for the complexity-based penalty.
We sample 100 structures for each controller setting and measure the average and the best performance, as shown in Table 3. Also, their distributions are illustrated in Fig. 9, where blue dots are the results of the CBP with λ = 2, red dots correspond to the Baseline, and the greens to the Random. We can see that the Baseline setting finds better architectures than the Random in terms of PSNR, sometimes with increased complexity. On the other hand, the CBP setting successfully generates lightweight sequences that have comparable PSNR to the Baseline.

B. EFFECT OF MULTI-BRANCH STRUCTURE AND PARTIAL PARAMETER SHARING SCHEME
To compare and visualize the effect of multi-branch structure and partial parameter sharing (PPS) scheme, we create three networks; single-branch, multi-branch without PPS, multi- branch with PPS. We set the parameters of three experiments by ∼ 1, 000K to fairly compare the results.
We measure the PSNR of each structure on the Set5 dataset. Fig. 10 shows the results of three structures for 400 epochs. We can find that the multi-branch structure converges faster than the single branch structure. Furthermore, with the partial parameter sharing scheme, we can successfully overcome the performance degradation phenomenon in the multi-branch structure.

C. EFFECT OF GRADIENT FLOW CONTROL WEIGHTS AND COMPLEXITY-BASED PENALTY COEFFICIENT
Gradient flow control weights allow MBNASNet to overcome the gradient vanishing problem by adjusting the gradient magnitude in the back-propagation process. We train MBNASNet with/without gradient flow control weights α and compare their performance in Table 4 and Fig. 11. The results show that α helps the MBNASNet converge to better point and achieve better performance.
To compare the effect of CBP weight λ, we train the controller with different λ values (λ = 0.5, 1, 2, 4) and compare their search results in Table 5. We can see that the mean CBP value tends to decrease (a lighter network is found), and the mean PSNR slightly decreases as the λ becomes larger. When the λ becomes too big (λ = 4), the controller fails to find a promising network in the search space. The experiments validate that the λ efficiently controls the trade-off between the performance and the number of parameters until λ = 2, and hence we use λ = 2 in other experiments.

VI. CONCLUSION
We have proposed a new NAS-based SR network, named as MBNASNet. We have attempted to improve the performance of the NAS-based SR by adopting a multi-branch network that can extract multi-scale features. In other words, we could obtain a better SR model by expanding the search space. We also regularized the reward signal of REINFORCE algorithm with a complexity-based penalty to favor a lightweight network. Besides, the partial parameter sharing scheme successfully reduces the number of parameters and helps the information transfer between each branch. It takes 24 hours to find promising network structures, which is a lot faster than the existing NAS-based design methods. The results show that the proposed method performs comparably to the conventional hand-crafted structures and other NAS-based networks. We will release our codes and more result images at https://github.com/Junem360/MBNAS.