NAS-HRIS: Automatic Design and Architecture Search of Neural Network for Semantic Segmentation in Remote Sensing Images

The segmentation of high-resolution (HR) remote sensing images is very important in modern society, especially in the fields of industry, agriculture and urban modelling. Through the neural network, the machine can effectively and accurately extract the surface feature information. However, using the traditional deep learning methods requires plentiful efforts in order to find a robust architecture. In this paper, we introduce a neural network architecture search (NAS) method, called NAS-HRIS, which can automatically search neural network architecture on the dataset. The proposed method embeds a directed acyclic graph (DAG) into the search space and designs the differentiable searching process, which enables it to learn an end-to-end searching rule by using gradient descent optimization. It uses the Gumbel-Max trick to provide an efficient way when drawing samples from a non-continuous probability distribution, and it improves the efficiency of searching and reduces the memory consumption. Compared with other NAS, NAS-HRIS consumes less GPU memory without reducing the accuracy, which corresponds to a large amount of HR remote sensing imagery data. We have carried out experiments on the WHUBuilding dataset and achieved 90.44% MIoU. In order to fully demonstrate the feasibility of the method, we made a new urban Beijing Building dataset, and conducted experiments on satellite images and non-single source images, achieving better results than SegNet, U-Net and Deeplab v3+ models, while the computational complexity of our network architecture is much smaller.


Introduction
In recent years, with the progress and popularization of remote sensing technology, satellite imaging and aerial photography are becoming more and more advanced [1]. We can get images which contain large amounts of information. These images have been applied in many fields, like agriculture [2], forestry, geology, military, environmental protectio n [3], urban planning [4], etc. High-resolution (HR) remote sensing images include high spatial, temporal and spectral resolution. The HR remote sensing image in this paper mainly refers to the high spatial resolution (2 m resolution and better) remote sensing image. The high spatial resolution remote sensing images capture the make the work of ordinary researchers unrealistic. Therefore, how to reduce the cost of the search has become a problem that the NAS has had to face since its birth. Researchers have done a lot of work in recent years to get rid of high memory consumption [32,33].
Before doing an architectural search, we need to define the search space. The common search space is chained and it is formed by stacking lays with operators. Many deep neural networks have many similar parts, which are gradually abstracted into a cell, so the search space is greatly simplified. A cell is usually designed as a directed acyclic graph (DAG) [32,[34][35][36].
There are three main types of search strategy. The first one is based on reinforcement learning. The generation of the architecture is regarded as an agent choosing the action, and the reward is obtained through the effect prediction function on a test set [30,31]. The second type of strategy is based on Genetic Algorithm (GA), a derivative-free optimization algorithm that may yield a global optimal solution, but is less efficient relatively [36,37]. The gradient-based method makes discrete search space continuous, and the objective function becomes a differentiable function, making it possible to use a gradient-based optimization method to find the optimal structure. The cell-based search space was applied into our works, and we use the gradient descent search strategy to search the space [32].
Here, we propose an improved HR remote sensing image segmentation method based on a neural architecture search, named NAS-HRIS. We applied NAS-HRIS to three different types of HR remote sensing dataset, to efficiently search out suitable architectures themselves.
Summarizing, our contributions are listed as follows: 1. The NAS of the HR remote sensing image segmentation is explored for the first time; 2. Our work embeds DAG into the search space and designs the differentiable searching process, which enables learning an end-to-end searching rule by using gradient descent optimisation [38]. We use the Gumbel-Max trick to provide an efficient way to draw samples from a non-continuous probability distribution, and it improves the efficiency of searching and reduces the memory consumption; 3. We provide a new HR remote sensing image segmentation dataset: the Beijing building datasets (BBD) that can be useful for image segmentation applications such as building segmentation for urban planning; (Figure 1) 4. Conducted search on a variety of remote sensing images, and training was conducted in aerial images, satellite images and Google earth image, obtaining and we got 98.52% pix accuracy, and 90.44% Mean Intersection over Union (MIoU) by using NAS-HRIS on the WHU dataset. Other parts of this paper are structured as follows. In Section 2, we provide our proposed methodologies in detail. The datasets, experimental settings and comparison results are presented in Section 3. At last, we discuss our work and put forward prospects for the future work in Section 4. We have released our code at https://github.com/zhangmingwei98/NAS-HRIS.

Methodology
In this article, we used NAS to construct the architecture of the encoder for the segmentation model Figure 2. The neural architecture search consists of three parts: search space design, search strategy formulation and evaluation method. We defined a search space composed of several cells, and we used the search strategy of gradient descent to select the weights of each edge of the directed acyclic graph, and so used the Gumbel-max trick to do continuous relaxation.   [32,[34][35][36]. The preorder node n i becomes the subsequent node n j after the calculation of operation p as follows  The nth node The candidate operations The selected operation In NAS-HRIS, the candidate operations set P has nine operations: (1) identity, (2) 3 × 3 avg pooling, (3) 3 × 3 max pooling, (4) 3 × 3 separate conv, (5) 5 × 5 separate conv, (6) 7 × 7 separate conv, (7) 3 × 3 dilated spearate conv, (8) 5 × 5 dilated spearate conv, (9) none.

Network Level
We look for two different cells, i.e., a normal one and a reduction one. They are similar in structure, and their feature maps are padded. However, there is a difference between normal and reduction cells. The stride of all operations is set to 1 for the normal cell, whereas the stride is set to 2 for all operations at the reduction cell. The purpose of reduction cell is to reduce the feature map resolution.
In NAS-HRIS, a cell is treated as the basic block and stacked by certain rules to form neural network. We also apply DAG to structure the network topology. The two input nodes of cell Cell k are the output nodes of the preorder Cell k−1 and Cell k−2 , respectively. Convolutions of 1 × 1 are filled in where necessary. In the network, reduction cells were set in the location of 1/3 and 2/3 of the total network depth. We define architecture variable as α and the weight of architecture as ω. α can be composed of α normal and α reduction , α normal and α reduction are shared by all the normal and reduction cells, respectively. In our work, we search for α normal and α reduction values. NAS-HRIS selects the optimal operations from candidate operations according to the weight value in the search procedure. In the training procedure, we update the value of the selected operation by gradient descent.

Continuous Relaxation and Search Strategy
As we can see the search space in Figure 3, before the NAS-HRIS search architecture, the operation of each edge in DAG is unknown (a). We set up a certain number of candidate operations on each edge to continuous relaxation of the search space (b). Each edge of the finalized by applying the reparameterization trick to sampling (c,d).
Our goal is to gain the optimal architecture α * and its weight ω * within all operations. We introduced the loss function L to achieve our goal. L train and L valid are train loss and valid loss, respectively. We regard this problem as a bi-level optimization problem. We find α * that minimizes L valid (α * , ω * ) in the case of obtaining the optimal weight ω * α , as we can see in (2) and (3).
An architecture α consists of many repeating cells: λ p i,j is the p-th element of a |P|-demensional learnable α i,j . We adopted the softmax function to get normalized probability f p i,j for sampled operation p between N i and N i . The process of selection a operation was relaxed, as can be seen in (4).
In order to back-propagate gradient though λ i,j , we propose using the Gumbel-Max trick [39,40] to re-formulate Equation (1), which makes it possible to sample from a discrete probability distribution in an efficient way, as can see in (5) and (6). This method is proposed to perform NAS for the first time in GDAS [41]. DARTS needs to keep all intermediate results in memory, but the Gumbel-Max trick selects only one operation at a time. Therefore, if there are P candidate operations, the computing resource consumption is about 1/P. Because the search efficiency of DARTS is mainly limited by memory resources, NAS-HRIS has a faster search speed in an environment with the same memory where ς p are Gumbel-distributed noise which are identically distributed and independently drawn samples from Gumbel(0, 1) 1 in (7). The ϕ i,j vector we obtained is a one_hot vector, and we multiply this vector by the range vector of x, and we end up with the x that we're sampling. ω p i,j is the weight of operation p ∼ P between N i and N j .
We apply so f tmax to relax argmax in Equation (6), hence Equation (5) is differentiable. We replace ϕ p i,j with approximatelyφ p i,j . This makes Equation (5) differentiable in back-propagatioñ where τ is the softmax temperature.
NAS-HRIS use gradient descent to optimize L valid , similar to using RL or evolutionary architecture search, where validation set performance is seen as reward or fitness. See Algorithm 1 for the detailed searching process, which uses the gradient descent method to fine-tune α and ω where ξ is learning rate.

Algorithm 1 NAS-HRIS Search Encoder for High-Resolution Remote Sensing Image Segmention
Require: D train : the training set; D valid : the validation set; n: batch size; initialized operation set P;

Ensure:
1: initialized the architecture variable α and the weights ω randomly, learning rate ξ, search epochs 2: repeat 3: Sample batch of data D t from D train ; 4: compute L train (ω, α) − D t ; 5: Updata ω by gradient descent: Sample batch of data D v from D valid ; 7: compute L valid (ω, α) − D v ; 8: Updata ω by gradient descent: Compared with DARTS [32], NAS-HRIS saves |P| times the GPU memory cost, making the implementation of NAS in large-scale datasets possible. This satisfies the large data characteristics of a high-resolution remote sensing image.

Evaluation Criteria
There are many criteria to evaluate the segmentation effect, most of which are based on accuracy and IoU. And different criteria represent different evaluation meanings. We selected several representative indicators to represent the performance of the segmentation task. In order to easily represent these criteria, we set the number of positive samples correctly predicted as TP, the number of positive samples wrongly predicted as FP, the number of negative samples correctly predicted as TN, and the number of negative samples wrongly predicted as FN.

Pixel Accuracy (PA)
This is one of the simplest metrics, and it represents the percentage of pixels that are properly classified.
2.3.2. F 1 Score F 1 Score is defined as the harmonic mean of the precision and recall.

Mean Intersection over Union (MIoU)
This is the standard metric for segmentation tasks. It represents the mean ratio of intersection to union of two sets. MIoU = TP TP + TN + FP

Experiments and Results
We describe the implementation of NAS-HRIS on three different datasets in detail. All the experiments were done in a single Tesla V100 GPU which has 32G memory. Our experiments consist of three stages. First of all, we use NAS to search the optimal architecture on the specified dataset, according to Algorithm 1. After this step , we can get the certain normal cell and reduction cell. The second stage is to retrain the optimal architecture and obtain a better performance model. In the first two steps, the training set and validation set are used. At last, we use the testing set to assess the performance of the architecture we have searched. We define each cell as consisting of seven node and eight candidate operations, and the depth of the encoder is eight layers. The learning rate is 0.025.

Experiments on Aerial Dataset
We chose the WHUBuilding dataset [42] for aerial images. The dataset is composed of more than 22,000 independent buildings in Christchurch, New Zealand. These buildings are extracted from aerial images with a spatial resolution of 0.0075 m and a coverage area of 450 km 2 . Most of the images are down-sampled to 0.3 m spatial resolution and cropped into 8189 non-overlapping blocks to form the whole dataset. They are divided into three parts, 4736 images for training, 1036 images for validation, and 2416 images for testing.
The architecture search process was carried out on the WHUBuilding dataset for about 12 hours for 30 epochs, and the resulting normal cell is shown in Figure 4, and the reduction cell in Figure 5. We ran the NAS-HRIS three times and the deviations of the PA, F 1 , and MIoU were 0.12%, 0.38%, and 0.25%, respectively, indicating the MIoU being nearly invariant. We compared NAS-HRIS with SegNet, U-Net and Deeplab v3+. The comparison results are shown in Table 1 and Figure 6. As we can see, the MIoU was higher than 5.93% and the F 1 was higher, 4.81%, than SegNet. Due to the simple design of the search space, our model is very small, only 1/164 times the size of SegNet.  As can be seen in Figure 6, the ability of SegNet to divide independent buildings is strong, and there is little adhesion between buildings, but the integrity of building segmentation is not high. In the aerial HR remote sensing images, U-Net does not perform as well as in the field of medical images. Although the MIoU is higher than SegNet, the independence of segmentation is not strong, and it is difficult to distinguish the areas between buildings. In the three groups of control experiments, Deeplab v3+ is the most prominent; the edge of the building is clearly divided, but there will be regional misclassification in the middle part of the building. As can be seen from the third picture, the distinction between roads and houses is still a difficult point in building segmentation, especially in areas with similar features. Obviously, the best performance is NAS-HRIS, the edge is clear, and the building segmentation is complete.
We used search time and train time to measure our approach NAS-HRIS. Because Segnet, U-net and deeplab are fixed architecture, there is no search time, so we have listed the respective trian time in relevant experiments. It is worth mentioning that because the DARTS method consumes a lot of memory, especially in the case of high-resolution remote sensing images with such a large scale of data, experiments cannot run on 32G GPU, so we do not give the relevant data, which precisely reflects the significance of our method improvement.

Experiments on Satellite Dataset
Gaofen Image Dataset (GID) is a dataset for land cover classification. It contains 150 HR images captured from more than 60 cities in China [43]. Each original image is 7200 × 6800, and we cut them into 182 images, each with a size of 512 × 512. Due to some problems with image labels, we selected 10,000 images as our dataset. Among them, 6000 images are for training, 2000 images for validation, and 2000 images for testing. There are five classes of tag in GID, which are built-up, farmland, forest, meadow, and waters, as can be seen in Figure 7.
By analogy with WHUBuilding, we used the three architectures of SegNet, U-Net, Deeplab v3+ as a comparison. The MIoU of NAS-HRIS is 7.37 % to 8.84 % higher than the other three methods (see Table 2), which shows the superiority of the customized architecture obtained by architecture search in complex datasets. Because there are many unmarked parts in the source image, in order to show the effect, we deliberately selected four images and compared them in this experiment. As can be seen from Figure 7, in satellite images, the two methods are not satisfactory for the boundary control of segmentation. There are functional disorders in the classification of forest by Segnet and functional disorders in the classification of meadow by NAS-HRIS. Note that in the last image, there are some ships parked on the water; although it is not marked in detail in the label, both methods have reflected that.   Figures 8 and 9 have a large number of avg_pooling. The reason for our analysis is that GID_Dataset is a satellite image dataset, which has large area, many colors and complex features. Furthermore, avg_pooling retains more background information from a wide range of images.

Experiments on Non-single Source Dataset
In order to run NAS-HRIS in multiple environments, we have made a non-single source dataset, namely Beijing Building Dataset (BBD). It is worth mentioning that BBD not only meets the requirements of HR image segmentation labels, but also has the value of convenient application. BBD is an elevation satellite image dataset, which is integrated by satellite image and aerial photographs for building extraction and identification. It contains 2000 images from Google Earth History Map of five different areas in Beijing in November 24th, 2016, and all these images are 512 × 512 with a precision of 0.458 m. It covers more than 100 km 2 geographic areas of Beijing both in suburbs and urban areas. We split the dataset into three parts, 1200 images for training, 400 images for validation and 400 images for testing.
In this experiment, we used the architecture searched on the WHUBuilding datasets. On this basis, retrain was carried out. The results of NAS-HRIS compared with SegNet, U-Net and Deeplab v3+ are shown in Table 3 and Figure 10.

Discussions and Conclusions
We proposed an improved image segmentation algorithm for high-resolution (HR) remote sensing images based on a neural architecture search (NAS-HRIS). NAS-HRIS uses a gradient descent search strategy to search in a cell-based search space. Compared with traditional methods, NAS-HRIS realizes the automatic design of neural networks and reduces the memory resources used in the automatic search process. We created a new urban Beijing Building Dataset (BBD), which is an elevation satellite image dataset integrated by satellite image and aerial photograph for urban building extraction and identification. We applied NAS-HRIS to aerial images, satellite images, and non-single source images, and achieved 90.44% MIoU on the WHUBuilding dataset. Although NAS-HRIS performs well in the task of segmentation of the HR remote sensing datasets, it still needs to consume considerable computing resources in the process of searching the architecture. Therefore, in the following work, we will further optimize the search space and search strategy and get rid of the constraints of computing resources on the neural architecture search.