Performance Improvement of Deep Convolutional Networks for Aerial Imagery Segmentation of Natural Disaster-Affected Areas

— This study proposes a framework for improving performance and exploring the application of Deep Convolutional Networks (DCN) using the best parameters and criteria to accurately produce aerial imagery semantic segmentation of natural disaster-affected areas. This study utilizes two models: U-Net and Pyramid Scene Parsing Network (PSPNet). Extensive study results show that the Grid Search algorithm can improve the performance of the two models used, whereas previous research has not used the Grid Search algorithm to improve performance in aerial imagery segmentation of natural disaster-affected areas. The Grid Search algorithm performs parameter tuning on DCN, data augmentation criteria tuning, and dataset criteria tuning for pre-training. The most optimal DCN model is shown by PSPNet (152) (bpc), using the best parameters and criteria, with a mean Intersection over Union (mIoU) of 83.34%, a significant mIoU increase of 43.09% compared to using only the default parameters and criteria (baselines). The validation results using the k-fold cross-validation method on the most optimal DCN model produced an average accuracy of 99.04%. PSPNet(152) (bpc) can detect and identify various objects with irregular shapes and sizes, can detect and identify various important objects affected by natural disasters such as flooded buildings and roads, and can detect and identify objects with small shapes such as vehicles and pools, which are the most challenging task for semantic segmentation network models. This study also shows that increasing the network layers in the PSPNet-(18, 34, 50, 101, 152) model, which uses the best parameters and criteria, improves the model's performance. The results of this study indicate the need to utilize a special dataset from aerial imagery originating from the Unmanned Aerial Vehicle (UAV) during the pre-training stage for transfer learning to improve DCN performance for further research.


I. INTRODUCTION
Image segmentation is an essential topic in image processing and computer vision.Classification, detection, localization, and segmentation are the four primary steps in identifying objects in an image through image processing.Image segmentation can be defined as a problem of classifying pixels with semantic labels (semantic segmentation) or partitioning individual objects.In contrast, image classification must identify what is in the image.Semantic segmentation performs pixel-level labeling with a set of object categories for all image pixels, so it is generally a more difficult task than image classification, predicting a single label for the entire image [1].In computer vision, semantic segmentation is a fundamental task that assigns a label to each pixel, aka pixel-level classification [2].Since the advent of deep neural networks, segmentation has made tremendous progress.We refer to [1]- [4] for a full description of deep learning techniques for semantic segmentation.
Numerous researchers are interested in image segmentation due to the numerous application domains that can be implemented.Conversely, increasing datasets are accessible over the internet and are becoming easier to acquire.Due to this convenience, it is necessary to automate image segmentation operations to handle various daily life challenges.These tasks can be in the form of urban remote sensing image segmentation to map land cover [5]- [7], river segmentation on remote sensing imagery [8], image segmentation for building extraction [9], [10], forest fire segmentation [11]- [14], segmentation of roads and buildings that diverse [15], and coconut tree segmentation [16].
In recent years, image segmentation in the context of natural disasters has attracted considerable interest.It is one of the essential research topics in artificial intelligence and image processing.The advent of advanced technology for capturing natural disaster events has increased.An Unmanned Aerial Vehicle (UAV) is one such device that captures aerial images of natural disaster damage and the affected area.The use of aerial images for monitoring and responding to natural disasters is gaining popularity.Even for difficult-to-explore areas on the ground, it is possible to create aerial imagery rapidly.This image can then identify the locations most in need of support.Such image analysis is typically performed manually (manual interpretation with ordinary eyesight).The procedure is time-consuming and frequently yields incorrect findings.
Until now, many methods have been proposed by researchers to produce accurate segmentation of aerial imagery.The current state-of-the-art methods are divided into two parts: models that rely on conventional handcrafted features, as done in studies [17]- [19], and deep neural networks.The benefit of using the latter is its ability to study end-to-end data.Driven by the highly developed success of Deep Convolutional Networks (DCN), several researchers used it for segmentation and natural disaster troubleshooting, including research [20] that used AlexNet to detect landslides and floods, detecting drifting buildings from aerial imagery before and after tsunamis used AlexNet and VGG [21], detecting objects and classifying damage after typhoons used Nazr-CNN [22], semantic segmentation of flooded areas with the integration of CNN and RNN networks [23], detecting flood areas used Generative Adversarial Networks (GAN) [24], extracting flooded areas from UAV imagery used the Fully Convolutional Network (FCN) based on Visual Geometry Group (VGG) [25], and identifying affected areas and access roads in post-disaster scenarios used multiple models for the binary semantic segmentation task and multi-class in aerial images, e.g., U-Net, LinkNet, and ENet [26].
Other studies analyzed and evaluated the performance of popular semantic segmentation methods named DeepLabv3+, Pyramid Scene Parsing Network (PSPNet), and ENet on problems related to natural disaster datasets [27], detecting and segmenting important objects in aerial footage of disaster locations used Mask-Region Based Convolutional Neural Networks (Mask-RCNN) and PSPNet [28], segmentation of damage to buildings after a natural disaster used MSNet [29], a self-attention-based semantic segmentation named ReDNet on a disaster UAV dataset and compared with three other advanced segmentation models: ENet, DeepLabv3+, and PSPNet [30], flood detection based on CNN AlexNet to extract flood-related features from disaster zone images [31], semantic segmentation of aerial images for post-flood landscape understanding by applying three advanced semantic segmentation networks namely ENet, PSPNet, and DeepLabv3+ [32], detecting buildings damaged after an earthquake used a network model Convolutional neural network VGG-16, VGG-19, and NASNet [33], semantic segmentation of natural disaster datasets used self-attentionbased methods combined with Global Average Pooling and U-Net [34], semantic segmentation of post-flood datasets with U-Net, PSPNet, and DeepLabV3+ [35], detecting flooding used segmentation with three deep neural networks: PSPNet, DeepLabV3, and U-Net [36], and extracted residential buildings with a modified Mask R-CNN [37], semantic segmentation of volcanic ash eruptions used SegNet and U-Net convolutional neural networks for volcano monitoring in volcanic eruptions [38], landslide detection and identification used Lightweight Attention U-Net [39], finding buildings damaged by disasters used transfers-learning deep attention network (TDA-Net) [40], and semantic segmentation to detect landslides used U-Net [41], [42], and self-training method [43].
Although DCN is highly dependent on architectural modifications, as shown in several studies above, tuning and selecting the appropriate parameters and criteria allows us to have enormous potential to improve further DCN performance for aerial imagery segmentation of natural disaster-affected areas.The main challenge is to improve the performance of DCN to accurately produce aerial imagery semantic segmentation of natural disaster-affected areas.Therefore, this study presents a framework and reveals practical knowledge through experimental studies for aerial imagery segmentation of natural disaster-affected areas.We summarize the experiments carried out and the knowledge gained in the trials, which are our main contributions to this paper as follows:  Improve the performance of the network model for aerial imagery semantic segmentation of natural disaster-affected areas by integrating the Grid Search algorithm and DCN and validating the results with the k-fold cross-validation method. Conduct comprehensive testing using several parameters, data augmentation, evaluation dataset of aerial imagery of natural disaster-affected areas, largescale datasets for pre-training, and DCN, which accurately produces the best parameters and criteria for aerial imagery semantic segmentation of natural disaster-affected areas. Based on the results of the best parameters, appropriate data augmentation criteria, and suitable pre-training dataset criteria, we conducted a comprehensive test and produced the most optimal DCN performance for aerial imagery semantic segmentation of natural disasteraffected areas.We utilize two semantic segmentation network models, U-Net [44] and PSPNet [45], two advanced semantic segmentation networks that have demonstrated promising performance on various segmentation benchmarks.We used the Residual Network (ResNet) architecture [46] as the backbone of an information encoder capable of extracting fine image patterns.Using the PSPNet model, we also determined the relationship between the number of layers and performance improvements, utilizing PSPNet- (18,34,50,101,152). The U-Net and PSPNet models with the best parameters and criteria, resulting from this study, were compared with the same network model, which only used default parameters and criteria (baselines).We also compared the results of our study with those in the literature review.We conducted these comparisons to prove that our proposed framework has significantly increased DCN's performance for aerial imagery semantic segmentation of natural disaster-affected areas.This study addresses recognizing aerial imagery of natural disaster-affected areas through semantic segmentation, resulting in improved DCN performance.Therefore, research that is closely related to the use of parameters, the use of data augmentation criteria, the use of dataset criteria in pretraining, and the use of natural disaster aerial imagery datasets, we describe representatively in the following.

A. Use of Parameters and Criteria on Deep Convolutional
Networks for Aerial Imagery Segmentation Previous researchers have used parameters such as learning rate, data split, optimizer, and data augmentation criteria on DCN for segmentation.Research [21] used Stochastic Gradient Descent (SGD) with a learning rate of 0.001 and data augmentation with vertical and horizontal flipping.CNN was trained using Adam optimization with a learning rate of 1e-5 and a batch size of 12, and RNN was trained using Adagrad with a learning rate of 0.01 and a batch size of 8 [23].A learning rate of 0.0001 and a maximum epoch of 6 were used for all classes [25].The model was trained using the Adam optimizer with a learning rate of 10 -4 for 600 epochs and used horizontal and vertical flipping data augmentation [26].[27], [32] used random shuffling, scaling, flipping, and random rotation data augmentation; the batch size was set to 2; for semantic segmentation implementing PSPNet used a learning rate of 0.0001; for ENet 0.0005; and DeepLabv3+ 0.01.The dataset was separated into training (60%), validation (20%), and testing (20%); the Mask-RCNN learning rate was set as 10 -3 ; and visual augmentation was done with zoom, blur, pixel dropout, adding clouds, and color equalization operation [28].Research [29] trained the model using 80% of the dataset and tested it on the remaining 20% dataset, at 100 epochs, with an initial learning rate of 0.003, then reduced to 0.001 after 10 epochs, and SGD was used as an optimizer with a batch size equal to 8. [30] used a learning rate of 0.0001 and augmentation of random shuffling, scaling, flipping, and random rotation.
Divide the dataset into training, validation, and test sets, with 70% for training and 30% for validation and testing [32].70%, 15%, and 15% of the dataset were used for training, validation, and test sets, respectively, with 300 epochs, and the learning rate was 0.0001 [33].The learning rate of 0.001 and used random shuffling, scaling, flipping, and random rotation augmentations [34].For U-Net, the learning rate of 0.01; for PSPNet and DeepLabV3+, the learning rate of 0.001; image augmentation used shuffling, rotation, scaling, shifting, and brightness contrast; Adam optimizer; and a batch size of 24 was used for all models [35].For PSPNet and U-Net training, a learning rate of 0.001 for 15 epochs; for DeepLabV3, a learning rate of 0.01 for 10 epochs; and used albumentation tools for image augmentation that implement various image transformation operations such as Resize, ShiftScaleRotate, RGBShift, RandomBrightnessContrast, and Normalize [36].The optimizer used in Mask R-CNN during training was SGD, with a learning rate of 0.0025 and a batch size of 3 [37].
Research [38] used a learning rate of 0.0001, batch size equal to 4, the number of epochs 100, Adam optimizer, data augmentation with horizontal flips, zoom, random noise, and rotations, and the data set was divided into two sets: training and validation respectively in the proportion of 80% and 20%.Research [39] used a learning rate of 1 × 10 -5 , the maximum number of epochs was 150, with a batch size of 16, the division of the dataset for training was 70%, and validation was 30%, and the Adam optimizer.Research [40] chose Adam as the optimizer, and the learning rate was 1 × 10 -4 .Research [41] used augmentation consisting of random rotations and vertical and horizontal flips; the model was trained for 200 epochs with a dynamic learning rate of 0.001, Adam was used as optimization, the model was trained with four different batch sizes (16,32,64,128), and 30% of each dataset was used as validation data.Research [42] used the Adam optimizer with 100 epoc; the learning rate was 0.01.For the overall training setting, research [43] used the SGD optimizer, batch size set to 16, and used data augmentation random flipping, random resizing, and cropping.
Research [20] resulted f-scores for landslide and flood detection in the range of 80%-90%, but only detected 1 class, [21] achieved a classification accuracy of 94%-96% in all conditions, but only detected 1 class, namely bulding, [23] obtained accuracy and mean Intersection over Union (mIoU) of semantic segmentation of 96% and 92%, but only used 1 class, namely flooded areas, [24] resulted accuracy for flood segmentation, in rural areas 89%-95.5% and in urban areas 80.5%-88%, [27] achieved the highest mIoU of 79.43% with the PSPNet method used 9 classes, [30] resulted a mIoU value of 80.27% for the PSPNet method with 9 classes, [31] had an accuracy of 91% for segmentation with only 1 class, namely flooding, [32] resulted a mIoU value for segmentation of 80.35% used the PSPNet method and used 9 classes, [33] achieved the highest accuracy of 70% for the VGG-19 model used 3 building classes, namely normal, less damaged, and damaged, [34] achieved a PSPNet mIoU value of 79.43% for 9 classes, and [35] resulted the best segmentation mIoU value of 52.23% used the DeepLabV3+ (pseudo-labels) method.Research [38] resulted in a mIoU of 90.13% obtained for the U-Net architecture, and for SegNet, a mIoU value of 88%, calculated using a validation dataset, to extract volcanic ash eruption forms automatically.Research [39] resulted in mIoU, and F1_score values of Lightweight Attention U-Net achieved 82.29% and 87.45%, which are the best performance for landslide segmentation.Research [41] achieved the highest mIoU value of 43%.Research [42] achieved an Area under the Precision-Recall curve (AUPRC) value exceeding 0.7.
Datasets from open sources such as images.google.comand images.baidu.comwere used in the study [33], and datasets taken from the Geospatial Information Authority of Japan (GSI) were used in the study [37].Research [38] used the Etna_NETVIS dataset.The Red Relief Image Map (RRIM) dataset was used in the study [39].The xView2 dataset, WHU Building, and other data from Google Earth were used in the study [40].Research [41] used three different datasets created by RapidEye, the Normalized Vegetation Index (NDVI), and the digital elevation model (DEM).Synthetic Aperture Radar (SAR) datacubes were used in the study [42].Research [43] used Sentinel-2 and ALOS PALSAR data.
The use of transfer learning in research [22] resulted in an overall accuracy for the best segmentation of 40.90% with the Nazr-CNN model for three damage classes, [25] resulted in the highest overall accuracy on FCN-8s of 95.520% for four classes: water, building, vegetation, and road, [26] achieved the highest mIoU for segmentation on the UNetUp (VGG16) model of 44.99% used only road, and building classes, [28] achieved the best performance mIoU of 32.17% and accuracy of 77.01% on the PSPNet model with class namely, flood area, debris, roads, and vegetation, [29] achieved an AP value (averaged over all IoU thresholds) for MSNet of 37.2%, [36] reached 56% mIoU on the PSPNet model, together with the Resnet-152 encoder, [37] resulted in the highest mAP value for segmentation used the Mask R-CNN model of 37.3% for four levels of damage, [40] resulted F1-scores (F1), precision (P), recall (R) on TDA-Net with respective values of 95.6%, 94.9%, and 96.4% for detected damaged buildings, and [43] resulted the highest F1-score of 73.50%.
The related studies above produced various accuracy values for aerial imagery segmentation according to the parameters and criteria used.The resulting accuracy value is quite high in several studies that only used one class or a small number of classes.However, the resulting accuracy is quite small in studies that used a large number of classes.High accuracy does not necessarily result in high mIoU values, so it is necessary to display mIoU values in each final test result in displaying segmentation results so that the accuracy of the DCN model used can be seen.Some of the mIoU values displayed in these related studies are still quite small, especially for segmentation tasks with many classes; this is due to the inaccuracy of the use of parameters and criteria for the DCN model.Our study used nine object classes and presented a complete performance evaluation consisting of accuracy, precision, recall, F1-score, and Intersection over Union (IoU).
The previous studies that have been described representatively above used parameters (such as learning rate, data split, and optimizer), data augmentation criteria, and pretraining dataset criteria that had been determined only based on their respective literature studies or only applied different settings in a trial-and-error manner, or only used the default parameters and criteria.None of these previous studies have improved DCN performance for aerial imagery semantic segmentation of natural disaster-affected areas, no one has used the Grid Search algorithm to tune DCN parameters (such as learning rate, data split, and optimizer), tuning data augmentation criteria, and tuning dataset criteria for pretraining comprehensively for aerial imagery semantic segmentation of natural disaster-affected areas, no one has yet searched for the best combination of parameters and criteria, and no one has validated using the k-fold cross-validation method on the most optimal DCN model.
These previous studies also have not conducted tests to verify the relationship between the number of layers and increased performance.No one has carried out transfer learning using a combination of general datasets (real-world images + urban images + road images) and a combination of special aerial imagery datasets originating from the UAV for aerial imagery semantic segmentation of natural disasteraffected areas.No one has tested and compared model performance with several scenarios, namely using default parameters and criteria, using the best parameters, and using the best parameters and criteria.
Our study proposes a framework for improving performance and exploring the application of DCN using the best parameters and criteria to accurately produce aerial imagery semantic segmentation of natural disaster-affected areas.Our study takes the initiative to perform aerial imagery semantic segmentation of natural disaster-affected areas by integrating the Grid Search algorithms and DCN.This study performs parameter and criteria tuning comprehensively using the Grid Search algorithm and validates the results using the k-fold cross-validation method, taking into account the parameters used in DCN, paying attention to various appropriate data augmentation methods, and paying attention to various datasets that are suitable for pre-training.Combinations of each parameter and criteria were tried to get the most optimal performance results in producing aerial imagery semantic segmentation of natural disaster-affected accurately.We provide the results of tuning the best combination of parameters and criteria and comparing performance with models using default parameters and criteria (baselines).We also try to optimize PSPNet with multiple layers using the best parameters and criteria.This effort is beneficial for revealing practical knowledge and fair comparison with several approaches/scenarios.
We believe that transfer learning considerations make the aerial imagery semantic segmentation of natural disasteraffected areas more reliable and knowledgeable.We validated the effect of general datasets (real-world, urban, or road images), a special dataset of aerial imagery derived from UAVs, and a combined dataset for transfer learning on the performance of DCNs for semantic segmentation.We also display the results of aerial imagery semantic segmentation of natural disaster-affected areas visually to see the accuracy of the DCN model.
We organize this paper as follows: Section II of Materials and Method describes Deep Convolutional Networks (DCN), the dataset used, the Grid Search algorithm, the k-fold crossvalidation method, the implementation of semantic segmentation, and the proposed framework or method.The experimental results are presented and discussed in Results and Discussion in Section III.Finally, Section IV presents our conclusions and suggests further research in the future.

A. Deep Convolutional Networks (DCN)
We mainly use U-Net [44] and PSPNet [45] as DCN models for semantic segmentation in this study and ResNet architecture [46] as the backbone.At the start of the test, we confirmed performance with the PSPNet(50) model and added layers to PSPNet(101).Next, we retest with various layers, such as PSPNet (18), PSPNet (34), and PSPNet(152).In addition, we compared the results with U-Net.All these network models use the best parameters, the appropriate data augmentation criteria, and the suitable pre-training dataset criteria and compare the results with the network models using the default parameters and criteria (baselines).
1) U-Net: U-Net modifies and expands the FCN architecture so that the network uses fewer training images and generates more accurate segmentation.The objective and concept behind this strategy are to augment the conventional contract network with successive layers so that the upsampling operator replaces the pooling operator as this layer increases output resolution.One of the most significant changes to the U-Net architecture is upsampling.Many feature maps are included, allowing the network to propagate context information to higher-resolution layers.The architectural model is shaped like a U [44].U-Net was initially designed for biomedical image segmentation tasks.In recent years, research has demonstrated that U-Net is also applicable and has significant potential for semantic segmentation of aerial imagery.
2) PSPNet: Scene decomposition is a fundamental concept in computer vision based on semantic segmentation.Scene parsing aims to comprehensively understand the scene by predicting object labels, locations, and shapes.Previously developed frameworks for advanced scene decoding relied heavily on Fully Connected Networks (FCN).The usage of CNN presents a number of challenges because it is difficult to examine the variety of scenes.To overcome these challenges, the Pyramid Scene Parsing Network (PSPNet) was released [45].Pixel prediction is based on FCN in PSPNet.In addition, the pixel-level features have been expanded to a series of built global pyramids in which local and global values are merged to produce more accurate final predictions.In addition, optimization techniques with highly supervised losses have been integrated.For the previous global scene construction on the final layer feature map of the neural network, the Pyramid Pooling Module was implemented to reduce the loss of context information between distinct sub-regions.This module has operations under four different stages of the pyramid.PSPNet is a proven and effective pyramid scene parsing network for comprehending complex scenes.PSPNet achieves state-of-the-art performance on various datasets, including the 2016 ImageNet scene decoding, 2012 PASCAL VOC, and Cityscapes benchmarks.PSPNet utilizes ResNet as its backbone with an extended network to extract feature maps.Then a 4-level pyramid pooling is applied to the feature map to extract the previous global context.The final prediction map is produced by combining these global priorities with the original feature map, followed by a convolution layer.
3) ResNet: The PSPNet model utilizes a backbone capable of extracting fine patterns of images in the form of an information encoder.A Microsoft Research team developed deep Residual Learning for Image Recognition to solve the fundamental issues of VGG and AlexNet.The scalability of the network is a challenge for AlexNet and VGG.As increasingly deep networks begin to coalesce, the degradation problem becomes apparent.As the network depth increases, the accuracy saturates and then rapidly falls.ResNet is based on implementing a residual block of "identity shortcut connections" that traverse one or more layers.When the identity mapping reaches optimal, it pushes the residual to zero and matches the identity mapping.With these actions and modifications, ResNet outperforms current state-of-the-art convolution networks [46].

B. Datasets
The datasets used in this study are divided into pre-training and evaluation datasets for training, validation, and testing.The selection of the two kinds of datasets is based on the availability of datasets that include segmentation and annotations and are publicly available and easily accessible.The datasets used in this study and their characteristics are shown in Table I.

1) Pre-Training Dataset:
For the pre-training dataset, we used two types of external datasets: general datasets (realworld images, urban images, or road images) and special datasets of aerial imagery derived from UAVs.Both types of datasets were tested to verify the effect of these datasets on DCN performance for semantic segmentation.In order to successfully optimize the DCN model for semantic segmentation, a large number of pre-trained datasets are required.We define a dataset that is larger in scale than the evaluation dataset, is easy to obtain, and has segmentation annotations.The transfer learning procedure consists of pre-training with a large-scale dataset and training by a relatively small evaluation dataset.However, due to the limited capabilities of personal computers and the availability of existing datasets, we limited the number of images in each dataset, as shown in Table I.We selected the COCO, VOC, Cityscapes, DSRS, and Mapillary Vistas datasets for the pre-training datasets containing real-world, urban, or road images and the USS and Semantic Drone datasets derived from UAV aerial imagery.The VOC and DSRS datasets have a single label on each image, while the COCO, Cityscapes, Mapillary Vistas, USS, and Semantic Drone datasets have multiple labels.We collected these datasets from the relevant sites (COCO, VOC, Cityscapes, Mapillary Vistas, USS, and Semantic Drone) and the data science community site Kaggle (DSRS).In transfer learning, a trained model is needed; this trained model is called a pre-trained model.Pre-trained models are usually already trained on larger, structured, and labeled datasets.Currently, many pre-trained models are provided for various needs, such as pre-trained models for image classification and object detection.Still, obtaining a pre-trained model for image segmentation that fits the overall DCN model we use in this study isn't easy.Therefore, in this study, we use all of the above datasets and their annotations in the pre-training process to create their pre-trained models to get a special pre-trained model for image segmentation and have a good quality pre-trained model.2) Evaluation Dataset: We use FloodNet as an evaluation dataset for training, validation, and testing in image recognition for aerial imagery semantic segmentation of natural disaster-affected areas, which are aerial imagery datasets originating from UAVs.We obtained this dataset from research [32]using high-resolution aerial image data collection to understand post-disaster (flood) landscapes.FloodNet delivers high-resolution images taken from low altitudes, which have an advantage over satellite images captured from higher altitudes that clouds and smoke may obscure.The collection was acquired using a small UAV platform, DJI Mavic Pro quadcopters, at an altitude of 60 meters, resulting in images with a very high spatial resolution (about 1.5 centimeters) that distinguishes it from previous natural disaster datasets.Post-flood damage in the affected area is shown in all images.This dataset contains pixel-level semantic segmentation annotations.There are 2343 images and their respective annotations, categorized into 9 classes: building-flooded, building-non-flooded, road-flooded, roadflooded, road-non-flooded, water, tree, vehicle, pool, and grass.

C. Implementation Details 1) Optimization of Parameters and Criteria Using Grid Search Algorithm and K-Fold Cross-Validation Method:
The Grid Search (GS) algorithm is a complete search method with a uniform grid in the search parameter space defined.The primary purpose of this method is to identify optimal model parameters so that model performance can be improved as much as possible [55].The basic principle of the GS method is to divide the grid into a certain range and traverse all points in the network with the parameter values used.Finally, the parameter with the highest accuracy was determined as the best parameter [56].GS was developed to match parameters and criteria and optimize the solution of complex problems, especially in this study in accurately producing aerial imagery semantic segmentation of natural disaster-affected areas.
The Grid Search algorithm is used in this study to be tuning and completely identify the parameters and criteria that lead to the highest accuracy.The parameters consist of the learning rate, data split, and optimizer.The criteria used consisted of data augmentation criteria and pre-training dataset criteria.We use several augmentation methods on the data augmentation criteria, namely photometric distortion, geometric distortion, cutout, and a combination of all data augmentation methods.The pre-training dataset criteria consist of general and special aerial imagery datasets using the datasets described in Table I.The highest accuracy of the DCN model for semantic segmentation with all parameters and criteria is compared to determine the best combination of parameters and criteria to produce the most optimal model performance in accurately producing aerial imagery semantic segmentation of natural disaster-affected areas.
The results of the most optimal model using the best parameters and criteria were validated based on the crossvalidation method using the k-fold cross-validation method.
The performance of the DCN model can be improved by using a combination of GS and k-fold cross-validation, and the model's performance can be evaluated based on the cross-validation method.In k-fold cross-validation, the training set is first divided into k subsets of equal size.The model will be trained and tested k times.In each training process, one data set will be used as a test, while the rest will be used as train data.Sequentially, each subset is tested by a model trained on another k−1 subset.Therefore, each sample in the training set is tested once.As a result, the cross-validation accuracy will be the percentage of data tested correctly.The estimated k-fold cross-validation of all model accuracy is calculated by the average of each k-model accuracy measurement (Equation 1), where A is the accuracy of the model and k is the number of subsets or groups used.

Cross Validation Accuracy
The flowchart of the Grid Search algorithm with k-fold cross-validation proposed in this study is shown in Fig. 1, and the complete parameters and criteria used in the Grid Search algorithm are shown in Table II.2) Implementation of Semantic Segmentation: This section explains how to implement DCN for aerial imagery semantic segmentation of natural disaster-affected areas.The Keras framework was used to build the model and implement the segmentation network with the TensorFlow backend.All semantic segmentation experiments were run using a personal computer (PC) with a 12th Gen Intel ® Core ™ i7 processor, with turbo frequency up to 4.90 GHz 12-core, 32 GB of RAM, and a 10 GB NVIDIA GeForce RTX 3080 GPU.This study uses a categorical cross-entropy loss function (Equation 2), where y i is the true label (the ground truth label for each image labeled), y i is the predicted label (the predicted result of an image classified), N represents the total number of samples used for each epoch, and Loss is the average crossentropy between the desired distribution y i and the ground truth y i .
All models were trained for 50 epochs for a fair comparison between different models.During the training and validation process, we resized all original images to 473 × 473 pixels, the batch size was equal to 2, and the number of steps (no_of_step) was equal to the number of datasets used divided by the number of batches.We also use a checkpoint callback operation to save the best model for the duration of the epoch.In addition, we apply the early stopping method to prevent overfitting by stopping the training process when the loss train does not decrease.
To assess the performance of the DCN model for semantic segmentation, this study presents a performance evaluation consisting of accuracy, precision, recall, F1-score, and Intersection over Union (IoU), which is used based on a confusion matrix with four main factors, such as false negative (FN), false positive (FP), true negative (TN), and true positive (TP).
TP is a pixel correctly predicted according to its class, which includes 9 object classes: building-flooded, building-non-flooded, road-flooded, road-non-flooded, water, tree, vehicle, pool, and grass.FP is a pixel incorrectly identified as belonging to a class but actually does not belong to that class.FP represents the number of false positives that occur when a pixel is not of class, incorrectly identified as a certain object class.FN is a pixel incorrectly identified as not belonging to a class when in fact, it does.FN represents the number of false negatives that occur when the actual class of an object is incorrectly identified as a pixel instead of its class.TN is a correctly predicted pixel that does not belong to all classes.
Accuracy is the ratio between the number of correctly predicted pixels and the total number of pixels.Accuracy is calculated as the number of TP and TN pixels for each class divided by the total number of pixels (Equation 3).Precision counts how many positive predictions belong to the positive class (Equation 4).Recall represents the number of positive predictions from all positive samples (Equation 5).The F1score provides a numerical value to balance precision and recall problems (Equation 6).For each class, the IoU pixels are calculated by dividing all the TP pixels corresponding to that class by the number of TP, FP, and FN cases (Equation 7).The average pixel IoU (mIoU) across all classes reflects the overall performance of the DCN model.The proposed overall framework or methodology for improving the performance of deep convolutional networks for aerial imagery segmentation of natural disaster-affected areas is shown in Fig. 2. Segmentation results display performance evaluation, segmented objects, object class labels, object class probability, and the number of each object.

A. Parameters and Criteria Testing Results
This section presents the test results of comprehensively tuning the parameters and criteria using the Grid Search algorithms and DCN for aerial imagery semantic segmentation of natural disaster-affected areas.The detailed settings for tuning parameters and criteria according to the parameters and criteria are shown in Table II.The test results for tuning parameters on DCN using the Grid Search algorithm are shown in Fig. 3, which produces 90 combinations of parameters.The best parameters of the DCN model for aerial imagery semantic segmentation of natural disaster-affected areas with the highest accuracy of 98.48% on a combination of parameters, namely: learning rate of 0.0001, data split with 90% for training and validation (70% training and 20% validation), and 10% for testing, and the optimizer used is RMSProp.Fig. 4 shows the test results of tuning the data augmentation criteria.We confirm that the appropriate data augmentation method is used to improve DCN performance and help prevent overfitting with the highest accuracy of 91.10% by using a geometric distortion data augmentation method consisting of flipping (horizontal and vertical), rotating (30°, 60°, 90°), and random scaling (X scale, Y scale).
Fig. 4 The test results of tuning data augmentation criteria using a grid search algorithm for aerial imagery semantic segmentation of natural disasteraffected areas What kind of dataset is suitable for transfer learning in aerial imagery segmentation of natural disaster-affected areas is shown in the test results of tuning dataset criteria for pretraining using a Grid Search algorithm, which leads to the highest accuracy to produce the most optimal performance of the DCN model.We compared several datasets in line with transfer learning.To confirm the effect of pre-training in a dataset, we performed transfer learning using a general dataset (real-world images, urban images, or road images) and a special dataset of aerial imagery derived from UAVs against an evaluation dataset (FloodNet).We used five general datasets, two special datasets of aerial imagery, and three combined datasets, which are in line with transfer learning and according to the object classes used in aerial imagery segmentation to improve DCN performance.Fig. 5 shows the effect of pre-training with all datasets.
As shown in Fig. 5, the Cityscapes pre-training model achieved the best level of performance for transfer learning with a single pre-training dataset, which had an accuracy of 92.396%.This is because the Cityscapes dataset is a general dataset that contains images of urban landscapes and multiple labels and has a fairly high image resolution, so it is still suitable for transfer learning, specifically for a single dataset, to aerial imagery of natural disaster-affected areas datasets which contain images of more complex urban and natural landscapes.
We want to improve accuracy by using combined datasets; we are trying to organize larger datasets.As shown in Fig. 5, the combined general dataset (unbalanced) pre-training model achieved the best performance level for transfer learning with the combined pre-training dataset, which had an accuracy of 92.4%.The combined dataset combines all general pretraining datasets (COCO + VOC + Cityscapes + DSRS + Mapillary Vistas), multiple labels, good segmentation annotations, and complex images (real-world images + urban images + road images).Overall, the combined general dataset (unbalanced) pretraining model achieved the best performance across all transferred tasks, the second best was the combined aerial imagery dataset pre-training model, and the third best was the Cityscapes pre-training model.The difference between the combined general dataset (unbalanced) and the second best dataset, the combined aerial imagery dataset, is 0.002% (92.4% − 92.398%), and 0.004% (92.4% − 92.396%) in the third best dataset, Cityscapes.It can be seen that the difference in the results of the combined general dataset (unbalanced) and combined aerial imagery dataset is very small (0.002%), even though the difference in the number of images owned is very large; namely, the combined general dataset (unbalanced) has 20,888 images, while the combined aerial imagery dataset only has 670 images.We gained knowledge that the combined aerial imagery pre-training dataset derived from the UAV, which has high image resolution, good segmentation annotations, and multiple labels, is very precise and has a great opportunity for transfer learning to the aerial imagery of natural disaster-affected areas dataset, provided that it has a large enough number of images.The combined aerial imagery dataset as a pretraining dataset has similar characteristics to the evaluation dataset (FloodNet), namely the similarity of aerial imagery from UAVs, which contains images of more complex urban and natural landscapes.However, this cannot be done because the special dataset of aerial imagery from UAVs for pre-training has not been available with a large enough size to date.
Another piece of knowledge we gained was to practically increase the amount of data simply by combining multiple datasets and proving that the combined datasets increase the level of performance compared to a pre-training of only one dataset.Combining multiple datasets containing complex images (real-world images + urban images + road images) and multiple labels can improve accuracy.We also reveal that the carefully annotated composite of pre-training datasets effectively trains aerial imagery datasets for semantic segmentation tasks.

B. Results of Deep Convolutional Networks (DCN) Performance Testing
After testing the parameter tuning on the DCN, testing the data augmentation criteria tuning, and testing the dataset criteria tuning for pre-training comprehensively using the Grid Search algorithm for aerial imagery semantic segmentation of natural disaster-affected areas, which produces the best parameters and criteria with the highest accuracy results, we apply it to two semantic segmentation network models, namely: U-Net and PSPNet, to produce the most optimal DCN performance.We also carried out comprehensive testing of the PSPNet model with multiple layers to verify the relationship between the number of layers and performance improvements.We used PSPNet- (18,34,50,101,152).
Tests were carried out with several scenarios, and performance comparisons were made with models using default parameters and criteria (baselines).Each PSPNet model in all layers and the U-Net model were tested with several scenarios: using the default parameters and criteria, the best parameters, and the best parameters and criteria.The default parameters and criteria for the U-Net model were obtained from the study [44], and the default parameters and criteria for the PSPNet model were obtained from the study [45].To distinguish all these scenarios, we added a letter abbreviation behind the model that uses the best parameters and criteria.The abbreviation with the letter "bp" means that the model uses the best parameters, while the abbreviation with the letter "bpc" means that the model uses the best parameters and criteria.A model that does not have an additional letter abbreviation behind it is a model that uses default parameters and criteria (baselines).The model, which uses additional abbreviations for the letters "bp" and "bpc," is the result of our proposed framework (this study).
The comparison of the results of the overall performance testing of the U-Net and PSPNet models is presented in Table III, and the results of the network model testing with intersection over union values for each object class are shown in Table IV.
Based on the test results shown in Table III and Table IV, the most optimal DCN performance is achieved by the PSPNet(152) (bpc) model that uses the ResNet-152 architecture as the backbone.The network model fully uses the best parameters and criteria.Based on the test results, it is also proven that the PSPNet(152) (bpc) model can detect and identify various objects with irregular shapes and sizes, can detect and identify various important objects affected by natural disasters, such as buildings and roads that are flooded, and can detect and identify objects with small shapes such as vehicles and pools, which is the most challenging task for semantic segmentation network models.This ability can be seen from the fairly high value of IoU for each object class and mIoU.The test results in this study prove that there is an increase in DCN's performance in producing aerial imagery semantic segmentation of natural disaster-affected areas accurately.
Based on the test results, we learned that using the best parameters, appropriate data augmentation criteria, and suitable pre-training dataset criteria can significantly improve DCN performance in aerial imagery semantic segmentation of natural disaster-affected areas, compared to using only default parameters and criteria (baselines).In addition, in scenarios that use the best parameters and scenarios that use the best parameters and criteria, the effect of increasing the number of layers in the PSPNet- (18,34,50,101,152) model results in an increase in the performance of the network model, which can be seen from the rise in mIoU value.
Our test results have advantages compared to the results of tests carried out by several studies in the literature review, which also use FloodNet as an evaluation dataset for training, validation, and testing in recognizing aerial images of natural disasters with semantic segmentation.In the study [32], it produced a mIoU value of 80.35%, which is the highest mIoU value in the study for the PSPNet(101) model, while in our study, it produced a higher mIoU value of 81.71% for the same model PSPNet(101).In the study [35], the mIoU value for the U-Net model was 23.9%, and the PSPNet(101) model was 46.65%, while in our study, the higher mIoU value was 55.22% for the U-Net model and 81.71% for the PSPNet(101) model.In the study [36], the highest mIoU value for the PSPNet(152) model was 56%, while in our study, the highest mIoU value for the PSPNet(152) model was 83.34%.
The results of the most optimal DCN model using the best parameters and criteria, namely the PSPNet(152) (bpc) model, were validated using the k-fold cross-validation method to evaluate the performance and validate the accuracy of the model.The validation results are shown in Table V.To visually see the accuracy of the DCN model in displaying the results of the aerial imagery semantic segmentation of natural disaster-affected areas.We present a visual comparison of the DCN model using the best parameters and criteria in Fig. 6.This study revealed an effective way of improving the performance of Deep Convolutional Networks (DCN) for aerial imagery semantic segmentation of natural disasteraffected areas.An experimental study was conducted using the parameters in DCN, the appropriate data augmentation criteria, and the suitable dataset criteria in pre-training to produce the most optimal performance.In this study, we have integrated the Grid Search algorithm and DCN, and validated the results with the k-fold cross-validation method.The Grid Search algorithm performs parameter tuning on DCN, data augmentation criteria tuning, and dataset criteria tuning for pre-training.
This study uses U-Net and PSPNet as DCN models for semantic segmentation.The results of this study indicate that the Grid Search algorithm obtained the best parameters and criteria and improved the performance of the two models used.The most optimal DCN performance is achieved by the PSPNet(152) (bpc) model, which fully uses the best parameters and criteria, with an accuracy value of 98.99%, precision of 90.84%, recall of 90.47%, f1-score of 90.65%, and mean intersection over union (mIoU) of 83.34%.The validation results using the k-fold cross-validation method on the most optimal DCN model, namely PSPNet(152) (bpc), resulting in an average accuracy of 99.04%, precision of 90.77%, recall of 90.92%, f1-score of 90.83%, and mIoU of 83.63%.Significantly mIoU increased by 43.09% (83.34% − 40.25%) in the PSPNet(152) (bpc) model compared to only using default parameters and criteria (baselines).Likewise for other models, namely U-Net (bpc), PSPNet(18) (bpc), PSPNet(34) (bpc), PSPNet(50) (bpc), and PSPNet(101) (bpc), results in increased mIoU compared to only using default parameters and criteria (baselines).
The PSPNet(152) (bpc) model is able to detect and identify various objects with irregular shapes and sizes, is able to detect and identify various important objects affected by natural disasters such as flooded buildings and roads, and is able to detect and identify objects with small shapes such as vehicles and pools, which is the most challenging task for semantic segmentation network models.This capability can be seen from the results of a fairly high value in IoU for each object class, the mIoU value, and the visual display results.This study also proves that the effect of increasing the number of layers in the PSPNet- (18,34,50,101,152) model results in an increase in the model's performance.The results of this study prove that the proposed framework contributes to improving DCN performance to accurately produce aerial imagery semantic segmentation of natural disaster-affected areas.
We obtained several knowledge findings in this study, namely: 1) the combined aerial imagery pre-training dataset originating from the UAV, which has high image resolution, good segmentation annotations, and multiple labels, is very precise and has great opportunities for transfer learning of the dataset aerial imagery of areas affected by natural disasters, provided that the number of images is large enough.So for future research, we suggest the need to utilize a special dataset from aerial imagery originating from UAVs at the pre-training stage for transfer learning in improving DCN performance; 2) increase the amount of data practically by simply combining multiple datasets and proving the combined datasets increase the level of performance compared to pre-training only one dataset.We suggest that for future research, it is necessary to combine multiple datasets containing complex images (realworld images + urban images + road images) and multiple labels to improve accuracy; 3) carefully annotated combined pre-training datasets, effectively training aerial imagery datasets for semantic segmentation tasks; and 4) using the best parameters, appropriate data augmentation criteria, and suitable pre-training dataset criteria can significantly improve DCN's performance in aerial imagery semantic segmentation of natural disaster-affected areas, compared to only using default parameters and criteria (baselines).
We also see opportunities for further research.With the advent of Transformers, it can also be explored for aerial imagery semantic segmentation of natural disaster-affected areas and compare the results with DCN to obtain the most optimal performance.

Fig. 1
Fig. 1 Flowchart of the grid search algorithm with k-fold cross-validation .logy i + 1-y i .log1-y

Fig. 2
Fig. 2 Overview of the proposed framework for improving the performance of deep convolutional networks for aerial imagery semantic segmentation of natural disaster-affected areas

Fig. 5
Fig. 5 Transfer learning on the evaluation dataset (FloodNet) with all pretraining datasets

Fig. 6
Fig. 6 Visual comparison of deep convolutional networks model for aerial imagery semantic segmentation of natural disaster-affected areas using the best parameters and criteria

TABLE II PARAMETERS
AND CRITERIA USED IN THE GRID SEARCH ALGORITHM 3The test results of tuning parameters on deep convolutional networks using a grid search algorithm for aerial imagery semantic segmentation of natural disaster-affected areas

TABLE III PERFORMANCE
TESTING RESULTS OF DEEP CONVOLUTIONAL NETWORKS MODELS FOR AERIAL IMAGERY SEMANTIC SEGMENTATION OF NATURAL DISASTER-

TABLE IV PERFORMANCE
TESTING RESULTS OF DEEP CONVOLUTIONAL NETWORKS MODELS WITH INTERSECTION OVER UNION VALUE (IN %) FOR EACH OBJECT CLASS