Superpixel-Based Multiscale CNN Approach Toward Multiclass Object Segmentation From UAV-Captured Aerial Images

Unmanned aerial vehicles (UAVs) are promising remote sensors capable of reforming remote sensing applications. However, for artificial-intelligence-guided tasks, such as land cover mapping and ground-object mapping, most deep-learning-based architectures fail to extract scale-invariant features, resulting in poor performance accuracy. In this context, the article proposes a superpixel-aided multiscale convolutional neural network (CNN) architecture to avoid misclassification in complex urban aerial images. The proposed framework is a two-tier deep-learning-based segmentation architecture. In the first stage, a superpixel-based simple linear iterative cluster algorithm produces superpixel images with crucial contextual information. The second stage comprises a multiscale CNN architecture that uses these information-rich superpixel images to extract scale-invariant features for predicting the object class of each pixel. Two UAV-image-based aerial image datasets: 1) NITRDrone dataset and 2) urban drone dataset (UDD), are considered to perform the experiment. The proposed model outperforms the considered state-of-the-art methods with an intersection of union of 76.39% and 86.85% on UDD and NITRDrone datasets, respectively. Experimentally obtained results prove that the proposed architecture performs superior by achieving better performance accuracy in complex and challenging scenarios.


I. INTRODUCTION
O UR world has come a long way since the launch of the first satellite into space, and we are in an era fifty centuries ahead of it, significantly changing our daily lives. The Manuscript received 16 November 2022; revised 26 December 2022; accepted 13 January 2023. Date of publication 26 January 2023; date of current version 13 February 2023. This work was supported in part by the project titled "Deep learning applications for computer vision task" funded by NITROAA with support of Lenovo P920 and Dell Inception 7820 workstation and NVIDIA Corporation with support of NVIDIA Titan V and Quadro RTX 8000 GPU, in part by the project titled "Applications of Drone Vision using Deep Learning" funded by Technical Education Quality Improvement Programme (referred to as TEQIP-III), National Project Implementation Unit, Government of India, and in part by the Project entitled "Computer vision-based smart solutions for UAV remote sensing applications through semantic segmentation" funded by Vishlesan I-Hub Foundation, IIT Patna (NMCPS-DST), Government  technological advancement in space technology and remote sensing (RS) sector can be analyzed from the ever-growing number of operated satellites around the earth since 1957 till date. As per one statistic, a remarkable jump of 1070 satellites is noticed from 2019 to 2020 making the total number of satellites to 3368, which orbit around our planet earth [1]. A massive number of very high-resolution (VHR) images are generated on a daily basis by the earth observation satellites, such as the WorldView series, Landsat series, and RE-SOURCESAT [2], [3]. These captured images have been used to address many societal issues at a higher level through different RS applications. However, certain gaps in satellite-based RS applications make it difficult to go through the tropical regions, which are mostly covered by clouds [4]. This opens up a space for the new edge remote sensors in the form of unmanned aerial vehicles (UAVs) that can genuinely improve the spatial, temporal, and spectral resolution of satellite-captured data at different scales. UAVs can help satellites overcome their limitations and accomplish particular tasks through real-time assessment and monitoring actions in different scenarios. These small devices have taken their usage to a whole different level, managing various issues of our day-to-day lives through several RS applications, such as traffic management, urban management in smart cities, land cover classification, fishery management, forest area management, etc., at a lower scale as compared to the satellites.
Mostly, the images captured by the UAVs are of high resolution and provide a detailed view of a particular area in a scene. Among several image data acquisition tasks for UAV RS images, semantic segmentation is an emerging and challenging areas for computer vision researchers. Here, the task is to predict the pixel-level object class according to the semantic information represented by that pixel in the captured aerial image. Recent years have witnessed tremendous progress in deep-learningbased approaches like CNNs, which have proved their significance in attending semantic segmentation tasks [5], [6], [7].
UAV-based aerial image analysis systems differ from satellite image analysis systems concerning their use cases and approaches to solving tasks in various application domains. Some of these applications include detecting objects such as roads, buildings, vegetation, and vehicles that play a vital role in critical applications like military target identification and damage estimation and rescue operations in natural disasters This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ [8], [9]. Therefore, developing a robust aerial image segmentation algorithm is needed for such critical tasks. However, several inherent challenges, such as image resolution, large field of view (FOV), and diversified and complex backgrounds make the task more challenging (in UAV-inspired segmentation tasks). Many popular semantic segmentation frameworks designed for satellite-captured images are unsuitable for UAVborne RS image-based tasks. It is generally due to the specificity of UAV-captured RS images. Another area for improvement of these popular approaches is that the purpose of UAV-inspired RS differs from satellite RS. Satellite-borne RS images focus on object extraction and land cover analysis in a larger area. In contrast, UAV-borne RS images are meant to extract information at a smaller scale in a smaller area. Hence, these large numbers of high-resolution UAV RS images aim to analyze the objects more accurately. It is because the UAV-borne RS images possess richer contextual information to work on addressing the UAV RS-inspired tasks.
A. Motivation 1) Motivation of Using Superpixel Algorithm: A group of pixels can be termed a superpixel, where the members of the superpixels share some common attributes compared to the nonmembers. As suggested by definition, superpixel techniques are more beneficial for image segmentation tasks [10], [11]. Superpixel images have several advantages, such as reducing the computational cost by representing pixels inside a superpixel. Thus, they can be used to reduce the overhead incurred by the deep learning frameworks in terms of time and memory. Similarly, superpixels can extract essential regional features, which are more distinctive than the standard pixelwise features used in several computer vision tasks. They are adaptive due to their shape and size, containing more local and spatial features [12]. Thus, having these features, superpixels can be generated at different scales, which can be used in multiscale-inspired applications with specific parameter settings [13].
2) Motivation of Using a Multiscale Architecture: The traditional segmentation techniques usually suffer from their low generalization ability to produce high-quality segmentation maps. Thus, developing a deep-learning-based robust framework is essential to strengthening the aerial image segmentation process. However, certain underlying complications in these deep learning frameworks could lead to false classification. The issues lie within the process through which the image patches are fed to the architecture during the training phase. The CNN architecture misses many high-level feature sets with strict image sizes, thereby losing crucial contextual information. These missing features are essential in multiobject semantic segmentation, especially in aerial images. The different flight heights of a UAV can create ambiguity for a model leading to poor generalization for several small-scale objects. This is where a multiscale sampling process can become a savior in extracting and gathering the spatial-level object features. Multiscale features are desirable to realize the abstraction of the image at different scales. Introducing a multiscaling process to the CNN framework can help it learn multiple heterogeneous scaleinvariant features, which can lower the misclassification rate.

B. Contribution
In this work, we have proposed a multiscale CNN framework for UAV-captured images. Additionally, the proposed approach benefits from the simple linear iterative clustering (SLIC)inspired superpixel techniques to generate the superpixel images, which act as the input for the multiscale CNN architecture. Some of the major contributions of this work are summarized as follows.
1) The proposed deep-learning framework is a two-staged architecture for aerial scene segmentation. The first stage uses the UAV-captured images to perform coarse-level segmentation using the SLIC superpixel technique to generate superpixel images. Superpixels carry more spatial information than normal pixels and provide a more compact and convenient representation. Hence, they are useful for computationally demanding applications. 2) In the second stage, a multiscale CNN architecture is proposed to analyze the given superpixels for pixel-level classification. Here, the superpixel images are sampled at different scales to the multiscale module to extract the scale-invariant features to perform multiclass segmentation.
3) The proposed model is evaluated over the two UAV-borne aerial image datasets to ensure the robustness of the proposed architecture in real-world settings. 4) Moreover, the model is also evaluated by changing some important parameters to show its improved behavior with the superpixel and multiscale convolution to detect smallscale ground objects. The rest of the article is organized as follows: The existing semantic object segmentation approaches are discussed in Section II. Similarly, Section III presents the different methodologies used in the proposed approach, which is followed by Section IV. Sections IV and V discuss the detailed structure of implementation and overview of the obtained results, respectively. Similarly, a discussion section is also added as Section VI. Finally, Section VII briefly describes the conclusion drawn from the article.

II. RELATED WORK AND BACKGROUND STUDY
This section briefly discusses the different approaches proposed by the researchers in aerial scene understanding. The evolution of deep learning and multiscale learning algorithms toward aerial scene understanding problems are discussed in this section.

A. Traditional Approaches in Aerial Image Segmentation
Aerial images are the images of the earth captured from above it, where the spaceborne remote sensors or satellites were the only option until the UAV-based technology pitched in this work to leverage the load incurred on a satellite at a lower scale. These devices have been widely used in various RS applications such as ground objects detection: cars, roads, buildings, trees, and pedestrians, which is an essential aspect of many projects viz. agriculture mapping, urban mapping, forest mapping, etc. The UAV images are a bit complicated compared to the satellite images due to the detailed and vast population of diversified objects, making the task more complex and challenging. Previously attempted research works by computer vision researchers are based on the rule-descriptor-influenced methods for objectlevel feature extraction, specifically in building extraction [14], road detection [15], [16]. However, due to poor generalization concerning aerial data, the hierarchical rule-based approaches miss out on several significant features. Conventional classifiers employed machine learning techniques that extract the local features from the input pixel intensities through simple arithmetic combinations [17], [18]. Researchers also proposed discriminating classifiers like boosting and random forest to evaluate the redundant local feature maps for training purposes [19], [20], [21]. In an aerial image segmentation problem, the global features are equally essential as local features [22]. In [23], Ortner et al. have used marked point processes to build architectural models and road network topologies through probabilistic priors defined for global knowledge gain. Conditional random fields, also known as CRFs, are also used for object-level segmentation and detection from aerial images [24]. Similarly, Wang et al. [25] have proposed a fusion approach using a superpixel-based labeling technique and Markov random field toward aerial video segmentation.

B. Deep Learning Approaches in Aerial Image Segmentation
Unlike conventional machine-learning techniques, deeplearning algorithms have no requirement for feature definition steps. They learn the critical distinguishing features from an input dataset according to the provided task. These methodologies were proposed back in the 1980s when there was limited computing power, and available training data [26], [27]. These algorithms announced their return [28] in 2012 and achieved impressive outcomes for the ImageNet challenge [29], creating hope in the research community with tons of opportunities. Several layers are stacked on one another in the proposed baseline models to learn and analyze the essential local-global feature sets from the input images. One of the crucial aspects of deep CNN architectures lies in its ability to parallelize both training and inference through GPUs.
CNN has started its journey with the image classification problem, and in a short period, they have been successfully able to address computer vision problems like object detection [30], tracking [31], and object-level segmentation [32]. The usage of convolution network frameworks has not been restricted to classical image classification tasks but can also be noticed in aerial scene parsing using RS images [33]. Several common RS tasks in this domain comprise buildings extraction [34], [35], road networks extraction [36], [37], [38], and vegetation extraction [39]. Aerial scene understanding based on an encoder-decoder-based fully convolutional network (FCN) structure is proposed in [5] and [40] that yields an explicitly labeled image depicting the contexts associated with each pixel. Then, the extracted feature maps propagate through an expansion module to upsample the reduced image back to the original resolution. In [41], Xie et al. have proposed a multiscale densely-connected CNN architecture for RS-based hyperspectral aerial image (HSAI) classification.
Similarly, Fan et al. [42] have presented a superpixel-aided deep-sparse-representation technique to construct hierarchical architecture to understand HSAI context information. This gathered information (features) obtained from the multilayered network is concatenated and trained by a support vector machine classifier. Moreover, UAV usage is increasing for small-scale applications and collected data have been utilized in many crucial RS applications. Computer vision researchers [43], [44] have provided several solution approaches to address the existing issues using deep learning-based architectures. Authors have recommended a deep-learning-based framework inspired by Fast R-CNN and Faster R-CNN for vehicle extraction from aerial images [45]. The two networks are combined to gather important feature space, which can be used to detect vehicles semantically. Moreover, datasets are the backbone of the success behind deep learning frameworks. A thorough and detailed analysis of the available UAV image datasets for computer vision researchers to conduct research toward UAV-inspired applications is presented in [46].

III. PROPOSED METHODOLOGY
The article proposes a superpixel-aided multiscale deep learning framework that semantically segments the aerial images captured by UAVs. This section discusses each module used in the proposed deep architecture.

A. Overview
The proposed Superpixel_MCNN_AerialSegNet framework consists of two modules: 1) a superpixel module and 2) a multiscale CNN module to work on extracting the scale-invariant features. At the backend of the architecture, the superpixel algorithm works to determine the essential scale-invariant features. As the first phase of the segmentation process, the superpixel technique narrows down the texture and color-based features. These extracted features are considered the input to the second phase of the proposed framework, where these superpixel images are used to produce the final segmentation map. The superpixel images help the deep learning architecture to be implemented quickly, reducing the overall training and validation/testing time (in most instances). The architectural overview is presented in Fig. 1. Each module of the proposed architecture is explained in the following sections.

B. Superpixel Method
A number of pixels sharing common characteristics can be referred to as a superpixel. They can carry more information than simple pixels and provide a more convenient and compact representation that could be useful for computationally demanding applications. Some of these applications include medical imaging [47], object detection, scene segmentation, video surveillance, etc. Among the superpixel algorithms, SLIC has been widely used in various application platforms [48], [49].
Generally, SLIC-based superpixel algorithms generate relatively uniform and compact superpixels based on the spatial and color proximity of pixels in an image plane. Five-dimensional  represents the pixel color vector and [xy] indicates the position of a pixel. Hence, it should be normalized so that the Euclidean distance can be employed in 5D space. Hence, the maximum spatial distance within a cluster should lie within a sampling interval S and can be represented as follows: where N number of pixels in the input image; K number of Superpixels required; N K approximate area of a superpixel. The superpixel algorithm considers the desired number of superpixels of approximately equal sizes (K). The cluster centers C k can be represented as where k varies between a range of 1 to K at a regular interval of S within a grid. The spatial extent of a superpixel is generally S 2 (approximate area of a superpixel). Thus, an assumption can be made corresponding to its cluster center that associated pixels fall within a region 2S × 2S area around the superpixel head-on xy plane. Hence, the normalized distance (D s ) can be calculated as the sum of the lab color space distance (d laβ ) and XY plane distance (d xy ) normalized by the grid interval S and is given as follows: where Like spatial distance, the color-related distance plays a crucial role in estimating the normalized distance (D s ) in the SLIC algorithm. Estimating color distance is a complex task as the color-based distance may vary rapidly from cluster to image and image to image. Thus, to avoid such a problem, a constant m is introduced that controls the compactness of a superpixel. The higher the value of m, the more compact the cluster is. Reducing the compactness factor (m) (lied within [5 − 40]) gives us images that are more closely related to the original images keeping the relevant object features.
The superpixel module acts as the first-level optimizer to transform complex aerial images into more compact-sized superpixel images. Pixels representing a single superpixel share similar visual attributes in a superpixel image. Thus, the superpixel images carry more information values than the usual ones. The UAV-based VHR aerial imageries are given as inputs to the superpixel module to produce the superpixel images using the SLIC-based algorithm. It is a linear-time algorithm and can generate superpixel images that are lightweight in terms of memory space, thus consuming less storage space. They can provide a compact and convenient representation of standard images, which can be very useful for computationally demanding applications that process RS images in a low-bandwidth environment. Further optimization takes place at the CNN module on these superpixel images.

C. Convolutional Neural Network
Convolutional neural networks (also known as CNNs/ConvNets) are enhanced neural networks most commonly applied to analyze visual images. The structure of CNN is distinctive; one convolutional layer stacks upon another, followed by a few pooling layers, and finally, a few fully-connected layers (for image classification) or upsampling layers (for image segmentation). The convolutional layer is the core of a CNN, which extracts the high-level features through the local perception and weight-sharing mechanism of the kernels/filters. The pooling layer can be considered the backbone of CNN used as a stuffing layer of a sandwich between the two slices of convolutional layers. It is used to enhance efficiency and avoid over-fitting in training procedures. It downsamples the input feature map using a nonlinear max function that reduces the number of parameters to be used for calculations in the following convolutional layers. The deep architecture used in the proposed multiscale CNN (MCNN) approach is an encoder-decoder-based convolutional framework (also known as AerialSegNet [50]) that is composed of the following four stages.
1) Contraction path: The input RGB images get decomposed to provide spatial and temporal features through convolution operations. 2) Dense modules: Each stage of the architecture contains densely connected modules to pass the learned feature maps to the follow-up stages to enhance the feature set without increasing the number of parameters. 3) Bottleneck layer: At this stage, the extracted features from the contraction path are then fed to the decoder blocks in the expansion path. 4) Expansion path: Here, the shrunken image (in the encoder path) is reshaped to its original shape to produce the desired segmented map through some deconvolution operation (using transpose convolution or bilinear interpolation techniques). The architecture overview can be seen from middle blocks in Fig. 1, where the combined use of dense and skip connections can be observed.

D. Multiscale Module
The correctness and accuracy of the image segmentation model need to integrate pixel-level accuracy concerning multiscale context reasoning. Deep CNNs combine multiscale context feature maps based on consecutive pooling, and convolution layers reduce image resolution [28]. Moreover, the dense/deeper layers require more context information in addition to full resolution [51]. The input images can be downscaled and upscaled with proper interpolation technique to get the multiscaled resolution images Fig. 2. As mentioned in Fig. 1, these multiscale images were given as inputs to the corresponding CNN modules to obtain the scale-invariant feature sets. Each CNN framework processes an image scene with different scales extracting the multiscale feature maps, which are further aggregated to form a multiscale context feature map that can predict pixel-level object class. The aggregation process is performed under the resize and concatenation process to make the process simple. The aggregation process can be understood from the following equations: where D s = ds f 1 (I img ) + ds f 2 (I img ) + · · · + ds fm (I img ) U s = us 1/f 1 (I img ) + us 1/f 2 (I img ) + · · · + us 1/f m (I img ).
Here, ds and us represent downsampling and upsampling of an input image, respectively. Similarly, f, I img , and M img denote the scale factor used for downsampling or upsampling, input image, and the obtained multiscale feature map, respectively.
In our experiment, we have used 512 × 512 image tiles as input, which are then upscaled and downscaled by a factor of 2 to get 256 × 256, 1024 × 1024 resolution images. All these three different resolution images are trained individually through the encoder-decoder CNN architecture to fetch the multiscale feature maps that decide the pixel class.

IV. EXPERIMENTATION
In order to access the performance of the proposed ensemble superpixel-MCNN architecture, extensive experimentation has been conducted on the NITRDrone scene understanding dataset and is described in the Section IV-A. Moreover, the proposed approach is compared to some of the chosen state-of-the-art methodologies of semantic segmentation tasks, viz. [5], [6], [7], [40], [52].

A. Data Description
To perform the experimentation, we have considered the following two datasets.
1) NITRDrone Dataset: The NITRDrone dataset 1 [53] is proposed and built on satisfying the rising demand for UAVbased applications for scene understanding that uses semanticsegmentation-based techniques. The dataset contains around 101 number of variable resolution of VHR images captured with the help of DJI Phantom 4 and DJI Mavic Mini drone having ground sampling distance (GSD) of 0.025 sq.cm/pixel. The resolution of an image in the dataset can be any of the following 1280 × 720, 4000 × 3000, 4096 × 2160. A pixel can belong to any of the four different considered classes named "road," "vegetation," "occluded_road," and "_background_." Some of the sample images and their corresponding ground truths of the dataset are presented in Fig. 3.

2) Urban Drone Dataset (UDD):
The UDD is a UAV-based image dataset that was proposed by Chen et al. [54] toward semantic segmentation problems in computer vision. The dataset is collected by a UAV DJI Phantom 4 operated at an altitude of 60 m to 100 m. The considered resolution for each image in the dataset is either 3000 × 4000 or 4096 × 2160. This dataset has been divided into three types, UDD-3, UDD-5, and UDD-6, that have three, five, and six classes, respectively, of which we have considered UDD-5, on which the proposed model is implemented and validated. As mentioned, UDD-5 has five-pixel classes named vegetation, buildings, roads, vehicles, and others (denoted for the rest of the object in a scene other than the mentioned classes). The dataset comprises two sets: a training set and a validation set consisting of 160 and 45 image frames, respectively. Sample images and the masks are shown in Fig. 4.

B. System Setup
The first stage of the proposed architecture is meant for the SLIC superpixel algorithm to produce superpixel images. These superpixel images are considered the inputs for the second stage and are sampled at different scales to multiple deep CNN frameworks, which are then trained to extract the required features for further classification of the pixels into one of the four classes in the NITRDrone dataset and one of the five classes in UDD. The flow of operations to perform the experimentation is presented in Fig. 5. The implementation and validation of the proposed architecture are carried out on the datasets mentioned above and compared with the benchmark and peer-reviewed state-of-theart methods. All the considered models are implemented with the help of the deep learning library PyTorch 2 [55] and are trained with NVIDIA TITAN V graphics card having 12 GB of GPU memory.

C. Dataset Preprocessing
The proposed architecture is evaluated on the semantic drone datasets NITRDrone dataset [53] and UDD [54]. The resolution of the images of the considered datasets is of different sizes, such as 1280 × 720, 4000 × 3000, 4096 × 2160. Hence, we apply a sliding window technique with a constant stride that works over these images to extract the image tiles of 576 × 576 from both datasets. Through this operation, we are able to generate around 3470 number of images from the NITRDrone dataset and 3500 number of images from the UDD dataset. Out of the total number of extracted images from the NITRDrone dataset, we have considered 2590 and 880 images as training and testing sets, respectively. Similarly, for the UDD, 3100 images are considered for training the model, and the rest 400 images are equally divided among the validation and testing set.

D. Preprocessing With SLIC
It is the first phase of segmentation in our proposed architecture. The image tiles produced by the sliding window are fed to this module. One of the popular superpixel algorithms, SLIC, is applied to produce semisegmented superpixel images. There are two important parameters of SLIC algorithms: N and m representing the number of superpixels in a superpixel image and the compactness control parameter, respectively. They play a vital role in preserving the natural properties of the ground objects. We have considered different combinations of N and m to find out the best combination with which we can apply the SLIC algorithm on the raw input images that preserve the integral properties of the objects to be segmented. The value of N and m are initialized to certain constant values as N = [500, 1000] and m = [5,15,25,35]. Thus, eight types of superpixel images can be generated from this module, which will be the inputs for the next stage of CNN implementation.

1) Input Preprocessing:
The images from the superpixel module can be denoted as Image X. These Image Xs are collected at the CNN module, where they have to pass through a simple preprocessing step before considering for training. Image Xs are then downsampled from 576 × 576 to 512 × 512 (can be denoted as Image Y ) through random cropping or center cropping techniques. These cropped images (Image Y ) are used by the deep CNN framework to train individual models. The resolution of the input and target images for the proposed architectures and state-of-the-art methods remains the same. The only difference lies in the type of images considered in both cases: The proposed approach uses superpixel aerial images, whereas the state-of-the-art models use stock aerial imageries.
2) Target Preprocessing: The corresponding target or the masks need to be down-scaled to 512 × 512 as per the input images described in the previous section. However, the training is performed with the one-hot coded masks. The one-hot coded target images are then color-coded with different colors for each object class for better visualization. These RGB color-coded masks are presented in Figs. 3 and 4.
3) Implementation Details: Adaptive moment estimation (Adam) [56] is considered the optimizer, and cross-entropy (CE) is used as a loss function. The learning rate is initialized to 5e − 3, whereas momentum and batch size are initialized to 0.9, and 3, respectively. A weight decay of 0.002 is introduced to handle the problem of overfitting in the long run during training. ReLU activation function [57] is employed to improve the convergence and accuracy of the network. Moreover, two types of augmentation techniques are applied to the superpixel images making the number of images stand double, thus extending the training process. The learning rate is set to be decreased after every 30 epochs by a factor of 0.002 to maintain the regularization. The training operation continues until the learning rate reaches 10 −20 . After 450 epochs, the proposed architecture converges, which is observed through minor changes in loss and accuracy.

F. Loss Function
The choice of the loss function is vital in carrying out neural network-based optimization. The loss-weighting scheme of the network architecture targets the interior pixels and the border of the segmented objects. The CE loss, also known as logarithmic or logistic loss, is chosen to train the baseline models. The predicted class probability is compared with the truly desired class output 0 or 1. The corresponding loss/score of the corresponding pixel class is obtained to check for the deviation from the actual (true) value. And as a penalization, the weights will travel backward to recorrect the same for a better understanding of the object feature map. SoftMax differential function (S i ) is also used with CE, which aims at minimizing the loss during training, i.e., smaller the loss value better the model. CE can be defined as follows: where T i and S i are the truth value ∈ [0, 1] and the SoftMax Probability for ith class, respectively.

G. Tasks and Metrics
The primary objective of the proposed Superpixel_ M CNN_AerialSegNet framework is scene parsing and segments the aerial images as per the given number of objects (four for the NITRDrone dataset and five for the UDD). In order to analyze the architecture's performance, both quantitative and qualitative results play vital roles. Widely acceptable performance metrics, such as precision, recall, F-score, the intersection of union (IoU), and overall accuracy, are used to examine the performance of the proposed framework. These metrics can be formulated as per the following equations: .
Here, P , R, IoU, F , and A represent precision, recall, dice score, intersection over union, and overall accuracy, respectively. Similarly, TP and TN stand for true positive and true negative, respectively, which can be explained by the number of predicted pixels belonging to the same class as the ground truth. Additionally, FP and FN denote false positive and false negative, respectively.

V. EXPERIMENTAL RESULTS AND OBSERVATION
This section presents the obtained results from the proposed model through the experimentation. It also discusses an extensive comparison of these results with state-of-the-art methodologies. It highlights the improvements achieved through the proposed architecture in semantically segmenting the object classes from the UAV images.

A. Observation
As discussed earlier, the SLIC-based superpixel algorithm works based on two core parameters: N and m to decide the number of superpixels in a superpixel image and a scale variation parameter, thus playing an important role in estimating the size of a superpixel, respectively. The value of m falls within a range of [5,35]. The greater the value of m, the more compact the cluster.
In the experiment, we have considered m as [5,15,25,35]. That means when m = 5, each image patch (superpixel) in the superpixel image is of size 5 × 5. Small-scale patches (m = 5) expose the features inside a superpixel efficiently. In contrast, the enormous value (m = 25/35) works better at the border regions of an object to distinguish it from the others. It helps the training block be exposed to meaningful, distinguished features to learn about the object it needs to segment. However, to prove the efficiency, every possible combination of m and N is considered. The produced superpixel images are then given as input to the CNN block, and the outcomes are listed in Tables I and II.   TABLE I  COMPARISON OF PERFORMANCE EVALUATION OF VARIOUS STATE-OF-THE-ART  MECHANISMS ON SUPERPIXEL IMAGES OF NITRDRONE DATASET   TABLE II Fig. 1) is to have a vast feature space of the ground objects at different scales sampled through the multisampling process. The ConvNet module in the MCNN architecture is an encoder-decoder architecture inspired by the skip connection mechanisms (dense module within a stage) that passes the previously learned parameters in the encoder stage to the following equivalent decoder stage. This architecture can determine the edge-level object features and the imbalanced occlusion class objects. These features are crucial from the perspective of a segmentation task, as even a few pixel misclassifications may affect the accuracy of the architecture.

B. State-of-the-Art Comparison
The proposed model is also compared with the state-of-the-art methodologies based on the evaluation matrices described in the previous section. The baseline models are validated with the raw input images considered for multiscale CNN and Aerial SegNet architectures. The training process for these models is performed for around 450 epochs till the convergence occurs. The obtained results from the experiments are listed in Tables III and IV. Moreover, the qualitative results are also presented in Figs. 6 and 7.
As shown in Figs. 6 and 7, it can be clear that among the state-of-the-art models, U-Net [40], FCN-32s [5], and DeepLabplus-exception [7] manage to perform well to segment the vegetation and road class pixels thus achieving a reasonable  75%−80% IoU score. However, they have failed to capture the outlines of different class objects resulting in a slight drop in accuracy. This is where the multiscale feature fusion technique looks useful in aggregating the scale-invariant features that help fetch the missing feature sets, thus improving the accuracy. Hence, it can be concluded from Tables III and IV that the proposed model can perform better than the existing methods, such as [5] and [40] in terms of segmenting the pixel class and achieving a smoother boundary of the objects. Moreover, along with the performance measures like F_Score and IoU, we have also considered precision and recall. It can be observed from Table III that (from the mean precision (mPrecision) point of view) the DeepLab_V3+Xception [7] performs better (on the NITRDrone dataset) in terms of precision score than the proposed architecture. However, there is a miss, and it can be explained by seeing the table that there is a massive gap between the recall and precision score, which is entirely unacceptable from the perspective of a semantic segmentation task. At this point, the proposed framework acts superior maintaining a descent of true positives and true negatives as can be judged based on the obtained scores mentioned in Table III. Similarly, a comparison of improvement achieved through the proposed architecture is also presented in Table V.

VI. DISCUSSION
To provide a better comprehensive comparison of the proposed approach, various experimental observations corresponding to external factors, such as space complexity, are also noted, which are discussed in this section.

A. Parameter Comparison
The number of learnable parameters plays a crucial role in accessing the performance of the model in terms of speed   Table VI. It can be observed that state-of-the-art models, except a few, such as FC_DensenNet-103 [52], AerialSegNet [50], and UNet [40] having ResNet-18 as the backbone comprises of less number of trainable parameters as compared to the proposed architecture. However, the proposed architecture overcomes the  [40], and (f) FCN-16s [5]. Color coding of the semantic classes matches Fig. 3. underlined issues of these baselines, achieving better performance accuracy while having less trainable parameters than most of the considered baselines. Therefore, it can be deployed on various edge-end devices (like UAVs), where memory and computing power are constraints. In the following section, we discuss the optimization that has been achieved through the use of superpixel.

B. Space Efficiency
The superpixel technique provides a partially segmented image that helps the CNN module extract the object-level features while reducing the space complexity. As per our study, images with smaller m performs slightly better than others due to their feature preservation properties and are pretty close to a natural image (with meaningful information). From the space consumption point of view, it is pretty clear that the superpixel images with the highest space complexity are also 60% lesser in size than the original images while performing better or equal than with the original standard images. A bar graph representing the space consumption of all the considered images is presented in Figs. 8 and 9. Similarly, among the superpixel images considered for the experiment, space consumption (spc) can be arranged in a decreasing order like spc(5) > spc(15) > spc(25) > spc (35). Considering an example, if N = 1000, then spc(1000_5) > spc(1000_15) > spc(1000_25) > spc  [40], and (f) FCN-16s [5]. Color coding of the semantic classes matches Fig. 4.
(1000_35). This space complexity matters for the proposed approach to get implemented over IoT and network, as transferring the superpixel images (over the network) would require a low-bandwidth connection making bandwidth available for the other network-related operations. Thus, IoT-based RS applications can be benefited from the proposed architecture.

C. Observed Limitations
The current study has a few limitations, which are presented ahead. Under low-light conditions, the model performs poorly (at the beginning of the training) in segmenting similar-looking objects. For example, the road surface may look similar to the rooftop (the tar-covered sheet), creating confusion for the model and leading to low accuracy in terms of IoU. Similarly, for the minor class objects, such as the occlusion class in the NITRDrone dataset and vehicle class in the UDD, the model invests a lot of time in obtaining the required feature maps before correctly classifying the minor object class pixels. Moreover, one more limitation can also be seen corresponding to the increased number of trainable parameters due to the multiple CNN modules. This may bring CNN architectural issues. In future work, a few cues, as presented in [58], can be considered to develop a multiscale CNN architecture, and its effectiveness can be verified.

VII. CONCLUSION
This article presents a superpixel-based multiscale CNN framework to address UAV aerial image-based semantic segmentation problems. The first-level segmentation is achieved using the SLIC superpixel algorithm that produces superpixel images from the input UAV images, which become the input for the CNN architecture for final segmentation. The proposed CNN architecture collectively uses the strength of skip connections and the multiscale context aggregation strategy to extract the crucial scale-invariant features that can uniquely classify a pixel of the corresponding object class. The multiscale CNN module is good at extracting scale-invariant features that are essential from a UAV imagery point of view, as the same ground objects may look small or large as per the operating height of the UAV. Furthermore, the proposed architecture is evaluated (on the NITRDrone and UDD) and compared with the state-of-the-art methods. The experimentally obtained results prove the superiority of the ensemble framework (of the superpixel technique and the deep multiscale architecture) in segmenting the UAV aerial images. Moreover, the proposed architecture provides a robust solution toward semantic segmentation for object classes like road, vehicle, and vegetation, which the other considered state-of-the-art methodologies failed to do. The proposed approach can be integrated with the robotics-based artificial intelligence solution to provide intelligent road extraction and vegetation detection through panoptic aerial imageries of UAVs.
Similarly, the proposed approach can be combined with IoT and cloud concepts to actively analyze critical operations, such as disaster management and carrying out surveys.
As an extension to this work, different superpixel techniques, such as SLICO and SEEDS, may be tested to have a betterperforming superpixel technique for road and vegetation extraction from aerial images. Similarly, the proposed architecture can also be implemented in a simulated IoT environment to demonstrate the efficiency of this approach in managing operations under low-bandwidth environment.