Abstract

An important measurable indicator of urbanization and its environmental implications has been identified as the urban impervious surface. It presents a strategy based on three-dimensional convolutional neural networks (3D CNNs) for extracting urbanization from the LiDAR datasets using deep learning technology. Various 3D CNN parameters are tested to see how they affect impervious surface extraction. For urban impervious surface delineation, this study investigates the synergistic integration of multiple remote sensing datasets of Azad Kashmir, State of Pakistan, to alleviate the restrictions imposed by single sensor data. Overall accuracy was greater than 95% and overall kappa value was greater than 90% in our suggested 3D CNN approach, which shows tremendous promise for impervious surface extraction. Because it uses multiscale convolutional processes to combine spatial and spectral information and texture and feature maps, we discovered that our proposed 3D CNN approach makes better use of urbanization than the commonly utilized pixel-based support vector machine classifier. In the fast-growing big data era, image analysis presents significant obstacles, yet our proposed 3D CNNs will effectively extract more urban impervious surfaces.

1. Introduction

Global urbanization has accelerated dramatically in the previous few decades. Around 68% of the global population will be living in cities by 2050 [1]. It can lead to various environmental problems, such as ecological issues, poor air quality, deterioration of public health, and changes in the microclimate that can lead to extreme weather, higher temperatures, limited water availability, and a continued vulnerability to natural disasters [2]. Due to the spectral and structural diversity and complexity of urban surfaces over a small area, these issues make sophisticated urban analysis difficult [3]. As a result, it is imperative to keep an eye on urban areas at all times. In metropolitan settings, where many items are mobile (vehicles and temporary buildings) and where infrastructure, vegetation, and construction are continuously changing, systematic monitoring and updating of maps are vital.

With the advancements in remote sensing technologies [4], spatial and temporal analyses of metropolitan areas are now possible [5]. A strong new tool for urban analysis is airborne remote sensing, which provides fast mapping of a city for various planning [6], management operations [7], and monitoring of urban and suburban land use [8] activities. For example, social preferences, the regional ecosystem, urbanization change, and biodiversity can all be studied using this method [2]. An important component of studying three-dimensional city geometry and modelling urban morphology is to use urban remote sensing to identify various objects, heterogeneous materials, and mixes [9, 10]. However, the developing problems necessitate a cutting-edge technology solution, including sensors and analysis methodologies at the cutting edge of development. Urban land cover types can be identified using spectral, spatial, and structural aspects of remote sensing devices, constantly improving. There has been an increase in the use of LIDAR (light detection and ranging) in urban mapping, as well as hyperspectral data (HS) and synthetic aperture radar (SAR). Urban settings can be studied using various electromagnetic spectrum components, ranging from microwave radar to the reflective spectral range. Since they require oblique illumination of the scene, these high-resolution photographs are subject to occlusion and layover, making it difficult to analyze dynamic metropolitan environments [11, 12].

Single sensor urban land cover classification accuracy and interpretability in densely populated areas are frequently inadequate [13, 14]. Extensive analyses are required because of the substantial spectrum variance within a single land cover type caused by urban heterogeneity. The spectral and spatial-structural properties of urbanization (such as roofs, parking lots, highways, and pavements) differ significantly. Urban heterogeneity can be estimated using several methods, including scale and geographic resolution. Heterogeneity is defined by scale as the absence or grouping of materials into a single class, such as individual trees, forest type, or vegetation in general [15, 16]. The amount of pixel mixing depends on the spatial resolution. Nevertheless, increased spatial resolution increases the variability of the physical substance, making analysis more challenging.

Materials can be distinguished without reference to elevation context using HS data, which provides spectral information about the studied materials. The problem with pure spectral analysis is that it ignores object identification, and most objects are made of various materials, resulting in extremely high intraobject heterogeneity. As an example of how LIDAR data differs from other data types, consider asphaltic open parking lots and highways, which have varying heights of coverage [17, 18]. As a result, passive remote sensors such as HS are more sensitive to changes in air conditions and lighting than active sensors such as LIDAR are. When paired with HS data, this LIDAR characteristic allows for physical correction of shadow and lighting and intensity measurement for urban land cover mapping in shadowed areas.

Urban environments include spectral ambiguity and reduced spectral value under the shadow of terrain changes, buildings, and trees, which can be solved by adding LIDAR data as provided by [19, 20], disregarding the spatial and spectral resolution of airborne-based HS sensors. Many new urban surface classification methods use active and passive remote sensing techniques such as airborne LIDAR and hyperspectral data to overcome the limitations of individual sensor capabilities (HL-Fusion). Using an HL-Fusion for land cover classification can provide additional information on three-dimensional topography, spatial structure, and spectral data.

For this reason, it is important to combine spectral, spatial, and elevation data while studying the city environment [21, 22]. The airborne HL-Fusion has been tested for use in the classification of urban land cover. On the other hand, various combination methods based on physical or empirical approaches are put into practice at various data and product levels. Furthermore, due to the complexity of the fusion processes, no standard framework exists for fusing these sensors. As a result, a complete review of prior data fusion research may help researcher’s better grasp the potential, constraints, and prevalent issues that limit categorization outcomes in cities [2325].

Classifiers for HS data have been developed using machine learning (ML) approaches. Different mapping approaches are used to achieve the categorization goal depending on the target. Algorithms for machine learning are constantly being improved, allowing them to extract increasingly complex features in an organized manner. Deep learning, a machine learning subfield, is credited with developing this capability (DL) [26, 27]. Spatiospectral feature extraction from HS data has been accomplished using DL [1, 2]. Both ML and DL methods are useful for classifying data in remote sensing, although various algorithms are better at extracting different attributes from pixels or objects. It is necessary to understand the features that can be extracted before selecting a classification technique for HS data [28, 29]. Due to its ability to uncover unique deep characteristics at the pixel level, however, a per-pixel classification can produce noisy findings in an urban setting considering the large spatial distribution.

When the training dataset is insufficient, classification results will be limited in performance and accuracy since they heavily depend on the amount of training samples. We looked at incorporating contextual information around pixels and object-oriented classification to improve the findings and reduce heterogeneity. ML and DL algorithms [23, 30] outperform classical classifiers in the urban setting, especially when trained quickly.

The initial goal of this study is to learn more about 3D CNNs’ ability to extract urban urbanization from high-resolution (HR) data. Researchers compared the results to those of support vector machine (SVM) and 2D CNN approaches to see if it worked. We’ve also shown how the suggested 3D CNN model’s various parameters affect impervious surface extraction.

2. Literature Review

To map impermeable surfaces, scientists have relied on satellite images with a range of spatial and temporal resolutions. Medium and low spatial resolution images, such as Landsat and MODIS data, are appropriate for large-scale urbanization mapping due to their spectrum information and high temporal resolution [26, 27]. However, in large-area mapping, issues with mixed pixels confuse when extracting impermeable surfaces. High spatial resolution images can provide a lot of information regarding land use and land cover. Still, the spectral similarities of diverse objects, as well as the shadows created by large structures or enormous trees, make it difficult to extract urbanization [28, 29]. Images captured with hyperspectral technology eliminate the spectral homogeneity of a land class’s spectral similarity while also reducing the spectral heterogeneity of various things [21, 22]. Using SAR, land information previously obscured by clouds can be retrieved, and impervious surfaces under huge tree crowns can be found using SAR images [19, 20]. However, the SAR image’s coherent noise poses a substantial challenge for impermeable surface extraction. As a result, urban impervious surface mapping using single-source imagery has several limitations. Several datasets derived from diverse image capture technologies have lately been deemed useful in addressing these difficulties [9, 17], including medium–low spatial resolution (MLR) photos, optical photos, SAR photos, high spatial resolution (HSR) photos, and light detection and range (LiDAR) data, to name a few [16, 19]. The height information provided by LiDAR data can greatly distinguish between objects with identical spectral characteristics, which can help with impervious surface extraction [18]. For example, even though the spectral aspects of the houses, roads, and bare land are often comparable, there is a significant height disparity. To cope with various objects with a similar spectrum, LiDAR height information is useful. In addition, buildings’ roofs are flatter than trees’ crowns. Buildings and trees can be distinguished using the LiDAR height variance.

However, for the most part, picture categorization was carried out using one- or two-dimensional (1D/2D) CNNs. Three-dimensional (3D) CNNs can model spatial, textural, spectral, and other information simultaneously because of their 3D convolutional function. Although 3D CNNs are already widely utilized for movies and volumetric images, their performance in extracting urban urbanization from satellite images has yet to be proved.

3. Methodology

The best performance in remote sensing comes from CNNs or multistage feed-forward neural networks. Computer advancements such as high-performance GPUs and rectified linear units are being used to create CNNs on a large scale (ReLU), and dropout or data augmentation methodologies have just recently been practical in remote sensing, a long time after they were first proposed. We propose employing 3D convolutions to calculate multiple LiDAR data features using a 3D kernel in CNNs.

The 3D CNN employed in our work is based on the notion of a convolutional neural feed-forward network with numerous hidden layers to enhance abstraction capabilities as shown in Figure 1.

The input layer of the DNN structure employed in this study has a set of 64, 32, 16, or 8 neutrons. After that, we used four dense layers with 210, 29, 28, and 27 neurons, followed by a sigmoid classification layer with two outputs to demonstrate the abnormal and benign traffic categorization. For the experiment purpose, only five neurons with numerical and category information are used in the input layer; after that, there are two thick layers with 28 and 27 neurons, as well as an output layer with a sigmoid activation function, which determines if the mobile activity is benign or pathological. The equation can express deep neural network model as where is the jth neuron in the (l-1) th activation layer. And overall is the sum of neurons in the l-1th layer.

3.1. 3D CNN Model Architecture

A 3D CNN is a powerful model for learning representations for volumetric data since it uses a 3D volume or a sequence of 2D frames as input. When extracting 3D characteristics or establishing a 3D association, 3D CNNs are utilized. Convolutions in three dimensions (or 3D) apply a three-directional (3-direction) filter to the dataset to calculate low-level feature representations. Typically, their output takes the form of a cube or cuboid. A 3D filter can move in all three directions when used in 3D convolution (height, width, and channel of the image). The element-by-element multiplication and addition yield one number at each place. Due to the filter’s 3D movement, the output numbers are also arranged in 3D. The final result is 3D data. Figure 2 shows the 3DCNN image classification model.

3.2. Model Training

Buildings, roads/other forms of urbanization, trees, grasslands, and bare soils are the five land cover classifications in our research area. The number of classes in our research field determines the size of the output layer. For the output layer, softmax nonlinearity is used in multiclass logistic regression. As a result, a K-dimensional vector is created, with each member representing the likelihood of a specific class occurring in the dataset. In a mini-batch B, for each input sample. Figure 3 shows the output layer employs softmax nonlinearity in multiclass logistic regression.

3.3. Hypertuning of Parameters of 3D CNN Model

The 3D CNN model’s starting parameters significantly impact the extracted information and classification results. To improve the training accuracy of the CNN model, it is critical to choose an ideal choice of CNN hyperparameters. The training and validation samples were used to select hyperparameters in our work. The size of the input image (), the size of the convolution kernel (), the pooling dimension (), the number of feature mappings (), and the number of CNN layers are all investigated to see how they influence classification accuracy (). The best hyperparameters are then used to build a 3D CNN model.

First, we use different convolutional kernel sizes and pooling dimensions with input image layers mmcd () when , , and to test the accuracy of 3D CNNs. The best and parameter combination may be found using epoch. Second, we investigate the accuracy of various input picture sizes m and output feature counts using ideal and values and a preset parameter . This iterative approach is used to find the best values for and . Finally, we examine how changing input picture sizes () and the number of CNN layers () affect the classification performance of CNNs with varied layer counts utilizing the optimal combination of and , , and . and are revised iteratively until they are discovered to be optimal.

4. Results and Discussions

There is an experimental location in Azad Kashmir, State of Pakistan. The WV-2 data was collected on October 2021, whereas the aerial LiDAR data was collected in August 2021. The period of data has been selected from 2016 to 2021. Based on the 3D CNN model, we divided the study area into three land classes: urban surfaces, vegetation, and bare soil. Table 1 shows the pixels utilized in 3D CNN classification for training, validation, and testing.

Using deep learning approaches for remote sensing picture categorization, we used around the same number of training samples as other studies, with a total of about 3000–7000. 3D CNN model training uses training samples, validation samples for hyperparameter adjustment, and testing samples for ultimate accuracy assessment. Figure 4 shows sample images taken from datasets.

4.1. Hypertuning of Model

To determine the best 3D CNN parameters for model construction, randomly selected training and validation samples were used to evaluate their performance. Figure 5 shows the pixel sizes with pretrained kernel size and error %, while Figure 6 shows the impact of on image size and error %.

Figures 5, 6, and 7 show the pretraining accuracy of 3D CNN models influenced by different 3D CNN model parameters. Parameters and have an effect, and parameters and perform well. Parameters and perform well as well. (A pixel is the smallest possible unit of image input size.)

Figure 8 shows the urban areas classified from Azad Kashmir datasets using LiDAR to 3D CNNs. The error matrix approach is used to conduct the accuracy assessment’s error matrix.

Various accuracy metrics are calculated, including the producer’s accuracy, the user’s accuracy, the system as a whole, and the overall kappa coefficient (Table 2).

5. Conclusions

This study proposed and used to extract urbanization detection using LiDAR datasets by using a 3D CNN technique based on convolution. We investigate the effects of various 3D CNN settings on urbanization extraction in more detail in this study. Through the use of deep learning and 3D convolutional, ReLU, and pooling operators, our suggested 3D CNN technique can automatically extract spectral and spatial data and textural and elevation features, resulting in improved urbanization extraction performance (particularly for the construction of roofs and roads). According to the findings of this study, the urbanization extraction is significantly influenced by the 3D CNN settings. The convolutional kernel size and the pooling dimension are combined to form , the best combination. An image size impacts the accuracy and calculation time of the algorithm. For parameter , a value in the 20-40 range works best. The impervious surface extraction’s performance is most steady at , which is also true for CNN. For the 3D CNN approach, the benefit of using several sources of data is less pronounced. The 3D CNN model can extract urbanization accurately from a single-source HR picture.

Data Availability

As you are concerned with the availability of my code and practical work, it is confidential to my lab and cannot be shared to anyone until it get published. And secondly, I will publish my code and lab work according to the instructions of my supervisor.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

I would like to acknowledge my indebtedness and render my warmest thanks to my supervisor, Professor Yang Fengbao, who made this work possible. His friendly guidance and expert advice have been invaluable throughout all stages of the work. I would also wish to express my gratitude to Miss Gao Min for their extended discussions and valuable suggestions which have contributed greatly to the improvement of the paper. This work was supported by the National Natural Science Foundation of China (Grant Nos. 61672472 and 61972363), Science Foundation of North University of China, Postgraduate Science and Technology Projects of North University of China (Grant No. 20181530), and Postgraduate Education Innovation Project of Shanxi Province.