Real-time scene classification of unmanned aerial vehicles remote sensing image based on Modified GhostNet

Unmanned Aerial Vehicles (UAVs) play an important role in remote sensing image classification because they are capable of autonomously monitoring specific areas and analyzing images. The embedded platform and deep learning are used to classify UAV images in real-time. However, given the limited memory and computational resources, deploying deep learning networks on embedded devices and real-time analysis of ground scenes still has challenges in actual applications. To balance computational cost and classification accuracy, a novel lightweight network based on the original GhostNet is presented. The computational cost of this network is reduced by changing the number of convolutional layers. Meanwhile, the fully connected layer at the end is replaced with the fully convolutional layer. To evaluate the performance of the Modified GhostNet in remote sensing scene classification, experiments are performed on three public datasets: UCMerced, AID, and NWPU-RESISC. Compared with the basic GhostNet, the Floating Point Operations (FLOPs) are reduced from 7.85 MFLOPs to 2.58 MFLOPs, the memory is reduced from 16.40 MB to 5.70 MB, and the predicted time is improved by 18.86%. Our modified GhostNet also increases the average accuracy (Acc) (4.70% in AID experiments, 3.39% in UCMerced experiments). These results indicate that our Modified GhostNet can improve the performance of lightweight networks for scene classification and effectively enable real-time monitoring of ground scenes.


Introduction
Unmanned Aerial Vehicles (UAVs) as a remote sensing platform are used in various applications, such as search and rescue [1], disaster evaluation, and traffic monitoring [2], wireless communications [3]. Deep learning networks and embedded devices are equipped with UAVs to autonomously carry out tasks of image classification. An autonomous UAV rapidly monitors hazards and disasters by classifying the captured images in real-time, which relies heavily on its onboard sensors and microprocessors [4]. Local embedded devices are superior to cloud storage in scenarios involving privacy, latency, and limited connectivity. However, because embedded devices have limitations in memory and computing power, there are challenges to efficient scene classification. analyze the basic architecture of the GhostNet model and then adjust the structure of the network, developing a Modified GhostNet model that has less network complexity while ensuring the accuracy of the model. The Modified GhostNet model designed in this paper can be deployed on an embedded device for image scenes classification with high requirements for both accuracy and real-time processing.

GhostNet model
The GhostNet model is a lightweight network model jointly launched by Huawei Noah Lab, Peking University, and the University of Sydney in 2020. The model determined that during the training process of the network, there will be feature redundancy in each output layer in the middle. The conclusion is that a linear operation with a lower computational cost can replace a part of the convolution operation with a higher computational cost to reduce the amount of computation and save computing resources. The author proposes the Ghost module, which is characterized by replacing half of the convolution operations with linear operations, to reduce the computational complexity of the model. The structure of original GhostNet model is shown in Fig 1, including ordinary convolutional layers and Ghost module. The compression ratio between the network model after using the Ghost module and the network model with pure convolution operation can reach 2 times. Suppose n is the number of feature maps that a convolution layer should produce, m represents the actual number of feature maps output by the convolution operation, and s represents the factor of the linear operation. Since the general convolution operation is replaced by a linear operation in this paper, s = 2. Then, according to the principle that the total number of feature maps output by a certain layer should remain unchanged, we know that n = m � s, which is equivalent to m � ðs À 1Þ ¼ n s � ðs À 1Þ. Suppose the average size of the convolution kernel for each linear operation is d × d. The compression ratio formula between the network model after using the Ghost module and the network model with pure convolution operation can be obtained by calculation, which is shown below, where c is the number of input channels, and k is the size of the convolution kernel of the convolution operation.
The author designed two residual structures that make up the Ghost bottlenecks of Ghost-Net model, and two structures with a stride of 1 and a stride of 2. The two structures are shown in Fig 2. Ghost bottleneck is mainly composed of two Ghost modules stacked. For stride of 2, the shortcut path uses the downsampling layer and inserts depthwise convolution with the stride of 2 in the middle of the Ghost module. The original GhostNet model is formed by stacking the bottlenecks of the two sub-network models. The overall architecture is shown in Table 1.

Modified GhostNet model
In this paper, we propose a network model based on structural compression. The feature map redundancy phenomenon is analyzed when the GhostNet model is applied to the UAV image scene classification dataset. We remove the Ghost bottlenecks that generate redundant feature maps in the GhostNet model and replace the final fully connected layer with a fully convolutional layer. The Modified GhostNet has lower memory usage and energy consumption. It is well known that computationally intensive models drain batteries quickly for embedded devices. The same embedded device processes a model with a large amount of calculation. If the battery can only support 30 minutes, then reduce the amount of calculation by half, and it can work for at least 10 more minutes. Therefore, reducing the computational load of network models is crucial for supporting models in embedded devices. The improvement scheme of the GhostNet model is mainly based on the following two points: 1. The 2nd, 4th, 9th, 10th, 15th, and 17th layers in the GhostNet model are extracted. Through the visualization operation, it is found that there is a large amount of feature redundancy between these layers and their adjacent layers; 2. The fully connected layer at the end is replaced with a fully convolutional layer to reduce the amount of network computation caused by the fully connected layer.
The structure of Modified GhostNet network model based on these two points is shown in Table 2. K is the number of classification samples, and SE indicates whether to use the Squeeze-And-Excite module.

Dataset
With the development of neural networks, the number of data samples, the amount of computation, and the algorithm have become the three key factors that have the greatest impact on the neural network model. Among them, sample data is the basis for all research work. When the dataset used in the research is not suitable, that is, the number of samples in the dataset is scarce or the data samples do not have diversity, it will be difficult to carry out the research work smoothly [38]. Brill et al. [39] gave relevant conclusions on this problem. The same problem has almost the same performance on different algorithms. If the sample data is increased, the overall accuracy of the algorithm will be improved to a certain extent. The quality of the dataset is reflected in the richness and diversity of data samples. For natural images, the same type of samples should have different shapes, angles, sizes, etc. For UAV images, the same type of samples should include different factors such as climate change, illumination change, viewing angle change, and spatial resolution change. The current public UAV image scene datasets have some deficiencies. The commonly used public UAV image scene datasets are shared in Table 3, including UCMerced, AID and NWPU-RESISC [40].
UCMerced is a dataset with 21 categories that was released by the Computer Vision Laboratory of the University of California in 2010, which is the earliest UAV image dataset collected. However, the number of various types of samples in this dataset is relatively small, and the number of pictures in each category is only 100. Additionally, the scene information involved in this dataset is limited to American cities, and the diversity of samples is lacking.
Wuhan University and Huazhong University of Science and Technology released a UAV image dataset called AID. The dataset contains 30 categories of data, a total of 10,000 UAV images, and the resolution of each image is 600 × 600. There are about 220-420 pictures. Although the dataset has some variation in spatial resolution, the number of images per category is inconsistent.
NWPU-RESISC is a large-scale benchmark dataset created by Northwestern Polytechnical University in 2016, and it is the most commonly used dataset for validating network performance. Although there are certain advantages compared to the first two datasets, the number of types of samples in the dataset is relatively small, and the image data of similar samples are basically ideal environments and lack sample diversity.

Data preprocessing
The dataset used in this paper has a small capacity, and the similarity of the data samples is extremely high, which may lead to an overfitting problem in the network model. Therefore, we utilize image augmentation methods to expand the capacity of image scene data. The image augmentation methods are used to enrich the scene information in different environments and improve the generalization capacity of the network. The approaches include geometric transformation, pixel color transformation, and complex transformation. The geometric transformation includes flipping, rotating, translating, scaling, cropping, and other operations. Pixel color transformation includes noise interference and blurring. Combination of different rotation angles and different noise is shown in Fig 3. In addition, the composite transformation is important for enriching the sample characteristics of UAV image datasets. Compound transformation is the combination of geometric transformation and pixel color transformation. Combining rotation and noise interference can simulate complex sample data taken from different directions. The We use transfer learning techniques to generate the Modified GhostNet model. Transfer learning means that knowledge learned in one domain (such as knowledge learned in natural scenes) is applied to another domain (such as defect detection or drone images) to improve its generalization ability [41]. The pre-training model in this paper is the model trained on NWPU-RESISC. The weights generated in the model are set as the initial weights of AID and UCMerced, and then the network model is fine-tuned. The neural network model must have feature redundancy during the training process. This paper uses the GhostNet network model to train using the NWPU-RESISC dataset and visualizes the feature maps of the output channels from the 8th to 11th layers and observes the

Experiment environment
To ensure the training speed of the model, the model training system environment used in this experiment is Ubuntu16.04, and the hardware platform is Quadro P5000. The model was built on the Keras deep learning framework. During the training process, the Adam is selected as optimizer, the initial learning rate is set to 0.001, the training rounds are set to 100, and the batch size is set to 8. The image resolution in the dataset is set to 256 × 256, and the same image augmentation operations are performed, including geometric transformation, pixel color transformation, and composite transformation.
Since the purpose of this project is to apply the lightweight network model to embedded devices, the trained network model is transplanted to the embedded device for prediction. The embedded device selected in this experiment is the Jetson TX2. The test method in this paper mainly adopts the simulation test. The embedded device is connected to a camera through USB, and the camera is treated to recognize the picture, and the recognized category is drawn in the upper left corner of the picture in the form of characters. The display screen is used to display the ground object information captured by the drone. On the left side of the Jetson TX2 device is a monocular camera, which is used to capture the ground object information.

Training strategy
To ensure the training speed of the model, the model training system environment used in this experiment is Ubuntu16.04, and the hardware platform is Quadro P5000. The building of the model was done on the Keras deep learning framework. In this paper, the capacity of the dataset is expanded through the methods of image augmentation, which prevents overfitting in network training and improves the generalization ability of the network. The image resolution in the dataset is set to 256 × 256. The dataset is shuffled and divided into three parts in a ratio of 6:2:2, including training set, validation set and test set. During the training process, the Adam optimizer is used, the initial learning rate (lr) is set to 0.001, the training rounds are set to 100, the batch size is set to 8, and dropout ratio is set to 5%. The entire training strategy of Modified GhostNet model is shown in Fig 6. Val Loss is the loss of validation set. Training complete means training rounds reach 100 and ε is a constant value.

Image augmentation
To verify the generality of the scheme, this paper will conduct experiments on three small UAV image scene classification datasets: UCMercedx, AID, and UCMerced. Each sample data is expanded by 20 times the original, thus the total number of samples in the dataset is also 20 times the original. The division ratio of the training set, validation set, and test set of the dataset is 6:2:2. The number of a single category in training set before and after augmentation is shown in Table 4. The number of training samples is greatly increased, providing rich sample features for model training.
The dataset after image augmentation is employed trained on GhostNet. The average accuracy of GhostNet using dataset after image augmentation is improved, which is revealed in Table 5.  UCMerced. The loss and accuracy of the GhostNet model before and after UCMerced image augmentation are shown in Fig 7. The red line represents the change trend after image augmentation, the blue line represents the change trend before image augmentation, and steps represent the training rounds.
In Fig 7, figure(a) indicates that the convergence rate of the network model on UCMerced after image augmentation is slower than that before image augmentation. Figure(c) shows that the convergence of the network model before image augmentation on the validation set is always jittered within a certain range, and the convergence of the network model after image augmentation on the validation set maintains a relatively stable trend after 50 rounds of training. In comparing figure(b) and figure(d), it can be seen that the network model before image augmentation has a very serious overfitting phenomenon. The accuracy of the blue line is equal to the red line in the training set, and its accuracy is close to 100%. However, the accuracy of the blue line in the validation set is less than 80%, and the red line is not much different from the red line in the training set. The experimental results indicate that the preprocessing method of image augmentation is able to alleviate the overfitting phenomenon well.
AID. Before experimenting with AID, this paper reduces the sample resolution in AID to 256 × 256, which is consistent with the resolution of the other two datasets. Fig 8 shows the loss and accuracy of the GhostNet model before and after AID image augmentation.
As shown in Fig 8, compared with the decrease in the loss of the training set before image augmentation, the value loss after image augmentation is lower, indicating that its convergence occurs before image augmentation. In figure(a), the loss before data enhancement is always in a jitter state, which is caused by the imbalance of sample data in the dataset before data enhancement. As shown in figure(b) and figure(d), the AID fitting situation before and after data enhancement is similar, but the overall accuracy after data enhancement is higher.
NWPU-RESISC. Fig 9 shows the loss and accuracy of the GhostNet model before and after NWPU-RESISC image augmentation. The red line represents the change trend after data enhancement, the blue line represents the change trend before data enhancement, and steps represent the training rounds. and figure(d) indicate that the accuracy of the validation set and training set has a small degree of overfitting before and after data enhancement, and the overall accuracy after data enhancement is higher than before image augmentation. Experimental results reveal that image augmentation can, indeed, improve the accuracy of the network to a certain extent.

Comparison of models
To verify the validity of the GhostNet model based on structure compression, we compare MobileNetV3-Small, GhostNet, and the Modified GhostNet models through different evaluation metrics, including FLOPs, memory usage, predicted time, and average accuracy. FLOPs. In lightweight neural networks, Floating Point Operations (FLOPs) are commonly used to measure the complexity of the network model. FLOPs represent floating point operands, also known as computations. The calculation formula of FLOPs is: Among them, H and W represent the height and width of the input feature map, respectively, C in represents the number of channels of the input feature map, C out represents the number of channels of the output feature map, and F represents the size of the convolution kernel.
Since the amount of computation generated in the model has nothing to do with the dataset and the structure of the model itself, only the amount of data on one dataset is shown below, and the others are the same. Table 6 shows the FLOPs of different models on UCMerced dataset. The amount of computation of the Modified GhostNet model is reduced by as much as 3 times that of the original GhostNet model. Compared with MobileNetV3-Small, it is also reduced by nearly 3 times.
Memory usage and predicted time. Embedded devices have high requirements on model memory usage and real-time performance. The reduction of the parameter quantity is accompanied by the reduction of the memory occupancy rate, and the reduction of the calculation quantity is accompanied by the reduction of the operation time. The combined effect of these two factors is important for the predicted speed of the network model. The memory usage and predicted time of different models are shown in Table 7.
The quantitative results show that the amount of the Modified GhostNet's parameters is reduced by nearly 3 times that of the original GhostNet model, and the overall prediction time is also reduced from 52.5ms to 42.6ms. The Modified GhostNet model also shows its overall performance advantage over the MobileNetV3-Small model, both in terms of memory usage and forward inference time.
Acc. Except for FLOPS, memory usage, and real-time performance, average accuracy (Acc) of models deployed on embedded devices is one of the most critical metrics. Acc of different models on the three datasets is displayed in Table 8.
Dropout with the ratio of 0.05 and weight transfer are employed in the Modified GhostNet model proposed in this paper. The average accuracy of three models on three dataset indicate that the Modified GhostNet has higher accuracy. On the UCMerced dataset, the average accuracy of Modified GhostNet is 96.19%, which performs better than MobileNetV3-Small and original GhostNet. The average accuracy of Modified GhostNet is 92.05% on AID dataset, which is much higher than MobileNetV3-Small and original GhostNet. The Modified Ghost-Net has a certain improvement in accuracy compared with the original network structure, and the UCMerced and AID datasets have increased by 3.39% and 4.70% respectively. However, the average accuracy of the three models is similar on NWPU-RESISC dataset, and Modified GhostNet increased by 0.28% compared with original GhostNet. The effect of dropout position on average accuracy is shown in Fig 10. Experiments are performed on three datasets, and dropout operations are performed on the 5th, 7th, 9th, and 11th layer of the Modified GhostNet. The dropout operation can improve the classification accuracy of the model, and the dropout in different positions has different effects on the classification accuracy of the model. In UCMerced and AID, adding dropout to the 9th layer can greatly improve the accuracy of the model, while in NWPU-RESISC the introduction of dropout at the 7th layer was most effective.
For small datasets, fine-tuning the network model by weight transfer can usually achieve good results. Weight transfer is a technique commonly used in transfer learning, where a pretrained model on a large dataset is used as a starting point for training a model on a smaller dataset. NWPU-RESISC dataset is a larger dataset, while UCMerced and AID datasets are smaller. In this experiment, the weights obtained by training on NWPU-RESISC are used as pre-training weight parameters, and weight transfer training is used for UCMerced and AID. The policy tunes the network model to fit the respective dataset. As shown in Fig 11, the figure shows the change in the classification accuracy of the model before and after using transfer learning. The quantitative results show that adjusting the network model by weight transfer can avoid the phenomenon of network overfitting and effectively improve the prediction accuracy of the network.

Discussion
The evaluation metrics examined in this study were FLOPs, memory usage, predicted time, and Acc. The values obtained from these criteria in the modified network are as follows. The FLOPs was 2.58 MFLOPs, the memory usage was 5.7 MB, the predicted time on Jetson TX2 was 42.6 ms, the Acc On the UCMerced dataset was 96.19, and the Acc on the AID dataset was 92.05. Based on the basic GhostNet results, the proposed network had an average improvement of 5.27 in FLOPs, 18.86% in predicted time, 3.39% in Acc on UCMerced dataset, and 4.70% in Acc on AID dataset, indicating an improvement in the classification of the proposed model. In recent years, the employment of artificial intelligence and deep learning methods has become one of the most popular and useful approaches in scene classification. LeCun et al. [42] introduced the LeNet5 convolutional neural network model. Lin et al. [26] proposed Network in Network (NIN). The entire network model was formed by stacking sub-networks and could be changed arbitrarily. Based on the idea of increasing network depth, VGG [22] increased the depth of the CNN model to about 20 layers. The error rate of image recognition has also decreased from 11% to 6.7%, which is gradually approaching the human eye error rate of 5.1%. He et al [23] formulated ResNet, improving the network accuracy by increasing the network depth. However, the real-time performance of these studies still needs improvement. In addition, one of the limitations of these studies was that the computing performance of most embedded devices cannot support the deep convolutional neural network model.
Previous research has focused on improving the accuracy, with a few studies being actually applied to embedded environments. Moreover, no matter how high the classification accuracy is, the experimental conditions are tested on high-performance equipment in the laboratory, and the network model does not go out of the laboratory and into practice. Thus, it can be said that porting neural network models to embedded devices is still an important issue for UAV image scene classification. The main goal of this study is to reduce computational costs and apply the lightweight network model to real-time scene classification of UAV images.
In this study, the GhostNet was modified to improve the challenges of distinguishing ground objects from UAV images in real-time. The Modified GhostNet model can not only reduce the memory and calculation amount of the model, thereby improving the prediction speed of the model, but also improve the accuracy of some data sets. Because the network model with high complexity easily causes overfitting when training on small data sets, the structure compression of the network model can reduce the complexity of the network and alleviate the overfitting phenomenon to some extent. Compared with MobileNetV3 and the basic GhostNet, the Modified GhostNet has faster speed and higher classification accuracy.

Conclusion
Based on the original GhostNet, we built a lightweight neural network, the Modified GhostNet model, which can be transplanted on an embedded device for real-time UAV classification of drone imagery scenes. We utilize image augmentation methods to expand the sample diversity in the UAV image dataset, including geometric transformation, pixel color transformation, and compound transformation. The Modified GhostNet model is transplanted to the embedded device Jetson TX2, which is trained on three datasets based on transfer learning. The performance of the Modified GhostNet is evaluated in FLOPs, memory usage, predicted time, and Acc. Compared with MobileNetV3-Small and original GhostNet, the Modified GhostNet proposed in this paper reduces the amount of computation, improving real-time processing rate while reducing memory usage.