Tree Species Classiﬁcation in UAV Remote Sensing Images Based on Super-Resolution Reconstruction and Deep Learning

: We studied the use of self-attention mechanism networks (SAN) and convolutional neural networks (CNNs) for forest tree species classiﬁcation using unmanned aerial vehicle (UAV) remote sensing imagery in Dongtai Forest Farm, Jiangsu Province, China. We trained and validated representative CNN models, such as ResNet and ConvNeXt, as well as the SAN model, which incorporates Transformer models such as Swin Transformer and Vision Transformer (ViT). Our goal was to compare and evaluate the performance and accuracy of these networks when used in parallel. Due to various factors, such as noise, motion blur, and atmospheric scattering, the quality of low-altitude aerial images may be compromised, resulting in indistinct tree crown edges and deﬁcient texture. To address these issues, we adopted Real-ESRGAN technology for image super-resolution reconstruction. Our results showed that the image dataset after reconstruction improved classiﬁcation accuracy for both the CNN and Transformer models. The ﬁnal classiﬁcation accuracies, validated by ResNet, ConvNeXt, ViT, and Swin Transformer, were 96.71%, 98.70%, 97.88%, and 98.59%, respectively, with corresponding improvements of 1.39%, 1.53%, 0.47%, and 1.18%. Our study highlights the potential beneﬁts of Transformer and CNN for forest tree species classiﬁcation and the importance of addressing the image quality degradation issues in low-altitude aerial images.


Introduction
Traditional tree species classification methods primarily rely on the expertise of forest workers who visually identify and judge trees based on features such as leaf shape, crown shape, and texture. These methods are often subjective and labor-intensive, requiring extensive fieldwork and manual identification. To improve the accuracy of tree species classification, LiDAR data are often combined with hyperspectral (HS) data [1][2][3]. However, the processing of airborne LiDAR data can be costly and complex, which makes this method unsuitable for large-scale forest classification [4]. Conventional research methods for tree species classification include manual feature extraction and classical machine learning algorithms, such as Support Vector Machines (SVMs) [5][6][7], Artificial Neural Networks (ANNs) [8,9], and Random Forest (RF) [10][11][12]. Burai, P. et al. [13] used airborne HS imagery and image classification methods (multi-label classification and SVM) combined with feature extraction to discriminate between species and clones of energy trees. They proposed an adaptive binary tree SVM classifier (ABTSVM) to improve the species-level classification accuracy. Rocha, S.J.S.S.D. et al. [14] used ANNs based on competition index and climatic and categorical variables to predict tree survival and mortality in the semideciduous seasonal forests of the Atlantic Forest biome, reaching a high classification performance. Freeman, E.A. et al. [15] proposed strategies to address the issues of varying sampling intensity across different strata and the imbalanced presence of target species in training data when using the RF model for species distribution modeling. However, traditional machine learning methods for tree species classification often rely on handcrafted features, including various vegetation indices and texture features. In contrast, CNNs offer a significant advantage over traditional approaches in extracting essential features from raw data, which enables their widespread application in tree species classification tasks.
Since LeCun, Y. and his colleagues pioneered the use of CNNs for image classification tasks in 1998 [16], and it has been demonstrated that CNNs have significant advantages in extracting low-level features and visual structures. More recently, Krizhevsky, A. and his team introduced a deep CNN architecture called AlexNet [17], which was trained on a large scale using GPUs and obtained breakthrough results on the ImageNet dataset. At the same time, the development of low-altitude UAVs has made the acquisition of highresolution aerial images with richer texture information than satellite images easier [18]. This combination has resulted in significant advances in tree species classification. For example, Nezami, S. et al. [19] proposed a deep learning method based on 3D-CNN for the high-accuracy classification of three major tree species in a boreal forest using RGB and HS data layers. Kapil, R. et al. [20] proposed a RetinaNet-based method that reached an average accuracy of 98.95% in classifying different stages of bark beetle attacks on individual trees. Hu, M. et al. [21] used a transfer-learning-based approach that fused multiple deep learning models to solve tree species classification in complex backgrounds, attaining an overall average accuracy of 93.75%. Natesan, S. et al. [22] used DenseNet for classifying forest tree species at the individual tree level using high-resolution RGB images from UAVs. The validation results demonstrate an accuracy of over 84% in distinguishing coniferous tree species in eastern Canada. Ford, D. J. [23] delved into the use of high-resolution RGB imagery from UAVs for tree species classification in a tropical wet forest. The study compared three classifiers and found that U-Net obtained the highest overall accuracy of 71.2%, suggesting the suitability of CNNs for fine-grained species-level classification using UAV data. In addition, some researchers have used deep learning in combination with UAV-Borne LiDAR data for individual tree crown segmentation studies [24]. However, it is noted that canopy images are different from natural images, and mutual relationships between tree canopies can affect classification results in forests with medium to high canopy density. In addition, low-altitude drone canopy images exhibit intra-canopy heterogeneity at the level of individual tree crowns, while displaying repeated similarity at the overall level. Additionally, due to hardware limitations of the drone imaging equipment, the images may suffer from blurred tree crown boundaries and weak texture factors. Therefore, simply adopting a residual network model to identify tree species can lead to overlooking the importance of different channel feature maps for classification results and result in a limitation of accuracy. However, attention mechanism models have a natural long-range modeling ability that enables the full utilization of effective global information from shallow to deep layers. By taking advantage of this ability, researchers can more effectively classify tree species and overcome the unique challenges posed by canopy images.
The goal of this research article is to investigate tree species classification methods using both CNN models and Transformer models on UAV tree crown images. Our proposed approach involves several steps. Firstly, we utilized a sliding window cropping technique on low-altitude drone canopy images to obtain smaller image patches. Then, we applied Real-ESRGAN technology for super-resolution reconstruction to restore the blurry canopy images and enhance their spatial resolution. Next, we made use of both CNN models and Transformer models to extract features from the tree crown images, and performed a differential comparative analysis to evaluate the performance of these two approaches. Due to the relatively modest scale of the canopy sample set in this experiment, and taking into account the constraints imposed by computational resources, this study opted to train and validate the ResNet-50, ConvNeXt-T, Swin-T, and ViT-B models. These models boast fewer parameters and exhibit lower computational intricacy, thereby enhancing the efficacy of the training and inference processes. Through this efficient and accurate tree species classification method based on low-altitude aerial tree crown images, we aim to delve into the potential applications of artificial intelligence in forestry intelligence and information construction, and explore the possibilities of improving tree species classification accuracy using advanced deep learning techniques.

Study Area
The study area was located in Dongtai Forest Farm, Yancheng City, Jiangsu Province, with geographical coordinates ranging from 120 • 47 11 E to 120 • 52 0 E and 32 • 53 30 N to 32 • 51 17 N, as shown in Figure 1. We used a Liortho high-resolution imaging system mounted on a Digital Green Earth octocopter UAV for data collection, with a flight altitude of 200 m and image resolution of 0.2 m. In the Dongtai Forest Farm study area, we picked out three representative plots as experimental areas. From these areas, we selected four predominant tree species, Poplar, Metasequoia, Bamboo, and Ginkgo, as the primary subjects of our research. CNN models and Transformer models to extract features from the tree crown images, and performed a differential comparative analysis to evaluate the performance of these two approaches . Due to the relatively modest scale of the canopy sample set in this  experiment, and taking into account the constraints imposed by computational resources,  this study opted to train and validate the ResNet-50, ConvNeXt-T, Swin-T, and ViT-B models. These models boast fewer parameters and exhibit lower computational intricacy, thereby enhancing the efficacy of the training and inference processes. Through this efficient and accurate tree species classification method based on low-altitude aerial tree crown images, we aim to delve into the potential applications of artificial intelligence in forestry intelligence and information construction, and explore the possibilities of improving tree species classification accuracy using advanced deep learning techniques.

Study Area
The study area was located in Dongtai Forest Farm, Yancheng City, Jiangsu Province, with geographical coordinates ranging from 120°47′11″E to 120°52′0″E and 32°53′30″N to 32°51′17″N, as shown in Figure 1. We used a Liortho high-resolution imaging system mounted on a Digital Green Earth octocopter UAV for data collection, with a flight altitude of 200 m and image resolution of 0.2 m. In the Dongtai Forest Farm study area, we picked out three representative plots as experimental areas. From these areas, we selected four predominant tree species, Poplar, Metasequoia, Bamboo, and Ginkgo, as the primary subjects of our research.  enhanced training and improved model generalization. Random rotatio diverse perspectives of the tree canopies, while horizontal flipping increased diversity by mirroring the images. Additionally, brightness adjustm robustness to varying lighting conditions, enabling the model to generalize different environments. These techniques collectively contributed to robustness and performance of the tree canopy classification model. Furthermore, we randomly divided the samples into training and vali an 8:2 ratio, respectively. This ensured that both sets had a representative d samples from each tree species, allowing for robust model evaluation and estimation. The augmented dataset with increased sample size and diversit the appropriate training and validation set partitioning, facilitated the trainin ral network models for accurate tree species recognition in the study area.  Given that neural network models, especially Transformer models and their variants, can have millions or even billions of parameters, it is crucial to have sufficient samples for effective training. To increase the sample size, we performed data augmentation on the training and validation sample sets. The data augmentation techniques applied to the tree canopy image samples included random rotations by 90 • , 180 • , and 270 • , horizontal flipping, and brightness adjustment as depicted in Figure 3. These transformations were implemented using the 'RandomRotation', 'RandomHorizontalFlip', and 'ColorJitter' functions provided by the Torchvision library. By incorporating these techniques into the data augmentation pipeline, variations were introduced to the dataset, allowing for enhanced training and improved model generalization. Random rotations provided diverse perspectives of the tree canopies, while horizontal flipping increased the dataset's diversity by mirroring the images. Additionally, brightness adjustment ensured robustness to varying lighting conditions, enabling the model to generalize better across different environments. These techniques collectively contributed to the overall robustness and performance of the tree canopy classification model. Furthermore, we randomly divided the samples into training and validation sets in an 8:2 ratio, respectively. This ensured that both sets had a representative distribution of samples from each tree species, allowing for robust model evaluation and performance estimation. The augmented dataset with increased sample size and diversity, along with the appropriate training and validation set partitioning, facilitated the training of the neural network models for accurate tree species recognition in the study area. Remote Sens. 2023, 15, x FOR PEER REVIEW 5 of 23

Super-Resolution Reconstruction
Faced with the limitations of hardware equipment, such as high-denseness forest stands, aerial photography angles, and resolution, as well as complex geographic backgrounds, drone aerial images often suffer from degradation issues such as lost details, reduced brightness, and partially blurred tree crowns. To solve these problems, this paper uses Real-ESRGAN technology [25] to perform super-resolution reconstruction, denoising, and deblurring on the original tree crown images. Classical first-order degradation models, as expressed in Equation (1), often consider only one type of degradation operation, such as blur or noise, while ignoring the simultaneous presence of multiple degradation operations. However, in practical scenarios, images may undergo multiple degradations simultaneously. For example, during transmission, images may experience blur, resolution reduction, and the introduction of noise. In order to more accurately simulate the degradation process of real images, Real-ESRGAN proposes a high-order degradation model, as expressed in Equation (2), which utilizes multiple repeated degradation processes, with each degradation process representing a classical degradation model. Through ablation experiments, the second-order model has been proven to exhibit excellent performance and practicality, effectively meeting the requirements of most image processing tasks [25]. Real-ESRGAN's second-order degradation model, as shown in Figure 4, aims to simulate the degradation process of real images through several steps, including blur processing, downsampling, noise addition, and JPEG compression. Firstly, in the blur processing stage, isotropic and anisotropic Gaussian blur kernels are applied to the original image, resulting in the loss of details and clarity to mimic real-world blurring effects. Subsequently, the blurred image undergoes downsampling using randomly selected methods such as bilateral interpolation, bilinear interpolation, and regional interpolation, further reducing the image's resolution and diminishing details and clarity to simulate actual resolution reduction effects. Next, noise is added based on the image type (color or grayscale). For color images, both Gaussian noise and noise following the Poisson distribution are added to simulate environmental and sensor noise, while grayscale images only receive Gaussian noise. Finally, the image

Super-Resolution Reconstruction
Faced with the limitations of hardware equipment, such as high-denseness forest stands, aerial photography angles, and resolution, as well as complex geographic backgrounds, drone aerial images often suffer from degradation issues such as lost details, reduced brightness, and partially blurred tree crowns. To solve these problems, this paper uses Real-ESRGAN technology [25] to perform super-resolution reconstruction, denoising, and deblurring on the original tree crown images. Classical first-order degradation models, as expressed in Equation (1), often consider only one type of degradation operation, such as blur or noise, while ignoring the simultaneous presence of multiple degradation operations. However, in practical scenarios, images may undergo multiple degradations simultaneously. For example, during transmission, images may experience blur, resolution reduction, and the introduction of noise. In order to more accurately simulate the degradation process of real images, Real-ESRGAN proposes a high-order degradation model, as expressed in Equation (2), which utilizes multiple repeated degradation processes, with each degradation process representing a classical degradation model. Through ablation experiments, the second-order model has been proven to exhibit excellent performance and practicality, effectively meeting the requirements of most image processing tasks [25]. Real-ESRGAN's second-order degradation model, as shown in Figure 4, aims to simulate the degradation process of real images through several steps, including blur processing, downsampling, noise addition, and JPEG compression. Firstly, in the blur processing stage, isotropic and anisotropic Gaussian blur kernels are applied to the original image, resulting in the loss of details and clarity to mimic real-world blurring effects. Subsequently, the blurred image undergoes downsampling using randomly selected methods such as bilateral interpolation, bilinear interpolation, and regional interpolation, further reducing the image's resolution and diminishing details and clarity to simulate actual resolution reduction effects. Next, noise is added based on the image type (color or grayscale). For color images, both Gaussian noise and noise following the Poisson distribution are added to simulate environmental and sensor noise, while grayscale images only receive Gaussian noise. Finally, the image is subjected to JPEG compression based on a compression quality parameter ranging from 0 to 100, where lower compression quality leads to poorer image quality and more severe distortion. restoration process, mitigating degradation effects and improving image quality. It is worth noting that in the final synthesis step, the order of applying the sinc filter and JPEG compression is randomly exchanged to cover a broader range of degradation scenarios.
In summary, Real-ESRGAN's second-order degradation model adopts various degradation operations, such as blur processing, downsampling, noise addition, and JPEG compression, to simulate the degradation process of real images. These processed degraded images lose their detail and clarity, exhibiting visual effects such as blurring, reduced resolution, noise addition, and distortion, thereby providing challenging inputs for the subsequent Real-ESRGAN super-resolution reconstruction process.
The above equation is actually a multiple repetition operation of the first-order degradation, where each D represents the execution of one first-order degradation.
i j is the kernel coordinate, c ω is the cutoff frequency, and 1 J is a first-order Bessel function of the first type. Both in the initial blur processing step and the final synthesis step, a sinc filter is used. The expression for the sinc filter is shown in Equation (3). The sinc filter exhibits high selectivity in the frequency domain, responding differently to signals of various frequencies.
Consequently, the sinc filter can smooth and blur specific frequency details in the image, thereby reducing its details and clarity. Additionally, the sinc filter possesses inverse filtering properties, allowing it to repair blurred or degraded images during the restoration process, mitigating degradation effects and improving image quality. It is worth noting that in the final synthesis step, the order of applying the sinc filter and JPEG compression is randomly exchanged to cover a broader range of degradation scenarios.
In summary, Real-ESRGAN's second-order degradation model adopts various degradation operations, such as blur processing, downsampling, noise addition, and JPEG compression, to simulate the degradation process of real images. These processed degraded images lose their detail and clarity, exhibiting visual effects such as blurring, reduced resolution, noise addition, and distortion, thereby providing challenging inputs for the subsequent Real-ESRGAN super-resolution reconstruction process.
D(·) denotes the degradation process, y denotes the input image, k denotes the blur function, ↓ r denotes the downsampling factor, n denotes the noise, and [ ] JPEG denotes the compression of the obtained result in JPEG format.
The above equation is actually a multiple repetition operation of the first-order degradation, where each D represents the execution of one first-order degradation.
(i, j) is the kernel coordinate, ω c is the cutoff frequency, and J 1 is a first-order Bessel function of the first type.
Real-ESRGAN builds upon the generator structure of ESRGAN [26] as a foundation and further enhances and refines it through design improvements, comprising numerous Residual-in-Residual Dense Blocks (RRDB) for enhanced performance. These processed images, obtained through the preceding pre-processing steps elucidated earlier, are subsequently channeled into the generator of ESRGAN, as demonstrated in Figure 5 underneath. Primarily, pixel shuffling is implemented to diminish the spatial dimensions and augment the channel properties. Then, the resultant outcome is fed into the principal architecture of ESRGAN for super-resolution reconstruction.
architecture of ESRGAN for super-resolution reconstruction.
Moreover, owing to Real-ESRGAN's aspiration to confront a considerably wider range of degradations than ESRGAN, the original VGG-style discriminator design in ESRGAN is no longer suitable. Instead, Real-ESRGAN introduces a U-Net framework with skip connections for the discriminator, inspired by the referenced research endeavors [27,28]. Finally, the generated images are mixed with the input images and fed into the discriminator for discrimination, which uses spectrally normalized U-Net to mitigate excessive sharpness and artifacts introduced by GAN training. The original tree species canopy images were restored by Real-ESRGAN super-resolution reconstruction as shown in Figure 6 below, and the canopy texture details were restored and processed to facilitate the neural network model to extract clear canopy edge contour features and detailed texture features.

ResNet Model
Residual blocks proposed by ResNet [29], as shown in Figure 7, can effectively solve the problem of network degradation in deep networks. Two varieties of shortcut connections are adopted within the layers of ResNet. Identity shortcuts are utilized when the dimensions of the input and output are equal, while projection shortcuts are used to Moreover, owing to Real-ESRGAN's aspiration to confront a considerably wider range of degradations than ESRGAN, the original VGG-style discriminator design in ESRGAN is no longer suitable. Instead, Real-ESRGAN introduces a U-Net framework with skip connections for the discriminator, inspired by the referenced research endeavors [27,28]. Finally, the generated images are mixed with the input images and fed into the discriminator for discrimination, which uses spectrally normalized U-Net to mitigate excessive sharpness and artifacts introduced by GAN training. The original tree species canopy images were restored by Real-ESRGAN super-resolution reconstruction as shown in Figure 6 below, and the canopy texture details were restored and processed to facilitate the neural network model to extract clear canopy edge contour features and detailed texture features. Real-ESRGAN builds upon the generator structure of ESRGAN [26] as a foundation and further enhances and refines it through design improvements, comprising numerous Residual-in-Residual Dense Blocks (RRDB) for enhanced performance. These processed images, obtained through the preceding pre-processing steps elucidated earlier, are subsequently channeled into the generator of ESRGAN, as demonstrated in Figure 5 underneath. Primarily, pixel shuffling is implemented to diminish the spatial dimensions and augment the channel properties. Then, the resultant outcome is fed into the principal architecture of ESRGAN for super-resolution reconstruction.
Moreover, owing to Real-ESRGAN's aspiration to confront a considerably wider range of degradations than ESRGAN, the original VGG-style discriminator design in ESRGAN is no longer suitable. Instead, Real-ESRGAN introduces a U-Net framework with skip connections for the discriminator, inspired by the referenced research endeavors [27,28]. Finally, the generated images are mixed with the input images and fed into the discriminator for discrimination, which uses spectrally normalized U-Net to mitigate excessive sharpness and artifacts introduced by GAN training. The original tree species canopy images were restored by Real-ESRGAN super-resolution reconstruction as shown in Figure 6 below, and the canopy texture details were restored and processed to facilitate the neural network model to extract clear canopy edge contour features and detailed texture features.

ResNet Model
Residual blocks proposed by ResNet [29], as shown in Figure 7, can effectively solve the problem of network degradation in deep networks. Two varieties of shortcut connections are adopted within the layers of ResNet. Identity shortcuts are utilized when the dimensions of the input and output are equal, while projection shortcuts are used to

ResNet Model
Residual blocks proposed by ResNet [29], as shown in Figure 7, can effectively solve the problem of network degradation in deep networks. Two varieties of shortcut connections are adopted within the layers of ResNet. Identity shortcuts are utilized when the dimensions of the input and output are equal, while projection shortcuts are used to align dimensions [29]. In this paper, we use ResNet-50 as a representative of such models. The decision to adopt ResNet-50 was grounded on empirical observations and experiments conducted by the researchers who proposed the ResNet architecture. Their findings revealed that surpassing a certain depth threshold in the network resulted in diminishing performance improvements or even a decline in accuracy, primarily due to the issue of vanishing gradients [29]. ResNet-50 successfully strikes an excellent balance between model depth and performance. As the name suggests, ResNet-50 consists of 50 layers, as shown in Figure 8, divided into five stages. Stage 0 can be regarded as the preprocessing of input data, while stages 1 to 4 are each composed of 3, 4, 6, and 3 bottleneck blocks, respectively. (1) In Stage 0, the first step was to convert the canopy image, which has a size of 224 × 224, into a digital matrix with dimensions [224, 224, 3]. Subsequently, we used a convolutional layer with a 7 × 7 kernel, stride of 2, and 64 output channels, resulting in a feature map measuring 112 × 112 × 64. Furthermore, a 3 × 3 max pooling layer with a stride of 2 was applied to reduce the feature map's size, yielding a 56 × 56 × 64 representation.
(2) Moving forward, the 56 × 56 × 64 feature map underwent processing in Stage 1. At this stage, we adopted three bottleneck blocks to facilitate the integration process. Each block consisted of a sequence of convolutional operations, including 1 × 1 convolutions with 64 output channels, 3 × 3 convolutions with 64 output channels, and another 1 × 1 convolution with 256 output channels. These operations reshaped the feature map, resulting in a 56 × 56 × 256 representation.  (7) Finally, the 1 × 1 × 2048 feature map was flattened into a one-dimensional vector and subjected to processing through a fully connected layer for classification purposes. Given that this experiment encompasses four tree species, the output provides probability values for the four categories. The detailed architecture specifications of ResNet-50 are described in Table 1.
processing in Stage 1. At this stage, we adopted three bottleneck block integration process. Each block consisted of a sequence of convolut including 1 × 1 convolutions with 64 output channels, 3 × 3 convolutio channels, and another 1 × 1 convolution with 256 output channels.  (7) Finall feature map was flattened into a one-dimensional vector and subjec through a fully connected layer for classification purposes. Given tha encompasses four tree species, the output provides probability val categories. The detailed architecture specifications of ResNet-50 are des

ConvNeXt Model
The ConvNeXt model [30] improves upon the ResNet architecture by incorporating ideas from the Transformer network [31], as depicted in Figure 9. Specifically, ConvNeXt implements techniques such as mimicking depthwise convolutions to form convolution blocks and referencing the design of the inverted bottleneck structure, resulting in enhanced performance. In this study, we made use of the ConvNeXt-T model. The process of classifying tree species with ConvNeXt-T involved the following steps: (1) To begin, we started with an input canopy image of size 224 × 224 × 3. This image was passed through a convolutional layer with a kernel size of 4 and a stride of 4. Afterward, a Layer Norm (LN) was applied to normalize the feature map, resulting in a 56 × 56 × 96-sized feature map. The LN operation plays a crucial role in enhancing network stability and generalization by standardizing the feature map. (2) Moving on, the feature map underwent four stages, each containing a ConvNeXt block. In each stage, there were 3, 3, 9, and 3 ConvNeXt blocks, respectively. These ConvNeXt blocks were composed of a 7 × 7 depthwise convolution with a stride of 1 and a padding of 3, followed by an LN. In addition, two 1 × 1 Conv2d layers with a stride of 1 and a GeLU activation function [32] were adopted. The output channels of the ConvNeXt blocks were 96, 192, 384, and 768, respectively. During the second to fourth stages, a downsampling operation was performed on the feature map. This downsampling operation included applying an LN and using a Conv2d layer with a stride of 2. This operation reduces the size of the feature map to half that of the previous layer. As a result, the feature map ended up containing 768 channels after passing through all four stages. (3) Next, the feature map was passed through a global average pooling layer, and a feature vector with a size of 1 × 1 × 768 was obtained. (4) The final step involved passing the feature vector through a fully connected layer to convert the 768-dimensional vector into an output vector of size 4. This output vector represents the probability values of the four tree species categories. Table 2 shows the detailed structural specifications of the ConvNeXt-T.

ViT Model
The ViT model [33] is a specially designed Transformer model for image classification. It is composed of three modules: Embedding, Transformer Encoder, and Multi-Layer Perceptron (MLP) Head. The Transformer Encoder comprises LN, Multi-head Attention, Dropout, and MLP Block, and for this paper, we used the ViT-B model, as shown in Figure 10. Here is how the image classification process worked using ViT-B: (1) Initially, we input a tree canopy image with dimensions of 224 × 224 × 3. It underwent convolution using a 16 × 16 kernel and a stride of 16, resulting in a feature map sized 14 × 14 × 768.
(2) Next, we flattened the feature map in both the height and width directions, transforming its size to 196 × 768. Subsequently, we concatenated a class token and applied positional encoding to the feature map, yielding a transformed size of 197 × 768. (3) Following this step, we applied Dropout and passed the input through 12 stacked Encoder Blocks. The output from the Encoder was then processed with LN, maintaining the feature map size at 197 × 768. We proceeded to extract the output corresponding to the class token and slice it, obtaining a vector of size 1 × 768. This vector was then fed into the MLP Head. (4) Finally, the feature vectors were fed into a fully connected layer with four neurons for classification, where each neuron represented a tree species category, resulting in the final classification results.

Swin Transformer Model
Due to the high computational cost and memory consumption of the self-attention mechanism in ViT models when processing high-resolution image tasks, the Swin Transformer model was proposed [34], which adopts a hierarchical structure as shown in Figure 11. The model comprises four key components: Patch Partition, Linear Embedding, Swin Transformer Block, and Patch Merging. Two successive Swin Transformer Blocks illustrated in Figure 12, incorporate Window-based Multi-head Self-attention (W-MSA) and Shifted Window-based Multi-head Self-attention (SW-MSA) to address memory consumption challenges while maintaining efficient performance. Additionally, Patch Merging adopts pooling-like operations to progressively reduce the feature map size and merge image blocks, constructing a hierarchical feature map in deeper layers. LN layers are used before each MSA module and each MLP, and residual connections are used after each MSA and MLP. These characteristics make it a versatile backbone for image classification and dense recognition tasks. In this study, we adopted the Swin-T model as a representative example of such models. (1) Initially, the input was a tree crown image with dimensions 224 × 224 × 3, which underwent Patch Partition. This process involved dividing the image into fixed-sized blocks using a 4 × 4 convolutional kernel. The resulting feature map dimensions were 56 × 56 × 48. (2) Next, the Linear Embedding layer was subsequently applied to each channel of the pixel data, resulting in a linear transformation that changed the feature map dimensions to 56 × 56 × 96. (3) Moving forward, we proceeded to a series of Swin Transformer blocks. These blocks consisted of two variations: one utilizing the W-MSA structure and the other adopting the SW-MSA structure. Consequently, the Swin Transformer blocks appeared in even numbers, with 2 blocks in each of the first, second, and fourth stages, and 6 blocks in the third stage. The Swin Transformer block incorporated window partitioning and window reverse operations, maintaining the output feature map size at 56 × 56 × 96. Therefore, the input and output sizes of the Swin Transformer block remained unchanged. (4) Subsequently, Patch Merging was performed to reduce the spatial dimensions by half and double the channel count. This process was repeated across the four stages, eventually transforming the feature map dimensions to 7 × 7 × 768. (5) Finally, global average pooling was utilized to reduce the spatial dimension to 1, resulting in a feature vector of size 1 × 1 × 768. A linear classifier with four neurons was used to map the output vector to probability values corresponding to the four tree species categories, yielding the final prediction. The detailed Swin-T architecture specification is described in Table 3.

Swin Transformer Model
Due to the high computational cost and memory consumption of the self-attention mechanism in ViT models when processing high-resolution image tasks, the Swin Transformer model was proposed [34], which adopts a hierarchical structure as shown in Figure 11. The model comprises four key components: Patch Partition, Linear Embedding, Swin Transformer Block, and Patch Merging. Two successive Swin Transformer Blocks illustrated in Figure 12, incorporate Window-based Multi-head Self-attention (W-MSA) the feature map dimensions to 7 × 7 × 768. (5) Finally, global average pooling was utilized to reduce the spatial dimension to 1, resulting in a feature vector of size 1 × 1 × 768. A linear classifier with four neurons was used to map the output vector to probability values corresponding to the four tree species categories, yielding the final prediction. The detailed Swin-T architecture specification is described in Table 3.

Experimental Environment
The experiments were conducted on a computer with the following specifications: Windows 10 operating system, AMD Ryzen 5 5600X 6-Core Processor CPU, and NVIDIA GeForce RTX 3070Ti GPU with 8GB of memory. The deep learning platform used for training and evaluation was PyTorch 1.12.0, along with cudatoolkit11.3 for GPU acceleration. For raster and vector data processing, Arcgis10.8 was utilized, and Matplotlib 3.5.2 was used for data visualization. Python 3.9.13 was the programming language used for implementation and analysis.

Comparison of CNN and Transformer Models for Tree Species Classification
The main objective of this investigation was to evaluate the impact of different models on tree species classification using canopy images. To this end, we selected two representative pre-trained models from both the CNN and Transformer models, specifically ResNet-50, ConvNeXt-T, ViT-B, and Swin-T. For these four models, we set the image input size to 224 × 224 pixels. During training, we used a batch size of 16 and the AdamW optimization algorithm, and trained for a total of 150 epochs. The initial learning rate was set to 4 × 10 −4 , and the weight decay factor was set to 5 × 10 −2 . We also used the LambdaLR strategy for learning rate adjustment.
To assess the performance of the models, we monitored the changes in the loss function and recognition accuracy of the original image data samples on the training set and validation set across 150 epochs. These changes are visualized in Figure 13 to understand the training progress and stability of each model. It is observed that for all models, the accuracy and loss tend to stabilize after around 120 to 130 epochs, with recognition classification accuracy exceeding 95%, indicating strong stability and high accuracy in the tree species classification task. Furthermore, we evaluated the overall classification accuracy (OA), Kappa coefficient, and confusion matrix of each model, as shown in Figure 14. These evaluation metrics provide quantitative measures of the performance of the models in the tree species classification task. By analyzing these results, we can gain a comprehensive understanding of the performance of each model in tree species classification using canopy images. accuracy in the tree species classification task. Furthermore, we evaluated the overall classification accuracy (OA), Kappa coefficient, and confusion matrix of each model, as shown in Figure 14. These evaluation metrics provide quantitative measures of the performance of the models in the tree species classification task. By analyzing these results, we can gain a comprehensive understanding of the performance of each model in tree species classification using canopy images.   This superiority of the Transformer models could be attributed to their multi-head attention mechanism, which allows them to capture richer feature information more effectively. The multi-head attention mechanism enables the models to attend to different regions of the input image simultaneously, capturing both local and global contextual information. In contrast, the CNN models, ResNet-50 and ConvNeXt-T, were found to be sensitive to factors such as blurry edges and weak texture in the original canopy images, which could impact the matching accuracy of feature points during feature extraction.
In summary, the research results show that the ViT-B and Swin-T models perform better than the ResNet-50 and ConvNeXt-T models in terms of OA and Kappa coefficient, due to the influence of factors such as blurry edges and weak textures in the original canopy images. This suggests that the ViT-B and Swin-T models have the potential to be effective in tree species classification tasks. However, an accurate judgment of model performance requires a comprehensive consideration of multiple evaluation metrics, as well as the needs of practical application scenarios, dataset characteristics, and model strengths and weaknesses. Furthermore, further research and empirical analysis can validate the performance of these models on different datasets and tasks.

Super-Resolution Reconstruction for Improved Tree Species Classification
Real-ESRGAN, a proven super-resolution image restoration algorithm [25], was utilized in this study to reconstruct the original image dataset. The reconstructed dataset was then used as input for training and validation in four different models. To ensure a fair comparison of the performance of the models, consistent hyperparameter tuning methods were applied during training. Figure 15 provides a visual representation of the changes in the loss function and recognition accuracy for the training and validation sets of the reconstructed dataset for each of the four models. The figure allows for a detailed analysis of the training progress and stability of each model. It is observed that after approximately 90 to 100 iterations, the accuracy and loss of each model tend to stabilize, indicating convergence of the training process. This implies that the dataset reconstructed and repaired through super-resolution exhibits a faster convergence of the model with fewer training iterations compared to the original dataset, and demonstrates improved stability on the training set and validation set. Furthermore, Figure 16 presents the OA, Kappa coefficient, and confusion matrix of each model. These metrics provide a comprehensive assessment of the performance of each model in terms of accuracy, agreement, and confusion among different classes. The detailed analysis of these metrics can provide insights into the effectiveness of each model in accurately classifying tree species based on the reconstructed dataset. After comparing the model parameters reached and tree species classification results in Table 4, the following conclusions can be drawn: the ResNet-50 model reached an OA of 96.71% and a Kappa coefficient of 0.9558; the ConvNeXt-T model reached an OA of 98.70% and a Kappa coefficient of 0.9826; the ViT-B model reached an OA of 97.88% and a Kappa coefficient of 0.9716; and the Swin-T model attained an OA of 98.59% and a Kappa coefficient of 0.9810. Compared with the data samples that were not reconstructed using Real-ESRGAN, the recognition accuracy of each model increased by 1.39%, 1.53%, 0.47%, and 1.16%, respectively. Among them, the ConvNeXt-T model reached the best result. Therefore, we can conclude that the original image data, which may contain factors such as blurry edges and weak texture in the canopy, can benefit from reconstruction using Real-ESRGAN. This reconstruction method can help improve the accuracy of tree species classification recognition to a certain extent.
accuracy and loss of each model tend to stabilize, indicating convergence of the training process. This implies that the dataset reconstructed and repaired through superresolution exhibits a faster convergence of the model with fewer training iterations compared to the original dataset, and demonstrates improved stability on the training set and validation set. Furthermore, Figure 16 presents the OA, Kappa coefficient, and confusion matrix of each model. These metrics provide a comprehensive assessment of the performance of each model in terms of accuracy, agreement, and confusion among different classes. The detailed analysis of these metrics can provide insights into the effectiveness of each model in accurately classifying tree species based on the reconstructed dataset.

Distribution Map of Tree Species in Dongtai Forest Plot
In this study, we selected a typical plot from the study area of Dongtai Forest Farm with UAV remote sensing images as a test sample set. To match the sample size of the original dataset, we used the same sliding window method to crop this sample. Then, we used the ConvNeXt-T model to test this sample set and created a tree species distribution map based on the model's predictions, as shown in Figure 17. The plot of this forest farm is divided as follows: the left plot is mainly planted with Metasequoia, the middle plot is mainly planted with Poplar, and the right plot is mainly distributed with Ginkgo. In addition, Bamboo mainly grows in the plot below.

Distribution Map of Tree Species in Dongtai Forest Plot
In this study, we selected a typical plot from the study area of Dongtai Forest Farm with UAV remote sensing images as a test sample set. To match the sample size of the original dataset, we used the same sliding window method to crop this sample. Then, we used the ConvNeXt-T model to test this sample set and created a tree species distribution map based on the model's predictions, as shown in Figure 17. The plot of this forest farm is divided as follows: the left plot is mainly planted with Metasequoia, the middle plot is mainly planted with Poplar, and the right plot is mainly distributed with Ginkgo. In addition, Bamboo mainly grows in the plot below. This study examined the impact of different models on tree species classification in crown images. The study selected two representative pre-training models from the CNN and Transformer models, including ResNet-50, ConvNeXt-T, ViT-B, and Swin-T. The experimental results show that compared to traditional CNN models, Transformer models are more stable in feature extraction, and have better classification accuracy and stability. Additionally, this study used the Real-ESRGAN algorithm to perform superresolution reconstruction and repair on the original image dataset, resulting in an This study examined the impact of different models on tree species classification in crown images. The study selected two representative pre-training models from the CNN and Transformer models, including ResNet-50, ConvNeXt-T, ViT-B, and Swin-T. The experimental results show that compared to traditional CNN models, Transformer models are more stable in feature extraction, and have better classification accuracy and stability. Additionally, this study used the Real-ESRGAN algorithm to perform super-resolution reconstruction and repair on the original image dataset, resulting in an improvement in the accuracy of tree species classification as demonstrated in the results. Finally, the study presents a distribution map of tree species in Dongtai Forest Farm, demonstrating the practical application of the Real-ESRGAN algorithm and serving as a reference for further research.

Performance of CNN and Transformer in Classifying Tree Species Using the Original Dataset
For the application of tree species classification in low-altitude remote sensing images obtained from UAV, this paper further evaluated the classification accuracy performance of four models, namely, ResNet-50 and ConvNeXt-T as representatives of CNN models, and ViT-B and Swin-T as representatives of Transformer models, using the original canopy image dataset. Transformer has demonstrated its exceptional ability to capture global information, thereby bolstering a wide range of vision-related tasks such as image classification, object detection, and particularly semantic segmentation [33,34]. CNN and Transformer use object-based classification to achieve end-to-end tree species classification and avoid the non-transferability of manual feature extraction. The experimental results reveal that, as depicted in Figure 18, all four models exhibit classification validation accuracies exceeding 95%. Notably, the Swin Transformer reaches the highest classification accuracy, demonstrating an OA of 97.43% and a Kappa coefficient of 0.9671. Conversely, the CNN models, particularly the traditional CNN model, are more susceptible to the challenges posed by low-spatial-height aerial images, including detail loss, brightness reduction, blurred canopy edges, and weak image texture. These issues, coupled with small inter-class differences and significant intra-class differences, adversely impact the feature point matching accuracy of the CNN model, whereas the Transformer model is comparatively less affected. demonstrating the practical application of the Real-ESRGAN algorithm and serving as a reference for further research.

Performance of CNN and Transformer in Classifying Tree Species Using the Original Dataset
For the application of tree species classification in low-altitude remote sensing images obtained from UAV, this paper further evaluated the classification accuracy performance of four models, namely, ResNet-50 and ConvNeXt-T as representatives of CNN models, and ViT-B and Swin-T as representatives of Transformer models, using the original canopy image dataset. Transformer has demonstrated its exceptional ability to capture global information, thereby bolstering a wide range of vision-related tasks such as image classification, object detection, and particularly semantic segmentation [33,34]. CNN and Transformer use object-based classification to achieve end-to-end tree species classification and avoid the non-transferability of manual feature extraction. The experimental results reveal that, as depicted in Figure 18, all four models exhibit classification validation accuracies exceeding 95%. Notably, the Swin Transformer reaches the highest classification accuracy, demonstrating an OA of 97.43% and a Kappa coefficient of 0.9671. Conversely, the CNN models, particularly the traditional CNN model, are more susceptible to the challenges posed by low-spatial-height aerial images, including detail loss, brightness reduction, blurred canopy edges, and weak image texture. These issues, coupled with small inter-class differences and significant intra-class differences, adversely impact the feature point matching accuracy of the CNN model, whereas the Transformer model is comparatively less affected.

Performance of CNN and Transformer in Tree Species Classification Using Super-Resolution Reconstructed Dataset
To address these issues, this paper introduced the Real-ESRGAN super-resolution reconstruction technique to recover low-quality tree canopy images captured by UAVs. The recovery process improved the validation accuracy of the four models. For example, the OA of the ConvNeXt-T model increased by 1.53% and the Kappa coefficient increased by 0.0205. The validation accuracy comparison derived from the model using the original dataset and the Real-ESRGAN processed dataset for training is depicted in Figure 19.

Performance of CNN and Transformer in Tree Species Classification Using Super-Resolution Reconstructed Dataset
To address these issues, this paper introduced the Real-ESRGAN super-resolution reconstruction technique to recover low-quality tree canopy images captured by UAVs. The recovery process improved the validation accuracy of the four models. For example, the OA of the ConvNeXt-T model increased by 1.53% and the Kappa coefficient increased by 0.0205. The validation accuracy comparison derived from the model using the original dataset and the Real-ESRGAN processed dataset for training is depicted in Figure 19. Although the Real-ESRGAN super-resolution reconstruction technique has some limitations and shortcomings, it can be further improved in future research by introducing more fuzzy kernels and enhancing the image super-resolution algorithm model. These findings suggest that models trained on datasets restored and reconstructed by super-resolution may achieve stability faster while reaching higher accuracy on both training and validation sets compared to models trained on the original dataset. This phenomenon may be due to the fact that the restored and reconstructed datasets provide higher-quality images, which help the models to quickly acquire features related to tree species classification. However, further empirical evidence and validation are needed to confirm the correctness of this inference.
Remote Sens. 2023, 15, x FOR PEER REVIEW 21 of 23 more fuzzy kernels and enhancing the image super-resolution algorithm model. These findings suggest that models trained on datasets restored and reconstructed by superresolution may achieve stability faster while reaching higher accuracy on both training and validation sets compared to models trained on the original dataset. This phenomenon may be due to the fact that the restored and reconstructed datasets provide higher-quality images, which help the models to quickly acquire features related to tree species classification. However, further empirical evidence and validation are needed to confirm the correctness of this inference. Figure 19. Comparison of the validation accuracy obtained by the model using the original dataset and the Real-ESRGAN processed dataset for training.

Conclusions
The assessment of tree species classification in Dongtai Forest utilizing RGB images captured by UAV has yielded promising outcomes through deep-learning-based approaches. Four models, including CNN models (ResNet-50 and ConvNeXt-T) and Transformer models (ViT-B and Swin-T), were trained and validated using UAV RGB tree crown images, achieving classification accuracies surpassing 95%. CNN models have been extensively used in forest resource surveys for tree species classification tasks, demonstrating exceptional classification accuracy [22,35]. Transformer models have also started finding applications in plant classification using UAV imagery [36] and exhibit significant potential for future advancements in forest surveys. However, the limited spatial resolution of aerial images introduces degradation challenges, such as detail loss, decreased brightness, blurred tree crown edges, and weak image texture. These issues negatively impact the feature point matching accuracy of CNN models and the capture of crucial information in the images. In contrast, Transformer models, with their inherent attention mechanisms, effectively leverage contextual information and global correlations in the images, resulting in comparatively less susceptibility to such issues. To address these challenges, Real-ESRGAN technology was adopted to perform super-resolution reconstruction and restoration on the original tree crown image dataset, leading to

Conclusions
The assessment of tree species classification in Dongtai Forest utilizing RGB images captured by UAV has yielded promising outcomes through deep-learning-based approaches. Four models, including CNN models (ResNet-50 and ConvNeXt-T) and Transformer models (ViT-B and Swin-T), were trained and validated using UAV RGB tree crown images, achieving classification accuracies surpassing 95%. CNN models have been extensively used in forest resource surveys for tree species classification tasks, demonstrating exceptional classification accuracy [22,35]. Transformer models have also started finding applications in plant classification using UAV imagery [36] and exhibit significant potential for future advancements in forest surveys. However, the limited spatial resolution of aerial images introduces degradation challenges, such as detail loss, decreased brightness, blurred tree crown edges, and weak image texture. These issues negatively impact the feature point matching accuracy of CNN models and the capture of crucial information in the images. In contrast, Transformer models, with their inherent attention mechanisms, ef-fectively leverage contextual information and global correlations in the images, resulting in comparatively less susceptibility to such issues. To address these challenges, Real-ESRGAN technology was adopted to perform super-resolution reconstruction and restoration on the original tree crown image dataset, leading to improved classification accuracy across all four models. This study confirms and underscores the observed enhancement in classification accuracy when using neural network models trained on images reconstructed through super-resolution. Super-resolution reconstruction techniques facilitate the restoration of low-quality images by recovering details, enhancing brightness, improving tree crown edge clarity, and augmenting image texture. These reconstructed images provide higher-quality information, enabling CNN and Transformer models to more accurately learn and extract features relevant to tree species classification. Consequently, when trained on these repaired and reconstructed image datasets, the four models exhibit improved stability and accuracy on validation sets.