A CNNA-Based Lightweight Multi-Scale Tomato Pest and Disease Classiﬁcation Method

: Tomato is generally cultivated by transplanting seedlings in ridges and furrows. During growth, there are various types of tomato pests and diseases, making it challenging to identify them simultaneously. To address this issue, conventional convolutional neural networks have been investigated, but they have a large number of parameters and are time-consuming. In this paper, we proposed a lightweight multi-scale tomato pest and disease classiﬁcation network, called CNNA. Firstly, we constructed a dataset of tomato diseases and pests consisting of 27,193 images with 18 categories. Then, we compressed and optimized the ConvNeXt-Tiny network structure to maintain accuracy while signiﬁcantly reducing the number of parameters. In addition, we proposed a multi-scale feature fusion module to improve the feature extraction ability of the model for different spot sizes and pests, and we proposed a global channel attention mechanism to enhance the sensitivity of the network model to spot and pest features. Finally, the model was trained and deployed to the Jetson TX2 NX for inference of tomato pests and diseases in video stream data. The experimental results showed that the proposed CNNA model outperformed the pre-trained lightweight models such as MobileNetV3, MobileVit, and ShufﬂeNetV2 in terms of accuracy and all parameters, with a recognition accuracy of 98.96%. Meanwhile, the error rate, inference time for a single image, network parameters, FLOPs, and model size were only 1%, 47.35 ms, 0.37 M, 237.61 M, and 1.47 MB, respectively.


Introduction
Tomatoes are highly sought after worldwide as both a fruit and vegetable. However, the growth process of tomato plants is often plagued by a variety of pests and diseases, which seriously hinder the development of the tomato growing industry and affect farmers. Effective and accurate identification of tomato pests and diseases is an important tool for their management and an important prerequisite for improving tomato production [1,2]. Traditional tomato disease identification is typically carried out exclusively by experts or technicians due to their high accuracy, but the methods are not widely applicable due to their time-consuming, costly, and inefficient nature [3].
Conventional machine learning-based methods have been widely employed in research exploring the intelligent identification of plant diseases [4][5][6]. Mokhtar et al. [7] utilized a support vector machine (SVM) algorithm with different kernel functions for the classification and identification of tomato mosaic disease with an average accuracy of 92%. Johannes et al. [8] identified three diseases of wheat leaves using the Naïve Bayes technique Sustainability 2023, 15 with an accuracy of 85%. Rumpf et al. [9] used SVM to detect three diseases in sugar beet root images with an accuracy of 86%. Although these machine learning algorithms can generally satisfy the requirements of disease identification, feature extraction is a complex process that can reduce recognition accuracy and efficiency.
With the development of deep learning, convolutional neural network (CNN) models have been developed to autonomously extract image features and perform classification with higher accuracy and efficiency [10]. Yang et al. [11] proposed a fine-grained classification model, LFC-Net, with a self-supervised mechanism to classify images of eight tomato diseases and healthy leaf images with 99.7% accuracy. Ji et al. [12] proposed a joint model architecture based on an integrated approach to classify four grape leaf diseases in the open dataset of PlantVillage with 98.57% accuracy. Anandhakrishnan et al. [13] proposed a deep convolutional neural network model to classify tomato leaf disease using the open dataset in PlantVillage with 98.40% accuracy. Despite their high accuracy in plant disease identification, these convolutional neural networks had certain limitations, such as a large number of network parameters and slow model inference, which required attention.
Therefore, researchers have started to apply lightweight modeling algorithms to disease identification [14][15][16][17]. Elhassouny et al. [18] used the lightweight network model MobileNet to identify 10 common tomato leaf diseases and compared the results of several different optimizers to achieve a final accuracy of 90.3%. Agarwal et al. [19] proposed a convolutional neural network model with only three convolutional layers and three maximum pooling layers and a convolutional neural network model consisting of two fully connected layers to classify tomato disease using the PlantVillage open dataset with an accuracy of 91.2%. Wang et al. [20] proposed a shallow network with two to ten convolutional layers to classify four apple diseases using the PlantVillage open dataset and obtained an accuracy of 79.3%. Hamid, et al. [21] used MobileNetV2 for classification of 14 different categories of seeds and achieved an accuracy of 95% on the test sets. While these lightweight models featured short training times and fast inference speeds, their recognition accuracy was relatively low.
To further improve model recognition accuracy, attention mechanisms focusing on improving the model's attention to key information have been proposed, with common attention mechanisms such as squeeze-and-excitation (SE) [22], convolutional block attention module (CBAM) [23], etc. Yin et al. [24] proposed a deep learning network, DISE-Net, introducing a coordinate attention mechanism to construct and classify a field corn leaflet spot dataset with an accuracy of 97.12%, which was 2.21% higher than the model without introducing the attention mechanism. Gao et al. [25] proposed a dual-branch, efficient, channel attention mechanism-based DECA_ResNet model for cucumber disease recognition. The model was trained on the Global AI Challenge 2018 dataset, the PlantVillage dataset, and a self-collected cucumber disease dataset. The model accuracy reached 98.54%, which was 7.66% higher than the recognition accuracy without introducing the attention mechanism. The introduction of attention mechanisms has been able to improve the accuracy of models in identifying diseases. In addition, in current research work, most models are limited to plant diseases, and there are very few studies on the simultaneous identification of plant diseases and insect pests. In fact, insect pests also hinder the growth of plants, so a network model for the simultaneous identification of diseases and insect pests is needed. The introduction of the above attention mechanisms can improve the sensitivity of the model in the disease region, whereas the model is less sensitive in the insect pest region. Therefore, improving the attention mechanism to simultaneously increase the sensitivity of the model to both diseases and pests is urgently needed.
In summary, current research efforts to identify tomato pests and diseases have the following challenges: (1) In terms of content recognition, it is still difficult to collect pest dataset in the field and obtain high model performance for simultaneous recognition of pests and diseases. Generally, researchers are currently focused on disease identification without achieving unified identification of diseases, insect pests, and healthy leaves. (2) In terms of model performance, most studies achieve single performance improvement in terms of model accuracy, size, or robustness, without achieving a comprehensive balance of these performances.
To address these challenges, our study makes the following contributions: (1) We built a dataset consisting of 27,193 images across 18 categories, including tomato diseases, pests, and healthy leaves. (2) We proposed an efficient and lightweight classification model named ConvNeXt-Nano-Adjust (CNNA), which accurately and rapidly classified images of tomato diseases and pests. (3) We embedded the CNNA model into the Jetson TX2 NX with an inference time of only 47.35 ms per image, which is suitable for practical production applications. This approach can provide technical support for the development of a management and control system for tomato diseases and pests.

Materials and Methods
The workflow of this study is shown in Figure 1. Firstly, we constructed a dataset of images of tomato diseases and insect pests, as shown in Figure 1a. The tomato disease dataset was obtained from the PlantVillage open dataset, whereas the tomato pest leaf images were collected from tomato plants in an experimental field. Image preprocessing was performed on the original images. Images of leaves with diseases, pests, and in healthy condition were combined and divided into the training and validation sets. Secondly, the ConvNeXt-Tiny network was developed and optimized, and the lightweight multi-scale feature fusion module (MFFM) and lightweight global channel attention (GCA) mechanism were proposed to further optimize the model, as shown in Figure 1b. Specifically, we conducted horizontal and vertical performance comparison experiments for optimizing the model; the model weight file is shown in Figure 1d. Finally, the model with the best validation accuracy was deployed to the Jetson TX2 NX for inference of tomato leaf diseases and pests in video stream data, as shown in Figure 1e. and obtain high model performance for simultaneous recognition of pests and diseases. Generally, researchers are currently focused on disease identification without achieving unified identification of diseases, insect pests, and healthy leaves.
(2) In terms of model performance, most studies achieve single performance improvement in terms of model accuracy, size, or robustness, without achieving a comprehensive balance of these performances.
To address these challenges, our study makes the following contributions: (1) We built a dataset consisting of 27,193 images across 18 categories, including tomato diseases, pests, and healthy leaves. (2) We proposed an efficient and lightweight classification model named ConvNeXt-Nano-Adjust (CNNA), which accurately and rapidly classified images of tomato diseases and pests. (3) We embedded the CNNA model into the Jetson TX2 NX with an inference time of only 47.35 ms per image, which is suitable for practical production applications. This approach can provide technical support for the development of a management and control system for tomato diseases and pests

Materials and Methods
The workflow of this study is shown in Figure 1. Firstly, we constructed a dataset of images of tomato diseases and insect pests, as shown in Figure 1a. The tomato disease dataset was obtained from the PlantVillage open dataset, whereas the tomato pest leaf images were collected from tomato plants in an experimental field. Image preprocessing was performed on the original images. Images of leaves with diseases, pests, and in healthy condition were combined and divided into the training and validation sets. Secondly, the ConvNeXt-Tiny network was developed and optimized, and the lightweight multi-scale feature fusion module (MFFM) and lightweight global channel attention (GCA) mechanism were proposed to further optimize the model, as shown in Figure 1b. Specifically, we conducted horizontal and vertical performance comparison experiments for optimizing the model; the model weight file is shown in Figure 1d. Finally, the model with the best validation accuracy was deployed to the Jetson TX2 NX for inference of tomato leaf diseases and pests in video stream data, as shown in Figure 1e.  In this study, we collected a total of 22,930 images from the PlantVillage dataset depicting both healthy and diseased plants, which were categorized into 10 classes, as shown in Figure 2a-j. In addition, we collected images of tomato leaves infested with pests to validate our proposed model from the experimental tomato field of Jilin Agricultural University in Changchun, Jilin Province, China. The images were collected between 9:00 a.m. and 5:00 p.m. using a smartphone (Xiaomi 10 in macro mode) with a resolution of 3120 × 3120 pixels. Finally, we obtained 431 images of tomato pests, belonging to 8 categories, with complex backgrounds, as shown in Figure 2k-r. Such images were preprocessed by eliminating images in which the pest subject was unclear, which were evaluated by agricultural experts. In addition, tomato pest and disease video stream data were collected according to the experimental requirements.

Acquisition of Images
In this study, we collected a total of 22,930 images from the PlantVillage dataset depicting both healthy and diseased plants, which were categorized into 10 classes, as shown in Figure 2a-j. In addition, we collected images of tomato leaves infested with pests to validate our proposed model from the experimental tomato field of Jilin Agricultural University in Changchun, Jilin Province, China. The images were collected between 9:00 a.m. and 5:00 p.m. using a smartphone (Xiaomi 10 in macro mode) with a resolution of 3120 × 3120 pixels. Finally, we obtained 431 images of tomato pests, belonging to 8 categories, with complex backgrounds, as shown in Figure 2k-r. Such images were preprocessed by eliminating images in which the pest subject was unclear, which were evaluated by agricultural experts. In addition, tomato pest and disease video stream data were collected according to the experimental requirements.

Image Preprocessing and Expansion
The tomato disease dataset was divided into training and validation sets in a ratio of approximately 80%: 20%. The resolution of all images was adjusted to 224 × 224 before data division to improve the efficiency of image processing. The distribution of the number of tomato disease images from PlantVillage for the training and validation sets is shown in Table 1.

Image Preprocessing and Expansion
The tomato disease dataset was divided into training and validation sets in a ratio of approximately 80%: 20%. The resolution of all images was adjusted to 224 × 224 before data division to improve the efficiency of image processing. The distribution of the number of tomato disease images from PlantVillage for the training and validation sets is shown in Table 1. The acquisition of tomato leaf pest images was more difficult, resulting in a smaller number of pest images compared to a larger number of tomato disease images in the open dataset. In order to avoid overfitting the model and improve its robustness, we augmented the original pest sample dataset by increasing its size approximately five-fold through two random 40-degree rotations and horizontal flips [26], which finally expanded the dataset to 4263 images. After enhancement, we constructed a tomato disease and pest dataset consisting of 27,193 images with 18 categories. The distribution of tomato pest images before and after the expansion is shown in Table 2. ConvNeXt, a convolutional neural network model proposed in 2021 [27], is a convolutional neural network model design based on ResNet50 [28] and the Swin transformer structure [29]. ConvNeXt-Tiny is the smallest version of ConvNeXt, but still suffers from the problem of model overfitting due to the stacking of modules and the excessive number of channels. To avoid the problem of saturating the model with disease and pest features due to the deep network layers of the ConvNeXt-Tiny network, we designed four variants of the ConvNeXt-Nano model based on the ConvNeXt-Tiny network by compressing the ConvNeXt model in two dimensions: the number of channels and the number of module stacks. ConvNeXt-Nano-1, ConvNeXt-Nano-2, ConvNeXt-Nano-3, and ConvNeXt-Nano were designed, and they were validated using images collected in the field to fully verify the effects of channel number and module stacking number on the global features. Additionally, we selected the optimal deep neural network based on model complexity and model performance, i.e., ConvNeXt-Nano.
All variants of ConvNeXt-Nano were created by reducing the number of modules in ConvNeXt-Tiny from four to three, reducing the operation by one layer, and then adjusting the number of module stacks and the number of channels per module. The number of module stacks in ConvNeXt-Tiny was [3,3,9,3]  ConvNeXt-Nano-1 was adjusted to [48,96,192], and the number of module stacks was [3,9,3]. To further improve the accuracy, we used the control variable method to sep arately adjust the number of module stacks and the number of channels per layer to obtain ConvNeXt-Nano-2 and ConvNeXt-Nano-3. The number of module stacks of ConvNeXt-Nano-2 was adjusted to [2,6,2], and the number of module channels was [48,96,192], which was the same as ConvNeXt-Nano-1. ConvNeXt-Nano-3 was tested to adjust the number of channels per module to [24,48,96], and the number of module stacks was the same as ConvNeXt-Nano-1. ConvNeXt-Nano was tested to adjust the number of module stacks to [3,7,2], and the number of module channels to [24,48,96]. These improvements greatly reduced the numbers of network parameters and floating point operations (FLOPs). The above compression yielded the efficient lightweight network ConvNeXt-Nano, and the internal parameter comparison with the ConvNeXt-Tiny network is shown in Table 3. Table 3. Internal parameters of the ConvNeXt-Nano model before and after compression.

Operator1
Operator2 The overall recognition of disease spots and pests by the ConvNeXt-Nano network was difficult due to the small number of lightweight network ConvNeXt-Nano parameters and the large variety of target species with different disease and pest characteristics in this study. To resolve this, we proposed a lightweight multi-scale feature fusion module (MFFM) and a lightweight global channel attention (GCA) mechanism. The CNNA network model optimizing the ConvNeXt-Nano network with MFFM and GCA is shown in the Figure 3. above compression yielded the efficient lightweight network ConvNeXt-Nano, and the internal parameter comparison with the ConvNeXt-Tiny network is shown in Table 3. Table 3. Internal parameters of the ConvNeXt-Nano model before and after compression.
The overall recognition of disease spots and pests by the ConvNeXt-Nano network was difficult due to the small number of lightweight network ConvNeXt-Nano parameters and the large variety of target species with different disease and pest characteristics in this study. To resolve this, we proposed a lightweight multi-scale feature fusion module (MFFM) and a lightweight global channel a ention (GCA) mechanism. The CNNA network model optimizing the ConvNeXt-Nano network with MFFM and GCA is shown in the Figure 3. Due to different pest sizes and the uneven distribution of leaf pests and diseases in the images, the feature distribution was uneven. As a result, it was hard to achieve high overall recognition. To improve the sensitivity of the CNNA model to features with dif-

Multi-Scale Feature Fusion Module
Due to different pest sizes and the uneven distribution of leaf pests and diseases in the images, the feature distribution was uneven. As a result, it was hard to achieve high overall recognition. To improve the sensitivity of the CNNA model to features with different size, we proposed MFFM, as shown in Figure 4a. MFFM contained three branches, which were 3 × 3, 5 × 5, and 7 × 7 in depthwise convolution, as shown in Figure 4b. Depthwise separable convolution is a convolution operation with a small number of parameters and large memory access cost (MAC) [30], and multiple deep convolutions serially leads to increased inference time. Thus, using one large convolution kernel in deep convolution instead of multiple small convolution kernels in deep convolution can reduce the inference time. The model proposed in this study was named CNNA0 (CNNA0 compared to CNNA, only the structure of the multi-scale feature fusion module was replaced). Meanwhile, using multiple small convolution kernels in serial convolutions, compared with the large convolution kernel in convolutions [31], can reduce the number of model parameters and also reduce the picture inference time. The specific comparison experiments of the two schemes are described in Section 3.6. This study finally adopted the scheme of multiple small convolution kernels in depthwise convolutions instead of a large convolution kernel in depthwise convolutions. As shown in the dashed part of Figure 4, where DWLN represents one 3 × 3 depthwise convolution and feature normalization operation, two serial 3 × 3 depthwise convolutions instead of one 5 × 5 depthwise convolution for feature normalization, and three 3 × 3 depthwise convolutions instead of one 7 × 7 depthwise convolution. instead of multiple small convolution kernels in deep convolution can reduce the inference time. The model proposed in this study was named CNNA0 (CNNA0 compared to CNNA, only the structure of the multi-scale feature fusion module was replaced). Meanwhile, using multiple small convolution kernels in serial convolutions, compared with the large convolution kernel in convolutions [31], can reduce the number of model parameters and also reduce the picture inference time. The specific comparison experiments of the two schemes are described in Section 3.6. This study finally adopted the scheme of multiple small convolution kernels in depthwise convolutions instead of a large convolution kernel in depthwise convolutions. As shown in the dashed part of Figure 4, where DWLN represents one 3 × 3 depthwise convolution and feature normalization operation, two serial 3 × 3 depthwise convolutions instead of one 5 × 5 depthwise convolution for feature normalization, and three 3 × 3 depthwise convolutions instead of one 7 × 7 depthwise convolution. The pest images with complex backgrounds in this experimental dataset were collected from the real-time field environment using a smartphone. In order to improve the global dependency of the CNNA model and increase its accuracy, this study proposed a lightweight global channel a ention (GCA) mechanism, which encoded the channel relationship and long-term dependency by precise location information, fully integrating the information from the horizontal and vertical coordinates into the channel. Such a network formed a feature that was subject sensitive to direction and location information, which was beneficial to obtaining the pest subject information and suppressed useless information such as the background. The GCA structure is shown in Figure 5.
GCA encodes the adaptive global pooling method along the vertical and horizontal coordinates for each channel separately, and the input feature vector X is pooled using a convolutional kernel of size (H, 1) in the horizontal direction and a convolutional kernel of size (1, W) in the vertical direction. Equation (1) gives the vertical feature output value of the feature vector X of height c at the c-th channel:

Global Channel Attention for Optimizing the Model
The pest images with complex backgrounds in this experimental dataset were collected from the real-time field environment using a smartphone. In order to improve the global dependency of the CNNA model and increase its accuracy, this study proposed a lightweight global channel attention (GCA) mechanism, which encoded the channel relationship and long-term dependency by precise location information, fully integrating the information from the horizontal and vertical coordinates into the channel. Such a network formed a feature that was subject sensitive to direction and location information, which was beneficial to obtaining the pest subject information and suppressed useless information such as the background. The GCA structure is shown in Figure 5.
information of the input. The hard sigmoid activation function significantly reduced the computation and computation time compared to the sigmoid activation function because of the absence of exponential operations. Finally, the global dimensional information was multiplied with the input dimensional information matrix for feature fusion.  Figure 5. Global channel a ention mechanism.

Test Setup
To ensure fairness, the following experiments were conducted in Python 3.8 using the Pytorch framework, with the server using Pytorch-GPU 1.9 on Windows 10 and the GCA encodes the adaptive global pooling method along the vertical and horizontal coordinates for each channel separately, and the input feature vector X is pooled using a convolutional kernel of size (H, 1) in the horizontal direction and a convolutional kernel of size (1, W) in the vertical direction. Equation (1) gives the vertical feature output value of the feature vector X of height c at the c-th channel: where Z h c (h) denotes the output of the c-th channel in the specific height direction, x c denotes the feature vector of the input of the c-th channel, and W denotes the feature map width.
Equation (2) gives the horizontal feature output value of the feature vector X of width c at the c-th channel: where Z w c (w) denotes the output of the c-th channel in the specific width direction, and H denotes the feature map height. Equations (1) and (2) obtain the coded information at positions C × H × 1 and C × 1 × W, respectively, and then form a vector of C × 1 × 1 in two coordinate directions after the global averaging pooling operation without dimensionality reduction. Then, according to the idea of weight sharing of convolutional neural networks-i.e., the coverage of cross-channel information interaction (the size of the convolutional kernel of one-dimensional convolution) is proportional to the channel dimension C, which is proportional to the number of channels-the size of the adaptive convolutional kernel k is obtained, which calculates the weight of each channel and also reduces the number of parameters. Equation (3) gives the convolution kernel k of the computational formula: where C denotes the number of channels; | | odd denotes that the convolution kernel k takes the nearest odd value; and γ and b are used to change the ratio between the number of channels C The two-dimensional feature maps are one-dimensionally stitched in the width direction, then reconstructed to the initial feature dimension and subjected to the feature normalization operation. By convolution, in order to extract the information of global feature fusion, this study used the lighter hard sigmoid activation function, as shown in Equation (4), where ReLU6 is the activation function and x is the dimension information of the input. The hard sigmoid activation function significantly reduced the computation and computation time compared to the sigmoid activation function because of the absence of hard sigmoid(x) = ReLU6(x + 3) 6 (4)

Test Setup
To ensure fairness, the following experiments were conducted in Python 3.8 using the Pytorch framework, with the server using Pytorch-GPU 1.9 on Windows 10 and the Jetson TX2 NX using Pytorch 1.8 on Ubuntu 18.04 [32]. The server was powered by an Intel Core i7-7820X processor with 32 GB of RAM and NVIDIA TITAN Xp graphics with 12 GB of video memory. The Jetson TX2 NX had a CPU cluster consisting of a dual-core Denver2 processor and a quad-core ARM Cortex-A57, 4 GB of LDDR4 memory, and a 256-core Pascal GPU with power mode set to MAXN.
Each image was normalized before the input image, and the normalization formula shown in Equation (5) denotes the output after normalization and the input image derived from the results of large data training in ImageNet. The images needed to be computed in each of the three channels. Equation (5) denotes their mean values, which were taken as (0.485, 0.456, 0.406) for each of the three channels. The optimizer used AdamW (adaptive momentum and weight decay) with a cross-entropy loss function, the batch size was set to 64, the learning rate was initialized to 0.001, the number of rounds (epoch) was 50, and the learning rate declined by a factor of 0.1 every 10 rounds.

Model Evaluation
To improve the performance evaluation of each model, we utilized precision, recall, F1 score, and accuracy as the evaluation indexes of the model to comprehensively evaluate the performance of the model, as shown in Equations (6)- (9).
where TP, TN, FP, and FN are the number of true positive samples, true negative samples, false positive samples, and false negative samples, respectively. Precision is an estimate of how many of the predicted positive samples are positive, as shown in Equation (6). Recall is an assessment of how many of all positive samples can be correctly predicted as positive, as shown in Equation (7). F1 score is the summed average of the precision and recalls, as shown in Equation (8). Accuracy is the most intuitive measure of model quality, as shown in Equation (9). The model size, parameters, and floating point operations (FLOPs) are usually used to measure the complexity of the model. The FLOPs were calculated as shown in Equation (10). where C i is the number of input channels of the i-th convolutional layer; K is the convolutional kernel size; H and W are the height and width of the output feature map of the convolutional layer, respectively; C O is the number of output channels of the convolutional layer; and I and O are the number of inputs and outputs in the fully connected layer, respectively.

Complexity and Performance Comparison of CNNA Model Variants
Too many layers of network stacking can cause model overfitting and lead to accuracy degradation. This experiment was conducted to compare the number of parameters and model performance before and after model pruning. The experiments and results are shown in Table 4.
As can be seen in Table 4, ConvNeXt-Nano-1 reduced the number of parameters from 27.83 to 1.79 M and the number of FLOPs from 4457.49 to 964.02 M compared to ConvNeXt-Tiny, which resulted in only a loss of 0.37% accuracy with a significant reduction in the numbers of parameters as well as FLOPs. In addition, the results proved that ConvNeXt-Tiny network had the problem of model overfitting due to excessive module stacking. ConvNeXt-Nano-2 reduced module stacking by 5 times compared to ConvNeXt-Nano-1, and the numbers of parameters and FLOPs were reduced by 31% and 32%, respectively. Additionally, the accuracy was improved by 0.22%. Similarly, ConvNeXt-Nano-3 improved the accuracy compared to ConvNeXt-Nano-1, and the number of channels per layer was reduced by a factor of 1. At the same time, the numbers of parameters and FLOPs were reduced by 74% and 73%, respectively, and the accuracy was improved by 0.54%. The experimental results proved that the reduction in the number of module stacks did not have a significant impact on the accuracy, and the increase in the number of channels per layer led to an increase in the numbers of model parameters and FLOPs, which also had a greater impact on the accuracy. In general, ConvNeXt-Nano was a variant that kept the number of channels the same as ConvNeXt-Nano-3 while adjusting the module stacks to [3,7,2] and the numbers of parameters and FLOPs. CNNA further improved ConvNeXt-Nano by introducing MFFM and GCA into the structure of ConvNeXt-Nano, and the numbers of model parameters and FLOPs increased by 6% and 13%, respectively, while the accuracy was greatly improved by 2.65%. Thus, this variant had the best comprehensive performance.

Performance Evaluation of Different Attention Mechanisms Models
To verify the performance of the SE, CA [33], ECA [34], and GCA proposed in this paper, the attention mechanisms were introduced into the CNNA network in the same way for performance comparison experiments, and the results are shown in Table 5. As shown in Table 6, the accuracy after introducing an attention mechanism was significantly improved compared to that without introducing an attention mechanism, which verified the necessity of introducing the attention mechanism. The model complexity after introducing SE and CA was almost comparable, and the accuracy of introducing CA was 0.09% higher than that of introducing SE, but the accuracy, recall, and F1 score of CA were lower than those of SE. The index results of introducing ECA were better than those of introducing SE and CA, and the accuracy of introducing ECA was 0.63% and 0.54% higher than that of introducing CA and SE, respectively. This may have been related to the efficient channel dependence of ECA and weight-sharing learning. Meanwhile, introducing GCA into the model achieved the highest accuracy, precision, recall, and F1 score, while the number of parameters was minimal. The accuracy was improved by 0.65% compared to introducing ECA, which verified that GCA was more capable of acquiring global features.

Performance Comparison with Other Lightweight Network Parameters
To validate the parameters and performance of the CNNA network, based on the dataset constructed in this study, the ConvNeXt-Nano and CNNA networks were compared with some of the more widely used and effective lightweight networks in this study. The compared networks were MobileNetV2 [35], MobileNetV3 [36], GhostNet [37], Shuf-fleNetV2 [30], MixNet [38], and MobileVit [39]. As shown in Table 6, the CNNA network significantly outperformed the other networks in all evaluation metrics. The recognition accuracy of GhostNet and MobileVit was relatively low, at 94.14% and 94.30%, respectively. Compared with the GhostNet and MobileVit models, the recognition accuracy of ConvNeXt-Nano was higher, reaching 96.31% and proving the effectiveness of network compression and optimization. The recognition accuracy of the CNNA network reached 98.96%, with a small increase in the numbers of parameters and FLOPs compared with ConvNeXt-Nano but a 2.01% accuracy improvement, proving the effectiveness of introducing MFFM and GCA. Compared with MixNet, the CNNA model had 85% fewer parameters, 5% fewer FLOPs, 86% smaller model size, and 0.64% higher accuracy. In summary, among the seven lightweight models compared, except ConvNeXt-Nano, the CNNA network model had the smallest number of parameters, the smallest model, and fewer FLOPs, but it achieved the best convergence and the highest accuracy.
To further verify the performance and parameters of the CNNA network, the accuracy and loss curves of six lightweight networks, ConvNeXt-Nano, and the CNNA network were plotted in this study using the validation set of tomato pest data, as shown in Figure 6. The accuracy of each model on the validation set tended to be stable after 50 rounds of iterations. The accuracy curves showed that the CNNA network had the highest recognition accuracy but there was a sudden decrease and then an increase in accuracy. This change may have been caused by the large difference between the pest subject and the background environment of the different pests in the images, since the CNNA network learned less information about the same kinds of pest features. This led to the phenomenon that the model had errors in recognizing the same kind of pest, but this did not affect its overall recognition accuracy of tomato pests and diseases. The loss curve showed that the loss value of the CNNA network kept decreasing in the first 10 rounds and then started to converge in 10 rounds, but there was an increase in the middle. Additionally, the last 30 rounds tended to smooth out and almost converged to 0. In summary, the improved CNNA network achieved higher accuracy and lower loss than the other networks. started to converge in 10 rounds, but there was an increase in the middle. Additionally, the last 30 rounds tended to smooth out and almost converged to 0. In summary, the improved CNNA network achieved higher accuracy and lower loss than the other networks.

Ablation Experiments
To verify the performance improvement achieved by introducing model compression optimization, MFFM, and GCA into the CNNA model, ablation experiments were conducted to introduce MFFM and GCA into the ConvNeXt-Nano network. The results are shown in Table 7. Analysis of Table 7 shows that the addition of the MFFM to the ConvNeXt-Nano network increases the number of parameters by only 0.01M, but the accuracy and F1 score increase by 0.86% and 0.92%, respectively. The introduction of this module demonstrated that the network was able to extract information on different size pest features to some extent. Embedding only GCA was able to efficiently incorporate the global coordinate in-

Ablation Experiments
To verify the performance improvement achieved by introducing model compression optimization, MFFM, and GCA into the CNNA model, ablation experiments were conducted to introduce MFFM and GCA into the ConvNeXt-Nano network. The results are shown in Table 7. Analysis of Table 7 shows that the addition of the MFFM to the ConvNeXt-Nano network increases the number of parameters by only 0.01 M, but the accuracy and F1 score increase by 0.86% and 0.92%, respectively. The introduction of this module demonstrated that the network was able to extract information on different size pest features to some extent. Embedding only GCA was able to efficiently incorporate the global coordinate information into the channels, and both accuracy and F1 score were improved. Finally, by introducing the improved strategy of multi-scale feature fusion module and global channel attention, the accuracy and F1 score of CNNA on the tomato diseases and pests validation set were 98.96% and 97.76%, which were 2.65% and 4.23% better than ConvNeXt-Nano, and 75 times lower than ConvNeXt-Tiny model parametric number, while the accuracy and F1 score have 2.68% and 2.50% improvement. Analyzing the confusion matrix plot, as shown in Figure 7, most of the pests and diseases were correctly predicted by the CNNA model in the case of unbalanced dataset 15 images of cotton bollworm were incorrectly predicted, 7 images of late blight were predicted as early blight, and no other prediction error categories exceeded 10, which proved that the CNNA model has better recognition and generalization ability. Confusion matrix (where label 0 corresponds to American leaf miner, 1 corresponds to aphid, 2 corresponds to bacterial spot, 3 corresponds to co on bollworm, 4 corresponds to Diaphania indica, 5 corresponds to early blight, 6 corresponds to healthy leaves, 7 corresponds to late blight, 8 corresponds to leaf mold, 9 corresponds to septoria leaf spot, 10 corresponds to spider leaf mite, 11 corresponds to two-spo ed spider mite, 12 corresponds to target spot, 13 corresponds to tea mite, 14 corresponds to tobacco budworm, 15 corresponds to tomato yellow leaf curl virus, 16 corresponds to tomato mosaic virus, and 17 corresponds to whitefly).

Network A ention Visualization
To be er observe the learning ability of the CNNA model proposed in this experiment for tomato pest features, the classification results were visualized using Grad-CAM [40]. Specifically, feature visualization was performed on some data samples of the validation set, as shown in Figure 8. The MixNet, ConvNeXt-Nano, and CNNA network models were used for comparison, and the last layer was used for the network feature visualization layer. From the figure, it can be seen that the heatmap of MixNet focused on small disease patches and local areas of insect infestation, and the accuracy of the heatmap was not surprisingly high. The heatmap of ConvNeXt-Nano focused on patch areas but contained much irrelevant background information and could not identify whitefly. Compared with the MixNet and ConvNeXt-Nano models, the heatmap of CNNA could accurately identify key regions of late blight, with high accuracy and less a ention paid to irrelevant and complex backgrounds. Meanwhile, the addition of MFFM with small-sized convolutional kernels resulted in more accurate whitefly identification in small regions. Figure 7. Confusion matrix (where label 0 corresponds to American leaf miner, 1 corresponds to aphid, 2 corresponds to bacterial spot, 3 corresponds to cotton bollworm, 4 corresponds to Diaphania indica, 5 corresponds to early blight, 6 corresponds to healthy leaves, 7 corresponds to late blight, 8 corresponds to leaf mold, 9 corresponds to septoria leaf spot, 10 corresponds to spider leaf mite, 11 corresponds to two-spotted spider mite, 12 corresponds to target spot, 13 corresponds to tea mite, 14 corresponds to tobacco budworm, 15 corresponds to tomato yellow leaf curl virus, 16 corresponds to tomato mosaic virus, and 17 corresponds to whitefly).

Network Attention Visualization
To better observe the learning ability of the CNNA model proposed in this experiment for tomato pest features, the classification results were visualized using Grad-CAM [40]. Specifically, feature visualization was performed on some data samples of the validation set, as shown in Figure 8. The MixNet, ConvNeXt-Nano, and CNNA network models were used for comparison, and the last layer was used for the network feature visualization layer. From the figure, it can be seen that the heatmap of MixNet focused on small disease patches and local areas of insect infestation, and the accuracy of the heatmap was not surprisingly high. The heatmap of ConvNeXt-Nano focused on patch areas but contained much irrelevant background information and could not identify whitefly. Compared with the MixNet and ConvNeXt-Nano models, the heatmap of CNNA could accurately identify key regions of late blight, with high accuracy and less attention paid to irrelevant and complex backgrounds. Meanwhile, the addition of MFFM with small-sized convolutional kernels resulted in more accurate whitefly identification in small regions.

Model Deployment and Comparison
To further validate the performance of the CNNA model for identifying tomato pests and diseases and the inference time of the CNNA0 network mentioned in Section 2.2.2, the MixNet model with high accuracy, the MobileNetV3 model with high reputation, the CNNA0 model with multi-scale modules using large convolutional kernels, and the CNNA model proposed in this study were deployed to the Jetson TX2 NX and the server, respectively. Image inference was performed on the constructed tomato pest and disease validation set, and the inference time for a single image was calculated. Then, this study imported the captured video stream data into the Jetson TX2 NX, which performed classification based on the video data and displayed the maximum possible pest categories on the monitor (Figure 9). The comparison of single image inference time, numbers of model parameters and FLOPs, and accuracy are shown in Table 8, where the test results of a single image inference time are reported as the average of five experiments.

Model Deployment and Comparison
To further validate the performance of the CNNA model for identifying tomato pests and diseases and the inference time of the CNNA0 network mentioned in Section 2.2.2, the MixNet model with high accuracy, the MobileNetV3 model with high reputation, the CNNA0 model with multi-scale modules using large convolutional kernels, and the CNNA model proposed in this study were deployed to the Jetson TX2 NX and the server, respectively. Image inference was performed on the constructed tomato pest and disease validation set, and the inference time for a single image was calculated. Then, this study imported the captured video stream data into the Jetson TX2 NX, which performed classification based on the video data and displayed the maximum possible pest categories on the monitor (Figure 9). The comparison of single image inference time, numbers of model parameters and FLOPs, and accuracy are shown in Table 8, where the test results of a single image inference time are reported as the average of five experiments.

Model Deployment and Comparison
To further validate the performance of the CNNA model for identifying tomato pests and diseases and the inference time of the CNNA0 network mentioned in Section 2.2.2, the MixNet model with high accuracy, the MobileNetV3 model with high reputation, the CNNA0 model with multi-scale modules using large convolutional kernels, and the CNNA model proposed in this study were deployed to the Jetson TX2 NX and the server, respectively. Image inference was performed on the constructed tomato pest and disease validation set, and the inference time for a single image was calculated. Then, this study imported the captured video stream data into the Jetson TX2 NX, which performed classification based on the video data and displayed the maximum possible pest categories on the monitor (Figure 9). The comparison of single image inference time, numbers of model parameters and FLOPs, and accuracy are shown in Table 8, where the test results of a single image inference time are reported as the average of five experiments.   From Table 8, it can be seen that MixNet had the largest number of parameters, performed the worst in both Pascal and TITAN Xp, and took the longest time to reason one image. The average inference times were 47.35 ms in Pascal and 11.42 ms in TITAN Xp. These results proved that the depth-separable convolutional model was faster than CNNA0 in the case of a relatively small model and a small number of serial depth-separable convolutions. In addition, the inference speed was weighted more on the numbers of model parameters and FLOPs.

Discussion and Conclusions
In this study, a lightweight pest and disease classification model was developed for 18 categories of tomato leaves. The experimental data used open-source images of 10 classes of diseased and healthy tomato leaves in PlantVillage (Figure 2a-j). Images of eight insect pests on tomato leaves were taken in an experimental field to validate the proposed deep neural network (Figure 2k-r). In addition, the dataset of tomato pest image data was enhanced to enrich the data sample features.
We compressed and optimized the structure of the ConvNeXt-Tiny network by pruning the original four modules to three while reducing the number of channels of each module, thus optimizing it into a lightweight convolutional neural network. We added MFFM to improve the model's recognition ability for different disease spot sizes and pests. Additionally, we embedded GCA to incorporate global feature information into the channels and enhance the network model. The CNNA network had a parameter count of 0.37 M, 237.61 M FLOPs, and a model size of 1.47 MB, which were 75, 18, and 72 times smaller than those of ConvNeXt-Tiny, respectively. Meanwhile, the error rate of the model was only 1% and the accuracy was 98.96%, further demonstrating its outstanding performance.
Furthermore, the CNNA model was deployed into the edge intelligence device, the Jetson TX2 NX, which performed inference recognition based on the video stream data and displayed the maximum possible pest categories on the monitor, with an average inference time of 47.35 ms for one image. This verified the superiority of the model in inference time and provided technical support for the development of tomato pest and disease control system.
In summary, the CNNA model can perform the classification of tomato pest and disease image data quickly and efficiently, thereby assisting agriculture-related personnel in improving production efficiency, reducing labor intensity, and decreasing the use of pesticides and other chemicals. This model is also applicable to non-tomato data and can be extended to tasks such as target detection and image segmentation. In future work, we will install the Jetson TX2 NX device in the experimental tomato field of Jilin Agricultural University and build a real-time tomato pest and disease classification platform. We will collect tomato leaf video data using a mobile USB camera, and the Jetson TX2 NX will perform real-time classification and identification based on the video to display the maximum possible pest and disease categories for relevant processing by the lower computer. Furthermore, we will acquire more tomato pest images with complex backgrounds in the field as input for model training in order to improve the model's ability to generalize tomato pests and diseases in complex backgrounds.