An Approach for Rice Bacterial Leaf Streak Disease Segmentation and Disease Severity Estimation

Rice bacterial leaf streak (BLS) is a serious disease in rice leaves and can seriously affect the quality and quantity of rice growth. Automatic estimation of disease severity is a crucial requirement in agricultural production. To address this, a new method (termed BLSNet) was proposed for rice and BLS leaf lesion recognition and segmentation based on a UNet network in semantic segmentation. An attention mechanism and multi-scale extraction integration were used in BLSNet to improve the accuracy of lesion segmentation. We compared the performance of the proposed network with that of DeepLabv3+ and UNet as benchmark models used in semantic segmentation. It was found that the proposed BLSNet model demonstrated higher segmentation and class accuracy. A preliminary investigation of BLS disease severity estimation was carried out based on our BLS segmentation results, and it was found that the proposed BLSNet method has strong potential to be a reliable automatic estimator of BLS disease severity.


Introduction
Rice (Oryza sativa L.) is one of the three main crops in Asia and is widely planted. Due to the complex characteristics of rice growing in relatively hot and humid conditions, it is vulnerable to diseases. Rice bacterial leaf streak (BLS) is one of the most serious diseases affecting rice production, occurring early in the growth cycle, spreading quickly and causing severe damage [1]. The timely monitoring and prediction of the occurrence and development of BLS is of great significance to maintain rice production. The leaves are usually damaged by BLS and the traditional disease severity estimation of BLS depends on the lesion area as a proportion of the total leaf area. The estimation of leaf damage is largely dependent on the level of experience of agronomists and farmers, which is labor intensive and time consuming, and the estimations are often subjective and unreliable [2].
In the early studies of plant disease recognition, feature descriptors were designed to extract internal features from pictures obtained in the laboratory or the agricultural production environment using edge detection, color space transform, feature space transform theories, etc. [3][4][5]. These features were usually classified using a support vector machine (SVM) and linear discrimination, among others. For example, a prototype system for rice plant disease was proposed by Prajapati et al. [6], in which the background was removed and SVM was used to classify the types of diseases according to color, shape, and texture features of the digital pictures of infected rice. In another example, four types of local descriptors were compared in soybean leaf disease recognition and the effectiveness of near infrared vine imageries acquired by an unmanned aerial vehicle (UAV) [25]. Wang et al. used the VGG16 model and an encoder-decoder structure to segment diseased leaf spots in corn, and this overcame the influence of complex backgrounds and solved the labor-intensive problem of feature extraction template design [26]. Tusubira et al. evaluated Cassava brown streak (CBS) loss according to the segmented pixels number of necrotized and non-necrotized cells based on CBS lesion segmentation results [27]. In the area of weed identification, Abdalla et al. segmented images taken in the field and efficiently distinguished crops from weed species [28].
In general, the development of agricultural disease recognition and severity estimation undergoes two stages, i.e., the manual crafted features-based machine learning method and the deep learning-based method, both of which can be summarized as a classification problem. The former methods can be regarded as shallow classifiers, and the latter methods can be considered as deep classifiers according to the complex degree of the proposed models. After the data acquisition stage, e.g., by color cameras, mobile devices, and multispectral or hyperspectral imaging devices. The former methods need pre-processing, segmentation, feature extraction and classification modules [29]. A systematic summary about classical classification methods used in the first study stage of disease recognition and severity estimation was also given in this literature review. The performance of this type of method largely depends on the degree of feature extraction. Deep learning-based methods rises with the hardware computing power, which do not rely on manual feature extraction and have excellent generalization ability [30], having been applied in many types of crops disease recognition studies [31][32][33]. Wang et al. used apple black rots data from PlantVillage dataset; the data were further labeled into different disease levels by agronomist experts. The performance of disease classification accuracy of VGG16, VGG19, Inception-V3 and ResNet50 models were compared and found that VGG-16 combined with the transfer learning strategy achieved 90.4% accuracy on the test samples [34]. Verma et al. used tomato data from PlantVillage Dataset and label three stages (early, middle and end) for tomato late blight severity estimation study. Two types model design methods, i.e., transfer learning-based and feature extraction-based, were compared, and it was demonstrated that AlexNet achieved the best accuracy in both methods [35]. Although great progress has been made in the study of disease classification, existing research about agricultural disease classification mainly directly compared the performance of networks in the area of computer vision. Meanwhile, the number of disease levels used in the studies were small, possibly causing the proposed methods to estimate disease severity roughly and restrict the practical applications of the model. For further research on the fine classification of disease severity, Lin et.al used the semantic segmentation of deep learning to quantify the degree of cucumber powdery mildew through outperformed k-means, random forest and the GBDT method in 20 test samples [36]. For the specific rice disease recognition task, Shrivastava used a pre-trained convolutional neural network as feature extractor and support vector machine as a classifier, then classified four types of rice diseases taken in actual fields; the encouraging results showed that it can be used in the early warning of disease occurrence [37]. Chen et. al combined squeeze-and-excitation (SE block) with MobileNet to propose a model of rice diseases classification, which achieved 99.33% classification accuracy [38]. Das et. al firstly segmented the rice disease zone, then CNN was used to extract features and the dimension reduction method was used to reduce redundant features; finally, various classifiers were used to classify rice disease types [39]. Sethy et al. proposed a model for detecting rice false smut based on the Faster R-CNN object detection method [40]. Among studies of rice disease, most of them concern whether rice disease occurs or classifying multiply types of rice diseases; however, the number of studies about disease lesion segmentation and fine assessment of rice BLS severity estimation is still few, and the background of the analyzed images was usually simple. There is an urgent demand to design a better severity estimation model and improve the robustness of the proposed model for efficient plant protection. With the improvement of the performance of models, some research pays attention to the lightweight model of disease recognition, Agriculture 2021, 11, 420 4 of 18 which can be deployed in a mobile device. Elhasouny et al. proposed an effective and lightweight mobile application to classify and diagnose ten types of common tomato plant diseases [41].
The studies mentioned above concerning semantic segmentation applications in agriculture have made a beneficial step forward in high accuracy uses of the method. However, to the best of our knowledge, the research on rice BLS lesion segmentation and disease severity estimation is still limited, which is of great significance to the crop protection of rice and the increase in rice yield.
Based on the above discussion, in this study, a new model named BLSNet for rice BLS lesion recognition and segmentation is proposed. The main contribution of this paper can be stated as follows: (1) this is the first step for the evaluation of the capacity of deep learning semantic segmentation for rice BLS lesion segmentation; (2) it fills the research gap of rice BLS lesion segmentation and fine disease severity estimation. The lesion area value can be drawn for BLS disease severity assessment based on the segmentation results. The model was trained and validated using real images captured in the production environment to ensure real application utility in the future. Preliminary studies suggest that the proposed method is effective in estimating rice BLS severity.

Data Collection
Data were collected in the fields around Shanji, Xuzhou, JiangSu Province, China (117 • 34 22" N, 34 • 13 58" E, 30 m above sea level), on 18-19 September 2020 at 10:00-16:00 under different sun light intensities and capture angles. A portable digital camera (D850, Nikon Inc., Tokyo, Japan) was used to capture RGB pictures of paddy rice. Most of the collected lesion images were at the growth stage and ranged from small water-spot-like short spots to vertical long stripes. The initial size of the images was 8256 × 5504 pixels ( Figure 1a). Due to the fact that the initial size of one image was large, data were first cropped by the central area determined by the actual size of the 1 × 1 m white box shown in the images (Figure 1b). Then, after randomly cropping the central area, the final size of the images used to train and validate BLSNet was 256 × 256 pixels. The dataset consisted of 109 images, which were labeled as BLS lesion, rice leaves, and three background classes using LabelMe software [42]. Figure 2 shows the data labels. images was usually simple. There is an urgent demand to design a better severity estimation model and improve the robustness of the proposed model for efficient plant protection. With the improvement of the performance of models, some research pays attention to the lightweight model of disease recognition, which can be deployed in a mobile device. Elhasouny et al. proposed an effective and lightweight mobile application to classify and diagnose ten types of common tomato plant diseases [41]. The studies mentioned above concerning semantic segmentation applications in agriculture have made a beneficial step forward in high accuracy uses of the method. However, to the best of our knowledge, the research on rice BLS lesion segmentation and disease severity estimation is still limited, which is of great significance to the crop protection of rice and the increase in rice yield.
Based on the above discussion, in this study, a new model named BLSNet for rice BLS lesion recognition and segmentation is proposed. The main contribution of this paper can be stated as follows: (1) this is the first step for the evaluation of the capacity of deep learning semantic segmentation for rice BLS lesion segmentation; (2) it fills the research gap of rice BLS lesion segmentation and fine disease severity estimation. The lesion area value can be drawn for BLS disease severity assessment based on the segmentation results. The model was trained and validated using real images captured in the production environment to ensure real application utility in the future. Preliminary studies suggest that the proposed method is effective in estimating rice BLS severity.

Data Collection
Data were collected in the fields around Shanji, Xuzhou, JiangSu Province, China (117°34′22″ N, 34°13′58″ E, 30 m above sea level), on 18-19 September 2020 at 10:00-16:00 under different sun light intensities and capture angles. A portable digital camera (D850, Nikon Inc., Tokyo, Japan) was used to capture RGB pictures of paddy rice. Most of the collected lesion images were at the growth stage and ranged from small water-spot-like short spots to vertical long stripes. The initial size of the images was 8256 × 5504 pixels ( Figure 1a). Due to the fact that the initial size of one image was large, data were first cropped by the central area determined by the actual size of the 1 × 1 m white box shown in the images (Figure 1b). Then, after randomly cropping the central area, the final size of the images used to train and validate BLSNet was 256 × 256 pixels. The dataset consisted of 109 images, which were labeled as BLS lesion, rice leaves, and three background classes using LabelMe software [42]. Figure 2 shows the data labels.   To avoid model overfitting, data augmentation methods were used through random vertical and horizon flip, contrast stretch, brightness stretch, sharpness stretch, salt and pepper noise, Gaussian noise, random rotation (rotation angle = 0-360 • ), random resize (resize factor = 0.1-1.0), and random blur. These data enhancements were performed using open source OpenCV [43], and Python Image Library (PIL) [44]. A total of 1199 images were acquired using the above data augmentation techniques,~80% of which were randomly selected as a training dataset, with the remaining~20% used for model validation. To avoid model overfitting, data augmentation methods were used through random vertical and horizon flip, contrast stretch, brightness stretch, sharpness stretch, salt and pepper noise, Gaussian noise, random rotation (rotation angle = 0-360°), random resize (resize factor = 0.1-1.0), and random blur. These data enhancements were performed using open source OpenCV [43], and Python Image Library (PIL) [44]. A total of 1199 images were acquired using the above data augmentation techniques, ~80% of which were randomly selected as a training dataset, with the remaining ~20% used for model validation.

Disease Severity Label
The disease severity label determination was based on the BLS lesion and the proportion of the BLS lesion of the total leaf area was used in the process. Level 0 included leaves with no lesion; level 1 with below 10% lesion; level 2 = 11-25% lesion; level 3 = 26-45% lesion; level 4 = 46-65%; level 5 = >65% accordingly. Due to the data collection time constraints, BLS was in the intermediate or more serious disease severity classes, so the sample sizes of levels 1 and 2 were relatively small.
For this study, as the trained data were randomly partitioned from the initial data, the proportion of the lesion in the total leaves of each sample was used to investigate disease severity. Table 1 shows the disease severity data distribution in the training and validation datasets.

Model Architecture
CNN blocks are usually used in semantic segmentation models because CNN can extract and exploit local contextual features. The attention mechanism and multi-scale features methods can discriminate parts and make full use of local and global information, improving the efficiency and accuracy of the model training process. Therefore, both the

Disease Severity Label
The disease severity label determination was based on the BLS lesion and the proportion of the BLS lesion of the total leaf area was used in the process. Level 0 included leaves with no lesion; level 1 with below 10% lesion; level 2 = 11-25% lesion; level 3 = 26-45% lesion; level 4 = 46-65%; level 5 = >65% accordingly. Due to the data collection time constraints, BLS was in the intermediate or more serious disease severity classes, so the sample sizes of levels 1 and 2 were relatively small.
For this study, as the trained data were randomly partitioned from the initial data, the proportion of the lesion in the total leaves of each sample was used to investigate disease severity. Table 1 shows the disease severity data distribution in the training and validation datasets.

Model Architecture
CNN blocks are usually used in semantic segmentation models because CNN can extract and exploit local contextual features. The attention mechanism and multi-scale features methods can discriminate parts and make full use of local and global information, improving the efficiency and accuracy of the model training process. Therefore, both the attention mechanism and multi-features were used in BLSNet for rice BLS lesion segmentation in our research.
Based on the encoder-decoder architecture used in UNet [45], our BLSNet was designed to have five stages or blocks; namely, a down-sample stage, up-sample stage, skip connection with an attention block(SE block), atrous spatial pyramid pooling (ASPP) block [46], and multi-scale spatial channel attention block (MSC block). Figure 3 presents the overall network model architecture.
simple, and edge features. High level semantic features were extracted during the 4th down-sample block and combined with the ASPP and MSC blocks. The right part of the BLSNet was the decoded part, in which different level and scale information flows were combined, and picture size was recovered to the original size of 256 × 256 pixels gradually from bottom to top. Convolution blocks used in the whole model were composed of combined operations of convolution, normalization, and activation to improve training performance.

Figure 3.
Overall architecture of BLSNet. Conv stride = 2 represents the down-sample block with convolution operation of kernel size of 3 and stride size of 2. Conv + MaxPool stands for the downsample block composed of convolution operation of kernel size of 3, followed by a maximum pooling of kernel size of 2. DeConv + ResBlock stands for the up-sample block, i.e., DeConv is convolutional transposed operation of kernel size 3, ResBlock uses residual connection. The detailed implementation process is described in the following sections.

Down-Sample Stage
The down-sample stage was composed of four down-sample blocks, after which the output size was half of the input size, meaning the final features were one sixteenth of the original size. In the first down-sample phase, a convolution of kernel size 3 and stride 2 was adopted to avoid the edge information being ignored for the purpose of maximum pooling adoption in the UNet. In the second, third, and fourth down-sample phases, a convolution of kernel size 1 was used to integrate information, and a maximum pooling operation of stride 2 was used to down-sample and extract salient features. During the down-sample block procedure, an increasing amount of contextual information was drawn as the number of down-samples increased.  Similar to the architecture of UNet, BLSNet was divided into three parts: left, middle, and right. Feature extraction and semantic feature extraction were achieved in the left part. Four down-sample blocks were included in the feature extraction process, while skip connection with attention machine was used in the middle part. According to the level of feature complexity, two types of attention blocks, i.e., SE block and MSC block, were used in BLSNet. Shallow, simple, and edge features were obtained in the 1st, 2nd, and 3rd down-sample block. The simple channel attention block was used to process the shallow, simple, and edge features. High level semantic features were extracted during the 4th down-sample block and combined with the ASPP and MSC blocks. The right part of the BLSNet was the decoded part, in which different level and scale information flows were combined, and picture size was recovered to the original size of 256 × 256 pixels gradually from bottom to top. Convolution blocks used in the whole model were composed of combined operations of convolution, normalization, and activation to improve training performance.

Down-Sample Stage
The down-sample stage was composed of four down-sample blocks, after which the output size was half of the input size, meaning the final features were one sixteenth of the original size. In the first down-sample phase, a convolution of kernel size 3 and stride 2 was adopted to avoid the edge information being ignored for the purpose of maximum pooling adoption in the UNet. In the second, third, and fourth down-sample phases, a convolution of kernel size 1 was used to integrate information, and a maximum pooling operation of stride 2 was used to down-sample and extract salient features. During the down-sample block procedure, an increasing amount of contextual information was drawn as the number of down-samples increased.

Skip Connection Stage with Attention Block
As is shown in Figure 4 [47], in the skip connection stage, feature maps obtained from the 1st, 2nd, and 3rd down-sample stages were processed with the squeeze and excitation attention blocks (SEBlock). The size of feature maps extracted in the 4th down-sample stage containing rich semantic information was 1/16 of the initial input image that contains Agriculture 2021, 11, 420 7 of 18 rich semantic information, so the MSCBlock was utilized to process the 4th down-sample stage features. attention blocks (SEBlock). The size of feature maps extracted in the 4th down-sample stage containing rich semantic information was 1/16 of the initial input image that contains rich semantic information, so the MSCBlock was utilized to process the 4th downsample stage features.
The SEBlock was composed of three branches: the upper branch generated the weight of every channel, i.e., the channel weights were extracted through the convolutional block and global pooling, followed by a convolution block comprising of two 1 × 1 operations of convolution. The sigmoid activation function was then used to scale channel weights to a range of 0 to 1. The output features of the SEBlock were the summary of the initial features dotted with the channel weight and the original features. The MSCBlock consisted of the ASPP block and a spatial channel attention machine. Figure 5 shows the information flow process of the MSCBlock, in which the higher semantic features are first filtered with ASPP to integrate multi-scale features and are then separately fed into the two branches (i.e., the position attention module (PAM) and channel attention module (CAM)). Figure 6 shows the structure of PAM and Figure 7 presents the diagram of CAM. Through both PAM and CAM branches, the position-enhanced and channel-enhanced features were then added, and a convolutional block with 1 × 1 convolutional operation was used to reduce channel numbers and refine the semantic contextual content to acquire the final high-level semantic features. The SEBlock was composed of three branches: the upper branch generated the weight of every channel, i.e., the channel weights were extracted through the convolutional block and global pooling, followed by a convolution block comprising of two 1 × 1 operations of convolution. The sigmoid activation function was then used to scale channel weights to a range of 0 to 1. The output features of the SEBlock were the summary of the initial features dotted with the channel weight and the original features.
The MSCBlock consisted of the ASPP block and a spatial channel attention machine. Figure 5 shows the information flow process of the MSCBlock, in which the higher semantic features are first filtered with ASPP to integrate multi-scale features and are then separately fed into the two branches (i.e., the position attention module (PAM) and channel attention module (CAM)). Figure 6 shows the structure of PAM and Figure 7 presents the diagram of CAM. Through both PAM and CAM branches, the position-enhanced and channel-enhanced features were then added, and a convolutional block with 1 × 1 convolutional operation was used to reduce channel numbers and refine the semantic contextual content to acquire the final high-level semantic features.
In the PAM branch, the features extracted by ASPP were first processed with the convolutional block, after which three 1 × 1 convolutional branches were used in the position attention module (PAM) to process the features and generate key feature map B, query feature map C, and value feature map D. The shape of feature maps B and C was then altered and dotted to generate spatial feature relation matrix E. Feature map D was reshaped to operate with E and generate enhanced position feature maps. The outputs of the PAM were the summary of enhanced position feature map and original features.   In the PAM branch, the features extracted by ASPP were first processed with the convolutional block, after which three 1 × 1 convolutional branches were used in the position attention module (PAM) to process the features and generate key feature map B, query feature map C, and value feature map D. The shape of feature maps B and C was then altered and dotted to generate spatial feature relation matrix E. Feature map D was reshaped to operate with E and generate enhanced position feature maps. The outputs of the PAM were the summary of enhanced position feature map and original features.
Similar to the design of the PAM branch, after the convolutional block three 1 × 1 convolutional branches were used in the channel attention module (Figure 7) to generate value feature maps B, query feature maps C and key feature maps D. Channel correlation feature maps E were obtained by C and D. Specifically, the shapes of C and D were first altered and matrix multiplication was performed. The final enhanced position feature maps were acquired using matrix multiplication of E with feature maps B. The outputs of the CAM were the summary of the enhanced position feature map and original features.
The ASPP block (Figure 8) consisted of five branches, four of which were two-dimensional dilation convolutions with a kernel size of 3, and dilation rates of 1, 3, 6, and 9, respectively; the last branch utilized average pooling to obtain global semantic information. Similar to the design of the PAM branch, after the convolutional block three 1 × 1 convolutional branches were used in the channel attention module (Figure 7) to generate value feature maps B, query feature maps C and key feature maps D. Channel correlation feature maps E were obtained by C and D. Specifically, the shapes of C and D were first altered and matrix multiplication was performed. The final enhanced position feature maps were acquired using matrix multiplication of E with feature maps B. The outputs of the CAM were the summary of the enhanced position feature map and original features.

Reshape and Transpose
The ASPP block ( Figure 8) consisted of five branches, four of which were twodimensional dilation convolutions with a kernel size of 3, and dilation rates of 1, 3, 6, and 9, respectively; the last branch utilized average pooling to obtain global semantic information.

Up-Sample Stage
As the size of the final feature map was 1/16 of the initial feature map, to recover the same size of the input images, an up-sample operation for the feature maps was required. The up-sample stage of the proposed network had four up-sample blocks, with the 1st, 2nd, 3rd, and 4th up-sample stages corresponding to the respective down-sample stages. During the up-sample stage, from bottom to top, a deconvolution of stride 2 was first used to double the size of the highest-level feature maps generated from the MSCBlock. Then, in the 3rd, 2nd, and 1st up-sample blocks, features enhanced by a skip connection stage were concatenated with up-sampled features. Then, a residual operator composed of a convolution of kernel size 3 and kernel size 1 was used to process the concatenated features to prevent the network performance from degradation. After four up-sample operations, the size of the feature map was the same as the input image. A convolution of kernel size 1 was used to make the final channel size the same as the category numbers to be distinguished. Each pixel label was assigned by the indices of the maximum values among the possibilities in the three channels.

Experimental Process
Our model was implemented in the PyTorch deep learning framework. The model was trained using 16 GB of memory powered by a Tesla V100 Graphics Processing Unit (GPU) under Ubuntu18.04 LTS system environment. The number of batch sizes for training and validating was set to 32, according to the capacity of the GPU and the size of sample images. A "Loss" value was computed according to the multi classes cross entropy loss function, calculated in Equation (1), where y n indicates the class label,ŷ represents the predicted class label, and K represents the number of total categories. The parameters of the proposed model were updated using backward gradient propagation algorithm.
The model was trained for a maximum of 300 epochs, and the performance on the validation dataset was recorded after every two training epochs. Adam optimizer was chosen as the network optimizer. The initial learning rate was set to 0.001, and the poly learning rate strategy was used to update the learning rate dynamically. For parameter initialization, the parameters of the convolutional layer were initialized using the Kaiming initialization method, while other parameters were set using the default values provided in the PyTorch framework.
The BLSNet was compared with UNet and DeepLabv3+ based on segmentation accuracy, pixel accuracy, BLS lesion segmentation accuracy, rice segmentation accuracy, background segmentation accuracy, and the mean interaction over union (mIoU). The mIoU was the ratio of the intersection area and the total area between the input label mask and the predicted result mask, calculated using Equation (2). The mIoU reflects the quality of the segmentation on a scale of 0 to 1, where 1 means the predicted result was exactly the same as the label, which is the ideal situation for semantic segmentation. In Equation (2), k represents the total number of categories, which in this case was set to 3, and p ij represents the number of categories i predicted as category j.
Pixel accuracy (PA) represents the proportion of all the correctly predicted pixels to the total pixels, as is shown in Equation (3). k indicates the number of classes, and p ii represents the corrected predicted pixels, p ij represents the number of categories i predicted as category j.
Class accuracy (CA) was usually used to measure the accuracy of each category prediction results and was computed using Equation (4): Figure 9 shows the training accuracy curves from the mIoU and different category segmentations. All the criteria for the BLSNet model were the highest at the beginning, and the number of epochs required to achieve the convergence of BLSNet was the smallest. BLSNet showed an overall upward trend in the training and validation process, proving the stability of the method. By contrast, in the training process of UNet, two obvious oscillations were observed that reflect the fluctuation of UNet, which is not acceptable in practical applications.

Validation Results
In the training process, the performance of the trained models used in the validation dataset was recoded every two epochs. From Figure 10, it can be seen that compared with the UNet and DeepLabV3+ methods, the change of BLSNet was smallest, indicating the stability and transfer capacity of this method.

Validation Results
In the training process, the performance of the trained models used in the validation dataset was recoded every two epochs. From Figure 10, it can be seen that compared with the UNet and DeepLabV3+ methods, the change of BLSNet was smallest, indicating the stability and transfer capacity of this method.

Validation Results
In the training process, the performance of the trained models used in the validation dataset was recoded every two epochs. From Figure 10, it can be seen that compared with the UNet and DeepLabV3+ methods, the change of BLSNet was smallest, indicating the stability and transfer capacity of this method. The mIoU and CA values from the three models are listed in Table 2 for comparison. BLSNet performed better than UNet and DeepLabV3+ in terms of mIoU, and the segmentation accuracy of the BLS disease lesion, rice leaves and background outperformed the next best method by 4%, 4%, 2%, and 1% in terms of accuracy, respectively. The bold in each row in the table is the highest accuracy achieved by model.

Ablation Study on Attention Block
Ablation studies were performed to investigate the effectiveness of each attention block used in the skip connection stage. In this experiment, the first, second, third, and fourth attention blocks in the skip connection were removed separately, and the first, second, and third attention blocks were removed together. For mIoU, the segmentation accuracy of the background, rice, and BLS were compared independently (Table 3), and we The mIoU and CA values from the three models are listed in Table 2 for comparison. BLSNet performed better than UNet and DeepLabV3+ in terms of mIoU, and the segmentation accuracy of the BLS disease lesion, rice leaves and background outperformed the next best method by 4%, 4%, 2%, and 1% in terms of accuracy, respectively. The bold in each row in the table is the highest accuracy achieved by model.

Ablation Study on Attention Block
Ablation studies were performed to investigate the effectiveness of each attention block used in the skip connection stage. In this experiment, the first, second, third, and fourth attention blocks in the skip connection were removed separately, and the first, second, and third attention blocks were removed together. For mIoU, the segmentation accuracy of the background, rice, and BLS were compared independently (Table 3), and we found that the performance of segmentation by applying ablation models was mostly worse than the highest accuracy achieved in BLSNet, demonstrating the effectiveness of the attention mechanism in the segmentation process. The performance of the ablation BLSNet and the initial BLSNet model in terms of the mIoU was evaluated. The lowest accuracy was obtained in the model without the fourth attention block, and a possible reason for this is that the fourth attention block was mainly effective in dealing with high-level semantic features containing rich semantic contextual and spatial information. This attention block was also useful in dealing with distinguishable features for further segmentation. This finding is beneficial to the further development of the model architectural design for agricultural diseases segmentation; that is, it can pay more attention on a full utilization of high level semantic and contextual features.
The second attention block did not significantly improve the performance of the BLSnet model. A possible reason for this is that the channel number of the output features of the second down-sample block was 128, which has already distinguished the intermediate features reasonably well from the low-level of edge features and the high-level of semantic features.
Comparing the accuracy of different classes achieved by the ablation studies listed in Table 3, it can be seen that the background accuracy was not obviously influenced by removing attention blocks. A possible reason for this is that the background occupies a relatively large area in an image, so the attention blocks did not work efficiently. The accuracy of the BLS segmentation and rice segmentation is apparently decreased by about 3% when the attention blocks were not included, especially the fourth attention block. This verifies the effectiveness of using the MSC block to distinguish high-level semantic features.

Prediction Results Comparison
To investigate the segmentation performance of the BLSNet, UNet, and DeepLabV3+ models, the images and results of their segmentation are presented in Figure 11. In Figure 11, the first-row sample under dark light conditions was processed using the flip augmentation operation. The segmentation effects of the first-row sample suggest that the performance of all models was similar. The performance of the BLSNet segmentation was relatively consistent and tended to be complete. The second-row sample was augmented by an image rotation operation, and it is clear that some of the leaves' segments are missing in comparison with those in the UNet and DeepLabv3+ models. Again, the segmentation performed by BLSNet tended to be complete, indicating its ability to perform well even when sample rotation occurs. This characteristic is desirable for disease segmentation in the natural agricultural production environment. The third-row sample is enhanced by a scale transformer and the detailed features of lesion segmentation from the BLSNet model were better than those produced by UNet and DeepLabV3+. This suggests that BLSNet is better suited for the adaptation of scale changes of images. In the fourth row of Figure 10, it can be seen that the detailed edges of BLS can be segmented well by the BLSNet model, which is consistent with the conclusion obtained in the lesion ablation studies, i.e., attention blocks can significantly improve detailed segmentation. In the fifth row of Figure 10, panicles were misclassified as BLS by the BLSNet model. A possible reason for this is that the characteristics of the panicle were similar to the color characteristics of the BLS disease spots, so the network may not be able to discriminate between the fine differences of disease spots and panicles, which ultimately leads to misclassifications. BLSNet model, which is consistent with the conclusion obtained in the lesion ablation studies, i.e., attention blocks can significantly improve detailed segmentation. In the fifth row of Figure 10, panicles were misclassified as BLS by the BLSNet model. A possible reason for this is that the characteristics of the panicle were similar to the color characteristics of the BLS disease spots, so the network may not be able to discriminate between the fine differences of disease spots and panicles, which ultimately leads to misclassifications.

Comparison of Model Prediction Time
To compare the prediction time of different models, the saved parameter files of UNet, DeepLabV3+, and the proposed BLSNet were used to load the trained models for the segmentation of validation dataset images. The segmentation time of each image in the validation dataset was calculated by averaging the prediction time of all images. Table  4 shows that the prediction time for BLSNet was slightly longer than UNet, shorter than DeepLabV3+, but the segmentation accuracy achieved by the BLSNet was the highest. This suggests that the BLSNet is well-suited for real-time BLS lesion segmentation.

Disease Severity Estimation Comparison
Precise areas of the disease lesion that are needed for the investigation of disease severity can be extracted based on the segmentation results. The trained models were used to segment the validation dataset, and the percentage of the BLS diseased lesion area in a

Comparison of Model Prediction Time
To compare the prediction time of different models, the saved parameter files of UNet, DeepLabV3+, and the proposed BLSNet were used to load the trained models for the segmentation of validation dataset images. The segmentation time of each image in the validation dataset was calculated by averaging the prediction time of all images. Table 4 shows that the prediction time for BLSNet was slightly longer than UNet, shorter than DeepLabV3+, but the segmentation accuracy achieved by the BLSNet was the highest. This suggests that the BLSNet is well-suited for real-time BLS lesion segmentation.

Disease Severity Estimation Comparison
Precise areas of the disease lesion that are needed for the investigation of disease severity can be extracted based on the segmentation results. The trained models were used to segment the validation dataset, and the percentage of the BLS diseased lesion area in a sample was then recorded to obtain the disease severity in each image according to the criteria outlined in Section 2.2. To compare the effectiveness of BLSNet, UNet, and DeepLabV3+ in determining different levels of BLS disease severity, confusion matrices were used, as shown in Figures 12-14. The classification accuracy of different levels of disease severity is shown in Table 5. In confusion matrices, the number in each row represents the number of the predicted category corresponding to the true category. It can be seen that for level 2 damage severity, UNet achieved the highest accuracy, while the best results of disease severity level determination were achieved by BLSNet for the other four levels. As for the overall classification accuracy, BLSNet was higher than UNet and DeepLabV3+.
among which the disease severity of 11 samples was underestimated. This und tion problem is likely due to the fact that lesion segmentation zones tended to be due to the attention mechanism. This led to the estimated area of disease infec smaller than the actual area. Twenty-three samples were misclassified by the De model correspondingly. The possible reason for this misclassification was that tive fields of ASPP used in DeepLabV3+ were large and lacked sufficient attenti increasing the difficulty of efficiently detecting BLS lesion zones. It is clear th formance of different levels of disease classification differed in the UNet model racy of classification for levels 2 and 3 were 1 and 0.85, respectively, demonst UNet was not always able to determine BLS disease severity well at certain l results obtained from the disease severity investigation suggest that the propos model could be used in practical applications to automatically determine BLS d verity with a high level of accuracy.    Although higher classification accuracy was achieved by the BLSNet in BL estimation overall, the accuracy of level 1 and 2 determination was lower th levels 3-5. The capability to identify low level BLS disease severity was limite menting small lesion zones needs to be improved in BLSNet. The performance posed model can be improved through the amount of data used, by using a high of different level BLS disease severity data. A priori knowledge of plant prot other relevant agriculture information, as well as full usages of computer vis help with the design and implementation of an effective severity estimation addition, automatic labeling, semi-supervised training, and assisted labeling could be used to improve the speed of agricultural sample processing, helpin come the workload problem of manual semantic labeling.

Conclusions
This paper proposed BLSNet for rice BLS lesion segmentation and disea estimation based on the semantic segmentation technique and attention mech multi-scale methods. The BLSNet was trained by images taken in the actual accuracy of segmentation of the BLSNet was the highest compared with b DeepLabV3+ and UNet model. In the case of level 5 disease severity, the accura  To improve our understanding of the causes for such a phenomenon, misclassification problems were further analyzed. A total of 14 samples were misclassified by BLSNet, among which the disease severity of 11 samples was underestimated. This underestimation problem is likely due to the fact that lesion segmentation zones tended to be clustered due to the attention mechanism. This led to the estimated area of disease infection being smaller than the actual area. Twenty-three samples were misclassified by the DeepLabV3+ model correspondingly. The possible reason for this misclassification was that the receptive fields of ASPP used in DeepLabV3+ were large and lacked sufficient attention blocks, increasing the difficulty of efficiently detecting BLS lesion zones. It is clear that the performance of different levels of disease classification differed in the UNet model: the accuracy of classification for levels 2 and 3 were 1 and 0.85, respectively, demonstrating that UNet was not always able to determine BLS disease severity well at certain levels. The results obtained from the disease severity investigation suggest that the proposed BLSNet model could be used in practical applications to automatically determine BLS disease severity with a high level of accuracy.
Although higher classification accuracy was achieved by the BLSNet in BLS severity estimation overall, the accuracy of level 1 and 2 determination was lower than that of levels 3-5. The capability to identify low level BLS disease severity was limited, i.e., segmenting small lesion zones needs to be improved in BLSNet. The performance of the proposed model can be improved through the amount of data used, by using a higher volume of different level BLS disease severity data. A priori knowledge of plant protection and other relevant agriculture information, as well as full usages of computer vision, could help with the design and implementation of an effective severity estimation process. In addition, automatic labeling, semi-supervised training, and assisted labeling methods could be used to improve the speed of agricultural sample processing, helping to overcome the workload problem of manual semantic labeling.

Conclusions
This paper proposed BLSNet for rice BLS lesion segmentation and disease severity estimation based on the semantic segmentation technique and attention mechanism and multi-scale methods. The BLSNet was trained by images taken in the actual fields. The accuracy of segmentation of the BLSNet was the highest compared with benchmark DeepLabV3+ and UNet model. In the case of level 5 disease severity, the accuracy of classification of the proposed model was higher in level 4, demonstrating the effectiveness of disease severity estimation of the proposed models.
The highlight of this paper is the application of the semantic segmentation of the deep learning method into the rice BLS segmentation and severity estimation. To our knowledge, this is the first work to take advantage of the pixel-wise classification of semantic segmentation to obtain fine rice BLS lesion areas, which are used for disease severity determination. The result of comparative experiments proved the introduction of the attention mechanism and multi-scale features aggregation can improve disease classification accuracy; the finding of the work provides references for the deep learning architecture design of other relevant plant disease classification.
It is worth noting that there are still some limitations in this method. The accuracy of low level disease degree estimation needs improving. Due to the restriction of data collection time and experimental site, only one type, i.e., BLS rice disease segmentation and severity estimation, was studied. In the future study of lesion segmentation and disease severity level estimation, an advanced segmentation model can be developed for more types of plants, more types of diseases and more levels of disease severity. Multiple types of data sources can also be considered, e.g., multispectral and hyperspectral data of which spectral characteristics can be used to show the growth state of crops, and aerial imagery which be used for effective plant diseases monitoring in the large scale. In theory, the accuracy of the model can be improved by increasing the amount of data; the data crowdsourcing method can be used to label adequate crop disease samples. In terms of the size of the model, work can also be carried out to reduce the parameter amount of the model and the proposed lightweight model for mobile and IoT devices in the actual field.