Attention Based Quick Network With Optical Flow Estimation for Semantic Segmentation

Video semantic segmentation is a challenging vision task since the temporal-spatial characteristics are difficult to model to satisfy the requirements of real-time and accuracy simultaneously. To tackle this problem, this paper proposes a novel optical flow based method. We propose an adaptive threshold key frame scheduling strategy to model the temporal information by estimating the inter-frame similarity. To ensure segmentation accuracy, we construct a convolutional neural network named Quick Network with attention (QNet-attention), a lightweight image semantic segmentation model with a spatial-pyramid-pooling-attention module. The proposed network is further combined with optical flow estimation to realize a semantic segmentation framework. The performance of the proposed method is verified with existing benchmark methods. The experimental results indicated that our method achieves excellent balanced performance on accuracy and speed.


I. INTRODUCTION
Semantic segmentation, as a challenging subject in computer vision, classifies each pixel of images or videos with given semantic labels to achieve the purpose of object detection. Semantic segmentation can be widely applied in obstacle avoidance, tracking, and path planning in the intelligent scenes such as autopilot and unmanned aerial vehicle (UAV) [1]. In the past decades, with the development of deep learning and hardware equipment, deep learning models have been widely used in image and video semantic segmentation and achieved obviously better performance than the classical machine learning methods, leading to great progress in computer vision [3], [4].
Videos are essentially composed of a series of temporally continuous images. The abundant temporal information in the videos can be integrated into the image segmentation model by using special modules to extract effective features to improve the segmentation accuracy. In recent works, the long short-term memory (LSTM) module is applied to learn the temporal features of video and assists the propagation The associate editor coordinating the review of this manuscript and approving it for publication was Prakasam Periasamy . of spatial features [7]. The Netwarp structure uses optical flow to fuse the features of the previous frame with those of the current frame [8]. The spatial-temporal transformer gate recurrent unit (STGRU) module takes the neighbor frames of the current frame as inputs for training the optical flow based semantic segmentation model [9]. Reference [10] proposes the segmentation transformer in the coding stage to process continuous video sequences for global context information.
According to the similarities among the frames, the videos often contain redundant information that should be considered to be reduced. The classical deep feature flow (DFF) structure is combined with the fixed interval key frame selection strategy, where the features of the previous key frames are directly converted by the optical flow method in the feature extraction process of the current frame [11]. The optical flow method is much simpler than the feature extraction on the calculation issue. Because the fixed key frame is difficult to determine the time interval threshold, the adaptive key frame selection strategy is investigated [12]. Based on the DFF structure, a shallow neural network structure is added to the dynamic video segmentation network (DVSNet) to judge whether the current frame is a key frame [13].
Although the aforementioned works focus on the real-time performance of semantic segmentation method, most of them take the loss of accuracy as the cost, especially for the small targets. To tackle this problem, this research investigates the real-time video semantic segmentation under limited computing and storage conditions for soundable accuracy. We first propose a module combining spatial pyramid pooling module (SPP) [14] and attention mechanism (AM), which can extend the range of receptive field and effectively realize the classification of small targets. Then, we construct a lightweight image semantic segmentation model based on QNet [15], called QNet-attention, which is further combined with FlowNet2-s [16] to realize a video semantic segmentation framework. Finally, a comparative experiment with other state-of-the-art frameworks are conducted on the Cityscapes [17] dataset. The experimental results verify the excellent performance of the proposed video semantic segmentation framework.
In summary, this paper proposes a lightweight deep learning neural network based on the simplification of network structures, global and local feature fusion, and key frame selection. This research makes the following contributions: 1) Propose a module combining spatial pyramid pooling module (SPP) [14] and attention mechanism (AM), which can extend the range of receptive field and effectively realize the classification of small targets. We named this new module SPP-attention Module (SPP-A). 2) Construct a lightweight image semantic segmentation model based on QNet [15], called QNet-attention. The experimental results show that it realizes an excellent balance between accuracy and speed. 3) An adaptive threshold key frame scheduling strategy combined with optical flow method is proposed, which not only ensures the overall segmentation accuracy, but also improves the model reasoning speed. 4) Propose a video semantic segmentation framework of QNet-attention + FlowNet2-s [16], and carry out comparative experiments with other state-of-the-art frameworks on the Cityscapes [17] dataset. The experimental results verify the excellent performance of the video semantic segmentation framework proposed in this paper. The rest of this paper is organized as follows. Section II introduces several related works. Section III introduces the proposed video semantic segmentation method. Experiments are depicted in Section IV. Finally, the paper is concluded in Section V.

II. RELATED WORKS
In this section, related works are introduced from two aspects: we first introduce several methods for video segmentation acceleration and then review the attention mechanism.
A. FRAME PROCESS STRATEGY Videos often contain redundant temporal-spatial information, which can bring huge computational cost. Hence, the frame process strategy has been widely considered to reduce the redundancy.

1) INTER FRAME FEATURE PROPAGATION STRATEGY
Optical flow prediction usually estimates the position of each pixel in an image in the adjacent image from a pair of time-dependent image pairs [19]. The DFF algorithm [13] carries out for the inter frame feature propagation through the optical flow method, which reduces the number of feature extraction links and significantly improves the calculation speed. The low-latency video semantic segmentation uses convolutional neural network (CNN) to propagate the previous deep features to the current frame, and fuses it with the low-level features of the current frame [12]. The temporally distributed network (TDNet) circularly allocates several sub-networks to frames in chronological order, and performs a lightweight forward propagation on each frame [20]. Finally, all features are aggregated by reusing the sub-features extracted in the previous frames. Unlike DFF, we directly propagate the segmentation result of the key frame to the current frame instead of features.

2) KEY FRAME SCHEDULING STRATEGY
The scheduling of key frames is an essential step in feature propagation process. At present, there are mainly three types of methods, including clustering, optical flow, and quality, to extract key frames [28]. The clustering method maps the image information to the high-dimensional space composed of feature vectors and then classifies it by clustering [21]. The optical flow method obtains the motion information according to the optical flow between video frames, such as Lucas-Kanade optical flow method [26], to select the key frame. The quality method scores the image according to different measure standard [29]. These traditional methods often cannot meet the real-time requirement. To improve the real-time efficiency, the fixed interval key frame scheduling strategy has gradually become one of the research focus [30]. This method is simple and easy and greatly improves the efficiency of key frame scheduling. DVSNet measures the similarity of video images between frames through neural network. If the value exceeds a certain threshold, it indicates that the similarity of two frames is high, and the current frame is a non key frame. Otherwise, it is a key frame.

B. ATTENTION MECHANISM
The attention mechanism is used to determine where to focus and assists in making adaptive feature refinement. Recently, several attempts [22], [23], [24], [25] have been made to incorporate attention mechanisms into semantic segmentation tasks. Dual attention network (DANet) [22] append two types of attention modules on top of traditional dilated fully convolutional networks (FCN) [34] to adaptively VOLUME 11, 2023  integrate local features with their global dependencies. Crisscross attention network (CCNet) [23] uses a novel criss-cross attention module to capture contextual information from long-range dependencies in a more efficient and effective way. Spatial and channel squeeze & Excitation (scSE) [24] proposes three modules cSE, sSE and scSE, which can enhance meaningful features and suppress useless features. Hierarchical multi-scale attention (HMSA) [25] can improve the problem of category confusion and find the best prediction results from multiple scales.
Motivated by the successes of attention mechanisms, we proposed an attention mechanism module combined with spatial pyramid pooling module [14] for multi-feature fusion. It conducts global-level, channel-level, and spatial-level attention to refine the fused features, which enables the model to select the meaningful ones.

III. PROPOSED METHOD
To satisfy the requirements of real-time and accuracy, we propose a novel video semantic segmentation framework. In this section, we introduce our method in detail.

A. OVERALL FRAMEWORK
This framework is divided into two branches: the optical flow branch and the segmentation branch. Notations used in this section are shown in Table 1. 1) Step 1: Current frame I i and keyframe I k are subjected to the optical flow calculation network simultaneously to obtain the optical flow field between the two frames. Then input the optical flow field to the decision network (DN). The decision network starts to analyze the similarity between the two input video frames and calculates the confidence of the predicted value between them. The confidence of the predicted value is further compared with the set confidence threshold t. Once the confidence is greater than the threshold, the current frame is then sent to the optical flow branch for further processing; otherwise, it is processed by the segmentation branch. As shown in Fig. 1, the current frame I i processes through the optical flow branch (red flowchart in Fig. 1), the next frame I i+1 processes by dividing branches (blue flowchart). The high confidence of the predicted value implies that the current frame is similar to the key frame and good segmentation results can be obtained by the optical flow branch. Meanwhile, t determines the use frequency of the two branches and also affects the final segmentation speed and accuracy. For more discussion on the decision network, readers are referred to Section III-D. 2) Step 2: According to the similarity between the current frame and the key frame, the decision network sends each video frame into two subsequent different branches to obtain the segmentation result of the current frame. The segmentation branch directly sends the current frame to the semantic segmentation network for processing, the same as the general image semantic segmentation process. The optical flow branch takes the optical flow field of the current frame and the key frame in step 1 as the input and converts the previously processed key frame segmentation image to the segmentation result of the current frame through the propagation function W , which does not need to be processed through the segmentation network. Note that the optical flow branch cannot obtain the segmentation result only by relying on the optical flow computing network. The segmentation result and propagation function of the pre-nearest neighbor key frame must be used. Readers are referred to Section III-C for the specific discussion of the propagation process.

B. QNet-ATTENTION NETWORK
To improve the segmentation accuracy of small-scale targets, we combine the channel attention mechanism and spatial attention mechanism with the spatial pyramid pooling module, and design a lightweight semantic segmentation network based on QNet network, called QNet-attention. The structure diagram of the model is shown in Fig. 2, and the network structure is depicted in Table 2.
QNet-attention mainly uses the feature extraction unit based on Channel Split and Channel Shuffle [31]. Block 1 is composed of five basic units; block 2 is composed of nine basic units, where the first units are downsampling. The decoding stage mainly uses the spatial pyramid pooling-attention mechanism fusion module designed in this paper, as shown in Fig. 2. We use 1 × 1, 2 × 2, 4 × 4, 8 × 8 pooled kernel and step size to average the feature maps obtained at the coding stage, and obtain four feature maps of different sizes. To maintain the weight of global features, we use 1 × 1 convolution to reduce the channels of each feature map by half. Then upsample these low-dimensional feature maps by bilinear interpolation to recover their scale. Meanwhile, the four feature maps are added into two groups, the channel attention mechanism module follows the first group. The spatial attention mechanism module follows the second group. Finally, the two groups of feature maps are concatenated with the original input feature map and the feature maps at different levels are superimposed into the final global feature. This module uses receptive fields of different sizes to aggregate the information of different regions of the input  characteristic map to reduce the information loss between different regions. At the same time, the global information and local information of different scales are fully integrated to obtain the additional dimensional information. This improves the ability of the network to pay attention to small-scale targets and the overall reasoning ability of the decoding end. The operation of channel number reducing and final concatenating also fully ensures a low amount of calculation. The above operations improve the ability to understand the coded feature map, and fully utilize the spatial information through multi-scale feature map and attention mechanism, which effectively improve the recognition of small-scale targets, and consider both computational cost and efficiency.

C. FEATURE PROPAGATION STRATEGY
The optical flow method is an essential research content in video analysis. Variational methods, such as the classical Lucas-Kanade optical flow method, are some of the most widely used methods. These methods mainly solve the problem of small displacement of moving objects in video. In recent years, the method based on deep learning and semantic information has been widely used to address the issue of large removal and the robustness of the optical flow method. FlowNet first applies deep CNN to estimate motion directly and obtains good results, while the optical flow method is also used to assist visual tasks, which can speed up the processing speed of conventional video recognition tasks. In this paper, the optical flow method is used to represent the inter-frame correlation of video and propagate the inter-frame feature. Generally speaking, the similarity between consecutive video frames is significant, and the current frame is similar to the nearest neighbor key frame of the previous sequence, with only local differences. Feature propagation can be carried out between the two frames through optical flow. The introduction of the optical flow method for inter frame feature propagation can significantly reduce the use of image segmentation branches and greatly improve the overall reasoning speed of the model on a single frame image. Compared with image semantic segmentation frame by frame, the overall accuracy can also be slightly reduced. However, from the practical application scenarios, the trade-off between speed and accuracy is more important. To further improve the computational efficiency and reasoning speed, we will not propagate the feature in the feature extraction link, but directly propagate the segmentation graph of the key frame to the current frame through the optical flow method so as to obtain the segmentation result of the current frame.
After the current frame I i is subjected, the optical flow field F k→i between the two frames is calculated through the optical flow network together with the previous nearest key frame I k . The position of the pixel p in the current frame I i is projected back to the corresponding p + δp in the key frame I k through the optical flow field, where δp = F k→i (p). Since δp is generally non-integer, feature conversion can be realized through bilinear interpolation, as shown VOLUME 11, 2023  in (1a) and (1b): where g(a, b) = max(0, 1−|a−b|). This propagation process can be abbreviated as (2): In this paper, FlowNet2-s [16] is used as the optical flow computing network, which is significantly improved compared with FlowNet in data training and model structure, and the overall performance of the model is also the best at present. The semantic segmentation network adopts the lightweight image semantic segmentation network QNet-attention designed in this paper. The overall segmentation performance, especially the real-time performance, is significantly improved compared with other mainstream methods. The schematic diagram of feature propagation strategy based on optical flow method is shown in Fig. 3. The feature propagation strategy based on optical flow method proposed in this paper integrates semantic segmentation network and optical flow computing network to build a new model. Both networks can be pre-trained to reduce the computational cost. This strategy by combining the two networks can avoid semantic segmentation of each video frame, effectively reduce the amount of calculation and hereby the overall network will be fully accelerated. Meanwhile, the advanced optical flow computing network ensures the accuracy of feature propagation.

D. KEY FRAME SCHEDULING STRATEGY
Previous research works on key frame scheduling have been conducted based on fixed intervals or simple heuristic methods. Such methods usually cannot deal with complex changes in the video scene, such as sudden camera movement or large changes in the scene structure, which will seriously affect the overall performance of the model. At present, the existing key frame selection methods can be divided into three categories: clustering based, optical flow based and quality based. In this paper, we use the quality-based key frame selection method and the optical flow method to improve the overall quality and efficiency. Fig. 4 is the overall structure diagram of the key frame scheduling strategy based on the decision network and the training strategy of the decision network. The decision network is a lightweight convolutional neural network composed of only a single convolution layer and three fully connected layers. Its input is the output of optical flow network. In the training phase, the current frame I i and the nearest key frame I k are used as inputs. The training purpose is to obtain the confidence score of the predicted value to represent the similarity of the two input images. Then, the optical flow fields F k→i and Warp function W are calculated by the optical flow network FlowNet2-s, then the semantic segmentation result of the key frame is passed through W to calculate the semantic segmentation output O i of the current frame. The other branch calculates the semantic segmentation output S i directly through the image semantic segmentation network QNet-attention proposed. We define the confidence score of ground truth as the expected similarity between O i and S i as the following: where P is the total number of pixels in the current frame, p is the index of P. O i (p) and S i (p) represent the semantic category label of pixel p calculated by the two branches, respectively. C(u, v) is an illustrative function that outputs 1 when u is equal to v, otherwise 0. The output of decision network branch is the confidence score of prediction while the output of segmentation branch is the confidence score of ground truth. Based on them, a regression model can be trained with the mean squared error (MSE) loss function. The current frame is identified as a new key frame only when the similarity between the current frame and previous key frame lower than a preset confidence threshold t.

E. ADAPTIVE THRESHOLD STRATEGY
An optimal threshold can lead to a proper total key frame, and it is impossible to set a unified threshold. Therefore, an adaptive threshold strategy is proposed in this paper. Define the confidence score of the ith frame image and its nearest key frame as score i , set the initial threshold to 95 then define the threshold: where k can be set manually. Through this method, the threshold can be adaptively and flexibly adjusted according to the video content, and help to improve the efficiency and accuracy of subsequent segmentation tasks.

IV. EXPERIMENTS A. DATASET
To verify the real-time performance and accuracy of the proposed method, the benchmark datasets Cityscapes [17] are selected for training, testing and performance evaluation with other state-of-the-art semantic segmentation networks.
Cityscapes is a large-scale dataset of 50 driving scenes images in different cities, with 19 categories of dense pixel annotation, eight of which have instance level segmentation. There are two sets of evaluation standards, fine and coarse. The former provides 5000 fine labeled images and the latter provides 5000 fine labeled images plus 20000 rough labeled images, with a maximum resolution of 1024 × 2048. In this paper, only fine labeled data are used, of which 2975 are trained, 500 are verified, and the remaining 1525 are tested.

B. EXPERIMENTAL SETTINGS 1) IMPLEMENTATION DETAILS
The video semantic segmentation framework based on optical flow method designed in this paper is mainly composed of segmentation branch, optical flow branch and discrimination network. Considering the training cost and efficiency, we test these three parts separately, and finally integrate them for testing. The segmentation branch uses the lightweight semantic segmentation network QNet-attention proposed in this paper and optical flow branch uses the optical flow computing network FlowNet2-s. As FlowNet2-s is a good pre-trained model, we directly use its pre-trained model. The semantic segmentation network QNet-attention and decision network DN are trained separately. All the training and testing issues are finished based on TensorFlow deep learning framework and carried out on a single NVIDIA Titan RTX GPU.

2) TRAINING DETAILS OF DECISION NETWORK
To improve the training efficiency and consider the continuity and stability of video, we first calculate the confidence score of 38675 frames from the 7th frame to the 19th frame of 2975 video clips in the training set of leftImg8bit_sequence video frame dataset for the decision network. In the calculation process, the 20th frame of each video clip is the key frame with semantic label as the reference frame of other frames in the video clip. Then we build a regression model for training, and take the MSE as loss function as the following: where N represents the total number of pictures in the training set, y i represents the confidence score of the ground truth, and y i represents the confidence score of the prediction. Adadelta optimizer is used to train the decision network. Considering the small scale of the decision network, we trained 100 epochs and the batch capacity is set to 32. The initial learning rate is set as 0.002 and decays at a rate of 0.99 after each epoch.

3) LOSS FUNCTION
Inter class sample imbalance is a common problem in semantic segmentation. The total number of small and medium-sized target pixels is much less than that of background pixels. Because the traditional semantic segmentation training process calculates the loss pixel-by-pixel from isolated pixels, it is difficult for the network to obtain the global VOLUME 11, 2023   context information. Therefore, to strengthen the ability of network learning context semantic information and improve the segmentation accuracy of small-size targets, we add a full connection layer branch with sigmoid activation functions on the coding layer and use binary cross-entropy loss to predict the target categories in the image scene. The loss function of the whole network is the weighted sum of the cross-entropy loss of the final decoding layer and the loss of the class prediction branch, in which the weight of the class prediction branch loss is 0.4. Experiments show that this improves the segmentation accuracy of small-size targets.

4) EVALUATION METRICS
This paper mainly considers the evaluation indexes of the accuracy and speed of semantic segmentation. The accuracy mainly includes pixel accuracy (PA), intersection over union (IoU), mean IoU (mIoU), floating point operations (FLOPs), etc. In terms of speed, frames per second (FPS), a total of parameters, and model size are used as the main evaluation index.

C. MODEL ANALYSIS 1) EFFECTIVENESS OF SPP-ATTENTION
To demonstrate the effectiveness of the proposed SPP-Attention module, we test different layers of SPP-Attention module k = 2, 3, 4, 5, 6 whose details are shown in Table 3. As shown in Fig. 5, we find k = 4 yields the best performance. Because when the SPP-Attention module's layer k increases to 4, the increase of mIoU is close to saturation, and the decrease of velocity is more obvious. Then we compare the accuracy of QNet and QNet-attention on the Cityscapes dataset, and the accuracy evaluation indexes are Class IoU and Category IoU. To verify the real-time performance of SPP-Attention module and consider the computing and storage requirements on mobile devices, we use a lower performance processor (GTX Titan GPU) as the test machine. Taking FLOPs, parameters, model size and FPS as the evaluation indexes. As shown in Table 4, it can be found that without using any pre-trained model, the Class IoU and Category IoU of QNet-Attention are both higher than QNet. At the same time, QNet-attention has lower computational power requirements and parameters than QNet. The overall model is small, which is conducive to storage. It shows that the SPP-Attention module proposed in this paper meets the real-time requirements and is suitable for practical application scenarios.

2) EFFECTIVENESS OF LOSS
To further improve the performance of the model, we introduce a loss function, which is different from the pixel-bypixel cross-entropy loss function. We add a full connection layer branch with Sigmoid activation function, and use binary cross-entropy loss to predict the target categories in the image scene. The loss function of the whole network is the weighted sum of the cross-entropy loss per pixel and the branch loss. The weight of cross-entropy loss per pixel is always 1, and the weight of branch loss is a positive constant α less than 1.
To verify the effectiveness of this loss function, we test the segmentation accuracy of the model under different α based on the Cityscapes dataset. The results show that when α = 0.4, the performance is the best, and the segmentation accuracy of small-size targets (such as traffic sign, car, person, etc.) is improved to a certain extent. The test results are shown in Fig. 6 and Table 5, in which the bold part is the optimal value of the same group.

3) EVALUATION OF THRESHOLD ALGORITHM
To evaluate the effectiveness of the adaptive threshold algorithm proposed in this paper, we compare it with the results when the threshold t is 95. The experimental results in Table 7 show that although the adaptive threshold algorithm proposed in this paper still has some redundancy, it still follows the principle of ''better more than less'' in key frame extraction.
In addition, most of the key frames extracted by the adaptive threshold algorithm proposed in this paper can well represent the video content. Compared with the methods with constant   thresholds, our method shows fewer wrong key frames and less redundancy.

D. COMPARISON 1) COMPARISON WITH OTHER NETWORKS
The comparison is conducted with PSPNet [38], ICNet [43], and ENet [40]. Both PSPNet and ICNet use pre-trained models for transfer learning. Table 6 shows the Class IoU, Category IoU, FPS, Model size and other indicators of each network on the test picture. (The bold part in the table is the optimal value of the same group.) Compared with other networks, QNet-attention has lower computational power requirements and parameters without losing too much segmentation accuracy. At the same time, the whole model is small. The image processing speed achieved on the low-performance processor reaches 18.2fps, which is significantly higher than other networks. This result indicates that our lightweight network model meets the real-time requirements. The example results are shown in Fig. 7.

2) COMPARISON WITH OTHER FRAMEWORKS
We compare the accuracy and speed of video semantic segmentation frameworks PSPNet + FlowNet2-s, ICNet + FlowNet2-s, and QNet-attention + FlowNet2-s corresponding to each network. Class IoU, Category IoU, FLOPs, FPS, Model size and other indicators are adopted. The experimental results are shown in Table 8. Compared with other frameworks, the QNet-attention + FlowNet2-s video semantic segmentation framework proposed in this paper has no advantages in accuracy indicators such as IoU. But its performance is more advanced in the indicators of FLOPs, Parameters and Model size. The total number of video frames it processed per second reaches 23.5fps, which is significantly higher than other frameworks. This shows that our proposed framework has more advantages under the condition of limited computing power and storage conditions, and is more suitable for mobile devices in actual scenarios. Fig. 8 compares the segmentation results of various video semantic segmentation frameworks. It can be found that the video semantic segmentation framework based on optical flow method is effective. Each object has clear segmentation and clear edge, which can accurately reflect the semantic information of the scene. Among them, ICNet + FlowNet2-s has better overall effect, fewer error points and clear object contour. The accuracy of QNet-attention + FlowNet2-s proposed in this paper is guaranteed to a certain extent. At the same time, small objects such as electric poles can be segmented, and each segmented object can correspond to the actual picture. Fig. 9 compares the segmentation results of the image semantic segmentation network QNet-attention and the video semantic segmentation framework QNet-attention + FlowNet2-s proposed in this paper. It can be found that the pure image semantic segmentation has better effect than the video semantic segmentation with optical flow method, the segmented object edge is of less noise. To speed up the segmentation process, using the same segmentation method under the same conditions, video semantic segmentation will lose a certain accuracy.

E. SCENARIO TEST
We also carried out experiments in real scenes. We used DJI Matrice 210 RTK V2 UAV equipped with ZENMUSE X5 camera to fly at low altitude at Beichen Road, the main campus of Dalian University of technology to obtain video data VOLUME 11, 2023       similar to Cityscapes street view dataset. Then we used QNetattention + FlowNet2-s proposed in this paper to directly perform semantic segmentation. The overall effect is shown in Fig. 10. It can be found that the overall segmentation effect is good. The segmentation of main objects such as roads, pedestrians, cars and trees is clear, but some images have more noise and some false segmentation, especially when the flight speed of UAV is unstable or the scene is too complex. To ensure the segmentation effect of the practical scene, the generalization ability of the segmentation model should be further improved.
For implementation, the proposed framework can be applied to the outdoor scene of low altitude and low speed flight of UAV or the outdoor street scene of automobile. In terms of hardware, it needs to be equipped with a camera and have a certain computing power. And it is better to use the model in a bright day environment. In a simple environment with little scene change, the model reasoning speed is faster and more efficient than in a complex environment with fast scene change.

V. CONCLUSION
In this paper, we present an SPP-Attention module and analyze its effectiveness. Accompany this, we also propose a lightweight image semantic segmentation model QNetattention. Based on QNet-attention, a video semantic segmentation framework is proposed. The proposed framework consists of QNet-attention segmentation branch and FlowNet2s optical flow branch, in which the segmentation branch performs semantic segmentation of key frames, and the optical flow branch performs inter frame feature propagation and key frame scheduling. We have proposed an adaptive threshold key frame scheduling strategy based on decision network for the key frame scheduling problem. In the experimental part, we conducted comparative experiments with other frameworks on the Cityscapes dataset under the same conditions, and the experimental results verified the excellent performance of the proposed framework. In addition, we have tested the proposed video semantic segmentation framework in the UAV cruise scene, and the segmentation effect is good, which can meet a certain degree of accuracy and real-time. His research interests include deep learning, bioinformatics, and statistical modeling.
PAN QIN received the B.S. and M.S. degrees in the study and research of aircraft power supply system from Northwest Polytechnic University, Shaanxi, China, and the Ph.D. degree in the research of identification and predictive control algorithms for multi rate systems from Kyushu National University, Japan. He worked as an Academic Researcher for nearly six years with the School of Mathematics and Science and the Institute of Mathematics for Industry (IMI). He is currently an Associate Professor with the Dalian University of Technology. His current research interests include statistics and data mining. VOLUME 11, 2023