AI-based optical-thermal video data fusion for near real-time blade segmentation in normal wind turbine operation

Blade damage inspection without stopping the normal operation of wind turbines has significant economic value. Blade segmentation is a fundamental task for blade damage inspection in the field without stopping wind turbines. This study proposes an AI-based method AQUADA-Seg to segment the images of blades from complex backgrounds by fusing optical and thermal videos taken from normal operating wind turbines. The method follows an encoder–decoder architecture and uses both optical and thermal videos to overcome the challenges associated with field application. A memory is designed between the encoder and decoder to improve the method’s performance by utilizing time history information in the videos to achieve temporal complementarity. The designed memory shares information between optical and thermal modalities to achieve multimodal complementarity. We collected a large-scale dataset, i.e., 100 video pairs and over 55,000 images, of optical-thermal videos of blades in operational wind turbines to train and test the method. Experimental results show that AQUADA-Seg: i) achieves near real-time thermal-optical blade video segmentation and can analyze videos with complex backgrounds in real-world field applications; ii) achieves 0.996 and 0.981 MIoU on optical and thermal videos, respectively, outperforming state-of-the-art methods, particularly in the videos with complex backgrounds. This study provides an essential step towards automated blade damage detection using computer vision without stopping the normal operation of wind turbines.


Introduction
Rotor blades are critical components of wind turbine systems and often operate in harsh environments.This leads to blade failures becoming the most important contributor to wind turbine failures, followed by control system failures and electrical failures (Pérez et al., 2013;Van Bussel and Zaaijer, 2001).Therefore, inspecting blades regularly to prevent blade failure is an important task in wind turbine operation and maintenance.
Blade damage inspection has seen dramatic advancement in the last decades.Traditional blade inspection method requires professionals manually check blades with rope and basket, which is labor-intensive and time-consuming.Moreover, wind turbines must be stopped when inspecting, resulting in extra turbine downtime and operative costs.To reduce inspection costs, more and more computer-vision-based methods emerged, including ground-based telescopes (Wallace and Dawson, 2009), drone-based cameras (Shihavuddin et al., 2019;Wang et al., 2019), and infrared thermography (Chen et al., 2021;Sheiati and Chen, 2023;Chen et al., 2023Chen et al., , 2022)).For example, Shihavuddin et al. (2019) proposed a faster R-CNN and Inception-ResNet-v2 based method that detects blade surface damages from images taken by drone-based optical cameras.Unlike optical camera based methods that only detect surface damages, some laboratory studies demonstrate that infrared thermography can detect and evaluate underneath blade damages, which are often more severe and require more attention (Chen et al., 2021;Sheiati and Chen, 2023;Chen et al., 2023Chen et al., , 2022)).For example, Chen et al. (2023) presented a computer vision and thermal imagery based blade damage inspection method named AQUADA PLUS.This method can automatically localize, track, and evaluate multiple blade damages in blades under cyclic loading simulating operational fatigue loads.Although these methods demonstrated encouraging results in laboratories, they are too difficult to focus on blades in the field because of distraction from noisy and complex backgrounds, resulting in their severe performance degradation.Thus, blade segmentation becomes a fundamental task when applying computer-vision-based blade damage inspection methods in the field.
In the past few years, much effort has been devoted to building wind turbine blade segmentation models.For example, Xu et al. (2019) presented an optical blade segmentation method based on Canny edge https://doi.org/10.1016/j.engappai.2023.107325Received 16 August 2023; Received in revised form 12 October 2023; Accepted 16 October 2023 Fig. 1.Cases where single-modal fails to segment the blades but multimodal complementarity can be utilized to improve the segmentation performance.Thus, this study proposes using both optical and thermal modalities in blade segmentation instead of a single one.detection and morphology.This method first segments blade images with Canny edge detection then applies morphological opening to erode and deflate segmentation masks.Tang et al. (2021) proposed a hough line detection and Otsu threshold segmentation based adaptive wind turbine blades segmentation method.This method first preliminarily locates edges of the blade line using hough line detection, then uses the grab-cut algorithm of Otsu threshold segmentation and morphological operations to segment blade images in the target area.Inspired by the huge success of deep learning (LeCun et al., 2015;Silver et al., 2016;Senior et al., 2020;Bi et al., 2023) in semantic segmentation (Oh et al., 2019;Caelles et al., 2017;Long et al., 2015;Noh et al., 2015;Ronneberger et al., 2015;Strudel et al., 2021), several deep learning based blade segmentation methods have emerged recently (Wang et al., 2022;Yang et al., 2021;Yu et al., 2023;Pérez-Gonzalo et al., 2023).For example, Yu et al. (2023) presented a U-net based thermal blade image segmentation model, in which hierarchical-split depth-wise separable convolution block is designed to obtain a balance between speed and accuracy.Wang et al. (2022) proposed a U-net based optical wind turbine segmentation model, in which two types of attention mechanisms-ECA-Net and PSA-Net-were incorporated to enhance the model's details capture ability.Yang et al. (2021) presented a blade segmentation method based on CNN and Otsu threshold.In addition, ensemble learning was introduced to improve the segmentation performance.Pérez-Gonzalo et al. (2023) proposed a U-Net and hole filling based optical blade image segmentation method.This method first employs a U-Net to generate a preliminary blade segmentation, then applies three hole filling based postprocessing steps and random forest to improve its segmentation accuracy.

Motivation
Motivation for using both optical and thermal modalities: For segmentation: Existing blade segmentation methods either use optical or thermal modality, but in real-world applications, we found numerous cases where single-modal methods fail.Take Fig. 1(a) as an example, optical modality fails to segment the blade because the boundaries between the blade and clouds are too difficult to identify.But if a model takes both optical and thermal modalities as input, it can solve this case by utilizing complementary information provided by the thermal modality Similarly, thermal modality in fails to segment the blade, but complementarity from optical modality can be utilized to help with solving this case.Thus, we should fuse optical and thermal modalities in blade segmentation to achieve multimodal complementarity.
For damage detection: Although we focus on blade segmentation here, our long-term objective of the future study is to detect blade damage.For damage detection, the motivation for using multimodal data is twofold.On the one hand, using both optical and thermal modalities can detect surface and underneath damages simultaneously.Optical modality can be used to detect surface damage, but cannot be used to detect underneath damage, which is much more important than surface damage in wind turbine blades.Meanwhile, thermal modality can help to detect underneath damage.On the other hand, infrared thermography suffers from reflectivity-emissivity issues, which cause temperature measurement errors (Moradi and Sfarra, 2021;Gao and Tian, 2018).It has been verified that optical modality can complement thermal modality to correct reflectivity-emissivity problems (Moradi Fig. 2. Cases where optical and thermal both fail to segment the blade tip but temporal complementarity can be utilized to improve the segmentation.Thus, this study proposes using videos instead of images in blade segmentation.et al., 2022;Tong et al., 2023).Thus, to facilitate damage detection and correcting reflectivity-emissivity issues in the future, optical and thermal modalities should be used together.
Motivation for using videos instead of images: For segmentation: Existing blade segmentation methods only work on static images, but in real-world applications, we found many cases where optical and thermal both fail if using only images.In the cases of complex backgrounds, optical and thermal modalities may fail at the same time instant, which cannot be handled with multimodal complementarity (see Fig. 2(a)).However, blade segmentation has its unique advantages: Except for orientation, the segmentation shapes of blades do not change much at different times.If taking videos as input, a model can solve these cases by utilizing temporal complementarity in the video.Take Fig. 2 as an example, a model can utilize history complementary segmentation information from a few seconds ago (Fig. 2(b)) to help with segmenting the current frame (Fig. 2(a)).Therefore, we should use videos that contain historical information to achieve temporal complementarity.
For damage detection: Another motivation for using videos is that temporal information plays a key role in thermal-modality-based blade underneath damage detection.Because it takes time for thermal waves to reach the surface from subsurface defects, temporal information is significant in thermal-modality-based damage detection.Static images cannot provide temporal information.Therefore, not images but videos should be used.
Altogether, the objective of this work is developing a novel AI-based model, which achieves multimodal and temporal complementarity by fusing optical and thermal data, to segment blades from complex backgrounds in real-world field application videos.However, realworld hardware differences lay a challenge on our way to achieve this objective-where to get complementary information?Knowing where to get complementary information is a prerequisite for the model to utilize complementarity.Ideally, optical and thermal videos should be perfectly synchronized and have the same field of view (FOV).In this way, when a modal fails in a certain area at a certain moment, the model can directly obtain complementary information in the corresponding area and moment from the other modality.Nevertheless, real-world optical and thermal cameras are not perfectly synchronized and they have different FOVs, spatial resolutions, frame frequencies, and reception fields (see Fig. 3).The model cannot easily get complementary information as in the ideal case.Hence, where to get complementary information is a challenge the model needs to overcome while utilizing complementarity.

Contributions
This paper contributes existing knowledge base as follows: • This study presents a novel AI-based optical-thermal blade video segmentation model named AQUADA-Seg.AQUADA-Seg achieves near real-time optical-thermal blade video segmentation without stopping turbines and outperforms state-of-the-art blade segmentation methods.• By taking both optical and thermal videos as input to achieve multimodal and temporal complementarity with a tailored memory, AQUADA-Seg shows that using multimodal videos instead of single-modal images significantly improves blade segmentation performance, especially in real-world applications with complex backgrounds.• This study contributes a large-scale optical-thermal wind turbine blade video dataset.It contains 100 optical-thermal video pairs and over 55,000 images, among which 36 video pairs and 20,778 images were published to facilitate future studies.

Paper structure
The rest of this paper is organized as follows: we will start by introducing our proposed method AQUADA-Seg in Section 2, then  move to the experimental results including comparison with state-ofthe-art and ablation studies in Section 3. Finally, we conclude the paper in Section 4.

Proposed method
Fig. 4 illustrates the overall architecture of AQUADA-Seg.AQUADA-Seg follows an encoder-decoder architecture.Optical and thermal modalities have their own encoder, decoder, and value encoder.For each modality, we add a memory between its encoder and decoder to store history segmentation masks.The memory adopts a key-value structure and accesses data through attention mechanism.The decoder of each modality gets input from its encoder, its memory, and importantly the other memory, then outputs a segmentation mask of the current frame.Finally, the value encoder of each modality encodes the segmentation mask and stores it in memory.In the following subsections, we will introduce the encoder-decoder architecture of AQUADA-Seg, details of the designed memory, the loss function, and our collected optical-thermal wind turbine blade video dataset respectively.

Encoder-decoder architecture
Overall, AQUADA-Seg is an encoder-decoder style network.Each modality has its own encoder, decoder, and lightweight value encoder.
We built the encoders and decoder following segmentation network STCN (Cheng et al., 2021).Specifically: Encoder of each modality takes an image as input and outputs a representation of the image and a query key.The representation is the ''code'' and will be input into the decoder.The query key, which also works as a memory key, will be used when reading memory.Following common practice (Cheng et al., 2021;Oh et al., 2019), we constructed the encoder based on Resnet-50 (He et al., 2016), removing its last convolutional layer and classification layer.
Decoder of each modality outputs the segmentation mask of the current input image.It takes the following three types of information as inputs: • Representation of the input image, which is obtained from the encoder.• History segmentation information read from memory.
• Multimodal complementary information obtained from the counterpart modality.
We constructed the decoder following STM network (Oh et al., 2019).
In particular, decoder first fuses the image representation and memory readout, which get from encoder and memory respectively, with a group convolutional neural network.Then it upscales the fused feature.Finally, masks outputted by the decoder will be bilinearly upsampled to the original resolution.
Value encoder of each modality encodes the information that will be stored in the value part of memory.Since the memory of each modality stores the history segmentation masks, value encoder encodes the masks generated by the decoder.Because segmentation masks are easier to encode than input images, we construct the value encoder based on a lightweight network-Resnet-18 (He et al., 2016), removing its last convolutional layer and classification layer.

Memory
On top of encoder-decoder architecture, we design a memory to utilize temporal complementarity and multimodal complementarity to enhance the model's performance.In the following subsections, we first introduce the details of this memory, including key-value memory structure, memory writing, attention-based memory reading, and memory management, then move on to how AQUADA-Seg utilizes temporal and multimodal complementarity with this memory.

Key-value Memory Structure
As illustrated in Fig. 4, we designed a key-value memory for optical and thermal modalities respectively.Key works as indexes, responsible for memory reading.Value stores history segmentation masks.Key comes from the encoder, which is essentially a compressed image representation.Value comes from the value encoder, which is essentially a compressed segmentation mask.After decoder outputs a segmentation mask of the current frame, the model updates memory by adding a new key and value to it.
Attention-based Memory Reading AQUADA-Seg reads memory in an attention-based way.When segmenting the ( + 1)th frame of input video, the model first encodes it with encoder, then starts to read memory.At this time, memory stores segmentation information of previous  frames.Let   ∈ R   × be the memory key,   ∈ R   × be the memory value,   ∈ R   × be the query key obtained from the encoder, where   and   denote the dimensions of key and value,  and  denote the spatial dimensions.We employ a similarity function (⋅) to compute the similarity matrix  ∈ R  × of   and   .This process can be written as: In practice, we use the L2 similarity function proposed in STCN (Cheng et al., 2021), and normalize the similarity matrix with √   .Then, we let  pass a softmax function to get the softmax-normalized attention weight matrix  ∈ R  × , which can be represented by: Finally, the memory readout of the ( + 1)th frame   +1 can be computed as the weighted sum of memory value, which can be written as: The memory readout   +1 works as the temporal complementary information and will be input into the decoder to assist the segmentation of the ( + 1)th frame.

Memory Management
As we update memory for each frame of the video input, the memory size gradually increases as the number of frames increases.If we do not manage memory, it will explode soon.Especially when training the model with long videos.Following Cheng and Schwing (2022), we divide memory into different segments and start to clean up the oldest saved masks when the memory reaches its limit.In addition, since wind turbines rotate periodically, blade segmentation does not require large memory.

Memory design for utilizing temporal complementarity and multimodal complementarity
With the designed memory, AQUADA-Seg can utilize temporal complementarity when segmenting new frames.Specifically, AQUADA-Seg saves history masks in memory and reads these masks when segmenting the current frame.Because the shape of blade segmentations does not change much at different times, historical segmentation information is of great value for segmenting the current frame.Moreover, attention-based memory reading helps AQUADA-Seg find the most useful information for segmenting the current frame.Thus, AQUADA-Seg utilizes temporal complementarity when segmenting new frames.
AQUADA-Seg utilizes multimodal complementarity by sharing complementary information between optical and thermal modalities via the memory.As described above, AQUADA-Seg inputs the information read from memory into the decoder of the current modality to help with its segmentation.Inspired by Jia et al. (2023), we made AQUADA-Seg also share this information with the other modality (see blue lines in Fig. 4).Since optical and thermal videos are almost synchronized, information read from optical memory, which is most useful for segmenting the current optical frame, is also of great help for segmenting the current thermal frame, and vice versa.Thus, when a modality fails -its encoder and memory fail to provide useful information for its segmentation -it still can utilize complementary information obtained from the other modality to assist its segmentation.With this cross-modal complementary information sharing, AQUADA-Seg utilizes multimodal complementarity in blade segmentation.

Loss function
Following previous semantic segmentation studies (Cheng et al., 2021;Cheng and Schwing, 2022;Wang et al., 2022), we employ binary cross entropy loss (BCE loss) and Dice loss (Milletari et al., 2016) to train AQUADA-Seg.The loss function of AQUADA-Seg can be written as: where  is a trade-off parameter,  is the number pixels in a thermal frame,   is the binary label of the th pixel from this frame, ŷ is the model's prediction of the same pixel.Since  is similar to  Thermal Dice , we do not repeat them here.

Optical-thermal wind turbine blade video dataset
To train AQUADA-Seg, we collected a large-scale optical-thermal wind turbine blade video dataset.Moreover, we make it publicly available to facilitate future studies.It can be accessed here. 1This dataset contains 100 optical-thermal video pairs and over 55,000 images from 22 different wind turbines.The videos are collected from each turbine at different time and under different environmental conditions.We only published the data collected from DTU Vestas V52 wind turbine, i.e., 36 optical-thermal video pairs and 20,778 images.The data from other 21 commercial turbines is not published due to confidentiality.Table 1 tabulates the information of this dataset.Table 2 compares some existing datasets (Zampokas et al., 2022;Pérez-Gonzalo et al., 2023;Wang et al., 2022;Yu et al., 2023) that can be used for blade segmentation.To the best of our knowledge, our dataset is the largest wind turbine blade dataset to date.All blade videos are taken with DJI Zenmuse H20T 2 or DJI Mavic 2 Enterprise Advanced 3 while wind turbines are in normal operation.We fixed the frame frequencies of the optical and thermal cameras to 30 FPS.The fusion color palette is chosen for thermal cameras.We first fly the drone to a position where the horizontal distance from the hub nose is 12 ± 4 m and the vertical distance is 2 m (see Fig. 5).We tilt up the camera 15 degrees to avoid taking videos of the thermal source from the nacelle.Then, we take both optical and thermal videos in pairs.For long blades, we take videos of them horizontally or vertically in several segments.The interval between videoing positions of different segments is about 5 m.We take videos from both sides of the blades, i.e., both from upwind and downwind directions.In addition, to increase the diversity of data and improve the robustness of the model, we also take various videos from different angles and distances.Fig. 6 demonstrates some optical-thermal images in this dataset and their segmentation masks.

Metrics
We use two commonly used segmentation metrics, i.e., MIoU and MPA to compare all the results.Mean Intersection over Union (MIoU) is a common metric for semantic segmentation.It computes the coincidence ratio between ground truth and the model's prediction.MIoU is defined as follows: where  is the number of classes, TP is true positive, FP is false positive, FN is false negative, and FP is false positive.Mean Pixel Accuracy (MPA) also is a popular segmentation metric, which computes the mean of the right predicted pixel ratio from different classes.MPA is defined as follows:

Settings
Compared Methods: To the best of our knowledge, we are the first to segment wind turbine blades with videos and there are not any wind turbine blade video segmentation methods.Hence, we compare AQUADA-Seg with two state-of-the-art blade image segmentation methods-Improved-UNet-Thermal (IUNet-T) (Yu et al., 2023) and Improved-UNet-Optical (IUNet-O) (Wang et al., 2022).Table 3 gives an overview of these two state-of-the-art methods and AQUADA-Seg.
Because these methods work either on optical or thermal data, we train them with data only from one modality.Because these methods only work on images, we test them on all video frames and take the average as their results on video.According to the study (Wang et al., 2022), 10% of training data is used as the validation set for IUNet-O.
Data Preprocessing: AQUADA-Seg shares complementary information between different modalities.To unify the shape of shared information across different modalities as well as to reduce computation burden, we first resize frames of optical and thermal videos to 852 × 480.Then, we conduct data augmentation, including random rotation, random crop, random horizontal flip, and random color jitter.Training Details: Following previous work (Cheng et al., 2021;Cheng and Schwing, 2022), we train AQUADA-Seg with different stages.In the first stage, we train the model with static images.In the second stage, we mixed videos from DAVIS video segmentation dataset (Perazzi et al., 2016;Pont-Tuset et al., 2017) and our collected dataset to train the model.At this stage, since DAVIS dataset only contains optical data, we make a copy of the optical data as thermal data.In the third stage, we train the model with our collected optical-thermal wind turbine blade video data.These three stages were iterated 5k, 8k, and 8k times respectively.The model is implemented with Pytorch (v1.13.0) and optimized by AdamW with a beginning learning rate of 1 × 10 −5 .Besides, MultiStepLR is employed to adjust the learning rate.We train the model with a computer provided by Denmark Technical University Computing Centre.This computer is equipped with two 32core Intel Xeon Gold 6226R CPUs, 756 GB of memory, and two NVIDIA A100 (40 GB) GPUs.The entire training takes approximately 52 h.

Results and discussion
In the test phase, we tested the model only on a single GPU.To reduce the impact of hardware on the results, we ran test 10 times and recorded the average.
Across 10 tests, the average maximum GPU memory allocated of AQUADA-Seg was 1584 MB.The average test FPS of AQUADA-Seg is 26.75, showing that it achieves near real-time wind turbine blade segmentation without stopping turbines.Notably, AQUADA-Seg segments RGB and thermal videos simultaneously.This new capability opens vast opportunities for real-world applications.For example, AQUADA-Seg provides at least the following three possibilities if it is applied to blade damage detection: (i) Unlike previous methods that either detect surface damages based on optical data or detect underneath damages based on thermal data, detecting both these damages simultaneously is possible now.(ii) Unlike previous image-based blade damage detection methods that can only obtain damage status at a certain moment.AQUADA-Seg enables the detection and intervention of blade damages in near real-time, thus avoiding significant property loss.(iii) Obtaining detailed damage progress in normal operating wind turbines is possible.By analyzing the damage progress, blade researchers not only can get a better understanding of damages but also gain clues for blade structure designs.
To intuitively compare the performance of these methods, we conducted a case study.Specifically, we selected some cases with simple or complex backgrounds from the test set and compared segmentations of these methods.Table 6 compares the results.From Table 6 we can see that: • Both relevant methods and AQUADA-Seg are capable of segmenting simple cases.The backgrounds of these cases are simple, with Comparison of segmentations from AQUADA-Seg and state-of-the-art methods.All these methods are capable of handling cases where backgrounds are simple and boundaries between blades and backgrounds are clear.In cases where the background is complex and blades and background are mixed, AQUADA-Seg clearly outperforms state-of-the-art methods.
only sky in background and relatively few clouds, which makes clear boundaries between blades and backgrounds.• For complex cases, however, AQUADA-Seg clearly outperforms the relevant methods.The backgrounds of these cases are complex, with either a landscape (e.g., column 2, row 7 and column 5, row 6) or a dense layer of clouds (e.g., column 2, row 5 and column 5, row 7).Blades and backgrounds are mixed together, and the boundary between them is difficult to distinguish.Singlemodal based state-of-the-art methods fail to segment these cases.However, AQUADA-Seg achieves remarkable results even in these complex cases due to multimodal complementarity.• Our dataset contains numerous complex cases, which are closer to real-world field situations.
We designed software with a user-friendly GUI as shown in the video. 4

Multimodal vs. single-modal
The first big difference between AQUADA-Seg and existing blade segmentation methods is that AQUADA-Seg takes multimodal data as input while existing methods take single-modal data as input.Here, we investigate the effectiveness of multimodal data on blade segmentation with experiments.

Settings
In this experiment, we compare the following three methods: • This study, the AQUADA-Seg method.
• Thermal-only, which is implemented by removing optical parts from AQUADA-Seg.• Optical-only, which is implemented by removing thermal parts from AQUADA-Seg.
For AQUADA-Seg, we use the same experimental setting as in Section 3.2.For Thermal-only and Optical-only, we also train them with 3 stages and the same iterations that are used in AQUADA-Seg.But in the third stage, we train Thermal-only only with thermal data and train Optical-only only with optical data.Other settings stay unchanged.

Results and discussion
Tables 7 and 8 compare the results of these methods in terms of MIoU and MPA.From these results, we can see that: (i) The model trained with multimodal data outperforms the model trained with single-modal data.This confirms that using multimodal data can improve the performance of blade segmentation.(ii) Among the methods trained with single-modal data, the modal trained with optical data outperforms that trained with thermal data.This may be because unlike thermal data which can only provide temperature information, optical data can provide richer information, such as color information and texture information.Thermal modality is more likely to fail than optical modality.(iii) If you mainly focus on thermal blade segmentation, introducing optical modality and utilizing the complementarity of multimodal data can significantly improve the segmentation performance.
To intuitively investigate the effectiveness of multimodal complementarity, we conducted a case study.Table 9 compares the results from models trained with single-modal and multimodal data.From Table 9 we can see that: Since the information provided by a single modality is limited, it is inevitable that single-modal fails.For example case 1, case 3, and case 8.In these cases, blades and backgrounds are mixed together.It is too difficult for models trained with singlemodal data to handle these cases (see the third column in Table 9).AQUADA-Seg takes both thermal and optical modalities as inputs and can handle these cases by utilizing the complementarity between these two modalities.Thus, better segmentation performance is achieved.

Videos vs. images
The second big difference between AQUADA-Seg and existing blade segmentation methods is that AQUADA-Seg takes videos as input while existing methods take images as input.Here, we investigate the effectiveness of temporal complementarity on blade segmentation by comparing the results from models with access to different amounts of temporal information.

Settings
We investigate the effectiveness of temporal complementarity by controlling the temporal information that can be utilized by the model.AQUADA-Seg saves history segmentation information in memory and updates memory for every segmented frame.Therefore, we can control the max number of history frames that AQUADA-Seg can access by controlling memory size.Thus simulating situations where the model obtains different amounts of temporary information.Here, we compare the performance of AQUADA-Seg with access to different numbers of frames, including 0, 25, 75, 100, 150, 200, 250, and 300.MIoU is selected as the evaluation metric.Other settings stay unchanged as in Section 3.2.

Results and discussion
Fig. 7 illustrates the performance of AQUADA-Seg with access to different numbers of history video frames.From Fig. 7, we can see that: when the number of frames AQUADA-Seg can access is less than 150, the performance of AQUADA-Seg gradually improves with the growth of Memory.When this number exceeds 150, the performance of the model gradually stabilizes.This verified that we can indeed improve the blade segmentation performance by utilizing temporal complementarity with a designed memory.In addition, since wind turbine blade videos are periodic, there is an upper bound to improve the segmentation performance by increasing the memory size.The recommended memory size for our case is 150 frames.

AQUADA-Seg's robustness against noisy input
Although deep-learning-based methods achieve state-of-the-art performance in various real-world tasks, their robustness against input noise is still a hot topic, because noisy input may cause dramatic performance degradation, thereby leading to disasters in real-world applications (Maulik et al., 2020;Li et al., 2020).In this subsection, we investigate the robustness of AQUADA-Seg against noisy input.

Settings
In this experiment, we investigate the relationship between performance of AQUADA-Seg and the magnitude of input noise.We first randomly replace 1%, 2%, 3%, 4%, and 5% of optical videos with noisy input in the test set.Then observe AQUADA-Seg's performance under noisy input.We construct noisy input by weighted stacking frames from different time instants of the same video.Specifically, for a randomly selected test optical video, we first randomly select a number  between 3 and 10.Then, starting from Frame1, we stack Frame1 and Frame(1 + ) with weights of 0.8 and 0.2.Fig. 8 shows a frame of the noisy input.

Results and discussion
Table 10 shows the optical segmentation performance of AQUADA-Seg under different magnitudes of input noise.From Table 10 we can see that: with the increase of magnitude of input noise, AQUADA-Seg's performance drops, but slightly.This may be because when optical modality is affected by a small magnitude of noise, the model can use the information of thermal modality to assist its segmentation.Hence, we can conclude that AUQADA-Seg has high robustness when a single modality is affected by noisy input.

Conclusion and future work
In this paper, we propose AQUADA-Seg, an AI-based encoderdecoder style method that achieves near real-time optical-thermal wind turbine blade video segmentation.AQUADA-Seg fuses both optical and thermal videos captured from normal operating wind turbines and improves the blade segmentation performance by utilizing temporal and multimodal complementarity with a tailored memory.AQUADA-Seg utilizes temporal complementarity by storing history segmentations in the memory and reading them when segmenting new frames.In addition, AQUADA-Seg utilizes multimodal complementarity by sharing complementary segmentation information via the memory.Experimental results from a large-scale optical-thermal video dataset show that AQUADA-Seg considerably outperforms state-of-the-art optical or thermal blade segmentation methods, particularly in cases when complex backgrounds are present in real-world applications.
For neural-network-based methods, reliability of the generated results is important, especially in real-world applications where huge  property losses and personal casualties may occur.In the future, we intend to extend AQUADA-Seg by introducing probabilistic neural network, enabling the method to output a prediction distribution rather than just the best prediction, thus, assessing the uncertainty of the method's output.Another future study is near real-time wind turbine blade damage detection by utilizing multimodal complementarity.

CRediT authorship contribution statement
Xiaodong Jia: Developed the method, Implemented the codes, Wrote the original manuscript.Xiao Chen: Generated research ideas, Designed and supervised the study, Reviewed and revised the manuscript and acquired the funding.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 5 .
Fig. 5. Drone-based optical-thermal blade video data acquisition when the wind turbine is in normal operation.

Fig. 6 .
Fig.6.Some optical-thermal images in our dataset and their segmentation masks.We can see these images vary considerably in the videoing distance, background, videoing angle, and lighting, indicating they are close to the images taken in real-world applications.

Fig. 7 .
Fig. 7. Comparison of the performance of AQUADA-Seg with access to different numbers of history video frames.

Fig. 8 .
Fig. 8.A frame of noisy input.We construct noisy input by weighted stacking the original frame and another frame at a different time instant from the same video.

Table 1
Information of the optical-thermal wind turbine blade video dataset used in this study.We collect videos from each turbine at different time and under different environmental conditions.Due to confidentiality, only a part of the data was published, i.e., 36 optical-thermal video pairs and 20,778 images.It can be found at https://aquada-go.github.io/.
a a

Table 3
Overview of AQUADA-Seg and two relevant state-of-the-art methods that only work on single-modal data.

Table 4
Comparison of MIoU between different methods (higher is better, N.A. for not applicable).

Table 5
Comparison of MPA between different methods (higher is better, N.A. for not applicable).

Table 7
Comparison of contribution from different modalities in terms of MIoU.

Table 9
Segmentation comparison between single-modal methods and our multimodal method.

Table 10
AQUADA-Seg's performance under different magnitudes of input noise.