Automatic Detection and Identification of Defects by Deep Learning Algorithms from Pulsed Thermography Data

Infrared thermography (IRT), is one of the most interesting techniques to identify different kinds of defects, such as delamination and damage existing for quality management of material. Objective detection and segmentation algorithms in deep learning have been widely applied in image processing, although very rarely in the IRT field. In this paper, spatial deep-learning image processing methods for defect detection and identification were discussed and investigated. The aim in this work is to integrate such deep-learning (DL) models to enable interpretations of thermal images automatically for quality management (QM). That requires achieving a high enough accuracy for each deep-learning method so that they can be used to assist human inspectors based on the training. There are several alternatives of deep Convolutional Neural Networks for detecting the images that were employed in this work. These included: 1. The instance segmentation methods Mask–RCNN (Mask Region-based Convolutional Neural Networks) and Center–Mask; 2. The independent semantic segmentation methods: U-net and Resnet–U-net; 3. The objective localization methods: You Only Look Once (YOLO-v3) and Faster Region-based Convolutional Neural Networks (Fast-er-RCNN). In addition, a regular infrared image segmentation processing combination method (Absolute thermal contrast (ATC) and global threshold) was introduced for comparison. A series of academic samples composed of different materials and containing artificial defects of different shapes and nature (flat-bottom holes, Teflon inserts) were evaluated, and all results were studied to evaluate the efficacy and performance of the proposed algorithms.


Introduction
Modern industrial production relies heavily on quality management (QM) [1], which is crucial for maintaining high standards in various manufacturing applications, including the aerospace industry. Implementing an efficient QM and control system can provide a significant technological boost to such fields. Structural monitoring is vital for ensuring the health of production lines, and visual inspection systems are increasingly necessary for achieving this goal. However, manual inspection during the quality control stages can be hindered by inspector fatigue, making automatic quality control and defect detection more crucial for improving inspection rates and achieving cost-effective condition monitoring [2].
Among various non-destructive testing (NDT) techniques, infrared thermography (IR) [3] is used to visualize the temperature distribution on the surface of materials, allowing us to "see the un-seen." Infrared Non-Destructive Evaluation (INDE) aims to identify and categorize Regions of Interest (ROIs) as either defects or anomalies through the analysis of 1. A comprehensive and systematic investigation and comparison of three classical deeplearning methods were conducted to analyze the accuracy and efficiency of defect detection using pulsed thermography. 2. An innovative instance segmentation method was introduced to predict the irregular shape of each defect instance in thermal images at the pixel level, enabling efficient defect segmentation and identification for each defect type across different specimens. 3. Experimental modeling and analysis for the post-processing of inspected data based on deep-learning feature extraction techniques have also been introduced.
The structure of this paper is as follows: Section 2 outlines the main principles and methods utilized in this research. Section 3 provides an introduction to pulsed thermography (PT). Section 4 describes the experimental setup, including details on data collection, defective features, and samples. Section 5 presents the spatial deep-learning models used in the investigation, including YOLO-V3 [23], Faster-RCNN [24], U-net [25], Resnet-U-net [26], Mask-RCNN [27], and Center-Mask [28]. Section 6 offers a detailed account of the experimental results and training procedures for each method. Section 7 analyzes the results obtained from the experiments. Finally, Section 8 concludes the research and highlights future work in this area.

Principles
In this section, a detection system trained with pulsed thermography data was proposed to segment and identify defects in thermal images. The spatial characteristic deeplearning model is introduced separately and comparatively in this strategy, as shown in Figure 1. The design of this defect detection system is based on the three types of detection frameworks. The implementation steps can be illustrated as follows: learning model is introduced separately and comparatively in this strategy, as shown in Figure 1. The design of this defect detection system is based on the three types of detection frameworks. The implementation steps can be illustrated as follows: 1. First, the infrared thermal sequences are acquired by the pulsed thermography (PT) system. 2. Secondly, the raw thermal sequences are preprocessed and decomposed by augmentation methods: 1. Principal Component Thermography (PCT), where the sequence decomposes into several orthogonal functions (Empirical Orthogonal Functions: EOF); 2. Flip; 3. Random crop; 4. Shift; 5. Rotation etc. 3. In the final step, the defect regions are recognized via deep neural networks, which visualize the defects with the bounding boxes. All defects must be labeled with the locations, then trained with the deep region neural network.

Thermography Consideration-Optical Pulsed Thermography
In PT [29,30], a high-power thermal pulse is applied to the surface of the specimen through heat radiation. Due to the heat conduction of the thermal front absorbed by the specimen's surface, the thermal front travels from the surface and propagates through the materials. As the time elapses, the surface temperature will decrease uniformly for a zone without defects. Conversely, if there is an internal defect beneath the surface (e.g., delamination, disbands, damage, etc.), this defect can become a resistance to heat flow that produces higher temperature patterns at the surface with a decay of temperature, which can be inspected by an infrared (IR) camera. Figure 2 indicates the fundamental principle of pulsed thermography.

Thermography Consideration-Optical Pulsed Thermography
In PT [29,30], a high-power thermal pulse is applied to the surface of the specimen through heat radiation. Due to the heat conduction of the thermal front absorbed by the specimen's surface, the thermal front travels from the surface and propagates through the materials. As the time elapses, the surface temperature will decrease uniformly for a zone without defects. Conversely, if there is an internal defect beneath the surface (e.g., delamination, disbands, damage, etc.), this defect can become a resistance to heat flow that produces higher temperature patterns at the surface with a decay of temperature, which can be inspected by an infrared (IR) camera. Figure 2 indicates the fundamental principle of pulsed thermography.
In a solid of semi-infinite isotropic conduction, a 1D solution of the propagation of the pulse of a Dirac heat pulse is indicated in Equation (1) as a Fourier mathematical equation, where the energy absorbed by the surface is Q J/m 2 and T 0 [K] is the temperature of initialization. The surface temperature progression at T(0, t) can be written as follows: Sensors 2023, 23, 4444

of 33
From Equation (2), where e = kpc is effusive. The temperature of the surface evolution following a Dirac heat pulse will decay as a monotonous decrease as t −1/2 without defects, while areas with defects will diverge more or less from this behavior based on the actual thermo-physical properties of the region.
From Equation (2), where e = kpc is effusive. The temperature of the surface evolution following a Dirac heat pulse will decay as a monotonous decrease as / without defects, while areas with defects will diverge more or less from this behavior based on the actual thermo-physical properties of the region.

Specimens and Experimental Setting Up
This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.

Experiment Setup
Infrared measurement and the inspected system are the essential parts of collecting infrared data from pulsed thermography. To evaluate the robustness of the proposed algorithms, a certain number of samples were tested. In general, the inspected system used in this experiment consists of: two photographic flash lamps (Balcar FX 60.5 ms thermal pulse) 6.4 kj/flash, an infrared thermal camera, and a personal computer (PC)-Ubuntu 14.04, as shown in Figure 3. To be more detailed, the sampling rate was 157 Hz, a total of three types (steel; CFRP; plexiglass) of eight pieces of specimens were inspected.

Specimens and Experimental Setting Up
This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.

Experiment Setup
Infrared measurement and the inspected system are the essential parts of collecting infrared data from pulsed thermography. To evaluate the robustness of the proposed algorithms, a certain number of samples were tested. In general, the inspected system used in this experiment consists of: two photographic flash lamps (Balcar FX 60.5 ms thermal pulse) 6.4 kj/flash, an infrared thermal camera, and a personal computer (PC)-Ubuntu 14.04, as shown in Figure 3. To be more detailed, the sampling rate was 157 Hz, a total of three types (steel; CFRP; plexiglass) of eight pieces of specimens were inspected. The analysis of the thermography process was conducted with the PC (Intel(R) Core (TM) i7-2600 CPU, 3.40 GHz, RAM 16.0 GB, 64-bit, Operating System) and the processing of the thermal data was conducted using the MATLAB computer program R2019a and a Tensor-flow deep-learning open-source library. A mid-wave infrared (MWIR) camera with a special mid-infrared lens (to filter the MWIR spectrum) and two normal lamps were The analysis of the thermography process was conducted with the PC (Intel(R) Core (TM) i7-2600 CPU, 3.40 GHz, RAM 16.0 GB, 64-bit, Operating System) and the processing of the thermal data was conducted using the MATLAB computer program R2019a and a Tensor-flow deep-learning open-source library. A mid-wave infrared (MWIR) camera with a special mid-infrared lens (to filter the MWIR spectrum) and two normal lamps were utilized for collecting the infrared data. The normal lamp (containing the entire visible spectrum) was used as an illumination source to illuminate the specimen during the inspection performed inside the laboratory.

Validation Samples Preparation
To evaluate the performance of the proposed method, academic samples were collected independently from three types of materials: plexiglas (Plexi), carbon fiber-reinforced polymer (CFRP), and steel. All the experiments with DL models were conducted under the databases collected from these samples.
As shown in Table 1, the description of eight validation samples in this work is explained. The aspect ratio (size/depth) for all trained and validated samples is designed at (0, 60) to reveal if the detection model has a flexible performance to detect defects. Among the eight validated specimens, the detailed description can be illustrated as follows: 1. The first sample (a) is from plexiglass material with 25 sub-surface circle defects of different diameter and depth. 2. The second sample (b) has eight multiple angle defects that are embedded on the surface of the plexiglass specimen. 3. The third sample (c) is from plexiglass material with 25 sub-surface circle defects of same diameter but different depths, increasing from the left to right column (deeper). 4. The fourth sample (d) plexiglass has 25 circle and quadrilateral defects of various depths and sizes. 5. The fifth sample (e) is a steel sample that has three different diameters of circle defects; the depth being shallower from top to bottom. 6. The sixth sample (f) CFRP has 25 triangle defects embedded in the specimen in the form of a folding plane. 7. The seventh sample (g) CFRP has 25 triangle defects embedded in the specimen in the form of a flat plane. 8. The eighth sample (h) CFRP has 25 triangle defects embedded in the specimen in the form of a curved plane.

Acquisition of the Training Database
To maximize the probability of detection, we independently sampled 4000 thermal images in total from the pulsed thermography experiment in three types of materials (plexiglas, carbon fiber-reinforced polymer (CFRP), and steel) to build a training and testing database from pulsed thermography data. As the images used for training should be the same size, the database was split into 512 × 640 pixels.

Calibration of the Data
The marking process was conducted with the two labelling software based on the model type: Colabeler toolkit (YOLO-V3; Faster-RCNN); LabelMe 2.5 toolkit (Mask-RCNN; Center-Mask; U-net; Res-U-net).
Each representative image file from the four types of samples was extracted from the sfmov.format sequence files or matrix raw files. These samples created multiple shapes of defects in the database, such as squares and rectangles.
In the Colabeler toolkit, only one label (square-shape label) was used for all of the different kinds of marks. The bounding boxes were prepared by hand for each of the images, then exported to a .xml file by Colabeler. Each bounded defect was used as training for the algorithm. The process has to be repeated for all images used for training.
In the Labelme toolkit, a different labeling curve from the procedure will be provided

Acquisition of the Training Database
To maximize the probability of detection, we independently sampled 4000 thermal images in total from the pulsed thermography experiment in three types of materials (plexiglas, carbon fiber-reinforced polymer (CFRP), and steel) to build a training and testing database from pulsed thermography data. As the images used for training should be the same size, the database was split into 512 × 640 pixels.

Calibration of the Data
The marking process was conducted with the two labelling software based on the model type: Colabeler toolkit (YOLO-V3; Faster-RCNN); LabelMe 2.5 toolkit (Mask-RCNN; Center-Mask; U-net; Res-U-net).
Each representative image file from the four types of samples was extracted from the sfmov.format sequence files or matrix raw files. These samples created multiple shapes of defects in the database, such as squares and rectangles.
In the Colabeler toolkit, only one label (square-shape label) was used for all of the different kinds of marks. The bounding boxes were prepared by hand for each of the images, then exported to a .xml file by Colabeler. Each bounded defect was used as training for the algorithm. The process has to be repeated for all images used for training.
In the Labelme toolkit, a different labeling curve from the procedure will be provided 30  To maximize the probability of detection, we independently sampled 4000 thermal images in total from the pulsed thermography experiment in three types of materials (plexiglas, carbon fiber-reinforced polymer (CFRP), and steel) to build a training and testing database from pulsed thermography data. As the images used for training should be the same size, the database was split into 512 × 640 pixels.

Calibration of the Data
The marking process was conducted with the two labelling software based on the model type: Colabeler toolkit (YOLO-V3; Faster-RCNN); LabelMe 2.5 toolkit (Mask-RCNN; Center-Mask; U-net; Res-U-net).
Each representative image file from the four types of samples was extracted from the sfmov.format sequence files or matrix raw files. These samples created multiple shapes of defects in the database, such as squares and rectangles.
In the Colabeler toolkit, only one label (square-shape label) was used for all of the different kinds of marks. The bounding boxes were prepared by hand for each of the images, then exported to a .xml file by Colabeler. Each bounded defect was used as training for the algorithm. The process has to be repeated for all images used for training.
In the Labelme toolkit, a different labeling curve from the procedure will be provided regardless of the shape of the defects for segmentation, a labeling curve on each object in the images is then exported to a json.file by Labelme to transform into a large scale object segmentation database (COCO). The elaborate labeling procedure has been explicitly depicted in Figure 4a-c, providing a comprehensive representation of the precise steps involved in the processing of the data.

Calibration of the Data
The marking process was conducted with the two labelling software based on the model type: Colabeler toolkit (YOLO-V3; Faster-RCNN); LabelMe 2.5 toolkit (Mask-RCNN; Center-Mask; U-net; Res-U-net).
Each representative image file from the four types of samples was extracted from the sfmov.format sequence files or matrix raw files. These samples created multiple shapes of defects in the database, such as squares and rectangles.
In the Colabeler toolkit, only one label (square-shape label) was used for all of the different kinds of marks. The bounding boxes were prepared by hand for each of the images, then exported to a .xml file by Colabeler. Each bounded defect was used as training for the algorithm. The process has to be repeated for all images used for training.
In the Labelme toolkit, a different labeling curve from the procedure will be provided regardless of the shape of the defects for segmentation, a labeling curve on each object in the images is then exported to a json.file by Labelme to transform into a large scale object segmentation database (COCO). The elaborate labeling procedure has been explicitly depicted in Figure 4a-c, providing a comprehensive representation of the precise steps involved in the processing of the data.

Preprocessing and Data Augmentation
In the case of the overfitting issue during the training, data augmentation plays a significant role. We encourage this model to learn the invariant and transformations by using rotation and flipping for the raw images. Since the defects in these materials remain in permanent positions and shapes, they lead to a requirement of capturing images in diverse conditions. As known, the defect is not clear because of the shaping process and/or the specifications of materials that lead to captured images on cluttered background. Those reasons lead to the augmentation of the captured images before entering them into a deep-learning network, which is important. Partial images for the training are undertaken in a preprocessing stage.
We adapted the preprocessed sequence images from feature extraction methods, including Principal Component Thermography (PCT), which extracts meaningful features by dimension reduction and reflects the intuitions of the data. For example, when the data arise from the high dimensional form (sparse and unstable estimation), the PCT can give more redundancy to our classier to enable them to make a better decision.

Methodologies: Defect Detection Methods by Deep Learning Algorithms
As shown in Figure 5

Objective Localization Algorithms
Method 1: Real-time defect localization (YOLO-V3) YOLO-v3 is a proposed supervised deep-learning algorithm that has excellent detection capability both on the large or small objects due to its concatenation involving the merging of the features from the earlier layer with the features from the deeper layer, especially during the infrared nondestructive evaluation with an automatic defect detection task (subsurface defects case).
Processing images with YOLO v3 is quite fast and simple, allowing defects to be detected and localized directly. To perform the feature extraction, residual networks and successive 3 × 3 and 1 × 1 convolutional layers are localized in YOLO-v3 in Figure 6. The skip-connections mechanism was achieved by residual networks through multiple residual units [9,10], which was proposed to improve the performance of object detection, and also solve the gradient vanishing issue. In this research, the YOLO-v3-based deep architecture neural network is proposed to perform the detection of defects (of various sizes). This algorithm includes the implementation of three steps. First, the pictures are resized as the input size. Then, an entire convolutional network is run on these pictures. Lastly, we threshold the detection results based on the model confidence scores.
In Figure 7, an example is shown of an original image (a) and a detected image (b) from the YOLO-V3 network. The CNN was able to distinguish the components, which have a similar thermal pattern with defects during the processing of thermal diffusion, which indicated that the supervised learning method (YOLO-V3) is less influenced by the boundary information in the components. skip-connections mechanism was achieved by residual networks through multiple residual units [9,10], which was proposed to improve the performance of object detection, and also solve the gradient vanishing issue. In this research, the YOLO-v3-based deep architecture neural network is proposed to perform the detection of defects (of various sizes). This algorithm includes the implementation of three steps. First, the pictures are resized as the input size. Then, an entire convolutional network is run on these pictures. Lastly, we threshold the detection results based on the model confidence scores. In Figure 7, an example is shown of an original image (a) and a detected image (b) from the YOLO-V3 network. The CNN was able to distinguish the components, which have a similar thermal pattern with defects during the processing of thermal diffusion, which indicated that the supervised learning method (YOLO-V3) is less influenced by the boundary information in the components. Faster-RCNN is a real-time detector that achieved satisfying accuracy with several previous object localization applications in NDT [31]. In 2018, the Faster-RCNN was used for crack detection in an eddy current thermography diagnosis system. The neural network based on a deep architecture was proposed to deal with the problem of accurate crack detection and localization via the preprocessing unsupervised method (Principal Component Analysis).
The deep architecture of Faster-RCNN is composed of several modules ( Figure 8): 1. A fully convolutional network, which included five blocks of basic convolutional layers and a Relu layer with a pooling layer to extract feature from the input images. 2. A region proposal network (RPN) connected with the fully convolutional network to obtain the region of interest (RPI). 3. A Fast-RCNN detector using the feature region extracted in the (1)-(2) to achieve bounding box regression and SoftMax classification.
The Faster R-CNN trained from multi-properties, rather than the regular unsupervised method, was limited with respect to certain properties that the defect information contained. An example image detected from Faster-RCNN, as well as a corresponding original thermal image, is shown in Figure 9. Faster-RCNN is a real-time detector that achieved satisfying accuracy with several previous object localization applications in NDT [31]. In 2018, the Faster-RCNN was used for crack detection in an eddy current thermography diagnosis system. The neural network based on a deep architecture was proposed to deal with the problem of accurate crack detection and localization via the preprocessing unsupervised method (Principal Component Analysis).
The deep architecture of Faster-RCNN is composed of several modules ( Figure 8): 1. A fully convolutional network, which included five blocks of basic convolutional layers and a Relu layer with a pooling layer to extract feature from the input images. 2. A region proposal network (RPN) connected with the fully convolutional network to obtain the region of interest (RPI). 3. A Fast-RCNN detector using the feature region extracted in the (1)-(2) to achieve bounding box regression and SoftMax classification.

3.
A Fast-RCNN detector using the feature region extracted in the (1)-(2) to achieve bounding box regression and SoftMax classification.
The Faster R-CNN trained from multi-properties, rather than the regular unsupervised method, was limited with respect to certain properties that the defect information contained. An example image detected from Faster-RCNN, as well as a corresponding original thermal image, is shown in Figure 9.  The Faster R-CNN trained from multi-properties, rather than the regular unsupervised method, was limited with respect to certain properties that the defect information contained. An example image detected from Faster-RCNN, as well as a corresponding original thermal image, is shown in Figure 9.

Semantic Defect Segmentation Method
Method 3 defect-segmentation method with U-net network The U-net is an excellent auto-encoder format model to handle the training data with dimensionality reduction and data augmentation. It is worth evaluating the performance of semantic segmentation by U-net after extracting objective features from the temporal infrared sequence. In the previous article [32], the U-net was employed for the segmentation of wildland and forest fires as a deep-fire convolutional network obtaining very good performance.
The convolutional architecture of U-net is inspired from the auto-encoder network architecture, as indicated in Figure 10. Contracting path maps from the original image to a low dimension vector by extracting meaningful feature representations, and the expansive path reconstructs the output of the desired feature maps. The contracting path is composed of a group of convolutional blocks: convolutional layers; rectified linear unit (ReLU) [33]; and max pooling (dimension reduction). The expansive path included groups of reconstruction blocks to upsample the feature: up-conv (half-reduce the feature channels), concatenation with a feature map from cropping in the contracting path, and so on.
In the final layer, the feature vectors are classified into the target number of the class by 1 × 1 convolution. Moreover, this architecture relies heavily on data augmentation for

Semantic Defect Segmentation Method
Method 3 defect-segmentation method with U-net network The U-net is an excellent auto-encoder format model to handle the training data with dimensionality reduction and data augmentation. It is worth evaluating the performance of semantic segmentation by U-net after extracting objective features from the temporal infrared sequence. In the previous article [32], the U-net was employed for the segmentation of wildland and forest fires as a deep-fire convolutional network obtaining very good performance.
The convolutional architecture of U-net is inspired from the auto-encoder network architecture, as indicated in Figure 10. Contracting path maps from the original image to a low dimension vector by extracting meaningful feature representations, and the expansive path reconstructs the output of the desired feature maps. The contracting path is composed of a group of convolutional blocks: convolutional layers; rectified linear unit (ReLU) [33]; and max pooling (dimension reduction). The expansive path included groups of reconstruction blocks to upsample the feature: up-conv (half-reduce the feature channels), concatenation with a feature map from cropping in the contracting path, and so on. During the cooling period of the thermal data, a temperature change curve over time is obtained on the given image sequence. Therefore, each single thermal frame is fed into this model at the pixel level, and the thermal image can gradually capture the physical properties of temperature variation by U-net. The input values of U-net are thermal temporal evaluation vectors from each pixel. The output label is set either as 1 or 0 corresponding to the defect or non-defect region. During the validation stage, an obtained thermal sequence is selected as the input data after de-background and normalization. The output is a segmented image reconstructed from the predicted value as shown in Figure 11b. Figure 11a is the corresponding original thermal image. Method 4: Res-U-net for defect semantic segmentation It is worth investigating comparatively to evaluate thermal sequence databases based on these different defect segmentation methods. As indicated in Figure 12, Res-U-net is an adapted novel encoder/decoder structure evolved from U-net in combination with several structures: residual connections [34]; atrous convolutions [35]; pyramid scene In the final layer, the feature vectors are classified into the target number of the class by 1 × 1 convolution. Moreover, this architecture relies heavily on data augmentation for its performance, which is explained below. The data augmentation strategy from the U-net architecture also brings a significant benefit for the performance for the training. Due to the characteristics of the spatial-thermal temperature sequence, the infrared thermal profile for the defect and non-defect pixels can be distinguished based on the labeling to the force implementation of the supervised learning method (U-net segmentation).
During the cooling period of the thermal data, a temperature change curve over time is obtained on the given image sequence. Therefore, each single thermal frame is fed into this model at the pixel level, and the thermal image can gradually capture the physical properties of temperature variation by U-net. The input values of U-net are thermal temporal evaluation vectors from each pixel. The output label is set either as 1 or 0 corresponding to the defect or non-defect region. During the validation stage, an obtained thermal sequence is selected as the input data after de-background and normalization. The output is a segmented image reconstructed from the predicted value as shown in Figure 11b. Figure 11a is the corresponding original thermal image.
Method 4: Res-U-net for defect semantic segmentation It is worth investigating comparatively to evaluate thermal sequence databases based on these different defect segmentation methods. As indicated in Figure 12, Res-U-net is an adapted novel encoder/decoder structure evolved from U-net in combination with several structures: residual connections [34]; atrous convolutions [35]; pyramid scene parsing pooling [36]. Res-U-net can infer sequentially the boundary of the objects, the distance transforms of the segmentation mask, the segmentation mask, and a colored reconstruction of the input.
Since residual blocks in Res-U-net can remove vanishing and exploding gradients [37] to a great extent to improve the implementation efficacy of the learning mode and to achieve the pixel level of the segmenting of defects and classification, Res-U-net was compared with other state-of-the-art DL algorithms. The Res-U-net original was performed on the monotemporal aerial images for the task of semantic segmentation. The framework adapted here for segmenting defects included a Res-U-net framework and a corresponding novel loss function: Dice loss [38]. This reliable framework can perform semantic segmentation, resulting in high-resolution images. To avoid the overfitting, the Res-U-net relied on the data augmentation strategy as well. Each image was rotated to the angle, zoom in/out, flip, and so on to enlarge the datasets of Res-U-net. In Figure 13, a segmented sample from Res-U-net (b) and the corresponding raw images (a) are shown. During the cooling period of the thermal data, a temperature change curve over time is obtained on the given image sequence. Therefore, each single thermal frame is fed into this model at the pixel level, and the thermal image can gradually capture the physical properties of temperature variation by U-net. The input values of U-net are thermal temporal evaluation vectors from each pixel. The output label is set either as 1 or 0 corresponding to the defect or non-defect region. During the validation stage, an obtained thermal sequence is selected as the input data after de-background and normalization. The output is a segmented image reconstructed from the predicted value as shown in Figure 11b. Figure 11a is the corresponding original thermal image. Method 4: Res-U-net for defect semantic segmentation It is worth investigating comparatively to evaluate thermal sequence databases based on these different defect segmentation methods. As indicated in Figure 12, Res-U-net is an adapted novel encoder/decoder structure evolved from U-net in combination with several structures: residual connections [34]; atrous convolutions [35]; pyramid scene parsing pooling [36]. Res-U-net can infer sequentially the boundary of the objects, the distance transforms of the segmentation mask, the segmentation mask, and a colored reconstruction of the input. Since residual blocks in Res-U-net can remove vanishing and exploding gradients [37] to a great extent to improve the implementation efficacy of the learning mode and to achieve the pixel level of the segmenting of defects and classification, Res-U-net was compared with other state-of-the-art DL algorithms. The Res-U-net original was performed on the mono-temporal aerial images for the task of semantic segmentation. The framework adapted here for segmenting defects included a Res-U-net framework and a corresponding novel loss function: Dice loss [38]. This reliable framework can perform semantic segmentation, resulting in high-resolution images. To avoid the overfitting, the Res-U-net relied on the data augmentation strategy as well. Each image was rotated to the angle, zoom in/out, flip, and so on to enlarge the datasets of Res-U-net. In Figure 13, a segmented sample from Res-U-net (b) and the corresponding raw images(a) are shown. parsing pooling [36]. Res-U-net can infer sequentially the boundary of the objects, the distance transforms of the segmentation mask, the segmentation mask, and a colored reconstruction of the input. Since residual blocks in Res-U-net can remove vanishing and exploding gradients [37] to a great extent to improve the implementation efficacy of the learning mode and to achieve the pixel level of the segmenting of defects and classification, Res-U-net was compared with other state-of-the-art DL algorithms. The Res-U-net original was performed on the mono-temporal aerial images for the task of semantic segmentation. The framework adapted here for segmenting defects included a Res-U-net framework and a corresponding novel loss function: Dice loss [38]. This reliable framework can perform semantic segmentation, resulting in high-resolution images. To avoid the overfitting, the Res-U-net relied on the data augmentation strategy as well. Each image was rotated to the angle, zoom in/out, flip, and so on to enlarge the datasets of Res-U-net. In Figure 13, a segmented sample from Res-U-net (b) and the corresponding raw images(a) are shown.

Instance Defect Segmentation Algorithm
Method 5: MASK-RCNN for defect segmentation The Mask-RCNN detection procedure can be considered as either an object detection function or object segmentation function. Compared with the semantic segmentation, the

Instance Defect Segmentation Algorithm
Method 5: MASK-RCNN for defect segmentation The Mask-RCNN detection procedure can be considered as either an object detection function or object segmentation function. Compared with the semantic segmentation, the instance segmentation associates each pixel of an image with an instance label. It can forecast a whole segmentation mask for each of those objects and predict which pixels in the input image correspond to each object instance. It also reduces the restriction to the position of defects rather than predicting a group of bounding boxes for the defects. Mask-RCNN is a classical instance segmentation method extended intuitively from Faster-RCN, which is an end-to-end trainable model to achieve pixel-to-pixel alignment segmentation between inputs and outputs of a convolutional backbone architecture. ROI Align preserves spatial orientation of features with no loss of data for extraction over the entire image of the network. This approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance.
Each thermal image was fed into the backbone convolutional network from Mask-RCNN, once some learned region proposal was obtained from the backbone network. These features projected learned region proposals onto convolutional feature maps. Mask-RCNN uses ROI aligning [39] to warp our feature from the convolutional feature map into the right shape then outputs it into two different branches. As shown in Figure 14, there are two different branches providing an output of predicted results. The top branch (blue line box) is a classification score of categories of region proposals and a bounding box for regression of coordinates in the output. In addition, at the bottom (red line box), a segmentation mask is predicted by the model for each of those region proposals to classify for each pixel in that input region proposal whether it is an object. Figure 15 provides an example of an original image from pulsed thermography (a) and a segmented image from Mask-RCNN (b). Mask-RCNN is a classical instance segmentation method extended intuitively from Faster-RCN, which is an end-to-end trainable model to achieve pixel-to-pixel alignment segmentation between inputs and outputs of a convolutional backbone architecture. ROI Align preserves spatial orientation of features with no loss of data for extraction over the entire image of the network. This approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. Each thermal image was fed into the backbone convolutional network from Mask-RCNN, once some learned region proposal was obtained from the backbone network. These features projected learned region proposals onto convolutional feature maps. Mask-RCNN uses ROI aligning [39] to warp our feature from the convolutional feature map into the right shape then outputs it into two different branches. As shown in Figure  14, there are two different branches providing an output of predicted results. The top branch (blue line box) is a classification score of categories of region proposals and a bounding box for regression of coordinates in the output. In addition, at the bottom (red line box), a segmentation mask is predicted by the model for each of those region proposals to classify for each pixel in that input region proposal whether it is an object. Figure 15 provides an example of an original image from pulsed thermography (a) and a segmented image from Mask-RCNN (b).  Method 6: Central-Mask for defect segmentation Since the Mask-RCNN relies on the pre-defined anchors, its influence slowed down for the speed and accuracy in detection. Central-Mask is a simple yet efficient real-time anchor-free instance segmentation. Based on the structure, Central-Mask could be regarded as a novel spatial attention-guided mask (SAG-Mask) branch, adding a free anchor onestage object detector (FCOS) [40]. A segmentation mask head is located on each detected box with the spatial attention map that helps to aim attention at informative pixels and suppress noise. Figure 16 shows the overview architecture of Center-Mask. A feature pyramid extractor combines with the FCOS box head to predict classification scores and bounding box regression. A spatial attention-guided mask (SAG-MASK) predicts the segmentation map for the defects based on a spatial attention module [41] from each bounding box, which focuses on meaningful pixels and eliminates the noised influence. Central-Mask achieves a faster speed and surprising accuracy better than other state-ofthe-art instance segmentation approaches (Mask-RCNN). In this work, we adapted the Central-Mask network for feature extraction and defect segmentation. The main goal is to precisely detect and analyze defect information from the thermal images. The core strategy from this network is to extract the meaningful thermal pattern from the sequence for each specific defect. Figure 17 shows a raw thermal image (a) and a corresponding segmented thermal image (b) from Center-Mask. Each defect is precisely localized and segmented by the Mask. Since the Mask-RCNN relies on the pre-defined anchors, its influence slowed down for the speed and accuracy in detection. Central-Mask is a simple yet efficient real-time anchor-free instance segmentation. Based on the structure, Central-Mask could be regarded as a novel spatial attention-guided mask (SAG-Mask) branch, adding a free anchor one-stage object detector (FCOS) [40]. A segmentation mask head is located on each detected box with the spatial attention map that helps to aim attention at informative pixels and suppress noise. Figure 16 shows the overview architecture of Center-Mask. A feature pyramid extractor combines with the FCOS box head to predict classification scores and bounding box regression. A spatial attention-guided mask (SAG-MASK) predicts the segmentation map for the defects based on a spatial attention module [41] from each bounding box, which focuses on meaningful pixels and eliminates the noised influence. Central-Mask achieves a faster speed and surprising accuracy better than other state-of-the-art instance segmentation approaches (Mask-RCNN). In this work, we adapted the Central-Mask network for feature extraction and defect segmentation. The main goal is to precisely detect and analyze defect information from the thermal images. The core strategy from this network is to extract the meaningful thermal pattern from the sequence for each specific defect. Figure 17 shows a raw thermal image (a) and a corresponding segmented thermal image (b) from Center-Mask. Each defect is precisely localized and segmented by the Mask.

Regular Infrared Defect Detection Algorithm
Method 7: Absolute thermal contrast (ATC) with global threshold (GT) In combination with a global threshold method (GT), the ATC was adapted for the procedure of segmenting defects areas. The vital concept of this method was to compare the grey level of the pixel in the image coordinated ( , ) with the average grey level of a sound region of the sample, and it is often adapted in infrared image processing. Equation  Since the Mask-RCNN relies on the pre-defined anchors, its influence slowed down for the speed and accuracy in detection. Central-Mask is a simple yet efficient real-time anchor-free instance segmentation. Based on the structure, Central-Mask could be regarded as a novel spatial attention-guided mask (SAG-Mask) branch, adding a free anchor one-stage object detector (FCOS) [40]. A segmentation mask head is located on each detected box with the spatial attention map that helps to aim attention at informative pixels and suppress noise. Figure 16 shows the overview architecture of Center-Mask. A feature pyramid extractor combines with the FCOS box head to predict classification scores and bounding box regression. A spatial attention-guided mask (SAG-MASK) predicts the segmentation map for the defects based on a spatial attention module [41] from each bounding box, which focuses on meaningful pixels and eliminates the noised influence. Central-Mask achieves a faster speed and surprising accuracy better than other state-of-the-art instance segmentation approaches (Mask-RCNN). In this work, we adapted the Central-Mask network for feature extraction and defect segmentation. The main goal is to precisely detect and analyze defect information from the thermal images. The core strategy from this network is to extract the meaningful thermal pattern from the sequence for each specific defect. Figure 17 shows a raw thermal image (a) and a corresponding segmented thermal image (b) from Center-Mask. Each defect is precisely localized and segmented by the Mask.

Regular Infrared Defect Detection Algorithm
Method 7: Absolute thermal contrast (ATC) with global threshold (GT) In combination with a global threshold method (GT), the ATC was adapted for the procedure of segmenting defects areas. The vital concept of this method was to compare the grey level of the pixel in the image coordinated ( , ) with the average grey level of a sound region of the sample, and it is often adapted in infrared image processing. Equation

Regular Infrared Defect Detection Algorithm
Method 7: Absolute thermal contrast (ATC) with global threshold (GT) In combination with a global threshold method (GT), the ATC was adapted for the procedure of segmenting defects areas. The vital concept of this method was to compare the grey level of the pixel in the image coordinated (x, y) with the average grey level of a sound region of the sample, and it is often adapted in infrared image processing. Equation (3) describes how this method works: where T atc is the grey level in the ATC image in the coordinate (x, y) of the ATC image. T d (x, y) is the average grey level of the group pixels in the defect region and T s (x, y) is the average temperature of a nearly sound region. Figure 18 provides an example of the segmentation with this method: (a) The raw image from pulsed thermography; and (b) The corresponding segmented image in Method 7. This method made it possible to reduce the effect from non-uniform heating and remove some thermal pattern noises.

Training
The training procedure for deep learning models was set according to the following principles for different neural network architecture parameters adjusted based on the Pytorch framework. The training processing was conducted on a GeForce GTX1080TI about 30 min. The operating system is set as: Ubuntu 16.04. The framework of the learning model is set as: Darknet. CPU: i7-7700k. Memory: 16GB, GPU: NVIDIA GeForce GTX1080TI.
For each modeling training procedure and hyperparameters setting, we configurated the parameters and time speed, which purely ran on the CPU for 1000 thermal images with defects during the training, as shown below in Table 2. As indicated, the Center-Mask runs at a faster speed and spent the lowest time on the CPU for the training of all the thermal images of defect detection compared with other baseline modeling. The U-net network was the slowest speed and took the longest time to process the thermal images for the training of, and obtaining the feature of, the defects. Further, based on the CPU time of the objective-detection methods, it has been proven that it has obvious advantages and a faster speed that outperformed other modelings'. This could be a crucial factor in the industry to choose the optimized modeling for automatic defect detection with infrared-nondestructive evaluation when it considers its training time and procedure.

Training
The training procedure for deep learning models was set according to the following principles for different neural network architecture parameters adjusted based on the Pytorch framework. The training processing was conducted on a GeForce GTX1080TI about 30 min. The operating system is set as: Ubuntu 16.04. The framework of the learning model is set as: Darknet. CPU: i7-7700k. Memory: 16GB, GPU: NVIDIA GeForce GTX1080TI.
For each modeling training procedure and hyperparameters setting, we configurated the parameters and time speed, which purely ran on the CPU for 1000 thermal images with defects during the training, as shown below in Table 2. As indicated, the Center-Mask runs at a faster speed and spent the lowest time on the CPU for the training of all the thermal images of defect detection compared with other baseline modeling. The U-net network was the slowest speed and took the longest time to process the thermal images for the training of, and obtaining the feature of, the defects. Further, based on the CPU time of the objective-detection methods, it has been proven that it has obvious advantages and a faster speed that outperformed other modelings'. This could be a crucial factor in the industry to choose the optimized modeling for automatic defect detection with infrared-nondestructive evaluation when it considers its training time and procedure.  [44]; (β 1 , β 2 ) is (0.9,0,09) and the initial learning rate is 0.0005.

Evaluation Metrics
F-score and the probability of detection [45] are introduced to analyze the capability of detection of each detection deep-learning model, which is interpreted by Equations (4)- (7). The precision means the ratio from the cases contain the defects over the cases that are recognized by the system that contains the defects, which represent how accurate the system is in identifying the defects. The recall means the system correctly recognized the defects over the cases that actually contained the defects. The precision and recall values heavily depend on the confidences scores that the system is setting. The F-score is a method to estimate the detection and segmentation capability from these algorithms. β is a value to represent the weight between the precision and recall value. In this work, the recall is a metric that is more influential in evaluating the performance. Therefore, β is equal to 2. The POD reveals the accuracy of the method to detect the defects, which are always calculated at a specific confidence score value. Although the POD keeps the same mathematical format as the recall in the equation, POD represents a further explanation in quantifying research with NDT inspectors. In this work, we set the threshold for CTS at 75% for POD metric.
where TP is true positive, and FN is the false negative representing the number of the defects that have not been detected. Meanwhile, FP is the false positive defect representing the defects that are wrongly detected as defects when they are in fact not defects. Moreover, the confidence threshold score (CTS) was defined as a standard for measuring the accuracy of detecting corresponding objects in each dataset. CTS is a simple measurement standard that can be used for any task that yields a prediction range (bounding boxes, segmented maps) in the output regarding the ground truth.

Learning Curves
In Figure 19a-f, each deep-learning model was trained for 1000 epochs, respectively. Figure 19a shows the average loss curve for training and validation process for the Mask-CNN model. The training-loss curves decreased while the number of iterations increased. The loss significantly decreased during the first 200 batches, then gradually flattened out around 0.225 as the batch number of the iterations increases and then remains steady. The validation loss coverage involves similar loss. This indicates that the performance of Mask-RCNN was promising during the training procedure. In Figure 19b, the loss curves of the Center-Mask model have a similar momentum to that of the Mask-CNN loss but more smoothly. The training loss stably decreased as well, while the whole number of iterations increased and then converged around 0.341. In comparison with Figure 19a,b, the four other DL models in Figure 19c-f seem to maintain a similar momentum. The average curve of the training loss became more dramatic while oscillating decreased in the first 500 epochs before flattening out late. The loss curves of steel stabilized at a value lower than 0.5 after 500 epochs. As a result, based on the obtained model, the loss from the six different kinds of deep-learning models further indicated an impressive performance during the whole procedure (training and validation) when it was applied on the defect segmentation and localization of composite materials.

Detection Results
This model provided the shape and location of each defect detection results based on the labeled images with ground truth.
In Table 3f, the noise of the input image is the main factor affecting the segmentation results. As indicated in the U-net result, the segmented image is not clear. The segmentation boundary is still blurry. A preprocessed image from principal component analysis (PCA) was added in the validation database to verify whether the segmentation effect will be better after denoising in the Res-U-net model training. From the results, it In comparison with Figure 19a,b, the four other DL models in Figure 19c-f seem to maintain a similar momentum. The average curve of the training loss became more dramatic while oscillating decreased in the first 500 epochs before flattening out late. The loss curves of steel stabilized at a value lower than 0.5 after 500 epochs. As a result, based on the obtained model, the loss from the six different kinds of deep-learning models further indicated an impressive performance during the whole procedure (training and validation) when it was applied on the defect segmentation and localization of composite materials.

Detection Results
This model provided the shape and location of each defect detection results based on the labeled images with ground truth.
In Table 3f, the noise of the input image is the main factor affecting the segmentation results. As indicated in the U-net result, the segmented image is not clear. The segmentation boundary is still blurry. A preprocessed image from principal component analysis (PCA) was added in the validation database to verify whether the segmentation effect will be better after denoising in the Res-U-net model training. From the results, it seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. Table 3. Results with semantic segmentation and object localization algorithms. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. Table 3. Results with semantic segmentation and object localization algorithms. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. Table 3. Results with semantic segmentation and object localization algorithms. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. seems the performance improved to some extent, and the test result of Resnet-U-net gave a better performance than the original U-net. Table 3. Results with semantic segmentation and object localization algorithms.

Res-U-Net U-Net
Faster-RCNN Yolo-v3  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.  Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deep-learning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information. Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference. Table 3 also shows the visualized results from six deep-learning algorithms. Specifically, in Sample (g), the defect feature from the sample indicated clearly that these deeplearning methods show excellent defect detection capability. However, it is obvious that the comparison methods (semantic segmentation) have a substantial shortcoming. These method results are affected by the non-defect area in Sample (g), whereas the Resnet-U-net can be conducted without false detection. Compared with Resnet-U-net, the original U-net is more sensitive to fix patterns-noise and non-uniform heating from thermography due to the higher false detection rates in the result of Sample (g). Therefore, U-net cannot detect the specific thermal data very well because U-net is too insensitive to defect information.
Note that the introduced model Res-U-net can ensure correct detection while effectively prohibiting noise interference.
In terms of Sample (d) (steel) and Sample (e) (plexiglass), it is still quite challenging to detect the defects and abnormal areas because the background and noisy information represent a high percentage around the defect information region in the sample. The result of Sample (d) indicated that Faster-RCNN failed in detecting the less visible defects. For the model of Faster-RCNN, although it introduces a hierarchical structure of deep architecture to extract semantic information in the images, there is still a failure to distinguish the boundary noise information from the steel sample. On the other hand, YOLO-V3 is slightly more effective in comparison with a Faster-RCNN based on the detected results on eight evaluation samples (mAP = 0.75 IOU metric). This further illustrates the introduced model; YOLO-V3 leads to good identification accuracy as a single-stage detector in comparison with the other state-of-the-art methods.
For the instance-segmentation method, the segmented images (Center-Mask, Mask-RCNN) show some indistinguishable results from the ground truth. Several types of defects are detected, which include the shapes of a circle, square, and rectangle. Table 4 shows the detection results of the defects by training, using the instance-segmentation model: Mask-RCNN/Center-Mask model. Particularly, since the training database is composed of regular shapes and permanent angles with circle and square shapes, the testing results in Table 4 show that the irregularly distributed defects with multiple angles are detected accurately, which indicated that the Mask-RCNN/Center-Mask spatial detection model can enhance detection performance based on the instance segmentation of pixel-to-pixel alignment. In terms of Sample (d) (steel) and Sample (e) (plexiglass), it is still quite challenging to detect the defects and abnormal areas because the background and noisy information represent a high percentage around the defect information region in the sample. The result of Sample (d) indicated that Faster-RCNN failed in detecting the less visible defects. For the model of Faster-RCNN, although it introduces a hierarchical structure of deep architecture to extract semantic information in the images, there is still a failure to distinguish the boundary noise information from the steel sample. On the other hand, YOLO-V3 is slightly more effective in comparison with a Faster-RCNN based on the detected results on eight evaluation samples (mAP = 0.75 IOU metric). This further illustrates the introduced model; YOLO-V3 leads to good identification accuracy as a single-stage detector in comparison with the other state-of-the-art methods.
For the instance-segmentation method, the segmented images (Center-Mask, Mask-RCNN) show some indistinguishable results from the ground truth. Several types of defects are detected, which include the shapes of a circle, square, and rectangle. Table 4 shows the detection results of the defects by training, using the instance-segmentation model: Mask-RCNN/Center-Mask model. Particularly, since the training database is composed of regular shapes and permanent angles with circle and square shapes, the testing results in Table 4 show that the irregularly distributed defects with multiple angles are detected accurately, which indicated that the Mask-RCNN/Center-Mask spatial detection model can enhance detection performance based on the instance segmentation of pixel-to-pixel alignment. In terms of Sample (d) (steel) and Sample (e) (plexiglass), it is still quite challenging to detect the defects and abnormal areas because the background and noisy information represent a high percentage around the defect information region in the sample. The result of Sample (d) indicated that Faster-RCNN failed in detecting the less visible defects. For the model of Faster-RCNN, although it introduces a hierarchical structure of deep architecture to extract semantic information in the images, there is still a failure to distinguish the boundary noise information from the steel sample. On the other hand, YOLO-V3 is slightly more effective in comparison with a Faster-RCNN based on the detected results on eight evaluation samples (mAP = 0.75 IOU metric). This further illustrates the introduced model; YOLO-V3 leads to good identification accuracy as a single-stage detector in comparison with the other state-of-the-art methods.
For the instance-segmentation method, the segmented images (Center-Mask, Mask-RCNN) show some indistinguishable results from the ground truth. Several types of defects are detected, which include the shapes of a circle, square, and rectangle. Table 4 shows the detection results of the defects by training, using the instance-segmentation model: Mask-RCNN/Center-Mask model. Particularly, since the training database is composed of regular shapes and permanent angles with circle and square shapes, the testing results in Table 4 show that the irregularly distributed defects with multiple angles are detected accurately, which indicated that the Mask-RCNN/Center-Mask spatial detection model can enhance detection performance based on the instance segmentation of pixel-to-pixel alignment. In terms of Sample (d) (steel) and Sample (e) (plexiglass), it is still quite challenging to detect the defects and abnormal areas because the background and noisy information represent a high percentage around the defect information region in the sample. The result of Sample (d) indicated that Faster-RCNN failed in detecting the less visible defects. For the model of Faster-RCNN, although it introduces a hierarchical structure of deep architecture to extract semantic information in the images, there is still a failure to distinguish the boundary noise information from the steel sample. On the other hand, YOLO-V3 is slightly more effective in comparison with a Faster-RCNN based on the detected results on eight evaluation samples (mAP = 0.75 IOU metric). This further illustrates the introduced model; YOLO-V3 leads to good identification accuracy as a single-stage detector in comparison with the other state-of-the-art methods.
For the instance-segmentation method, the segmented images (Center-Mask, Mask-RCNN) show some indistinguishable results from the ground truth. Several types of defects are detected, which include the shapes of a circle, square, and rectangle. Table 4 shows the detection results of the defects by training, using the instance-segmentation model: Mask-RCNN/Center-Mask model. Particularly, since the training database is composed of regular shapes and permanent angles with circle and square shapes, the testing results in Table 4 show that the irregularly distributed defects with multiple angles are detected accurately, which indicated that the Mask-RCNN/Center-Mask spatial detection model can enhance detection performance based on the instance segmentation of pixel-to-pixel alignment. In terms of Sample (d) (steel) and Sample (e) (plexiglass), it is still quite challenging to detect the defects and abnormal areas because the background and noisy information represent a high percentage around the defect information region in the sample. The result of Sample (d) indicated that Faster-RCNN failed in detecting the less visible defects. For the model of Faster-RCNN, although it introduces a hierarchical structure of deep architecture to extract semantic information in the images, there is still a failure to distinguish the boundary noise information from the steel sample. On the other hand, YOLO-V3 is slightly more effective in comparison with a Faster-RCNN based on the detected results on eight evaluation samples (mAP = 0.75 IOU metric). This further illustrates the introduced model; YOLO-V3 leads to good identification accuracy as a single-stage detector in comparison with the other state-of-the-art methods.
For the instance-segmentation method, the segmented images (Center-Mask, Mask-RCNN) show some indistinguishable results from the ground truth. Several types of defects are detected, which include the shapes of a circle, square, and rectangle. Table 4 shows the detection results of the defects by training, using the instance-segmentation model: Mask-RCNN/Center-Mask model. Particularly, since the training database is composed of regular shapes and permanent angles with circle and square shapes, the testing results in Table 4 show that the irregularly distributed defects with multiple angles are detected accurately, which indicated that the Mask-RCNN/Center-Mask spatial detection model can enhance detection performance based on the instance segmentation of pixel-to-pixel alignment. In terms of Sample (d) (steel) and Sample (e) (plexiglass), it is still quite challenging to detect the defects and abnormal areas because the background and noisy information represent a high percentage around the defect information region in the sample. The result of Sample (d) indicated that Faster-RCNN failed in detecting the less visible defects. For the model of Faster-RCNN, although it introduces a hierarchical structure of deep architecture to extract semantic information in the images, there is still a failure to distinguish the boundary noise information from the steel sample. On the other hand, YOLO-V3 is slightly more effective in comparison with a Faster-RCNN based on the detected results on eight evaluation samples (mAP = 0.75 IOU metric). This further illustrates the introduced model; YOLO-V3 leads to good identification accuracy as a single-stage detector in comparison with the other state-of-the-art methods.
For the instance-segmentation method, the segmented images (Center-Mask, Mask-RCNN) show some indistinguishable results from the ground truth. Several types of defects are detected, which include the shapes of a circle, square, and rectangle. Table 4 shows the detection results of the defects by training, using the instance-segmentation model: Mask-RCNN/Center-Mask model. Particularly, since the training database is composed of regular shapes and permanent angles with circle and square shapes, the testing results in Table 4 show that the irregularly distributed defects with multiple angles are detected accurately, which indicated that the Mask-RCNN/Center-Mask spatial detection model can enhance detection performance based on the instance segmentation of pixel-to-pixel alignment. In terms of Sample (d) (steel) and Sample (e) (plexiglass), it is still quite challenging to detect the defects and abnormal areas because the background and noisy information represent a high percentage around the defect information region in the sample. The result of Sample (d) indicated that Faster-RCNN failed in detecting the less visible defects. For the model of Faster-RCNN, although it introduces a hierarchical structure of deep architecture to extract semantic information in the images, there is still a failure to distinguish the boundary noise information from the steel sample. On the other hand, YOLO-V3 is slightly more effective in comparison with a Faster-RCNN based on the detected results on eight evaluation samples (mAP = 0.75 IOU metric). This further illustrates the introduced model; YOLO-V3 leads to good identification accuracy as a single-stage detector in comparison with the other state-of-the-art methods.
For the instance-segmentation method, the segmented images (Center-Mask, Mask-RCNN) show some indistinguishable results from the ground truth. Several types of defects are detected, which include the shapes of a circle, square, and rectangle. Table 4 shows the detection results of the defects by training, using the instance-segmentation model: Mask-RCNN/Center-Mask model. Particularly, since the training database is composed of regular shapes and permanent angles with circle and square shapes, the testing results in Table 4 show that the irregularly distributed defects with multiple angles are detected accurately, which indicated that the Mask-RCNN/Center-Mask spatial detection model can enhance detection performance based on the instance segmentation of pixel-to-pixel alignment. In terms of Sample (d) (steel) and Sample (e) (plexiglass), it is still quite challenging to detect the defects and abnormal areas because the background and noisy information represent a high percentage around the defect information region in the sample. The result of Sample (d) indicated that Faster-RCNN failed in detecting the less visible defects. For the model of Faster-RCNN, although it introduces a hierarchical structure of deep architecture to extract semantic information in the images, there is still a failure to distinguish the boundary noise information from the steel sample. On the other hand, YOLO-V3 is slightly more effective in comparison with a Faster-RCNN based on the detected results on eight evaluation samples (mAP = 0.75 IOU metric). This further illustrates the introduced model; YOLO-V3 leads to good identification accuracy as a single-stage detector in comparison with the other state-of-the-art methods.
For the instance-segmentation method, the segmented images (Center-Mask, Mask-RCNN) show some indistinguishable results from the ground truth. Several types of defects are detected, which include the shapes of a circle, square, and rectangle. Table 4 shows the detection results of the defects by training, using the instance-segmentation model: Mask-RCNN/Center-Mask model. Particularly, since the training database is composed of regular shapes and permanent angles with circle and square shapes, the testing results in Table 4 show that the irregularly distributed defects with multiple angles are detected accurately, which indicated that the Mask-RCNN/Center-Mask spatial detection model can enhance detection performance based on the instance segmentation of pixel-to-pixel alignment. In terms of Sample (d) (steel) and Sample (e) (plexiglass), it is still quite challenging to detect the defects and abnormal areas because the background and noisy information represent a high percentage around the defect information region in the sample. The result of Sample (d) indicated that Faster-RCNN failed in detecting the less visible defects. For the model of Faster-RCNN, although it introduces a hierarchical structure of deep architecture to extract semantic information in the images, there is still a failure to distinguish the boundary noise information from the steel sample. On the other hand, YOLO-V3 is slightly more effective in comparison with a Faster-RCNN based on the detected results on eight evaluation samples (mAP = 0.75 IOU metric). This further illustrates the introduced model; YOLO-V3 leads to good identification accuracy as a single-stage detector in comparison with the other state-of-the-art methods.
For the instance-segmentation method, the segmented images (Center-Mask, Mask-RCNN) show some indistinguishable results from the ground truth. Several types of defects are detected, which include the shapes of a circle, square, and rectangle. Table 4 shows the detection results of the defects by training, using the instance-segmentation model: Mask-RCNN/Center-Mask model. Particularly, since the training database is composed of regular shapes and permanent angles with circle and square shapes, the testing results in Table 4 show that the irregularly distributed defects with multiple angles are detected accurately, which indicated that the Mask-RCNN/Center-Mask spatial detection model can enhance detection performance based on the instance segmentation of pixel-to-pixel alignment.   As a result, it is not enough to only acknowledge the semantic information; it is more impactful to know how to obtain the low semantic information from defects under the interference of objective noisy conditions. In contrast, for the instance-segmentation models, not only did this illustrate a better segmentation performance for the plexiglass samples, but it also has excellent detection capability for the steel and CFRP samples.
To further analyze the robustness of the learning model in comparison with the stateof-the-art ATC, Samples (a)-(h) were further adapted to carry out a detailed analysis. For instance, in Samples (a)-(e) of plexiglass and steel, in comparison with the introduced method from instance and semantic segmentation, the poorer detection of the comparison methods (ATC) is obvious since the semantic information is unclear, and the segmented defects are not obvious. Then, for Samples (f)-(h) from CFRP, which were limited by the accuracy of IR camera, the detection results of absolute thermal contrast (ATC) are still disappointing, whereas for the DL methods, the results are far superior to ATC. Therefore, for the regular-and irregular-shaped specimens, the overall performance of the DL methods is markedly better than all the state-of-the-art methods (ATC and global threshold).

Reliability Assessment Using Probability of Detection (POD)
The reliability assessment metric of subsurface defects detection. The probability of detection [28] has been further assessed to quantify the performance of these six DL models in this task, which can be expressed as a function of aspect ratio through a POD curve. Each curve was plotted in Figure 20 with respect to the aspect ratio ( / ℎ) to indicate the quantitative analysis for various sizes and depths of defects with deeplearning models. The results of probability of detection (POD) are based on the defect regions detected from the deep neural network methods based on the referenced ground truth. As a result, it is not enough to only acknowledge the semantic information; it is more impactful to know how to obtain the low semantic information from defects under the interference of objective noisy conditions. In contrast, for the instance-segmentation models, not only did this illustrate a better segmentation performance for the plexiglass samples, but it also has excellent detection capability for the steel and CFRP samples.
To further analyze the robustness of the learning model in comparison with the stateof-the-art ATC, Samples (a)-(h) were further adapted to carry out a detailed analysis. For instance, in Samples (a)-(e) of plexiglass and steel, in comparison with the introduced method from instance and semantic segmentation, the poorer detection of the comparison methods (ATC) is obvious since the semantic information is unclear, and the segmented defects are not obvious. Then, for Samples (f)-(h) from CFRP, which were limited by the accuracy of IR camera, the detection results of absolute thermal contrast (ATC) are still disappointing, whereas for the DL methods, the results are far superior to ATC. Therefore, for the regular-and irregular-shaped specimens, the overall performance of the DL methods is markedly better than all the state-of-the-art methods (ATC and global threshold).

Reliability Assessment Using Probability of Detection (POD)
The reliability assessment metric of subsurface defects detection. The probability of detection [28] has been further assessed to quantify the performance of these six DL models in this task, which can be expressed as a function of aspect ratio through a POD curve. Each curve was plotted in Figure 20 with respect to the aspect ratio ( / ℎ) to indicate the quantitative analysis for various sizes and depths of defects with deeplearning models. The results of probability of detection (POD) are based on the defect regions detected from the deep neural network methods based on the referenced ground truth. As a result, it is not enough to only acknowledge the semantic information; it is more impactful to know how to obtain the low semantic information from defects under the interference of objective noisy conditions. In contrast, for the instance-segmentation models, not only did this illustrate a better segmentation performance for the plexiglass samples, but it also has excellent detection capability for the steel and CFRP samples.
To further analyze the robustness of the learning model in comparison with the stateof-the-art ATC, Samples (a)-(h) were further adapted to carry out a detailed analysis. For instance, in Samples (a)-(e) of plexiglass and steel, in comparison with the introduced method from instance and semantic segmentation, the poorer detection of the comparison methods (ATC) is obvious since the semantic information is unclear, and the segmented defects are not obvious. Then, for Samples (f)-(h) from CFRP, which were limited by the accuracy of IR camera, the detection results of absolute thermal contrast (ATC) are still disappointing, whereas for the DL methods, the results are far superior to ATC. Therefore, for the regular-and irregular-shaped specimens, the overall performance of the DL methods is markedly better than all the state-of-the-art methods (ATC and global threshold).

Reliability Assessment Using Probability of Detection (POD)
The reliability assessment metric of subsurface defects detection. The probability of detection [28] has been further assessed to quantify the performance of these six DL models in this task, which can be expressed as a function of aspect ratio through a POD curve. Each curve was plotted in Figure 20 with respect to the aspect ratio ( / ℎ) to indicate the quantitative analysis for various sizes and depths of defects with deeplearning models. The results of probability of detection (POD) are based on the defect regions detected from the deep neural network methods based on the referenced ground truth. As a result, it is not enough to only acknowledge the semantic information; it is more impactful to know how to obtain the low semantic information from defects under the interference of objective noisy conditions. In contrast, for the instance-segmentation models, not only did this illustrate a better segmentation performance for the plexiglass samples, but it also has excellent detection capability for the steel and CFRP samples.
To further analyze the robustness of the learning model in comparison with the stateof-the-art ATC, Samples (a)-(h) were further adapted to carry out a detailed analysis. For instance, in Samples (a)-(e) of plexiglass and steel, in comparison with the introduced method from instance and semantic segmentation, the poorer detection of the comparison methods (ATC) is obvious since the semantic information is unclear, and the segmented defects are not obvious. Then, for Samples (f)-(h) from CFRP, which were limited by the accuracy of IR camera, the detection results of absolute thermal contrast (ATC) are still disappointing, whereas for the DL methods, the results are far superior to ATC. Therefore, for the regular-and irregular-shaped specimens, the overall performance of the DL methods is markedly better than all the state-of-the-art methods (ATC and global threshold).

Reliability Assessment Using Probability of Detection (POD)
The reliability assessment metric of subsurface defects detection. The probability of detection [28] has been further assessed to quantify the performance of these six DL models in this task, which can be expressed as a function of aspect ratio through a POD curve. Each curve was plotted in Figure 20 with respect to the aspect ratio ( / ℎ) to indicate the quantitative analysis for various sizes and depths of defects with deeplearning models. The results of probability of detection (POD) are based on the defect regions detected from the deep neural network methods based on the referenced ground truth. As a result, it is not enough to only acknowledge the semantic information; it is more impactful to know how to obtain the low semantic information from defects under the interference of objective noisy conditions. In contrast, for the instance-segmentation models, not only did this illustrate a better segmentation performance for the plexiglass samples, but it also has excellent detection capability for the steel and CFRP samples.
To further analyze the robustness of the learning model in comparison with the stateof-the-art ATC, Samples (a)-(h) were further adapted to carry out a detailed analysis. For instance, in Samples (a)-(e) of plexiglass and steel, in comparison with the introduced method from instance and semantic segmentation, the poorer detection of the comparison methods (ATC) is obvious since the semantic information is unclear, and the segmented defects are not obvious. Then, for Samples (f)-(h) from CFRP, which were limited by the accuracy of IR camera, the detection results of absolute thermal contrast (ATC) are still disappointing, whereas for the DL methods, the results are far superior to ATC. Therefore, for the regular-and irregular-shaped specimens, the overall performance of the DL methods is markedly better than all the state-of-the-art methods (ATC and global threshold).

Reliability Assessment Using Probability of Detection (POD)
The reliability assessment metric of subsurface defects detection. The probability of detection [28] has been further assessed to quantify the performance of these six DL models in this task, which can be expressed as a function of aspect ratio through a POD curve. Each curve was plotted in Figure 20 with respect to the aspect ratio ( / ℎ) to indicate the quantitative analysis for various sizes and depths of defects with deeplearning models. The results of probability of detection (POD) are based on the defect regions detected from the deep neural network methods based on the referenced ground truth. As a result, it is not enough to only acknowledge the semantic information; it is more impactful to know how to obtain the low semantic information from defects under the interference of objective noisy conditions. In contrast, for the instance-segmentation models, not only did this illustrate a better segmentation performance for the plexiglass samples, but it also has excellent detection capability for the steel and CFRP samples.
To further analyze the robustness of the learning model in comparison with the stateof-the-art ATC, Samples (a)-(h) were further adapted to carry out a detailed analysis. For instance, in Samples (a)-(e) of plexiglass and steel, in comparison with the introduced method from instance and semantic segmentation, the poorer detection of the comparison methods (ATC) is obvious since the semantic information is unclear, and the segmented defects are not obvious. Then, for Samples (f)-(h) from CFRP, which were limited by the accuracy of IR camera, the detection results of absolute thermal contrast (ATC) are still disappointing, whereas for the DL methods, the results are far superior to ATC. Therefore, for the regular-and irregular-shaped specimens, the overall performance of the DL methods is markedly better than all the state-of-the-art methods (ATC and global threshold).

Reliability Assessment Using Probability of Detection (POD)
The reliability assessment metric of subsurface defects detection. The probability of detection [28] has been further assessed to quantify the performance of these six DL models in this task, which can be expressed as a function of aspect ratio through a POD curve. Each curve was plotted in Figure 20 with respect to the aspect ratio ( / ℎ) to indicate the quantitative analysis for various sizes and depths of defects with deeplearning models. The results of probability of detection (POD) are based on the defect regions detected from the deep neural network methods based on the referenced ground truth. As a result, it is not enough to only acknowledge the semantic information; it is more impactful to know how to obtain the low semantic information from defects under the interference of objective noisy conditions. In contrast, for the instance-segmentation models, not only did this illustrate a better segmentation performance for the plexiglass samples, but it also has excellent detection capability for the steel and CFRP samples.
To further analyze the robustness of the learning model in comparison with the stateof-the-art ATC, Samples (a)-(h) were further adapted to carry out a detailed analysis. For instance, in Samples (a)-(e) of plexiglass and steel, in comparison with the introduced method from instance and semantic segmentation, the poorer detection of the comparison methods (ATC) is obvious since the semantic information is unclear, and the segmented defects are not obvious. Then, for Samples (f)-(h) from CFRP, which were limited by the accuracy of IR camera, the detection results of absolute thermal contrast (ATC) are still disappointing, whereas for the DL methods, the results are far superior to ATC. Therefore, for the regular-and irregular-shaped specimens, the overall performance of the DL methods is markedly better than all the state-of-the-art methods (ATC and global threshold).

Reliability Assessment Using Probability of Detection (POD)
The reliability assessment metric of subsurface defects detection. The probability of detection [28] has been further assessed to quantify the performance of these six DL models in this task, which can be expressed as a function of aspect ratio through a POD curve. Each curve was plotted in Figure 20 with respect to the aspect ratio ( / ℎ) to indicate the quantitative analysis for various sizes and depths of defects with deeplearning models. The results of probability of detection (POD) are based on the defect regions detected from the deep neural network methods based on the referenced ground truth. As a result, it is not enough to only acknowledge the semantic information; it is more impactful to know how to obtain the low semantic information from defects under the interference of objective noisy conditions. In contrast, for the instance-segmentation models, not only did this illustrate a better segmentation performance for the plexiglass samples, but it also has excellent detection capability for the steel and CFRP samples.
To further analyze the robustness of the learning model in comparison with the stateof-the-art ATC, Samples (a)-(h) were further adapted to carry out a detailed analysis. For instance, in Samples (a)-(e) of plexiglass and steel, in comparison with the introduced method from instance and semantic segmentation, the poorer detection of the comparison methods (ATC) is obvious since the semantic information is unclear, and the segmented defects are not obvious. Then, for Samples (f)-(h) from CFRP, which were limited by the accuracy of IR camera, the detection results of absolute thermal contrast (ATC) are still disappointing, whereas for the DL methods, the results are far superior to ATC. Therefore, for the regular-and irregular-shaped specimens, the overall performance of the DL methods is markedly better than all the state-of-the-art methods (ATC and global threshold).

Reliability Assessment Using Probability of Detection (POD)
The reliability assessment metric of subsurface defects detection. The probability of detection [28] has been further assessed to quantify the performance of these six DL models in this task, which can be expressed as a function of aspect ratio through a POD curve. Each curve was plotted in Figure 20 with respect to the aspect ratio ( / ℎ) to indicate the quantitative analysis for various sizes and depths of defects with deeplearning models. The results of probability of detection (POD) are based on the defect regions detected from the deep neural network methods based on the referenced ground truth.
As a result, it is not enough to only acknowledge the semantic information; it is more impactful to know how to obtain the low semantic information from defects under the interference of objective noisy conditions. In contrast, for the instance-segmentation models, not only did this illustrate a better segmentation performance for the plexiglass samples, but it also has excellent detection capability for the steel and CFRP samples.
To further analyze the robustness of the learning model in comparison with the stateof-the-art ATC, Samples (a)-(h) were further adapted to carry out a detailed analysis. For instance, in Samples (a)-(e) of plexiglass and steel, in comparison with the introduced method from instance and semantic segmentation, the poorer detection of the comparison methods (ATC) is obvious since the semantic information is unclear, and the segmented defects are not obvious. Then, for Samples (f)-(h) from CFRP, which were limited by the accuracy of IR camera, the detection results of absolute thermal contrast (ATC) are still disappointing, whereas for the DL methods, the results are far superior to ATC. Therefore, for the regular-and irregular-shaped specimens, the overall performance of the DL methods is markedly better than all the state-of-the-art methods (ATC and global threshold).

Reliability Assessment Using Probability of Detection (POD)
The reliability assessment metric of subsurface defects detection. The probability of detection [28] has been further assessed to quantify the performance of these six DL models in this task, which can be expressed as a function of aspect ratio through a POD curve. Each curve was plotted in Figure 20 with respect to the aspect ratio (size/depth) to indicate the quantitative analysis for various sizes and depths of defects with deep-learning models. The results of probability of detection (POD) are based on the defect regions detected from the deep neural network methods based on the referenced ground truth. detection [28] has been further assessed to quantify the performance of these six D models in this task, which can be expressed as a function of aspect ratio through a PO curve. Each curve was plotted in Figure 20 with respect to the aspect ratio ( / ℎ) indicate the quantitative analysis for various sizes and depths of defects with dee learning models. The results of probability of detection (POD) are based on the defe regions detected from the deep neural network methods based on the referenced groun truth.  Mask-based approach has a notable performance, and the highest POD scores comparison to other approaches represent the highest detectability. Then, the stage objective localization methods (YOLO-V3; Fast-RCNN) have a faster and mediu detection accuracy. The semantic segmentation method (U-net; Res-U-net) obtained le accuracy due to the fixed pattern noise and non-uniform heating from the infrared therm data. However, all of these six DL models surpass the state-of-the-art method (absolu thermal contrast: ATC) and function automatically.  for each DL algorithm. The POD of the instance-segmentation method Center-Mask-based approach has a notable performance, and the highest POD scores in comparison to other approaches represent the highest detectability. Then, the staged objective localization methods (YOLO-V3; Fast-RCNN) have a faster and medium detection accuracy. The semantic segmentation method (U-net; Res-U-net) obtained less accuracy due to the fixed pattern noise and non-uniform heating from the infrared thermal data. However, all of these six DL models surpass the state-of-the-art method (absolute thermal contrast: ATC) and function automatically.
The results from different samples for POD validation of methods are indicated in Table 5 (CTS = 0.75). The results are compared using the thermal frames acquired from the pulsed thermography (PT) on each plexiglass/CFRP/steel sample in Section 4.2. In Table 5, the instance-segmentation method (Center-Mask) still shows an acceptable segmented result for all samples provided. Since the instance-segmentation method can capture feature differences over each pixel, the pretrained model (Center-Mask; Mask-RCNN) has the intrinsic capability to segment defects from background information based on a learning and labeling process. For the semantic-segmentation method, as discussed previously, the original U-net model fails to defect the deeper defects region, such as some defect regions of sample (c), since the spatial features from the infrared sequences in defect regions are difficult to obtain due to the non-uniform heating and fixed pattern noise from the boundary region in this case. In contrast, it is worth noting that the introduced method from the instance-segmentation model (Center-Mask) obtained the best performance in Sample (c) in comparison with the other methods. The results from Samples (a)-(h), analyzed by the POD evaluation metrics, are indicated in Table 5. The performance is compared with thermal images and sequences collected from pulsed thermography experiments on each specimen. On Samples (a)-(d), the YOLO-V3 and Center-Mask show commensurable results, while the other four algorithms (U-net; Res-U-net; Mask-RCNN; Faster-RCNN; ATC) obtain a less accurate performance. Due to the hierarchical structure of deep architecture in neural networks, each DL model has a strong capability to separate low-degree background information from raw thermal sequences (fix-pattern noise; non-uniform heating) and extract defect feature components. However, it is worth mentioning that the semantic method (U-net) model still fails to detect when it encounters a challenging situation (non-uniform heating), just like Sample (e), while the instance segmentation (Center-Mask) maintains high performance in POD evaluation.
Furthermore, the results validated on the total databases from eight representative Samples (a)-(h) are indicated in Table 6. These results in Table 6 indicated that the Center-Mask learning model achieves the best performance comparable to that of the other benchmark detection methods (semantic segmentation; defect localisation). This network may be able to obtain a relatively stable performance due to the fact that it involves a deep architecture to extract the features driven by the tasks and instance masks. Table 6 indicated the precision, recall, and F-score analysis result from each algorithm (CTS = 0.75). The Pr represents the precision value, and the Re represents the recall value. The F-score of comparison on Specimens (a)-(c) is relatively high for each proposed DL model. However, specifically for Sample (g), the F-score of U-net is 57.2%, which indicates that U-net only partially detects defects due to noise influences. Whereas the instance segmentation method (Mask-RCNN; Center-RCNN) can reach 76.5%, 83% in Sample (g). From Sample (h), the F-score of semantic segmentation (U-net, Res-U-net) can only achieve 71.8%, 77% in several, and the remaining two comparison methods give the following results: Center-Mask (86%) and Yolo-v3 (79%).
The average F-scores for all eight specimens involving the comparison methods from six different DL algorithms (Faster-RCNN, YOLO-V3, U-net, Res-U-net, Mask-RCNN, Center-Mask) provide 72.62%, 79.8%, 67.25%, 73.66%, 74.8%, and 82.55% defect detection capability, respectively. The instance-segmentation method (Center-Mask) gives the highest capability for detection on average. As a result, the instance-segmentation method is relatively better than the semantic-segmentation methods in terms of detection ability due to noise influence from thermography in this circumstance.
In this experiment, the samples from three types of materials are divided into different geometric distributions (regular-and irregular-shape defects). Due to the different geometric architecture of the regular and the irregular defects, this leads to inconsistent thermal diffusion. Correspondingly, this inconsistent thermal diffusion causes the data distribution to be quite different. It can be concluded based on the comparison of the algorithms that it can be difficult for a single DL model to detect all defects effectively. Thus, we adapt the three types of deep-learning algorithms separately training for the different types of samples and also compare the results with other state-of-the-art methods. The precision and recall values always tend to be negatively correlated in the evaluation of DL models. In this project, we placed more emphasis on the recall values and comprehensive F-scores to assess the number of defects, which successfully detected model performance.

Mean-Average Precision (mAP)
Average Precision (AP) [40] is also an indicator that is analysed for the relationship between the precision and recall values. In this section, the mean-average precision (mAP) metric is introduced to further analyze the top-four detection model ranking in POD analysis in Section 6.5 (Center-Mask; Mask-RCNN; YOLO-V3; Faster-RCNN).
The average precision (AP) [46] is calculated based on the indicated bounding boxes and the different confidence thresholding scores from the DL models in object localization and instance segmentation. The total AP and the precision-recall plots obtained when four different deep-learning architectures were adapted are compared in Figure 21 (including all of the confidence-thresholding score values). The detection results show that the Center-Mask still displays the best performance. Taking the 1000 infrared images from the eight representative specimens, defects can be detected, and the recall and precision can be calculated by the confidence scores, as shown in Figure 21 below. The AP of all 1000 thermal images is 75.05%. For instance, in the instance-segmentation model, the AP is calculated as 75.33% and 81.06%, respectively, for Mask-RCNN and Center-Mask (Figure 21a,b). Therefore, the detection performance is better in Center-Mask than in Mask-RCNN. For the object localization model, the AP is calculated as 76.63% and 71.06%, for YOLO-V3 and Faster-RCNN, respectively (Figure 21c,d). Therefore, the detection performance is better in YOLO-V3 than in Faster-RCNN.
can be calculated by the confidence scores, as shown in Figure 21 below. The AP of all 1000 thermal images is 75.05%. For instance, in the instance-segmentation model, the AP is calculated as 75.33% and 81.06%, respectively, for Mask-RCNN and Center-Mask (Figure 21a,b). Therefore, the detection performance is better in Center-Mask than in Mask-RCNN. For the object localization model, the AP is calculated as 76.63% and 71.06%, for YOLO-V3 and Faster-RCNN, respectively (Figure 21c,d). Therefore, the detection performance is better in YOLO-V3 than in Faster-RCNN.

Running Time Complexity
Further, frames per second (fps) [35] was introduced as an idea to certify how many images can be processed in a unit (1 s) time by each deep learning model in order to analyze the running time complexity for the model. In Figure 22, the running time complexity of each model was indicated, which illustrates the average time to detect or predict a defect in each frame (picture) from a DL model: the higher the value in the graph, the faster speed the DL algorithm has.
Based on the analysis of Figure 22, the objective localization approach significantly achieved the fastest speed among the models. For the instance-segmentation detector, Center-Mask has increased the time per frame from the state-of-the-art method ATC: 0.5 fps to 12 fps. Then, Mask-RCNN also achieved a time per frame of 5 fps, which increased the processing speed significantly in comparison with the regular thermal threshold segmentation method (ATC) in thermography.
Moreover, we further analyzed the speed rate in comparison with other state-of-the-art networks and YOLO-V3 still obtained the fastest running time speed to process images due to the reason that it is a one-stage real-time detector and has a much faster speed than other detectors (such as Mask-RCNN; Faster-RCNN). The RCNN methods are relatively slow since these models are two-stage procedures (Region Proposal Network (RPN); ROI pooling). Whereas, as indicated previously the POD curves in Figure 20 (Section 6.5), Center-Mask still achieved the highest POD scores during the whole validation process based on the different aspect ratio values (size/depth). Therefore, in this work, Center-Mask is the most promising to obtain the highest accuracy but YOLO-V3 is the most efficient, which has the faster time frame speed.
Further, frames per second (fps) [35] was introduced as an idea to certify how many images can be processed in a unit (1 s) time by each deep learning model in order to analyze the running time complexity for the model. In Figure 22, the running time complexity of each model was indicated, which illustrates the average time to detect or predict a defect in each frame (picture) from a DL model: the higher the value in the graph, the faster speed the DL algorithm has. Based on the analysis of Figure 22, the objective localization approach significantly achieved the fastest speed among the models. For the instance-segmentation detector, Center-Mask has increased the time per frame from the state-of-the-art method ATC: 0.5 fps to 12 fps. Then, Mask-RCNN also achieved a time per frame of 5 fps, which increased the processing speed significantly in comparison with the regular thermal threshold segmentation method (ATC) in thermography.
Moreover, we further analyzed the speed rate in comparison with other state-of-theart networks and YOLO-V3 still obtained the fastest running time speed to process images due to the reason that it is a one-stage real-time detector and has a much faster speed than other detectors (such as Mask-RCNN; Faster-RCNN). The RCNN methods are relatively slow since these models are two-stage procedures (Region Proposal Network (RPN); ROI pooling). Whereas, as indicated previously the POD curves in Figure 20 (Section 6.5), Center-Mask still achieved the highest POD scores during the whole validation process based on the different aspect ratio values (size/depth). Therefore, in this work, Center-Mask is the most promising to obtain the highest accuracy but YOLO-V3 is the most efficient, which has the faster time frame speed.

Results Analysis
The deep segmentation models gave attractive results for the Plexiglass/CFRP/Steel materials defects identification evaluation. This project focused on building and fine-tuning the training parameters for those defects. To improve the accuracy of the detection model, the way the dataset is built has a significant impact.
According to the results obtained, the following analyses and points of this experiment were concluded below:

1.
To implement a robust detection model, the databases must include enough samples. One way to effectively improve is to increase the size of the dataset by including multiscale images. A database composed of images on different scales (larger or smaller), enables the training to be sensitive to those new dimensions. This would increase the robustness of the deep segmentation algorithms facing larger defects, as well as improve the results on blurry pictures. To help reduce false alarms in the algorithm results and be more convenient for the user, implementing different types of labels is necessary. In the case of this project, each section was labeled with a defect in the spatial segmentation training (Mask-RCNN; U-net; Res-U-net). The proposal is to add different classifications. For example, including the name of the shape of the defect: circle, triangle, or some false positive cases (lighting spots, scratches) would be beneficial. This would allow the algorithm to not detect these shapes as a defect, and, thus, reduce the number of false alarms.

2.
Another critical point in this experiment to be considered is the marking process. In comparison to other objective detection methods, Mask-RCNN/Center-Mask especially involves a pixel-based marking approach that could mark the defects accurately, as opposed to marking a considerable area around each defect. It can rapidly and easily annotate the object without the bounding boxes restrictions in most cases. In comparison with an instance-segmentation method, U-net and Res-U-net are the auto-encoder format DL models that can be trained based on each pixel level to semantically segment defect pixels from sound pixels. However due to the burden of tackling massive temporal data of thermal frames, U-net and Res-U-net have less time efficiency and high time complexity on the thermal data in comparison to the instance-segmentation model. Therefore, building and creating more diverse and representative training samples is the key point in the future work in this research. There are several ways in which the size of the dataset can be effectively increased. Through data augmentation involving rotation, horizon flipping, and vertical shifts, the deep neural network model could learn the transformations further. By having different scales of larger or smaller training images, the learning procedure will be more sensitive to those new dimensions. This would also enhance the robustness of the algorithm to train for the detection of large defects and improve the results of grayscale images. 3.
In addition, the specific training gave results for specific defects in the academic samples. In this work, training only involved using square, circle, and rectangle defects of plexiglass, CFRP, and steel samples. The detection results indicate that similar defects could be detected on other types of training samples. However, the results also show that if the learning model is tested on other defects that the model did not learn on, it would not be an accurate system to rely on. Hence, to use the deep-learning algorithm for training, we should clearly define the type of sample we are working on and enlarge the robustness of the system to learn this type of sample during the neural network training procedure. In addition, due to the time limitation, we simply labeled all the visible defects of each sample in this experiment. However, if we want to extract the feature map completely for each defect area, the positioning of less visible defects in infrared data will be a significant but challenging issue in further research.

4.
A specific limitation of the objective localization algorithms is the influence of the labeling process. Although fast and efficient to use, the bounding boxes also led to some restrictions in most cases. As can be seen, when the circle is present in bounding box, this involves a defect that is totally bounded by the box. However, this shows that although the entire defect is contained, the bounding box also extracted the non-defect area, which possibly introduces multiple errors and less accuracy in the results. The proposal is to make a pixel-based labeling to achieve integrity in the image segmentation, which would only label the defects and not a considerable area around each defect. This proposition can be further clarified by segmentation methods. The results presented here lead to a more reliable defects characterization with pulsed thermography (PT).

5.
A good defect characterization is essential to not replace parts that could yet be used and to not leave critically damaged components without the needed repair. Therefore, these results are important, especially, e.g., in the designing of autonomous diagnosis NDT systems, which can make decisions regarding the integrity of the inspected part by themselves. In this work, three different types of automatic detection, being intelligent techniques, to combine with infrared thermography could improve the detection with industrial applications based on each group of results in the previous section. The critically damaged components could be easier identified and maintained the component that could be used by those algorithms with a high AP rate (81.06%). However, the instance segmentation (e.g., Center-Mask) provided the highest detection rate associated with vivid segmentation results among three different algorithms to provides the better solution of detection capability compared with the conventional thermal inspection method in industries. Therefore, it could be able to apply and contribute to current industrialized infrared inspection and controlling system. 6.
Future work includes: (a) Tests that can be performed with the instance segmentation method and other NDT techniques based on images like stereography and holography; (b) The best technique, method instance-segmentation method (Center-Mask), which can still be improved by tuning the network parameters; (c) Since the CNN technique achieves excellent performance, other network architectures must be tested and compared in the future to specify the best intelligent tool for defect measurement with infrared images.

Conclusions
In this work, six spatial deep-learning models, involving instance segmentation (Mask-RCNN; Center-Mask), autoencoder format semantic segmentation (U-net; Res-U-net), and the object localization model (YOLO-V3; Faster-RCNN) are applied for defect detection in infrared thermography. The evaluated results and analysis from different geometric specimens of plexiglass, CFRP, and steel specimen with different aspect ratios (size/depth) are indicated in Section 6. Each POD curve is related to the defect sizes that assess the quality of the results to land smoothly in the case of catastrophic failure results. These spatial deep-learning models are separately and comparatively discussed in brief. Future work will focus on the detection of more complicated structured materials through the modification and combination of different spatial and transient deep-learning models.