Steel surface defect detection algorithm based on ESI-YOLOv8

To enhance the precision of detecting defects on steel plate surfaces and diminish the incidences of false detection and leakage, the ESI-YOLOv8 algorithm is introduced. This algorithm introduces a novel EP module and integrates the large separation convolutional attention module and the spatial pyramid pooling module to propose the SPPF-LSKA module. Additionally, the original CIOU loss function is replaced with the INNER-CIOU loss function. The EP module minimizes redundant computations and model parameters to optimize efficiency and simultaneously increases the multi-scale fusion mechanism to expand the sensory field. The SPPF-LSKA module reduces computational complexity, accelerates model operation speed, and improves detection accuracy. Additionally, the INNER-CIOU loss function can improve detection speed and model accuracy by controlling the scale size of the auxiliary border.The results of the experiment indicate that, following the improvements made, the algorithm’s detection accuracy has increased to 78%, which is 3.7% higher than the original YOLOv8. Furthermore, the model parameters were reduced, and the verification was conducted using the CoCo dataset, resulting in an average accuracy of 77.8%. In conclusion, the algorithm has demonstrated its ability to perform steel plate surface defect detection with efficiency and accuracy.


Introduction
Steel, as a metallic material alloyed with iron, carbon and other elements, has an extremely important status and a wide range of applications.With its properties of high strength, wear resistance and corrosion resistance, it plays an irreplaceable role in modern industry.According to different alloy compositions and process treatments, steel can be divided into carbon steel, alloy steel, stainless steel and other types, each type having unique properties and uses.Steel has a wide range of applications and is one of the most widely used metallic materials in the world.Whether in the aerospace, automotive, construction or many other industries, steel plays an important role and has increasingly higher performance requirements.The demands on the flatness and quality of steel surfaces are also increasing, and the inspection of steel surfaces has become one of the hot topics of current research.The study and inspection of steel and its surface quality is not only important for industrial development, but also helps to improve product quality and safety.
The development of steel surface defect detection technology can be divided into three stages: the first stage is the traditional target detection method [1,2]; the second stage is the machine vision-based detection method; and the third stage is the deep learning-based detection method.At present, the traditional methods and machine vision-based detection methods are difficult to meet the requirements of steel detection accuracy, so more and more researchers are investing in the field of deep learning-based detection.
Deep learning algorithms in the direction of target detection include two-stage target detection algorithms, one-stage target detection algorithms, and so on.Two-stage target detection algorithms include R-CNN [3], which mainly extracts candidate regions by selective search, and then performs classification and position regression on each candidate region; Fast R-CNN [4], which uses a convolutional neural network to extract features on the whole image, and then uses a candidate region pooling layer to classify and regress the position of candidate regions; and Faster R-CNN [5], which introduces a region proposal network to achieve end-to-end target detection.Single-stage target detection algorithms are mainly classified into YOLO series [6] and SSD [7], etc, in which YOLO series treats the target detection task as a regression problem and predicts the location and category of the object simultaneously by a single neural network, while SSD predicts the location and category of the target at different scales and combines multiple feature maps for target detection.
With the continuous development of single-stage target detection algorithms, the YOLO family of iterations has been continuously updated, from the initial YOLOv1 [8,9] and YOLOv2 [10,11] to the faster inference YOLOv5 [12][13][14][15], to the higher accuracy YOLOv7 [16][17][18][19], and all the way up to the current YOLOv8.The YOLO algorithms have made significant progress in terms of both the average accuracy as well as the inference speed have been significantly improved, and the lightweighting of the model has also been achieved, and these advances reflect the potential and prospect of the YOLO algorithm.It is foreseeable that as the technology continues to mature, the YOLO algorithm will continue to improve its performance and play a more important role in the field of computer vision in the future.
In recent years many scholars have done in-depth research based on deep learning target detection algorithms.Xu [20] et al proposed Detecting defects in fused deposition modelling based on improved YOLO v4.Li [21] et al Improvement of Remote sensing image target detection algorithm based on YOLO V5.Li [22] et al Yolov5 Vehicle Detection Model in Fog Based on Channel Attention Enhancement.Yi [23] et al An efficient method of pavement distress detection based on improved YOLOv7.It can be seen that YOLO based target detection algorithms are widely used in many fields.
The focus of this paper is to improve the average accuracy of steel surface defects without increasing the number of model parameters, in order to reduce the leakage rate and false detection rate in the detection process.To this end, this paper proposes a steel surface defect detection algorithm based on ESI-YOLOv8, which has a higher average accuracy and a smaller number of model parameters compared to the YOLOv8 algorithm.

Related work
The YOLOv8 model has some shortcomings in detecting defects on steel surfaces.Firstly, its detection accuracy is low, and it may suffer from target leakage or misdetection.Secondly, the model's size is larger, requiring more computational resources and resulting in slower performance.Additionally, the loss function in the YOLOv8 model has weak generalisation ability and converges slowly.To enhance the effectiveness of steel surface defect detection, further improvements and optimizations are required for the YOLOv8 model.
The C2f module in YOLOv8 has a large number of parameters and complex computations, resulting in long and slow model training times that require high computational resources and storage space.To address this issue, Du [24] et al were inspired by GhostConv and proposed a GhostC2f module to reduce the model parameters and computational load.In YOLOv8, Ciou overlooked the imbalance problem in BBR, resulting in slow convergence and inaccurate regression results.To address this issue, Zhou [25] et al replaced Ciou with Eiou in YOLOv8, improving the network's regression capability.Similarly, Zhao [26] et al replaced Ciou with Diou in YOLOv3 to enhance model accuracy.Additionally, the pyramid pooling module in YOLOv8 has a limited receptive field and is ineffective for multi-scale feature extraction.To overcome this limitation, Yu [27] et al introduced a bi-directional feature pyramid network to improve the fusion of multi-scale features.
In response to the above problems in YOLO8, many researchers have proposed improvement methods and made great contributions, but there is still much room for improvement.

Introduction to the model
YOLOv8 is the most recent iteration of the You Only Look Once algorithm series.It is a single-stage detection algorithm which utilizes an end-to-end detection technique to efficiently and accurately recognize various targets present in an image.In contrast to earlier versions of YOLO, YOLOv8 has enhanced network structure and encompasses a range of sophisticated technologies.
On the underlying network architecture of the backbone network, YOLOv8 utilises Darknet, a deep convolutional neural network composed of convolutional and pooling layers with reduced parameters and computation.Its primary function involves extracting features from input images to detect targets in real-time.To address the challenge of detecting targets at varying scales, YOLOv8 utilises a feature pyramid network.This network produces feature maps at different scales by implementing up-sampling and up-sampling operations on the backbone network.These feature maps effectively capture target details of different sizes, hence enhancing the capability of multi-scale target detection.To enhance the fusion of feature information at different scales, YOLOv8 incorporates a path integration network [28], facilitating cross-layer transfer and fusion of feature information.This feature enhances the algorithm's ability to detect targets of multiple scales by introducing lateral connectivity and top-down deconvolution operation relative to the feature pyramid network.
The prediction layer of YOLOv8 obtains target detection outcomes based on the feature pyramid network and the path integration network's output to generate target detection outcomes.This stage of the process involves the mapping of feature maps onto bounding boxes and determining class probabilities at various scales using convolution operations.Subsequently, non-maximum suppression [29] is employed to eliminate overlapping bounding boxes in order to produce the final detection outcome.The diagram of YOLOv8's structure is displayed in figure 1.

EP module
The EP module comprises two principal modules: Efficient Multi-Scale Convolution (EMSC) and Pointwise Convolution.The EMSC module captures features of various scales in the image, broadening the network's sensory field.This allows the network to capture details and global contextual information from the input image while reducing the computational cost of the network.While point-by-point convolution has no receptive field in the spatial dimension and only processes the number of channels, it has the ability to modify the number of channels in the input tensor, thus allowing control over the model's parameters and computational complexity.
The EP module principle is depicted in figure 2, where pairs of feature maps generate redundant information.Hence, we adopt the concept of the Ghost Net network, where half of the channels operate in a non-working state, reducing parameter redundancy and computation in the module to enhance network efficiency.The remaining channels are divided into two groups.The left channel is connected to a 3 × 3 convolution, and the right channel is connected to a 5 × 5 convolution to capture features at varying scales and broaden the sensory field.Technical abbreviations are explained upon initial use.As the features are derived from a solitary feature channel and the characteristics of each channel remain autonomous, the exchange of channel information is achieved through point-by-point convolution to amalgamate the features of every channel.Moreover, this technique curbs the module's quantity of parameters and computations.

SPPF-LSKA Module
The SPPF-LSKA module is constructed by adding the Large Separable Kernel Attention module to the Spatial Pyramid Pooling with Filter module, which can pool feature maps at different scales without increasing the computational effort to extract richer feature information, and also can adapt to targets of different sizes.It can also improve the sensory field of the module and enhance the module's ability to monitor the target.This module can better improve the detection accuracy of the network and enhance the model robustness, and its schematic diagram is shown in figure 3.
The schematic diagram of one of the LSKA [30] modules is shown in figure 4. For a given feature map F, as shown in the following equation: Where C is the number of input channels and H and W denote the height and width of the feature map respectively.The output of its LSKA is shown in the following equation: Where * and ⊗ represent the convolution and Hadamard product, respectively, and ZC is the output of the deep convolution obtained by convolving a kernel W of size k × k with the input feature map F. The left side of equation (2) displays the output of the deep convolution with a kernel measuring 1 × (2d-1), which captures the local spatial information and compensates for the lattice effect of the subsequent deep convolution, as shown in equation (3).The kernel size for deep convolution is [k/d] × 1, where k represents the receptive field of the kernel W, and d denotes the dilation rate.The attention map is obtained by performing convolution using a 1 × 1 convolution kernel, or A C .The Hadamard product of the attention map A C and the input feature map F C is represented by the left-hand side of equation (5) and the middle equation.

Loss functions inner ciou
Due to the small target of steel defect detection, the CIOU [31] loss function cannot effectively describe the target of Boundary Regression (BBR) and ignores the imbalance problem in BBR, resulting in slow convergence and inaccurate regression results.Therefore, in order to exploit the potential of BBR loss, the scale size of the auxiliary boundary is controlled by introducing Inner-Iou to adjust the size of the scale factor ratio to calculate the loss.The schematic diagram is shown in figure 5.Where GT frame and anchor frame are denoted by B gt and B respectively, the centroid of the GT frame and the inside of the GT frame is denoted by (x c gt , y c gt ), the centroid of the anchor frame and the inside of the anchor frame is denoted by (x c ,y c ), the width and height of the GT frame are denoted by w gt and h gt respectively, and that of the anchor frame is denoted by w and h respectively.In equations (6)-( 14), the radios and unions are denoted by ra and un, respectively, to make the formulae more aesthetically pleasing.ratiodenotes the scale factor, which takes a value between 0.5 and 1.5.
The formula for Inner CIou is shown below: Through Inner Ciou makes the model detection accuracy higher, after repeated experiments, for the steel surface defect detection model whose ratio value is 1.2, its performance is the best.
In summary, the YOLOv8 model is improved, and the structure of the improved ESI-YOLOv8 is shown in figure 6, specifically the addition of five EP modules, the fusion of the large separate convolutional attention module within the spatial pyramid pooling module to form the SPPF-LSKA module, and the introduction of the Inner-Ciou loss function.

Experimental preparation
The dataset used in this experiment, NEU-DET steel surface defect dataset, which contains 1800 grey scale images, is randomly divided into training set, validation set and test set in 8:1:1 ratio.Which is divided into six categories, namely Crazing, Inclusion, Patches, Pitted-Surface, Rolled-in-Scale and Scratches, and the surface defects of the six categories are shown schematically in figure 7.
The experimental conditions and parameter configurations are as outlined below.

Experimental comparison of different loss functions
To assess the efficacy of the Inner-Iou loss function on this specific model, various loss function groups were compared experimentally.The comparison in figure 8 displays the average accuracy of Inner-Iou alongside Ciou, Diou, and Giou.As shown in the figure, the average accuracy of Inner-Iou surpasses that of the other three loss functions as training times increase.After 40 training times, Inner-Iou has a smoother increase in average accuracy compared to several other loss functions that experience more fluctuations, including Ciou, Diou, and Giou.Between 60 to 80 training times, the average accuracy remains between 0.1 and 0.3.To make a visual comparison of average accuracies for different loss functions, table 1 displays the results.At 0.5 and 0.5-0.95, the average accuracy of Inner-Iou is higher than that of the other loss functions.It is 1.8%, 0.7%, and 1.6% higher than that of Ciou, Giou, and Diou, respectively.

Analysis of experimental results before and after YOLOv8 improvement
As shown in table 3, the enhanced model increases the mean accuracy by 3.7% compared to the original YOLOv8 model.Additionally, the number of parameters is smaller, and the computational volume is lower than in the original model.The performance of the enhanced model is superior.Figure 9 illustrates the average accuracy of the improved model in identifying steel surface defects.Notably, the highest average accuracy of 92.2% was achieved in detecting scratches, whereas the average accuracy for the majority of the classifications remained around 80%.However, a few classifications exhibited low average accuracy, suggesting scope for further enhancement.
To assess the model's performance more effectively, we conducted a test and found that its average accuracy was 77.9%, merely 0.1% lower than the training model's average accuracy.Additionally, we validated it on the  CoCo dataset with an average accuracy of 77.8% and on the VOC2007 dataset with an average accuracy of 77.6%, meeting the required standard.Please refer to table 2 for the results.
To demonstrate the improved performance of the model, figure 10 presents a visual comparison between the original and upgraded YOLOv8 models.It is evident that, while the original model can detect defect locations, its detection rate is inadequate in identifying a wide range of defects, including the Pitted_surface classification in figure 10.Notably, surface pockmarks go undetected in several cases.Conversely, the improved model has a lower leakage rate and performs better in detecting various types of defects.The improved model has a lower relative leakage rate.However, the thermal imaging area for detecting small areas is larger than the defect area.This is evident with the Crazing classification in figure 10, where the original model had a detection area that was larger than the defect area.Conversely, the improved model is more precise in its detection.From the marks in figure 10, it is apparent that the original model has some incidences of misdetection, and the detected location is   not aligned with the defect location.Conversely, the improved model is highly precise and exhibits a superior average accuracy when matched against the original model.The graph in figure 11 illustrates the mean accuracy of both the upgraded ESI-YOLOv8 model and the initial version, with the horizontal axis indicating training instances and the vertical axis representing the average accuracy.The graph illustrates that initially, the average accuracy of the enhanced model is lower than that of the original model in the first 80 training instances.However, as the number of training instances increases, the enhanced model outstrips the original model after 80 instances, achieving higher average accuracy in all 120 instances.
According to figure 12, the ESI-YOLOv8 model exhibits a lower category classification loss as the number of trainings increases when compared to the YOLOv8 model.Furthermore, the model shows higher detection accuracy in distinguishing between various categories.

Ablation experiment
To provide a clear and objective analysis of the improvements made to the model proposed in this paper, in comparison to the original YOLOv8 model, and to determine the effectiveness of the proposed model in detecting defects on steel surfaces, a series of ablation experiments consisting of seven sets have been designed for analysis and comparison.
Experimental group 1 incorporated the EP module into YOLOv8; experimental group 2 substituted the spatial pyramid pooling module with the SPPF-LSKA module based on YOLOv8; experimental group 3 adjusted the loss function to the Inner-Iou function derived from YOLOv8; experimental group 4 added both the EP module and SPPF-LSKA module based on YOLOv8.Experimental group 5 utilised the EP module and employed the Inner-Iou loss function with reference to YOLOv8.Group 6 used the SPPF-LSKA module with the Inner-Iou loss function with reference to YOLOv8.Group 7 combined the EP and SPPF-LSKA modules with the Inner-Iou loss function with reference to YOLOv8.Under the same experimental conditions, three groups were studied.A steel surface defect dataset underwent experiments, and the results of the experiment are presented in table 4.  From the table above, it is evident that experimental group 1 has an average accuracy 0.9% higher than the original model after incorporating the EP module.Moreover, the number of parameters and computation amount in experimental group 1 is lower than the original model.Experimental group 2 achieved an accuracy of 2. The SPPF-LSKA module improved the original model's average accuracy by 3%, although it also increased the number of parameters and computation amount slightly.Experimental group 3, which incorporated the Inner-Iou loss function and the same number of parameters and computation amount as the original model, demonstrated a 1.8% improvement compared to the original model.After incorporating the Inner-Iou loss function, the average accuracy of experimental group 3 increased by 1.8% compared to the original model with identical parameters and computation volume.This indicates that the SPPF-LSKA module enhances average accuracy more than the other two modules.However, the inclusion of the Inner-Iou loss function increases the number of model parameters and computation volume.Nonetheless, its incorporation ensures improvement in average accuracy without changing the number of parameters and computation volume.The EP module exhibits the poorest average accuracy improvement, although it necessitates the least amount of model parameters and computational effort.
It is evident that each of the three has its own strengths, thus they are combined in pairs.Experiment 4 reveals that the average accuracy increases when the EP and SPPF-LSKA modules are added, surpassing the accuracy of any of the individual modules.The resulting accuracy is 2.6% higher than that of the original model.Experimental group 5, which included the addition of an EP module and Inner-Iou loss function, had the same number of parameters as the group that added only the Inner-Iou loss function, yet it achieved a higher accuracy of 0.6%.Experimental group 6, which added the SPPF-LSKA module and Inner-Iou loss function, had the most significant improvement in flat accuracy, with an increase of 2.9% compared to the original model.It had the same number of parameters as the group that added only the SPPF-LSKA module, but with the additional  Inner-Iou loss function, it achieved a higher accuracy of 0.6%.Using a combination of all three models is more effective than using a single model.The original model was improved by incorporating all three models, as evidenced by the results from experimental group 7.The average accuracy of this group achieved 78%, which is higher than any other experimental group.Furthermore, using this combined model resulted in an average accuracy improvement of 3.7% when compared with the original model.Notably, the combined model uses fewer parameters and requires less computational power than the original model.

Comparative experiments of different algorithms
To ensure a more objective comparison of the improved model, we analysed its performance alongside the current mainstream models, including YOLOX, Faster R-CNN, SSD,Resnet-50,YOLOv3, YOLOv4, YOLOv5, YOLOv7-tiny, and YOLOv8.Furthermore, we compared it with several other models, including the CBAM-MobilenetV2-YOLOv5 model (CM-YOLOv5) proposed by Yang [32] et al, the YOLO-ACG model by Wang [33] et al, the AGCN model by Zhang [34] et al, and the improved YOLOv8 model by Wei [35] et al, the multi-scale lightweight neural network model (MM) proposed by Shao [36] et al, and Zhang [37] et al proposed a model that combines CNN and Transformer.The experiments were conducted using identical hardware and software configurations, and the same dataset of steel surface defects was used.Table 5 presents the performance comparison results for each model.Based on the data presented, the YOLOv8 model exhibits superior average accuracy compared to other YOLO series models, with an 11.8% increase compared to the YOLOv3 model which showed the lowest average accuracy.Additionally, the YOLOv8 model reduces parameter quantity by 13.6MB and computation quantity by approximately 1/25th of that of YOLOv3.These findings suggest the YOLOv8 model outperforms other YOLO series models in terms of average accuracy, parameter quantity, and computation quantity.Specifically, the model utilizes only half of the parameters of YOLOv7 while achieving 8.6% higher average accuracy.Additionally, the model produces significantly lower computational demands than its counterparts.Compared with the current mainstream Faster R-CNN model, the SSD model and the Resnet-50 model, the model proposed in this study has a lower number of parameters, the average accuracy is also 4.9%, 16.3% and 7.1% higher than them respectively.The present study presents a novel model that outperforms the existing models in terms of accuracy, parameter count, and computational efficiency.
Numerous scholars have made significant contributions to the research field of steel surface defect detection.This paper compares our model with several other mainstream models, including the CM-YOLOv5 model, the YOLO-ACG model, the AGCN model, the improved model proposed by Wei et al, the MM model and the improved model proposed by Zhang et al.Although our model may not be as fast as these models in terms of processing speed, it has fewer parameters.Our model demonstrates a significant improvement in accuracy compared to the other models tested, with an increase of 1.7%, 14.3%, 6.7%,2.6%,0.9% and 0.2% respectively.Compared to Zhang's model, our model has a smaller number of parameters, although the average accuracy is similar.Although our model may have sacrificed processing speed, it provides a more reliable solution for realworld applications in the task of steel surface defect detection.Please refer to figure 13 for the detection results of the model.

Conclusion
This paper first analyses the limitations of steel surface defect detection, including the large number of model parameters and low average accuracy.In order to improve the average accuracy of steel surface defect detection, reduce the leakage rate and the false detection rate, this paper proposes an ESI-YOLOv8 model.In order to reduce the number of model parameters and computational volume to improve the detection efficiency, this paper introduces a new EP module, which makes the number of parameters and computational volume smaller than the original YOLOv8 model.Additionally, the paper suggests the integration of the large separation convolutional attention module and the spatial pyramid pooling module to form the SPPF-LSKA module.This integration reduces computational complexity and improve the average detection accuracy.To solve the problems of slow convergence and inaccurate regression results in balanced BBR, this paper adopts the Inner-Iou loss function adjustment to control the scale size of the auxiliary boundary to solve the problem.
Experimental validation shows that the ESI-YOLOv8 model has smaller computational and parametric quantities compared to the YOLOv8 model, and the average accuracy is improved by 3.7%.Compared to other mainstream steel surface defect detection models, the model presented in this paper has advantages in terms of average progress and number of parameters.After testing and validation on the CoCo dataset, the performance of the model shows good stability, and the average accuracy of the validation site only decreases by 0.1% and 0.2%.Therefore, the ESI-YOLOv8 model has good stability and meets the requirements of small volume and high accuracy detection, which can better meet the requirements of steel surface defect detection.

Figure 5 .
Figure 5. Schematic diagram of the Inner-IOU loss function.

Figure 7 .
Figure 7. Schematic diagram of steel surface defects.

Figure 8 .
Figure 8. Plots of the average precision of different loss functions.

Figure 9 .
Figure 9. Improved average accuracy plot for each category.

Figure 11 .
Figure 11.Comparison of average accuracy before and after improvement.

Figure 12 .
Figure 12.Comparison of classification losses before and after improvement.

Figure 13 .
Figure 13.Renderings of the improved algorithm.

Table 1 .
Comparison of the results of different loss functions.

Table 2 .
Comparison of before and after results.

Table 3 .
Comparison of before and after results.

Table 4 .
Comparison of ablation experimental results.

Table 5 .
Comparison of experimental results of different models.