MCAW-YOLO: An Efficient Detection Model for Ceramic Tile Surface Defects

Yu, Xulong; Yu, Qiancheng; Mu, Qunyue; Hu, Zhiyong; Xie, Jincai

doi:10.3390/app132112057

Open AccessArticle

MCAW-YOLO: An Efficient Detection Model for Ceramic Tile Surface Defects

¹

The College of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China

²

The Key Laboratory of Images and Graphics Intelligent Processing of State Ethnic Affairs Commission, North Minzu University, Yinchuan 750021, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(21), 12057; https://doi.org/10.3390/app132112057

Submission received: 29 August 2023 / Revised: 27 October 2023 / Accepted: 30 October 2023 / Published: 5 November 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Traditional manual visual detection methods are inefficient, subjective, and costly, making them prone to false and missed detections. Deep-learning-based defect detection identifies the types of defects and pinpoints their locations. By employing this approach, we could enhance the production workflow, boost production efficiency, minimize company expenses, and lessen the workload on workers. In this paper, we propose a lightweight tile-defect detection algorithm that strikes a balance between model parameters and accuracy. Firstly, we introduced the mobile-friendly vision transformer into the backbone network to capture global and local information. This allowed the model to comprehend the image content better and enhance defect feature extraction. Secondly, we designed a lightweight feature fusion network. This design amplified the network’s detection capability for defects of different scales and mitigated the blurriness and redundancy in the feature maps while reducing the model’s parameter count. We then devised a convolution module incorporating the normalization-based attention module, to direct the model’s focus toward defect features. This reduced background noise and filtered out features irrelevant to defects. Finally, we utilized a bounding box regression loss with a dynamic focusing mechanism. This approach facilitated the prediction of more precise object bounding boxes, thereby improving the model’s convergence rate and detection precision. Experimental results demonstrated that the improved algorithm achieved a mean average precision of 71.9%, marking a 3.1% improvement over the original algorithm. Furthermore, there was a reduction of 26.2% in the model’s parameters and a 20.9% decrease in the number of calculations.

Keywords:

ceramic tile surface-defect detection; multi-scale feature fusion; attention mechanism; loss function

1. Introduction

Ceramic tiles are commonly used for floor and wall decoration in construction. However, ceramic tiles can fall victim to various defects during manufacturing [1]. These defects could arise from various sources, including the raw materials used, the production processes’ intricacies, and the manufacturing equipment’s reliability. Among the commonly observed defects are: physical damages, such as chips or breaks; structural flaws, like cracks or deformations; and cosmetic issues, including discolorations or variances in hue and pattern. The presence of these defects not only diminishes the aesthetic value of the tiles but could also compromise their structural integrity, potentially shortening their functional lifespan and posing safety risks. Moreover, tiles that consistently demonstrate a high defect rate could tarnish a manufacturer’s reputation, eroding consumers’ trust. Such tiles invariably fail to meet industry standards, reducing market competitiveness. Consequently, the sales and profitability of these subpar products takes a hit, jeopardizing the company’s brand image and long-term viability.

The detection of defects in ceramic tiles has predominantly depended on manual visual detection by skilled workers [2]. In this traditional method, inspectors meticulously examine each tile for defects, ranging from minute cracks to discolorations. However, the manual process, despite the expertise of the workers, came with inherent challenges. Subjectivity, fatigue, and the propensity for human error rendered it inefficient, especially for the demands of large-scale, modern-day production lines. In light of these challenges, computer vision offered promising solutions. The introduction of detection algorithms, particularly those harnessing the power of deep learning, has revolutionized tile-defect detection. These algorithms, known for their high precision, excel at classifying the nature of defects and pinpointing their exact locations on the tiles. The resultant reduction in errors, previously arising from manual detections, led to a noticeable enhancement in production efficiency. Moreover, the adoption of computer vision transcended mere defect detection. It offered automation in image and video analysis, significantly reducing the need for manual oversight and, in turn, easing the burden on human inspectors. The multifaceted benefits of integrating this technology were hard to ignore. Beyond the obvious savings in labor costs, there was a strategic advantage: companies embracing this innovation stood out in the competitive market, reinforcing their brand reputation and ensuring continued consumer trust in the quality of their products.

This paper proposed a lightweight ceramic-tile defect-detection algorithm called MCAW-YOLO. This algorithm aimed to improve the model’s detection accuracy while reducing the number of parameters. The effectiveness of the improved algorithm was verified through ablation experiments. Moreover, comparisons were made with other YOLO-series algorithms, underscoring the superior performance of MCAW-YOLO. The main research methods are as follows:

(1): Introducing the mobile-friendly vision transformer (MobileVitTV3) into the backbone network to capture defect information. By combining the local information provided by convolution and the global information provided by the transformer, the model could better capture features at different scales and abstraction levels, thereby improving model performance.
(2): Designing the lightweight bidirectional feature pyramid network (Light-BiFPN) feature fusion network in the neck network to obtain feature maps at different scales through cross-layer connections. This promoted the interaction between shallow localization and deep semantic information, enhanced the fusion of multi-scale features, and effectively reduced the number of model parameters.
(3): Designing the ghost shuffle convolution module with an attention mechanism (GSAConv) to improve the model’s focus on defects by concentrating the network’s attention on the ceramic-tile defect regions. This reduced the influence of the background on detection, thereby improving detection accuracy and recall.
(4): Introducing the bounding box regression loss function with a dynamic focusing mechanism (Wise-IoU) to better adapt to defects of different shapes and accelerate the convergence of the model. This loss function reduced the impact of geometric factors on bounding box regression and improved the model’s learning ability for complex samples.

2. Related Work

2.1. Defect Detection

Defect-detection algorithms can be categorized into single-stage and two-stage detection algorithms, depending on whether the model generates candidate bounding boxes.

The mainstream method for two-stage detection algorithms is the Region-CNN series [3,4,5]. For example, Zhang et al. [6] presented an improved faster R-CNN model for detecting in-shell walnuts with shriveled and empty-shell defects. They incorporated a feature pyramid network and utilized the region-of-interest alignment and softer NMS modules to enhance the model’s detection precision. Xu et al. [7] introduced the multi-stage balanced R-CNN defect-detection method, incorporating deformable convolutions and balanced feature pyramids to enhance feature-extraction capabilities. Furthermore, they applied balanced L1 loss and IoU balanced sampling to improve the model’s detection accuracy. Zhu et al. [8] proposed the IA-Mask R-CNN detection method for surface-defect detection on automotive engine parts. They determined the optimal anchor scale through labeled data analysis. Although two-stage detection algorithms can reduce reliance on manual intervention, these algorithms require separate training for candidate box generation and object detection, which consumes more training time and resources. Additionally, these algorithms are slower, making applying them to real-time scenarios challenging.

The mainstream method for single-stage detection algorithms is the You Only Look Once (YOLO) series [9,10,11,12,13]. These algorithms directly predict the defects category and regress location information, usually boasting faster detection speed. For example, Li et al. [14] proposed an improved YOLOv5 model for surface-defect detection in aerospace engine components. This model utilizes the K-means clustering algorithm to optimize the size of anchor points and incorporates the efficient channel mechanism in the backbone network to enhance feature representation. Kang et al. [15] introduced the DME-YOLO model for defect detection in the appearance of high-frequency transformers. By reusing features, they accelerated the detection speed. Additionally, they designed the multi-information-source spatial attention module to enhance feature-extraction capabilities and employed the EIoU loss function to improve bounding box regression accuracy. Zheng et al. [16] proposed an insulator defect-detection algorithm that improves upon YOLOv7, utilizing the K-means++ clustering algorithm to generate defect bounding boxes more suitable for model detection. Subsequently, they incorporated coordinate attention and HorBlock modules to bolster the network’s expressive capacity. Lastly, they introduced SIoU and focal loss functions to expedite model convergence. Wang et al. [17] presented the ODCA-YOLO algorithm for timber-defect detection, integrating coordinate attention and omni-dimensional dynamic convolution to amplify the model’s feature-extraction prowess. Moreover, they applied the shuffle concept within the HorBlock module to heighten the model’s recognition accuracy. Although the studies mentioned above apply the YOLO series algorithms to defect detection with good generalization and detection performance, the improvements usually increase the depth and complexity of the network, resulting in an increased number of parameters, which is unsuitable for deployment on edge devices.

2.2. YOLOv5 Algorithm

YOLOv5 is an efficient and accurate single-stage detection algorithm that enables fast and accurate object detection tasks in scenarios with limited computational resources. It has been widely applied in various computer vision applications, such as intelligent security, autonomous driving, intelligent logistics, industrial quality inspection, and other fields.

The YOLOv5 model consists of four main components: the input end, the backbone network, the neck network, and the output end. The input part of the network uses the Mosaic data augmentation method, which combines multiple images into one by scaling and shifting, increasing the diversity of training data, and improving the model’s robustness and generalization ability when dealing with complex scenes. Additionally, the model adopts an adaptive anchor box calculation method to determine the anchor box sizes suitable for the current dataset. This reduces the manual adjustment of anchor box sizes and speeds up the model training process. The backbone network utilizes the cross-stage partial architecture to improve the speed and efficiency of the model. It also incorporates the spatial pyramid pooling-fast module, which concatenates different feature maps to enhance the model’s learning ability. The neck network of YOLOv5 combines the feature pyramid network (FPN) and path aggregation network (PAN) structures. FPN transfers deep semantic information to shallow detailed information through nearest-neighbor interpolation, while PAN transfers shallow detailed information to deep semantic information through convolution. The combination of FPN and PAN enables the feature maps of different scales to have a strong perception of defects’ color, texture, and other details, as well as rich semantic information. The output end uses Complete-IOU as the bounding box regression loss function, which accurately evaluates the distance between predicted and ground truth boxes.

3. MCAW-YOLO Network Model

The improved MCAW-YOLO network boasts superior feature-extraction capabilities, all while optimizing the number of parameters and calculations. The architecture of the network is shown in Figure 1.

3.1. MobileViTv3 Block

In tile-defect detection, the model needed to analyze and process the surfaces of the tiles to identify defect structures. In order to extract global information on defects during the detection process, researchers had proposed the vision transformer. However, the vision transformer required self-attention calculations between each pixel, which had limitations in capturing location and spatial structures, and the vision transformer had many parameters. Therefore, this paper introduced the lightweight MobileViTV3 block [18] to alleviate the drawbacks. The structure of the MobileViTv3 block is shown in Figure 2, and mainly included the local representation module, global representation module, and fusion module.

The shape of the input feature map was C_in × W × H, where C_in represented the number of input channels, and W and H represented the height and width of the feature map, respectively. Firstly, the local representation module processed the input feature map to extract features. This module comprised two sub-modules: depth-wise separable convolution (DWConv) and standard convolution (Conv). Depth-wise separable convolution could independently perform convolution calculations on input channels, reducing the computational cost of the model. Convolution integrated channel information while changing the dimensions of the features. Secondly, the global representation module was used to extract the global information of the defects. This module consisted of a transformer module and a convolution module. The transformer component performed global self-attention calculations on the input feature map to capture the global information of the input. Convolution was used to change the feature dimension, preparing for concatenation in the fusion module. The fusion module concatenated the outputs of the local representation module and the global representation module to fuse their information, capturing both the overall characteristics of the defects and compensating for the shortcomings of the transformer in capturing location and spatial information. Convolution was used to perform convolution operations on the concatenated feature map, improving the expressive power of the model. Finally, the output of the fusion module was added to the original input feature map to obtain the final output feature map, which had the shape of C_out × W × H.

Compared to the traditional direct use of the transformer module, the MobileViTv3 block embedded the transformer module into the convolution operation, which could significantly reduce the number of parameters. At the same time, the MobileViTv3 block used the transformer to replace local modeling with global modeling in convolution, combining the advantages of spatial induction bias in convolution and global processing in ViT. As a result, the MobileViTv3 block possessed the properties of both CNN and ViT, alleviating the disadvantages of convolution’s inability to capture long-range information and the shortcomings of ViT in capturing location and spatial structure. By introducing the MobileViTv3 block, the network could consider global information such as the shape, size, and position of tile defects, enabling better capture of the image’s positional and spatial structural information. This helped the model better understand the tiles’ overall shape and surface features, improving the accuracy, robustness, and classification accuracy of tile-defect detection.

3.2. Light-BiFPN Neck

In order to improve the utilization of features at different scales, reduce model parameters, and increase detection speed, a lightweight neck network called Light-BiFPN was designed in this section; this network was used for feature fusion and aimed to maintain model accuracy while reducing parameters and increasing detection speed.

In the feature fusion network of YOLOv5, the scale of the feature maps varied greatly, which could easily result in the loss of details or global information, affecting the model’s perception of fine-grained and holistic features. Additionally, multiple transmissions of the feature maps could lead to gradual blurring of the feature information, making it difficult for the model to accurately capture the features and positions of defects. Therefore, to improve the expressive power and accuracy of the model, it was necessary to introduce the semantic information of the original image. This paper referred to the bi-directional feature pyramid network [19] and introduced cross-scale connections to fuse input and output features from the same layer. Based on the structure of the YOLOv5 network, a three-channel BiFPN was constructed. By introducing the original semantic information, more features and information could be provided to compensate for the missing information in feature propagation, integrating high-resolution and high-semantic information and improving feature fusion’s accuracy and expressive power.

In order to reduce model parameters, memory consumption, computational burden, hardware costs, and usage thresholds, several techniques were employed. Firstly, in the downsampling step of the network, max pooling was used instead of convolution operations to reduce computational workload. The most salient features in the input region were captured by selecting the maximum value, enhancing feature discriminability. Secondly, cost-effective modules such as ghost shuffle convolution [20] and partial convolution (PConv) [21] were introduced. These modules achieved high detection accuracy while maintaining relatively fewer parameters and computational costs, providing efficient computation and inference speed. The structures of ghost shuffle convolution and partial convolution were shown in Figure 3.

The GSConv module consisted of Conv, DWConv, concatenate (concat), and shuffle modules. Firstly, the convolution module extracted defect features and reduced the number of channels to reduce computational load. Secondly, depth-wise separable convolution was employed to perform separate convolutions on each channel, reducing the number of parameters. Then, the feature maps obtained from both modules were concatenated along the channel dimension to achieve the desired output channel number. Finally, the features obtained from the two modules were fused through the shuffle operation to enhance feature diversity. The shuffle operation alleviated the problem of weak feature extraction and fusion caused by the limited capabilities of depth-wise separable convolution. On the other hand, the partial convolution divided the input features into two parts along the channel dimension, where one part underwent convolution operations for feature extraction. In contrast, the other remained unchanged to reduce computation. According to the GSConv and PConv modules, we redesigned the C3 module as PGSC3, as shown in Figure 4.

3.3. Lightweight Attention Mechanism

During the tile-production process, there were often some difficult-to-identify defects. These defects were usually minor in size, had weak features, provided limited information, or had a high similarity to the background, making them difficult to distinguish and resulting in a challenging detection task. However, depth-wise separable convolution could only perform convolution within each channel, limiting information interaction between different channels and reducing the expressive power.

Therefore, this paper introduced a normalization-based attention module (NAM) [22] into ghost shuffle convolution. It did not require additional calculations or parameters, such as fully connected layers or convolutions. Instead, it directly used the scaling factors in batch normalization to calculate attention weights. This module was referred to as GSAConv module, as shown in Figure 5. The limitations of depth-wise separable convolution in terms of lacking inter-channel correlation information were addressed by introducing channel attention. This allowed the model to automatically learn the importance of different positions and adaptively adjust the weight of the feature map based on the characteristics of the input data. As a result, the model could focus on more important features, reduce the false-negative rate of minor defects, decrease the interference of irrelevant information, and lower the false-positive rate.

Specifically, the normalization-based attention mechanism avoided using fully connected or convolutional layers, reducing the number of parameters and computational complexity. NAM measured channel variations by calculating the scaling factors using batch normalization, as shown in Equation (1). A more prominent scaling factor indicated more significant variations, richer information, and greater importance in the features. Conversely, a weaker scaling factor indicated less diverse feature information and lower importance. The structures of NAM were shown in Figure 6.

After calculating the attention weights w_r, the sigmoid function was applied to obtain the weight coefficients M_c. These weight coefficients were then multiplied with the original features to obtain new features that reflect the importance of the information. The calculation of the attention module based on batch normalization was shown in Equation (1).

\begin{matrix} B_{out} = B N (B_{in}) = r \frac{B_{in} - μ_{B}}{\sqrt{σ_{B}^{2} + ε}} + β \\ M_{c} = s i g m o i d (w_{r} (B N (F_{1}))) \\ F_{2} = M_{c} * F_{1} \end{matrix}

(1)

In this equation, μ_B and σ²_B represented the mean and variance of the minimum batch, respectively. r and β were trainable affine transformation parameters. F₁ was the input feature map to the model, and F₂ was the output feature map. w_r = r_i/Σ_j _{= 0}r_j represented the weight of each channel.

3.4. IoU

Intersection-over-union (IoU) is a metric used to measure the overlap between predicted and ground truth bounding boxes. In defect-detection models, intersection-over-union is often used as part of the bounding box regression loss function to improve the accuracy of the predicted boxes.

The Complete-IOU bounding box regression loss function used in YOLOv5 and many of its variants introduced more geometric factors, leading to excessive penalties for low-quality samples and affecting samples with good overlap between predicted and target boxes, thereby reducing the robustness of the model. Additionally, Complete-IOU did not consider the issue of balancing hard-to-identify samples, causing the model to learn easy-to-classify samples while ignoring some complex samples to classify, which reduced the model’s generalization ability. Therefore, this paper proposed using the Wise-IoU [23] bounding box regression loss function to address these issues, as shown in Equation (2).

In this equation, α and δ control the shape of the focusing coefficient curve and adjust the weights of complex samples when the loss value stabilizes, and * indicates that the variable is detached from the computational graph.

\sqrt[t n]{0.05}

is used to adjust the overall magnitude of the focusing coefficient. x, y, x_gt, and y_gt represent the center coordinates of the predicted box and the ground truth box. W_g and H_g represent the width and height of the minimum enclosing rectangle of the predicted box and the ground truth box.

\begin{array}{r} L_{W I o U v 3} = r R_{W I o U} L_{I o U} \\ r = \frac{β}{δ α^{β - δ}}, α = 1.9, δ = 3, β = \frac{L_{I o U}^{*}}{\bar{L_{I o U}}} \\ \bar{L_{I o U}} = (1 - m) * \bar{L_{I o U}} + m * \bar{L_{I o U}^{*}} \\ m = 1 - \sqrt[t n]{0.05}, t = 60, n = 208 \\ R_{W I o U} = \exp (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{(W_{g}^{2} + H_{g}^{2}) *}) \end{array}

(2)

Specifically, Wise-IoU constructs a dual-layer attention mechanism. It uses

R_{W I o U} \in [1, e]

to amplify the error of regular predicted boxes and

L_{I o U} \in [0, 1]

to reduce the error of high-quality predicted boxes. At the same time, when there is a good overlap between the predicted boxes and the object boxes, it focuses on the distance between the centroids and the center point.

Furthermore, β is introduced to increase the weight of difficult-to-classify defects in the loss function, focusing on challenging defects and reducing the model’s classification errors. Then, based on β, a non-monotonic focusing coefficient r is constructed to assign different gradient gains to better train the model. Predicted boxes with smaller β values have higher quality and are closer to the actual target boxes, so they are assigned smaller gradient gains, resulting in more minor corrections. This facilitates the model in quickly learning high-quality predicted boxes. Predicted boxes with larger β values typically have lower quality and overlap than target boxes. Smaller gradient gains are assigned to prevent the model from overly focusing on low-quality predicted boxes and reduce their harmful impact on bounding box regression. This allows the model to pay more attention to regular and higher-quality predicted boxes, reducing reliance on low-quality predicted boxes and thereby improving the model’s robustness and generalization ability.

By using the Wise-IoU loss function to address the issues of Complete-IOU, it is possible to obtain more accurate predicted bounding boxes, accelerate the training process, improve the training efficiency of the model, and enhance the network’s performance.

4. Experimental Results and Analyses

4.1. Dataset

The dataset used in this study was sourced from the public datasets on the PaddlePaddle platform, maintained initially by the Tianchi Data Science Team [24]. This dataset comprised 3613 images stored in JPG format and was in standard RGB format. These images were captured under two different lighting conditions and encompassed eight distinct types of defects within the dataset. These defects included white-dot defects, light-block defects, dark-block defects, angle defects, edge defects, aperture defects, marker-pen defects, and scratch defects. White-dot defects referred to small, localized, and typically white-colored spots on the surfaces of ceramic tiles. Light-block defects involved irregular, light-colored patterns or blocks on the surfaces of ceramic tiles. Like light-block defects, dark-block defects featured irregular patterns or blocks on the tile’s surface. However, in this case, the patterns were dark. Angle defects pertained to issues related to the corner of ceramic tiles. These defects may involve chipping, cracking, or irregularities at the tile’s edges. Edge defects referred to irregularities or issues along the edges of ceramic tiles. These defects included chipping, cracking, or abnormalities along the tile’s perimeter. Aperture defects involved irregularly-shaped or -sized openings or holes in the ceramic tiles. Marker-pen defects typically consisted of marks or stains made by marker pens on the surfaces of ceramic tiles. Scratch defects encompassed scratches or abrasions on the surfaces of ceramic tiles. White-dot defects were the most prevalent among these, totaling 1868 instances, whereas angle defects were less frequent, with only 410 instances.

In the YOLOv5 model, the input image size was 640 pixels × 640 pixels. However, most of the images in the dataset used for this study had dimensions of 2000 pixels × 2000 pixels. Simply resizing the original images to a fixed dimension could lead to the loss of information, especially in detecting minor defects. Feeding the original images directly into the detection model could easily trigger memory overflow issues, hindering the model’s normal training. Therefore, inspired by the approach used in YOLT [25], this paper adopted an offline sliding window cropping method. Images were divided from left to right and from top to bottom with a stride of 640 pixels. An overlap region with a 0.2 overlap ratio was established to ensure complete defect detection, resulting in a final dataset of 6712 images.

Given the imbalance in the types of ceramic-tile defects, data-augmentation techniques such as horizontal flipping and mirror flipping were applied to images with fewer instances of defect types. Additionally, the copy-paste technique [26] was employed to expand the dataset due to minimal defects. This involved duplicating small ceramic-tile defect samples and pasting them into non-defective areas to increase the count of minor ceramic-tile defects. However, since the copy-paste method primarily catered to segmentation tasks requiring pixel-level annotation, this paper directly overlaid defect regions onto non-defective areas for defect-detection tasks. Random outward expansion was applied to the cropped areas to prevent the model from merely learning the boundaries of defect pasting. Finally, the average size and standard deviation of each defect type in the dataset were statistically computed, as presented in Table 1.

This paper segregates the processed images based on defect types and lighting conditions to better assess the model’s robustness and generalization capabilities. The division was carried out in a ratio of 8:2, randomly extracting images for the training and validation sets. The training set comprised 6674 images, while the validation set encompassed 1680 images.

4.2. Experimental Setup and Training Process

The experimental environment for this study consisted of an NVIDIA TITAN V and the Centos 7 operating system. Python was used as the programming language, and the PyTorch deep learning framework was employed. All models were trained from scratch without loading pre-trained weights. Each experiment involved training for 200 epochs. Stochastic gradient descent was chosen as the training method to prevent the models from getting stuck in local optima. The initial learning rate was set to 0.01, with a momentum of 0.937, a weight decay coefficient of 0.0005, and a batch size of 32. Additionally, the loss curve of the model training process is depicted in Figure 7. According to Figure 7, the loss curve demonstrated a gradual decrease in the loss value of the model as the number of training iterations increased. The improved algorithm model presented in this paper exhibited a faster loss convergence rate than the original model. The results suggested this study’s improvement strategy and parameter settings are reasonable.

4.3. Evaluation Metrics

Commonly used detection metrics were employed in this study to evaluate the algorithm’s performance. The mean average precision assessed the model’s accuracy improvements. Simultaneously, metrics such as the number of model parameters, model computational complexity, and frames per second were adopted to evaluate the model’s deployment efficiency. Model parameters and computational complexity were employed to gauge improved models’ weight occupation and computational burden, while frames per second were used to assess the model’s processing speed. Through this comprehensive assessment, the model’s detection efficacy was comprehensively evaluated.

4.4. Ablation Experiments

To validate the effectiveness of the proposed improvements, this section designed ablation experiments for comparative analysis. These experiments evaluated the impact of different enhancement strategies on the YOLOv5 model, as shown in Table 2. Specifically, experiments 2 and 3 introduced modules into the model that offered improved efficiency. While achieving nearly identical mAP values, they reduced parameters by approximately 30% and enhanced detection speed. In experiment 4, Light-BiFPN was employed as the feature fusion module, while experiment 5 introduced the MobileViTv3 module to the backbone network. Experiment 6 replaced the original loss function with Wise-IoU.

In experiment 7, a lightweight MobileVitv3 was introduced after improving the neck component. Although the model parameters were slightly different from experiment 4 by 0.08 M, compared to the base model, this led to a 26.5% reduction in parameters and a 27.9% reduction in computational complexity. Additionally, the mAP value increased by 1.5%, validating the superior performance enhancement achieved by the improvements. Experiment 8, building upon experiment 7, introduced the lightweight attention mechanism NAM. With minimal parameter increase, the mAP value improved by 2.6% compared to the base model. Finally, introducing the Wise-IoU loss function while slightly decreasing detection speed compared to the original model increased the mAP value by 3.1%. Simultaneously, model parameters and computational complexity decreased by 26.5% and 20.9%, confirming the effectiveness of combining Light-BiFPN, MobileViTv3, NAM, and Wise-IoU. Overall, the improved YOLOv5n network achieved model lightweight and enhanced accuracy, making it feasible for deployment on edge terminal devices.

In order to visually demonstrate the effectiveness of the proposed improvement algorithm, several representative ceramic-tile defects were selected for detection. The visualization of the results is shown in Figure 8. The improved algorithm’s detection outcomes were more accurate, enhancing precision and reducing missed detection rates.

4.5. Comparative Experiments

In order to validate the rationality of the designed Light-BiFPN as a feature fusion mechanism, this section compared the Light-BiFPN with relatively mainstream lightweight networks [27,28,29,30,31,32,33] as replacements for the backbone network. The results are shown in Table 2.

The data in Table 3 showed that, compared to the base model, Light-BiFPN-YOLOv5n outperformed the base model in all aspects. Among the other lightweight networks, ShuffleNetv2 [30] possessed the fewest parameters and least computational complexity. However, it incurred significant accuracy loss, exhibiting the poorest performance in the comparative experiments. When compared to Light-BiFPN, its mAP value differed by 14.4%. On the other hand, although PPLCNet [31] demonstrated as fast a detection speed as the backbone network, its detection accuracy was relatively low, which did not guarantee satisfactory detection outcomes. RepghostNet [33] performed as the best network among the contrasted lightweight backbone networks regarding detection performance. However, compared to Light-BiFPN, its mAP value differed by 1.8%. Additionally, RepghostNet [30] exhibited disadvantages regarding parameter count, computational complexity, and detection speed.

Light-BiFPN boasted the highest detection accuracy compared to other lightweight networks, with detection speed only falling behind ShuffleNetv2 [30]. As evident in Table 3, utilizing the Light-BiFPN feature fusion structure achieved a favorable balance among detection accuracy, model parameters, and detection speed. This indicated that the designed Light-BiFPN structure was reasonably suitable as a fusion network.

In order to validate the effectiveness of using Wise-IoU, a comparison was conducted with other loss functions [34,35,36] on the YOLOv5n model. As seen from Table 4, employing the Wise-IoU loss function holds a notable advantage in enhancing accuracy compared to other loss functions. Compared to the base model, only a marginal sacrifice in detection speed results in a 1% increase in the model’s mAP value.

To further validate the effectiveness of the proposed improvement algorithm, this section compared the improved algorithm with other YOLO series algorithms. The results are presented in Table 5. Compared to the deepened-network depth-detection model proposed in YOLOv5+, this paper’s ceramic-tile defect-detection algorithm exhibited favorable detection performance with lower parameters and computational complexity. Moreover, Table 5 showed that the improved model in this paper had a minor parameter count and computational complexity. Compared to YOLOv3-tiny, YOLOv4-tiny, and YOLOv7-tiny, the mAP values were respectively improved by 7.3%, 6.1%, and 8.2%, while the parameter scales decreased by 85%, 77.9%, and 78.4%, respectively, and the computational complexity reduced by 73.8%, 79%, and 74.2%, respectively. This indicated that the proposed algorithm in this paper possesses significant advantages in detection performance. Compared to YOLOv6n and YOLOv8n, the proposed algorithm in this paper experienced a 1.3% and 0.9% decrease in mAP, respectively. However, in terms of parameters and model size, it outperformed YOLOv6n and YOLOv8n. Hence, the algorithm proposed in this paper demonstrated feasibility and practicality in real-world applications.

4.6. Model Generalization

To verify the generalization of the proposed model for ceramic-tile-defect detection tasks, the improved model was trained and tested on the magnetic-tile dataset [38] open-sourced by the Chinese Academy of Sciences Institute of Automation. A comparison was also made with more recent detection algorithms, and the experimental results were shown in Table 6.

Table 6 showed that the improved model proposed in this paper achieved a mAP of 80.4% for ceramic-tile-defect detection. This represented a 3.6% improvement over the base model. While it was lower than YOLOv6n and YOLOv8n, the model boasted significantly fewer parameters and computational complexity. The experimental results demonstrated the proposed model’s strong ceramic-tile-defect detection generalization.

5. Conclusions

In response to the challenges in ceramic-tile surface-defect detection, this paper proposed an MCAW-YOLO defect-detection algorithm based on YOLOv5. The algorithm aimed to address the issues of low detection accuracy, high model parameters, and computational complexity in ceramic-tile defect-detection models. The proposed approach incorporated the MobileViTv3 module to extract global information, model spatial relationships among defects, and better capture features at varying scales and abstraction levels to enhance model performance. Furthermore, a Light-BiFPN structure was designed to improve multi-scale fusion networks, enhancing the utilization of multi-scale semantic features, augmenting the network’s understanding of defect semantics, and reducing model parameters. Subsequently, a lightweight NAM attention mechanism was introduced to enhance the focus on defects, reducing missed detection and false positives. Finally, using the Wise-IoU loss function enhanced the model’s ability to learn from challenging samples. Through comparative experiments on ceramic-tile surface-defect datasets, the detection accuracy of the proposed improved algorithm reached 71.9%, representing a 3.1% improvement over the original model. Simultaneously, model parameters and computational complexity decreased by 26.5% and 20.9%, respectively.

In this paper, we conducted offline segmentation of images to enhance the model’s ability to detect small defect objects. However, this approach was prone to losing defect labels in boundary regions and introducing noise, thus preventing the model from achieving satisfactory results. In future work, we will further refine the methods to enhance the detection accuracy of the model. Additionally, we plan to collect images of tiles with various defects and multicolored patterns to enrich the dataset and improve the model’s performance.

Author Contributions

Conceptualization, X.Y. and Q.M.; methodology, X.Y.; software, X.Y.; validation, X.Y., Q.M. and J.X.; formal analysis, Z.H.; investigation, Z.H.; resources, X.Y.; data curation, J.X.; writing—original preparation, X.Y.; writing—review and editing, X.Y., Q.Y. and Q.M.; visualization, Z.H.; supervision, Q.Y.; project administration, Q.Y.; funding acquisition, Q.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was financially supported by: Ningxia Key Research and Development Program (Key Project) (2023BDE02001); Ningxia Key Research and Development Program (Talent Introduction Special Project) (2022YCZX0013); North Minzu University 2022 School-level Research Platform, “Digital Agriculture Empowerment for Ningxia Rural Revitalization Innovation Team”, Project number 2022PT_S10; Yinchuan City School-Enterprise Joint Innovation Project (2022XQZD009); and “Innovation Team for Imaging and Intelligent Information Processing” of the National Ethnic Affairs Commission.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in PaddlePaddleat at [https://aistudio.baidu.com/datasetoverview, accessed on 28 August 2023].

Acknowledgments

The author thanks the School of Computer Science and Engineering of North Minzu University for providing equipment support.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hanzaei, S.H.; Afshar, A.; Barazandeh, F. Automatic detection and classification of the ceramic tiles’ surface defects. Pattern Recognit. 2017, 66, 174–189. [Google Scholar] [CrossRef]
Shire, A.N.; Khanapurkar, M.M.; Mundewadikar, R.S. Plain ceramic tiles surface defect detection using image processing. In Proceedings of the 2011 Fourth International Conference on Emerging Trends in Engineering & Technology, Port Louis, Mauritius, 18–20 November 2011; pp. 215–220. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef]
Zhang, H.; Ji, S.; Shao, M.; Pu, H.; Zhang, L. Non-destructive Internal Defect Detection of In-Shell Walnuts by X-ray Technology Based on Improved Faster R-CNN. Appl. Sci. 2023, 13, 7311. [Google Scholar] [CrossRef]
Xu, Z.; Lan, S.; Yang, Z.; Cao, J.; Wu, Z.; Cheng, Y. MSB R-CNN: A Multi-Stage Balanced Defect Detection Network. Electronics 2021, 10, 1924. [Google Scholar] [CrossRef]
Zhu, H.; Wang, Y.; Fan, J. IA-Mask R-CNN: Improved Anchor Design Mask R-CNN for Surface Defect Detection of Automotive Engine Parts. Appl. Sci. 2022, 12, 6633. [Google Scholar] [CrossRef]
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Computer Vision and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1804, pp. 1–6. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2–7. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.Y.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao HY, M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-Time Flying Object Detection with YOLOv8. arXiv 2023, arXiv:2305.09972. [Google Scholar]
Li, X.; Wang, C.; Ju, H. Surface defect detection model for aero-engine components based on improved YOLOv5. Appl. Sci. 2022, 12, 7235. [Google Scholar] [CrossRef]
Kang, Z.; Jiang, W.; He, L.; Zhang, C. A Novel DME-YOLO Structure in a High-Frequency Transformer Improves the Accuracy and Speed of Detection. Electronics 2023, 12, 3982. [Google Scholar] [CrossRef]
Zheng, J.; Wu, H.; Zhang, H.; Wang, Z.; Xu, W. Insulator-Defect Detection Algorithm Based on Improved YOLOv7. Sensors 2022, 22, 8801. [Google Scholar] [CrossRef]
Wang, R.; Liang, F.; Wang, B.; Mou, X. ODCA-YOLO: An Omni-Dynamic Convolution Coordinate Attention-Based YOLO for Wood Defect Detection. Forests 2023, 14, 1885. [Google Scholar] [CrossRef]
Wadekar, S.N.; Chaurasia, A. Mobilevitv3: Mobile-friendly vision transformer with simple and effective fusion of local, global and input features. arXiv 2022, arXiv:2209.15159. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar]
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Liu, Y.; Shao, Z.; Teng, Y.; Hoffmann, N. NAM: Normalization-based attention module. arXiv 2021, arXiv:2111.12419. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
PaddlePaddle Baidu, Ceramic Tile Defect Detection Data Set. Available online: https://aistudio.baidu.com/datasetoverview (accessed on 15 June 2022).
Van Etten, A. You only look twice: Rapid multi-scale object detection in satellite imagery. arXiv 2018, arXiv:1805.09512. [Google Scholar]
Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2918–2928. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Cui, C.; Gao, T.; Wei, S.; Du, Y.; Guo, R.; Dong, S.; Lu, B.; Zhou, Y.; Lv, X.; Liu, Q.; et al. PP-LCNet: A lightweight CPU convolutional neural network. arXiv 2021, arXiv:2109.15099. [Google Scholar]
Zhou, D.; Hou, Q.; Chen, Y.; Feng, J.; Yan, S. Rethinking bottleneck structure for efficient mobile network design. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer International Publishing: Cham, Switzerland, 2020; pp. 680–697. [Google Scholar]
Chen, C.; Guo, Z.; Zeng, H.; Xiong, P.; Dong, J. RepGhost: A Hardware-Efficient Ghost Module via Reparameterization. arXiv 2022, arXiv:2211.06088. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
GEVORGYAN, Z. SIoU Loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Wan, G.; Fang, H.; Wang, D.; Yan, J.; Xie, B. Ceramic tile surface defect detection based on deep learning. Ceram. Int. 2022, 48, 11085–11093. [Google Scholar] [CrossRef]
Huang, Y.; Qiu, C.; Yuan, K. Surface defect saliency of magnetic tile. Vis. Comput. 2020, 36, 85–96. [Google Scholar] [CrossRef]

Figure 1. MCAW-YOLO architecture.

Figure 2. MobileViTV3 block architecture.

Figure 3. Structures of GSConv and PConv.

Figure 4. PGSC3 module.

Figure 5. Structure of GSAConv module.

Figure 6. Normalization-based attention module.

Figure 7. Model loss curve.

Figure 8. Comparison of detection results between YOLOv5n and MCAW-YOLO.

Table 1. The average size and standard deviation of each category of defects.

Defect Category	Average Width /Pixel	Average Height /Pixel	Standard Deviation Width/Pixel	Standard Deviation Width/Pixel
edge defect	63.1	78.5	57.6	60.5
angle defect	73.1	54.2	79.9	74.6
white-dot defect	10.4	10.6	4.4	4.2
light-block defect	19.2	20.8	17.4	23.2
dark-block defect	12.5	13.4	9.5	10.7
aperture defect	32.5	33.7	18.1	19.9
marking-pen defect	181.3	234.4	142.3	170.2
scratch defect	98	65.9	121.5	88.2

Table 2. Result of ablation experiments with different modified methods.

Number	Model	Parameters/M	GFLOPs	[email protected]/%	FPS
1	Baseline(YOLOv5n)	1.77	4.3	68.8	158
2	Baseline + GSConv	1.23	3.1	67	163
3	Baseline + GSConv + PGSC3	1.22	3.1	68.9	166
4	Baseline + Light-BiFPN	1.25	3.3	69.4	161
5	Baseline + MobilViT3	1.79	4.4	69.9	149
6	Baseline + Wise-IoU	1.77	4.3	69.8	155
7	Baseline + Light-BiFPN + MobilViT3	1.3	3.4	70.3	152
8	Baseline+ Light-BiFPN + MobilViT3 + NAM	1.3	3.4	71.4	147
9	Baseline + Light-BiFPN + MobilViT3 + NAM + Wise-IoU	1.3	3.4	71.9	142

Table 3. The comparative experiments of Light-BiFPN and different lightweight models.

Model	Parameters/M	GFLOPs	[email protected]/%	FPS
GhostNet-YOLOv5n [27]	1.18	2.9	66.7	140
MobileNetv3-YOLOv5n [28]	1.3	2.2	61.4	120
EfficientNet-YOLOv5n [29]	0.94	2.2	64.3	126
ShuffleNetv2-YOLOv5n [30]	0.6	1.3	55	158
PPLCNet-YOLOv5n [31]	0.76	1.6	60	172
MobileNeXt-YOLOv5n [32]	1.14	2.4	59.4	147
RepghostNet-YOLOv5n [33]	1.53	3.6	67.6	147
Light-BiFPN-YOLOv5n	1.28	3.3	69.4	161
YOLOv5n	1.77	4.3	68.8	158

Table 4. Comparison of results of different IoU loss under the same model.

Model	IoU	[email protected]/%	FPS
YOLOv5n	Complete-IoU [34]	68.8	158
	EIoU [35]	67.6	156
	SIoU [36]	69.1	160
	Wise-IoU [23]	69.8	155

Table 5. Comparison of different model results.

Model	Parameters/M	GFLOPs	[email protected]/%	FPS
YOLOv3-tiny [9]	8.68	13	64.6	169
YOLOv4-tiny [10]	5.89	16.2	65.8	153
YOLOv6n [11]	4.47	10.8	73.2	119
YOLOv7-tiny [12]	6.03	13.2	63.7	116
YOLOv8n [13]	3.01	8.2	72.8	154
YOLOv5s+ [37]	12.63	14.9	70.8	96
MCAW-YOLO	1.3	3.4	71.9	142

Table 6. Experimental comparison on magnetic tile.

Model	Parameters/M	GFLOPs	[email protected]/%	FPS
YOLOv5n	1.77	4.3	76.8	153
YOLOv6n [11]	4.47	10.8	82.1	126
YOLOv7-tiny [12]	6.03	13.2	66.8	122
YOLOv8n [13]	3.01	8.2	85.3	150
MCAW-YOLO	1.3	3.4	80.4	140

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, X.; Yu, Q.; Mu, Q.; Hu, Z.; Xie, J. MCAW-YOLO: An Efficient Detection Model for Ceramic Tile Surface Defects. Appl. Sci. 2023, 13, 12057. https://doi.org/10.3390/app132112057

AMA Style

Yu X, Yu Q, Mu Q, Hu Z, Xie J. MCAW-YOLO: An Efficient Detection Model for Ceramic Tile Surface Defects. Applied Sciences. 2023; 13(21):12057. https://doi.org/10.3390/app132112057

Chicago/Turabian Style

Yu, Xulong, Qiancheng Yu, Qunyue Mu, Zhiyong Hu, and Jincai Xie. 2023. "MCAW-YOLO: An Efficient Detection Model for Ceramic Tile Surface Defects" Applied Sciences 13, no. 21: 12057. https://doi.org/10.3390/app132112057

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MCAW-YOLO: An Efficient Detection Model for Ceramic Tile Surface Defects

Abstract

1. Introduction

2. Related Work

2.1. Defect Detection

2.2. YOLOv5 Algorithm

3. MCAW-YOLO Network Model

3.1. MobileViTv3 Block

3.2. Light-BiFPN Neck

3.3. Lightweight Attention Mechanism

3.4. IoU

4. Experimental Results and Analyses

4.1. Dataset

4.2. Experimental Setup and Training Process

4.3. Evaluation Metrics

4.4. Ablation Experiments

4.5. Comparative Experiments

4.6. Model Generalization

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI