CJS-YOLOv5n: A high-performance detection model for cigarette appearance defects

: In tobacco production, cigarettes with appearance defects are inevitable and dramatically impact the quality of tobacco products. Currently, available methods do not balance the tension between detection accuracy and speed. To achieve accurate detection on a cigarette production line with the rate of 200 cigarettes per second, we propose a defect detection model for cigarette appearance based on YOLOv5n (You Only Look Once Version 5 Nano), called CJS-YOLOv5n (YOLOv5n with C2F (Cross Stage Partial (CSP) Bottleneck with 2 convolutions-fast), Jump Concat, and SCYLLA-IoU (SIoU)). This model incorporates the C2F module proposed in the state-of-the-art object detection network YOLOv8 (You Only Look Once Version 8). This module optimizes the network by parallelizing additional gradient flow branches, enhancing the model’s feature extraction capability and obtaining richer gradient information. Furthermore, this model uses Jump Concat to preserve minor defect feature information during the fusion process in the feature fusion pyramid’s P4 layer. Additionally, this model integrates the SIoU localization loss function to improve localization accuracy and detection precision. Experimental results demonstrate that our proposed CJS-YOLOv5n model achieves superior overall performance. It maintains a detection speed of over 500 FPS (frames per second) while increasing the recall rate by 2.3% and mAP (mean average precision)@0.5 by 1.7%. The proposed model is suitable for application in high-speed cigarette production lines.


Introduction
The quality of tobacco products is an essential indicator of tobacco enterprises' production level and brand establishment.Cigarettes inevitably have appearance defects on the production lines, such as stains, scratches and wrinkles.There is an urgent need for tobacco companies to adopt strict quality control measures to prevent cosmetically defective products from entering the sales market.In the past, manual inspection was the main method for cigarette factories to detect appearance defects.However, this method has problems such as strong subjectivity, slow detection speed, low efficiency and high missed detection rate, and it cannot adapt to the high-speed and high-precision detection requirements of modern industry.Therefore, there is an urgent need to develop an efficient, accurate and costeffective automated detection method to replace manual detection.
Product appearance defect detection based on conventional computer vision algorithms often relies on template matching to identify defect locations, entailing complications in template design and yielding suboptimal accuracy.With the advancements in deep learning technology, deep learning methods have also been harnessed for product appearance defect detection, resulting in enhanced detection accuracy.
Product appearance defect detection represents a specific application of object detection models.Based on deep learning, object detection models can be broadly categorized into two types: two-stage detectors and one-stage detectors.Two-stage detectors, exemplified by region-based convolutional neural networks (RCNN) [1] and Faster-RCNN [2,3], and one-stage detectors, represented by You Only Look Once (YOLO) [4][5][6][7] and Single Shot MultiBox Detector (SSD) [8], have been employed in product appearance defect detection.For instance, Liu et al. [9] utilized an improved YOLOv5 model to detect surface defects on diode metal bases, incorporating attention mechanisms and the K-means++ clustering algorithm, achieving mAP@0.5 84.0%.Chen et al. [10] applied an improved YOLOv3 model to detect surface defects on surface-mounted device (SMD) LED chips, replacing the backbone network with DenseNet and achieving mAP@0.5 95.28%.Hu and Wang [11] employed an improved Faster R-CNN model to detect surface defects on printed circuit boards (PCB), replacing the model's backbone network and feature fusion pyramid, achieving mAP@0.5 94.2%.Li and Xu [12] used an improved SSD model for appearance defect detection on electronic products, replacing the backbone network with MobileNet and optimizing parameters to achieve mAP@0.5 88.6%.Duan et al. [13] utilized an improved YOLOv3 model for surface defect detection in casting, introducing dualdensity convolutions and low-resolution detection layers to achieve mAP@0.5 88.02%.Kou et al. [14] employed an improved YOLOv3 model for surface defect detection on steel strips, modifying the detection head to Anchor-Free and achieving mAP@0.5 71.3% on the GC10-DET dataset.
In defect detection research for cigarette appearance, Yuan et al. [15] employed an improved ResNeSt model, achieving a classification accuracy of 94.01% through transfer learning, multi-scale learning methods and modified activation functions.Qu et al. [16] utilized an improved SSD model for cigarette appearance defect detection, achieving mAP@0.5 90.26% by improving the pyramid convolution.However, the detection speed still fell short of 100 frames per second (FPS).Liu and Yuan [17] employed an improved YOLOv5s model for cigarette appearance defect detection, achieving mAP@0.5 94.0% by introducing attention mechanisms and modifications to the loss function.Nonetheless, the recall rate reached only 86.8%.Yuan et al. [18] employed an improved YOLOv4 model for cigarette appearance defect detection, achieving mAP@0.5 91.46% through introducing attention mechanisms, dilated pyramid spatial pooling and modifications to anchor clustering algorithms and loss functions.However, the recall rate remained at 88.81%.Liu et al. [19] utilized an improved CenterNet model for cigarette appearance defect detection, achieving mAP@0.5 95.01% by introducing attention mechanisms, deformable convolutions, feature fusion pyramids and modifications to activation functions and data augmentation methods.Nonetheless, the recall rate reached only 85.96%.
Although there has been considerable research on industrial appearance defect detection using deep learning, studies specifically addressing cigarette appearance defect detection remain limited.Existing research on cigarette appearance defect detection still suffers from low detection recall rates and slow detection speeds.With the advancement of automation in tobacco production, cigarette production lines can now achieve speeds of up to 200 cigarettes per second, posing challenges for realtime detection using existing methods.To address this practical demand, we propose a real-time defect detection method for cigarette appearance based on YOLOv5n.Building upon the YOLOv5 model, our approach incorporates the C2F convolution module, Jump Concat fusion pyramid and SIoU loss function.Experimental results demonstrate that our improved model enhances the precision and recall rate of cigarette appearance defect detection, effectively reducing the presence of defective cigarettes in the consumer market.This model provides robust support for tobacco companies in improving the quality of their cigarette products.

Dataset
The cigarette images used in our experiments are from the Yunnan Branch of China Tobacco Industry Company Limited.The high-speed industrial cameras capture the images on the automated production line.The front and back cigarettes can be captured at different production line positions.A standard cigarette is 84 mm in length and 7.8 mm in diameter.
Cigarettes are primarily composed of two parts: the cigarette rod, depicted as the white portion on the left side in Figure 1, and the filter, represented by the dark portion on the right side.During cigarette manufacturing, excessive tobacco stems can puncture cigarette rod packaging during rolling [20].During the twisting process, a certain amount of pressure is applied to ensure a secure bond between the tipping paper and the cigarette rod.However, variations in filter elasticity, roundness of non-filter cigarettes, adhesive properties of latex and the absorptive characteristics of the tipping paper can result in insufficient bonding, misalignment, filter detachment, as well as folding, creasing and misalignment of the tipping paper [21].Furthermore, improper printing on the cigarette packaging paper or staining during production and transportation can result in visual contamination of the cigarette's appearance.To improve the detection of cigarette appearance defects, we categorized defective cigarettes into seven types: misplacement, stain, scratched, wrinkle, unpacked, no filter and bend, as shown in Figure 2.

Methods
YOLOv5 comprises five network variants: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x.These models share the same network architecture but differ in module depth and width.
YOLOv5n and YOLOv5s have the same depth, with YOLOv5n being the shallowest among the YOLOv5 series.To improve the network's runtime on small-scale devices, YOLOv5n reduces the network's width by half compared to YOLOv5s, significantly reducing the scale of parallel computations and enhancing runtime performance on low-capacity devices [22].To meet the real-time detection requirements and improve recognition speed, we select YOLOv5n as the baseline model.
YOLOv5 comprises four components: input, backbone network, neck network and detection head.The input component incorporates mosaic data augmentation, adaptive anchor calculation and adaptive image scaling.The backbone network performs feature extraction.The neck network employs a topdown and bottom-up feature fusion pathway to merge image features.The detection head includes three layers corresponding to feature maps of sizes 80 × 80, 40 × 40 and 20 × 20 for detecting objects of different scales.Finally, the Complete IoU (CIoU [23]) loss function is utilized to calculate the distance between predicted and ground truth bounding boxes and non-maximum suppression (NMS) is applied to remove redundant boxes while retaining those with the highest confidence.Due to the smaller depth and width of YOLOv5n, its ability to fit complex features is relatively limited.Therefore, this research introduces the C2F module proposed in the state-of-the-art object detection model YOLOv8 [24].Compared to the original C3 (CSP Bottleneck with 3 convolutions) module [25], the C2F module incorporates bottleneck skip connections, allowing for better feature extraction by increasing the gradient flow and channel capacity.Additionally, only half of the feature matrix participates in subsequent multiple bottleneck operations by splitting the channel number equally after one convolution.This design enhances feature extraction capabilities without compromising GPU inference speed, thus avoiding false detections and omissions.
The original YOLOv5's feature fusion pyramid involves multiple fusion operations on the P4 feature layer, which can lead to occlusion and coverage of subtle texture folds in cigarettes, increasing the difficulty of detection.Therefore, this research uses Jump Concat to the P4 feature layer to mitigate the occlusion of fine-textured crease defects resulting from feature map fusion.The original YOLOv5 utilizes the CIoU localization loss function, which calculates the localization loss based on IoU (Intersection over Union), center point distance and aspect ratio between predicted and ground truth bounding boxes.However, it does not account for mismatched orientations between the predicted and ground truth boxes.This limitation slows down convergence and efficiency, as predicted boxes may "hover" during training, ultimately leading to suboptimal model performance.Therefore, SIoU [26] is introduced, which considers the angle between the center point vectors and includes angle penalty metrics.This loss function enables predicted boxes to converge rapidly towards the nearest axis and converge only on a single axis.This loss function effectively accelerates the convergence of predicted boxes, improving the model's localization accuracy and confidence in object detection.
Our improved model structure is shown in Figure 3.

C2F module
The C2F module, proposed in the latest object detection model YOLOv8, enhances feature extraction capabilities compared to the original C3 module by introducing bottleneck skip connections and chunk operations, resulting in increased parameter quantity, gradient flow and channel capacity.In Figure 4(a), the C2F module is depicted, while Figure 4

Jump Concat feature pyramid
The YOLOv5n model incorporates both FPN (Feature Pyramid Network) [27] and PANet (Path Aggregation Network) [28] for multi-scale feature fusion.FPN enhances semantic information in a top-down manner, while PANet reinforces positional information in a bottom-up way.This combination enhances the feature fusion capability of the neck layer.However, in the P4 feature layer, multiple feature fusion operations may cause certain feature information to be overshadowed.To address this issue, this study introduces an additional Jump Concat at the end of the P4 feature layer to prevent the coverage of subtle textures with minimal pixel variations, such as small wrinkles in cigarette appearances.For P4, the two fusion feature processes are formed as follows: ( where, Resize(•) is usually an upsampling or downsampling operation for resolution matching, and Conv(•) is usually a convolutional operation for feature processing.Piout is the output of the i-th feature map and Piin is the input of the i-th feature map.

SIoU loss
The SIoU loss function is employed for bounding box regression.Compared to CIoU, DIoU (Distance IoU) [29] and GIoU (General IoU) [30], SIoU further considers the vector angle between the ground truth and predicted anchor boxes, leading to a redefinition of the loss function.It facilitates faster regression of anchor boxes towards the nearest axis (x or y), thereby enhancing convergence speed and localization accuracy.
In Figure 6, BB represents the predicted anchor box, GT denotes the ground truth box, σ represents the Euclidean distance between the centers of the predicted and ground truth boxes, Ch represents the projection distance of the center point on the y-axis, Cw represents the projection distance of the center point on the x-axis, α represents the angle between the center line and the x-axis and β represents the angle between the center line and the y-axis.The SIoU loss function involves three components in its computation: angle loss, distance loss and intersection over union loss: (2) The above equation represents the calculation of the intersection over union, where BB denotes the area of the predicted bounding box and GT represents the area of the ground truth box.The intersection ratio to the predicted and ground truth box union yields the IoU value. (3) The formula above calculates the angle, Δ, which comprises the angle and distance calculations involved in SIoU.It represents the final result of the angle calculation. and  represent the sine values of the angles between the center points of the ground truth and predicted boxes.At the same time,  is the ratio of the distance between the center points of the ground truth and predicted boxes relative to the width and height of their minimum enclosing rectangle squared.Here, e denotes the Euler's number.(4) The above equation computes the Ω shape loss, where w and h represent the width and height of the predicted box, respectively, and  denotes the attention coefficient in the shape calculation formula.
Finally, the SIoU loss consists of the three components above, as depicted by the following formula: (5

Enhancement and partitioning of experimental datasets
On high-speed cigarette production lines, the probability of encountering cigarettes with visual defects is approximately 1%.The manual screening process to obtain a substantial amount of defective cigarette data is labor-intensive.We gathered 900 valid images of appearance defect cigarettes through meticulous manual selection.Due to the limited availability of the appearance defect dataset, we employed data augmentation techniques, including flipping, brightness adjustment and Gaussian noise addition, to expand the dataset.Consequently, the augmented dataset now comprises 6200 images.It is important to note that since the images were captured in pairs, the actual research dataset consists of 12,400 cigarette images.To ensure the model's effectiveness, we divide the dataset into training, validation and testing sets in a 6:2:2 ratio, as detailed in Table 1.

Experiment parameter setting
The model was trained and tested on a Windows 10 system running PyTorch 1.12.1, using the following hardware specifications: an AMD R5600 processor @ 3.50 GHz, 32 GB of memory and an NVIDIA GeForce RTX3060 graphics card with 12 GB of VRAM.The software was CUDA 11.6, Torch vision 0.13.1 and Python 3.7.The integrated development environments are PyCharm and Anaconda.The initial learning rate was 0.01 during training, and a cosine annealing strategy was employed to reduce the learning rate.Additionally, the neural network parameters were optimized using the stochastic gradient descent (SGD) method with a momentum value of 0.937 and a weight decay score of 0.0005.The training process consisted of 300 epochs, with a batch size of 64 images.The input image resolution was uniformly adjusted to 640 × 640.The adjusted training parameters are summarized in Table 2.

Evaluation indicators
The experimental evaluation encompasses two aspects: performance evaluation and complexity evaluation.Performance evaluation metrics for the model include accuracy, recall, mAP@0.5 and mAP@0.5-0.95.Complexity evaluation metrics for the model consist of the size of model parameters, floating point operations (FLOPs) and frames per second (FPS), which assess the computational efficiency and image processing speed of the model.
Precision measures the proportion of correctly predicted positive samples out of the total positive samples, assessing the model's classification ability.Conversely, recall measures the proportion of correctly predicted positive samples out of the whole of positive samples.AP is the integral of precision and recall, and mAP represents the average AP, reflecting the model's overall performance for object detection and classification.The calculation formulas for these metrics are shown in Eqs ( 6)-( 9).( 6) (7) where, TP represents the number of true positive samples correctly detected, FP represents the number of false positive samples incorrectly detected, and FN represents the number of false negative samples incorrectly not detected.
(8) (9) where, n is the number of categories.
(10) (11) Model size refers to the amount of memory required to store the model.FLOPs, on the other hand, measure the complexity of the model by quantifying the total number of multiplication and addition operations performed during model execution.A lower FLOPs value indicates lower computational requirements for model inference, resulting in faster model computation speed.Here, Cin represents the number of input channels, Cout represents the number of output channels, K represents the convolutional kernel size and Wout and Hout represent the width and height of the output feature map, respectively.

Training process analysis
Figure 8 presents the training loss curves for the CJS-YOLOv5n method and YOLOv5n on the cigarette appearance defect dataset.The graph illustrates the overall loss values during the training process.In the initial 30 epochs, the model experiences a rapid decline in loss.Subsequently, from epoch 50 to 200, a gradual decrease in loss is observed.Between epochs 200 and 300, the loss values stabilize and approach convergence.Thus, 300 epochs are determined as the appropriate training iteration count for the model.
Furthermore, the dashed line represents the loss curve of the CJS-YOLOv5n, while the solid line represents the loss curve of YOLOv5n.It can be observed from the graph that, under the same conditions of rapid loss reduction and convergence, the improved model exhibits a lower final convergence value compared to the original model.Additionally, our improved model shows more minor fluctuations in the loss curve during training.Consequently, our improved model demonstrates superior performance during the training process.

Ablation experiment
This section verifies the effectiveness of different optimization modules through ablation experiments.Several improved models are constructed sequentially, adding the C2F module, Jump Concat and SIoU localization loss function to the baseline model YOLOv5n.The results are compared using the same test data, and the gains in model performance due to the added modules are presented in Table 3.
Table 3 shows the gain of model performance after adding each module.Due to the small depth and width of the original YOLOv5n network, the ability to fit complex features is poor.In order to enhance the feature extraction ability of the network, this paper replaces the C3 convolution module with the C2F convolution module, which increases mAP@0.5 by 0.6%.The rate increased by 0.6%.Later, in order to prevent the feature fusion from covering up the fine texture features, this paper adds a Jump Concat to the P4 feature layer, which increases mAP@0.5 by 1% and the recall rate by 1%.Finally, in order to enhance the positioning accuracy and avoid false detection, this paper replaces the CIoU loss function with the SIoU loss function, which increases mAP@0.5 by 1% and the recall rate After combining the three modules, the improved model achieves the best performance.The increased computational load of the improved module in this paper does not affect the performance of the model running on the GPU, but the increased parallel computing of the C2F convolution module makes the running speed of the model on the CPU drop by nearly 10%.

Comparative experiments
The proposed method in this study was trained using the same parameters as other advanced lightweight methods.The experimental results are compared in Table 4. Table 4 shows the performance comparison with other target detection models.The model selected in this paper is not optimal in every performance indicator, but it is faster than the better performance models like YOLOv8n, and detection accuracy is higher than models like YOLOv3tiny that are close in speed.Since the number of model parameters selected in this paper is the least among all models, it is beneficial to reduce the deployment cost of practical applications and achieve a balance between detection speed and accuracy.While maintaining the detection speed, the improved CJS-YOLOv5n in this paper has improved mAP@0.5 by 1.7% and the recall rate by 2.3% compared with the original model.YOLOv8 is a state-of-the-art object detection model with high accuracy and in many applications.Compared with the most advanced YOLOv8n in average detection accuracy, recall rate and detection speed achieved an all-around lead, which can meet most of the current detection needs.The experimental results show that the model proposed in this paper is a good detection method for cigarette appearance defects, which can meet the needs of cigarette appearance defect detection on the production line.
Our ablation experiments and comparative experiments demonstrate the effectiveness of our improvements.Among them, the C2F module achieves better feature extraction at the cost of a small amount of calculation, and various performance indicators have different improvements.Jump Concat reduces the degree of occlusion of subtle features in feature fusion, effectively improving the recall rate.The SIoU loss function improves the localization accuracy of anchor boxes by optimizing the anchor box loss calculation, thus increasing the confidence and recall of predicted anchor boxes.
Comparing the latest research results of YOLOv4-improved in detecting cigarette appearance defects and the current state-of-the-art object detection model YOLOv8, the improved model CJS-YOLOv5n has different degrees of increase in detection performance and detection speed.10, it can be found that the improved model can better identify the subtle wrinkle and stain defects that were not detected by the original network, and the positioning anchor box of the fold defect is more complete.In complex scenarios with multiple defects, the improved model also avoids the false detection of the original model.Comparing the detection results of other defects, it can be seen that the improved network has different confidence improvements compared with the original network in the detection of different defects.

Conclusions
According to the fact that the actual production speed of cigarette appearance defects is fast, the improved YOLOv5n network is used to detect cigarette appearance defects, and good detection results and detection speed can be achieved in the case of insufficient data sets.In this paper, the C2F module with stronger feature extraction ability than the C3 convolution module is selected as the network basic convolution module, and the P4 feature layer Jump Concat is added to further strengthen the feature fusion, prevent information coverage and use the SIoU positioning loss function to help the model improve.The positioning accuracy and convergence speed in the experimental results prove the effectiveness of the algorithm in the task of cigarette appearance defect detection.This method can achieve better detection speed and recall rate, and has certain robustness.The model can effectively improve the recall rate and detection accuracy without affecting the detection speed.
Our model in this paper still has shortcomings.For example, in terms of detection effect, it is still not as good as YOLOv8n with a larger network.Although the improvement in this paper can effectively increase the recall rate and help control the quality of cigarette sales, it is limited by the depth and width of the network.The ability to fit complex defects still needs to be improved.In the future, we will focus on how to improve the accuracy of network detection while maintaining the detection speed and reducing the deployment cost, such as adding a lightweight attention mechanism CBAM (Convolutional block attention module) [31], a detection head based on Anchor-Free [32], further expand the dataset, and use lightweight convolution.In addition, we will use higher-order moment time
(b) represents the original C3 module.The C2F module possesses more channels and gradient flow than the C3 module.The chunk operation performs slicing on the input feature map, dividing it equally into two parts along the channel dimension.Only half of the feature map participates in the bottleneck convolution operation, effectively reducing computational load.Additionally, to maintain a lightweight structure, the C2F module reduces the computation of the right branch CBS (Convolution, Batch normalization and SiLU activation function) module.(a) C2F module (b) C3 module

Figure 9 (
Figure 9(a)-(h) shows the detection results of YOLO-v5n for seven kinds of defects, and Figure 10(a)-(h) shows the detection results of CJS-YOLOv5n for seven kinds of defects.The detection results prove the improvement of this paper.Compared with the original model, the latter model has better performance in seven kinds of defect detection.Comparing Figure 9 and (b), (d) and (h) in Figure10, it can be found that the improved model can better identify the subtle wrinkle and stain defects that were not detected by the original network, and the positioning anchor box of the fold defect is more complete.In complex scenarios with multiple defects, the improved model also avoids the false detection of the original model.Comparing the detection results of other defects, it can be seen that the improved network has different confidence improvements compared with the original network in the detection of different defects.

Table 4 .
Comparison of different detection models.