Temporal-Quality Ensemble Technique for Handling Image Blur in Packaging Defect Inspection

Despite achieving numerous successes with surface defect inspection based on deep learning, the industry still faces challenges in conducting packaging defect inspections that include critical information such as ingredient lists. In particular, while previous achievements primarily focus on defect inspection in high-quality images, they do not consider defect inspection in low-quality images such as those containing image blur. To address this issue, we proposed a noble inference technique named temporal-quality ensemble (TQE), which combines temporal and quality weights. Temporal weighting assigns weights to input images by considering the timing in relation to the observed image. Quality weight prioritizes high-quality images to ensure the inference process emphasizes clear and reliable input images. These two weights improve both the accuracy and reliability of the inference process of low-quality images. In addition, to experimentally evaluate the general applicability of TQE, we adopt widely used convolutional neural networks (CNNs) such as ResNet-34, EfficientNet, ECAEfficientNet, GoogLeNet, and ShuffleNetV2 as the backbone network. In conclusion, considering cases where at least one low-quality image is included, TQE has an F1-score approximately 17.64% to 22.41% higher than using single CNN models and about 1.86% to 2.06% higher than an average voting ensemble.


Introduction
Packaging is widely utilized in logistics and transportation [1] but typically suffers from diverse surface defects stemming from external forces and repetitive compression.These defects of packaging cause significant damage or deterioration of the product inside.Moreover, it is also critical when important information like ingredient lists on packaging surface labels gets damaged.Traditionally, the defect inspection of packaging surfaces was usually performed by manual inspection.However, manual inspection is slow, inefficient, and has a high error rate in finding defects [2] in large volumes of packaging.To replace manual inspection, several studies were conducted for surface defect inspection using computer vision [3][4][5].Nevertheless, these existing methods are not widely used in inspection industries due to a lack of generalization, robustness, and accuracy.
Recently, deep learning [6] has received a lot of attention for the inspection of surface defects [7][8][9][10].Deep learning models can identify the unique features of packaging and their defects by training on datasets.However, when relying solely on convolutional neural network (CNN)-based architecture, factors such as motion blur [11] occurring in packagingrelated industrial environments cannot be forecasted in advance during the training of the model.Regarding low-quality images, packaging-related industrial environments often face two types of low-quality images: motion blur caused by conveyor movement; and focus blur, known as the out-of-focus phenomenon.
First, the main causes of motion blur are as follows.Although the conveyor contributes to the movement of packaging, it often generates significant vibrations when foreign substances adhere to the packaging or when the exposed surface of the conveyor belt is worn or damaged.Additionally, roller-based conveyors, which are commonly used in warehouses, can also cause vibration.This is shown in Figure 1a.This vibration of the conveyor temporarily increases the moving speed of the packaging product, and relatively decreases the camera shutter speed, which results in motion blur [12][13][14][15][16]. Second, focus blur mainly occurs when a camera or lens with an autofocus function changes the brightness level or loses focus [17,18].For example, focus blur occurs when packaged products on a conveyor belt have different brightness levels due to changing lighting or when the focus moves irregularly from one packaged product to the next one.Figure 1b shows an example of an out-of-focus problem.These blur problems are not only the main causes of producing low-quality images but also degrade the performance of CNN models.To this end, adopting ensemble techniques that refer to images from various timing captured images can be a solution to mitigate these issues.Many academics have presented research data over the last decade indicating that ensemble approaches outperform single classifiers [19,20].Ensemble learning is a machine learning technique that combines multiple models to obtain better prediction performance, typically by aggregating their predictions [21].Among them, average voting ensemble (AVE) [22] is a popular ensemble learning strategy [23] which provides the most common prediction among the models as the final output.However, when there is a mixture of lowquality (e.g., blur) and high-quality images, AVE suffers from performance degradations because low-quality images have a negative impact on prediction by allocating equal weight to all images with different qualities.

Vibration
To address this challenge, we propose a noble inference technique named temporalquality ensemble (TQE) which integrates information from multiple images with different qualities.For this, it consists of two key components: quality weight and temporal weight.Quality weight evaluates individual frame quality and then prioritizes high-quality images in order to ensure the inference process emphasizes reliable information while it deprioritizes low-quality images.Temporal weight accounts for the temporal continuity and assigns weights to images based on their timing and relation to observed images.This approach, which emphasizes the importance of up-to-date images, has the effect of balancing the quality weights without simply excluding important low-quality images that may contain important information about defects.
The main contributions of this paper are summarized as follows.
(1) The proposed inference technique, termed TQE, combined temporal and quality weight to integrate information from multiple images including blurred images.By leveraging temporal continuity and prioritizing superior clarity, it finally mitigates the effects of image blur and improves overall accuracy for identifying defects.To the best of our knowledge, the proposed approach is the first ensemble technique to overcome image blur for packaging inspection.
(2) Our new private database provided more realistic results for training and evaluating deep learning models since it reflected motion blur in images which are acquired by deploying a real machine vision camera and conveyor belt, etc. (3) Through comparative experiments with AVE, TQE exhibited effectiveness in terms of accuracy, precision, recall, and F1-Score for identifying defects.
Xu et al. [30] proposed the Bilinear-VGG16 model to improve quality problems that occur during the packaging process, such as packaging damage, packaging side ears opening, and scratches during the printing process.The Bilinear-VGG16 model is a network that improves performance by combining the VGG-16 network and Bilinear-CNN and uses global average pooling and global maximum pooling to extract the fine particle features of defective packaging images.This allows for robust representation learning and enhances the model's ability to distinguish between various types of defects.They achieved an accuracy rate of the improved model recognition of 96.3%, which is better than that of the popular network models before.Zhou et al. [31] proposed a method that combines traditional computer vision (CV) techniques and deep learning models to classify all 12 types of packaging defects.This method detects some easy-to-detect defects and position offset defects using the traditional CV method, and uses ResNet-34 to detect the remaining defects.This method was verified to fulfill the defect recognition requirements through experiments on 12 types of defects.Sheng et al. [32] propose a method based on the ECA-EfficientDet to detect multiple classes of defects and solve the model's generalization problem caused by the lack of packaging defect samples in the industry.This method improves fault detection accuracy by designing an ECA-Convblock convolutional block that can predict channel importance and suppress channels that do not carry information.Additionally, the mosaic augmentation technique and Mish activation function were used on sample data to improve the generalization function and the robustness of the model in complex environments.As a result, this method achieved an accuracy of over 99% for almost all categories except for a few.
Although these studies achieve good defect inspection results from high-quality images, none of them address the issue of low-quality images caused by both motion blur and focus blur.Thus, in this paper, we propose a technique that shows excellent inspection performance even in environments where high and low-quality (blurred) images are mixed.

Our Method
Single CNN models have the disadvantage of degrading performance when processing low-quality images, such as motion blur, and focus blur.To address this challenge, we present an inference method designated as TQE.TQE is a combination of temporal weight and quality weight.This methodology presupposes three aspects.Firstly, we deal with images that have been observed and memorized.Observed images are up-to-date images.On the other hand, memorized images are previously seen images in which prediction has already been performed and has a temporal order.Secondly, we do not process all acquired images.Generally, packaging moves gradually on the conveyor, resulting in small changes between consecutive frames.Therefore, instead of using all consecutive images, we use the small change captured in quantitative units of 2 to 4 images as input images.Lastly, low-quality images usually reduce inference performance.However, it assumes that some low-quality images may contain important defect information that high-quality images do not have.This premise is validated through experiments conducted in Section 4.
The following subsection provides a detailed exposition of our method.

Overall Architecture
The overall architecture is shown in Figure 2. It can be largely divided into training and test parts.We aim to ensure that a CNN trained solely on high-quality images performs even on low-quality images, using TQE; thus, TQE is only used in the test part.The training part adopts the widely used CNN as the backbone network.CNNs consist of a feature extractor layer and a classifier layer, and when the probability is output by the classifier, the loss value is calculated and used to update parameters through backpropagation [33].The test part performs predictions using single CNN models learned in the training part and TQE, which combines temporal and quality weights.As the previous premise, the image from the current moment is captured and used as an observed image, and the images from the previously captured image are loaded as memorized images.These images are selected as input data.Input data are then pass through to a pretrained model to perform inference, and at the same time, the input data are transmitted for weight allocation of TQE.TQE's weights are determined based on the relative quality and captured order between observed and memorized images.These weights determine the final prediction by differentially applying observed and previously memorized images.These weights help the final prediction by providing different priorities to observed and previously memorized images.The following subsection provides a detailed exposition of the temporal and quality weights, which are the main components of our method.

Temporal Ensemble
Temporal ensemble (TE) is a method which assigns temporal weights to input images based on their timing order.This principle is widely used across various fields, where closely timed images often share similar features [34].This means that images taken close in time tend to have similar features, allowing them to complement each other.Expanding on this idea, we introduce a temporal weighting mechanism where the importance gradually decreases or increases as the time difference between the observed image and the memorized image grows.This weighting strategy enhances temporal smoothing and integrates relevant information from observed and memorized images.Temporal weight is defined as where t is a positive integer timestamp.K represents the total number of images.When a new image is captured, it becomes the observed image and assigned t = 1.As the next image is captured, the previously observed image (t = 1) shifts as t = 2 and is categorized into the memorized images.As t increases up to K, K-th image becomes the oldest memorized image in the sequence.This cycle repeats with each new image capture.
The absolute value |t − 1| is the sequence difference between the observed and memorized images.τ is a parameter that adjusts the importance of the temporal weight with positive rational numbers greater than 0. As τ approaches 0, the influence of the observed image and recent image on the temporal weight becomes significant, whereas as τ approaches ∞, the importance of the recent and old images on the temporal weight gradually equalizes.Thereby, τ also allows for the enabling of higher weighting for recently observed images and balances the quality weights even if they are with low quality.However, τ requires user-based optimization, making it difficult to determine arbitrary numbers.Therefore, as a guideline, we present the binary search algorithm that recommends τ when the desired range of temporal weights for the observed image is inputted.The guideline for computing τ using binary search is shown in Algorithm 1.

Algorithm 1 Binary Search for τ
1: Input: Target weight (TW 1 ), range (deviation), minimum tau (min τ ), maximum tau (max τ ), total number of input images (K) 2: Output: Compute Quality ensemble (QE) is a method that assigns quality weights to input images based on their sharpness.High-quality images depict sharp high-frequency features such as small details or sharp edges.In contrast, low-quality images exhibit features with blurred lowfrequency details such as object boundaries or edges [35].We use the Laplacian operator [36] to evaluate the quality of observed and previously memorized images.The Laplacian operator is a mathematical operator used to measure the spatial variation of a scalar field.It can be expressed as a combination of the gradient and divergence operators, which is defined as follows. ∇ where ∇ represents the gradient operator, which quantifies the rate of change of a scalar function in space.∇ f denotes the divergence operator, measuring the extent to which a vector field diverges from a specific point.Therefore, the Laplacian operator signifies taking the gradient first and then calculating the divergence of the outcome.Essentially, the Laplacian operator assesses the spatial variation of the function, aiding in the identification of areas with significant change or curvature in the scalar field.We evaluate the relative image quality of the observed image and the previously memorized image using Laplacian variance [37].Initially, to calculate the Laplacian variance, we apply a Laplacian filter to transform the observed image and the previously memorized image into Laplacian images.Then, we gather Laplacian values for each pixel of the Laplacian image and compute the variance to quantify the intensity of high-frequency components in the image.Higher Laplacian variance indicates a high-quality image with high-frequency features.It is defined as where I is the given image, and the result of applying the Laplacian filter to this image is denoted as L. Here, the variance is computed as the average of the squares of the differences between each pixel's value and the mean value.This can be expressed in the following formula: where L i is the value of the i-th pixel of the image after applying the Laplacian filter, µ is the mean value of all pixels L in the image, and n is the total number of pixels in the image.
To calculate the variance ratio for a specific image among observed and memorized images, we divide the variance of that specific image by the sum of the variances of all images.Therefore, the variance ratio for a specific image can be calculated as follows: where t is a positive integer timestamp.K represents the total number of images, and Var(L) j denotes the result of applying the Laplacian filter to the j-th image.This is the value obtained by dividing the variance of the t-th image by the total variance, indicating how much the variance of the t-th image contributes to the overall variance of the entire image.

Temporal-Quality Ensemble
The proposed TQE weight considers both quality and temporal weight together.This calculates the average of the weights TW t and QW t , denoted as TQ t .It is defined as where t is a positive integer timestamp.TW t represents temporal weights for a specific t-th image, and QW t represents quality weights for the same t-th image.The algorithm for computing the proposed ensemble learning is shown in Algorithm 2.

Datasets
In this section, we present a comprehensive description of our collected dataset.In general, it is difficult to find an open dataset containing both high-quality and lowquality packaging images.Therefore, we collected a new dataset by acquiring data through our proprietary image acquisition system, as shown in Figure 3.The acquisition system employed a range of illumination angles and positions, along with a machine vision (BFLY-PGE-31S4C-C, FLIR, Wilsonville, OR, USA) and webcam (Brio, Logitech, Lausanne, Switzerland) with a frame rate of 10FPS.The machine vision contained a focal-length 16 mm fixed megapixel lens (LM16JC5M2, KOWA, Nagoya, Japan).The LED line illuminations were used to adjust the angle independently as bar illuminations mounted in four directions (LDBQ300, LFINE, Incheon, Republic of Korea).A backlight (LXL300, LFINE, Incheon, Republic of Korea) is mounted beneath transparent conveyors, utilizing LED mounted at regular intervals to provide a wide illumination angle and high uniformity.To intentionally simulate motion blur issues caused by external factors, our image acquisition system was set up to capture moving packaging on a conveyor belt operating at 15 cm/s.In addition, in order to reproduce focus blur, we collected data by randomly changing the camera's focusing function between automatic and manual modes.This configuration simulates acquiring low-quality and high-quality images, as shown in Figure 4.And Figure 4 shows the original captured image without manual cropping.
Packaging defects are classified into five categories: label loss, deformation, cracks, surface damage, and surface dirt.These categories broadly are classified into edge defects and surface defects [38].Among them, we focus on the above two types of defects formed around product labels.This is based on the fact that the label surface of packaging inherently contains important ingredient lists, which are depicted on the left in Figure 4.For example, edge defects occur when packaging collides with other objects while traveling on a conveyor or are torn when they are improperly stacked as depicted in the middle of  This experiment utilized a total of 9000 packaging images, resized to 256 × 256 pixels without cropping, and contained three classes: non-defect (3000), edge defect (3000), and surface defect (3000).Samples of the dataset are shown in Figure 5.This dataset aims to verify that CNN models trained solely on high-quality images perform effectively not only on high-quality images but also on low-quality images.For this purpose, low-quality images were selected based on having Laplacian variance values of 50% or less compared to the high-quality images.Low-quality images were collected focusing on two types of blur: motion blur (1500) and focus blur (1500), for a total of 3000 images.Each type includes three classes: non-defect (500), edge defect (500), and surface defect (500).The distribution of images in the dataset for each class is shown in Table 1.

Evaluation Metrics
To evaluate the performance of the proposed methods, various evaluation metrics have been used, and they are as follows: Accuracy = TP + TN TP + TN + FP + FN (10) True positive (TP) are total cases where the prediction is positive, and the actual value is positive.True negative (TN) are total cases where the prediction is negative and the actual value is negative.Conversely, False positive (FP) are total cases where the prediction is positive and the actual value is negative.False negative (FN) are total cases where the prediction is negative and the actual value is positive.Precision is the ratio of the number of classified positive to the total number of prediction positive.Recall is the ratio of the number of prediction positive to the number of actual positive.F1-Score is the balanced measure of precision and recall.Accuracy measures how often the predictions are true by comparing the classified cases to all cases.

Implemental Details
Ensembles based on CNN models can be used as an outstanding feature extractor for classifying tasks.In particular, CNN models trained on ImageNet [39] necessitate only minor fine-tuning, saving time and computational cost compared to training models from scratch on private datasets.Therefore, we considered the CNN models, which are capable of transfer learning through ImageNet and have proven reliability through many studies, as the baseline models.As a result, we chose ResNet-34, EfficientNet, ECAEfficientNet [40], GoogLeNet, and ShuffleNetV2 [41].
All experiments were performed on the ubuntu18.04OS; the CPU was Intel Core i9-13900K (32-core) (Santa Clara, CA, USA), and the GPU was NVIDIA GeForce RTX 4080 (16 GB) (Santa Clara, CA, USA).We adjusted the hyperparameters listed in Table 2 throughout the model training process.We trained the network for 6 epochs.During training, we utilized the WarmupScheduler to automatically adjust the learning rate.All models were trained using AdamW, with a learning rate of 1.25 × 10 −4 and a weight decay of 0.05.In this experiment, the performance of CNN models was evaluated using single images of either high or low quality, without employing an ensemble.This study aimed to ascertain whether CNN models could effectively learn from private datasets and evaluate the performance disparity when using high-quality versus low-quality images as input data.The loss and accuracy curves obtained during the training of the five CNN models are shown in Figure 6.In Figure 6, the first row compares the loss convergence rates for the high-quality images.The loss was steadily decreasing with each epoch, and the validation loss was also decreasing similarly to the training loss.This showed consistent performance even on data that CNN models had not seen, indicating that the model was generalizing not only to training data but also to validation data without overfitting.The second row of Figure 6 compared the accuracy convergence rates of the CNN model on the test dataset for high-and low-quality images.In all graphs, the training accuracy consistently increased as the epochs progressed and reached over 0.9, showing stable performance, and the test accuracy also showed a similar increasing trend as the training accuracy.On the other hand, the test accuracy for low-quality images was relatively low in all graphs, but gradually increased as the epoch progressed, showing that the model was improving its adaptability even to low-quality images.However, the accuracy of the low-quality images remained at a significantly lower level than that of the high-quality images.The class-wise results obtained on the test set of the dataset are shown in Table 3. Table 3 presents a performance comparison of CNN models on high-quality and lowquality images across various metrics.The metrics included precision, recall, F1-Score, and accuracy, with the model being evaluated on different classes: non-defect, edge defect, surface defect, and estimated total size.Analyzing the experimental results of each model for high-quality images, GoogleNet showed high precision, recall, and F1-Score in all classes, and especially recorded the highest F1-Score (0.9975) in the surface defect class.ResNet-34 also showed consistent performance in all classes, recording an F1-Score of 0.9930 in the surface defect class.ShuffleNetV2 had a precision of 0.8560 in the non-defect class, which was lower than other models, but it had the smallest model size (68.15MB), which had the advantage of fast computation and low memory usage.On the other hand, EfficientNet and ECAEfficientNet had the largest model size (242.78MB), showing high performance, but had the disadvantage of large memory usage.In low-quality images, the performance of all models deteriorated compared to high-quality images, and recall tended to decrease more significantly.On low-quality images, the ECAEfficientNet model showed the highest performance, with an accuracy of 0.8447.ShuffleNetV2 showed poor performance on low-quality images, especially in the non-defect class, with a recall of only 0.070, which meant it hardly detected defects in that class.Experimental results on low-quality images showed that performance on high-quality and low-quality images improved as the model size increased.In contrast, ResNet-34 maintained a relatively balanced performance with an accuracy of 0.7670, precision of 0.8079, recall of 0.7670, and F1-score of 0.7497.Additionally, from the perspective of model size, ResNet-34 had a balance.Given these attributes, ResNet-34 was selected as the representative model for other experiments, serving as a benchmark of stable, balanced performance across different image quality scenarios, and a model size that balanced efficiency with capability.  4 compared three different ensemble techniques, AVE, TE, and QE, using ResNet-34.TE and QE were separated for TQE's ablation analysis.In this experiment, τ was fixed to 3, and K represented the total number of images used in each scenario.Scenarios were focused on combinations of test images, based on high-quality (H) and low-quality (L).Furthermore, the sequence of scenarios was composed of observed images from the far left to memorized images with larger time intervals observed as we moved to the right.For example, in the HLH scenario, the observed image was of high quality, the nearest memorized image was of low quality, and the subsequent time frame presented a high-quality image.Similarly, in the LHH scenario, the observed image was low-quality, the memorized image closest in time was high-quality, and the subsequent time frame also contained a high-quality image.Among various combinations, K = 3 is the ideal minimum combination to simultaneously evaluate the performance of time weight and quality weight.The reasons for choosing three combinations are as follows.Focusing on a single typed combination works the same as a single model, which negates the essence of ensembles reliant on multiple inputs.Similarly, restricting to only two combinations shows challenges in validating both temporal and quality weights simultaneously.For example, in the HL or LH combination, it is possible to evaluate temporal weighting since H and L are distinct in time.
However, as the frequencies of H and L are equal, evaluating quality weighting becomes less significant.Thus, at least three combinations are required to evaluate the performance of temporal and quality weight at the same time.AVE weighted all inputs with equal importance.For example, the prediction was inaccurate in scenarios where low-quality images had the same frequency or more than high-quality images, such as HL, HLL, and HLHL scenarios.On the other hand, the TE considered the input order according to time and assigned higher weights as it approached the observed image.This method maintained relatively high performance even in scenarios with a mixture of high-quality and low-quality images, but performance tended to deteriorate in scenarios where low-quality images dominated.For example, in the HL scenario at K = 2, the F1-Score showed the highest performance with 0.9900.The reason was that since H was an observed image, it was assigned a higher weight than L. Conversely, in the LH scenario, since the low-quality image was an observed image, H received a relative penalty.QE demonstrated high performance across various scenarios and provided the most consistent performance overall.In particular, it maintained good performance not only in scenarios with many high-quality images but also in scenarios with a mixture of low-quality images.Interestingly, performance tended to be better in scenarios with a mix of low-quality images than in scenarios with many high-quality images.For example, the HHL, LHH, HLHH, and LHHH scenarios showed higher performance than the H-only scenario.This may have been because it provided broader information by including images of various quality.Although QE prioritized image quality, if it included low-quality images, detailed defects or patterns that were difficult to see in high-quality images may have been more evident in the lower-quality images.Table 5 shows the performance of TQE inference methodology which combined TE and QE.Our experiment focused on combinations of test images, based on high-quality and low-quality, and was performed using various values of K and τ.The experimental results showed that the performance of TQE improved as the K and τ values increased.In detail, the experimental results were as follows.for K = 2, the aggregate F1-Score at τ = 1 was 0.9448, and the aggregate F1-Score at τ = 3 improved to 0.9500.At τ = 5, the aggregate F1-Score further improved to 0.9501.This result showed that performance improved as τ increased.For K = 3, the aggregate F1-Score at τ = 1 was 0.9513, which improved to 0.9537 at τ = 3, and further improved to 0.9540 at τ = 5.This result also showed that performance improved as τ increased.Finally, for K = 4, the aggregate F1-Score at τ = 1 was 0.9555, which improved to 0.9594 at τ = 3, and further improved to 0.9597 at τ = 5.From this, we could conclude that TQE's performance improved in terms of aggregate performance as K and τ increased.The reason was that as the value of τ increased, the performance of TQE approached that of QE. τ is a factor that determines the weight of the TE.As τ increased, the proportion of QE increased and the proportion of the TE became relatively small.These results confirmed that as τ increased, the performance of TQE approached that of the QE.Particularly noteworthy among the experimental results were the LHH, LHLL, and LHHH scenarios at K = 3.Despite having a low-quality image as the observed image, these scenarios outperformed the other scenarios.In the LHH scenario, when τ was 3, precision, recall, and F1-Score all recorded the highest performance of 0.9923.Additionally, in the LHLL scenario, when τ was 3, precision, recall, and F1-Score were 0.9233, 0.9160, and 0.9152, respectively, proving the effectiveness of TQE even in scenarios involving low-quality images.The LHHH scenario also showed the best performance when τ was 3, which showed a higher performance than the QE.These results showed that the TE considered temporal priority and balanced the TE with a large number of quality images without alienating a small number of quality images.When comparing TQE and AVE with K = 4, ShuffleNetV2 achieved a higher F1-score using TQE despite having a lower recall.This meant that ShuffleNetV2-based TQE made conservative predictions, resulting in low recall and high precision, and was consequently less effective at identifying all defect cases but was more accurate when it did make a prediction.ECAEfficientNet showed higher performance in AVE than TQE as K increased.This was because ECAEfficientNet did not show a significant performance difference from TQE for low-quality images, even with single CNN models.In other words, because ECAEf-ficientNet itself showed high performance even for low-quality images, as K increased, the performance of AVE also increased and eventually could outperform TQE.On the other hand, for other models, single CNN models showed a large performance difference compared to TQE for low-quality images.As a result, as K increased, the performance of TQE became significantly better than AVE.This meant that TQE had a greater effect on networks with poor performance on low-quality images in single CNN models.

Conclusions
This paper presents TQE for identifying defects in packaging, including image blur.We conducted experiments to verify the performance of CNN models trained on highquality images to identify defects contained in low-quality images.As a result, CNNs identified more than 94% of defects included in high-quality images, but the accuracy dropped to about 10% to 50% in low-quality images.Additionally, we conducted experiments to identify defects contained in low-quality images using ensembles.As a result, we confirmed that both AVE and TQE had better performance than the CNN model alone.However, when low-quality images comprised more than half of the input images, AVE significantly decreased the performance of CNN models.In contrast, TQE increases performance by prioritizing image quality and maintaining importance against low-quality images in temporal timing.As a result, considering cases where at least one low-quality image is included, TQE had an F1-score approximately 17.64% to 22.41% higher than single CNN models and about 1.86% to 2.06% higher than AVE.These confirm the efficiency and improvement of TQE inference, considering both low-quality and high-quality datasets, by extensively applying CNN models.Additionally, the ensemble technique which included a few low-quality images outperformed ensembles consisting solely of high-quality images only.This suggests the possibility that low-quality images provide useful features for defect inspection when included in the ensemble.
As future work, we plan to collect and analyze more types of defect patterns to improve the performance of the proposed ensemble method.In addition, we also plan to expand the scope of industrial application by redefining not only low-quality conditions caused by camera blur but also external environmental factors such as pollution and damage.

Figure 1 .
Figure 1.Challenges in conveyor systems and packaging defect inspection.(a) Vibration with roller conveyor.(b) Out of focus with machine vision.

Figure 2 .
Figure 2. The detailed architecture of our method.

Figure 3 .
Figure 3.The image acquisition system has an image acquisition unit, light sources, a backlight, a transparent conveyor belt, and a chamber.
Figure 4. Additionally, surface defects are problems caused by ink bleeding or damage adsorbed on the label surface, as shown on the right of Figure 4.

Figure 4 .
Figure 4. Samples of collected image.The red circles indicate defects.Left: non-defect.Middle: edge defect.Right: surface defect.(a) High-quality images.(b) Low-quality images.

Table 1 .
The distribution of images by class in the training, validation, and test sets of the packaging dataset.

Table 3 .
Performance comparison of CNN models on a high-quality and low-quality image.

Table 5 .
Comparison of TQE based on ResNet-34 according to K and τ.

Precision Recall F1-Score Precision Recall F1-Score Precision Recall F1-Score
, especially at high K values.For example, in the ShuffleNetV2 model at K = 4, recall and F1-Score improved by 0.2961 and 0.3619, respectively. improvement