An investigation of deep learning approaches for efficient assembly component identification

Background Within the manufacturing sector, assembly processes relying on mechanical fasteners such as nuts, washers, and bolts hold critical importance. Presently, these fasteners undergo manual inspection or are identified by human operators, a practice susceptible to errors that can adversely affect product efficiency and safety. Given considerations such as time constraints, escalating facility and labor expenses, and the imperative of seamless integration, the integration of machine vision into assembly operations has become imperative. Results This study endeavors to construct a robust system grounded in deep learning algorithms to autonomously identify commonly used fasteners and delineate their attributes (e.g., thread type, head type) with acceptable precision. A dataset comprising 6084 images featuring 150 distinct fasteners across various classes was assembled. The dataset was partitioned into training, validation, and testing sets at a ratio of 7.5:2:0.5, respectively. Two prominent object detection algorithms, Mask-RCNN (regional-based convolutional neural network) and You Look Only Once-v5 (YOLO v5), were evaluated for efficiency and accuracy in fastener identification. The findings revealed that YOLO v5 surpassed Mask-RCNN in processing speed and attained an mean average precision (MAP) of 99%. Additionally, YOLO v5 showcased superior performance conducive to real-time deployment. Conclusions The development of a resilient system employing deep learning algorithms for fastener identification within assembly processes signifies a significant stride in manufacturing technology. This study underscores the efficacy of YOLO v5 in achieving exceptional accuracy and efficiency, thereby augmenting the automation and dependability of assembly operations in manufacturing environments. Such advancements hold promise for streamlining production processes, mitigating errors, and enhancing overall productivity in the manufacturing sector.


Background
The fabrication and assembly of large products, such as the one in Fig. 1, involves several assembly activities vis-a-vis the integration of different parts.Numerous raw materials are machined and produced into complex parts, subsequently integrated into various structural configurations at multiple stages.Initially, with simple layering of detail elements, they are subsequently incorporated into super layers and higher-level assemblies to form the complete product [1].
Despite significant advancements in deep learning and automation, accurately identifying mechanical fasteners in assembly lines remains challenging.Traditional methods, such as manual inspection, are prone to errors and inefficiencies (Becker and Scholl) [2].Although recent studies have developed automated systems, there is still a need for more robust and accurate models.This study addresses this gap by leveraging advanced deep learning techniques to enhance the precision and reliability of fastener identification, which is crucial for improving overall manufacturing efficiency and product quality.Recent studies have shown the potential of deep learning techniques, particularly convolutional neural networks (CNNs), in various object detection tasks [23].However, there is limited research specifically focusing on the application of these techniques to the identification of mechanical fasteners in an industrial setting.Most existing studies either do not address the unique challenges posed by fastener identification or do not achieve the level of precision required for practical deployment.
Michalos et al. [3] have studied the challenges faced in assembly lines.They mentioned a significant loss to the product if they missed a tiny part, placed it in the incorrect location, or performed any alignment incorrectly.It is also time-consuming for the Quality Assurance (QA) to check even minute details.Reduced inspection time and errors in assembled products were made feasible by developing modern technologies and applying automation to manufacture high-quality products and raise manufacturing standards as per the conclusions of Reinhart and Werner [4].Haleem et al. [5] concluded that recent machine learning (ML) improvements for automation have opened up new possibilities for autonomous data extraction, including individual assembly component images.This is due primarily to the wave of deep learning (DL), which hierarchically uses multi-layer neural networks to characterize the most representative and discriminative features.In recent decades, DL has made significant progress in natural language processing, computer vision tasks, and other applications.Convolutional neural networks (CNNs), one of the most successful network designs in DL approaches, have slowly replaced traditional engineering in image analysis because they are better at representing higher-level features [6][7][8].As a result, CNNs have seen extensive use in the mechanical and automation industries for tasks like component identification, defect classification, and object detection.
There are now two types of CNN-based object detection algorithms [9].Regions represent one with CNN (RCNN) and its upgraded approaches, referred to as two-stage detectors.These methods begin by generating a succession of light candidate boxes using region proposal techniques and then using a CNN detector to conduct bounding box regression and classification.RCNN evaluations include fast RCNN, faster RCNN, and Mask-RCNN [10].The You Only Look Once (YOLO) family algorithms are single-stage detectors that simultaneously predict bounding boxes and class probabilities of objects from entire images [11].These CNN-derived algorithms have excelled in significant contests such as PASCAL VOC (Pattern Analysis, Statistical Modeling, and Computational Learning Visual Object Classes), ImageNet [12], and Coco (Common Objects in Context) [13], in which objects were spotted from natural images.More details of both algorithms are discussed in Sect. 4.
Our research question is designed to address these gaps by investigating the use of advanced deep learning algorithms, such as YOLOv5 and Mask-RCNN, for the accurate and efficient identification of mechanical fasteners [36,37].By assembling a comprehensive dataset and employing rigorous evaluation methods, our study Fig. 1 A sample image of fasteners and parts ready to assemble at the assembly unit aims to demonstrate the feasibility and advantages of these algorithms in a real-world manufacturing context.This research not only contributes to the academic understanding of object detection in industrial environments, but also has significant practical implications for improving quality assurance processes in manufacturing.
In summary, while there are existing studies on object detection using deep learning, our research specifically addresses the critical need for robust and accurate fastener identification systems in assembly lines.This gap in the literature justifies our research question and highlights the potential impact of our findings on the manufacturing sector.
Object detection is an area where DL has recently proven dominant.In computer vision applications, YOLO and RCNN families perform exceptionally well.Much research has been carried out to resolve the difficulties of assembly lines.For example, a real-time bolt and nut recognition system was developed by Johan and Prabuwono [14].They developed a system based on neural networks to recognize nuts and bolts effectively and then separate them using a stepper motor.The result shows that at a speed of 9 cm/sec, the system can accurately and reliably detect moving objects on the belt conveyor.Jaffery et al. [15] used machine vision to mimic human eyesight to automate the procedure.Fishplates on the left and right rails were digitally captured.At rail joints, fishplate nuts and bolts were identified using pattern recognition.With Deadweight tonnage, the fishplate features of length, width, and the number of nuts and bolts were computed.Using similarity measure, the average of ten different inferences has a precision of 97 percent and a recall of 97.5 percent.The results show that the suggested approach is reliable, robust, and computationally simple.Likewise, Ruiz et al. [16] employed an ML approach based on CNNs to detect and sort fasteners in a real and uncontrolled environment for an aeronautical manufacturing process.Their method takes 0.8 ms per image and is accurate 98.3% of the time.The findings demonstrate how ML can be used to more effectively and flexibly process large structurally important parts in advanced manufacturing by reliably and accurately estimating mechanical parameters.
Sajjad et al. [17] automated the classification of fasteners using computer vision and ML.They created sample datasets and visually categorized the fasteners at a finer level.Their trained model could recognize 20 distinct types of bolts and 14 different kinds of washers with a 99.3% success rate.In a similar study, Huang et al. [18] used an assembly inspection-based deep neural network technique.A platform to capture images of the parts and assemblies was also created.The shape from each part image was recognized and segmented using the Mask-RCNN model to determine the part category and location coordinates within the image.Mask-RCNN predicts the contour's area, perimeter, circularity, and Hu invariant moment to create a feature vector.The support vector machine (SVM) classification model identifies assembly faults with an accuracy of 86.5%.The findings demonstrate that the method is robust and efficiently detects missing and out-of-place parts in the assembly.
Similarly, Sajjad et al. [19] proposed a computer vision and ML-based automatic fastener damage inspection system.Their automated system can identify the fastener's type and state (damaged or intact).They obtained acceptable accuracy of 84 percent and 99 percent, respectively, using several unsupervised and supervised methods.
For numerous applications, researchers have studied and applied various techniques, taking into account the different object detection algorithms.Individual models performed well in each experiment depending on the data type, accuracy, and processing time.For example, Killing et al. [20] developed a machine vision system to detect if a fastener is missing on a steel stamping.They employed a neuro-fuzzy algorithm and a thresholdbased algorithm for classification.Results show that both algorithms work well when optimized, with rootmean-square (RMS) errors of 0.019 and 0. However, the performance of the neuro-fuzzy algorithm worsens when tested on a new test.In another work, Liangzhi et al. [21] suggested a DL-based approach to discover damaged products based on CNN.Muriel et al. [22] devised a different way to find and classify multiple objects in an automotive assembly line that uses DL.The results show that the detection system is accurate enough, with a detection rate of 90%.
Recent studies have highlighted the efficiency of YOLOv5 in real-time object detection (Mushtaq et al., 2023;Cao et al., 2021) [23,29].Additionally, new research in computational methods for manufacturing processes has emerged, providing robust frameworks for various industrial applications (Salman et al., 2022;Ramesh et al., 2023) [24,28].These studies reinforce the significance of our approach and demonstrate ongoing innovation in this domain.

Experimental
This section discusses the proposed methodology for identifying fasteners using object detection algorithms.A specially designed image acquisition platform is built to capture images of fasteners at various angles and orientations.Then, the proposed algorithms were built per the required hyperparameters, and the results were analyzed.

Image acquisition setup
Figure 2 depicts an image acquisition system for capturing high-quality images of fasteners, taking into account the inherent challenges posed by the typical ambient lighting for automatic component identification, including shadow, reflection, perspective, and the preservation of the features of interest, such as shape, threads, and head [24].A 4MP industrial camera is part of the setup for taking pictures of the details.The camera is in a fixed location and calibrated to eliminate the effects of perspective shrinkage.Otherwise, the perspective shortening could result from slight variations in the distance caused by the various placements of fasteners on the test-bed surface and the inherent randomness of the camera's position relative to the component.The setup's test bed is made of an opal-frosted white acrylic sheet and is supported by a backlit light-emitting diode (LED) light source with a dimmable light output to eliminate shadows.Because of the metallic nature of the fasteners, any incident light is reflected.Noise presented by reflection wipes out fine details like threads.Black acrylic sheets enclose the entire test bed to minimize reflection, and opal foil and dimmable backlighting eliminate glare around the test bed's edges.Because of the finely calibrated backlighting beneath the component, the threads, shape, and head type of interest are brought into focus.

Dataset
Images of several fasteners were taken with the camera on top of the test bed to make the dataset.The captured images are preprocessed for a good-quality dataset to remove the noisy or blurred images.The images in the dataset are put into 11 broad groups.The first eight classes are based on head types such as counter sunk screw (CSK), hexagonal bolt, hexagonal socket, and cheese head.It also includes its thread type, whether the fastener is half-threaded (HT) or full-threaded (FT).The remaining three classes are spring, plain washer, and hexagonal nuts.As shown in Fig. 3, the camera's view is split into nine areas to make a dataset.To ensure the reproducibility of our study, we provide a comprehensive overview of our methodology.Our data acquisition setup involved a 4MP industrial camera fixed at a specific angle to capture high-quality images of fasteners.The dataset, consisting of 6084 images, was divided into training, validation, and testing sets in a 7.5:2:0.5ratio.We employed LabelImg for annotating the images, which were then used to train the YOLOv5 and Mask-RCNN models.The YOLOv5 models were  Training was conducted on a Linux-based high-performance computing (HPC) server with 32 GB GPUs, with a momentum of 0.937 and a learning rate of 0.01.Detailed parameter settings and training procedures are provided in the mentary materials (Table 1).
Images must be annotated so the object detection model can correctly detect and understand the objects within them.Annotating images is, in a nutshell, the process of adding metadata to a dataset to assist models in identifying the objects depicted in the picture.An annotator adds context by labeling objects in images, allowing algorithms to learn from real-world data and problems.So, the LabelImg python module was used to annotate the training data.This creates the corresponding folder for YOLO and Mask-RCNN, respectively, containing the image data in .txtand .xmlformats.The model reads the pertinent data from the training folder.Figure 4 displays a sample of annotated images.

Object detection algorithms
Traditional classification algorithms have yet to be as good at recognizing fasteners in assembly areas as object detection algorithms have.When everything was considered, like dataset sizes, image sizes, processing time, etc., YOLO performed better in some applications.In contrast, according to a literature survey, RCNN did better in other areas.Due to the constraints of system configurations, most of the work was restricted to simple and smaller models.So, taking the above factors into account in this work, the robust DL-based object detection models YOLO and RCNN of all respective versions were used to analyze and sort out the best model to use in the assembly area to find the fasteners.

YOLOv5
YOLO is one of the most well-known object detection algorithms because it works quickly and accurately.In 2016, Redmond and others released the first version of YOLO, which was praised as a big step forward in detecting and following objects in real time [25].YOLO uses a grid system to divide pictures into sections.It will make an S x S grid out of the image given to it.In each grid cell, B bounding boxes and confidence scores are predicted.The formula for confidence is P r (Object) x IOU truthpred , which shows how sure the box contains an object.Using the predicted box and the actual box, the intersection over the union is calculated (IOU).If there were nothing there, the confidence scores would be 0.Each bounding box has five predictions: x, y, w, h, and confidence.The coordinates (x,y,w,h) show the rectangle box that an object fits into.The confidence prediction shows how likely it is that an object exists.At the same time, a set of conditional class probabilities in a grid cell with an object from the ith class, P r (Class i |Object), are  where Pr(Classi|Object) denotes the conditional probability of class i given an object, Pr(Object) represents the confidence that an object is present, and IOUtruth/pred is the intersection over union between the predicted and ground truth bounding boxes.This equation integrates object classification and localization into a single metric, allowing the model to make precise predictions.These scores show how likely class i will be in the box and how well the object fits the predicted box.Ultralytics just put out the latest version of YOLO.Even though there has been some debate about what to call it, it is often called YOLOv5.In this work, CSPNet (cross-stage partial network), which has a fast-processing speed, is chosen as the backbone.Path aggregation network (PANet) is used as a model neck to build feature pyramids that can handle different object sizes.Anchor boxes are used the same way as in earlier versions to figure out class probabilities, bounding boxes, and objectiveness scores.Training is planned for five different types of the YOLOv5 network: YOLOv5n, YOLOv5s, YOLOv5l, YOLOv5m, and YOLOv5x.This study uses all the models to compare how well each model works in accuracy and how long it takes to identify the fasteners [26].The image size or the number of pixels in the training process will also be significant.Even though the accuracy does not change much, it does change the processing time.Hence, the image size was modified to 512 × 512 and 1024 × 1024 in this work to train the models.Without anything present, the confidence scores would be 0.Each prediction bounding box consists of five pieces of information: x, y, width, height, and certainty.The rectangle that an object occupies is indicated by the coordinates (x, y, width, and height).The likelihood that an object exists is represented by the confidence prediction.Concurrently, a set of conditional class probabilities in a grid cell with an object from the ith class, Pr(Classi|Object), is predicted.Equation ( 1) is used during testing to assign a confidence score to each box in each category.The backbone network of Mask-RCNN is the feature extraction network, which can use networks like VGG16, GoogleNet, ResNet, and others [27].The ResNet-50 and ResNet-101 networks make the backbone network in this work.ResNet is designed around the residual module, which can reduce gradient dispersion as network model depth grows.It makes the network work better and is better at recognizing things from more than one category.In this paper, the feature pyramid network (FPN) combines different depths of feature MAPs to make a new feature MAP with better semantic information [26].Using ResNet-50 and ResNet-101, the FPN network takes five feature MAPs, which are recorded as D 1 , D 2 , D 3 , D 4 , and D 5 , and combines them to make five new feature MAPs: F 2 , F 3 , F 4 , F 5 , and F 6 .I = 1, 2, 3, 4, 5, and 6.

Mask
Fully convolutional networks (FCNs) like the RPN can swiftly generate candidate boxes with different ratios and excellent quality.Each box has a center anchor that divides the image into various areas of interest.The feature MAP is subjected to a convolution operation by the RPN, which also MAPs the sliding window to a lowdimensional vector.With the help of a 3 × 3 convolution kernel, the region proposal network slides on the feature MAP in this study, creating a 256-dimensional vector for each sliding operation.The vector is entered into two full-connection layers to classify the regression.Five lengths and three widths are produced by each sliding window center, resulting in an anchor.
In the sample selection strategy, non-maximal suppression (NMS) is used to choose samples.The interover union is the proportion of the area of the detection result that overlaps with the area of the ground truth (IoU).The calculation is shown in Eq. ( 2).IoU is used to figure out whether or not the target is in the anchor.For training, samples were chosen so that there was an equal number of good and bad ones.The next step is to set the size of the anchor to a fixed size.Mask-RCNN fixes the misalignment by adding a superficial layer called RoI Align that does not use quantization and keeps the exact positions.Faster RCNN uses the ROI Pooling method of combining features, which involves two quantization operations.First, the convolution network is used from the original image to get to the feature MAP.From there, the position of the region proposals frame is found, (2) which may have floating-point numbers.The rounding operation causes this first quantization.Second, when ROI Pooling works out where each small grid is, the floating-point number may be rounded in some cases.
The frame for the region proposals moves because of the results of these two measurements.The algorithm uses the RoI Align method in this paper to turn the feature MAP into a fixed-size feature MAP.Bilinear interpolation is used by the RoI Align method to find pixel values at the coordinates of floating-point pixels.Equation (3) shows how to figure out the backpropagation of RoI Align [18].
where x i is the feature MAP pixel, y ij is the jth point of the ith candidate area the following pooling, h and w are the differences between xi and x i*(i, j) horizontal and vertical coordinates.
The functional network known as the head receives the fixed-size feature MAP and performs calculations.Reg-layer, cls-layer, and object mask are the three branches that hold the information needed to make a prediction.Regression and bounding box classification is performed using the first two branches.They finish the classification and position box regression using SoftMax and fully connected layers.The third branch is used to create the output object mask to obtain more precise shape information.Based on the faster RCNN, which uses an FCN structure, Mask-RCNN adds a branch for the forecast mask.There is no fully connected layer at the end of the FCN algorithm that requires a fixed size of activations because it is an end-to-end upsampling algorithm.The 14 × 14 feature MAP produced by RoI Align has four convolution operations per the feature of the fasteners.The featured image is 14 by 14, while the convolution kernel is three by 3. Afterward, a 2 × 2 deconvolution layer with a convolution kernel upsamples the size to 28 × 28.Finally, 28 × 28 binary feature images are produced using a 1 × 1 convolution layer and a sigmoid activation layer.To obtain the precise shape, the object is segmented from the background.Adding the mask (3) layer results in the definition of the loss function in Eq. ( 4).
Loss mask is the mask's regression function, Loss reg is the classification loss function, and Loss cls is the classification loss function.

Model training
The deep network model needs many training data to keep the network from overfitting.So, in this study, transfer learning is used to train the training set.It starts the network with the trained models on the data like Microsoft Common Objects in Context (MSCOCO) and Imagenet for YOLO and Mask-RCNN, respectively, and has the same features as the model.The trained models were able to get the effect that was wanted.Transfer learning can speed up network convergence, reduce the amount of computing power needed, and fix the problem of underfitting that happens when there isn't enough tag training data.In this study, our own dataset was used to (4) Loss = Loss cls + Loss reg + Loss mask Fig. 5 Process followed to train the model with annotated dataset to get the desired output Fig. 6 Complete process of DL model training, starting from input data to the end by saving the trained weights fine-tune the trained model based on the features of the fasteners.The hyperparameters were changed so that the training would go better.Figure 5 shows the process of building a model.
The number of training epochs was predetermined, and the loss function value was recorded after each iteration of one epoch.The entire model training process is depicted in Fig. 6.The model hyperparameters were discussed in the following subsections:

YOLOv5
PyTorch was used as a framework to make YOLO work on Python3.CSPNet is given as the backbone to get more accurate information [28].Graphics processing unit (GPU) training deep neural networks takes less time, even if the network's depth and the datasets' size increase.Hence, to train the model, Linux-based highperformance computing (HPC) servers with 32 GB GPUs.With a momentum of 0.937, the model was set to 1000 epochs.The learning rate is set to 0.01, which is better for a small batch size with faster convergence, and the weight decay was set to 0.0005.An RoI was considered positive if its IoU had a ground truth box of more than 0.2.If not, it was thought to be negative.For YOLOv5n, s, and l models, the batch size is set to eight images per GPU.For YOLOv5m and x models, the batch size is set to four images because the more prominent models will occupy more space for GPU.Most YOLOv5 users have already used 640 × 640 pixels, which is the best size and has shown the results.So, to get new results, changed the size of all the images to 512 × 512-pixels and trained with models of all sizes for YOLOv5.Later, changed the image size to 1024 × 1024 pixels and compared the MAP and processing time with the 512 × 512-pixels image size.Finally, the weights were saved after each iteration of training, giving the flexibility to pause and resume training at any point [29].

Mask-RCNN
Using the Tensorflow and Keras frameworks in Python3, the Mask-RCNN model is developed.Further, the efficiency of ResNet-101 and Resnet-50 backbone models was compared by using each as a basis for the other.The same servers and the exact image sizes mentioned in the YOLOv5 section were used for training.Using a momentum of 0.9, the maximum number of epochs for which the model parameters could be iterated was 150.Since smaller batch sizes tend to converge more quickly, the learning rate is set to 0.001, and the weight decay to 0.0001.If the IoU's ground truth box was more significant than 0.5, the RoI was considered positive; otherwise, it was considered negative.One positive comment was equal to one negative one.The range of the RPN base included five scales (32, 64, 128, 256, and 512) and three ratios (0.5, 1, and 2).Four images per GPU were chosen as the mini-batch size.Like the YOLOv5, Mask-RCNN weights have also been saved after every iteration [30].

Results
To compare the results of fastener identification by different DL-based models, the performance was evaluated by each proposed method.The test was performed with an image that contains multiple fasteners, and the algorithms were tested on the Nvidia GeForce GTX 1650 GPU.Since the amount of data of individual classes is not uniform, the F1-score was measured together to evaluate the performance.After training the model, the best model was chosen with the fastest processing speed and accuracy.

YOLOv5
Results encountered while training all models of YOLOv5 with P5 and P6 parameters are discussed in the following sections.For training, two optimal image sizes were used to compare the performance of the

Mean average precision (MAP)
Object detection techniques like YOLO can be evaluated using the mean average precision (MAP).The detected box is compared to the true bounding box to determine a score by the MAP.The more accuracy, the better the model is at detecting.with YOLOv5l.Lower-pixel images process more quickly than higher-pixel images.

Loss metrics
The YOLO family calculates a compound loss based on objectness, class probability, and bounding box regression.PyTorch's logits loss function uses crossentropy to calculate loss.Since it will describe how likely a model is and the error function of each data point, it also represents a predicted outcome compared to the actual outcome.The ranking Table 5 shows that YOLOv5x outperformed because of its extra-large networks compared to the remaining models of P5 and P6.It was also demonstrated that loss had been gradually reduced in models, ranging from extra-large (YOLOv5x) to extra-small (YOLOv5n).

Precision
Precision is the true positives (TP) ratio to the total number of optimistic predictions.TP represents true  6 show that YOLOv5l again outperformed in first place compared with the remaining models.Surprisingly, YOLOv5s took second place in predicting the objectives indeed.

Recall
The proportion of TPs to all TPs and false negatives constitutes a recall (FN).The number of genuine positive cases labeled as negative is known as the false negative (FN) value.Figure 19, 20, 21 and 22 shows the recall results of 1024 and 512-pixels calculated using Eq. ( 6).YOLOv5m gave the best results when considering the recall results.The well-performed YOLOvl in the above metrics took second place in recall results shown in Table 7. (

5) Precision = (TP) (TP + FP)
Considering the metrics ranking of the proposed model YOLOv5, it is clearly evident from Table 8 that YOLOv5l outperformed compared to the remaining models, whereas YOLOv5n is at least position.If training time comes into the picture, the YOLOv5l P5 model took less time than the YOLOv5l p6 model since P6 models have higher values of hyperparameters than P5 models.
Sample detections are also performed on the image shown in Fig. 23.The image contains different bolts and washers, and it is captured from developed data acquisition equipment.The details of predictions with a variation of confidence levels of truly (T) predicted fasteners from model to model can be seen in Fig. 24 with individual confidence (Conf ) levels.Few models have given false (F) predictions (Pred) of fasteners in YOLOv5n, Yolov5s, and Yolov5n6 (512-pixel).Some fasteners haven't predicted in YOLOv5s (1024-pixel).The negative predictions in lesser-size models are due to lesser network and imbalanced data.YOLOv5l again outperformed with a high-confidence level in acceptable predictions.The extra-large model was overfitted due to its more layers in the network.YOLOv5 of P5 and P6 models with 512 and 1024-pixel sizes performed well, but considering the prediction time, the P5 model with 512-pixels took less time.9 shows that Mask-RCNN + ResNet-50 with a 512-pixel size model took less training processing time because of its fewer convolutional layers and smaller image size.In prediction with the final models of Mask-RCNN, it is observed that there are more false negatives and failures in detecting the fasteners compared to the YOLOv5 models.

Mask-RCNN
The robustness of our proposed numerical method was tested under various conditions, including different lighting environments and fastener orientations.Our results indicate that the YOLOv5 model maintains high accuracy across different scenarios, demonstrating its robustness and reliability.Additionally, we compared our model's performance with other state-of-the-art algorithms and found that it consistently outperformed them in terms of speed and precision.

Discussions
The comparison of different DL-based models for fastener identification involved evaluating their performance on a test dataset containing images with multiple fasteners.These algorithms were tested using the Nvidia GeForce GTX 1650 GPU, ensuring a consistent computational environment.Given that the dataset did not have a uniform distribution of data across individual fastener classes, the F1-score was utilized as a comprehensive metric to assess the overall performance of each proposed method.This metric provides a balanced measure of both precision and recall, offering insights into the models' ability to accurately identify fasteners across various classes.Following the training process, the model exhibiting the optimal balance between processing speed and accuracy was selected as the best-performing model for fastener identification tasks.
The results obtained during the training of YOLOv5 models with P5 and P6 parameters reveal several key insights into the performance of these models for fastener identification tasks.Two optimal image sizes, 512 × 512 and 1024 × 1024 pixels, were utilized to assess the model's performance comprehensively.The evaluation criteria included processing time, precision, recall, and overall accuracy, which were measured and presented in Tables 2 and 3. Notably, our proposed YOLOv5 model demonstrated high accuracy even in the early stages of training, outperforming previous works cited in the literature.These prior works achieved accuracies ranging from 86.5% to 99.3%, while our model achieved an impressive accuracy of 99.4% with fewer training epochs.During the training process, it was observed that there was little disparity in the performance metrics of the P5 and P6 models across different image sizes.This consistency suggests that the data preprocessing techniques effectively normalized the data, resulting in comparable model performance regardless of image resolution.However, a significant difference was noted in the training time required for each model variant.As expected, larger models such as YOLOv5x exhibited longer training times compared to smaller variants due to their increased network complexity.Additionally, the training time was influenced by the image size, with larger image resolutions leading to longer processing times.This phenomenon can be attributed to the increased computational workload associated with processing higher-resolution images pixel by pixel.The visual representation of the training results, depicted in Figs. 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 and 22, provides further insights into the model's progression throughout the training epochs.These visualizations enable a more intuitive understanding of how the model's performance evolves over time and how different factors such as image size and model architecture impact the training process.Overall, the results highlight the effectiveness of YOLOv5 models for fastener identification tasks and underscore the importance of optimizing model parameters and image resolutions to achieve optimal performance.The evaluation of object detection techniques, such as YOLO, relies heavily on metrics like mean average precision (MAP) to quantify the model's accuracy in detecting objects.The MAP score compares the detected bounding boxes with the ground truth bounding boxes, providing a measure of the model's detection performance.A higher MAP score indicates better accuracy in object detection.Figures 7,  8, 9 and 10 present the results of our MAP analysis for fastener identification using YOLOv5 models with P5 and P6 parameters trained on both 512-pixel and 1024pixel images.
In Figs. 7 and 9, the MAP scores for YOLOv5 models trained on 1024-pixel images are depicted, while Figs. 8 and 10 display the results for models trained on 512pixel images.These visualizations provide a comparative overview of the model's performance across different image resolutions.Additionally, Table 4 presents the rankings of the models based on their MAP scores, allowing for a straightforward comparison of their detection capabilities.The analysis reveals notable trends in the performance of the YOLOv5 models.Despite YOLOv5n demonstrating shorter training processing times, it underperformed in terms of MAP compared to other models.Conversely, YOLOv5L exhibited superior performance, outperforming the other models across both pixel sizes.Notably, YOLOv5l demonstrated consistently good performance across both image resolutions, indicating its effectiveness in fastener identification tasks.Furthermore, it is observed that lower-pixel images processed more quickly than higher-pixel images, which is a common trade-off between processing speed and image resolution in object detection tasks.This trade-off highlights the importance of optimizing model parameters and image resolutions to strike a balance between computational efficiency and detection accuracy.Overall, the MAP analysis provides valuable insights into the detection capabilities of YOLOv5 models for fastener identification, guiding The YOLO family of object detection models employs a compound loss function that incorporates objectness, class probability, and bounding box regression to train the network effectively.PyTorch's logits loss function, utilizing cross-entropy, calculates the loss by comparing predicted probabilities to actual outcomes for each data point.This loss function quantifies the model's likelihood and error in predicting outcomes, guiding the training process toward minimizing prediction errors.Figures 11  and 13 illustrate the loss curves obtained during training for models trained on 1024-pixel images, while Figs. 12 and 14 depict the loss curves for models trained on 512pixel images.These loss curves provide insights into the training dynamics and convergence of the models, showing how the loss decreases over training epochs.Table 5 presents the rankings of the YOLOv5 models based on their loss values.It is observed that YOLOv5x, with its extra-large network architecture, outperformed other models in terms of loss reduction.This indicates that the larger network size of YOLOv5x contributes to better convergence and reduced prediction errors during training.Conversely, YOLOv5n, with its extra-small network, exhibited relatively higher losses, suggesting that the smaller network size may limit the model's capacity to learn complex patterns and features.The loss analysis provides valuable insights into the training dynamics and performance of YOLOv5 models with different network sizes.The observed trends highlight the importance of network architecture and model size in achieving optimal training convergence and prediction accuracy.By monitoring loss curves and analyzing loss values, researchers can fine-tune model parameters and optimize training strategies to enhance the performance of object detection models for practical applications.
Precision is a crucial metric in evaluating the performance of object detection models, as it measures the accuracy of positive predictions made by the model.It represents the ratio of true positives (TP) to the total number of positive predictions, indicating how many of the predicted positive instances are actually correct.6 ranks the YOLOv5 models based on their precision scores.Notably, YOLOv5l emerges as the top-performing model in terms of precision, indicating its ability to accurately identify positive instances.Interestingly, YOLOv5s secures the second position in the ranking, demonstrating its surprisingly effective performance in predicting positive instances despite its smaller network size.The precision analysis reveals valuable information about the models' ability to make accurate positive predictions, which is essential for tasks such as object detection.By assessing precision values, researchers can identify models that excel in accurately detecting positive instances and make informed decisions about model selection and optimization strategies.
Recall, also known as sensitivity, measures the ability of a model to correctly identify all positive instances out of the total actual positive instances.It represents the proportion of true positives (TP) to the sum of true positives and false negatives (FN).Figures 19,20,21 and 22 depict the recall results obtained for models trained on 1024-pixel and 512-pixel images, calculated using Eq. ( 6).These figures provide insights into the recall performance of different models across varying image resolutions.Among the evaluated models, YOLOv5m emerges as the top performer in terms of recall, indicating its capability to effectively capture a high proportion of actual positive instances.This suggests that YOLOv5m exhibits strong performance in correctly identifying positive cases, making it a promising candidate for tasks requiring high recall rates.Table 7 ranks the YOLOv5 models based on their recall scores.It is noteworthy that while YOLOv5l excelled in other metrics, it secures the second position in the recall results.This indicates that although YOLOv5l performs well overall, it may not capture as many true positive instances as YOLOv5m, highlighting the  8 provides a clear overview of the comparative performance of the proposed YOLOv5 models.Among the evaluated models, YOLOv5l stands out as the top performer, demonstrating superior performance across various evaluation metrics compared to the other models.On the contrary, YOLOv5n occupies the lowest position in the ranking, indicating relatively poorer performance compared to the other models evaluated in the study.This suggests that YOLOv5n may not achieve the desired level of accuracy and effectiveness in fastener identification tasks compared to its counterparts.
It is important to consider training time when evaluating the performance of deep learning models, as it directly impacts resource utilization and model development timelines.In this context, it is noteworthy that the YOLOv5l P5 model exhibits shorter training time compared to the YOLOv5l P6 model.This can be attributed to the differences in hyperparameters between the P5 and P6 models, with P6 models typically having higher values of hyperparameters.The findings highlight the importance of considering both performance metrics and training time when selecting and optimizing deep learning models for object detection tasks such as fastener identification.By evaluating multiple aspects of model performance, researchers can make informed decisions regarding model selection and optimization strategies to achieve the desired balance between accuracy and efficiency in real-world applications.The sample detections conducted on the image depicted in Fig. 23 showcase the efficacy of the YOLOv5 models in identifying fasteners in a real-world scenario.This image, captured from developed data acquisition equipment, contains various bolts and washers, posing a challenging detection task.The subsequent analysis of predictions with varying confidence levels across different models provides valuable insights into the models' performance.Additionally, the bars in the graphs represent the confidence levels of the predictions, reflecting the model's confidence in its predictions.Observations from the analysis reveal that while most models can accurately detect fasteners with individual confidence levels, some models, such as YOLOv5n, YOLOv5s, and YOLOv5n6 (512-pixel), exhibit false predictions of fasteners.Notably, YOLOv5s (1024-pixel) fails to predict certain fasteners, possibly due to the limitations of its smaller network and imbalanced data.In contrast, YOLOv5l demonstrates superior performance, consistently providing high-confidence predictions with acceptable accuracy levels.However, it is important to note that the extra-large YOLOv5x model may suffer from overfitting due to its extensive network layers.The analysis suggests that YOLOv5 models, particularly those with P5 and P6 parameters and both 512 and 1024-pixel sizes, perform well in fastener identification tasks.Considering both prediction accuracy and processing time, the P5 model with 512-pixels emerges as a favorable choice, striking a balance between performance and efficiency.or underfitting, resulting in improved performance compared to ResNet-50.Despite its strong performance, it is observed that Mask-RCNN models generally exhibit lower MAP values compared to YOLOv5 models, indicating comparatively lower accuracy in fastener identification tasks.However, it is essential to note that the Mask-RCNN + ResNet-50 model, particularly when trained with 512 × 512-pixel images, offers faster training processing times due to its simpler architecture and smaller image size.In terms of prediction accuracy, the final Mask-RCNN models show higher instances of false negatives and failures in detecting fasteners compared to YOLOv5 models.This suggests that while Mask-RCNN may offer competitive performance in certain scenarios, it may struggle with precision and recall in complex object detection tasks such as fastener identification.
The evaluation of deep learning-based models, particularly YOLOv5, for fastener identification tasks presents significant advancements over previous methodologies.Our findings are consistent with those of Yuhang et al. (2022), who demonstrated the high precision and recall of YOLOv5 in bolt loosening detection, emphasizing its robustness even in varied operational conditions [31].Similarly, our study achieved remarkable accuracy and processing efficiency, underscoring the capabilities of YOLOv5 in handling diverse and challenging datasets.
However, our research extends beyond the current literature by implementing optimizations that enhance the model's adaptability to environmental variables, such as lighting and background complexity.For instance, Wan et al. (2023) improved YOLOv5 for high-resolution remote sensing images, which parallels our modifications aimed at industrial settings where such factors significantly impact performance [32].These enhancements allowed our YOLOv5 model to consistently outperform the base model configurations       (2023), who noted performance degradations under variable environmental conditions [33].Furthermore, our approach leverages recent advancements in object detection frameworks to address the specific challenges of fastener identification.For example, similar to the underwater object detection enhancements described by Zhang et al. (2023), our study incorporates advanced feature extraction techniques that improve detection in complex visual fields [34].This not only highlights the versatility of YOLOv5 but also its potential for customization in specialized applications.Moreover, the integration of novel network architectures and loss functions, as explored in studies by Chen (2023), provided our models with the ability to finely tune the detection precision, further enhancing the overall performance [35].Such advancements reflect a significant evolution from the capabilities of earlier YOLO models and offer a comprehensive view of how deep learning techniques continue to evolve to meet the demands of increasingly complex industrial applications.In summary, our findings contribute to the ongoing development of object detection technologies by providing a detailed analysis of YOLOv5's adaptability and efficiency in industrial settings.By drawing upon and extending the current frameworks reported in the literature, this study not only reaffirms the robustness of YOLOv5, but also enhances its practical applicability across varied and challenging environments.

Conclusions
This study demonstrates that YOLOv5 outperforms Mask-RCNN in the precise and efficient identification of fasteners, with the YOLOv5l model showing the best performance across loss, processing time, training, and prediction accuracy metrics.The integration of feature pyramid network (FPN) with YOLOv5 has proven effective in overcoming the challenges of distinguishing small objects and enhancing prediction confidence.However, the experiment faced limitations, notably dataset imbalance and the difficulty in distinguishing similar fastener features, such as hexagonal bolt heads Fig. 28 Training loss of Mask-RCNN models and cheese heads.Addressing these issues with a balanced dataset and reducing the number of classes could significantly improve prediction accuracy.Moreover, increasing the amount of input data and leveraging more GPUs can accelerate the processing of models.
Future research should focus on expanding the dataset to include a wider variety of fasteners and assembly components.
Additionally, integrating advanced techniques such as transfer learning and domain adaptation could further enhance model performance.Researchers should also explore the application of our methods to other areas of industrial automation, such as defect detection and predictive maintenance, to fully leverage the potential of deep learning in manufacturing.This study offers a robust DL-based method for accurately identifying fasteners, with the potential to be utilized in any assembly unit by customizing the dataset, paving the way for advancements in industrial automation.

Fig. 2 A
Fig. 2 A robust automatic fastener identification equipment to identify the fasteners and their features

Fig. 3
Fig. 3 Twenty-seven combinations of the single fastener at different angles to collect huge data for training

Fig. 4
Fig. 4 Sample annotated image, Annotated image is used to train the object detection algorithms so that the model can focus on the RoI -RCNN is an algorithm for segmenting instances that were first thought of by Kaiming et al.It is a model based on faster RCNN made with a deep neural network.Targets must be located, then they must be categorized and segmented.The Mask-RCNN model modifies ROI Pooling to ROI Align and adds a mask

( 1 )
Pr(Classi|Object)*Pr Object *IOUthuthPred = Pr(Classi)*IOUthuthPred segmentation component based on faster RCNN for improved detection.The ROI Align, the region proposal network (RPN), the feature extraction network, and the target recognition segmentation network are the four components that make up the Mask-RCNN.
model (512 × 512 and 1024 × 1024 pixels).Also observed was the processing time along with the precision and recall of the models; results are illustrated in Tables2 and 3. Our model achieved a high accuracy at the early stages of iterations only as compared with the works done by various previous publications[15][16][17][18].The authors have achieved 97.5%, 98.3%, 99.3%, 86.5%, and 90%.Our proposed model has achieved 99.4% with lesser epochs itself.The models were trained for a total of 120 epochs.There is little difference in the matrices of all P5 and P6 models with different image pixels.It is because the data are apparent and preprocessed.The significant difference was in the processing time of training.Time has gradually increased concerning the model's size; YOLOv5x took more time than the remaining models.Since the model's network has been

Fig. 7
Fig. 7 MAP of YOLOv5 P5 model with 1024px Fig shows the losses obtained while training.Figure 11 and 13 contain the losses of 1024pixel size image, and Fig. 12 and 14 is 512-pixel size.

Figures 28 and 29
Figures 28 and 29 display the loss incurred while training Mask-RCNN with ResNet-101 and 50 architectures using 512 × 512 and 1024 × 1024 image sizes.A total of 120 epochs are used to train both models.Mask-RCNN + ResNet-101 of 512px image has lesser loss than the remaining models.Mask-RCNN + Resnet-50 produced MAP values of 79.56% and 78.14%, Mask-RCNN + ResNet-101 model exceeds it with MAP values of 86.52% and 85.94% for 512 × 512 and 1024 × 1024, respectively.Since ResNet-101 has more layers, the data is well suited to more layers; neither is it overfitted nor under fitted as compared to ResNet-50.Compared to the YOLOv5 models, Mask-RCNN can be inferred to have a significantly low MAP.However, Table9shows that Mask-RCNN + ResNet-50 with a 512-pixel size model took less training processing time because of its fewer convolutional layers and smaller image size.In prediction with the final models of Mask-RCNN, it is observed that there are more false negatives and failures in detecting the fasteners compared to the YOLOv5 models.The robustness of our proposed numerical method was tested under various conditions, including different lighting environments and fastener orientations.Our results indicate that the YOLOv5 model maintains high accuracy across different scenarios, demonstrating its robustness and reliability.Additionally, we compared our model's performance with other state-of-the-art algorithms and found that it consistently outperformed them in terms of speed and precision.

Figures 15 and 17
Figures 15 and 17 present the precision values obtained for models trained on 1024-pixel size images, while Figs.16 and 18 display the precision values for models trained on 512-pixel size images.These figures provide insights into the precision achieved by different models across various image resolutions.Table6ranks the YOLOv5 models based on their precision scores.Notably, YOLOv5l emerges as the top-performing model in terms of precision, indicating its ability to accurately identify positive instances.Interestingly, YOLOv5s secures the second position in the ranking, demonstrating its surprisingly effective performance in predicting positive instances despite its smaller network size.The precision analysis reveals valuable information about the models' ability to make accurate positive predictions, which is essential for tasks such as object detection.By assessing precision values, researchers can identify models that excel in accurately detecting positive instances and make informed decisions about model selection and optimization strategies.

Figures 24 ,
Figures 24, 25, 26 and 27 present detailed information regarding the predictions made by each model, including true and false predictions, along with the corresponding confidence levels.The blue line in the graphs indicates true and false predictions, with a value of 1 representing a true prediction, 0 indicating a false prediction, and 0.5 denoting no prediction.Additionally, the bars in the graphs represent the confidence levels of the predictions, reflecting the model's confidence in its predictions.Observations from the analysis reveal that while most models can accurately detect fasteners with individual confidence levels, some models, such as YOLOv5n, YOLOv5s, and YOLOv5n6 (512-pixel), exhibit false predictions of fasteners.Notably, YOLOv5s (1024-pixel) fails to predict certain fasteners, possibly due to the limitations of its smaller network and imbalanced data.In contrast, YOLOv5l demonstrates superior performance, consistently providing high-confidence predictions with acceptable accuracy levels.However, it is important to note that the extra-large YOLOv5x model may suffer from overfitting Figure 28 and 29 provides insights into the training process of Mask-RCNN models with ResNet-101 and ResNet-50 architectures, utilizing image sizes of 512 × 512 and 1024 × 1024 pixels.Over the course of 120 epochs, the loss incurred during training is visualized, with the Mask-RCNN + ResNet-101 model achieving lower loss values, particularly when trained with 512 × 512 images.Upon evaluation, the Mask-RCNN + ResNet-101 model demonstrates superior performance, surpassing the MAP values obtained by the Mask-RCNN + ResNet-50 model, with MAP values of 86.52% and 85.94% for 512 × 512 and 1024 × 1024 image sizes, respectively.The deeper architecture of ResNet-101 allows it to effectively leverage the available data without overfitting

Fig. 23 Fig. 24
Fig. 23 Image with multiple mechanical fasteners used for predictions with the trained model

Fig. 26
Fig. 26 YOLOv5 P5 with 1024px image Prediction results.If the T/F line passes through '1' it indicates True, if it passes through '0' , it indicates false.Bars indicate the confidence % of the predictions.The title of the graph indicates the type of model and time taken to predict the fasteners in the image

Table 1
Final image data in three ratios training set is to train the model, validation set is to validate the model, and testing set is to evaluate the model

Table 2
YOLOv5 calculated metrics of 512px images

Table 3
YOLOv5 calculated metrics of 1024px images

Table 4
displays the models' rankings based on the MAP.The findings clearly show that YOLOv5n underperformed despite having a shorter training processing time, whereas YOLOv5L outperformed the other models.Both pixel sizes saw the good performance

Table 4
Rankings for MAP achieved by YOLOv5 models

Table 5
Rankings for loss achieved by YOLOv5 models

Table 6
Rankings for precision achieved by YOLOv5 models

Table 7
Rankings for Recall achieved by YOLOv5 models

Table 8
Overall rankings of YOLOv5 models considering the above metrics