Improved YOLOv7 Network Model for Gangue Selection Robot for Gangue and Foreign Matter Detection in Coal

Coal production often involves a substantial presence of gangue and foreign matter, which not only impacts the thermal properties of coal and but also leads to damage to transportation equipment. Selection robots for gangue removal have garnered attention in research. However, existing methods suffer from limitations, including slow selection speed and low recognition accuracy. To address these issues, this study proposes an improved method for detecting gangue and foreign matter in coal, utilizing a gangue selection robot with an enhanced YOLOv7 network model. The proposed approach entails the collection of coal, gangue, and foreign matter images using an industrial camera, which are then utilized to create an image dataset. The method involves reducing the number of convolution layers of the backbone, adding a small size detection layer to the head to enhance the small target detection, introducing a contextual transformer networks (COTN) module, employing a distance intersection over union (DIoU) loss border regression loss function to calculate the overlap between predicted and real frames, and incorporating a dual path attention mechanism. These enhancements culminate in the development of a novel YOLOv71 + COTN network model. Subsequently, the YOLOv71 + COTN network model was trained and evaluated using the prepared dataset. Experimental results demonstrated the superior performance of the proposed method compared to the original YOLOv7 network model. Specifically, the method exhibits a 3.97% increase in precision, a 4.4% increase in recall, and a 4.5% increase in mAP0.5. Additionally, the method reduced GPU memory consumption during runtime, enabling fast and accurate detection of gangue and foreign matter.


Introduction
In recent years, the integration of robots in gangue and foreign matter screening has gained significant momentum as a notable development trend. Researchers have shown considerable interest in the visual information-based gangue robot selection method [1][2][3]. The efficacy of a gangue selection robot, which relies on visual information, is contingent upon precise gangue and foreign object recognition, as well as the speed of the manipulator. Currently, gangue selection robots face challenges, including slow selection speed and limited recognition accuracy. Attaining high-speed and accurate target recognition is a pivotal factor in enabling the automatic screening of a visual information-based gangue selection robot [4].
In the field of target recognition, researchers are progressively transitioning from relying on physical differences to exploiting faster and more intuitive visual distinctions. The integration of machine learning techniques, especially deep learning, has brought about improved robustness of the model. Unlike l 2 loss, DIoU loss directly minimized the distance between the two bounding boxes, leading to faster convergence and better detection performance. Furthermore, DIoU loss function was scale-invariant, further contributing to its effectiveness.
Prior to training the model on the dataset, the input sample images underwent a normalization process. This step played a vital role in enhancing the accuracy and efficiency of gangue and foreign matter image detection, thereby endowing the new model with robust multi-scale image processing capabilities. As a result, the new model demonstrated rapid detection and identification of gangue and foreign matter within coal samples. Furthermore, even in scenarios where these objects are covered with coal dust, they can still be reliably identified based on their distinctive physical characteristics. The new and improved model achieved these capabilities while reducing the number of parameters and the complexity of the network structure, leading to enhanced speed and accuracy in object identification. These advancements offer the benefits of low detection costs, high accuracy, and reliability.

Related Works
The presence of gangue and foreign matter in coal production has a substantial impact on the heating value of coal and can cause damage to transportation equipment, such as belt conveyors. Thus, it is essential to detect and effectively remove gangue and foreign matter to ensure reliable, efficient, and environmentally friendly sorting processes [32]. The development of automatic systems for accurate and efficient detection and removal of gangue and foreign matter holds paramount practical significance in the coal industry.
Traditional methods for gangue selection in coal processing include manual gangue selection, jigging gangue selection, heavy media gangue selection, and ray gangue selection [33]. Manual gangue separation is characterized by high labor intensity and low efficiency. Jigging gangue selection offers a simple process, ease of operation, and good processing capacity. However, it struggles with smaller gangue blocks. Heavy medium gangue exhibits high efficiency and wide applicability to different gangue types, but it suffers from a complex process system, significant equipment wear, and high cost. Ray gangue selection involves X-ray and γ-ray techniques, providing high efficiency and reliability. Nevertheless, environmental concerns arise with this method, which makes it less aligned with the principles of green development.
The YOLOv7 network is composed of three main parts [34]: the input, backbone, and head, as shown in Figure 1. In contrast to YOLOv5, YOLOv7 combines the neck and head layers into a head layer, while retaining the functionality of each part similar YOLOv5. The backbone is responsible for extracting features, and the head is utilized for making predictions. The network process in YOLOv7 begins with the preprocessing of the input image into a standardized RGB format with dimensions of 640 × 640. The processed image is then fed into the backbone network. According to the outputs of the three layers within the backbone network, feature maps of different sizes are generated and subsequently passed through the head layer. Following the RepVGG block and Conv, target detection takes place, ultimately yielding the final output.
The backbone part of the YOLOv7 network model comprises four CBSs (i.e., convolution, batch normalization, and Sigmoid weighted liner unit (SiLU) activation) followed by an enhanced light-weight aggregation network (ELAN). By combining the outputs of three MaxPooling (MP) layers with ELAN, feature maps of sizes 80 × 80 × 512, 40 × 40 × 1024, and 20 × 20 × 1024 are obtained. Each MP layer consists of 5 layers, while ELAN is composed of 8 layers, resulting in a total of 51 layers in the backbone section. The head section of YOLOv7 adopts a structure similar to pafpn observed in YOLOv4 and YOLOv5. Initially, the final output of the 32-fold downsampled feature map C5 from the backbone is obtained. Subsequently, spatial pyramid pooling with convolutional-spatial path (SPPCSP) is applied to reduce the number of channels from 1024 to 512. The backbone part of the YOLOv7 network model comprises four CBSs (i.e., convolution, batch normalization, and Sigmoid weighted liner unit (SiLU) activation) followed by an enhanced light-weight aggregation network (ELAN). By combining the outputs of three MaxPooling (MP) layers with ELAN, feature maps of sizes 80 × 80 × 512, 40 × 40 × 1024, and 20 × 20 × 1024 are obtained. Each MP layer consists of 5 layers, while ELAN is composed of 8 layers, resulting in a total of 51 layers in the backbone section. The head section of YOLOv7 adopts a structure similar to pafpn observed in YOLOv4 and YOLOv5. Initially, the final output of the 32-fold downsampled feature map C5 from the backbone is obtained. Subsequently, spatial pyramid pooling with convolutional-spatial path (SPPCSP) is applied to reduce the number of channels from 1024 to 512.

YOLOv7 Network Model Improvement
The YOLOv7 backbone feature extraction network is a CNN-based network known for its translation invariance and localization capabilities. However, it lacks the ability to model globally and over long distances. To overcome this limitation, the transformer framework, widely used in natural language processing, is introduced to construct the CNN + transformer architecture, forming the COTN module. By integrating the transformer framework, the target detection capabilities are significantly enhanced, especially for detecting small gangue blocks, coal blocks, and densely packed objects on the conveyor. This adaptation is vital to effectively handle the presence of a large number of coal blocks on the conveyor belt. Furthermore, insights from other advancements in the YOLO model are incorporated to further improve the overall performance [35][36][37].
The introduced transformer framework, implemented in this study, incorporates a novel spatial modeling mechanism based on dot product self-attention. This framework leverages recursive gated convolution and enables higher-order spatial interactions through gated convolution and recursive design. Consequently, it offers a high degree of flexibility and customizability. YOLOv7 model's substantial number of parameters and computational resources enable the compatibility of the CNN + transformer architecture with various convolutional variants. Furthermore, it allows for the extension of second-

YOLOv7 Network Model Improvement
The YOLOv7 backbone feature extraction network is a CNN-based network known for its translation invariance and localization capabilities. However, it lacks the ability to model globally and over long distances. To overcome this limitation, the transformer framework, widely used in natural language processing, is introduced to construct the CNN + transformer architecture, forming the COTN module. By integrating the transformer framework, the target detection capabilities are significantly enhanced, especially for detecting small gangue blocks, coal blocks, and densely packed objects on the conveyor. This adaptation is vital to effectively handle the presence of a large number of coal blocks on the conveyor belt. Furthermore, insights from other advancements in the YOLO model are incorporated to further improve the overall performance [35][36][37].
The introduced transformer framework, implemented in this study, incorporates a novel spatial modeling mechanism based on dot product self-attention. This framework leverages recursive gated convolution and enables higher-order spatial interactions through gated convolution and recursive design. Consequently, it offers a high degree of flexibility and customizability. YOLOv7 model's substantial number of parameters and computational resources enable the compatibility of the CNN + transformer architecture with various convolutional variants. Furthermore, it allows for the extension of second-order interactions in self-attention to arbitrary orders without incurring additional computational overhead.
The improvements to the YOLOv7 model are depicted in Figure 2, visually illustrating the enhancements. Specifically, the initial segment of the backbone enhanced performance achieved by reducing the number of convolutional layers by half. This reduction leads to a decrease in the overall number of network layers and the spatial size of the model, resulting in a shorter model running time. Consequently, these modifications give rise to the YOLOv71 network structure. Table 1 presents a comprehensive comparison of the number of layers and running time between the two model structures.
The improvements to the YOLOv7 model are depicted in Figure 2, visually illustrating the enhancements. Specifically, the initial segment of the backbone enhanced performance achieved by reducing the number of convolutional layers by half. This reduction leads to a decrease in the overall number of network layers and the spatial size of the model, resulting in a shorter model running time. Consequently, these modifications give rise to the YOLOv71 network structure. Table 1 presents a comprehensive comparison of the number of layers and running time between the two model structures.  To enhance the learning ability of small target detection, a small-sized detection layer is introduced in the head section. The inclusion of the COTN module further enhances the capabilities of the model. The DIoU loss border regression loss function is employed to calculate the overlap between the predicted frame and the ground truth frame, enabling the identification of gangue and foreign objects based on this overlap. Additionally, a dual-path attention mechanism is incorporated to improve recognition accuracy.
In this study, it was observed that approximately 70% of the gangue particles had a size greater than 50 mm, indicating a prevalence of large gangue. The remaining 30% of the gangue particles had a size less than 50 mm, representing the category of small gangue. The coal mine environment poses challenges in detecting smaller objects during  To enhance the learning ability of small target detection, a small-sized detection layer is introduced in the head section. The inclusion of the COTN module further enhances the capabilities of the model. The DIoU loss border regression loss function is employed to calculate the overlap between the predicted frame and the ground truth frame, enabling the identification of gangue and foreign objects based on this overlap. Additionally, a dual-path attention mechanism is incorporated to improve recognition accuracy.
In this study, it was observed that approximately 70% of the gangue particles had a size greater than 50 mm, indicating a prevalence of large gangue. The remaining 30% of the gangue particles had a size less than 50 mm, representing the category of small gangue. The coal mine environment poses challenges in detecting smaller objects during target detection. To account for the variations in particle sizes encountered in real-world scenarios, the gangue used in the experiment encompassed both large and small gangue.
This study presents the process of gangue and foreign matter detection in coal using an improved YOLOv7 network model for a gangue selection robot, as shown in Figure 3. An industrial camera was used to capture images of coal, gangue, and foreign matter. These images underwent classification, annotation, and augmentation to generate a dataset. The dataset was then utilized to train and test the improved YOLOv7 network model, enabling the determination of optimal model weights. Subsequently, the obtained model weights were applied to the improved YOLOv7 network model to facilitate the detection and identification of gangue and foreign objects in coal on a belt conveyor.
This study presents the process of gangue and foreign matter detection in coal using an improved YOLOv7 network model for a gangue selection robot, as shown in Figure 3. An industrial camera was used to capture images of coal, gangue, and foreign matter. These images underwent classification, annotation, and augmentation to generate a dataset. The dataset was then utilized to train and test the improved YOLOv7 network model, enabling the determination of optimal model weights. Subsequently, the obtained model weights were applied to the improved YOLOv7 network model to facilitate the detection and identification of gangue and foreign objects in coal on a belt conveyor.

Re-Parameterization
Two convolutional layers were incorporated into the batch normalization (BN) layer in the head section. This reparameterization of the three components is illustrated in Fig  The convergence equation for Conv and BN can be expressed as follows: (1) where w denotes the convolutional bias, b denotes the convolutional bias, r and β denote the parameters that can be learned in BN, m denotes the input mean in BN, and v denotes the input standard deviation in BN.

Re-Parameterization
Two convolutional layers were incorporated into the batch normalization (BN) layer in the head section. This reparameterization of the three components is illustrated in Figure 4.
taset. The dataset was then utilized to train and test the improved YOLOv7 network model, enabling the determination of optimal model weights. Subsequently, the obtained model weights were applied to the improved YOLOv7 network model to facilitate the detection and identification of gangue and foreign objects in coal on a belt conveyor.

Re-Parameterization
Two convolutional layers were incorporated into the batch normalization (BN) layer in the head section. This reparameterization of the three components is illustrated in Fig  The convergence equation for Conv and BN can be expressed as follows: (1) where w denotes the convolutional bias, b denotes the convolutional bias, r and β denote the parameters that can be learned in BN, m denotes the input mean in BN, and v denotes the input standard deviation in BN.
Equations (2) and (3)   The convergence equation for Conv and BN can be expressed as follows: where w denotes the convolutional bias, b denotes the convolutional bias, r and β denote the parameters that can be learned in BN, m denotes the input mean in BN, and v denotes the input standard deviation in BN.Ŵ Equations (2) and (3) are combined and incorporated into Equation (1), resulting in the derivation of the new fusion Equation (4).
The YOLOv7 network model structure consists of 415 layers, contributing to a significant number of network layers. However, this extensive layer count led to prolonged training and recognition time, creating challenges in achieving fast recognition of foreign matter in coal on a rapidly moving belt conveyor. To tackle this issue, a reparameterized convolution process was introduced before image output. This process enhanced the model's running speed, resulting in improved recognition speed while maintaining consistent model performance.

CONT Module
COTN utilized the transformer framework to replace the convolution in ResNet, serving as the backbone of the network. This replacement involves the utilization of a 1 × 1 convolution, enabling the seamless integration of contextual information mining and self-attentive learning within a unified architecture. Through the enhancement of self- While the transformer framework demonstrates strong global modeling capability for long-distance interactions, it primarily calculates the attention matrix based on the interaction between query and key, neglecting the connection between adjacent keys. To overcome this limitation, a 3 × 3 convolution was applied to the key to model static contextual information. This convolution operation, illustrated in Figure 5, captures localized information. The key was combined with the modeled query and context information using the COTN module. Following this, two consecutive 1 × 1 convolutions were employed to generate dynamic contexts through self-attention. Finally, the static and dynamic context information was fused together to produce the output. sistent model performance.

CONT Module
COTN utilized the transformer framework to replace the convolution in ResNet, serving as the backbone of the network. This replacement involves the utilization of a 1 × 1 convolution, enabling the seamless integration of contextual information mining and self-attentive learning within a unified architecture. Through the enhancement of self-attention, COTN enables efficient learning of contextual information, leading to improved expressiveness of the output features.
While the transformer framework demonstrates strong global modeling capability for long-distance interactions, it primarily calculates the attention matrix based on the interaction between query and key, neglecting the connection between adjacent keys. To overcome this limitation, a 3 × 3 convolution was applied to the key to model static contextual information. This convolution operation, illustrated in Figure 5, captures localized information. The key was combined with the modeled query and context information using the COTN module. Following this, two consecutive 1 × 1 convolutions were employed to generate dynamic contexts through self-attention. Finally, the static and dynamic context information was fused together to produce the output.

Margin Regression Loss Function
In the target detection process, the target bounding box was commonly represented by four variables (i.e., x, y, w, h). In this study, the predicted frame was denoted as (x1, x2, x3, x4) and the ground true frame was denoted as (x1', x2', x3', x4'). Notably, the intersection over union (IoU) loss treats both large and small bounding boxes equally. When dealing with images of varying resolutions (i.e., different bounding box sizes), the prediction box obtained by l2 loss is more influenced by the size of the ground true bounding box. Conversely, the prediction box obtained with IoU loss is less affected by the ground true bounding box's size, leading to improved robustness. Compared to l2 loss, IoU loss directly measures the distance between two bounding boxes, resulting in faster convergence speed and enhanced detection performance. Additionally, IoU loss demonstrates scale invariance, meaning it performs consistently well for both large and small bounding boxes.
This natural normalized loss function enhances the model's ability to handle multiscale images. IoU value is 1 when the prediction perfectly matches the ground truth, and 0 when there is no overlap between the prediction and the ground truth. As IoU loss tends to approach positive infinity when IoU is 0, it decreases monotonically from positive

Margin Regression Loss Function
In the target detection process, the target bounding box was commonly represented by four variables (i.e., x, y, w, h). In this study, the predicted frame was denoted as (x 1 , x 2 , x 3 , x 4 ) and the ground true frame was denoted as (x 1 ', x 2 ', x 3 ', x 4 '). Notably, the intersection over union (IoU) loss treats both large and small bounding boxes equally. When dealing with images of varying resolutions (i.e., different bounding box sizes), the prediction box obtained by l 2 loss is more influenced by the size of the ground true bounding box. Conversely, the prediction box obtained with IoU loss is less affected by the ground true bounding box's size, leading to improved robustness. Compared to l 2 loss, IoU loss directly measures the distance between two bounding boxes, resulting in faster convergence speed and enhanced detection performance. Additionally, IoU loss demonstrates scale invariance, meaning it performs consistently well for both large and small bounding boxes.
This natural normalized loss function enhances the model's ability to handle multiscale images. IoU value is 1 when the prediction perfectly matches the ground truth, and 0 when there is no overlap between the prediction and the ground truth. As IoU loss tends to approach positive infinity when IoU is 0, it decreases monotonically from positive infinity to 0 as IoU increases within the interval [0, 1]. IoU loss can be viewed as a form of crossentropy loss of that quantifies the dissimilarity between the predicted and ground truth bounding boxes. Compared with l 2 loss, which measures the Euclidean distance between bounding boxes, IoU loss directly captures the spatial overlap between the two bounding boxes, leading to faster convergence and improved detection performance. Additionally, IoU loss demonstrates scale invariance and remains effective for both large and small bounding boxes. l 2 loss border regression loss function can be calculated as follows: IoU loss border regression loss function can be calculated as follows: where Prediction denotes the prediction box for the detection target, and Truth denotes the detection target truth box. In situations where the prediction frame and the target frame do not overlap, the IoU loss function IoU(A,B) = 0, which does not capture the spatial distance between the two frames A and B. Consequently, the IoU loss function lacks differentiability and fails to effectively optimize cases where the frames do not intersect. Additionally, assuming fixed sizes for the prediction box A and target box B, the IoU value remains unchanged regardless of the specific intersection pattern between the boxes. Thus, IoU values alone do not provide detailed insights into the nature of the intersection. To overcome these limitations, the DIoU loss function was employed as a border regression loss. DIoU loss considers the distance between the prediction and target boxes, offering a more comprehensive optimization measure. Figure 6 illustrates the data calculations for the predicted and ground truth boxes in relation to each detection target. The red boxes represent the ground truth boxes, while the green boxes correspond to the predicted boxes generated by the model. In this study, anchor rods and I-beams are selected as representative foreign matter. When Prediction and Truth do not intersect, the IoU value was 0. However, this value does not capture the spatial distance between the two frames, rendering the loss function non-differentiable. As a result, IoU loss cannot effectively optimize scenarios where the boxes do not intersect. Additionally, when assuming fixed sizes for both Prediction and Truth, their IoU values remain the same regardless of the specific intersection pattern between the boxes. Consequently, IoU values fail to provide information about the actual characteristics of the intersection. To overcome these limitations, DIoU loss is introduced as a border regression loss function, as depicted in Figure 7.
l2 loss border regression loss function can be calculated as follows: ‖Prediction Truth‖ IoU loss border regression loss function can be calculated as follows: IoU loss In Intersection Prediction, Truth Union Prediction, Truth where Prediction denotes the prediction box for the detection target, and Truth denotes the detection target truth box. In situations where the prediction frame and the target frame do not overlap, the IoU loss function IoU(A,B) = 0, which does not capture the spatial distance between the two frames A and B. Consequently, the IoU loss function lacks differentiability and fails to effectively optimize cases where the frames do not intersect. Additionally, assuming fixed sizes for the prediction box A and target box B, the IoU value remains unchanged regardless of the specific intersection pattern between the boxes. Thus, IoU values alone do not provide detailed insights into the nature of the intersection. To overcome these limitations, the DIoU loss function was employed as a border regression loss. DIoU loss considers the distance between the prediction and target boxes, offering a more comprehensive optimization measure. Figure 6 illustrates the data calculations for the predicted and ground truth boxes in relation to each detection target. The red boxes represent the ground truth boxes, while the green boxes correspond to the predicted boxes generated by the model. In this study, anchor rods and I-beams are selected as representative foreign matter. When Prediction and Truth do not intersect, the IoU value was 0. However, this value does not capture the spatial distance between the two frames, rendering the loss function non-differentiable. As a result, IoU loss cannot effectively optimize scenarios where the boxes do not intersect. Additionally, when assuming fixed sizes for both Prediction and Truth, their IoU values remain the same regardless of the specific intersection pattern between the boxes. Consequently, IoU values fail to provide information about the actual characteristics of the intersection. To overcome these limitations, DIoU loss is introduced as a border regression loss function, as depicted in Figure 7.  DIoU loss can be calculated as follows: , ‖ ‖ where B denotes Prediction box, B gt denotes Truth box, R(B, B gt ) denotes the added penalty term, and c denotes the square of the diagonal length of the minimum enclosing box c. DIoU loss can be calculated as follows: Sensors 2023, 23, 5140 where B denotes Prediction box, B gt denotes Truth box, R(B, B gt ) denotes the added penalty term, and c denotes the square of the diagonal length of the minimum enclosing box c. The DIoU loss introduces the direct Euclidean distance between the two boxes as a penalty term, leading to a faster convergence rate compared to generalized intersection over union (GIoU) loss. Furthermore, DIoU loss takes into account the relative proportions of the rectangular boxes in the penalty term, which helps resolve the issue of box intersection between Prediction and Truth.

Attention Mechanism
The attention mechanism in artificial neural networks is inspired by the information acquisition behavior of the human brain. When humans gather information through their senses, the presence of numerous stimuli can create distractions, making it challenging to focus on the desired information. To solve this problem, the brain employs specialized processing units that facilitate effective processing and monitoring of relevant information. In the context of artificial neural networks, the attention mechanism aims to replicate this behavior by assigning specific weights to different targets during the feature extraction process. This allows the neural network to prioritize and emphasize important targets or regions of interest while suppressing or disregarding irrelevant or uninformative regions.
The attention mechanism, as shown in Figure 8, involves the output of a green square region denoted as D (t x , t y , t 1 ). In this representation, the central coordinates are represented by t x and t y , while half of the side lengths of the square region are represented by t 1 . The upper left coordinates are given by (t x − t 1 ) and (t y − t 1 ), and the lower right coordinates are given by (t x + t 1 ) and (t y + t 1 ). This square region serves as the core region that captures image category features under the attention mechanism. The attention extraction network is composed of two main parts. The first part, denoted as 'a', represents the features extracted by the attention mechanism. This component focuses on capturing important information within the region of interest. The second part, denoted as 'c', represents the features extracted through the convolution operations. This component is responsible for capturing features at different scales and resolutions within the image.

Configuration of Experimental Environment
The models in this study were trained and tested using PyTorch open-source framework renowned for its flexibility and versatility. The experiments were conducted on a Windows server running Windows Server 2019 with Windows 10 as the 64-bit operating system. The central processing unit (CPU) employed was an Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50 GHz 2.49 GHz. To expedite the training and testing processes, a graphics processing unit (GPU) was utilized. Specifically, NVIDIA Tesla T4 (8G) was cho-

Configuration of Experimental Environment
The models in this study were trained and tested using PyTorch open-source framework renowned for its flexibility and versatility. The experiments were conducted on a Windows server running Windows Server 2019 with Windows 10 as the 64-bit operating system. The central processing unit (CPU) employed was an Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50 GHz 2.49 GHz. To expedite the training and testing processes, a graphics processing unit (GPU) was utilized. Specifically, NVIDIA Tesla T4 (8G) was chosen for its computational prowess. The deep learning environment was established using Conda, with the following specifications: Python version 3.9.0, torch version 1.10.0, torchvision version 0.11.0, Torchaudio version 0.10.0, and CUDA version 10.2. As shown in Table 2, these hardware and software configurations were thoughtfully selected to ensure optimal performance and compatibility throughout the training and testing phases of the models. To comprehensively analyze the performance of YOLOv71 + COTN network model, two additional modules were introduced: simulated attention mechanism (simAM) and background object training (BOT). These modules played a crucial role in facilitating a longitudinal comparison during both the model training and testing phases. To provide a cross-sectional comparison, the dataset was utilized for training and testing Yolov5 model. The pre-training parameter settings configurations for each model are detailed in Table 3, which helps to establish a solid foundation for the subsequent experiments and evaluations.

Dataset Creation
In the testing phase, a variety of targets were selected, including coal blocks, gangue blocks, anchor rods, and I-beams. These targets were deliberately chosen to encompass a wide range of physical characteristics. Gangue blocks closely resemble coal blocks in their physical form, while anchor rods exhibit a slender shape and I-beams possess a flat surface. Furthermore, both anchor rods and I-beams are primarily composed of iron elements. To acquire the necessary images for testing, an industrial camera was employed. Each detection target was photographed, and the dataset was subsequently augmented using various techniques such as rotating, mirroring, panning, brightness adjusting, and Gaussian blurring. The efficacy of these augmentation techniques can be observed in Figure 9. The objective of this dataset augmentation process was to more accurately simulate the realworld conditions of each detection target within a coal mine environment and to enhance the dataset's representativeness.
acquire the necessary images for testing, an industrial camera was employed. Each detection target was photographed, and the dataset was subsequently augmented using various techniques such as rotating, mirroring, panning, brightness adjusting, and Gaussian blurring. The efficacy of these augmentation techniques can be observed in Figure 9. The objective of this dataset augmentation process was to more accurately simulate the realworld conditions of each detection target within a coal mine environment and to enhance the dataset's representativeness.

Coal
Gangue Bolt I-bean The targets in the images were manually annotated using the label open-source annotation tool. The dataset was organized into four categories: bolt, coal, gangue, and Ibeam. Each target was assigned a numerical label: 0 for anchor, 1 for coal block, 2 for gangue block, and 3 for I-beam. The annotation process involved carefully marking the targets within the images to ensure accurate labeling based on human visual perception. The annotation results were saved in visual object classes (VOC) format, with the corresponding XML files stored in a predetermined folder. Great care was taken to label all visible targets while avoiding the inclusion of ambiguous ones. This approach was employed to prevent unannotated targets from being mistaken as negative samples, thus The targets in the images were manually annotated using the label open-source annotation tool. The dataset was organized into four categories: bolt, coal, gangue, and I-beam. Each target was assigned a numerical label: 0 for anchor, 1 for coal block, 2 for gangue block, and 3 for I-beam. The annotation process involved carefully marking the targets within the images to ensure accurate labeling based on human visual perception. The annotation results were saved in visual object classes (VOC) format, with the corresponding XML files stored in a predetermined folder. Great care was taken to label all visible targets while avoiding the inclusion of ambiguous ones. This approach was employed to prevent unannotated targets from being mistaken as negative samples, thus preserving the algorithm's ability to effectively distinguish between positive and negative samples.

Analysis of Experimental Results
When the network structure of the YOLOv7 network model is fixed, the perceptual area of the network is predetermined. The resolution of the input image plays a critical role in determining the proportion of the perceptual area within the image. Higher-resolution images result in a lower percentage of the perceptual area, which reduces the effectiveness of capturing the local information for predicting objects at different scales. Consequently, this can lead to a decrease in detection accuracy. The variance in input image size has a significant impact on the model's detection performance. The underlying network part often generates feature maps that are 10 times smaller than the original image. As a result, features related to small targets may not be adequately captured by the detection network, particularly for relatively small gangue blocks on the belt conveyor. To mitigate these challenges, the input image pairs in the experiments were resized to a standardized size of 640 × 640, as indicated in Table 4. It is important to note that the resized dataset images did not exceed the size of the original images in the dataset. This standardization of image size enhances the robustness of the detection model to variations in target size, to a certain extent. The number of parameters in a network model refers to the total size, in bytes, of each network layer. It quantifies the amount of video memory occupied by the model. In this study, the spatial complexity of the model is evaluated based on the number of parameters, which provides insights into the model's size and memory requirements. On the other hand, the time complexity of the model is evaluated using the amount of computation required. The computation volume measures the duration of the model's detection process and is expressed as the number of floating-point operations per second. Additionally, the GPU runtime memory represents the amount of server space occupied by the model when running. It reflects the memory usage during the model's execution on the GPU and is a crucial consideration for ensuring efficient utilization of server resources. The YOLOv7 network model, derived from the YOLOv5 network model, exhibits a reduction in both network layers and parameters compared to the original YOLOv7 network model. Interestingly, despite having fewer layers and parameters, the YOLOv5 network model occupies an additional 3.93 G of space when compared to the YOLOv7 network model. Specifically, the YOLOv71 network model boasts 22 fewer layers and 1,701,083 fewer parameters. Furthermore, the memory usage by GPU is reduced by 0.15 G. To further analyze the operating parameters, additional modules such as simAM, BOT, and COTN were incorporated into the models. Among these modules, the COTN module demonstrated superior performance compared to the simAM and BOT modules. Table 5 presents the detection results for each detection target in the new improved model, including accuracy, recall, and average accuracy mean. Upon analyzing the results in Table 4, it is evident that anchor rods achieve higher identification accuracy compared to coal and gangue targets. This difference in performance can be attributed to the similarity in shape and appearance between gangue and coal, posing challenges for accurate discrimination. Additionally, I-beam, characterized by a larger surface area is larger and a tendency to accumulate coal dust, visually resembles coal, thereby impeding precise recognition. Consequently, the recognition performance for gangue and I-beams is relatively poorer when compared to other targets. The YOLOv71 + simAM network model demonstrates improved recognition specifically for coal when compared to the YOLOv71 network model. However, the overall detection results for each target in the YOLOv71 network model are lower. On the other hand, the YOLOv71 + BOT network model exhibits enhanced recognition capabilities specifically for anchor rods following the model's improvements. Notably, the YOLOv71 + COTN net-work model demonstrates improvements in terms of accuracy, recall, and average accuracy for each detection target when compared to the YOLOv71 model.
Box metric represents the mean value of the loss function, where a smaller value indicates more accurate detection, Objectiveness metric represents the mean value of the target detection loss, with a smaller value indicating more accurate detection of targets. The classification metric represents the mean value of the classification loss, where a smaller value indicates a more accurate classification. The comparison results presented in Figure 10 illustrate that YOLOv71 + COTN network model achieves lower detection values on the training set compared to the other models. The data curve exhibits a relatively flat trend, with a general decrease in values. These findings indicate that the YOLOv71 + COTN model outperforms the other three models in terms of detection accuracy. mAP0.5 metric represents the mean accuracy values at a threshold of 0.5. mAP0.5:0.75 metric represents the mean accuracy values for thresholds ranging from 0.5 to 0.75, with an interval of 0.05. Upon comparing the information presented in Figure 11c,d, it is evident that the YOLOv71 + COTN network model achieves the highest mean accuracy for the detection of gangue and foreign matters at the specified threshold.
The simAM, BOT, and COTN modules were integrated into the YOLOv71 network model to conduct ablation experiments and evaluate their impact on performance. The test results for each network model are shown in Table 6. It is evident that the network model with the additional module exhibits notable improvements in accuracy and recall compared to the YOLOv7 network model. Among the three modules, the COTN module is the most effective, achieving an impressive identification accuracy of 91.3%. sents the accuracy of the positive metric, representing the number of positive samples recalled, which describes the number of positive samples correctly recalled by the classifier. It provides insights into how well the classifier identifies positive instances. The comparison results depicted in Figure 11a,b show that YOLOv71 + COTN network model exhibits higher accuracy and recall values compared to the other models during training. These results suggest that YOLOv71 + COTN model achieves better recognition performance, with improved accuracy in identifying positive instances and a higher recall rate for true positive examples.  to 0.75, with an interval of 0.05. Upon comparing the information presented in Figure  11c,d, it is evident that the YOLOv71 + COTN network model achieves the highest mean accuracy for the detection of gangue and foreign matters at the specified threshold. The simAM, BOT, and COTN modules were integrated into the YOLOv71 network model to conduct ablation experiments and evaluate their impact on performance. The test results for each network model are shown in Table 6. It is evident that the network model with the additional module exhibits notable improvements in accuracy and recall compared to the YOLOv7 network model. Among the three modules, the COTN module is the most effective, achieving an impressive identification accuracy of 91.3%.  Figure 12 provides a visual representation of the detection results obtained by each network model for the same target. The analysis of the model output images reveals a clear pattern: As the network model undergoes continuous improvement, there is a noticeable enhancement in the accuracy and reliability of target detection. Of particular note is the performance of the new YOLOv71 + COTN model, which achieves the highest confirmation rate for the target, reaching an impressive value of 90% when the target is clearly visible.

YOLOv7
YOLOv71 YOLOv71 + simAM YOLOv71 + BOT YOLOv71 + COTN    Figure 13 represents the model's performance in identifying targets during the testing phase. Each target is assigned a numerical label: 0 for anchor, 1 for coal, 2 for gangue, and 3 for I-beam. Figure 13 reveals that both YOLOv7 and YOLOv71 network model exhibit recognition errors in the randomly outputted detection images. Specifically, the YOLOv7 network model has two instances of recognition errors, while the YOLOv71 network model has one instance of recognition error. However, with the incorporation of the new module, the network model demonstrates a notable absence of recognition errors.  Figure 13 represents the model's performance in identifying targets during the testing phase. Each target is assigned a numerical label: 0 for anchor, 1 for coal, 2 for gangue, and 3 for I-beam. Figure 13 reveals that both YOLOv7 and YOLOv71 network model exhibit recognition errors in the randomly outputted detection images. Specifically, the YOLOv7 network model has two instances of recognition errors, while the YOLOv71 network model has one instance of recognition error. However, with the incorporation of the new module, the network model demonstrates a notable absence of recognition errors.

Conclusions
This study presents a novel approach for detecting gangue and foreign matter in coal using an improved YOLOv7 network model. The improved YOLOv7 network model is specifically tailored to achieve accurate and reliable detection of gangue and foreign matter in coal samples. To enhance the performance of YOLOv7 network model, several key improvements were introduced. Firstly, the number of convolutional layers in the backbone was halved, resulting in a more streamlined and efficient detection process. This reduction in convolutional layers not only accelerated the detection speed but also enhanced the overall detection efficiency for gangue and foreign matter. Furthermore, small-size detection layers were incorporated into the head. This addition aimed to enhance the model's ability to detect and classify small targets more effectively. By specifically focusing on small-sized objects, the model could improve its accuracy in identifying and localizing these challenging targets. Additionally, the COTN module further enhanced the model's detection accuracy. This module leverages innovative techniques to refine the

Conclusions
This study presents a novel approach for detecting gangue and foreign matter in coal using an improved YOLOv7 network model. The improved YOLOv7 network model is specifically tailored to achieve accurate and reliable detection of gangue and foreign matter in coal samples. To enhance the performance of YOLOv7 network model, several key improvements were introduced. Firstly, the number of convolutional layers in the backbone was halved, resulting in a more streamlined and efficient detection process. This reduction in convolutional layers not only accelerated the detection speed but also enhanced the overall detection efficiency for gangue and foreign matter. Furthermore, small-size detection layers were incorporated into the head. This addition aimed to enhance the model's ability to detect and classify small targets more effectively. By specifically focusing on small-sized objects, the model could improve its accuracy in identifying and localizing these challenging targets. Additionally, the COTN module further enhanced the model's detection accuracy. This module leverages innovative techniques to refine the model's feature representation, ultimately improving its ability to discriminate between different classes of gangue and foreign matter. To calculate the overlap between the predicted and real frames, DIoU loss border regression loss function was employed. This loss function provided a comprehensive measure of the intersection between the predicted and ground truth bounding boxes. By considering both the spatial distance and proportional characteristics of the boxes, the model could effectively identify gangue and foreign matter based on this calculated overlap. To improve recognition accuracy, a dual-path attention mechanism was integrated into the model. This mechanism allowed the model to selectively focus on relevant features while suppressing irrelevant or distracting information. By effectively attending to important regions of the input, the model achieved enhanced recognition accuracy and robustness.
This research proposes an improved the YOLOv7 network model specifically designed for the identification of gangue and foreign matter in coal, with a focus on its application in gangue sorting robots. The proposed model, YOLOv71 + COTN, was trained and evaluated using a dedicated dataset to ensure accurate identification of gangue and foreign matter. The experimental results revealed significant improvements compared to YOLOv71 + COTN network model approach. This proposed network model demonstrated notable enhancements across various performance metrics. Precision was improved by 3.97%, recall was increased by 4.4%, and mAP0.5 was improved by 4.5%, compared to the YOLOv7 network model approach. By reducing the number of parameters, the model's efficiency was optimized, resulting in reduced memory requirements on GPU during operation. This reduction in memory consumption contributes to improved gangue selection speed and recognition accuracy, further enhancing the overall performance of the model. As a result, the YOLOv71 + COTN network model is well-suited for integration in belt conveyor gangue sorting robots.