Driver distraction detection based on lightweight networks and tiny object detection

: Real-time and e ffi cient driver distraction detection is of great importance for road tra ffi c safety and assisted driving. The design of a real-time lightweight model is crucial for in-vehicle edge devices that have limited computational resources. However, most existing approaches focus on lighter and more e ffi cient architectures, ignoring the cost of losing tiny target detection performance that comes with lightweighting. In this paper, we present MTNet, a lightweight detector for driver distraction detection scenarios. MTNet consists of a multidimensional adaptive feature extraction block, a lightweight feature fusion block and utilizes the IoU-NWD weighted loss function, all while considering the accuracy gain of tiny target detection. In the feature extraction component, a lightweight backbone network is employed in conjunction with four attention mechanisms strategically integrated across the kernel space. This approach enhances the performance limits of the lightweight network. The lightweight feature fusion module is designed to reduce computational complexity and memory access. The interaction of channel information is improved through the use of lightweight arithmetic techniques. Additionally, CFSM module and EPIEM module are employed to minimize redundant feature map computations and strike a better balance between model weights and accuracy. Finally, the IoU-NWD weighted loss function is formulated to enable more e ff ective detection of tiny targets. We assess the performance of the proposed method on the LDDB benchmark. The experimental results demonstrate that our proposed method outperforms multiple advanced detection models.


Introduction
Object detection plays a crucial role in monitoring driver distractions.Since the introduction of AlexNet in 2012 [1], deep convolutional neural networks have become widely adopted in computer vision [1,2].These networks have increasingly deeper depths, larger number of parameters and more complex structures, all in pursuit of higher accuracy in major rankings.However, this comes at the expense of increased network depth and computation.Today, one-stage object detectors [3,4] are popular in real-time applications due to their excellent speed and accuracy tradeoffs.Currently, one-stage object detectors [3,4] are preferred in real-time applications due to their exceptional balance of speed and accuracy.The YOLO series, particularly YOLOv5 [5], stands out as the most prominent architecture among one-stage detectors.It has achieved an excellent trade-off between accuracy and speed on the COCO dataset, making it widely adopted in the industry.In the context of driver distraction monitoring [6,7], real-time computation is essential.The model must be deployed on in-vehicle edge devices, which have limited CPU performance and memory.To deploy real-time computation algorithms on these low-powered devices, their computing power must be fully optimized.Furthermore, the algorithms themselves need to be lightweight.The emergence of lightweight networks has focused attention on the trade-off between speed and accuracy.The challenge lies in making the network even more lightweight and compact for easy deployment, while still preserving accuracy.Additionally, as network optimization aims for acceleration, there is a need to improve the accuracy in detecting challenging objects like tiny ones.
We have contributed to addressing the aforementioned concern in the following ways: 1) We propose a deep learning-based method for monitoring driver distraction that optimizes the efficiency of detecting small targets while reducing the weight of the model and reducing computational complexity.
2) We propose a multidimensional adaptive feature extraction block (MAFEB) that is integrated into the network's four attention mechanisms across the kernel space using a parallel strategy to enhance its performance bounds.Additionally, we reduce the model parameters and storage space by combining convolutional and batch normalization layers, resulting in a lighter model and accelerated network inference.
3) We introduce the Lightweight Feature Fusion Block (LFFB), which consists of a CFSM module and an EPIEM module, to effectively reduce computational complexity and memory access without significant loss of accuracy.4) We propose the IoU-NWD weighted loss function to address the issue of degraded detection capability caused by the sensitivity of the IoU metric to small targets.This loss function aims to improve the overall performance of the detector in detecting tiny targets.
The remaining portion of this paper is structured as follows: Section 2 provides an overview of the related work.Section 3 presents a detailed description of the proposed method.Section 4 presents the experimental results and corresponding discussion.Lastly, Section 5 concludes the paper.

Driver distraction detection
Inspired by the advancements in deep learning for object detection, several convolutional neural network (CNN) models have been recently proposed for driver distraction detection.J. V. Abdul et al. [8] utilized a CNN-based approach to enhance the accuracy of classifying distracted drivers by studying driver actions from an image dataset.C. Huang et al. [9] introduced a CNN framework (HCF) that combined ResNet50, Inception V3 [10] and Xception [11] to achieve high accuracy in classifying distracted drivers' behavior through deep learning-based image feature processing.S. Faiqa et al. [12] explored the EfficientNet architecture for designing EfficientDet, utilizing EfficientNet as the backbone network for the initial section, BiFPN architecture for feature extraction and a characterization and location box prediction network for grouping and identifying diverted drivers.L. N. Duy et al. [13] proposed a lightweight CNN architecture for driver behavior identification.The architecture leverages standard convolutional and depth-separable convolutional operations, adaptive connectivity for feature extraction and a CBAM attention mechanism to focus on salient features.The network parameters are reduced, while improving task classification accuracy.Furthermore, Zhu et al. [38] introduced the MT-DTA model, which combines an autoencoder with an attention model in a cascade structure of CNNs.

Efficient CNNs
SqueezeNet [14] proposed Fire modules that consist of Squeeze and Expand operations in order to compress the model and achieve lightweighting of the network.MobileNet [15][16][17] incorporates deep separable convolutions instead of standard convolutions to maintain high accuracy while reducing the number of parameters.This approach provides guiding principles for the design of subsequent lightweight networks.ShuffleNet [18,19] is specifically designed for mobile devices with limited computational power.To simplify the model and overcome the computational complexity of point convolutions, point group convolutions and channel shuffling techniques are employed.Xception [11] replaces the traditional Inception [10,[20][21][22] blocks with depthwise separable convolutions to achieve complete decoupling of cross-channel correlation and spatial correlation.
The EfficientNet family [23,24] introduces a new baseline network by scaling the three dimensions of depth, width and resolution with simple and efficient composite coefficients.GhostNet [25] introduces an inexpensive linear operation to generate more feature maps.

Tiny object detection
Object detection has been an important research direction in the field of computer vision and pattern recognition research, where tiny object detection is more challenging because small objects do not contain detailed information and may even disappear in the deep network.In order to solve the above problems, researchers have made some improvements to small target detection in terms of loss function.J.He et al. [26] designed a new formulation alpha-IoU that generalizes the existing IoU-based losses to a new family of power IoU losses.By modulating the power parameter alpha, alpha-IoU provides the flexibility to achieve different degrees of bbox regression accuracy when training the target detector.On the other hand, researchers have also made some improvements on the feature pyramid network to detect small targets more accurately.C. Deng et al. [27] proposed the extended feature pyramid network (EFPN) with additional high-resolution pyramid layers, using the feature texture transfer (FTT) module to extract both super-resolution features and plausible region details.Some researchers have also designed detectors for small target detection.X. Yang et al. [28] designed a sampling fusion network incorporating multilayer features and effective anchor sampling to improve the sensitivity to small targets.Also, supervised pixel attention network and channel attention network were explored for small and cluttered target detection together by suppressing noise and highlighting target features.Zhu et al. [39] aim to achieve a fuller utilization of multimodal information based on the fusion of features that have special meaning and importance.

Overview
Generally, a convolutional neural network (CNN) detector comprises three major components: the backbone, the neck and the head.The backbone is responsible for extracting the input features, while the neck improves the assignment and fusion of these features before they are passed to the head.The head then detects objects using the input features from the neck.In Figure 1, we demonstrate the framework of our proposed driver distraction detection method.This framework mostly consists of a multidimensional adaptive feature extraction block and a lightweight feature fusion block.The multidimensional feature extraction block utilizes lightweight convolutional networks based on channel sparse convolution as the backbone.Additionally, we introduce the multidimensional adaptive module (MDAM) to provide the depth-separable convolutional kernel with dynamic properties.This module enables the dynamic extraction of feature information at varying scales and dimensions, effectively reducing computational effort.The Channel Feature Shuffle Module (CFSM) utilizes depth-separable convolution and channel shuffling operations to facilitate information interaction between channel-dense convolution and depth-separable convolution.This approach effectively reduces the model's weight while maintaining accuracy.The lightweight feature fusion block consists primarily of the CFSM and the Efficient Partial Information Extraction Module (EPIEM).These modules aim to decrease computational complexity and the number of memory accesses required for feature fusion across different scales.EPIEM optimizes costs by considering feature map redundancy.It employs partial convolution to reduce redundant computations and minimize frequent memory accesses, resulting in more efficient extraction of spatial features.In order to address the sensitivity of IoU to small deviations in target position, the NWD metric is introduced as a more appropriate measure for similarity between two bounding boxes.This, however, may negatively impact convergence speed.To mitigate this issue, a weighting factor is applied to NWD, which is then integrated into the loss function of the detector.

Multi-dimensional feature extraction block
We incorporate the improved multidimensional adaptive attention into a lightweight convolutional neural network to enhance feature extraction in the multidimensional feature extraction block.This enables us to inject features with multidimensional information to improve the detection task.The feature extraction architecture consists of multiple convolutional layers and residual blocks.Within these layers and blocks, the inclusion of MDAM extends the performance capabilities of lightweight CNNs.This is achieved by integrating attention mechanisms into the convolutional blocks, allowing for better adaptation to the detection of small targets.
The implementation details of MDAM are illustrated in Figure 2.This approach incorporates a novel multidimensional attention mechanism, which computes four complementary types of attention αSi, αCi, αfi and αwi of Wi along four dimensions of the kernel space.These dimensions include the number of convolutional kernels, the size of the convolutional kernel space, the number of input channels and the number of output channels.The parallel computation empowers the convolution kernel to capture diverse contextual information.Moreover, the adoption of a single kernel space structure in MDAM strikes a better balance between model accuracy and efficiency in comparison to existing dynamic convolution designs.
MDAM utilizes a novel multidimensional attention mechanism with parallel strategies to learn the complementary attention of convolution nuclei along all four dimensions of the kernel space in any convolution layer.It is worth stating that the four kinds of attention learned are complementary to each other, and the order does not matter.
As denoted in Figure 3, the MDAM block can be defined as follows: where α wi ∈ R denotes the attention scalar of the convolution kernel wi; α si ∈ R k×k , α si ∈ R Cin and α si ∈ R Cout denote the spatial dimension, input channel dimension and output channel dimension computations along the kernel space of the convolution kernel Wi, respectively, and ⊙ denotes the multiplication operations along different dimensions of the kernel space.We assume a convolutional kernel (W) for the convolutional layer.The convolution process involves performing a sliding window computation on the input feature map using the convolutional kernel (W).Assuming w as an element in the convolutional kernel (W) and x as an element in the input feature map, the computation process for w and x is as follows: (2. 2) The BN layer requires calculating the mean and variance of the elements in a minibatch.Next, the mean value is subtracted from each element and then divided by the standard deviation.Finally, the BN output is obtained by performing an affine transformation using and γ, β as outlined below.
Convolutional layers and BN layers can both be regarded as extensions of linear layers.However, they differ in terms of their specific implementation and mechanism of action.Consequently, these two aspects are combined: obtained as: as a result: The amount of computation is reduced using operator fusion techniques, decrease the overall throughput of the process, enhance the localization of computation, consequently enhancing efficiency.

Lightweight feature fusion block
Studies in neuroscience have demonstrated that models with a larger number of neurons tend to exhibit stronger nonlinear representation.However, it is important to acknowledge that the human brain outperforms these models in terms of its superior information processing capabilities and low power consumption.Consequently, in the current stage of vision-based specific assisted driving systems, it is possible to optimize computational costs without introducing additional operations and achieve significant accuracy improvements through lightweight designs.Nonetheless, the utilization of deep separable convolution (DSC) in lightweight networks, while common, results in the separation of channel information during computation.Over-reliance on DSC alone causes the feature extraction and fusion capabilities of the network to deteriorate, consequently reducing the overall performance of the model.Therefore, enhancing the interaction capability of inter-channel information is crucial to obtain richer features during the lightweight optimization of the network.
We introduce the Channel Feature Shuffle Module (CFSM), which comprises a channel-dense standard convolution (Conv) and a channel-information-separated Depthwise Convolution (DWConv).As depicted in Figure 4, the CFSM receives the image as input and facilitates the exchange of information between Conv and DWConv through shuffling.Shuffling is used to uniformly and effectively mix the information generated by Conv and DWConv.Initially, the input feature map is split into two groups, each with half the original number of channels.These two groups are then blended, with a 2-channel interval, and interleaved along the channel dimension to create a new feature map output.This approach enables Conv's information to enhance the model's representation and generalization performance by uniformly exchanging local feature information across channels and fully integrating it into the DSC output.The visualization in Figure 5 illustrates that the feature maps of the different channels exhibit a high degree of similarity during the network computation.Many other lightweight optimisation schemes [29,30] address this redundant feature, but neglect to explore efficient and simple utilisation.We reduce both redundant computation and memory access by introducing the Efficient Partial Information Extraction Module (EPIEM) to improve the extraction of spatial features.
The module, as depicted in Figure 6, is implemented using the partial convolution (PConv) and pointwise convolution techniques.PConv performs standard convolutions (SC) on a subset of the input channel to extract spatial features, while disregarding interactions with other channels.The number of channels for SC is determined by the occupancy factor r = cp/c, typically set to 1/4 for balancing computational complexity and reusability of channel features.Consequently, PConv requires only 1/16 of the computational effort (FLOPs) compared to standard convolution, leaving the remaining channels unchanged.To fully and efficiently utilize information from all channels, a pointwise convolution (PWConv) is subsequently applied after PConv.

Weighted IoU-NWD loss function for tiny targets
The proposed detection network is trained by three loss functions, Lcls, Lobj and Lloc.Lcls is used to calculate the loss of classification, using the binary cross entropy loss (BCE), which calculates only the loss of classification of positive samples.Lobj calculates the CIOU of the target bounding box and the ground truth box (GT Box) of the network prediction, which calculates only the loss of all samples.Lloc is the loss of localization, using the CIOU Loss, which calculates only the loss of localization of positive samples.Here, λ1, λ2 and λ3 are the balance coefficients.Loss = λ 1 Lcls + λ 2 Lobj + λ 3 Lloc. (2.11) Current intersection-over-union (IoU) based metrics, including IoU itself and its extensions, are highly susceptible to positional deviations in the context of small objects.Moreover, even a slight pixel deviation between small and large targets can result in substantial fluctuations in IoU values, thereby negatively impacting the detection performance of anchor-based detectors.As depicted in Figure 7, each grid in the diagram corresponds to one pixel.Specifically, box A represents the true bounding box, whereas boxes B and C represent the predicted bounding boxes.When the deviation in position is identical, it has minimal impact on the intersection-over-union (IoU) value for normal-sized objects.However, with regard to tiny objects, even a slight position deviation can result in a notable decrease in the IoU value.Consequently, the IoU-based loss function is suboptimal for detecting small targets.Drawing inspiration from Wasserstein's novel metric for detecting small targets [31], we propose an IoU-NWD weighted loss function that aims to enhance the performance of small target detection.This is achieved by fine-tuning the weighting factor α, which is contingent upon the proportion of tiny targets present in the dataset. (2.12) The bounding box is initially represented as a 2D Gaussian distribution.The similarity between the resulting Gaussian distributions is then quantified using the proposed metric, known as Normalized Wasserstein Distance (NWD) [32].NWD, unlike the IOU metric, takes into account the width-toheight ratio of the bounding box and is well-suited for handling boxes with different shapes.IOU, on the other hand, fails to properly account for boxes that vary significantly in their width-to-height ratio.Additionally, the NWD metric is robust to scale variations and can measure the similarity between two distributions even when the bounding boxes either do not overlap or have minimal overlap.As a result, NWD outperforms IOU significantly when it comes to detecting small objects.To account for the NWD metric's influence on convergence speed, we incorporate it into the component of the L ob j loss function, which improves the ability to detect small targets.We use the Wasserstein distance from optimal transport theory to calculate the distribution distance.For two two-dimensional Gaussian distributions µ 1 = N(m 1 , Σ 1 ) and µ 2 = N(m 2 , Σ 2 ), the second-order Wasserstein distance between µ 2 and µ 2 is defined for: It can be simplified as follows: where ∥ • ∥ F is the Frobenius parametrization.Furthermore, for the Gaussian distributions Na and Nb modeled by the bounding boxes A = (cx a , cy a , w a , h a ) and B = (cx b , cy b , w b , h b ), Eq 6 can be further simplified as. . (2.16) ) is a distance metric and cannot be used directly as a similarity metric (i.e., values between 0 ∼ 1 as IoU).Therefore, we use its exponential form of normalization to obtain a new metric called Normalized Wasserstein Distance (NWD): where C is a constant that is closely related to the dataset.In the next experiments, we empirically set C to the average absolute size of the dataset and obtain the best performance.
To deal with the above problem, we design the NWD metric as a loss function L NWD and apply it to L ob j .

Experiment environment
We employed the PyTorch framework to construct our models.The models were trained using a dual-card setup consisting of an NVIDIA TITAN RTX or an NVIDIA GeForce RTX 4090 running on the Windows 10 operating system.The hyperparameters used are as follows: the optimizer utilized is stochastic gradient descent; the batch size is set to 32; a linear decay learning rate scheduling strategy is implemented with an initial learning rate of 0.01 and a cyclic learning rate with the same value; the momentum and weight decay values are set to 0.937 and 0.0005, respectively.All validation experiments were conducted on the NVIDIA GeForce RTX 4090.In the tables and figures, we define AP0.5 as the average accuracy across all categories when evaluating accuracy with an Intersection over Union (IoU) threshold of 0.5.Additionally, AP represents the average accuracy with IoU thresholds ranging from 0.5 to 0.95 in increments of 0.05, calculated as a weighted average.

Dataset
We utilized two datasets in our study: the publicly available StateFarm dataset [33] and the nonpublic Lilong Distracted Driving Behavior (LDDB) dataset [34].We selected the StateFarm dataset to assess the practical effectiveness of MDAM and various other lightweight architectures as a detector feature extraction backbone network.This dataset, as illustrated in Figure 8, consists of a farm driver distraction detection dataset comprising ten categories.It contains 22,424 training images and 67,272 images enhanced using offline data augmentation techniques such as Gaussian blur, Gaussian noise and CutMix.On the other hand, the LDDB dataset comprises 14,808 videos captured by infrared cameras that record six driving behaviors of 2468 participants.Manual annotations were provided for these videos at a frame rate of five frames per second, resulting in a total of 287,804 images.To validate the module's efficacy, we compared it with mainstream object detection methods using the LDDB dataset.

Comparison experiments
To establish the superiority of our proposed method for detecting driver distraction, we compared it with several other detection methods using the LDDB dataset.These methods included YOLOv5s, as well as some lightweight networks (MobileNet, GhostNet and FasterNet) used as the backbone for YOLO.Additionally, we evaluated our proposed MTNet.The evaluation results for these different detectors are presented in Table 1.The best performance values are highlighted in bold.
Figure 9 displays the corresponding results for a more accurate comparison.Figure 9(a),(b) depict the performance of the various detectors in terms of accuracy and model weight.In each instance, the best performing method is indicated with an asterisk on the respective bar.Our proposed MTNet demonstrates exceptional performance in the detection of tiny objects on the LDDB dataset, achieving an average accuracy of 36.9%.Additionally, the computational complexity of the model surpasses that of other methods, boasting a level of 6.5 GFLOPs.
The detection performance comparison on the LDDB dataset is visualized in Figure 10.Other methods exhibit false detections and misses in detecting small objects like cigarette butts.In terms of small object detection efficiency, MTNet outperforms the other detection methods.
The proposed method demonstrated superior performance compared to other methods, as indicated by the results presented in the table and figure.The proposed method achieves higher accuracy than YOLOv5s on the LDDB benchmark, with values of 97.6%AP, 83.7%AP50 and 36.9%APS.Moreover, the proposed method reduces the number of model parameters by 1.38M and computation by 9.3 GFLOPs.The detection method exhibits a superior balance between speed and accuracy compared to other lightweight networks used as feature extraction backbone networks.The method outperforms in all categories of distracted driving behavior, with notable improvements in small object detection and optimized model computation.
Table 2 presents the performance of our proposed multidimensional adaptive module (MDAM) in comparison to other lightweight classification networks on StateFarm's dataset.The results indicate that our proposed MDAM exhibits superior performance in terms of classification accuracy and computational complexity, simultaneously maintaining a smaller number of model parameters.

Ablation study
To further validate the effectiveness of each component in our method's modules, such as the multidimensional adaptive feature extraction block, lightweight feature fusion block and IoU-NWD weighted loss function, we conducted an ablation study in this subsection.Here, we aimed to assess the impact of these components on the overall performance.
MobileNetV2 serves as the backbone network model for our baseline model.We incrementally add different components to assess their effectiveness and primarily compare the performance of the following four detection models: • Baseline: We simply replace YOLOv5's feature extraction backbone network with lightweight MobileNetV2.This is the baseline model.
• Baseline+A(MAFEB): We incorporate the multi-dimensional adaptive feature extraction block into the backbone network of the detector architecture used for driver distraction behavior detection.
• Baseline+A(MAFEB)+B(LFFB): The lightweight feature fusion block is further enhanced based on the aforementioned model in order to reduce the model's overall weight.
• Completed Model: Our proposed comprehensive detector model.
Table 3 presents the utilization of Baseline and Baseline + A to showcase the efficacy of incorporating multidimensional adaptive feature extraction blocks into the model.Comparing Baseline + A with Baseline + A + B confirms the superior balance between speed and accuracy in the context of lightweight feature fusion blocks.Contrasting Baseline + A + B with the completed model demonstrates the benefits of using the IoU-NWD weighted loss function for detecting tiny targets.
Table 3 presents the detection performance of the various models.From the table, it is evident that each of the aforementioned components contributes to the improvement of the detection results.Notably, the inclusion of multi-dimensional adaptive feature extraction blocks and the utilization of IoU-NWD weighted loss function prove to be more effective in enhancing the detection accuracy of small targets.The performance comparison graph is depicted in Figure 13.Additionally, Figure 14 provides a visual representation of the impact of different models on the LDDB benchmark in our ablation study.

Conclusions
In this paper, we propose a novel method for detecting driver distractions using deep learning.We combine a lightweight network architecture with optimization techniques for detecting tiny objects.To demonstrate the effectiveness of our approach, we have designed three functional blocks.First, we propose a feature extraction block that is based on an improved lightweight network.This block injects four adaptive attentions along the kernel space into the network using a parallel strategy.This enables the extraction of features with multidimensional information and improves efficiency through operator fusion operations.Second, we have designed a feature fusion block based on lightweight operator operations.The purpose of this block is to reduce computational complexity and memory access.Finally, we enhance the efficiency of detecting tiny targets by designing a weighted loss function based on the NWD metric.Our experimental results demonstrate the effectiveness of the key components in our method.Additionally, our proposed method outperforms some state-of-the-art methods on LDDB and StateFarm benchmarks.For future work, we plan to explore the feasibility of other lightweight optimizations.

Figure 3 .
Figure 3. Four kinds of attention along the kernel dimension.

Figure 7 .
Figure 7. Sensitivity of the IoU metric to small target position offsets.

Figure 9 .
Figure 9.Comparison of the performance of different detectors on the LDDB dataset.The best performing methods are marked with an asterisk.

Figure 10 .
Figure 10.Visualization of the performance of different detectors on the LDDB dataset.

Figure 11 .
Figure 11.Performance comparison of different classification methods on the StateFarm dataset.The best performing method is marked with an asterisk.

Figure 12 .
Figure 12.Performance visualization of different classification methods on the StateFarm dataset.

Figure 13 .Figure 14 .
Figure 13.Comparison of the performance of different detectors on the LDDB dataset in the ablation experiment.The best performing method is marked with an asterisk.

Table 1 .
Performance data for different detectors on the LDDB dataset.