Real ‐ time Monitoring System for Driver Phone Usage Based on Improved YOLOv5s

: In response to the impact of driver's violations, such as using a mobile phone, on vehicle safety during the driving process, we propose an improved real-time monitoring algorithm based on YOLOv5s with lightweight optimization. Firstly, we replace the C3 module (CSP Bottleneck with 3 convolutions) in the backbone network of YOLOv5s with a lightweight Ghost Module to reduce the model's parameter count, enhance detection speed, and maintain inference accuracy unaffected, thus meeting the requirements of real-time monitoring. Secondly, we introduce the RepConv (Receptive Field Block) module into the Feature Extraction Network (PANet) structure to increase the neural network's receptive field for input images and further reduce the model's computational load. Experimental results show that the improved network achieves an mAP@0.5 of 95.7%, a detection speed of 140 FPS, and a model size reduction to 10.6MB, meeting the demand for real-time and reliable detection on embedded devices.


Introduction
The control of the driver is crucial for the safe operation of a vehicle, and normal driving behavior greatly ensures the safety of driving.However, in actual driving, we often witness drivers engaging in violations, such as using mobile phones, which poses a significant threat to road safety [1].According to relevant studies, approximately 25% to 50% of traffic accidents are caused by improper behaviors of drivers, such as using mobile phones while driving [2].In order to prevent drivers from being unable to respond promptly to unexpected situations due to phone usage, real-time monitoring of driver behavior becomes particularly important.
Currently, detection algorithms for driver phone usage can be broadly categorized into two types: those based on mobile signal detection and those based on computer vision detection.Scholars such as Ascariz [3] and Jie Yang [4] have proposed methods to identify whether drivers are using phones through mobile signal detection.However, this approach is susceptible to interference from passenger mobile signals, leading to high detection errors.Wei Minguo [5] and others proposed a method to detect phones by extracting F-B Error information to obtain facial features.However, this method has lower robustness and is prone to be affected by factors such as lighting, resulting in the failure of the detection algorithm.Wang Dan [6] and other researchers decomposed the action of a driver making a phone call into a series of subactions with certain temporal relationships and detected whether the driver is using a phone in a video through statistical analysis.However, this method is easily interfered with by external factors, leading to detection failures.Wu Chenmou [7] and colleagues proposed a method based on human body pose estimation to estimate the threedimensional coordinates of eight skeletal nodes in the upper body, using spatial analysis of coordinates to determine if the driver is using a phone.However, this method may mistakenly classify other postures similar to phone usage as phone call actions.
Given the current shortcomings in mainstream detection methods, such as poor robustness, complex algorithm structures, and slow inference speeds, this paper proposes an approach to achieve real-time detection of driver phone usage.The method focuses on achieving high detection accuracy while simultaneously improving the algorithm's detection speed to some extent.This is accomplished through appropriate enhancements to the YOLOv5s algorithm, which is a novel and advanced research direction [8][9][10].The proposed improvements involve replacing the original C3 module and Conv module in the backbone network with a more lightweight Ghost-Module and introducing RepConv into the Neck structure to enhance the model's field of view sensitivity.These modifications reduce the redundancy in model information processing and enhance the model's field of view sensitivity.In summary, these improvements enable the detection method to adapt to complex driving scenarios and achieve better real-time detection performance on embedded devices.

The overall structure of YOLOv5
YOLOv5 is a one-stage object detection algorithm that integrates numerous advantages from the previous YOLO series, demonstrating outstanding accuracy and real-time performance in the field of object detection.It encompasses multiple versions, such as YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, allowing for flexible adjustments of the network's depth and width through configuration file tuning.Moreover, YOLOv5 boasts remarkable portability and has been widely applied and deployed in practical production environments.The network structure of YOLOv5 is illustrated in Figure 1.

Loss Function
DYOLOv5 employs three types of loss functions, namely classification loss, localization loss, and confidence loss, to assess the algorithm's detection performance.Among them, Intersection over Union (IoU) is a simple function used to calculate the localization loss.It evaluates the matching degree of two bounding boxes by measuring their overlap and then utilizes Non-Maximum Suppression (NMS) to filter out the optimal detection results.
However, in practical applications, it has been observed that the IoU function has some deficiencies, making it challenging to meet the requirements for localization loss in complex environments.Consequently, various improved versions of localization loss calculation methods have emerged, with common ones including Generalized IoU (GIoU), Distance IoU (DIoU), and Complete IoU (CIoU).Among these, CIoU is a loss function that considers additional optimization strategies and exhibits higher accuracy in evaluation.
YOLOv5 precisely adopts the CIoU function as the localization loss function for the network model.The expressions for IoU and CIoU are shown in equations ( 1) and ( 2

Improved YOLOv5s Network Model
As a currently highly regarded object detection algorithm, YOLOv5s has made significant progress by integrating numerous optimization strategies.However, due to its large number of parameters, relatively low model inference speed, and substantial model file size, real-time detection on embedded devices has become a challenge.Recognizing this issue, this paper addresses it by designing an improved version of the YOLOv5s detection algorithm, aiming to more effectively meet the real-time requirements of driver behavior detection tasks.
References are cited in the text just by square brackets [1].(If square brackets are not available, slashes may be used instead, e.g./2/.)Two or more references at a time may be put in one set of brackets [3,4].The references are to be numbered in the order in which they are cited in the text and are to be listed at the end of the contribution under a heading References, see our example below.

Backbone Network Improvement
In order to achieve efficient deployment on lowcomputational-power devices, lightweight and low-latency characteristics are particularly important for convolutional neural networks.Recently, Han and his colleagues introduced a new neural network called GhostNet [15].This network is based on the Ghost-Module, and it is constructed by stacking Ghost bottlenecks, resulting in an efficient and lightweight network structure.Verified experimentally, GhostNet maintains high accuracy while exhibiting superior computational efficiency and inference speed.Compared to other models, GhostNet achieves the highest throughput on GPUs and lower latency on CPUs and ARM.
Building upon this foundation, this paper introduces the Ghost bottleneck structure and Ghost-Module into the backbone network.While maintaining detection accuracy with minimal decline, this significantly enhances the lightweight and low-latency characteristics of the network model.

Ghost-Module and Ghost bottleneck
The Ghost bottleneck, as the core structure of the lightweight GhostNet, demonstrates outstanding performance.Its architecture is implemented based on the Ghost-Module, a phantom convolution module.From Figure 3, it is evident that the Ghost-Module first generates some feature maps through a regular convolution, then performs a cheap operation on the generated feature maps to produce redundant feature maps.The convolution used in this step is a depth-wise separable convolution.Finally, the feature maps generated by the regular convolution are concatenated with the feature maps generated by the cheap operation.By stacking Ghost-Modules, the Ghost bottleneck structure is obtained.As shown in the figure, the Ghost bottleneck appears to be similar to the Basic Residual Block in ResNet, incorporating multiple convolutional layers and a shortcut.The Ghost bottleneck primarily consists of two stacked Ghost modules.The first Ghost-Module serves as an expansion layer, increasing the number of channels, while the second Ghost-Module reduces the number of channels to match the shortcut path.Then, the shortcut connects the inputs and outputs of these two Ghost modules.
In this paper, Ghost-Modules and Ghost bottleneck structures are introduced into the backbone network of YOLOv5, optimizing the design to create a more lightweight feature extraction network structure.

Feature Pyramid Network Improvement
In object detection tasks, multi-scale features play a crucial role in encoding objects with scale variations.Common strategies for multi-scale feature extraction include the use of classical top-down and bottom-up feature pyramid networks.The YOLOv5 network model, however, adopts a bidirectional fusion backbone network known as the Path Aggregation Network (PANet), which operates in a top-down and bottom-up manner to enhance the structure of the feature pyramid.Additionally, a "short-cut" path is added between the bottom and top layers to shorten the fusion path of high and low-level features.In this paper, on the basis of the PANet feature extraction network, the Reparameterizable Convolution (RepConv) module [17] is introduced.This module dynamically adjusts the shape and parameters of the convolution kernel according to the network's requirements.In comparison to regular convolution operations, it can adapt to a broader range of tasks and network architecture needs.Specifically, during training, this module functions as a multi-branch module, and during inference, the multi-branch module is equivalently transformed into a single-path module.

Dataset and Experimental Environment
This experiment utilized a self-built dataset, consisting of 1400 photos captured from real driving scenes and 221 photos collected from the internet.The dataset comprises a total of 1621 images captured by smartphones, including targets of various scales.These photos cover diverse scenarios such as real driving environments, streets, and subways, enhancing the complexity of the photo environment and the diversity of the data.Following the principle of an 8:1:1 ratio, the photos were divided into training, validation, and test sets.The dataset was further categorized into three classes: head, body, and phone.
The hardware setup required an Nvidia Geforce RTX 3060 graphics card.The experiment was conducted on a system running the Windows 10 operating system with 16GB of RAM and equipped with a 12th Gen Intel(R) Core(TM) i5-12400F processor.The deep learning framework PyTorch 1.12.0 was employed in the experimental environment.

Evaluation Metrics
In order to quantitatively evaluate and analyze the detection performance of YOLOv5s for detecting drivers using mobile phones, this paper selected detection accuracy, detection speed, and model size as performance evaluation metrics.Specifically, these include precision, recall, mean average precision (mAP), and model file memory occupancy (MB).In this paper, we use mAP at an IoU threshold of 0.5 as the evaluation metric.These metrics can be calculated using the following formulas:

Model Training
In this paper, the models before and after improvement were analyzed under the same hardware facilities and parameter conditions.The analysis was based on the Loss curve representative of the experimental results.During the model training process, the network parameters were set as shown in Table 1.

Comparative Experiments
To effectively demonstrate the superior performance of the proposed algorithm for real-time detection of drivers using mobile phones, it is compared with current mainstream lightweight detection algorithms YOLOv5s, FasterNet, and RepViT on the self-built dataset, using the same hardware and training parameters.The results are shown in Table 2.After analyzing the comparative results, we found that the improved algorithm achieves an average precision of 95.6%.Compared to the current mainstream detection algorithms, there is a slight decrease in accuracy, but there is a significant improvement in terms of FPS and model memory usage.
Through the comparison, it is observed that although the algorithm in this paper has a slight decrease in detection accuracy, this is understandable in the process of lightweighting the model, and the degree of decrease is within an acceptable range.In terms of lightweighting, the algorithm proposed in this paper is significantly better than the current mainstream detection algorithms.

Dropping Experiment
Conducting Ablation Experiment to confirm the effectiveness of the proposed improvements in this paper.This aims to comprehensively evaluate the performance impact of each improvement on the algorithm.The introduced enhancements include incorporating Ghost-Module and RepConv.The experimental results are presented in Table 3.

Model Deployment
For a more direct understanding of the detection performance of the algorithm before and after improvement, the improved algorithm model was deployed on the RK3399Pro development board for inference detection.The deployment experimental results are shown in Table.The detection results are shown in the following images: (a) Daytime driver using a mobile phone.
(b) Nighttime driver using a mobile phone.As shown in Figure 7(a) and (b), the improved network model can accurately and efficiently complete the detection of drivers using mobile phones in complex environments, including daytime and nighttime scenarios.

Summary
This paper addresses issues with existing real-time detection algorithms for detecting drivers' use of mobile phones, such as high memory usage and poor real-time performance, by proposing a series of improvement strategies.By introducing Ghost-Module into the backbone network, the feature extraction is enhanced, simultaneously reducing the model's parameter count, suppressing invalid information input, and improving the model's focus on target information.In the Neck section, the RepConv module is introduced to lightweight the feature extraction network.The experimental results demonstrate that the improved YOLOv5s algorithm for real-time detection of drivers' use of mobile phones, as proposed in this paper, has a significant advantage over other mainstream detection algorithms.Compared to the original algorithm, this algorithm has undergone substantial lightweight processing while meeting the requirements for real-time detection.However, due to the relatively lower accuracy of the modified model, future work will focus on improving both accuracy and detection speed to achieve better detection performance on embedded devices.

Figure 6 .
Figure 6.The improved Neck section structure.

Figure 7 .
Detection Results of the Improved Model

Table 1 .
Network parameters

Table 2 .
Compares with current mainstream detection algorithms

Table 3 .
Compares with current mainstream detection algorithms

Table 4 .
Comparison between Baseline and Improved