Multi-objective pedestrian tracking method based on YOLOv8 and improved DeepSORT

: A multi-objective pedestrian tracking method based on you only look once-v8 (YOLOv8) and the improved simple online and real time tracking with a deep association metric (DeepSORT) was proposed with the purpose of coping with the issues of local occlusion and ID dynamic transformation that frequently arise when tracking target pedestrians in real complex tra ffi c scenarios. To begin with, in order to enhance the feature extraction network’s capacity to learn target feature information in busy tra ffi c situations, the detector implemented the YOLOv8 method with a high level of small-scale feature expression. In addition, the omni-scale network (OSNet) feature extraction network was then put on top of DeepSORT in order to accomplish real-time synchronized target tracking. This increases the e ff ectiveness of picture edge recognition by dynamically fusing the collected feature information at various scales. Furthermore, a new adaptive forgetting smoothing Kalman filtering algorithm (FSA) was created to adapt to the nonlinear condition of the pedestrian trajectory in the tra ffi c scene in order to address the issue of poor prediction attributed to the linear state equation of Kalman filtering once more. Afterward, the original intersection over union (IOU) association matching algorithm of DeepSORT was replaced by the complete-intersection over union (CIOU) association matching algorithm to fundamentally reduce the target pedestrians’ omission and misdetection situation and to improve the accuracy of data matching. Eventually, the generalized trajectory feature extractor model (GFModel) was developed to tightly merge the local and global information through the average pooling operation in order to get precise tracking results and further decrease the impact of numerous disturbances on target tracking. The fusion algorithm of YOLOv8 and improved DeepSORT method based on OSNet, FSA and GFModel was named YOFGD. According to the experimental findings, YOFGD’s ultimate accuracy can reach 77.9% and its speed can reach 55.8 frames per second (FPS), which is more than enough to fulfill the demands of real-world scenarios.


Introduction
Since the start of the twenty-first century, urbanization has grown more intense and difficulties with city traffic have spread to include both motor cars and pedestrians.The majority of pedestrian-motor vehicle collisions are brought on by pedestrians' improper adherence to traffic regulations.The impact of a traffic accident on pedestrians and the flow of traffic along the entire route is very serious.Real-time traffic systems have replaced other methods as the primary way for police to identify the cause of traffic accidents in order to address them swiftly and return to regular road traffic.Due to the rise in automobile traffic in recent years, urban road traffic frequently presents complicated scenarios, which is quite likely to make it difficult for the traffic system to identify partially concealed pedestrian targets.Currently, deep learning-based target detection algorithms and classical detection algorithms are the two main types of target identification algorithms for complicated traffic scenarios [1].Traditional detection methods typically first extract image candidate frames using sliding windows, followed by feature extraction for the local information of each window, then classification processing for the features extracted [2].Accordingly, the typical flaws of conventional detection algorithms can be summed up as poor target recognition precision, sluggish computing performance and insufficiently classified derived image categories.Utilizing deep learning for multi-objective recognition and tracking has gained widespread acceptance in the current research community as a result of the growing development of deep learning.As a result, a innovative YOFGD model is suggested in this research to enhance all elements of the target identification and tracking model's performance.The detector in the model is you only look once-v8 (YOLOv8) [3], which can extract more detailed the feature information.The model uses the enhanced deep association metric (DeepSORT) [4] network OFGD as a tracker, which significantly boosts tracking efficiency and accuracy.
The remainder of this essay is structured as follows: The related work is introduced in Section 2. The design and implementation of the model are described in Section 3. The tracking effect of the YOFGD model is compared to that of existing models for target identification and tracking in Section 4. The study's summary and future prospects are presented in Section 5.

Related works
Numerous academics have made tremendous progress in the field of multi-objective tracking since the integration of deep learning and pedestrian tracking.
G. Yang et al. in [5] proposed to incorporate the Kalman filter into Kalman on K-KCF visual tracking framework based on deep learning to solve the tracking failure problem caused by pedestrian occlusion in densely populated situations.
In [6], M. I. H. Azhar et al. proposed to use YOLO in combination with DeepSORT to build a system for real-time pedestrian tracking in rows, which was able to successfully detect and track the movement paths of people at an average rate of 2.59 frames per second (FPS).D. Stadler et al. in [7] suggested a cluster-aware non-maximum suppression (CA-NMS) to solve the issue of missed detections that frequently arise in multi-objective pedestrian tracking in order to decrease the frequency of missed detections of the method.In order to enhance the association performance in cluttered settings, they also presented a new tracking pipeline that blends detection-by-detection tracking and regression-based tracking patterns.Last but not least, the findings also showed that the study made a lot of progress in tracking effectiveness.
A multi-pedestrian tracking algorithm based on the attention mechanism and double data association is proposed in [8].It adds a feature pyramid network and a high-resolution feature map at the neck layer of the network to further improve the network's capacity to extract epistatic information.Additionally, it improves the spatial attention mechanism module, which increases the model's accuracy in pedestrian spatial localization.The algorithm's effective tracking performance is lastly demonstrated empirically.In small-scale situations, the pedestrian tracking effect needs to be increased.
In [9], Q. Gao et al. established a target tracking approach based on DeepSORT and an optimized version of YOLOv5, which employs complete-intersection over union(CIOU) to compute the loss function and includes the attention mechanism into YOLOv5 to increase the model's tracking accuracy.Through rigorous testing, the model eventually reached a greater tracking accuracy of 54.3%.
The aforementioned study shows that, while the combination of deep learning and multi-objective tracking enhances the accuracy and speed of pedestrian tracking to some extent, the algorithm is frequently subject to omission and false detection issues when performing target tracking.This research suggests a pedestrian tracking method based on YOLOv8 and the enhanced DeepSORT to reduce such issues.
These are the primary contributions of this work: 1) In this study, a novel adaptive oblivious Kalman filtering [10] technique, the adaptive forgetting smoothing (FSA) Kalman filter, is proposed to improve the extracted features and association matching component in DeepSORT.
2) This study presents a generalized trajectory feature extractor model (GFModel) to more fully extract contextual information.
3) To quantify the match between the detection frame and the prediction frame and to increase the accuracy of target matching, this work utilizes the CIOU [11] correlation matching metric.

Model design and implementation
This study proposes a multi-objective pedestrian tracking model (YOFGD) based on YOLOv8 and improved DeepSORT, which addresses the issue of low accuracy and efficiency during target identification tracking caused by occlusion and too small targets.

YOLOv8 network
In order to further enhance the accuracy and sensitivity of the network, YOLOv8 boosts the backbone network, the detection head, and the loss function from the previous YOLO series' basic framework.This algorithm is currently recognized as an advanced target detection algorithm.The overall network structure of YOLOv8 is depicted in Figure 1.While the backbone network's general structure is similar to that of YOLOv5, YOLOv8 does not adopt the backbone's C3 module (Conv1, Conv2 and Conv3) from YOLOv5.Instead, it combines the idea of efficient layer aggregation networks (ELAN) [12] from YOLOv7 to combine C3 and ELAN to form the CSPDarknet53 to 2-stage feature pyramid network (C2F) module [13].This enhancement enables the network to acquire richer gradient flow information, increasing the YOLOv8 network's accuracy in image recognition.In contrast to the coupled head utilized in YOLO's previous series, the detecting head of YOLOv8 is a decoupled head.The decoupled head is capable of extracting all target location and classification data and learning each individually by using a classification and detection network, then fusing the data.This clear branch learning concept successfully lowers the network's computational cost, preventing the overfitting phenomena.Additionally, it optimizes the model's performance in terms of generalization and resilience.
Classification and regression are two of the branches that make up the YOLOv8 loss function calculation.In contrast to earlier versions, the classification branch continues to employ the binary cross entropy (BCE) loss [14] while the regression branch uses the distribution focal loss (DFL) [15] and CIOU loss [16].Target identification time is significantly increased thanks to the combination of the two loss functions, which makes it possible to gather frame regression information about targets with more accuracy.

Improved DeepSORT network
The appearance and mobility of tracked targets can both be extracted using DeepSORT, an endto-end tracking method [17].Nonetheless, the initial appearance model and re-recognition, motion model, and data association elements of DeepSORT are no longer able to match the current real-time and efficient target tracking requirements due to the growing complexity of the tracking scenarios.For this reason, the feature extraction and association matching components of DeepSORT are optimized in this study.The enhanced DeepSORT algorithm is referred to as the OFGD tracking algorithm, and its algorithm flowchart is displayed in Figure 2. The three major components of the OFGD tracking method are detection, feature extraction, and association matching.In order to extract the feature information, the input video sequence must first be identified using the omni-scale network (OSNet) [18] network.The goal trajectory is then obtained by matching it with the FSA Kalman filter suggested in this research and correlating the matching's findings with CIOU.The Hungarian algorithm [19] is subsequently combined with the GFModel to complete the feature extraction of global contextual information, which reduces the issue of competing algorithms for accurate tracking, such as occlusion and target number transformation due to changes in the scene.This improves tracking accuracy.

OSNet feature extraction network
Considering the Jetson [20] computing platform is utilized for network deployment in this article, the feature extraction network frequently fails to synchronize in real time while operating.DeepSORT's feature extraction network uses a straightforward convolutional neural network (CNN).The size of the model is decreased from the original 45 to 2.5M while maintaining the tracking accuracy in this paper's solution, and the network topology is depicted in Figure 3. OSNet is utilized as the feature extractor.As demonstrated in Figure 3, OSNet is composed of up of several residual blocks with convolutional feature streams.It also incorporates the unified aggregation gate (AG) [21] for dynamic scale fusion, a structure that makes it simpler for the network to extract global features.Moreover, OSNet's network architecture is extensively lightweight, which increases OSNet's efficiency and ease of device deployment.

FSA Kalman filter algorithm
The two fundamental components of the Kalman filter method, which is used to characterize uniform linear motion [22], are prediction and update.To make a prediction is to estimate the present instant's state based on the posterior estimate of the previous moment and to determine the current moment's prior estimate.The update can be further subdivided into measurement updates and time updates.Equation (3.1) displays the time update equation for the Kalman filter method, while Eq (3.2) displays the measurement update equation.
Here, xt represents the prior state estimate of the t moment, xt−1 represents the posterior state estimate of the t − 1 moment, A represents the state transition matrix, B represents the matrix that converts the input to the state, u t−1 represents the input of the t − 1 moment, Pt represents the prior estimation covariance of the t moment, P t−1 represents the posterior estimation covariance of the t − 1 moment and Q represents the process excitation noise covariance.
where K t represents the measurement noise covariance, H is for the observation matrix, R is for the Kalman gain matrix and z t is for the observation value.Subsequently updating the system parameters, the new data on parameter updating changes faster than the old data due to the covariance, and other metrics cannot distinguish between the old data from earlier observations and the new data from more recent observations [23].This study proposes a novel adaptive oblivious Kalman filter method, the FSA Kalman filter, the precise stages of which are displayed in Algorithm 1, in order to enhance the sensitivity of the algorithm when responding to parameter modifications.

CIOU association matching
Target matching between anticipated and detected locations by Kalman filtering technique is carried out by the DeepSORT algorithm using the cost matrix and Hungarian algorithm.DeepSORT employs intersection over union (IOU) for correlation matching to improve better tracking by further determining the match between the actual detection frame and the projected detection frame.Equation (3.3) illustrates the IOU calculation formula.
where the numbers A and B represent the expected and real target bounding boxes, respectively.IOU retains some restrictions, nevertheless, when coping with unique situations.For instance, when there is no overlap between the two bounding boxes, IOU is zero, which leads the gradient to also be zero, making it impossible to do secondary data optimization [24].CIOU is chosen to substitute IOU for association matching in this study in order to solve this issue, and its formula is shown in Eq (3.4).
where α represents the weight function, and v is used to gauge how consistently the detection frame to target frame ratio is maintained.Figure 4 displays the matching impact of CIOU when IOU is zero.As shown in Figure 4, the CIOU can still direct the movement of the target frame regardless of if there is no overlap between the target frame and the prediction frame.The association matching of the OFGD algorithm is substantially accelerated and refined by the CIOU's capability to swiftly return the target frame to the origin position without shifting the position of the prediction frame.

GFModel trajectory feature extractor
The generalized trajectory feature extractor GFModel with strong generalization is suggested in this study as a means of lowering noise resulting from occlusion and scene changes.The unique network structure is depicted in Figure 5.Its input is a collection of frames, which can more thoroughly extract the global contextual information and spatial properties.

Experimental results and analysis
This experiment is based on YOLOv8 and OFGD network with Windows environment, Python 3.6.13as development language, NVIDIA GeForce RTX 2070 SUPER (8G) as GPU and Intel(R) Core(1TM) i5-10500 CPU@3.10GHz as CPU configuration.

Experimental data sets
Market-1501 [25] and CUHK03 (The Chinese University of Hong Kong) [26] are two large-scale pedestrian re-identification datasets that were appropriate for the analysis in this paper.Additionally, because Market-1501 and CUHK03 datasets have the same structure, they can be trained together to increase the tracker's accuracy while also generating more data.
Bounding box train, Bounding box test, Gt query and Gt galley, which represent the training set, test set, real to-be-queried picture, and real-queried image, respectively, are separated into four pieces for each dataset.

Evaluation index of experiment
In this paper, a total of four common evaluation metrics for multi-objective tracking are chosen, including multiple object tracking accuracy (MOTA), identification F-score (IDF1), mostly tracked (MT) and FPS, in order to scientifically evaluate the performance of the YOFGD model proposed in this paper from a holistic viewpoint.

1) MOTA
MOTA index integrates four factors: false postive (FP), false negetive (FN), ID switch (IDSW) and ground truth (GT).FP represents the number of targets falsely detected, FN represents the number of real targets not detected, IDSW represents the number of ID switches for the same target and GT represents the number of real objects.The specific formula is shown in Eq (4.1).
The better the MOTA represents the model's overall performance, the closer it is to one.

2) IDF1
According to Eq (4.2), IDF1 refers to the F-value recognized by each pedestrian ID and is the reconciled mean of identification precision (IDP) and recall (IDR).
3) MT As schematically depicted in Figure 6, MT stands for the number of GT trajectories where the percentage of successfully tracked frames exceeds 80% of the total number of frames.

4) FPS
Target detection speed is frequently assessed using FPS, which is the number of images that can be processed in a second.The model's detecting speed increases with increasing FPS.

Analysis of experimental results of YOFGD model
This paper initially adopts the OSNet network for pedestrian tracking training, and the initial learning rate is set at 0.00005.After 180,000 iterations, an excellent distinction between pedestrians and pedestrians can be produced.This allows for a reasonable evaluation of the final tracking effect.The model is further trained in this study employing the Torch framework [27], and the resulting training loss curve and classification loss accuracy are displayed in Figure 7 in accordance.Some sequences from multiple object tracking 17 (MOT17) were selected for this paper's demonstration in order to illustrate the implications of the YOFGD model, as seen in Figure 8.

Mathematical Biosciences and Engineering
Volume 21, Issue 2, 1791-1805.According to the results shown in Figure 8, every pedestrian was accurately tracked.The YOFGD model didn't exhibit any missed detection or misdetection during tracking.
The model of YOLOv8 paired with DeepSORT was chosen for experimental comparison with the YOFGD model described in this paper in order to more effectively illustrate the distinctions between alternative tracking algorithms, as seen in Figure 9.The real frame is (a), the predicted frame generated by the YOLOv8-Deepsort model is (b) and the predicted frame generated by the YOFGD model is (c).In (b), the tracked target is the pedestrian in the green frame of frame 217 with ID 13.This target can be tracked normally at frame 438; however, in frame 634, the device occlusion and the target's distance cause the ID to be misaligned.The person's ID remains unaltered in (c), which successfully tracks the individual in the distance.It is clear that the YOFGD model can continue to perform well at tracking irrespective of the presence of complicated scenarios.
In this study, the MOT16 and MOT17 datasets are used to test the YOFGD, and the test results are displayed in Tables 1 and 2. The experimental results demonstrate that the MOTA of YOFGD on the MOT16 dataset reaches 69.7%, which is competitive in the same type of algorithms and indicates that its tracking accuracy has reached a high level when compared to other models.The model in this work has made significant progress in the continuity and accuracy of tracking, as demonstrated by the fact that IDF1 reaches 76.0% on the MOT17 dataset, suggesting that a significant percentage of detection targets among detected and tracked targets receive proper ids.The model in this work has a high tracking accuracy in real nonlinear motion situations, as evidenced by the MT value of 45.7%, which shows that trajectories accurately tracked in 80% frames account for a higher fraction of all trajectories.While the tracking speed may not be the fastest, the model can fully address practical concerns because pedestrian tracking in the real world requires a speed requirement of 30 frames per second.
In comparison to similar research, the MOTA and IDF1 of the model proposed by [28] on the MOT17 dataset were found to be 32.587 and 43.793%, respectively, whereas the MOTA and IDF1 of the model proposed by [29] on the MOT17 dataset were found to be 56.215 and 62.823%, respectively.In these two measures, the model presented in this study performs more superiorly.Furthermore, the trials did not show a balanced connection between tracking accuracy and computational complexity [28,29].This paper's tracking efficiency isn't the best, but it is still quite competitive.

Conclusions
This research proposed an innovative multi-objective pedestrian tracking technique based on YOLOv8 and the enhanced DeepSORT.The Kalman filter, feature extraction network and IOU were each improved in this paper starting with feature extraction and association matching in the multi-objective tracking problem.In order to increase the tracker's accuracy, this research first suggested a new adaptive oblivious Kalman filter technique called the FSA Kalman filter.A GFModel was then suggested to enhance the tracker's capacity to extract global data.In order to further increase the tracker's accuracy, the match between the detection and prediction frames was measured using the CIOU correlation matching metric.The YOFGD model put forward in this paper significantly improved tracking accuracy in the experiments.The research in this paper still needs to be expanded upon in certain areas.For example, the model's accuracy has to be further enhanced because it occasionally switches pedestrian IDs incorrectly in scenes with a high density of pedestrians.Meanwhile, YOFGD compares favorably to other models in terms of tracking accuracy and its tracking speed is not at a high level; hence, future studies on the model's speed are anticipated.

Figure 7 .
Figure 7.The training curve of the tracking model.

Figure 9 .
Figure 9.Comparison of partial tracking results.

Algorithm 1 :
Adaptive forgetting Kalman filter algorithm FSA Input: Initialize the measurement matrix z t , the confidence of the measured value c t , the predicted mean value xt−1 , the variance P t−1 , observation matrix H t , noise covariance R t , kalman gain K t , adaptive factor µ Output: Output the final predicted mean xt and variance Pt 1 Predict target status; 2 The results of tracking and detection are matched; 3 The matching and detection results are updated according to y t = z t − H t xt−1 ′ ; 4 The measurement noise ' covariance Rt = (1 − c t ) × 1 6 Update parameters,

Table 1 .
The performance of YOFGD model on MOT16.

Table 2 .
The performance of YOFGD model on MOT17.