Performance Evaluation of YOLOv8-Based Bib Number Detection in Media Streaming Race

The evolution of telecommunication networks unlocks new possibilities for multimedia services, including enriched and personalized experiences. However, ensuring high Quality of Service and Quality of Experience requires intelligent solutions at the edge. This study investigates the real-time detection of race bib numbers using YOLOv8, a state-of-the-art object detection framework, within the context of 5G/6G edge computing. We train (BDBD and SVHN datasets) and analyze various YOLOv8 models (nano to extreme) across two diverse racing datasets (TGCRBNW and RBNR), encompassing varied environmental conditions (daytime and nighttime). Our assessment focuses on key performance metrics, including processing time, efficiency, and accuracy. For instance, on the TGCRBNW dataset, the extreme-sized model shows a noticeable reduction in prediction time when the more powerful GPU is used, with times decreasing from 1,161 to 54 seconds on a desktop computer. Similarly, on the RBNR dataset, the extreme-sized model exhibits a significant reduction in prediction time from 373 to 15 seconds when using the more powerful GPU. In terms of accuracy, we found varying performance across scenarios and datasets. For example, not good enough results are obtained in most scenarios on the TGCRBNW dataset (lower than 50% in all sets and models), while YOLOv8m obtain the high accuracy in several scenarios on the RBNR dataset (almost 80% of accuracy in the best set). Variability in prediction times was observed between different computer architectures, highlighting the importance of selecting appropriate hardware for specific tasks. These results emphasize the importance of aligning computational resources with the demands of real-world tasks to achieve timely and accurate predictions.


Performance Evaluation of YOLOv8-Based Bib Number Detection in Media Streaming Race
Rafael Martínez , Álvaro Llorente , Alberto del Rio , Javier Serrano , and David Jimenez Abstract-The evolution of telecommunication networks unlocks new possibilities for multimedia services, including enriched and personalized experiences.However, ensuring high Quality of Service and Quality of Experience requires intelligent solutions at the edge.This study investigates the real-time detection of race bib numbers using YOLOv8, a state-of-the-art object detection framework, within the context of 5G/6G edge computing.We train (BDBD and SVHN datasets) and analyze various YOLOv8 models (nano to extreme) across two diverse racing datasets (TGCRBNW and RBNR), encompassing varied environmental conditions (daytime and nighttime).Our assessment focuses on key performance metrics, including processing time, efficiency, and accuracy.For instance, on the TGCRBNW dataset, the extreme-sized model shows a noticeable reduction in prediction time when the more powerful GPU is used, with times decreasing from 1,161 to 54 seconds on a desktop computer.Similarly, on the RBNR dataset, the extreme-sized model exhibits a significant reduction in prediction time from 373 to 15 seconds when using the more powerful GPU.In terms of accuracy, we found varying performance across scenarios and datasets.For example, not good enough results are obtained in most scenarios on the TGCRBNW dataset (lower than 50% in all sets and models), while YOLOv8m obtain the high accuracy in several scenarios on the RBNR dataset (almost 80% of accuracy in the best set).Variability in prediction times was observed between different computer architectures, highlighting the importance of selecting appropriate hardware for specific tasks.These results emphasize the importance of aligning computational resources with the demands of real-world tasks to achieve timely and accurate predictions.Index Terms-YOLO, object detection, bib number detection, cognitive networks, media streaming, broadcasting, edge computing, runner segmentation, image quality.

I. INTRODUCTION
T HE EVOLUTION of mobile communication technologies has triggered a significant paradigm shift in multimedia services, ushering in a new era of enriched and personalized offerings tailored to individual preferences [1], [2].This transformation is underscored by the increasing softwarization of mobile core network functions, which is driving the evolution of the mobile network architecture itself.In its fifth generation (5G) and beyond, mobile networks have transitioned towards a service provider/consumer framework, facilitated by servicebased interfaces [3], [4].
The capabilities inherent in 5G networks, including enhanced bandwidth, reduced latency, and improved reliability, hold immense significance for the delivery of audiovisual media services [5], [6].These capabilities enable the seamless transmission of high-quality video content and support emerging technologies such as augmented reality (AR) and virtual reality (VR), thereby revolutionizing the landscape of media consumption [7], [8].Among the many applications that have surfaced in this transformation, the integration of multimedia broadcast enrichment through cognitive services has emerged as a frontier [9], [10].
Central to this evolution is the proliferation of edge computing capabilities, which play a pivotal role in real-time multimedia content processing while ensuring compliance with Quality of Service (QoS) and Quality of Experience (QoE) standards.Moreover, the integration of edge computing (MEC) is crucial for fully leveraging the potential of 5G networks to enrich and tailor media services [11], [12].The convergence of edge computing and 5G/6G networks underscores the need for an infrastructure that can seamlessly handle the increased data load and complexity associated with sophisticated media services, thereby enhancing the overall user experience [13].
Enhanced capabilities of MEC environments become critical enablers for deploying Artificial Intelligence (AI) frameworks in real-time applications.MEC's proximity to the end-users ensures minimal latency and high computational efficiency, which are essential for the effective implementation of AI-driven solutions such as object detection.Within this context, state-of-the-art object detection frameworks like YOLO (You Only Look Once) [14] have demonstrated exceptional performance in various tasks (classification, detection, segmentation. ..) along different versions, which currently is provided on version 8. 1We intend to leverage its capabilities for a specific and practical application: race bib detection.The integration of YOLO in our proposed system is motivated by its proven ability to outperform other tools in object detection capabilities, and the integration of several tasks in the same framework, ensuring accuracy and reliability.Understanding the different YOLO model sizes, from nano to extreme, and configuring them for varied inference scenarios is crucial.This approach allows us to tailor the system's performance to meet specific needs, balancing between processing speed and accuracy, and effectively understanding the potential baseline of MEC configuration to deliver optimal results in diverse conditions.

A. Research Challenges
At the heart of this transformation is the integration of Artificial Intelligence (AI) into media services, enabling intelligent content processing to deliver valuable insights to both content providers and end-users.However, challenges arise when these edge services demand excessive resources or fail to deliver accurate results, emphasizing the importance of predicting real-time performance and configuring robust service architectures [15].
To address these challenges, a specific approach to managing the various layers of networking and computing infrastructure is essential.This requires an overall intelligence framework capable of orchestrating data, control, and service layers to optimize performance and ensure seamless delivery of media content.Furthermore, the validation of models and analysis of their accuracy are critical aspects of this evolving landscape.Rigorous testing and evaluation frameworks are needed to assess the adaptability and robustness of cognitive services, particularly in meeting the stringent performance requirements of multimedia applications [16].Racing events, characterized by their diverse and dynamic sequences, serve as a canvas for the analysis and application of cognitive services, especially object detection [17], [18].
In this field, companies and developers face significant challenges [19] when trying to implement object detection solutions.One of the main problems is the difficulty in selecting the most suitable model [20] for their specific use case.This selection process is restricted by the lack of complete documentation and detailed specifications, which are essential for making informed decisions.Understanding the specifics of each model, such as its performance capabilities, processing speed and suitability for various tasks, is crucial to optimizing your applications.
For example, when it comes to real-time processing [21], detection in high-resolution images [22] or object identification in high-definition videos [23], the absence of detailed information can lead to suboptimal choices.Users need clarity on which model is most suitable for their particular requirements, taking into account factors such as image characteristics and available computational resources, whether GPU or CPU.Without this data, implementing effective and efficient object detection solutions becomes a complicated task.
In the context of YOLO models, although there is technical documentation and comparisons of YOLO with state-of-theart datasets, the challenge lies in making these results more accessible and understandable to users.There is a need to make clear these data so that users can easier determine which variant is most appropriate for their specific needs.The lack of easily interpretable information on the speed and processing time of each model complicates the selection process, as these factors are essential for implementing effective solutions.

B. Objective and Contributions
The above problems show the importance of our research objective, which is to perform an analysis of different models and sizes of the last version of YOLO (YOLOv8), focusing on its performance in detecting runner bibs in different race datasets and environmental conditions.
In particular, at the forefront of our research is the realtime object detection system known as YOLO.By offering a spectrum of models with varying parameters such as size, inference speed, and specific task-oriented adaptability, YOLO provides a versatile toolkit.In our case, the main goal is to leverage the capabilities of YOLOv82 to detect and decipher the bib numbers worn by runners in different scenarios, extracting several tests and results.
Our work focuses on the optimization of object detection systems for specific tasks in racing events.For that, we examine the efficiency, accuracy and processing time of the YOLOv8 framework in different scenarios, ranging from daytime clarity to nighttime challenges.Furthermore, this study aims to determine the most effective conditions for each YOLOv8 model size, providing guidance for improving detection performance in real-time racing scenarios.
The use of open source datasets for both training and evaluation purpose, together with the wide range of running scenarios in several environmental conditions, ensures that our results can be validated and extended to other studies and research.

A. Significance in the Context of 5G/6G Multicast/Broadcast Services
The new emerging multimedia services and applications differ from the traditional ones by offering an increasingly immersive experience.4K and 8K video streaming, virtual reality, augmented reality and 360 omnidirectional video applications have popularized new scenarios and media use cases [24], [25].Audiovisual content providers and broadcasters are highly motivated to use IP-based, mobile, and cellular distribution technologies to deliver to the end-users their media services, for being a broadly accessible and unified distribution platform [26].
5G mobile networks has brought a great revolution in the communications field.High bitrates, low latency, security and improved reliability are fulfilled by 5G technologies, enabling success in multimedia streaming where is critically important to guarantee the stability of the transmission [27].5G networks with the new video compression standards, the evolution of the technology and the availability of UHD portable consumer devices provide the infrastructure for "anywhere anytime" access to real-time broadcast media for new emerging video services.In the current era of information explosion, applications such as 5G autonomous driving, UHD video, 4K video, 8K video, 360 video, gamming and holographic metaverse applications bring massive data increments, imposing more stringent requirements on the performance of 5G wireless communications networks and seeing the need for a leap to 6G networks [28], [29].
Mobile networks are characterized by frequent changes in latency and bandwidth conditions, which might result in an unstable and poor video streaming [30].For that, assure the QoE and QoS of the applications and services in challenging network scenarios (e.g., live streaming or video on demand) are one of the main objectives of the 5G networks [31] to satisfy the final perceived quality by the end-user through intelligent network management [32], [33].
The incoming specifications of 3GPP with the Release 17 3 [34] include the specifications for 5G Multicast-Broadcast Services (5G MBS) [35], a regulation for multicast and broadcast delivery over 5G networks [36].During these years there has been a continuous evolution of new broadcast and multicast technology in 5G networks [37] due to the versatility, flexibility and efficiency of the technology, and the easy integration with the deployed mobile communication networks [38].While 3GPP offers a set of specifications for the Media Industry and for the distribution of TV services to mobile devices, 5G Media Action Group (5G-MAG) has undertaken the task to develop open-source implementations of 3GPP specifications. 4

B. Multimedia Applications on Edge Computing
The advent of cloud computing and virtualization paradigms created new market gaps for multimedia applications, driving new opportunities for the multimedia content and entertainment industries [39].The application of Network Functions Virtualisation (NFV) and serverless paradigms for multimedia applications over 5G, has been widely analysed [40] with the use of open-source Function-as-a-Service (FaaS) enablers, such as Openwhisk, 5 for multimedia services.The EU H2020 5G-PPP 5G-MEDIA project [41] developed a transparent Service Virtualisation Platform (SVP), where the vertical service provider can deploy its virtualized service from an application-level perspective [42].These platforms, already proven for multimedia content, provide in some cases a complete 5G infrastructure for testing verticals [43].Another platform, in this case focused on immersive multimedia content, is the one proposed by the EU H2020 5G-PPP 5G-Xcast project [8].
From the perspective of the vertical service provider, the simplification of service deployment procedures is a key factor in terms of cost reduction [44].This simplification reduces the time required to deploy the service [45].Simplification and automation techniques facilitate the deployment, execution and analysis of vertical services [46].In the case of multi-site virtualized architectures, it is possible to deploy monitoring systems that allow the analysis of virtualized services in multiple geographical locations [47].This makes it possible to create procedures for the use of 5G-enabled end-to-end platforms for the creation and performance analysis of vertical services [48].

C. Object Detection Framework
The object detection field has been a hot topic in recent years, driven by advances in artificial intelligence [49] and the growing need for automated solutions [50] in various applications.Many studies have focused on object detection to provide solutions in diverse areas such as surveillance [51], [52], autonomous driving [53], [54], medicine [55], and many others.Object detection involves identifying and classifying objects in an image or video, and has proven to be crucial in the digital transformation of numerous industries.
Text detection is an important subcategory within object detection, which has significant applications in document scanning [56], translating text into images [57] and assisting the visually impaired [58].Although text detection is an specific topic, it faces similar challenges [59] as general object detection.This need has led to the development of specialized tools for text detection and recognition.
Traditionally, object detection tools such as Tesseract and EasyOCR have been widely used.These tools have proven to be effective in certain contexts, such as food identification and tracking [60] and handwritten character extraction [61].However, the primary use of these tools is for character recognition on car license plates [62], [63].
Despite their usefulness, these tools have significant limitations in terms of accuracy and the ability to handle objects in complex environments with high variability.Tesseract, for example, can struggle to perform its task when encountering low-quality images [64], as well as having fairly complex setup and configuration for non-technical users [65], which limits its accessibility.EasyOCR, while improving on some aspects of ease of use and configuration, also faces similar challenges.Its ability to handle multiple languages and diverse fonts sometimes results in decreased accuracy [66] when faced with non-standard text or less controlled situations.
To overcome these difficulties in object detection in general, more complex neural network models have been developed, which significantly improve object detection results.A prime example of such models is YOLO, which has revolutionized the computer vision field.YOLO is based on a convolutional neural network architecture that enables real-time object detection by analyzing an entire image in a single pass [67].Since its introduction, YOLO has evolved through eight versions, each improving in terms of accuracy, speed and versatility [68].The latest version of YOLO includes several models and sizes, allowing it to be adapted to different needs and hardware constraints.
The use of YOLO has been wide and varied in multiple applications.In passenger detection and counting, its implementation makes it possible to optimize the accuracy and efficiency of the Automation Passenger Counting (APC) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
system [69].In the automotive industry, YOLO has been used for various tasks.For example, a novel lightweight vehicle detection method called MA-YOLO (MobileNet Attention YOLO) has been proposed [70].This tool reduces the number of parameters by almost half compared to YOLOv8, while maintaining similar accuracy.
In addition, in the field of license plate detection and authentication, YOLO-V4 and YOLO-V5 have been used to solve specific problems.In one study [71], YOLOv4 was employed for license plate detection, while YOLOv5 was utilized for license plate class identification for authentication purposes.Similarly, in autonomous driving research [72], the YOLO algorithm has been applied to detect and classify various objects on the road using bounding boxes.
In the case of bib detection, the dynamism and variability in bib position and appearance during a competition present significant challenges.Similar studies, such as one on bib number recognition in running competitions [73], have addressed these issues.This system, which faces variability in bib appearance, size, and deformations, improves recognition accuracy using facial detectors and stroke width transforms (SWT).
Another study has presented modifications to SWT to improve its performance in detecting bib numbers in racing competition images [74].These modifications, such as hue channel similarity testing and stroke length limitation, have been shown to significantly improve bib detection in assorted images.
Finally, a multimodal technique has been presented that combines biometric and textual features to detect and recognize bib numbers in natural images of marathons and sports competitions [75].This technique uses face and skin features to identify candidate text regions, improving the accuracy and performance of bib recognition.

A. Experimental Setup
Our experimental setup aimed to analyze the performance of the YOLOv8 models under various computational constraints, while evaluating their accuracy and efficiency in race bib detection.We employed two different hardware configurations.
The neural networks for the bibs and numbers detection have been trained from scratch.For this training, we used a high-performance desktop computer equipped with an Intel Core i9-10900 CPU and a powerful NVIDIA GeForce RTX 3090 GPU (10,496 CUDA cores, 328 Tensor cores, 24GB RAM) graphic card.This robust system efficiently managed the intense computation required for training.Specifically, the training process exclusively used the GPU's parallel processing capabilities for significant speed optimization, reducing training time to approximately 3 hours per model for bib detection, and higher time for number detection (from half a day, to two days for the extreme model).Our training datasets included around 600 sequences featuring diverse bib sizes and angles, and almost 100,000 digits in different real-world scenarios with variations in object appearance.
For the inference phase, in addition to the desktop setup, we adopted a more portable setup, using a laptop equipped with an Intel Core i5-7200U CPU and an NVIDIA GeForce MX150 GPU (384 CUDA cores, 2GB RAM).This configuration met the minimum inference requirements while offering lower computational power.This strategic choice allowed us to evaluate the feasibility of deploying trained models on resource-constrained edge devices, paving the way for potential real-world implementations in resource-limited environments.

B. Media Architecture
The research presented in this work is closely related to the Cognitive Service module of a general multimedia broadcasting architecture.Although the overall goal of the architecture is to capture and enrich the User-Generated Content (UGC), it is vital to contextualize the YOLO's performance within the broader architecture.The other components serve to illustrate the composition of the real scenario, demonstrating how the Cognitive Service module can operate within a dynamic environment.
Illustrated in Figure 1, our architecture is an interconnected set of components.First, event stream acquisition ensures that the infrastructure manages access to the broadcast stream.Next, the stream transcoder optimizes the media formats, ensuring compatibility, and also adapts the bitrate of the stream.The Cognitive Services module, the main component of this work, assumes the fundamental role of enriching multimedia content, providing intelligent capabilities to enhance the overall viewing experience.In this scenario, the objective is to segment and identify the runners' bibs.
The information bus is responsible for real-time communication and coordination between the different components, acting as a data exchange channel.Finally, the production control supervises the orchestration, processing, and rendering of the content; while the media delivery manager is in charge of distributing the selected content to the different channels for end-users.
In this research manuscript, we focus on a singular test case, based on the Cognitive Services module.Our objective is to evaluate the performance of different YOLO models, focusing on the detection and prediction aspects to enrich the multimedia content, as shown in Figure 2. The process started with training YOLOv8, using data coming from the BDBD  (Bib Detection Big Data) dataset [76] (examples in Figure 3) for physical bibs detection and the SVHN (Street View House Numbers) dataset [77] (additional examples in Figure 4) for the identification of numbers within those bibs.
To guarantee the relevance and robustness of our findings, we chose to analyze two distinct datasets, in addition to those employed for training.After training our different sized models, we conducted the prediction process using images extracted from the TGCRBNW (Trans Gran Canaria Race Bib Number in the Wild) dataset [78] and RBNR (Racing Bib Number Recognition) dataset [79].The selection of these datasets is deliberate, as they encompass images of runners in a variety of scenarios, thus presenting intriguing challenges for evaluating the robustness of our neural networks under different conditions.These conditions include scenarios with a single runner, multiple runners, daytime conditions and nighttime conditions (check Figure 5 for visual references on daytime, nighttime, and a crowded environment).
In this framework, an initially pre-trained YOLOv8 neural network [80] is deployed to detect each individual runner present in the image.Next, a detection process is performed within each identified runner to discern the physical paper simulating the bib.Subsequently, an additional detection is executed within the bib to determine the number associated with each runner.This stepwise filtering process, ranging from the overall image to the individual, the bib and the number, allows our tool to mitigate errors associated with the detection of extraneous elements, such as background signs or irrelevant objects, particularly in challenging scenarios.
Visual analysis of our two validation datasets reveals a rich variety of conditions and scenarios, allowing us to evaluate the YOLOv8 detection system.We begin with daytime races under ideal lighting conditions (Figure 5(a)), which serve as a baseline due to their inherent ease of detection.Subsequently, we investigate nighttime races (Figure 5

C. Cognitive Services
After having analyzed the evolution of YOLO in general way, it is necessary to emphasize that, at this point, we have chosen to focus on YOLOv8.The reason for this decision lies in the advances and improvements introduced in YOLOv8 with respect to its predecessors [82].To go into the details, Ultralytics, the developers of YOLOv8, introduced several configuration sizes, each tailored to specific needs, as illustrated in Table I.
This table presents performance metrics for different versions of YOLOv8, including nano, small, medium, large, and extreme.The metrics include size (in pixels), mean Average Precision (mAP) over the range 50-95 [83], speed in CPU, speed in TensorRT, number of parameters (Params), and the number of floating-point operations (FLOPs).These variations in model sizes allow users to choose a specific YOLOv8 configuration that fits their needs, whether prioritizing speed, accuracy, or a balance between both.For example, the nano version is optimized for speed, with lower parameters and FLOPs, while the extreme version provides higher accuracy at the cost of higher computational complexity.
In addition to the different model configurations, YOLOv8 provides users with a set of hyperparameters [84] that can be adjusted to further optimize the model performance based on specific use cases.It is often necessary to experiment with different combinations and values to find the optimal configuration for a given use case.Some of the key hyperparameters in YOLOv8 are as follows: • Learning Rate.A crucial parameter that determines the step size during the optimization process.Proper adjustment of the learning rate is essential to achieve a balance between fast convergence and avoiding overfitting.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I YOLOV8 MODEL PERFORMANCE METRICS FOR COCO DATASET DETECTION [81]
• Input Size.The resolution of the input images.Smaller pictures can make things quicker but might mean the model isn't as accurate.• Epochs.The number of times the entire training dataset is processed during training.Modifying the number of epochs can significantly influence the learning and convergence of the model.In the context of YOLO, a common practice is to terminate the training process after a predefined number of epochs without observing any improvement in learning.This strategy, known as early stopping, serves to mitigate the risk of overfitting.Our methodology employs a set of evaluation metrics to capture various aspects of its functionality and effectiveness.These metrics serve as key benchmarks to evaluate the system performance in different models and sequences.
• Time efficiency.Measure the speed at which each YOLOv8 model processes and analyzes image sequences.• Accuracy.Evaluate the system's ability to accurately identify and locate bib numbers on runners in different scenarios.
• Real-world applicability.Evaluate the adaptability of the YOLOv8 system to various real-world racing environments.

D. Test Cases and Scenarios
The datasets used in this research consist of four different sources.Two of these datasets are intended for training the neural network for person and bib number detection.The third and fourth datasets are intended to evaluate the performance of the model under different conditions and image qualities.
The first dataset employed for bib detection (BDBD) contains photos of runners participating in various races.Each photo captures a runner wearing a race number on their clothing, providing data to train and test the bib detection model.Table II   Continuing with the next dataset responsible for training the neural network to detect digits within each bib number, we highlight SVHN.This is a dataset designed to develop machine learning algorithms similar to MNIST [85] but incorporating an order of magnitude more labeled data.Upon closer examination of its specifications, which are described in Table II, it becomes evident that the digits designated for training and testing exhibit a noticeably higher level of complexity compared to their supplementary counterparts.This observation justifies our decision to exclusively use the training and testing digits for two main reasons: their higher complexity in terms of discernibility and the consequent computational overhead associated with the integration of additional digits, due to their substantially larger volume.
In addition, the SVHN dataset offers flexibility in download formats, with two viable options: first, the entire image corpus in PNG format and, second, a format akin to MNIST where all digits are uniformly resized to a fixed resolution of 32-by-32 pixels.We opted for the first format, the entire image corpus in PNG format, for several reasons, but mostly because by retaining the original image resolution, we allow for more nuanced feature extraction and preserves finer details that could be crucial for our tasks.Finally, we proceed to explain the datasets used to test and validate the trained models, named TGCRBNW and RBNR.This first dataset, TGCRBNW, comprises over 3,000 samples from more than 400 different individuals and provides a diverse set of samples, reflecting a wide range of conditions and scenarios (Table III).Upon further investigation, the provided dataset is divided into 5 folders simulating different race scenarios.
Set 1 collects images of nighttime runs in which the camera is strategically positioned at the end of a slope.This location ensures that the runners are captured in a frontal orientation.On the other hand, set 2 presents nighttime races, but with the camera situated along a curve, which complicates the task of runner detection due to their oblique position with respect to the camera's field of view.
Set 3 depicts daytime races that start in shadow environments and gradually transition to sunlight.Camera placement in this scenario is skillfully chosen to provide clear frontal views of the runners' bibs, facilitating identification.Set 4 collects races under direct sunlight, with runners facing directly at the camera, optimizing visibility and detection accuracy.
Finally, set 5 depicts races during the twilight hours, starting with ample illumination but culminating in dimmer conditions.The camera angle in this scenario is noticeably skewed, capturing the runners in an almost profile orientation as they approach, posing a challenge for detection algorithms.
Moreover, in addition to the other test dataset explained, we have the RBNR dataset, as detailed in Table III.This dataset consists of 217 color images, each annotated with ground truth Race Bib Numbers (RBNs) per image.The dataset is divided into three sets, each derived from a different race.The first and second sets exhibit similar compositions of runners within the images, although the latter demonstrates greater variability in terms of brightness and contrast.Lastly, the third set encompasses images with a substantial number of runners, potentially posing challenges for our neural network's detection capabilities.

E. Picture Analysis
The evolution of AI detection and recognition tasks has been remarkable, but significant challenges persist when exposed to real-world conditions.Factors such as brightness and contrast of an image, or the positioning of the camera, are critical and have a real impact on the performance of such systems.Several image enhancement techniques modifying these factors have been developed to increase both the quality of images and the efficiency of image processing-based applications [86], [87], [88], [89], [90].
In low light environments, such as at night or in dark scenarios, captured images often have characteristics of low brightness, low contrast and limited visibility to the human eyes.On the other hand, a high level of contrast is usually associated with good visual quality [91], [92], [93].
Figure 6 shows the brightness and contrast values for the images of the different sets used in the validation of our work and described in Table III.In a grayscale image, we represent the brightness as the mean luminance value and contrast as the variance of the luminance values.The brightness information can be used to characterize the type of scene.In that figure, both sets from TGCRBNW, 1 and 2, include dark images in a night scenario, while the rest contain daytime images.In these two sets, the brightness value is below 70 on a scale from 0 to 255, where 0 indicates pure black and 255 indicates a pure white.In addition, sets 1 and 5, also from TGCRBNW, have the lowest contrast set of images.
This variety of scenarios and conditions under different lighting conditions will allow us to evaluate the robustness of our YOLOv8-based bib detection system.Results of this evaluation are presented in Section IV.

IV. RESULTS
In this section, we will present the outcomes achieved through analysis and experiments, representing the results of our research.

A. Training Time YOLO
In our study, we begin by examining the training durations necessary for various iterations of YOLOv8 applied to both the BDBD dataset and the SVHN dataset.In this context, the measured time encompasses the entire duration of an execution, from its initiation to completion, including periods of process blocking such as during input/output (I/O) operations or when other processes are active.The data presented in Table IV provides information on the training duration of the different versions of YOLOv8 on the two datasets.
Analyzing the results, it is evident that the training times not vary significantly between different versions of the YOLOv8 model.For instance, in the BDBD model, the nano Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE IV TRAINING TIME IN MINUTES
version requires 158 minutes of training, while the extreme version requires much more time, 175 minutes.This trend is repeated across all versions, with larger models systematically requiring more training time.
When we consider the SVHN dataset, the training times increase significantly across all model sizes.For example, while the nano version of the SVHN-trained model took 800 minutes to train, the extreme version extended this duration to 2,640 minutes.This highlights the impact of dataset complexity on training time, with more complex datasets requiring proportionally more training time, regardless of model size.

B. Comparative Performance of YOLO Models During the Training
As the YOLOv8 neural network completes its training, it generates a set of metrics to evaluate its performance on the given dataset and its predictive accuracy.These metrics serve as an initial indicator of how well each model version aligns with the characteristics of the dataset and its intended task.They provide information about the model's precision, recall, and overall performance, which are essential for understanding its suitability for real-world applications.
In the context of this study, these metrics are applied to the validation set of the BDBD dataset and the SVHN dataset.Before delving into the specific results for each dataset, it is essential to provide a general explanation of the metrics used to evaluate the performance of the YOLOv8 models.These metrics include precision, recall, and mean average precision (mAP) at different Intersection over Union (IoU) thresholds [94].Precision measures the accuracy of positive predictions, recall quantifies the model's ability to detect all relevant instances, and mAP provides an assessment of the model's object detection capabilities across various IoU thresholds.
When analyzing the performance of YOLOv8 models on the BDBD dataset, as shown in Table V, it is evident that the effectiveness of the model is similar across different size categories.Across all size categories (nano, small, medium, large, and extreme), the model consistently demonstrates high precision, with scores ranging from 91.2% to 93.8%.Similarly, recall rates remain robust, with values ranging from 89.5% to 93.2%.In particular, the medium-sized model presents the highest precision and recall rates among the different sizes.Moreover, the evaluation of the Mean average precision (Map) with thresholds of 0.5 and 0.95 reveals the effectiveness of the model in detecting objects of different scales within the BDBD dataset, with Map50 scores ranging from 95.1% to 96.4% and Map0.95 scores ranging from 68.7% to 72.1%.Interestingly, the disparity between the Map scores at these thresholds suggests that while the model performs well in detecting objects with a higher confidence threshold (0.95), it encounters challenges in maintaining precision at this threshold, potentially due to increased false negatives or decreased recall rates.
On the other hand, the evaluation on the SVHN dataset reveals exceptional performance on all versions of the YOLOv8 model as shown in Table V.The models consistently achieve high precision and recall scores, indicating their effectiveness on digit recognition tasks.Interestingly, the variation in performance across different model sizes is minimal, suggesting that smaller models are equally effective in this context.The elevated mean average precision scores provide additional confirmation of the models' precision in recognizing digits in the SVHN dataset, underscoring the adaptability and applicability of the YOLOv8 framework across various datasets and tasks.

C. Application on Real-World Scenarios
In this subsection, we detail the practical application of our trained models in real-world scenarios.After successfully training our models to detect people, race bibs, and numbers, we proceeded to evaluate their performance on two different real-life datasets: TGCRBNW and RBNR.The workflow of our application process is illustrated in Figure 7, which shows the whole process.First, the whole image is processed, employing segmentation to isolate individual runners.Next, each detected runner is cropped and another neural network, trained to detect bibs, is employed to locate and extract the bibs.Subsequently, the identical procedure is iterated to identify and predict the numbers within numerical values present in the bibs.The cropping of the identified items helps minimize detection errors, like incorrectly recognizing advertisements or other forms that resemble numbers, thus improving the precision of our models.To analyze the prediction times for each scenario and dataset, we refer to Table VI.The two computers used in this analysis are equipped with both CPU and GPU resources, one being more powerful than the other.The flexibility of the YOLO tool allowed us to choose between CPU and GPU for inference, facilitating the extraction and comparison of inference speeds on models of different sizes.Across all models and datasets, using the more powerful GPU results in reduced prediction times compared to CPU-only processing.
For instance, on the TGCRBNW dataset, the extremesized model shows a noticeable reduction in prediction time when the more powerful GPU is used, with times decreasing from 373 to 160 seconds.Similarly, on the RBNR dataset, the extreme-sized model exhibits a significant reduction in prediction time from 42 to 17 seconds when using the more powerful GPU.When comparing the prediction times between datasets, it is evident that the TGCRBNW dataset typically requires more computational time than the RBNR dataset, likely due to the number of images and the complexity of them.These findings highlight the importance of resource optimization and indicate that leveraging more powerful hardware can substantially improve the efficiency of our models in real-world applications.
With respect to the approximate prediction time per image in each of the scenarios depending on the model used, it is should be noted that they vary greatly depending on the computer used and the device, whether it is GPU or CPU, as can be seen in Figure 8.Thus, with respect to the desktop computer, we can observe a quite significant variability of the results depending on whether we use GPU or CPU, since for example in small models it is hardly appreciable because there is a magnitude of tenths, while if we move on to larger models, the magnitude has to do with several seconds of difference.Moving on to the case of the laptop, whose computational resources are lower, we can see that the times increase significantly with respect to the other computer, and something similar occurs in the CPU-GPU relationship.However, here we can already see that for CPU, we reach values of approximately 22 seconds to perform a detection on an image.
As for the accuracy of the different versions of YOLOv8 in each scenario, we have performed evaluations using also the TGCRBNW and RBNR datasets.The accuracy results for each model version in both scenarios are summarized in Figure 9.These tables provide insight into the performance of each version of YOLOv8 in different scenarios.It is clear that the accuracy varies significantly depending on the model version and the dataset.For instance, in the TGCRBNW dataset, YOLOv8n shows the highest accuracy in most scenarios, while YOLOv8x consistently exhibits lower accuracy.However, in the RBNR dataset, the performance differs, with YOLOv8m obtaining the highest accuracy in several scenarios.

V. ANALYSIS AND DISCUSSION
In examining the results obtained, an observation concerns the relationship between training times and performance metrics.It is evident that as the model size increases, so does the training time, with the exception of the YOLOv8s model, which converges faster than YOLOv8n.This discrepancy can be attributed to the early stopping mechanism, wherein the YOLOv8s model stops training earlier as it achieves convergence sooner.In addition, the model trained with SVHN Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.data requires much more time to train than the BDBD model due to its larger dataset size, although both models return moderately similar and generally favorable metrics.
Moreover, performing predictions solely on a CPU is impractical due to long processing times.On the other hand, processing times on GPU vary depending on hardware specifications.Furthermore, the choice of model version significantly influences the processing time, as larger models require more time due to the larger number of layers through which the input must traverse.However, it is essential to note that larger models do not always produce better results, as they may require larger and more diverse datasets to efficiently learn features.
A more detailed exploration of accuracy metrics for different test datasets uncovers interesting trends.For example, the difference in performance between the TGCRBNW and RBNR datasets can be attributed to multiple factors, like camera position (angles, distance to the runners. ..), general image quality (brightness and contrast level) or even image resolution differences (lower quality images on TGCRBNW).In addition, challenges arise in the detection of subjects that are distant or oriented to the side.Moreover, the presence of shadows in the dorsal area significantly complicates the detection.In particular, if a bib number is imperceptible to the human eye, machine detection becomes equally challenging.In general, the time taken for detection exceeds that for non-detection, and larger models exacerbate this discrepancy.
Furthermore, it is noteworthy that the accuracy achieved on a laptop GPU is comparable to that of a desktop CPU, which mainly affects processing time.Consequently, the availability of a GPU is preferred due to its ability to significantly reduce processing time while maintaining identical metrics.

VI. CONCLUSION
The work presented in this paper performs a comparative study of YOLOv8-based bib number detection in several races media datasets, comparing not only the performance due to the different dataset features, but the training performance in terms of time and accuracy depending on the YOLOv8 model and the hardware used for training and prediction.
Understanding the suitability of different YOLOv8 models under varying circumstances is crucial for real-time applications.Factors such as image resolution (e.g., High Definition or Ultra High Definition) or video frame rates (e.g., 30 fps or 60 fps), deployment environments (edge vs. cloud), and computational resources availability influence the choice of model.
A set of datasets were carefully selected and analysed, studying their characteristics in terms of size, quality and real circumstances variability, looking to have the highest generality of image conditions for training.
The results demonstrated the significant impact of hardware selection on prediction times and accuracy in object detection tasks.For instance, on the TGCRBNW dataset, the extremesized model shows a significant reduction in prediction time from 1,161 seconds (5.66 seconds per image) to 54 seconds (0.26 seconds per image) when using a more powerful GPU on a desktop computer.Similarly, the RBNR dataset exhibits a reduction from 373 seconds (4.05 seconds per image) to 15 seconds (0.16 seconds per image) for the same model.For the laptop case, the difference in prediction time between TGCRBNW for GPU and CPU is most noticeable in the extreme model, decreasing from 3,674 (17.92 seconds per image) to 826 seconds (4.02 seconds per image).For RBNR, the difference is also significantly reduced in the extreme model, from 1,440 (15.65 seconds per image) to 239 (2.59 seconds per image).
After studying the prediction phase into the two type of hardware architectures, it becomes clear that the time needed for prediction is much higher (on average in a factor by 3) in the case of the laptop.One of the main reasons for this difference lies in the hardware used to perform the predictions.While CPUs are generally more versatile and efficient in handling a wide variety of tasks, GPUs tend to excel in parallelizable tasks.However, this advantage does not come without its own implications.Unlike CPUs, GPUs tend to be more expensive and consume more power, which can be a limiting factor in resource-constrained environments such as edge devices or virtualization systems.This notable difference in prediction time between GPU and CPU can have important implications on the feasibility of real-time implementations.For example, while smaller models, such as nano-or medium-sized models, exhibited higher accuracy than large and extreme versions, this factor should not be considered solely from an accuracy perspective.It is also crucial to consider the prediction time associated with each model.Considering that prediction in these models takes 1 second or less, it is possible to consider that these models are suitable for real-time (GPU-enabled) or near real-time (CPUenabled) object detection in a multimedia streaming use case.However, if the prediction time is increased by even just one second more, it could compromise the system's responsiveness, which is critical for the quality of user experience and the ability to process a high number of images efficiently.
However, accuracy varied across scenarios and datasets.On the TGCRBNW dataset, results were generally below 50% across all sets and models, whereas the YOLOv8m model achieved nearly 80% accuracy on the RBNR dataset in the best scenario.It is important to note that these values remained unchanged when extracting metrics regardless of the hardware configuration selected (GPU versus CPU or between laptop and desktop).
Our study also revealed several key insights that could have been incorporated into our methodology, such as image augmentation techniques [95], dataset division based on lighting conditions, and the integration of explainable AI to improve model robustness and interpretability [96].Ensuring image quality throughout the audiovisual transmission chain is essential to guarantee the correct operation of our system.A future analysis on how brightness, contrast and sharpness affect the accuracy of the YOLOv8-based bib detection system and apply image enhancement techniques in future work to improve detection accuracy.
The application of the trained neural networks can be extended beyond their initial tasks.For instance, the neural network trained specifically for bib detection can be used in other similar events, such as marathons or cycling races, to identify participants by their bib numbers.Additionally, this neural network could be adapted for detecting other types of identifications in various contexts, such as vehicle identifications in toll systems or product identifications in production lines.
Similarly, the neural network trained for number detection can be valuable in diverse scenarios, such as OCR in printed or digital documents, vehicle license plate recognition, or barcode reading.When these two neural networks are combined (bib and number detection), the range of applications expands even further.For example, in sporting events, the network trained for bib detection could work alongside the network trained for number detection to identify participants and automatically record their times.In commercial environments, the combination of both networks could facilitate automated inventory tracking by reading barcodes and product identification numbers.
Future research can converge into developing guidelines or frameworks for selecting the most appropriate YOLOv8 model based on specific application requirements and deployment constraints.It could enable to dynamically adapt and optimize object detection algorithms based on contextual factors such as network conditions, user preferences, and environmental constraints.
These new emerging trends and opportunities may enable researchers to contribute to the advancement of the object detection field and its integration in different areas, such as smart cities, autonomous vehicles, healthcare systems and video surveillance.This interdisciplinary nature will offer the possibility to exploit the potential of YOLOv8 and similar algorithms, driving innovation and addressing challenges in the audiovisual sector.
(b)), where low light, shadows, and artificial illumination pose significant challenges.Finally, we investigated variations in crowd density (Figure 5(c)), evaluating how well each YOLOv8 model adapts to handle congested environments and complex interactions between objects.

•
Batch Size.The number of training samples used in an iteration.Adjusting the batch size can impact the convergence speed and memory requirements during training.
describes the details of the BDBD dataset, indicating 440 images for training, 30 for testing, and 130 for validation.To optimize the performance of our model,

TABLE III SPECIFICATIONS
OF THE TGCRBNW AND RBNR DATASETS

TABLE V METRICS
OBTAINED ON THE BDBD AND SVHN TEST SET

TABLE VI PREDICTION
TIME IN SECONDS Fig. 8. Prediction time per image.