Learning end-to-end respiratory rate prediction of dairy cows from RGB videos

Respiratory rate (RR) is an important indicator of the health and welfare status of dairy cows. In recent years, progress has been made in monitoring the RR of dairy cows using video data and learning methods. However, existing approaches often involve multiple processing modules, such as region of interest (ROI) detection and tracking, which can introduce errors that propagate through successive steps. The objective of this study was to develop an end-to-end computer vision method to predict RR of dairy cows continuously and automatically. The method leverages the capabilities of a state-of-the-art Transformer model, VideoMAE, which divides video frames into patches as input tokens, enabling the automated selection and featurization of relevant regions, such as a cow's abdomen, for predicting RR. The original encoder of VideoMAE was retained, and a classification head was added on top of it. Further, the weights of the first 11 layers of the pre-trained model were kept, while the weights of the final layer and classifier were fine-tuned using video data collected in a tie-stall barn from 6 dairy cows. Respiratory rates measured using a respiratory belt for individual cows were serving as the ground truth (GT). The evaluation of the developed model was conducted using multiple metrics, including mean absolute error (MAE) of 2.58 breaths per minute (bpm), root mean squared error (RMSE) of 3.52 bpm, root mean squared prediction error (RMSPE; as a proportion of observed mean) of 15.03%, and Pearson correlation (r) of 0.86. Compared with a conventional method involving multiple processing modules, the end-to-end approach performed better in terms of MAE, RMSE and RMSPE. These results suggest the potential to implement the developed computer vision method for an end-to-end solution, for monitoring RR of dairy cows automatically in a tie-stall setting. Future research on integrating this method with other behavioral detection and animal identification algorithms for animal monitoring in a free-stall dairy barn can be beneficial for a broader application.


INTRODUCTION
Health and welfare of dairy cows are essential from food security and ethical perspective, as milk production from dairy cows is a crucial part of ensuring global nutritional demands and offers economic livelihood to millions (Tremetsberger & Winckler, 2015).Among various physiological indicators, respiratory rate (RR) serves as a critical parameter of a cow's health and welfare status.For example, increased RR is a primary sub-clinical symptom of respiratory diseases, which, as reported by the United States Department of Agriculture National Animal Health Monitoring Service, led to a mortality rate of 2.3% in calves aged between birth and 16 weeks (Ferraro et al., 2021;Gorden & Plummer, 2010).In addition, RR plays a key role in monitoring heat stress of animals, as it reflects their physiological response to heat load, which is not immediately aligned with environmental conditions due to cow-based factors, such as breed, age, and production levels (Atkins et al., 2018).Unlike indices based solely on environmental parameters, RR provides a direct measure of an animal's response, capturing the lag in physiological adaptation to heat stress and offering a more accurate reflection of individual cow's experience (Atkins et al., 2018).Therefore, keeping track of RR offers an opportunity for detection of diseases, heat stress, and other adverse conditions, allowing for timely interventions (Jorquera-Chavez et al., 2019) which is important in modern dairy farming practices.
Traditional methods for monitoring RR in dairy cows involve manual observations of flank movement.This approach is labor-intensive, and its accuracy can be influenced by variability between different observers, making it less feasible, especially for large-scale operations (Dißmann et al., 2022).With advancement in precision technologies, there is a shift toward using contactbased approaches to predict and monitor RR (Handa & Peschel, 2022).While contact-based technologies such as accelerometers, pressure sensors, and thermistors can partially replace human observations, they also come with challenges, such as discomfort for animals, altered animals' natural behavior, and maintenance issues like battery replacement (Handa & Peschel, 2022).Recently, non-contact monitoring solutions gained increasing attention thanks to their ability to address some of these concerns.For instance, Yan et al., (2024) utilized environmental variables like ambient temperature and relative humidity to predict RR.While this method is effective, it does not allow for the assessment of RR on an individual basis.To date, improvements in camera and computer vision (CV) technologies have enabled the monitoring of respiratory rates at the individual level via video footage.Several studies have attempted to predict RR of dairy cows using videos, predominantly utilizing infrared thermography (IRT), although, there is a trend in using RGB (Red Green Blue) videos.For techniques employing IRT, respiratory patterns were identified by detecting temperature fluctuations between the region of interest (ROI), typically around the nostrils, and the ambient environment (Jorquera-Chavez et al., 2019;Lowe et al., 2019) Lowe et al., 2019;Zhao (Zhao et al., 2023).Similarly, by identifying ROI in RGB videos, Wu et al., (2020) and Shu et al., (2024) computed RR based on the changes in pixel intensity related to the cow's breathing movements.Although these methods have shown to be effective, they depend on multiple processing modules that involve the initial selection and tracking of the ROI, followed by the estimation of RR from respective ROI.The sequential nature of these methods introduced potential complications because errors can be introduced during this process and subsequently affect the accuracy of RR predictions.This limitation is why such methods are only effective for short-term predictions of RR, like those spanning 30 s.
Additionally, methods with multiple processing modules might not be convenient for monitoring multiple animals simultaneously in commercial settings, such as a free-stall dairy barn, given the increased complexity of detecting and tracking multiple ROIs.In a free-stall barn, cows may obstruct other cows from the camera's viewpoint, leading to partial or complete visual obstruction of the ROI.These overlaps can result in losing target ROI or switching the ROI to other cows.The task of monitoring the RR is to consistently estimate it over specified time intervals (e.g., every minute).The RR for each minute can serve as the label for the video segment of that min-ute.This task was meant to accurately assign the labels to their corresponding video segments.Essentially, this represents the categorization of video clips based on their contents, which is aligned with the well-established challenge of video classification in CV community.Therefore, the RR monitoring task can be restructured as a video classification challenge.This reconstruction aims to simplify the process by eliminating intermediate steps, reducing complexity, and minimizing potential technical challenges.The objective of this study was to develop an end-to-end method for monitoring RR from RGB video data.We hypothesize that this method will outperform conventional methods that involve multiple processing modules.To achieve this, this study employed a state-ofthe-art methodology known as Transformer (Vaswani et al., 2017).

Animal experiment and data collection
To perform the study, we collected an ample amount of video and sensor data from dairy cows in a tie-stall setting.The data includes high-quality video data for computer vision model consumption and ground-truth RR measurements to train and test the model.
Data were collected during an animal experiment conducted from October 2022 to June 2023 at AgroVet-Strickhof, ETH Zurich, Switzerland, where it investigated nutritional strategies to decrease methane emission in dairy cows (Ma et al., 2023).All procedures were approved by the Cantonal Veterinary office of Zürich, Switzerland (license no.ZH207/2021).The current study used data from 6 dairy cows (1 Holstein Friesian and 5 Brown Swiss) across 3 periods (period 1: November 2022; period 2: May 2023; period 3: June 2023).Each period involved 2 cows, and each cow was individually kept in her designated stall in a tie-stall barn.For each cow, 2 2D cameras (DAHUA, model: DH-SD1A404XB-GNR, 2.8 mm-12 mm lens) were installed: one on the side and one above, to capture abdominal movements related to respiration.The cameras were linked to a DAHUA network video recorder (NVR, model: DHI-NVR4216-16P-4KS2/L).The NVR was responsible for the synchronization of the cameras, and it automatically recorded and stored videos on its internal hard drives.The positions of the cameras and the infrastructure of the data collection system are illustrated in Figure 1.Videos were recorded continuously for 24 h in each period with a resolution of 2560 × 1440 pixels and 25 fps.The lighting in the barn generally followed the natural daylight.No supplemental lighting was provided at night, meaning that videos from these intervals were captured in grayscale.Following the recording, videos were screened Wang et al.: Respiratory rate prediction by a computer vision method through the NVR, and relevant clips were selected and saved for further analysis.
During the data collection periods, each cow wore an Embla XactTrace Respiration Belt (product reference: Single Use Cut-To-Fit Universal Respiratory Inductance Plethysmography Belt) and a recording device (Em-bletta® MPR PG) to collect the RR measurement serving as the ground truth (GT) for the CV method.The elastic belt measures RR based on Respiratory Inductance Plethysmography (RIP), which is embedded with wire coils placed around the chest and abdomen.The expansion and contraction of the chest and abdomen cause the coils to stretch and relax, leading to changes in their inductance.These inductance changes are detected by connected electronics, which then produce an electrical signal corresponding to the magnitude of change.Respiratory rate can be calculated by analyzing the frequency and pattern of these signals.A RIP-based device has been demonstrated to work well in measuring RR for humans (Ranta et al., 2019).Generally, the respiratory patterns around the abdomen of cows are clearer compared with humans, thus this study used the RR measurement collected by the respiratory belt as GT.The position of the belt was monitored every 30 min during the data collection and adjusted if needed.

Data selection
To facilitate model training and analysis, we select representative video clips and scenarios from the largescale data described above.The extraction of RR from RGB videos relies on the computation of RGB values in the image, which can significantly fluctuate under varying lighting conditions.Videos captured during daylight are in color, while those taken at night are in grayscale.
To ensure a comprehensive analysis, this study selected videos obtained during both day-and night-time, and from both top and side view (Figure 2).The essence of RR prediction from video relies on identifying the periodic patterns associated with RR around the abdominal region.However, when a cow is moving, these patterns become less recognizable.Figure 3 illustrates the raw readings obtained from the respiratory belt over 1-min intervals of the same cow, comparing its moving and resting states, respectively, further indicating that the respiratory patterns are clear and periodic when the cow is resting based on our data.Therefore, to build a model with more precise GT data, the current study selected videos wherein cows were in resting status with minimal movement.Minor disturbances such as eructation, occurs approximately once per minute in dairy cows, are inevitable and were deemed acceptable in the video selection process.Additionally, motion artifacts from the background, such as movements from other cows, are also acceptable.This can help with training a more robust model.Based on this selection criteria, a total of 18 video segments (18.18 ± 13.73 min; mean ± SD; Table 1) were identified to train and test the model.Further details regarding the samples can be found in Table 1.

End-to-end RR prediction from videos
To briefly introduce the concept of Transformer, we adopt a similar naming convention as in Vaswani et al. (2017).Let v denote the video, and patch i,j,t represent a 3D patch from v with a width w, height h, and depth d in time.Each frame in the video sequence is divided into fixed-size image patches (w × h), and a sequential collection of these patches across a specified time duration (d) from multiple frames is aggregated to form video batches.Each 3D patch is converted into a 1D vector through the flatten function as described in Equation ( 1), followed by a linearly embedding into a higher-dimensional space apt for model processing.This is crucial for reducing dimensions to comply with the linear embedding layer's requirement for 1D input, ensuring uniform patch size for efficient batch processing, and maintaining the integrity of information by reconfiguring the shape without discarding any original values, thus allowing the model to leverage the full information contained within each patch in subsequent processing phases.

patch flatten v i i w j j h t t d i j t
, , : , : , : where W e and b e are the embedding weights and biases.Note that spatial and temporal positional information embeddings are also added to tokens to retain positional information.
The input is projected onto W Q , W K , and W V weight metrics to compute the Queries (Q), Keys (K), and Values (V).Q act as probes within the data set, searching for relevant pieces of information.K function as identifiers, enabling Q to evaluate which parts of the data are pertinent by comparing against them.V hold the actual information that Q seek, with their importance gauged by the degree of match between the Q and their respective K, ensuring the model emphasizes the most pertinent details for processing.The process unfolds as follows: given a piece of input data, a q is formulated.This q is then compared against all K to assess their relevance.The closer a k aligns with q, the greater the attention weight assigned to its associated v.These attention weights are then utilized to amalgamate the V, yielding an attention score that accentuates the information most applicable in response to the q. Figure 4 (a) illustrate a single-head self-attention block.

corresponding to the dimension of and
In single-head self-attention, the input is processed once to generate Q, K and V matrices.However, for complex vision tasks such as video classification, multi-head attention is usually employed, which replicates the single-  head mechanism multiple times in parallel, each with different sets of weight matrices.These parallel "heads" allow the model to focus on different aspects or features of the input simultaneously.After processing through all heads, their respective outputs are concatenated and passed through a final linear layer to produce the multihead attention output.The construction of multi-head attention is illustrated in Figure 4 (b).
In this study, to adapt the Transformer model (Vaswani et al., 2017) for RR prediction, videos were initially segmented into patches and transformed into a series of vectors through linearly encoding.Positional encodings were then added to retain spatial context within the sequence.Following this, the model's self-attention mechanism computed attention scores, allowing it to prioritize video patches based on their relevance to each other.This was a key advancement in RR prediction, as it permitted the model to dynamically focus on patches that are indicative of respiratory patterns, such as the movement of the cow's flank.The focused attention on these significant video segments enabled the Transformer model to parse out temporal and spatial features critical for accurate RR prediction.The model subsequently employed these features to estimate RR values using a regression layer at the end of the Transformer architecture, translating the feature data into quantified RR values.
The training of self-attention-based Transformer models generally involves 2 phases: first, pre-training is performed on a large-scale data set (and sometimes a combination of several available data sets), and later, the pre-trained weights are adapted to the down-stream tasks using small scale data sets.Examples of downstream tasks include video classification, object detection, question-answering, image generation and so on.This pre-training phase demands substantial computational resources, and constructing such large data sets often requires significant efforts, such as rigorous labeling.Given these resource intensive prerequisites, this study followed the second phase and leveraged a pre-trained model and find-tuned it further using the RR monitoring data set.
The pre-trained model utilized in this study was Video-MAE from Tong et al. (2022).VideoMAE was originally developed for video classification, particularly showing good performance when training on small-scale data sets.The original VideoMAE model adopted a high masking ratio, which reduced computational complexity but still had favorable performance.Further details regarding the model's architecture can be found in Tong et al. (2022).Based on the architecture, several pre-trained models were provided to the public (https: / / huggingface .co/MCG -NJU).The one used in this study is the base model, which was trained on Kinetics-400 (Kay et al., 2017) over 1600 epochs (https: / / huggingface .co/MCG -NJU/ VideoMAE -base).
We kept the encoder part of the model, which includes 12 hidden layers, and each layer has 12 attention heads.The decoder part in the raw model architecture was not kept.A classification head was added on top of the encoder, which contains a fully connected linear layer mapping the encoder's output features to the classification labels.The detail of the architecture is illustrated in Figure 5.The selected videos were segmented to shorter clips, each lasting 1 min, with an overlap of 30 s between consecutive clips.A total of 538 clips were derived from 17 videos.This resulted in a total duration of 538 min and a total of 807000 frames for training and testing the model.Data from 4 cows were allocated for training, while data from 1 cow each were selected for validation and testing, respectively.This process was repeated 6 times, ensuring that data from each cow was tested once independently.Note that the longest video in the data set, with a duration of 52 min and 46 s, was excluded from the training, validation, and testing sets.Instead, this video was segmented into 104 video clips, which were utilized to assess the model's performance in continuous monitoring scenarios, thus providing a more robust evaluation.For each video clip, 64 frames were uniformly extracted, and each frame was resized to a resolution of 224 × 224 pixels.To fine-tune the model in an efficient way, the weights of the first 11 layers were frozen in this study, meaning that the weights of the last layer and the classification head were retrained in this study.The fineturning was conducted on 2 GPUs (NVIDIA® GeForce RTX 4090) with 24 GB memory each, and each training took about 4 h.The details of training parameters and configurations are displayed in Table 2.

Model comparison with a conventional multi-module processing method
In this study, we conducted a comprehensive comparison between our end-to-end approach and a conventional multi-module processing method frequently used in previous research (Khanam et al., 2019;Massaroni et al., 2018).The workflow of the method is illustrated in Figure 6.This conventional approach involves initially selecting the ROI manually, followed by predicting the RR from this selected ROI.Specifically, the abdomen area was chosen as the ROI, as depicted in Figure 7, and the prediction of RR from this area was carried out using Fast Fourier Transform (FFT) technique.
Unlike the end-to-end method, the conventional method does not require a training phase, thus all 18 videos were directly processed using this approach.Each video was segmented into shorter clips using a sliding window.These windows were 3 min in length, with a 1-min overlap between successive windows.This window duration was chosen to balance the trade-off between time and frequency resolution in the FFT analysis.A longer window improves frequency resolution, facilitating easier identification of the RR frequency, but at the expense of time resolution.Regarding computational requirements, the conventional method is not computationally intensive.Therefore, the analysis was performed on a MacBook Pro equipped with 16GB of memory.Note that the computation time for the generic method was not recorded.This omission was due to the variability in the time taken to manually select the ROI, as this duration can differ significantly among individuals.

Model evaluation
Following the common practice, mean absolute error (MAE), root mean square error (RMSE) and root mean squared prediction error (RMSPE; as a proportion of observed RR mean) were employed in this study to assess the model's performance.The definitions of MAE, RMSE and RMSPE can be found in equations ( 7), ( 8), and ( 9) respectively.Additionally, correlation and agreement analyses were conducted to further validate the model's accuracy and reliability.
where rr gt and rr CV represent the RR values from GT and the developed CV method, respectively, with n being the total number of RR values.

RESULTS AND DISCUSSION
The attention mechanism, utilized in the current study, within Transformers enables the videoMAE model to assign varying weights to different patches of an image.It can focus on critical areas, such as the abdomen, to recognize their importance for respiration.By giving priority to these regions, the model was thereby able to effectively learn and extract breathing patterns from videos, leading to accurate RR predictions as described below.

Results from end-to-end method
The model evaluation outputs from the end-to-end method averaged over the 6 test sets are presented in Table 3.The MAE, RMSE, RMSPE, and Pearson correlation coefficient were 2.58 bpm, 3.52 bpm, 15.03% and 0.86, respectively.The Pearson correlation coefficient and the regression from the highest-performing test sets displayed in Figure 8  To assess the model's performance during different periods, RR predictions corresponding to the GT from video clips recorded during day and night were compared.Note that the results shown in Table 4 were calculated based on the average from the 6 test sets.The Pearson correlation coefficient derived from the daytime clips was 0.88, which is higher than that obtained during nighttime, with values of 0.84.However, the MAE, RMSE and RMSPE from nighttime clips were lower at 2.13, 3.26 and 13.79%, respectively, compared with 2.59, 3.68, and 14.64% for the daytime clips.The findings suggest the model performed better at nighttime.This may be due to fewer movement artifacts at night, such as disturbances from other cows, which are reduced because cows are more likely to rest and lie down during these hours.
Additionally, Figure 9 displays the evaluation results obtained from the long video: the RR captured by the GT against the predictions made by the CV method, illustrating the continuous monitoring capabilities of the CV method.Table 5 shows specific outcomes, listing MAE as 2.49, RMSE as 2.91, the RMSPE as 8.78%, and the Pearson Correlation Coefficient as 0.74.The results demonstrate the applicability for continuous RR monitoring in a different scenario, indicating its potential application in animal health monitoring to improve farm management.

Results comparison with the conventional method
Our approach employed an end-to-end computer vision model tailored to predict RR directly from RGB videos, eliminating the need for manual ROI selection.This method simiplifies the process compared with traditional techniques, which typically require multiple processing steps that can lead to the accumulation of errors.Table 6 shows the results of the conventional method when compared with the GT.In terms of MAE, RMSE, and RMSPE, the conventional method underperformed relative to the results obtained from the end-to-end method, as shown in Table 4.However, it exhibited a higher correlation coefficient.In addition, the standard deviation of the prediction results from the end-to-end method was 6.95, which is less than that of the conventional method, which had a standard deviation of 10.01.This increased variability could come from the manual selection of the  Regression analysis of the relationship between respiratory rate (RR) obtained from ground truth (GT) and the predictions using the end-to-end method (95% CI: 95% confidence interval; bpm: breaths per minute); (b) Bland Altman plot showing the agreement between RR obtained from GT and the predictions using the end-to-end method.The 3 dash lines represent the 95% confidence intervals and the mean of the difference between GT and CV.ROI, a process lacking standardization and potentially introducing bias.Figure 10 presents both the regression and Bland-Altman plots in comparison to the GT, further illustrating the conventional method's inferior performance in relation to the end-to-end approach.Considering both performance metrics and practical applicability, the findings indicate that the end-to-end method for RR prediction enhances both the precision and dependability of RR measurements and their practical use in actual farming scenarios.This advancement is particularly pertinent in the current shift toward non-invasive and automated monitoring solutions within precision livestock farming.Such progress is in step with the broader objective of improving animal health and welfare via technological innovation.Utilizing readily accessible video data to evaluate a vital physiological indicator such as RR demonstrates its capacity to provide a significant, albeit modest, advancement in the field.This method offers a scalable and efficient means of monitoring livestock health and seeks to augment current practices by prioritizing both accessibility and precision.

Implications and future perspectives
In the current study, videos were selected wherein cows were in resting status with minimal movement.This selection criterion was established primarily because videos with minimal motion artifacts had more reliable GT.Although videos of good GT quality with slight movements of cows exist, the current study decided not to use them.Such a decision was based on findings from previous studies (e.g., Lowe et al., 2019;Wu et al., 2020;Zhao et al., 2023), where it was suggested that movement artifacts could affect the model's performance.This aligns with the common practice of traditional approaches with multiple processing modules.To validate whether predicting RR using videos with more animal movements have a comparable result, model development and evaluation based on more comprehensive data sets is required.Training with such an expanded data set demands more computational capabilities, especially when employing architectures like the Transformer model, which has a substantial number of parameters (Khan et al., 2022).
Regarding the collection of GT data for cows' RR, manually counting the flank movements in videos has been the predominant method used in prior studies (Jorquera-Chavez et al., 2019;Shu et al., 2024;Wu et al., 2020;Yan et al., 2024;Zhao et al., 2023) .These studies have been exclusively focusing on cows without movements.To the best our knowledge, no existing study has managed to collect RR reference while cows are consistently moving, due to the above-mentioned challenges.However, dairy cows generally spend approximately 12 h per day lying down (Tucker et al., 2021), it is plausible to acquire substantial RR data during these significant periods of rest.In addition, abrupt changes in RR, potentially caused by acute stressors such as injuries, are often accompanied by noticeable behavioral changes.In these cases, behavioral measurements might offer a more immediate and precise indication of the cow's condition, rather than RR.Predicting the RR during the resting phases, particularly when the movements are minimal, can be considered as a more readily available technology to be implemented and may be sufficient for routine monitoring of farm animals.
The focus on behavior detection in dairy cows, such as identifying lying behavior, has grown in recent studies due to its significance in assessing heat stress and mastitis risks (Heinicke et al., 2021;McDonagh et al., 2021;Porto et al., 2013).Adopting a methodology akin to ours might be beneficial in identifying lying behavior as a preliminary step to determining RR.This advancement could facilitate the development of an automated RR prediction system, designed for monitoring multiple subjects in free-stall barn environments.Enhancements to this approach, including the integration of cow detec-tion and individual identification, would enable a more personalized monitoring of well-being, utilizing RR as a key health indicator for each cow.
The end-to-end RR monitoring method, whether used for research purposes to compare different algorithms or on-farm monitoring, is currently challenged by the computational power required.This is similar with other computer vision-based methods developed for other purposes.Given the rapid advancements in hardware, it is becoming increasingly important to know how to implement and train robust large models on advanced hardware platforms for agricultural applications.Once the developed method is capable of analyzing larger data sets, then its potential will be elucidated, leading to true value creation by improving our knowledge and implementing RR monitoring for animal welfare assessment and disease prediction.

CONCLUSIONS
The current study presents an end-to-end respiratory rate monitoring method for dairy cows.By reformulating the RR extraction as a video classification problem, a state-of-the-art computer vision model, VideoMAE, was Figure 10.(a) Regression analysis of the relationship between respiratory rate (RR) obtained from ground truth (GT) and the predictions using the conventional method (95% CI: 95% confidence interval; bpm: breaths per minute); (b) Bland Altman plot showing the agreement between RR obtained from GT and the predictions using the conventional method.The 3 dash lines represent the 95% confidence intervals and the mean of the difference between GT and CV.

Figure 1 .
Figure 1.Illustration of the position of cameras (annotated by the red circle): (a) side view; (b) top view; (c) the infrastructure of the data collection pipeline.
Figure 2. : Frame example illustrating the 4 scenarios: (a) side view during the day; (b) top view during the day; (c) side view at night; (d) top view at night.
Figure 3. Respiratory rate patterns obtained from the respiratory belt in 1-min intervals when: (a) a cow is significantly moving; (b) the cow is resting.The time during every 2 consecutive peaks is considered as one respiratory cycle.

Figure 4 .
Figure 4.The structure of: (a) single-head self-attention block; (b) multi-head attention which consists of multiple attention layers running in parallel.

Figure 5 .
Figure 5. Model overview.The video was split into fixed-size patches, which were then linearly embedded and fed to the VideoMAE encoder.To perform classification, an extra learnable classification head was added on top of the encoder.

Figure 6 .
Figure 6.Workflow of the generic method for RR prediction based on multiple processing modules.

Figure 7 .
Figure 7. Illustration of manually selection of region of interest (ROI) in the generic method involving multiple processing modules.

Figure 8 .
Figure8.(a) Regression analysis of the relationship between respiratory rate (RR) obtained from ground truth (GT) and the predictions using the end-to-end method (95% CI: 95% confidence interval; bpm: breaths per minute); (b) Bland Altman plot showing the agreement between RR obtained from GT and the predictions using the end-to-end method.The 3 dash lines represent the 95% confidence intervals and the mean of the difference between GT and CV.

Figure 9 .
Figure 9. Illustration of the evaluation results on the long video: respiratory rate (RR) measured by the ground truth (GT) against the predictions derived from the computer vision (CV) method.

Table 1 .
Summary of the selected samples

Table 2 .
The configurations for the training parameters

Table 3 .
Wang et al.:Respiratoryrate prediction by a computer vision method Comparison of respiratory rate predicted by computer vision (CV) with the ground truth (GT) on the test set (averaged over 6 test sets) 1 bpm: breaths per minute.

Table 4 .
Wang et al.:Respiratoryrate prediction by a computer vision method Comparison of model performance during the day and night on the test set (averaged over 6 test sets)

Table 5 .
Comparison of respiratory rate predicted by computer vision (CV) with the ground truth (GT) on the long videos ***(P < 0.001). 1 bpm: breaths per minute.

Table 6 .
Wang et al.:Respiratoryrate prediction by a computer vision method Comparison of respiratory rate (RR) predicted by the conventional method (CM) involving multiple processing modules with the ground truth (GT)