Reinforcement Learning Assisted Oxygen Therapy for COVID-19 Patients Under Intensive Care

Background: Patients with severe Coronavirus disease 19 (COVID-19) typically require supplemental oxygen as an essential treatment. We developed a machine learning algorithm, based on deep Reinforcement Learning (RL), for continuous management of oxygen flow rate for critically ill patients under intensive care, which can identify the optimal personalized oxygen flow rate with strong potentials to reduce mortality rate relative to the current clinical practice. Methods: We modeled the oxygen flow trajectory of COVID-19 patients and their health outcomes as a Markov decision process. Based on individual patient characteristics and health status, an optimal oxygen control policy is learned by using deep deterministic policy gradient (DDPG) and real-time recommends the oxygen flow rate to reduce the mortality rate. We assessed the performance of proposed methods through cross validation by using a retrospective cohort of 1,372 critically ill patients with COVID-19 from New York University Langone Health ambulatory care with electronic health records from April 2020 to January 2021. Results: The mean mortality rate under the RL algorithm is lower than the standard of care by 2.57% (95% CI: 2.08–3.06) reduction (P<0.001) from 7.94% under the standard of care to 5.37 % under our proposed algorithm. The averaged recommended oxygen flow rate is 1.28 L/min (95% CI: 1.14–1.42) lower than the rate delivered to patients. Thus, the RL algorithm could potentially lead to better intensive care treatment that can reduce the mortality rate, while saving the oxygen scarce resources. It can reduce the oxygen shortage issue and improve public health during the COVID-19 pandemic. Conclusions: A personalized reinforcement learning oxygen flow control algorithm for COVID-19 patients under intensive care showed a substantial reduction in 7-day mortality rate as compared to the standard of care. In the overall cross validation cohort independent of the training data, mortality was lowest in patients for whom intensivists’ actual flow rate matched the RL decisions.


Background
Over the course of the past year, the rapid global spread of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), has motivated multidisciplinary investigation efforts to identify effective medical management against coronavirus disease 2019 (COVID- 19). Respiratory distress, including mild or moderate respiratory distress, acute respiratory distress syndrome (ARDS) and hypoxia, is a common complication of COVID-19 patients. The therapy of COVID-19 is guided by the knowledge and experience of moderate-to-severe ARDS treatment [1]. Oxygen therapy is recommended as the first-line therapy of COVID-19-induced respiratory and hypoxia by the Centers for Disease Control and Prevention (CDC) and the World Health Organization (WHO). Oxygen therapy consists of different kinds of supplemental oxygen therapies including nasal cannula, simple mask, venturi mask, non-rebreather masks, and high flow oxygen systems. The key factor in different supplemental oxygen methods is the setting of different levels of oxygen flow rates [2]. Thus, the selection of appropriate oxygen flow rate is a crucial decision in COVID-19 treatment.
To improve the treatment efficiency, the administration of oxygen therapy should be determined by the severity of COVID-19-induced respiratory failure, incorporating the uncertainties in measurements of patient health status and prediction of individual outcomes to the oxygen decisions. It certainly requires a comprehensive investigation of the optimal and personalized oxygen flow rate. Our research aims to explore effective oxygen therapy for COVID -19 patients based on continuous respiratory support and vital signs monitoring.
A large collection of artificial intelligence (AI) and deep learning (DL) approaches have been proposed to accelerate the drug discovery and the process of diagnosis and treatment of COVID-19 disease [3,4]. Clinical studies in oxygen therapy and respiratory support have been made in a short period in the treatment of COVID-19 pneumonia [5,6].
However, respiratory failure remains the leading cause of death (69.5%) for SARS-CoV-2 [7]. Thus, we provide an AI algorithm for the oxygen flow control, based on the deep deterministic policy gradient (DDPG) [8], a widely used reinforcement learning (RL) method for continuous state and action spaces. DDPG uses off-policy data and the Bellman equation to learn the Q-function and then utilizes the resulted Q-function (critic network) to learn a deterministic policy (actor network). To stabilize the training, it considers slow-learning target networks, i.e., actor/critic target networks are updated slowly, hence keeping the estimated targets stable. The optimized policy can recommend personalized optimal oxygen flow rates for COVID-19 patients based on the knowledge of patient health status estimated from patients' electronic health records (EHRs). Reinforcement learning has been successfully applied in the past to different healthcare problems such as multimorbidity management [9], HIV therapy [10], cancer treatment [11], and anemia treatment in hemodialysis patients [12]. For critical care, given the large amount and granular nature of electronically recorded data, RL is well suited for providing sequential optimal treatment recommendations and improving health outcomes for new ICU patients [13]. Recent studies include treatment strategies for sepsis in intensive care [14] and personalized regime of sedation dosage and ventilator support for patients in Intensive Care Units (ICUs) [15].
Focusing on RL-based oxygen flow rate control (RL-oxygen), we studied its impact on mortality in COVID-19 patients with respiratory failure. The evolution of patients' ICU histories, including treatments, vitals, and health outcomes, was modeled using a Markov decision process (MDP) [14,16]. At each decision epoch, based on the state (observed patient characteristics, including age, sex, race, smoking status, BMI, and comorbidity diagnoses, 36 daily observed lab test values, and 6 unique vitals), RL selected an oxygen flow rate (ranged from 0 to 60 L/min) and obtained a reward defined based on patient's 7-day survival. Then, following the oxygen flow rate suggested by RL policy, an estimated mortality rate was predicted to compare with the mortality rate in actual practice.

Study Design and Participants
Our research team used a retrospective cohort of the New York University Langone Health (NYULH) EHR data on COVID-19 patients to derive and validate the RL algorithm. Eligible patients had positive COVID-19 PCR test and had oxygen therapy in hospital between March 1 st 2020 to January 19 th 2021. We excluded COVID-19 patients aged below 50 and not been hospitalized as the lacked consistent documentation of vital signs, treatments, and laboratory tests. This study was approved by the NYULH IRB and the data were de-identified to ensure anonymity.
For each patient, we had access to demographic data, including age, sex, race, ethnicity and smoking status, ICU admits and discharge information, in-hospital living status, comorbidities, treatments, and laboratory test data. The comorbidities, including hyperlipidemia, coronary artery disease, heart failure, hypertension, diabetes, asthma or chronic obstructive pulmonary, dementia and stroke, are defined based on the International Classification of Diseases (ICD)-10 diagnosis codes. To reduce the feature dimensionality, we selected 36 laboratory tests based on two criteria: (1) less than 28% missing values; and (2) COVID-19 related tests and vital signs. In specific, we explore the associations between laboratory tests and COVID-19 based on existing literature and clinical findings. For example, recent studies have shown that a reduced estimated glomerular filtration rate (eGFR), low platelet count, low serum calcium level, increased white blood cell count, Neutrophil-to-lymphocyte ratio (NLR), and red blood cell distribution width-coefficient of variation (RDW-CV) are related to high risk of severity and mortality in patients with COVID-19 [17][18][19][20][21]. Additionally, some research suggests well-controlled blood glucose is associated with the lower mortality in COVID-19 patients with Type-2 diabetes [22] and continuous renal potassium level has correlation of hypokalemia, which is common among patients with COVID-19 [23]. Arterial blood gas analysis, including pH, Oxyhemoglobin saturation ( 2 ), oxygen saturation ( 2 ), partial pressure of oxygen ( 2 ) and bicarbonate ( 3 ), is commonly used biomarkers measuring the severity of ARDS [24,25].
In this study, we employed leave-one-hospital-out validation to evaluate the model performance. The whole dataset was divided into 4 batches by the hospital and then we take one batch as validation set and the rest as training set in each simulation.

RL algorithm Overview
We model patient health trajectory and the clinical decisions during a course of intensive care over a period of ICU stay by a Markov decision process (MDP) with state, action, and reward. The state of a patient includes the observed patient demographics, vital signs, and laboratory tests at each time . The action refers to the oxygen flow rate. After a sequence of actions, the patient receives a reward if he/she survives in the next 7 days; otherwise, a penalty to death will be given. The cumulative return is defined as the discounted sum of all rewards of each patient received during the ICU stay. The intrinsic design of RL provides a powerful tool to handle sparse and time-delayed reward signals, which makes them well-suited to overcome the heterogeneity of patient responses to actions and the delayed indications of the efficacy of treatments [14]. The details of state, action, and reward are listed as follows.
• State : observed patient's characteristics at each time with information, including demographics, COVID-19 lab tests, and vital signs.
• Action : oxygen flow rate ranged from 0 L/min to 60 L/min.
• Reward : the reward of an action at time is measured by its associated ultimate health outcome given the patient's health state. Similar to [14], we used in-hospital mortality as the system-defined penalty and reward.
When a patient survived, a positive reward was released at the end of the patient's trajectory (i.e., a `reward' of +15); a negative reward (i.e., a `penalty of -15) was issued if the patient died. We find that such a reward function can propagate the final health outcome backward to each decision and intervention over the period so that RL can predict long-term effects and dynamically guide the optimal oxygen flow treatment.
• Discount factor : determines how much the RL agents balance rewards in the distant future relative to those in the immediate future. It can take values between 0 and 1 [16]. After considering the ICU stay tends to be short and conducting side experiments, we chose a value of 0.99, which means that we put nearly as much importance on late deaths as opposed to early deaths for each recommended oxygen flow rate.

Model Evaluation
We evaluated the RL-recommended oxygen therapy by comparing its efficacy with the observed one on the cohort from each validation hospital. At each decision time, the RL algorithm recommends an oxygen flow rate for the patient.
If the absolute difference of recommended and the observed oxygen flow rate is less than 10 L/min, we say that RL is "consistent" with the critical care physicians.
When RL is discrepant with the oxygen flow rate used by physicians, the efficacy of the RL-recommended oxygen therapy is not directly observed. The problem then becomes how to assess the health outcomes in the future after taking RL recommendations. For this reason, we predicted the outcome of the RL-recommended treatment using Cox proportional hazards model, a regression model commonly used for investigating the association between the survival probability of patients during a period and predictor variables of interest in medical observational studies [26,27]. In short, a patient was labeled as "alive" if he/she survived after a treatment within seven days; otherwise, labeled as "deceased". Then we fitted a Cox survival model with demographics, vital signs, and lab tests as predictors and evaluated the effect of decision using the leave-one-hospital-out validation.
To assess the performance of the survival models, we compared predicted and observed outcomes (7-day living status) using 4 metrics: similarity, accuracy, Chi-squared test, and concordance index. Overall, the cosine similarity between predicted and actual survival is greater than 99.9%, and the concordance index is 0.83. Both metrics indicate that the predictive model can effectively estimate unobserved health outcomes. Moreover, the paired Chi-squared test (p-value < 0.0001) shows no significant difference between true and predicted survival.

Results
Overall, 1,362 patients in NYULH EHR samples had a PCR-based COVID-19 diagnosis between March 2020 to January 2021. The demographic and clinic characteristics summary of the analysis cohort is shown in  Table 2 depicts the characteristics of oxygen flow rate following the recommendations from both RL-oxygen and physicians. On average, the overall RL-oxygen flow rate was 1.28 L/min (95% CI: 1.14-1.42) lower than the rate delivered to patients.  The efficacy of the RL prescriptive algorithm was consistently observed across age, gender, BMI, and comorbidity subgroups ( Table 2). Demographically speaking, compared to the observed efficacies in patients of age 75 and younger, COVID-19 patients of age older than 75 observed higher efficacies from RL-oxygen recommended therapy than physician's recommendations. For example, 7-day estimated mortality rate under RL-oxygen for patients of age older than 80 was 5.87% (95% CI: 4.67-7.07) lower than under physician's therapy. In contrast, the 7-day estimated mortality rate under RL-oxygen was 0.55% (95% CI: 0.39-0.71) lower than that under physicians' therapy for patients aged between 50 and 65. Table 2 also shows that the RL-oxygen tends to be more effective for patients with comorbidities. Especially for COVID-19 patients with Asthma or chronic obstructive pulmonary, Dementia and Stroke, RL-oxygen reduced the 7-day mortality by 5.69%, 5.11% and 3.8% respectively on average. We further studied 7-days mortality when the actually administered oxygen flow rate differed from the oxygen flow rate suggested by the RL-oxygen in Fig. 2. It shows how the observed mortality changes with the flow rate difference between RL-oxygen and physicians. This phenomenon suggests that increasing differences between the RL-oxygen and the observed delivering oxygen were associated with increasing observed mortality rates in a rate-dependent fashion. When the difference is minimum, we obtain the lowest 7-day mortality rate of 1.7%. Another observation from Fig. 2(A) is that the mortality rate increases when the RL-oxygen flow rate is lower or higher than the one from physicians. It suggests that both the oxygen deficit (lower oxygen flow rate than RL-oxygen recommendation) and the oxygen excess are sup-optimal for patients' outcomes. We observed a trend that RL-oxygen was in general lower than what was prescribed by the physicians and might result in better outcomes under a lower flow rate. It suggests that oxygen flow rates prescribed by doctors tend to be excessively high for some patients. Last, we observed that the RL-oxygen and physicians recommended consistent flow rates in most times; see Fig. 2

(B).
The overall distribution of oxygen flow rates recommended by RL-oxygen and physicians are presented in Fig. 3. It depicts how many measurement times each oxygen flow rate was recommended by RL-oxygen and physicians. In twenty-nine percent of the time, the patients received an oxygen flow close to the suggested rate within 5 L/min while forty-four percent of the time, the difference between the administered and suggested oxygen flow rates are within A. B.
10L/min. Since the high-flow nasal oxygen (HFNO) therapy often increases flow rate in increments of 10 L/min up to 60 L/min [28], it suggests that RL-oxygen is consistent with physicians about 40-50% of the time.

Discussion
We used a RL approach to learn an optimal policy to continuously control the oxygen device for critically ill patients with COVID-19 who require oxygen therapy. As most people who become seriously unwell with COVID-19 have an acute respiratory illness [29,30], our algorithm has strong potential to improve individual health outcomes and reduce the COVID-19 mortality rate caused by respiratory failure. We designed the reward as the ultimate health outcome which is used to assess the performance of oxygen flow decisions along the treatment trajectory. As such, the reinforcement learning approach took uncertain outcomes and long-term treatment effects into consideration and made it smarter in understanding the long impact of an early decision on the final outcomes.
Our analysis suggests the current practice remains some potential to be improved as actual oxygen flow rate administered by intensivists showed more than fifty percent discrepancy with RL-oxygen recommendations.
Importantly, we observe that RL-oxygen tends to prescribe lower oxygen flow rate than physician's prescribed rates but leads to better outcomes. This finding is especially important in the context of the ongoing and persistent medical oxygen shortages in some developing regions. As COVID-19 patient-care protocols have evolved, medical-grade A.
oxygen is still considered as an essential resource to treatments for critically ill patients. In regions such as Africa, Middle East, and Asia, the surge in demand for medical oxygen to treat COVID-19 exacerbates preexisting gaps in medical-oxygen supplies, leading to substantial supply shortages.
Our analysis also identified some clinical patterns that RL-oxygen particularly works well. For example, patients with high risk (i.e., of age older than 75) observed higher efficacies than patients aged between 50 and 75 by using relatively lower averaged oxygen flow rate than actually administered. RL-oxygen also recommends a higher averaged oxygen flow rate may improve the health outcomes for patients aged from 50 to 65. Moreover, we also notice significant therapeutic discrepancies in patients with stroke and diabetes comorbidities. In both cases, RL-oxygen recommended higher averaged oxygen flow rate than doctors while showing a significant reduction in estimated mortality. In fact, these findings agree with recent studies which reported that "stroke survivors who underwent COVID-19 developed more acute respiratory distress syndrome and received more noninvasive mechanical ventilation" [31] and "diabetic patients required more oxygen therapy (60% versus 26.9%)" [32]. randomized clinical trials with patients randomly assigned to RL and clinician oxygen therapy would be needed.

Conclusion
Through analyzing the EHR data from multiple ambulatory care centers, we demonstrated the feasibility of using reinforcement learning based oxygen therapy to improve the intensive care for COVID-19 patients. The RL-oxygen showed medium concordance (44%) with the current practice of critical care physicians. For all COVID-19 patients requiring oxygen therapy, RL recommendations significantly reduce the mortality rate compared to the current practice. The algorithm has the potential to be integrated into the clinical decision support system and assist physicians to provide timely personalized recommendations of oxygen flow rate for COVID-19 patients in ICU.

Title: Reinforcement Learning Assisted Oxygen Therapy for COVID-19 Patients Under
Intensive Care • : denotes the duration since the patient admission time;

Cox Proportional Hazards Model
• ( ): denotes the survival rate of patients at time ; • : denotes the health state as predictor variables related to the survival rate; • : denotes the coefficient of the corresponding variables.
The objective of Cox PH model used to predict the survival rate is given by where the hazard at time for an individual with health state is assumed to be ( | ) = λ 0 ( ) exp( ⊤ ). In this model，λ 0 ( ) is a baseline hazard function, and exp ( ⊤ ) is the relative risk, a proportionate increase or reduce in risk, associated with the set of characteristics . Note that the increase or reduce in risk is the same at all duration t. Given health state of a patient, we predict the 7-day mortality rate by using 1 − (7 | ).

Feature Selection
Feature selection is critical for the establishment of Cox PH model. Unrelated risk factors and the high multilinearity between predictor variables will cause low concordance and impact on the prediction. In addition to the selected 36 laboratory tests (see Study Design and Participants), we also included 25 additional demographic predictor variables, 2 vital signs (temperature and systolic blood pressure), and oxygen flow rate. Since there were high correlations between the selected features, we conduct feature selection based on Pearson correlation to preclude multilinear features. Basically, we found the high linear correlation (>0.7) existing in each group of features, including: (1) red blood cell distribution width-coefficient of variation (RDW-CV) and red cell volume distribution width-standard deviation (RDW-SD); (2) eGFR and creatinine; (3) red blood cell count, hemoglobin, and hematocrit; (4) neutrophils and lymphocytes; and (5) SpO2, oxyhemoglobin and methemoglobin. We selected the first predictor in each group and removed the rest: RDW-CV, eGFR, red blood cell count, neutrophils and SpO2.
For the rest feature selection, we used the elastic net regularization [2] with grid search [3] to select the features. The procedure is shown as follows. 2. For each combination of L1 and L2 penalty values, we fitted a Cox model with elastic net regularization and recorded the performance measured by the concordance score.
3. Finally, we chose the best L1 and L2 regularizers with the best performance.
In the study, the selected coefficients of L1 and L2 regularizers are 0.04 and 0.02 respectively.

Training Process
We apply leave-one-hospital-out cross validation to evaluate the models and predict the 7-day survival probability to assess the performance of RL-oxygen models. To train the general model (including data from different hospitals), we randomly select 80% of the cohort as training set and the rest 20% as the test set. The coefficient of predictor variables of general Cox PH model shows in table S.1.

Reinforcement Learning Algorithm
A Markov decision process (MDP) was used to model the decision-making process and approximate individual patient health trajectories. We formalize the MDP by the tuple ( , , , , ).

•
: denotes a finite set of states, typically including patients' demographic information, ICU admit and discharge information, comorbidities, treatment and laboratory tests.
• : denotes action space that includes oxygen flow rate over time. In this problem, we consider continuous oxygen flow rate ∈ .
• ( ′| , ): represents the state transition probability model that takes action in state at time and will lead to the transition to state ′ at next time + 1 (i.e., the patient's health state changes to ′ at + 1 after taking oxygen therapy with flow rate at time ), which describes the dynamics of the treatment and health interactions.
• : represents the immediate reward received for transitioning to state ′ . Transitions to desirable states yield a positive reward, and reaching undesirable states generates a penalty.
• : denotes the discount factor, which makes immediate rewards more valuable than long-term rewards and determines the temporal impact of the current action. The greater indicates longer impact of current therapy action.
The process is observed at discrete time steps. In each time , the agent observes the current state ∈ . Then, we choose an action ∈ (i.e., oxygen flow rate), the patient health condition moves to a new state +1 , and we get a reward signal +1 associated with the one-step transition ( , , +1 ). The oxygen flow rate decision making strategy is called the policy, denoted by a mapping from state space to action space ，i.e., = ( ). The performance of a policy is measured using the value function which is defined as the expected cumulative discounted reward starting with state s, given that policy is used to make decisions. Then, the goal of a reinforcement learning agent is to learn the optimal policy * which maximizes the expected cumulative discounted reward, that is, ( ).

Learning the Optimal Policy
We utilize the Deep Deterministic Policy Gradient (DDPG) to concurrently learn the Q-function and optimal policy.
In each iteration, we use off-policy data and the Bellman equation to learn the Q-function, and then the estimated Qfunction is used to learn the optimal policy.
This approach is closely connected to Q-learning. In reinforcement learning, many algorithms focus on estimating the so-called "Q-function", denoted by ( , ) , of a policy . The Q-function measures the expected return or discounted sum of rewards obtained by following the policy and acting = ( ). Thus, the Q-function represents the expected value of state-action pairs, and it can be connected to the value function through the equation DDPG interleaves the learning process for a good approximator to * ( , ) with the learning process for an approximator to the optimal policy ⋆ ( ). The optimal Q-function is then defined as the maximum return that can be obtained starting from state , acting , and following the optimal policy * thereafter. It is known to obey the following Bellman optimality equation: where the next state ′ is sampled from the state transition distribution, denoted by (⋅ | , ). For continuous action space, the function ( , ) is presumed to be differentiable with respect to the action argument.
We use a nonlinear function, such as a neural network with parameters , to approximate the state-action value function, i.e., ( , ) ≈ ( , ; ). Such a neural network is called a Q-network [29].
The policy learning step in DDPG will obtain a deterministic policy ( ) which gives the action maximizes ( , ; ). Because the action space is continuous, we assume the Q-function is differentiable with respect to action parameters. Considering the following discounted expected reward (actor objective function) we can perform the gradient ascent with respect to policy parameters (see [4] for more details), where the expectation is estimated by using the training set, denoted by , of tuple ( , , ′ , ) from the EHR data.
Then, we update the parameters of the Q-function and the policy function by using the noisy gradient estimates in Eq.
where is a hyperparameter between 0 and 1.
The Q-network model, a.k.a. critic network in our paper, uses a multi-layer feed-forward architecture which evaluates each state-action pair ( , ). Specifically, the model architecture contains a state input layer followed by a dense layer with 32 neurons and an action input layer; they are concatenated and then followed by a 16-dimensional dense layer; the output layer is 1 dimensional with a linear activation function. The policy model, a.k.a. actor network, uses a twolater neural network with the state input followed by 32-dimensional intermediate layer and 1-dimensional action output layer. We also use batch normalization [5] after each dense layer to standardize the unit of low dimensional features. It is particularly useful in healthcare data as most biomarkers and vital signs have different physical unit and characteristics by nature and even statistics of the same type may vary a lot across multiple patients. Batch normalization can fix this issue by normalizing every dimension across samples in one minibatch.
We used the early stopping [6,7] to prevent overfitting. There are two metrics used as early stopping criteria: mean squared TD error and consistency of recommendations between physician and RL. First, since the objective of DDPG is to minimize the mean squared TD error (7), it is natural to use (7) as a metric. Second, as we did not want RLoxygen to be too much different from the standard of care, we used the consistency of recommendations as another metric, which is defined by the mean square deviations between RL's and physicians' recommended oxygen flow rates. In the study, we noticed that this second metric tends to converge later than the TD error. Thus, during training, we monitored both metrics and set the early stopping criterion to be that "mean squared deviation is not improved in last 500 iterations".
Our training scheme is as follows: 1. Split the dataset into 4 groups (one hospital per fold) 2. For each unique group: 1) Take the group as a hold out or test data set; 2) Take the remaining groups as a training data set; 3) Fit a model on the training set and evaluate it on the test set; 4) Retain the evaluation score; 5) Repeat this process until every group serves as the test set.
3. Then take the average of the recorded scores as the performance metric for the model.
In reinforcement learning, learning an optimal policy from observational data is referred as to offline RL [13]. This approach uses a set of one-step transition tuples: = {( , , , ′ ): = 1, … , | |} to estimate the Q-function ( , ′ ; ) and the oxygen flow policy ( ). The learning algorithm follows [23] with 64 batch size and 0.002 learning rates for both critic and actor network.

Missing Data Imputation
Our dataset contains a set of historically observed health states, but not every possible health state, and the time series data such as lab tests, vital signs, and oxygen flow rate are sampled unevenly. To learn an optimal policy, RL requires a way to estimate values in any state, including those not in the original data. Thus, we imputed data for such missing states based on the information from nearby measurements using a linear interpolation method.