In this study, we used a reinforcement learning based AI model named BCQ, to learn the optimal ventilation policy for critically ill patients who require mechanical ventilation. We validated the policy using two large public datasets from the US: the eICU and MIMIC-IV datasets. In both datasets, the learnt policy had superior performance compared to observed physician policy, based on several quantitative and qualitative evaluation metrics.
We first formulated the clinical problem of choosing optimal ventilator settings in the ICU as a reinforcement learning problem. We then used relevant physiological variables to represent patients’ health status as states and cut the ventilator treatment trajectories into time-varying steps to reflect the changes in patients’ condition. We designed a set of flags to capture the sudden changes in patients’ health and used the flag timings to further cut the trajectory, because such timings were the likely decision points for physicians to make necessary interventions. From the visualization of time-varying intervals in Fig. 6, we observed that when the flags were raised (vertical dotted line), time-varying interval setting (red lines) can better reflect the changes in raw data (blue lines) of ventilator settings in a timely manner compared to fixed 4-hourly time intervals (blue lines).
The action space was designed as 18 discrete actions comprising low/medium/high combinations of three ventilator settings: PEEP, \(Fi{O}_{2}\) and ideal body weight-adjusted tidal volume. At each timestep, the AI model took input from the patients’ status and physicians’ actions, received the evaluative feedback (reward) on those action, and adjusted itself to maximize survival and keep \(Sp{O}_{2}\) and MBP within their optimal ranges. Hospital mortality was used as the terminal reward, whereas \(Sp{O}_{2}\) and MBP were applied as intermittent rewards.
Although patients from the MIMIC-IV dataset were generally sicker compared to those in the eICU dataset, this provided an opportunity of us to test the extrapolation ability of the BCQ model. Given the consistently superior performance of the BCQ model-derived RL policy compared to physicians’ policy in both datasets, we felt that the extrapolation ability of the BCQ model was satisfactory.
From the action frequency distribution plot (Fig. 3) for patients in MIMIC-IV, we found that the actions from physicians (red) and the actions recommended by AI policy (blue) have some discrepancies in all ventilator settings. This result is desirable, because the supervisor network in the BCQ model does not aim to copy from physicians’ choices completely. On the contrary, the supervisor network was used to learn good action patterns from physicians and limit the choice of actions with constrains. In addition, we found the learnt policy recommended low-level PEEP and high-level ideal body weight-adjusted tidal volume more frequently compared to physicians’ current practices for all the SOFA level groups. This finding suggests that the high PEEP-low tidal volume strategy for acute respiratory distress syndrome (ARDS) (2, 9) may not be optimal for all mechanically ventilated patients and should not be applied as a one-size-fits-all approach. For the management of \(Fi{O}_{2}\), the learnt policy suggested more frequent use of low and medium levels and avoided high levels \(Fi{O}_{2}\) for all SOFA groups. This policy suggestion is in line with the known harm from excessive oxygenation, which has been found across different types of critical illness(5, 10, 11).
We computed the learnt policy’s expected return, and we plotted it against mortality risk in Fig. 4. We observed inverse relationships between expected return and mortality (red) in both validation and testing datasets. This indicates that the optimal policy (high return) results in lower mortality for patients. For the secondary outcomes related to maintaining \(Sp{O}_{2}\)and MBP within their respective optimal ranges, the expected return showed positive relationships (green for \(Sp{O}_{2}\), blue for MBP). This indicates that the optimal policy (high return) leads to higher proportions of \(Sp{O}_{2}\) and MBP within their respective optimal ranges.
We also illustrated the observed mortality in terms of the differences between AI and physicians’ ventilator setting (Fig. 5). An effective policy has the lowest mortality when the recommended and administered ventilator settings coincide (x-axis value is zero), indicating when the practice strictly following the AI policy, it has the lowest mortality. At the same time, for an effective policy, the observed mortality should increase as the administered ventilator settings deviate from recommended settings from the AI policy. Accordingly, an effective policy should have a V-shaped curve with the minimum at 0, which we observed for the AI policy under all the three action groups (PEEP, \(Fi{O}_{2}\) and tidal volume).
From the quantitative evaluation using CWPDIS, we found the learnt policy had the lowest estimated mortality compared with all three benchmark policies. At the same time, the learnt policy achieved the highest proportion of optimal \(Sp{O}_{2}\) and MBP in both datasets. Intuitively and as expected, the random policy had the worst outcome among all the policies.
Although our study harnessed two large databases for derivation and external validation of an RL model, several limitations remain. Firstly, our study is retrospective, and the results require prospective validation. Secondly, patients were treated in the US, which is a high-income country with advanced medical care. Whether the RL model would perform similarly in a lower resourced country is unknown and requires further study. Thirdly, our model is a standalone AI model, which may be made more effective by combining AI with human input (i.e. collaborative intelligence(12)).
Despite the above limitations, our study highlights the potential of AI and RL to personalize medical care by accounting for the myriad variations in patients’ clinical features and tailoring treatment recommendations according to those variations. Our method may also be applied to complex clinical decision making beyond mechanical ventilation, such as sepsis management(13) and drug dosing(14).