Learning dynamic treatment strategies for coronary heart diseases by artificial intelligence: real-world data-driven study

Background Coronary heart disease (CHD) has become the leading cause of death and one of the most serious epidemic diseases worldwide. CHD is characterized by urgency, danger and severity, and dynamic treatment strategies for CHD patients are needed. We aimed to build and validate an AI model for dynamic treatment recommendations for CHD patients with the goal of improving patient outcomes and learning best practices from clinicians to help clinical decision support for treating CHD patients. Methods We formed the treatment strategy as a sequential decision problem, and applied an AI supervised reinforcement learning-long short-term memory (SRL-LSTM) framework that combined supervised learning (SL) and reinforcement learning (RL) with an LSTM network to track patients’ states to learn a recommendation model that took a patient’s diagnosis and evolving health status as input and provided a treatment recommendation in the form of whether to take specific drugs. The experiments were conducted by leveraging a real-world intensive care unit (ICU) database with 13,762 admitted patients diagnosed with CHD. We compared the performance of the applied SRL-LSTM model and several state-of-the-art SL and RL models in reducing the estimated in-hospital mortality and the Jaccard similarity with clinicians’ decisions. We used a random forest algorithm to calculate the feature importance of both the clinician policy and the AI policy to illustrate the interpretability of the AI model. Results Our experimental study demonstrated that the AI model could help reduce the estimated in-hospital mortality through its RL function and learn the best practice from clinicians through its SL function. The similarity between the clinician policy and the AI policy regarding the surviving patients was high, while for the expired patients, it was much lower. The dynamic treatment strategies made by the AI model were clinically interpretable and relied on sensible clinical features extracted according to monitoring indexes and risk factors for CHD patients. Conclusions We proposed a pipeline for constructing an AI model to learn dynamic treatment strategies for CHD patients that could improve patient outcomes and mimic the best practices of clinicians. And a lot of further studies and efforts are needed to make it practical. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-022-01774-0.

died of CHD per year [1], and that 18.2 million American adults have CHD and 363, 452 died from CHD in 2016 [2]. Eleven million Chinese residents had CHD, and the mortality rate was 120.18 per 100 thousand in 2017 [3]. CHD is characterized by urgency, danger and severity; thus, personalized dynamic treatment in the ICU is particularly important [4]. A series of general guidelines on rational drug use for CHD have been made by experts [5][6][7]. However, one ideal treatment strategy may be effective for some patients but not for others, and even the same patient might need different treatment strategies during different stages of the CHD process. Additional file 2: Figure S1 shows an example of the dynamic treatment strategies administrated to a CHD patient.
The methods used to design artificial intelligence (AI) to identify dynamic treatment strategies can be classified into three main categories: (1) rule-based expert systems, which map diseases to treatments based on heuristic rules [44]; (2) supervised learning (SL) methods, which generate treatment recommendations by utilizing the similarity of patients or match diseases with treatments via classification, including pattern-based methods and deep learning [12,49], and more recently attention and memory-augmented network (AMANet) [50]; and (3) reinforcement learning (RL) methods [51,52], which address delayed rewards and infer an optimal strategy based on non-optimal treatment behaviours, including value-based RL [18][19][20] and direct policy optimization [53]. Other methods include outcome weighted learning [54], augmentation and relaxation learning [55], and ensemble machine learning [56]. Each kind of method has respective advantages and drawbacks. Taking SL methods as an example, on the one hand, they are adept at mining the experience of doctors from labelled data; on the other hand, their prerequisite of assuming that the treatment label provided by doctors is optimal is not always the case [57], so they may learn some wrong things. RL methods infer an optimal treatment strategy according to the delayed reward set up mainly by a patient outcome, but they may recommend treatments that are obviously different from a doctor's prescription due to the lack of supervision, which may cause high risk in clinical practice [58]. These two kinds of methods can supplement each other by combining an evaluation signal and an indicator signal to learn an integrated treatment policy [14,15].
We aimed to build and validate an AI model for dynamic treatment recommendations for CHD patients with the goal of improving patient outcomes and learning the best practices of clinicians to help clinical decision support in treating CHD patients. Inspired by Wang et al. [15] and the aforementioned studies, we applied an AI model of supervised reinforcement learning-long short-term memory (SRL-LSTM) to learn dynamic treatment strategies for CHD patients by using real-world EHRs and compared it with several state-ofthe-art SL models and RL models. We used a random forest algorithm to calculate the feature importance of both the clinician policy and the AI policy to illustrate the interpretability of the AI model. Case studies were conducted to analyse the similarities and differences between the AI-recommended treatment actions and the clinicians' actual treatment decisions. Figure 1 shows the framework of our study. To help clinicians prescribe efficiently and efficaciously to treat patients admitted to ICU with CHD, as shown in Fig. 1, we first train an AI model SRL-LSTM based on historical CHD cohort extracted from MIMIC-III V1. 4, an open access and anonymized real-world ICU database. The model combines SL and RL with an LSTM network to track the patients' states and learns dynamic treatment policies. When a patient with CHD is admitted to the ICU, we feed the patient's evolving health status to the model, including diagnoses, demographics and time-series variables of the patient till the current day extracted according to CHD guidelines. The model can then provide us a daily treatment recommendation in the form of whether to take specific drugs.

Overall approach and cohort
We formed the treatment strategy as a sequential decision problem and applied an AI SRL-LSTM framework  [59][60][61]. We included adult patients diagnosed with CHD. CHD is a class of heart diseases caused by myocardial ischaemia, hypoxia, or necrosis because of narrowing or occlusion of the lumen as a result of coronary atherosclerosis [7], which is also called ischaemic heart disease [62]. Following [2], we used the International Classification of Diseases (ICD), 9th edition revision codes 410 to 414 [63] to indicate CHD. The patient admission inclusion diagram is shown in Fig. 2.

Data extraction and preprocessing
We extracted the prescriptions of the CHD cohort, which contained 1,292,650 records and 2477 drugs. To ensure statistical significance, we selected the top 500 drugs that covered 98.0% of the prescriptions. The prescription action was coded as one-hot for each day.
If the prescription of a patient contained drugs out of the selected drug list on one day, then the patient was still included and only the drugs within the list were scanned and coded. The data were included from hospital admission to discharge, resulting in a total of 140,176 action days.
Each patient had 1 to 39 diagnoses coded by ICD-9, and a total of 3719 diseases were involved. All the diagnoses of a CHD patient were used as the input layer, and each of them was embedded with 40 hidden nodes regarding to the maximum number of diagnoses observed. The embedding was conducted as follows. First, all ICD-9 codes used to identify the CHD cohort were sorted in frequency descending order; then, the ICD-9 codes of each CHD patient were replaced by their indexes in descending order, and zeros were padded to make the length of each patient's diagnosis sequence equal to that of the largest diagnoses sequence. The model can be adjusted to deal with longer or shorter sequence and easily retrained on new dataset containing new patients with more or less diagnoses. Finally, the embedding layer in Keras [64] was applied to turn the positive integers (indexes) into dense vectors of fixed size. We analysed the rationality of the embedding method in both development set and test set. For both the development set and test set, we first calculated the Euclidean distances between each two patients in both the original index space and the embedding space respectively. Then we divided the distances in the original index space into 3 equal groups: the closer group, the middle-distance group, and the distant group. We calculated the mean distances of each group in the original index space and the mean distances of corresponding patient pairs in each group in the embedding space. Additional file 1: Table S1 shows that, for both development set and test set, the closer groups in the original index spaces have closer mean distances in the embedding spaces, the distant groups in the original index spaces have greater mean distances in the embedding spaces. And the mean distances of corresponding groups in the development set and test set are close. This demonstrated that the embedding method could rationally keep the relative distances between patients in terms of their diagnoses. We identified monitoring indicators and risk factors for the CHD patients by searching CHD related guidelines [5][6][7], handbooks [4], reports [3], and papers [1,2]. Then, for each hospital admission, we extracted static variables and time-series variables that were recorded for at least 20% of the sampled hospital admissions. Finally, the model features included diagnoses, demographics, electrocardiogram and haemodynamic monitoring results, vital signs, ventilation parameters, lab values, and output events. Among them, demographics such as gender, age, and weight are the basic risk factors for the CHD patients, and electrocardiogram monitoring of heart rhythm and heart rate are basic monitoring items for the CHD patients for detecting each kind of arrhythmia and the situation of myocardial ischaemia. Haemodynamic instability is a prominent manifestation in patients with severe cardiovascular disease. The monitoring of haemodynamic indexes is particularly important in the condition evaluation and rescue treatment of patients with severe cardiovascular disease. Systolic, diastolic, and mean blood pressure, systolic, diastolic, and mean  , CK-MB isoenzyme, and troponin T are markers of myocardial injury and play important roles in clinical diagnosis, condition monitoring and risk stratification of acute myocardial infarction and other diseases associated with myocardial injury. Renal insufficiency is a common and important complication in patients with severe cardiovascular disease and is one of the predictors of poor prognosis; thus, we included indexes such as creatinine, blood urea nitrogen, and daily urine output, which could reflect renal injury. Heart disease is often associated with liver insufficiency, so indexes that could reflect liver function were included, such as alkaline phosphatase, serum glutamic-oxaloacetic transaminase (SGOT), serum glutamic pyruvic transaminase (SGPT), total bilirubin, albumin, partial thromboplastin time (PTT), prothrombin time (PT), and international normalized ratio (INR). Partial pressure of oxygen (PaO2) was included to reflect hypoxia. Other basic laboratory values were also used, including routine blood indexes such as haemoglobin, white blood cell count, and platelet count; electrolyte indexes such as potassium, sodium, magnesium, calcium, ionized calcium, and chloride; and acid base balance indexes including pH, carbon dioxide (CO2), PaCO2, base excess, bicarbonate, and lactate. Variable heart rhythms were divided into 25 sub-types, including atrial fibrillation, atrial flutter, A paced, V paced, AV paced, left bundle branch block, right bundle branch block, sinus arrhythmia, sinus bradycardia, sinus rhythm, sinus tachycardia, supra ventricular tachycardia, ventricular tachycardia, ventricular fibrillation, multifocal atrial tachycardia, paroxysmal atrial tachycardia, wandering atrial pacemaker, first degree AV block, second degree AV block Wenckebach-Mobitz1, second degree AV block-Mobitz 2, complete heart block, junctional rhythm, idioventricular, asystole, and others, which were coded as one-hot sub-variables. We divided the time-series data of each hospital admission into different units, which were set to 24 h following [15] since it was the minimum interval of prescription in MIMIC-III. Following [15,18], variables with multiple data points in one unit were averaged (for instance, systolic blood pressure) or summed (for instance, urine output).
The quality of the data was improved in the preprocessing step. Variables with different measurement units were unified. For example, pound weights were converted to kilograms, and temperatures in Fahrenheit were converted to temperatures in Celsius. Variables extracted from different tables, such as labevents and chartevents in MIMIC-III, were combined, and duplicates were dropped according to the keys of hospital admission ID and chart time. Several composite variables were calculated by their composing sub-variables. For instance, some of the GCS values were summed by their sub-variables: GCS eye, GCS verbal, and GCS motor; and the shock index was calculated by heart rate dividing systolic blood pressure. Because pulse was not available in MIMIC-III, it was replaced by heart rate according to [18]. We detected the outliers with a frequency histogram and normal probability graph and removed them to cap all the variables to clinically plausible values. Variables not normally distributed were transformed to their logarithms as appropriate, all the variables were normalized, and the missing variables were imputed by k-nearest neighbours (KNN).

AI models Model preliminaries
In this paper, the dynamic treatment strategy was modelled as a partially observed Markov decision process with finite time steps. Let denote the observed dataset, where n is the number of patient admission trajectories. For each patient admission trajectoryi , T i is the total hospitalization days; S i,t , A i,t , S i,t+1 , r i,t shows the transitions from the t th day to the (t + 1) th day, where S i,t is the current observed state; is the actual medications prescribed by clinicians, where K = 500 and a k i,t ∈ {0, 1} represents whether or not to take drugk ; S i,t+1 is the state on the next day after taking actionA i,t , S i,T i +1 = 0 denotes the termination of the trajectory, and r i,t is the reward gained. Given the current observed stateS i,t , our goal was to learn a policy µ(S i,t |θ µ ) to select an action (drug combinations) A i,t by maximizing the expected return and minimizing the difference from clinicians' decisionA i,t , where θ refers to the parameters within the respective network. A critic network Q(S, A|θ Q ) was built to estimate the expected return, which was the accumulated discount reward from this state to the end of the trajectory [65].

SRL-LSTM
We applied an AI SRL-LSTM framework [15] to minimize the following objective loss function: where L RL (θ µ ) is the loss for RL, which aims at minimizing the negative expected return to maximize the positive expected return; L SL (θ µ ) is the loss for SL, which aims at minimizing the difference between the recommended action and the real action made by the clinicians; and ε is a weight parameter to trade off the objective between them. We used a deep deterministic policy gradient (DDPG) for the RL part and cross entropy as the supervisor. In DDPG [66], there are two agents, actor µ(S|θ µ ) and critic Q(S, A|θ Q ) , each of which has a target network µ ′ (S|θ µ ′ ) and Q ′ (S, A|θ Q ′ ) with the same structure but different parameter update frequencies to ensure the robustness of the model. LSTM with a time window of 5 days was adopted to track the historical observed states within the actor and critic networks. Their network architectures are shown in Additional file 3: Figure S2, and the parameters were updated by the Adam optimizer.
For a random minibatch of N transitions S i,t , A i,t , S i,t+1 , r i,t from D , the critic Q S, A|θ Q can be updated by minimizing the mean square error (MSE) loss: , and γ is the discount ratio of reward. The RL loss of the actor µ(S|θ µ ) is defined as: The SL loss of the actor is defined as: Then, the target networks are updated by: where τ is a weight to control the updating ranges of the parameters. (1)

Models for comparison
We compared the SRL-LSTM model with the following state-of-the-art models that were suitable for learning daily medication recommendation strategies: Dual-LSTM: Dual-LSTM [50] took the treatment recommendation as a classification problem, which encodes disease information and time-series variables using LSTM respectively and then concatenates the two encoding vectors for the final classification layer. AMANet: AMANet is a classification model for dual-view sequential learning based on attention and memory mechanisms [50]. The original AMANet predicted medications for each patient visit based on ordered diagnoses and medical procedures by treating the diagnoses and procedures in the current visit as two sequential views. Our purpose was to predict the daily medications based on diagnoses and daily time-series variables, so we replaced the token embedding of procedures in the original AMANet with an LSTM layer for the timeseries variables. Direct policy optimization (DPO) with LSTM: DPO is a RL method to directly learn a policy without learning an extra model of treatment effectiveness [53]. We used the same architecture of the actor network and reward r as defined in the SRL-LSTM model and learned a single model that directly predicts which treatment is optimal by optimizing a surrogate loss: where and SRL-Multimorbidity: SRL-Multimorbidity [15] was developed for dynamic medication recommendations for multimorbidity with a combination of SL and RL. Its time-series features only include diastolic blood pressure, fraction of inspiration O2, Glasgow coma scale score, blood glucose values, systolic blood pressure, heart rate, pH, respiratory rate, blood oxygen saturation, body temperature, and urine output.

Experiment setup
We randomly split the CHD dataset into a development set (80%) and a test set (20%) and applied 5-fold cross validation to the development set to investigate the The training process of the SRL-LSTM model is shown in Additional file 3: Figure S2 and described in the text in Additional file 1, and the hyperparameters are summarized in Additional file 1: Table S2.

Evaluation
We adopted the estimated in-hospital mortality rates on both state-wise and trajectory-wise (or admission-wise) to measure whether the AI strategy model could reduce patient mortality. The estimated mortality rate is a universal metric for computational testing of treatment recommendation models when only retrospective data are available [15,[18][19][20]. The state-wise estimated in-hospital mortality rate was computed as follows.
Step 1: We obtained the recommended actions for each patient state by the actor evaluation network and the expected returns of both the clinicians' actual actions and the AI model recommended actions by the critic evaluation network.
Step 2: We assigned the in-hospital mortality flag for all the expected returns of actual actions.
Step 3: We discretized all the expected returns of actual actions into different units according to their distribution with 5% in each unit.
Step 4: We calculated the average estimated mortality rate for each unit by the bootstrapping with 2000 re-samplings.
Step 5: We discretized all the expected returns of recommended actions into each unit, and calculated the expected mortality number in each unit according to the number of states and the average mortality rate.
Step 6: We calculated the state-wise expected in-hospital mortality rate by using the total expected mortality number to divide the total states. The state-wise estimated in-hospital mortality rate took each patient state to represent a patient admission to make full use of the data; however, it was not the direct estimated in-hospital mortality rate. Therefore, we also calculated the trajectory-wise (admission-wise) estimated in-hospital mortality rate according to the above steps, with the expected returns of each stateaction pair replaced by that of the initial state-action pair in each trajectory.
We used the mean Jaccard coefficient to measure the degree of consistency between prescription actions taken by the clinicians and those recommended by the AI model since the task belongs to multilabel classification [14][15][16][17]. The mean Jaccard is defined as follows: where M is the number of patient admissions in the valid/test set. J ∈ [0, 1] , where J = 1 indicates that the daily treatment actions recommended by the AI policy are exactly the same as those proposed by the clinicians; in contrast, J = 0 indicates that none of the drugs recommended by the AI policy are the same as the drugs proposed by the clinicians for each day. We analysed how the observed in-hospital mortality rate varies with the difference in treatment actions between the AI policy and the clinician policy. The treatment difference for the ith patient on the tth hospitalization day was defined as: Furthermore, a case study was conducted to see the similarities and differences between the AI-recommend treatment actions and the clinicians' actual treatment decisions on both the surviving and expired patients.

Interpretability analysis
We used a random forest model to estimate the importance of the features in decision making for the AI model and the clinicians [18] to gain some insight into the model representations and interpretability. The random forest model was fitted by all the test data over all the treatment periods. The independent variables of the model are patients' feature (except diagnosis). The dependent variable is the real prescription action taken by the clinicians when calculating feature importance for the clinician policy, and the recommended action generated by the AI model of SRL-LSTM when calculating feature importance for the AI policy. We calculated the importance of the heart rhythm by accumulating the importance of its 25 sub-types.

Distribution of the feature variables
Following the cohort selection and exclusion criteria, we included 13,762 admitted patients diagnosed with CHD. A detailed description of the cohort is provided in Additional file 1: Table S3. Table 1 shows the distribution of model features before normalization and imputation by KNN. In all, 65.6% of the CHD patients were male. CHD is a disease of old age, and 79.8% of the patients were over 60-years of age. A total of 29.1% suffered severe coma, and 38.0% suffered moderate coma during the hospital stay. The troponin T, CK and CK-MB isoenzyme levels were higher than the normal values because the patients' myocardia were infarcted by the ischaemia hypoxia. The average values of blood glucose were higher than those in the common population, since four out of five patients had endocrine, nutritional, metabolic, and immune diseases. More than half of the CHD patients suffered kidney damage according to the creatinine and blood urea nitrogen values. Figure 3 presents the performance of the models with different weights to balance RL and SL in 5-fold cross validation. The SL-LSTM model ( ε = 0.0, 100%SL ) achieved the highest Jaccard value, indicating that it had the best ability to learn the experience of the clinicians; however, it did not show any ability to improve the outcome of the CHD patients on either the state-wise or trajectory-wise estimated in-hospital mortality rate. The RL-LSTM model ( ε = 1.0, 100%RL ) seemed to have the ability to substantially reduce the state-wise estimated in-hospital mortality rate; however, it had the lowest Jaccard value, indicating that its recommendation was quite different from the clinicians' decision. The SRL-LSTM model with weight parameter ε = 0.4 exhibited a relatively high performance in reducing both the trajectory-wise and statewise estimated in-hospital mortality rates, while its Jaccard value was not greatly harmed by RL, and all three evaluation indicators had relatively small variation. Therefore, the AI model with 40% RL and 60% SL was preferable. The proposed SRL-LSTM model ( ε = 0.4, 40%RL, 60%SL ) was then tested against several recent methods, and was found to outperform them all ( Table 2). The test dataset contained 2,752 patient admissions (trajectories) with 28,417 hospitalization days (states). A total of 9.56% of the patients died in the hospital under the clinicians' actual treatment policy, and the state-wise mortality rate was 9.59%. The preferred AI model of SRL-LSTM could help reduce the trajectory-wise and state-wise estimated in-hospital mortality rate by 3.13% and 0.81% respectively, while keeping the Jaccard similarity (0.3110) close to the SL-LSTM model (0.3432), and its average recommended drug amount (27) was close to that of the clinicians. Considering that diagnoses codes might not be available at bedside, we conducted an ablation experiment by removing the disease information out from the SRL-LSTM model, and the result showed that it would slightly harm the performance by increasing the trajectory-wise estimated mortality rate by 0.98% and decreasing the Jaccard similarity score by 8.77%. Figure 4 further indicates the effectiveness and stability of the AI model. Figure 4A shows the expected return and Jaccard value obtained in each learning epoch, demonstrating that the AI model can maximize both the expected return and the similarity to the clinician policy and achieved stability after approximately 200,000 epochs. Figure 4B shows that the observed mortality rates varied with the difference in treatment actions between the AI policy and the clinician policy. The smallest treatment action difference was associated with the best survival rates. When the difference was not greater than 6, the in-hospital mortality rate was zero; the greater the discrepancy was in the clinicians prescribed drugs and those recommended by the AI model, the worse the outcome. Figure 4C, D shows the correlation between the expected returns of the clinicians' treatment actions and the trajectory-wise and state-wise in-hospital mortality rates. We observed that treatment actions with low returns were associated with a high risk of mortalities, whereas treatments with high returns achieved better survival rates. This demonstrated that the estimated mortality calculating method could effectively reflect that inhospital mortalities have clear negative correlation with the expected returns. Thus, the estimated mortalities generated according to the relationship between the distribution of the expected returns and the mortality rates were relatively reliable. Figure 5 shows the feature (except diagnosis) importance gained from the random forest model for the clinician policy and the AI policy generated by SRL-LSTM method respectively. These results confirmed that the treatment decisions made by both the clinician policy and the AI policy were clinically interpretable and relied primarily on sensible clinical and biological parameters. Among the ten most important features, both the clinician policy and the AI policy emphasized heart rhythm, WBC, urine output, platelet count, GCS score, and age; the clinician policy paid more attention to weight, creatinine, blood urea nitrogen, and ionized calcium; the AI policy was more concerned with haemoglobin, albumin, PTT, and lactate dehydrogenase.   Fig. 6, who survived to discharge after 5 days in the hospital, the similarity between the daily prescription of the clinician and AI policies was high, indicating that AI was able to learn the best practices of the clinicians. For patient 2 in Fig. 7, who expired after 14 days in the hospital, the daily prescription of the AI policy was quite different from that of the clinicians, and its rationality needs to be further examined by experts.

Comparison with recent methods
The AI model of SRL-LSTM outperformed the recent methods in learning dynamic treatment strategies for CHD patients (Table 2). It could not only improve patient outcome, but also mimic the best practices of clinicians. The Dual-LSTM, AMANet and SL-LSTM were SL models with relatively high Jaccard values, indicating that they were capable of learning the experiences of the clinicians but had little ability to reduce the in-hospital mortality rate. The RL-LSTM model was supposed to be the most effective model for reducing the estimated inhospital mortality rate. However, the Jaccard value was only 0.0342, one-tenth of the SL-LSTM model, and the average number of drugs recommended per patient per day was 230, ten times the average amount (23) clinicians prescribed. It was obviously not reasonable. DPO-LSTM was inferior to the SRL-LSTM model in both reducing the in-hospital mortality rate and mimicking the behaviours of the clinicians. SRL-Multimorbidity was inferior to the SRL-LSTM model for the CHD patients in both improving patient outcomes and mimicking the behaviours of the clinicians, indicating that the AI model developed for multimorbidity should not be directly used for a specific disease, such as CHD. It is essential to modify the models according to the characteristics and risk factors for the specific disease.

Comparison with similar studies
The results of this study complied with the theoretical analysis and experimental results in similar studies in the literature. Many studies [12][13][14][15][16] have demonstrated that SL approaches are adept at learning the behaviours of doctors. RL approaches generate treatment recommendations by maximizing the expected return according to the reward mechanism, so they have the potential to recommend better treatment than those of the clinicians to improve patient outcomes [51]. For example, Weng [20] adopted an RL paradigm using policy iteration to learn the optimal glycaemic control policy for septic patients and found that the best optimal policy could potentially reduce the estimated mortality rate by 6.3%; Komorowski et al. [18] developed an AI clinician by using an RL approach of policy iteration to learn the optimal dynamic dosing of intravenous fluids and vasopressors for sepsis treatment. They found that the AI clinician recommended lower doses of intravenous fluids and higher doses of vasopressors than the clinicians' actual treatments, the smallest dose difference was associated with the best survival rates, and the further away the dose received was from the suggested dose, the worse the outcome. Additionally, the SL approaches did not take the outcome of the patients into consideration when mimicking the practice of the clinicians; the RL approach, on the contrary, may recommend treatments that are obviously different from clinicians' prescription due to the lack of supervision, which may be of high risk in the clinical practice [58]. These two approaches can complement each other. For example, Wang et al. [15] proposed a SRL framework for dynamic medication recommendations for multimorbidity, and the experiment on MIMIC-III illustrated that the SRL-multimorbidity model could reduce the estimated mortality, while providing promising accuracy in matching doctors' prescriptions, which provided a prospect for combing SL and RL approaches.

Limitations and future directions
This study was a preliminary exploration of learning dynamic treatment strategies for CHD patients, and more work is needed to make it practical. It is worthwhile to explore how to combine the structured data and the unstructured information (including the narrative diagnosis and other free-text records at bedside) to learn more practical dynamic treatment strategies. In addition, the AI model built in a pure data-driven way might be improved by leveraging domain knowledge of medicine and clinical guidelines to avoid major adverse drug-drug interactions. Moreover, this study focused only on whether to take specific drugs on each hospitalization day, and future studies need to further investigate the impact of drug doses. Further investigation is required to validate the effectiveness of the AI model in various CHD cohort and obtain careful evaluations from experts in medical domain. This study aimed to learn optimal dynamic treatment strategies by using real-world data, thus was based on the following assumptions [32]: a. the consistency assumption that the state and results observed in each stage for each patient are such that potentially would be seen after the patient actually receives the corresponding treatment; b. the stable unit treatment value assumption that the potential outcome for each patient is not affected by treatments applied to other patients; c. the no unmeasured confounders, also referred to as the sequential randomization assumption. These assumptions were defaulted in this study, and further clinical trials or other prospective studies are needed to make the AI model practical.
Besides, despite AI technologies have the promise to support clinicians making more efficient and high quality treatment decisions, there are formidable obstacles and pitfalls, including risks for bias and overfitting, limited generalizability, risks of privacy and data security, and cause or exacerbate inequities [67][68][69][70]. Many initially promising technologies have failed in broader testing and applications. For example, Watson for Oncology that used by hundreds of hospitals worldwide for recommending treatments for cancer patients, provided many erroneous treatment recommendations, such as suggesting using bevacizumab in a patient with severe bleeding, which is an explicit contraindication [67,71]. The Automated Retinal Disease Assessment (ARDA) tool was developed by Google to detect a condition that causes blindness in diabetic patients. Though ARDA was effective working with sample data, it struggled with images taken in field clinics during the test in a hospital in India [72]. The potential for an AI algorithm inducing iatrogenic risk is vast if it was widely applied. Therefore, when the AI algorithm is to be unleashed in clinical practice, systematic debugging, audit, extensive simulation and validation on various patient groups, along with prospective scrutiny and participation of different stakeholders, are required to ensure its efficiency and generalization [67][68][69][70]. Regulatory, governance and ethical guidelines are also necessary to ensure the information security, ethics and equity [67][68][69][70].

Conclusion
We proposed a pipeline for constructing an AI model to learn dynamic treatment strategies for patients with CHD, the leading cause of death and one of the most serious epidemic diseases worldwide [1]. The cohort was selected following strict inclusion and exclusion criteria, and the features were extracted according to monitoring indexes and risk factors for CHD patients by referring to CHD-related guidelines [5][6][7], handbooks [4], reports [3], and papers [1,2]. The AI model combining SL and RL resulted in better performance than using either SL or RL alone. The combined approach can help improve the outcomes of CHD patients and learn the best practices of clinicians and is clinically interpretable by relying on sensible clinical features. And a lot of further studies and efforts are needed to make it practical.