An automatic deep reinforcement learning bolus calculator for automated insulin delivery systems

In hybrid automatic insulin delivery (HAID) systems, meal disturbance is compensated by feedforward control, which requires the announcement of the meal by the patient with type 1 diabetes (DM1) to achieve the desired glycemic control performance. The calculation of insulin bolus in the HAID system is based on the amount of carbohydrates (CHO) in the meal and patient-specific parameters, i.e. carbohydrate-to-insulin ratio (CR) and insulin sensitivity-related correction factor (CF). The estimation of CHO in a meal is prone to errors and is burdensome for patients. This study proposes a fully automatic insulin delivery (FAID) system that eliminates patient intervention by compensating for unannounced meals. This study exploits the deep reinforcement learning (DRL) algorithm to calculate insulin bolus for unannounced meals without utilizing the information on CHO content. The DRL bolus calculator is integrated with a closed-loop controller and a meal detector (both previously developed by our group) to implement the FAID system. An adult cohort of 68 virtual patients based on the modified UVa/Padova simulator was used for in-silico trials. The percentage of the overall duration spent in the target range of 70–180 mg/dL was \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$71.2\%$$\end{document}71.2% and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$76.2\%$$\end{document}76.2%, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$<70$$\end{document}<70 mg/dL was \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.9\%$$\end{document}0.9% and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.1\%$$\end{document}0.1%, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$>180$$\end{document}>180 mg/dL was \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$26.7\%$$\end{document}26.7% and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$21.1\%$$\end{document}21.1%, respectively, for the FAID system and HAID system utilizing a standard bolus calculator (SBC) including CHO misestimation. The proposed algorithm can be exploited to realize FAID systems in the future.


Methodology
In this work, a DRL-based insulin bolus calculator is designed and integrated with a closed-loop controller and a UKF-based meal detector to compensate for unannounced meals in patients with DM1.The proposed DRL-based insulin bolus calculator is an advanced version of an algorithm published by our group 28 .The DRL algorithm is driven by meal detection and does not require information on the CHO content in meals, thereby fully closing the AID control loop.Continuous insulin delivery is achieved by a closed-loop PD controller with a safety auxiliary feedback element (SAFE) introduced in 29 .The detection of meals is based on an in-house algorithm utilizing an augmented minimal model and a UKF along with the insulin and CGM data 30 .A schematic of the overall strategy is given in Fig. 1.

PD Controller
The control strategy involves two loops: an inner loop comprising the insulin feedback system (IFB) that relies on the PD algorithm and an outer loop that provides a safety layer to exploit the concept of insulin on board (IOB).
Three insulin components constitute the inner control action: u bl the basal insulin profile of the patient, u bolus the insulin bolus, and the PD control action resulting in insulin action given by: where k p = 60 × TDI τ d × 1500 (U/hr) is the proportional gain, TDI is the total daily insulin, e(t) is the error in glucose concentration and τ d = 90 (min) is the derivative time constant.The safety layer is based on sliding mode reference conditioning (SMRC) and comprises three parts: 1) a model to estimate IOB; 2) a sliding mode referencing block (SMR); and 3) a 1 st -order low-pass filter to smooth the reference adaptation.The outer safety layer modifies the reference glucose concentration ( G ref ) under defined conditions to ensure that the IOB is bounded (IOB ∈ [0, IOB] ).Essentially, this is accomplished by a suspension of insulin infusion caused by the controller's reference modification.G ref is modified to a virtual reference G vref in case the estimated ( IOB ) approaches dangerously or exceeds the maximum allowed IOB ( IOB ).This phenomenon provides robustness against delays in the subcutaneous route.
The insulin absorption model 31 is utilized to account for the estimated IOB and is given below.
where u(t) = u pd (t) + u bl + u bolus , c 1 (t) and c 2 (t) are two compartments representing the basal and bolus IOB conditions and k dia is a time constant that accounts for the duration of insulin action.The SMR block is based on the concept of invariance control 32 with IOB(t) being the variable to be bounded and belonging to the set: where x(t) is the state of the system and s(t) is the sliding surface, defined as: The invariance of the region is achieved using the following discontinuous function.
Finally, the smoothness of the reference change is achieved by applying a first-order low-pass filter: A widely used mechanism of IFB in AP systems is also implemented.The plasma insulin concentration is estimated online; then, insulin control action is inhibited proportionally.This gives rise to a new insulin control action given by: where i p (t) is the estimated value and i pss (t) is the steady-state estimated value of the plasma insulin concentra- tion.� i pss (t) is the deviation of the plasma insulin concentration from the basal infusion.Further details are presented in 29 .

Meal Detector
The meal detector algorithm 30 takes the rate of insulin infusion and CGM value as inputs and estimates a disturbance term via an extended minimal model utilizing the UKF.The glucose subsystem comprises Bergman equations 33 as follows: where G pl (t) is the blood plasma glucose concentration, X(t) reflects insulin in the remote compartment, G bl is basal glucose, p 1 is the insulin-independent rate of plasma glucose utilisation, D(t) is the disturbance term included as an extended model state, and V g is the volume distribution.
Subcutaneous glucose is represented by a first-order system 34 as given below: (1) where G s (t) is the subcutaneous glucose concentration, τ is the time constant of the system, and the static gain is represented by g.X(t) reflects insulin in the remote compartment, p 2 is the disappearance rate of remote insulin and p 3 captures insulin sensitivity.The insulin subsystem model is the same as that represented by equation 2, and the concentration of plasma insulin 34 is given by: where V i is the distribution volume, k f is the fractional rate of disappearance, and t max,I is the time to maximum absorption of insulin.
After estimation of the model states given by equations 2 and 8 to 11 through UKF, the cross-covariance is calculated between the two sequences G s (k) (from the CGM data) and D diff (k) (forward difference of disturbance term) over a window of specified length.G s n and D diff n are jointly stationary random processes, and their cross- covariance sequence is defined as the cross-correlation of mean-removed sequences 35 , as given below: where the mean values of the random processes are represented by µ G s and µ D diff , E stands for the expectation operation, and * represents the complex conjugate.
Meal consumption is assumed if a predefined threshold is exceeded by the cross-covariance between G s and D diff with respect to the last three consecutive samples (15 min).As a safety measure, meals are not detected during the night period (23h-6h).
The meal detector can be tuned regarding three settings with respect to the threshold and window size for cross-covariance 30 .The three settings refer to 1) highest sensitivity (high true positives (TP)), 2) trade-off (high TP and low false positives (FP)), and 3) lowest FP.In this study, trade-off tuning is used because the highest sensitivity is prone to FP and will result in the delivery of insulin bolus at times other than meals, leading to extreme hypoglycemia.The third setting was not used because it decreases the TP substantially.
A meal detection flag is triggered if: where T is the predefined threshold and c G s ,D dif (m) represents the raw cross-covariance, as given in 30 .

The DRL algorithm
The problem is first formulated as a Markov decision process (MDP) to implement the training of the RL agent.An MDP is defined in terms of state space S, action space A, the transition probability P(s t+1 | s t , a t ) of the next state ( s t+1 ) given action ( a t ) is taken in the current state ( s t ), and an immediate reward r t , mathematically rep- resented as a tuple M(S, A, P, r).In DRL, the agent is based on a combination of RL and a category of artificial neural networks (ANNs), specifically deep neural networks (DNNs), and is termed a deep Q-network (DQN).
The DQN aims to learn actions that result in the maximum total expected reward.The total expected reward can be represented as E R = E[r t + γ r t+1 + γ 2 r t+2 + ...] , where γ ∈ [0, 1) is the discount factor defining the contribution of future rewards and r t is the immediate reward at time step t.
In DRL, the mapping of states into actions to be taken by the DQN is termed the policy and is represented by π : S → A .The quality of the policy is represented by the action-value function Q π (s, a) .The policy that leads to the maximum E R is a unique optimal policy π * and results in a unique optimal action-value function Q * (s, a) .In this work, a fully connected DNN is used to learn π * to approximate Q * (s, a, θ) ≈ Q * (s, a) , where θ refers to the parameters of the DNN.The final goal of training the DQN is to learn π * , which implies that the agent will take the best possible action in a given state.In RL, the optimal action-value function is obtained on the basis of the notion of the Bellman equation 36 given below: The optimal policy is obtained by dynamic programming to iteratively evaluate: According to Bellman's identity, Q t converges to Q * as t → ∞ , where α ∈ [0, 1) is the learning rate.This approach to RL (Q-Learning) requires the states to be discrete and lack generalization.Therefore, in DRL, Q * (s, a) is approximated by a nonlinear function approximator such as DNN.To estimate Q * (s, a) , the DQN uses fixed Q-targets by maintaining the Q(s, a, θ) and the target Q(s, a, θ) , both having the same architecture.The two www.nature.com/scientificreports/approximators improve the stability of optimization by updating the parameters of Q(s, a, θ) periodically to the latest parameters of Q(s, a, θ) 37 .The parameters are updated every 15 iterations during the training phase in the proposed algorithm.
In this work, multi-DQNs are implemented and trained.Typically, there are three meals per day, i.e., breakfast, lunch, and dinner.The protocol for meals is described later in the scenario subsection under Results.For each meal, the action space is divided into 8 subaction spaces based on the 8 ranges defined for the CGM value before meal intake.The action space is explained later in Sect.2.3.Firstly, the DQN agents are personalized for each patient.Secondly, a DQN is trained for each subaction space, resulting in the implementation of 8 DQNs for each meal and leading to a total of 24 DQNs corresponding to three meals a day.
The motivation behind introducing a multi-DQN strategy is to obtain a personalized DRL agent for each subaction space with respect to meals.This approach will limit the learning experience of each DQN to that specific subaction space and meal, thereby providing greater chances of better performance.In summary, it is the personalization of a DQN based on the meal and the CGM value before meal intake.
A fully connected ANN composed of three hidden layers is considered to represent a DQN for the approximation of Q * (s, a, θ) .Each hidden layer is composed of 28 nodes.The whole network consists of 5 layers, including the input and output layers.The input layer represents 15 parameters (defining the state), and the output layer shows the Q-value of each action taken in that particular state.The Q-value used in RL measures the effectiveness of the action taken in a certain state.The DQN architecture is presented in Fig. 2.
The main components of the MDP model considered in this study are explained below:

State space
The states are represented as the current state and the next state.DQN takes the action in the current state, which is then evaluated in the next state during the training process.In DRL, the states are continuous in nature, and discretization of states is not required.where G max is the maximum CGM value, G min is the minimum CGM value, t m is the meal detection time, k is the sample, G t m −k is the CGM value at t m − k and AUC is the area under the curve over 4 hours of CGM data corresponding to hyper and hypoglycemia only.

Action space
The action space for a certain meal is classified into 8 subaction spaces (SASs) corresponding to 8 different BG ranges.The number of SASs in a previous study 28 was 7, but the number has now been increased to 8 to enhance safety based on BG before a meal and to provide greater flexibility to the agent in the choice of insulin bolus.
According to the CGM value (sample) before meal intake ( G BM ), belonging to one of the 8 defined ranges, the The states feed the DQN to approximate the optimal policy Q * (s, a) .A randomly extracted mini-batch of experiences is also utilized by the DQN.The action A t corresponds to the maximum Q-value, which is the insulin bolus to be delivered to the patient.As a result, a transition occurs for the state S t+1 , and the memory buffer is updated with the new experience.corresponding SAS is selected for action by the DQN agent.The actions considered in this study are discrete and are the bolus insulin units to be delivered to the patient, as described in 28 .The action space can be represented as: where A is the action space and A i | i = 1, 2...8 represents the SASs.A i = {a 1 , a 2 ...a j } , where a 1 ...a j are the bolus insulin units calculated based on the total daily insulin requirement of the patient and the value of G BM .In this study, j = 15, i.e., an agent can choose among 15 actions from a chosen SAS.The selection of SAS for a single iteration is demonstrated in Fig. 3.
The insulin bolus selected as an action is further adjusted according to the bolus insulin on board (BOB) to ensure safety and avoid extreme hypoglycemic events.The adjustment can be represented as a piece-wise function: where u ad is the adjusted insulin bolus to be delivered, a j is the action chosen by the agent, BOB is the estimated BOB and k BOB is a hyperparameter that is tuned separately for all SASs and three meals.A two-compartment model is used to estimate BOB 38 .

Reward function
An immediate reward is assigned to the actions of the DQN based on the next state.If the postprandial BG is in the normal range (70-180 mg/dL), a high reward is given to the DQN.If the action taken by the DQN results in hyper or hypoglycemia, the agent is penalized.The numerical values assigned to the immediate rewards are illustrated in Fig. 4 and can be expressed as a piece-wise defined function: where G maxp and G minp represent the maximum and minimum glucose values in the postprandial period, respec- tively.In the case of the simultaneous occurrence of G maxp and G minp , the value associated with G minp is consid- ered.The reward function is designed to reward the DQN agent for optimal performance, i.e., maintaining postprandial glucose in the normal range.The reward values are considered positive for mild hyperglycemia to avoid hypoglycemic episodes.There exists a trade-off between avoiding hyper and hypoglycemia, as no information on the meal content is available.On the other hand, the occurrence of hypoglycemia is penalized proportionally to the intensity of the event to avoid severe postprandial hypoglycemia.

Implementation
The concept of experience replay is typically used in DRL for stability and convergence of the DNN 37 .This concept is also implemented in the proposed methodology.Memory is defined for each DQN.The memory buffer (MB) consists of the past experiences of the agent and can be represented as: where n is the size of the MB and ξ is a single iteration experience given by: To generate the memory, a simulation is performed for 1500 days, where the actions are taken randomly and the experiences are stored in the MB.The size of the MB varies for each DQN and depends on the number of occurrences of a specific A i during the whole simulation.The MB is generated for each virtual patient.
A cohort of 68 virtual patients previously developed by our group is considered in this study 39 .A protocol of three meals (breakfast at 08:00 of 30-50g, lunch at 14:00 of 50-70g, and dinner at 20:00 of 60-80g) was considered during the training session.The CHO content in meals was chosen randomly from the amounts indicated.All the meals were unannounced, and the agent only took action whenever it received a positive indicator from the meal-detector.The sources of intrapatient variability included sinusoidal variations in insulin pharmacodynamics and insulin sensitivity (circadian variability) and randomness in the rate of absorption of meals 40 .An epsilon greedy policy is used to choose the action, and an immediate reward is assigned to the DQN agent according to the reward function presented in equation 19.In a single iteration, the corresponding MB is updated with the new experience, and the weights of the DQN are updated based on past experiences from MB.The loss function used to optimize the DQN's weights is based on the Bellman equation and is given for a k th iteration as follows: During learning, the Q-learning updates are applied to the mini-batches (s t , a t , r t , s t+1 ) ∼ U(MB) extracted randomly from MB through uniform distribution, where γ is the discount factor, Q(s t+1 , a t+1 ; θk ) is the target DQN in iteration k, whose weights θk are updated periodically with the DQN Q(s, a; θ k ) weights.The DRL train- ing algorithm implemented in this study to calculate the insulin bolus is presented in Algorithm 1.The training is performed for each patient resulting in individually trained DQN agents.Vol:.( 1234567890) if meal detection flag is triggered then 7: Choose the SAS A i based on the value of G BM 8: Explore with probability ; a random action a j 9: Exploit with probability 1-; a = max aj Q(s, a j ; θ) 10: Apply the BIOB adjustment according to equation 18 and take action 11: Observe the next state and assign the immediate reward 12: Modify the MB with a new experience {s k , a k , r k , s (k+1) } 13: Sample a random mini-batch of N experiences from MB 14: Set (double DQN algorithm) A max ← arg max Perform a gradient descent step on (

In-silico scenario and benchmark
The virtual cohort from 39 was used for the final testing simulations.However, the training of the DQN was not successful for one of the virtual patients.Therefore, in the subsequent analysis that patient has been removed.The simulation time for in-silico trials is 14 days.The meals delivered include breakfast at 07:00, lunch at 13:00, a snack at 17:00, and dinner at 20:00, composed of a CHO content selected randomly from 30-50g, 50-70g, 30-50, and 60-80g, respectively.During the simulations, the meal time is varied ±30 min around the time mentioned above.Variability is also incorporated, including randomness in the rate of absorption for meals, random CHO content in meals, and circadian variability in insulin sensitivity, to emulate real-life conditions 40 .
Three insulin delivery systems are compared in this study, and they all utilize a PD closed-loop controller for continuous insulin delivery.First, the HAID system is implemented utilizing SBC for the insulin bolus calculation, and the CHO misestimation error is included to be more realistic.This baseline system is represented as HAID SBC MCHO.The CHO misestimation error is incorporated as a Gaussian distribution according to the recently published methodology 41 .To implement the SBC, the parameters required are the carbohydrate-toinsulin ratio (CR) and correction factor (CF), calculated based on clinical guidelines 42 .Then, the formula for SBC used in this study is given below 43 : where u bolus is the bolus insulin, BG k is the CGM value at the time of delivering the bolus, BG T is the target glucose value and IOB is the estimated insulin on board.
Second, the HAID system with the proposed DRL insulin bolus calculator is represented as HAID DRL.As the DRL bolus calculator is independent of the CHO content in meals, CHO misestimation is not an issue in this case.In both HAID systems, all the meals are announced, hence the name hybrid.In this case (HAID DRL), the DRL algorithm was tuned and trained in the setting of announced meals.This implies that the meal detector was not used and the insulin bolus was delivered at meal time during the training session of DQN agents.The simulation performed for generating the memory (required for the memory replay concept in the DRL algorithm) was also based on announced meals.HAID DRL is included to explicitly show the difference in the glycemic performance induced by unannounced meals.( 23)

Comparative analysis
To draw a comparison and investigate the performance of the proposed FAID system, the outcomes of the insilico simulations are presented in the standardized core CGM metrics, as reported in a consensus report by the American Diabetes Association (ADA) and the European Association for the Study of Diabetes (EASD) 44 .
The standardized CGM metrics and insulin information are presented in Table 1.The mean and median CGM values reported for the FAID system were statistically similar to those of the HAID systems, as indicated by the p-values.The extreme CGM values, i.e., minimum and maximum in the FAID system, were more spread, leading to a slightly higher glycemic variability, as indicated by the higher CV compared to that of the HAID systems.The FAID system achieved a similar glucose monitoring index (GMI), as reflected by the p-value.
The percentage of the CGM values (PCGM) reported for the ranges provided in Table 1 showed an overall increase of 5% in the PCGM below 70 mg/dL and above 250 mg/dL (hypoglycemia and hyperglycemia) for the FAID system.Specifically, the difference in hypoglycemia (below 70 mg/dL) was 0.9%, and that in hyperglycemia (above 250 mg/dL) was 4.1%, which is in accordance with the designed reward function.Hypoglycemia was penalized more than hyperglycemia since a hypoglycemic excursion is riskier than a hyperglycemic excursion of the same magnitude.
According to the p-values, the differences in PCGM ranges are significant, except for the tight target range (70-140 mg/dL).Importantly, all the values achieved were in the range recommended by the ADA consensus report 44 .Moreover, the glycemic risk index (GRI), a measure of the quality of glycemia based on hypoglycemia and hyperglycemia components using CGM tracings 45 , is also provided.
The performance of the FAID system is coupled with the accuracy of the meal detector and the time duration of detection.The performance metrics of the meal detector are presented in Table 2, which summarizes the populational detection performance of meals.The detection of lunch and dinner was better, as evidenced by sensitivity and true positives, whereas the snacks were barely detected.The detection of breakfast was approximately 60%.The time taken to detect a meal ranged between 30 and 40 min.As reported in Table 2 FP amounted to fewer than 1 meal in the cases of breakfast, lunch, and snacks, and none resulted in a hypoglycemic event.However, in the case of dinner, this number is approximately 2.4 meals, and a total of 8 hypoglycemic events were observed.
To exemplify the performance of the approach, the four-hour postprandial BG curves for each meal are illustrated in Figs. 5, 6, and 7.The BG followed a similar trajectory in all three cases.The postprandial peak BG values were higher in the case of the FAID system, reflecting the 30 to 40 min of delay in the delivery of the insulin bolus as a consequence of meal detection.The populational values of the meal detection time in minutes are represented by filled circles (pink) in the case of the FAID.Points on top of each other represent meals on different days with the same time of detection, whereas points along the x-axis represent meals with different times of detection.The time of detection is represented by the x-axis in minutes, with the meal appearing at t = 0.

Table 1.
Comparison of standardized CGM metrics and insulin data for the FAID system. 1 HAID SBC MCHO = Hybrid automatic insulin delivery (closed-loop) with standard bolus calculator and CHO misestimation.

Discussion
The development of reliable and safe FAID systems is one of the current mainstreams in DM1 technology research.Although many disturbances affect people with DM1, such as exercise, stress or other medications, it is common practice to classify FAID systems as those that do not require meal input.To accomplish a FAID system 13.5 (12.1 -14) 2 (0.1 -4.9) 0.5 (0 -1.9)  first meal detection has to be done accurately and in a timely manner, and then compensate them.Therefore, the performance of these type of systems can be affected by two core steps: 1) detection and 2) compensation.
Several attempts have been made in the pursuit of a reliable FAID system.A learning-MPC algorithm was validated in an inpatient clinical study for a single unannounced meal in 29 patients with DM1 46 .No severe hypoglycemia was recorded, and it was suggested to extend the time of clinical trials and the number of unannounced meals in a future study.Analysis of the initial safety and efficacy of a FAID system based on a multiple-model probabilistic controller was presented for patients with DM1 47 .Thirty hours of inpatient study in 10 patients and 54 hours of supervised hotel study in 15 patients were performed, challenging the controller with unannounced meals.It was concluded that there exists a greater risk of hypoglycemia compared to that of the HAID algorithms.A meal detection and estimation module was presented, relying on the fuzzy logic algorithm 48 .The algorithm was evaluated in a retrospective study for a total of 117 meals and 11 patients.The percentage of FPs reported was 20.8%.The detector was integrated with the AP system, but the calculation of insulin bolus was also dependent on the patient's CR.In a more recent study, an internal model control approach was used to derive a feedback controller for the FAID system and was tested in the UVa/Padova DM1 simulator.The outcome was presented in terms of the CGM curve and compared with open-loop therapy, and it was reported that the postprandial peak was reduced by approximately 8% 49 .
In this work, we have proposed a FAID system to compensate for meal disturbances by utilizing a DRL insulin bolus calculator.Three core components were integrated to implement the FAID system, i.e., a closed-loop PD controller for continuous insulin delivery, a detection algorithm for meal disturbances, and the DRL-based insulin bolus calculator.The proposed DRL insulin bolus calculator builds on top of our previous work 28 and goes one step further.The key novelties of this paper include: 1) the complete elimination of meal announcements; 2) the improvement of the RL algorithm by using DRL based on DNNs; and 3) the integration of a closed-loop controller and meal detector algorithms together with the DRL system.Specifically, the state space and action of the DRL algorithm have been reworked and improved.One one hand, the use of DNNs allowed to describe the state space in continuous form and now it is composed of 15 continuous parameters.On the other hand, an additional subspace is also added to the action space to increase the range of actions to be chosen by the DRL agent.Additionally, on design benefits of the proposed system is that it could also accommodate announced meals without knowing the CHO content, unlike the methodologies presented in the literature.In such cases, the insulin bolus calculator could be fed by meal announcement instead of the meal detector.

Performance analysis
The primary CGM metrics are presented in Table 1.CHO misestimation is included in the HAID with SBC to depict a real-life scenario.The absolute CGM values (mean, median, and maximum) are similar, whereas the minimum CGM is lower in the case of the FAID system because the insulin bolus calculation does not utilize CHO information and there is an inherent delay in bolus delivery due to the meal detection.The CV was slightly higher for the FAID system but was in the acceptable range of < 36% as recommended by an international consensus report 44 .The GMI, an approximation of the A1C level based on the average BG from CGM 50 , was similar in all cases.
The PCGM in the tight target range ( 70 − 140 mg/dL) was similar, and that in the target range ( 70 − 180 mg/ dL) was lower by 5% in the FAID system.First, the PCGM in the range below 70% accounted for approximately 1% owing to the reasons mentioned above.Second, an increase was observed in the PCGM in the range above 180 mg/dL.This increase was induced by a delay in the bolus insulin delivery proportional to the meal detection duration.Moreover, a less aggressive dosing of bolus insulin, as reflected by greater penalties for hypoglycemia, also results in a lowering of PCGM in the target range ( 70 − 180 mg/dL).
A comparison of the postprandial performance is explicitly presented in terms of populational postprandial BG curves for the three major meals in Figs. 5, 6, and 7.For all three meals, a similar pattern was observed, i.e., the peak was higher and the slope of the BG dip was steeper in the case of the FAID system as a consequence of the delay in insulin bolus delivery.Despite the steeper slope of the BG dip, there was no risk of severe hypoglycemic events owing to the higher peaks in the postprandial period.To show the overall daily glucose profiles Fig. 8 is presented.
The improvement in policy and performance of the DQN agents during the training session is presented in terms of the total number of hypoglycemia events in Fig. 9.Each point in the plot represents a median of the number of hypoglycemia events per day for all patients for 25 days.A window of 25 days was selected to highlight the trend in the number of hypoglycemia events as training progressed.During training, an epsilon greedy policy that consists of both exploitation and exploration was considered; therefore, the trend was not downward throughout, but the overall impact was.As is clear from Table 1, the time spent in hypoglycemia was approximately 1% when the trained DRL agents were deployed.

Comparison with state of the art
Two RL algorithms are considered for comparison in this subsection.Both of the studies represent HAID systems.The RL algorithm presented in 27 learns the programmable basal rates and the CRs for insulin bolus calculation.The simulator used for in-silio validation was based on the Hovorka model 51 .The DRL algorithm proposed in 52 is based on double deep Q learning topology and is validated on the UVa/Padova simulator.The major advantage as compared to the algorithms presented in the literature is that our work does not require estimating the CHO content in meals and works in a fully automatic fashion.Comparison in terms of the key percentage of time ranges for CGM values is provided in the Table 3.It is evident from the table that the safety mechanisms presented in this study to avoid hypoglycemia are reflected in the results.It is not possible to make a head-to-head comparison because of the difference in the simulation environments used for the validation of the algorithms.The RL algorithms developed for other therapies such as multiple daily injections 53 or basal insulin dosing 54 are not considered.A comparison with the FAID systems is not provided because it is the first attempt to analyze the performance of DRL in a FAID system to the best of the author's knowledge.

Figure 2 .
Figure 2. Representation of the DRL algorithm based on DQN.The states feed the DQN to approximate the optimal policy Q * (s, a) .A randomly extracted mini-batch of experiences is also utilized by the DQN.The action A t corresponds to the maximum Q-value, which is the insulin bolus to be delivered to the patient.As a result, a transition occurs for the state S t+1 , and the memory buffer is updated with the new experience.

50 70 ≤ 45 Figure 3 .
Figure 3. Demonstration of the selection of a subaction space based on the CGM value before a meal.

Figure 4 .
Figure 4. Reward function for the proposed DRL algorithm.The green region represents the immediate reward when G pp is in a healthy range, yellow for hyperglycemia and red for hypoglycemia.

Figure 5 .
Figure 5. Four-hour postprandial BG curves for breakfast.The solid lines (middle curve) represent median values, whereas the dotted lines (upper and lower curves) correspond to the interquartile range of 25% and 75% respectively.The filled circles are points where meals were detected, plotted against the time of detection in minutes in the case of the FAID system.

Figure 6 .
Figure 6.Four-hour postprandial BG curves for lunch.The solid lines (middle curve) represent median values, whereas the dotted lines (upper and lower curves) correspond to the interquartile range of 25% and 75% respectively.The filled circles are points where meals were detected, plotted against the time of detection in minutes in the case of the FAID system.

Figure 7 .
Figure 7. Four-hour postprandial BG curves for dinner.The solid lines (middle curve) represent median values, whereas the dotted lines (upper and lower curves) correspond to the interquartile range of 25% and 75% respectively.The filled circles are points where meals were detected, plotted against the time of detection in minutes in the case of the FAID system.

Figure 8 .
Figure 8.This figure shows the median daily CGM profile of the whole cohort.The day starts at 12:00 AM.The three peaks appearing are breakfast, lunch, and dinner respectively.The small spike between lunch and dinner represents the snacks.The solid lines (middle curve) represent median values, whereas the dotted lines (upper and lower curves) correspond to the interquartile range of 25% and 75% respectively.

Figure 9 .
Figure 9. Populational number of hypoglycemia events throughout the training period lasting for 1500 iterations.An epsilon greedy policy was followed for the purpose of training.
Training of Deep Reinforcement Algorithm for Insulin Bolus Calculation for the FAID

Table 3 .
Performance metrics of the meal detector.TBR = % of CGM values below 70 mg/dL.TIR = % of CGM values in the range 70-180 mg/dL.TAR = % of CGM values above 180 mg/dL.