Control of Blood Glucose for Type-1 Diabetes by Using Reinforcement Learning with Feedforward Algorithm

Background Type-1 diabetes is a condition caused by the lack of insulin hormone, which leads to an excessive increase in blood glucose level. The glucose kinetics process is difficult to control due to its complex and nonlinear nature and with state variables that are difficult to measure. Methods This paper proposes a method for automatically calculating the basal and bolus insulin doses for patients with type-1 diabetes using reinforcement learning with feedforward controller. The algorithm is designed to keep the blood glucose stable and directly compensate for the external events such as food intake. Its performance was assessed using simulation on a blood glucose model. The usage of the Kalman filter with the controller was demonstrated to estimate unmeasurable state variables. Results Comparison simulations between the proposed controller with the optimal reinforcement learning and the proportional-integral-derivative controller show that the proposed methodology has the best performance in regulating the fluctuation of the blood glucose. The proposed controller also improved the blood glucose responses and prevented hypoglycemia condition. Simulation of the control system in different uncertain conditions provided insights on how the inaccuracies of carbohydrate counting and meal-time reporting affect the performance of the control system. Conclusion The proposed controller is an effective tool for reducing postmeal blood glucose rise and for countering the effects of external known events such as meal intake and maintaining blood glucose at a healthy level under uncertainties.


Introduction
Type-1 diabetes is a chronic condition that is characterized by an excessive increase in blood glucose level because the pancreas does not produce insulin hormone due to the autoimmune destruction of pancreatic beta cells. High blood glucose can lead to both acute and chronic complications and eventually result in failure of various organs.
Until today, there are many challenges in control of the blood glucose in type-1 diabetes. One of them is that the glucose kinetics process is complex, nonlinear, and only approximately known [1].
ere are also many external known and unknown factors that affect the blood glucose level such as food intakes, physical activities, stress, and hormone changes. Generally, it is difficult to predict and quantify those factors and disturbances.
By using control theories, various studies have been conducted to design a control system for patients with type-1 diabetes. For example, Marchetti et al. [2], derived an improved proportional-integral-derivative controller for blood glucose control. Soylu et al. [3] proposed a Mamdani type fuzzy control strategy for exogenous insulin infusion. Model predictive control has also been widely used in type-1 diabetes and artificial pancreas development [4,5]. Recently, together with the development of artificial intelligence and machine learning, reinforcement learning (RL) has emerged as a data-driven method to control unknown nonlinear systems [6,7] and has been used as a long-term management tool for chronic diseases [8,9]. e biggest advantage of RL compared to other methods is that the algorithm depends only on interactions with the system and does not require a well represented model of the environment. is especially makes RL well suited for type-1 diabetes since the modelling process of the insulin-kinetic dynamics is complex and requires invasive measurements on the patient or must be fit through a large dataset. Hence, by using RL as the control algorithm, the modelling process can be bypassed, which makes the algorithm not susceptible to any modelling error.
In diabetes, controlling of blood glucose require actions that are made at specific instance throughout the day in terms of insulin doses or food intakes. e actions are based on the current observable states of the patients (e.g., blood glucose measurement and heart rate). e effectiveness of the actions is calculated by how far the measured blood glucose value is compared to the healthy level. In RL, an agent makes decision based on the current state of the environment. e task of the algorithm is to maximize a cumulative reward function or to minimize a cumulative cost function. Based on these similarities in the decisionmaking process between a human being and a RL agent, RL may be key to the development of an artificial pancreas system.
When dealing with meal disturbances, modelling of glucose ingestion is the norm as well as the first step in designing a controller for disturbance rejection [10]. Feedforward control was proven to be an effective tool to improve disturbance rejection performance [11,12]. In control system theory, feed-forward is the term that describes a controller that utilizes the signal obtained when there is a (large) deviation from the model. Compared to feed-back control, where action is only taken after the output has moved away from the setpoint, the feed-forward architecture is more proactive since it uses the disturbance model to suggest the time and size of control action. Furthermore, building a meal disturbance model is simpler and requires less data to fit than finding the insulin-glucose kinetics. Based on the model, necessary changes in insulin actions can be calculated to compensate for the effects of carbohydrate on the blood glucose level.
A challenge in the control of the blood glucose is the lack of real-time measurement techniques. With the development of continuous glucose measurement sensors, blood glucose level can be measured and provided to the controller in minute intervals. However, blood glucose value alone is usually not enough to describe the states of the system for control purpose. erefore, an observer is needed to estimate other variables in the state space from the blood glucose measurement. In this paper, the Kalman filter was chosen for that purpose since it can provide an optimal estimation of the state variables when the system is subjected to process and measurement noises [13,14].
Vrabie et al. [15] established methodologies to obtain optimal adaptive control algorithms for dynamical systems with unknown mathematical models by using reinforcement learning. Based on that, Ngo et al. [16] proposed a reinforcement learning algorithm for updating basal rates in patients with type-1 diabetes.
is paper completes the framework for blood glucose control with both basal and bolus insulin doses. e framework includes the reinforcement learning algorithm, the feed-forward controller for compensating food intake and the Kalman filter for estimating unmeasurable state variables during the control process.
is paper also conducts simulations under uncertain information to evaluate the robustness of the proposed controller.

Problem Formulation.
e purpose of our study is to design an algorithm to control the blood glucose in patients with type-1 diabetes by the means of changing the insulin concentration. e blood glucose metabolism is a dynamic system in which the blood glucose changing over time as the results of many factors such as food intake, insulin doses, physical activities, and stress level. e learning process of RL is based on the interaction between a decision-making agent and its environment, which will lead to an optimal action policy that results in desirable states [17]. e RL framework for type-1 diabetes includes the following elements: (i) e state vector at time instance k consists of the states of the patient: where g(k) and g d (k) the are measured and desired blood glucose levels, respectively, and χ(k) is the interstitial insulin activity (defined in the appendix).
(ii) e control variable (insulin action) u k , which is part of the total insulin i k (a combination of the basal and the bolus insulin ( Figure 1)): where u basal (k) and u bolus (k) are the basal and bolus at time instance k, respectively.
(iii) e cost received one time-step later as a consequence of the action. In this paper, the cost was calculated by the following quadratic function: where Q � 1 0 0 0.1 and R � 0.01. Each element in matrix Q and the value of R indicate the weighting factors of the cost function. e element in the first row and the first column of Q has the highest value, which corresponds to the weighting of the difference between the measured blood glucose and the prescribed healthy value. Since our ultimate goal is to reduce this difference, the factor of this measurement should have the largest value in the cost function. e element in the second row and second column of Q corresponds to the weighting of the interstitial insulin activity. e value of R indicates the weighting factor of the action (basal update). Minimizing the cost function, therefore, becomes the problem of minimizing the difference between the measured blood glucose and the desired value, the interstitial insulin activity, and the change in basal insulin.
At time instance k + 1, a sequence of observations would be x k , u k , r k+1 , x k+1 and u k+1 . Based on this observation, the agent receives information about the state of the patient and chooses an insulin action. e body reacts to this action and transitions to a new state. is determines the cost of the action.
For the control design purpose, the blood glucose model (Appendix) was divided into three submodels: the meal (G meal ), the insulin (G ins ), and the glucose kinetics (G glucose ). e controller has three main components: the actor, the critic, and the feedforward algorithm. e actor is used to estimate the action-value function, the critic's task is to obtain optimal basal insulin, and the feedforward algorithm is used to propose the bolus insulin profile for disturbance compensation (food intake). e purpose of the Kalman filter is to estimate unmeasurable states of the patient.

Basal Update by Actor and Critic.
When the patient is in a fasting condition, the controller only needs to change the basal insulin level through the actor and the critic. Based on the current state x k , the actor proposes an insulin action u k through the policy π : u k � π(x k ). e updated basal rate is obtained from u k as follows: After each action, the patient transforms into a new state, and the cost associated with the previous action can be calculated using equation (3). e action-value function (Qfunction) of action u is defined as the accumulation of cost when the controller takes action u k � u at time instance k and then continues following policy π(x k+1 ): where c (with 0 < c ≤ 1) is the discount factor that indicates the weighting of future cost in the action-value function.
e action-value function depends on the current state and the next action. It was shown that the action-value function satisfies the following recursive equation (Bellman equation) [15,17]: Since the state space and action space are infinite, function approximation was used in this paper for estimation of the Q-function. In this case, the Q-function was approximated as a quadratic function of vectors x k and u k : where the symmetric and positive definite matrix P is called the kernel matrix and contains the parameters that need to be estimated. Vector z k is the combined vector of x k and u k : With Kronecker operation, the approximated Q-function can be expressed as a linear combination of the basis where w is the vector that contains elements of P and ⊗ is the Kronecker product. By substituting Q π k (x, u) in equation (6) by w T Φ(z k ) and using the policy iteration method with the least square algorithm [15], elements of vector w can be estimated. Matrix P can then be obtained from w using the Kronecker transformation.
By decomposing the kernel matrix P into smaller matrices P xx , P xu , P ux , and P uu , the approximated Q-function can be written as follows: e current policy is improved with actions that minimize the Q-function Q π k (x, u). is can be done by first taking the partial derivative of the Q-function and then solving zQ π k (x, u)/zu � 0. e optimal solution can thereafter be obtained as follows [15]: With that, the update of basal insulin is where i be is the equilibrium basal plasma insulin concentration.

Bolus Update by Feedforward Algorithm.
When the patient consumes meals, in addition to the basal insulin, the controller calculates and applies boluses to compensate for the rise of blood glucose as the results of carbohydrate in the food. e feedforward algorithm first predicts how much blood glucose level will rise and then suggests a bolus profile where Descriptions and values of τ D , p 2 , and p 3 are shown in Tables 1 and 2. e transfer function from the meal intake D(s) to the blood glucose level g(s) can be calculated as In order to compensate for the meal, the gain of the open loop system F(s) must be made as small as possible. Hence, the feedforward transfer function was chosen such that G meal (s) + G ff (s)G ins (s) ⟶ 0, which leads to e meal compensation bolus in s-domain can be calculated from the feedforward transfer function: Hence, the feedforward action becomes the output of the following dynamic system, which can be solved easily using any ordinary differential equation solver:

Kalman Filter for Type-1 Diabetes System.
Since the interstitial insulin activity, the amounts of glucose in compartments 1 and 2 cannot be measured directly during implementation, Kalman filter was used to provide an estimation of the state variables from the blood glucose level. e discretized version of the type-1 diabetes system can be written in the following form: where T , and matrices A K , B K , C K are linearized coefficient matrices of the model: matrix H K is the noise input matrix: H K � 0 0 0 p 3 V T , the output value y K (k) � g(k) − g d (k) is the measured blood glucose deviation from the desired level, w(k) is the insulin input noise, and v(k) is the blood glucose measurement noise with zero-mean Gaussian distribution. e variances of w(k) and v(k) are assumed to be as follows:  Based on the discretized model, a Kalman filter was implemented through the following equation: where x(k + 1 | k) denotes the estimation of x(k + 1) based on measurements available at time k. e gain L is the steady-state Kalman filter gain, which can be calculated by where M is the solution of the corresponding algebraic Riccati equation [13,14,18]: By assuming the noise variances to be R w � R v � 0.01, the Kalman filter gain was calculated from equation (23) as L � 0 0 8.32 · 10 −4 −6.40 · 10 −7 . (25)

Simulation Setup.
First, a pretraining of the algorithm was conducted on the type-1 diabetes model in the scenario where the patient is in a fasting condition (without food intake). e purpose of the pretraining simulation is to obtain an initial estimation of the action-value function for the algorithm. e learning process was conducted by repeating the experiment multiple times (episodes). Each episode starts with an initial blood glucose of 90 mg/dL and ends after 30 minutes. e objective of the algorithm is to search and explore actions that can drive the blood glucose to its target level of 80 mg/dL. By using the initial estimation of the action-value function, the controller was then tested in the daily scenario with food intake. Comparisons were made between the proposed reinforcement learning with the feedforward (RLFF) controller, the optimal RL (ORL) controller [15], and the proportional-integral-derivative (PID) controller. e ORL was designed with the same parameters and pretrained in the same scenario as with the RLFF. e PID controller gains were chosen, which produces a similar blood glucose settling time as the RLFF: where K p � 1, In order to understand the effects of different food types on the controlled system, two sets of simulations were conducted for food that has slow and fast glucose absorption rates while containing a similar amount of carbs. Absorption rates in the simulations are characterized by parameter τ D from the model, where τ D � 50 corresponds to food with a slow absorption rate and τ D � 10 corresponds to food with a fast absorption rate. e amount of carbohydrate (CHO) per meal can be found in Figure 2.
Next, the performance of the proposed controller was evaluated under uncertainties of meal information. Two cases of uncertainties were considered: uncertain CHO estimation case and uncertain meal-recording time. In the uncertain CHO estimation, the estimated CHO information that provided to the controller was assumed to be a normal distribution with a standard deviation of 46% from the correct carbohydrate value shown in Figure 2. e standard deviation value was used based on the average adult estimates and the computerized evaluations by the dietitian [19]. For the uncertain meal-recording time, the estimated meal starting time is assumed to be a normal distribution with a standard deviation of two minutes from the real starting time. is standard deviation value was randomly selected because systematic research on the accuracy of meal-time recording for patients with type-1 diabetes could not be found. For each case, multiple simulations were conducted in the same closed-loop system with its corresponding random variables. From the obtained results, the mean and standard deviation for blood glucose responses at each time point will be calculated and analyzed.

Results
After pretraining in the no-meal scenario, the Q-function was estimated as follows: e initial basal policy was obtained from the initial Qfunction and equation (12): e initial estimation of the Q-function and the initial basal policy were used for subsequent testing simulations of the control algorithm.
During the simulation with correct meal information, blood glucose responses of the RLFF, the ORL, and the PID are shown in Figures 3 and 4. e insulin concentration during the process can also be found in Figures 5 and 6. With slow-absorption food, the fluctuation range of blood glucose was approximately ±30 mg/dL for all three controllers from the desired value ( Figure 3). However, with fast absorption glucose meals, the fluctuation range of the postmeal blood glucose level was within ±40 mg/dL with the RLFF compared to ±60 mg/dL with the ORL and is significantly smaller than the fluctuation range ±80 mg/dL of the PID (Figure 4). Medicine Figures 7 and 8 show the blood glucose variation under uncertain meal time and CHO counting. e upper and lower bounds in shaded areas show the mean blood glucose value plus and minus the standard deviation for each instance. Under uncertain meal information, the upper bound was kept to be smaller than 40 mg/dL from the desired blood glucose value for fast glucose absorption food and 15 mg/dL for slow glucose absorption food. e lower bound is smaller than 15 mg/dL from the desired value for fast glucose absorption food and 5 mg/dL for slow glucose absorption food.

Discussion
e controller has shown its capability to reduce the rise of postmeal blood glucose in our simulations. It can be seen in Figures 3 and 4 that three controllers were able to stabilize the blood glucose. However, when using the RLFF, the added bolus makes the insulin responses much faster when there is a change in blood glucose level, which reduces the peak of the postmeal glucose rise by approximately 30 percent compared to the ORL and 50 percent compared to the PID in the fast-absorption case. It can also be seen that the undershoot blood glucose (the distance between the lowest blood glucose and the desired blood glucose value) of the PID controller is much larger than that of the RLFF and the ORL. e RLFF has the smallest glucose undershoot among the three controllers. Low blood glucose value (hypoglycemia) can be very dangerous for patients with type-1 diabetes. erefore, simulation results show the advantage of using RLFF in improving safety for patients. In general, with the feedforward algorithm, the proposed algorithm is an     effective tool for countering the effects of external events such as meal intake.
Among uncertainties, carb counting created more effect on the variation of the blood glucose than meal-time recording, especially with slow absorbing food. e uncertainty in recording meal time may also lead to larger undershot of blood glucose below the desired level as can be seen in Figure 7. Following the same trend as previous simulations, the fluctuation range of the blood glucose with slow absorbing food is smaller than the fluctuation range with fast glucose absorbing food. In general, the control algorithm kept the blood glucose at the healthy level although uncertainties affect the variation of the responses. However, an accurate carbohydrate counting and accurate meal-time recording method are still important for the purpose of blood glucose control in order to completely avoid the chance of getting hypoglycemia.

Conclusion
e paper proposes a blood glucose controller based on reinforcement learning and feedforward algorithm for type-1 diabetes. e controller regulates the patient's glucose level using both basal and bolus insulin. Simulation results of the proposed controller, the optimal reinforcement learning, and the PID controller on a type-1 diabetes model show that the proposed algorithm is the most effective algorithm. e basal updates can stabilize the blood glucose, and the bolus can reduce the glucose undershoot and prevent hypoglycemia. Comparison of the blood glucose variation under different uncertainties provides understandings of how the accuracy of carbohydrate estimation and meal-recording time can affect the closed-loop responses.
e results show that the control algorithm was able to keep the blood glucose at a healthy level although uncertainties create variations in the blood glucose responses.    insulin pump, and there is no delay between the administered insulin and the plasma insulin concentration.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest.