Personalized control system via reinforcement learning: maximizing utility based on user ratings

In this paper, we address the design of personalized control systems, which pursue individual objectives defined for each user. To this end, a problem of reinforcement learning is formulated where an individual objective function is estimated based on the user rating on his/her current control system and its corresponding optimal controller is updated. The novelty of the problem setting is in the modelling of the user rating. The rating is modelled by a quantization of the user utility gained from his/her control system, defined by the value of the objective function at his/her control experience. We propose an algorithm of the estimation to update the control law. Through a numerical experiment, we find out that the proposed algorithm realizes the personalized control system.


Introduction
Robust design has been a basic design concept of control systems: common control systems are designed for various system-users such that the systems operate stably without any concern for the difference between user-environment.Beyond the robustness, "personalization" can be an advanced design concept of control systems: an individual control system is designed for each system-user such that the system operates pursuing high control performance and improving user utility [1,2].In this paper, we address the personalization of control systems, in particular, a design methodology of the control systems that possess a function of adaptation of improving user utility.
As studied in other research fields, a key to personalization is the modelling of system-users [3,4].For example, in [5,6], data on manual driving is used to model driver intent.In the previous works, the user modelling is addressed based on the measured data on the user actions to the control system.On the other hand, in this paper, the data on the action is not available, and instead, the result of user rating on the control system is available.The user rating is collected for every control operation and is used to estimate the private objective function of the user, and the optimal controller is updated based on the estimated objective function.By repeating the controller update, we aim at maximizing the user utility gained in the control system.
As an application of the presented personalized control system, let us imagine an automatic driving system.
The system drives automatically, and the user rates the driving control system based on his/her comfort, for example, once a month.The control system accesses the result of the user rating to estimate his/her preference, modelled by a parameterized objective function.Then, the implemented control law is updated based on the estimated objective function.
In general, a problem of reinforcing system performance by learning techniques is called reinforcement learning (RL), which has been addressed in the literature.See e.g.[7] and its references therein.RL has been applied to the design of feedback control systems [8][9][10][11].In [8,9], continuous control problems are addressed unlike standard RL problems, and the control law is updated directly.Furthermore, in [10,11], RL is combined with model predictive control, and the objective function is tuned to update the optimal control law indirectly.The main difference in this paper and the literature is the assumption on the objective function and/or reward.In the literature, the objective function is designable, while in this paper, the function is not designable and is pre-defined but hidden by a system-user.
The rest of this paper is organized as follows.In Section 2, the models of control systems and systemuser are given, and the problem of the control system update is formulated.In Section 3, we propose an algorithm of estimating the user objective function, which plays a central role in personalization.In Section 4, we give a convergence analysis of the proposed algorithm.In Section 5, we present a numerical experiment using the proposed algorithm.In Section 6, the conclusion is given.

Model of control systems
We consider a control system that is composed of a plant system and a controller, which are denoted by P and K, respectively.The plant system is modelled by the following discrete-time state space equation where k ∈ N + is the discrete-time, and x ∈ R n and u ∈ R m are the state and control input, respectively.Symbol The controller is designed based on the estimate of an individual objective function, which is private and is defined for each user.Let J and Ĵ represent the objective function and its estimate, respectively.Then, the control law is described by the following optimization problem where x and u are stacked vectors composed of the sequences of the state and input, respectively, i.e.
and X and U are the state and input constraints, respectively.In the following discussion, {x (d) , u (d) } represents the measured data on x and u, obtained in a control experiment and called experience for the control system.
In addition to P and K, we should note that a systemuser participates in the control system.His/her objective function is modelled by where f i (x, u) : R nN × R mN → R + , i ∈ {0, 1, . . ., } are non-negative functions defined by a system-designer and q i > 0, i ∈ {1, 2, . . ., } are weighting parameters to be estimated.As (1), user's objective is parameterized by q i , i ∈ {1, 2, . . ., }.The system-user has his/her objective function (1) in his/her mind, and the user's weighting parameter q i , i ∈ {1, 2, . . ., } are not accessible directly for control systems update.Instead, the user rating including the information on q i , i ∈ {1, 2, . . ., } is available for the update.
In this paper, we aim at maximizing the users utility achieved in the control system (P, K) by updating K. To this end, the individual and private objective function J is estimated, and the control law K is updated based on the estimate Ĵ.In other words, such "personalization" of the control system is achieved by accurate estimate of q i , i ∈ {1, 2, . . ., }.The estimation of J, i.e. that of q i , i ∈ {1, 2, . . ., }, is based on user rating without accessing bare data {x (d) , u (d) } unlike the conventional works [10,11].

Model of system-user
In the problem setting, we explicitly take into account the presence of a system-user.The user rates the control system (P, K) based on the utility determined by his/her experience, denoted by J * (x (d) , u (d) ).An example of the user rating is realized by a questionnaire.The user gives m-grade evaluation based on his/her satisfaction with the control system as illustrated in Figure 1.
Recall that in control system (P, K), controller K pursues the performance in the sense of the estimated objective function Ĵ, parameterized by qi , i ∈ {1, 2, . . ., }, to give experience {x (d) , u (d) } to the user.There exists a gap in the estimated utility Ĵ(x (d) , u (d) ) and true utility J * (x (d) , u (d) ).We assume that the user rating depends on the gap: the user rates the control system high (low) if the gap is small (large).
To model the user rating, we define the gap in the utility gained from experience {x (d) , u (d) } as J * (x (d) , u (d) ) .
Based on (2), the user rating r is modelled by a piecewise constant function as where M i , i ∈ {1, 2, . . ., m} are positive constants satisfying and a i , i ∈ {1, 2, . . ., m − 1} are also positive constants that indicate the range of existence of |E J |.One can assume the rating as (M 1 , M 2 , . . ., M 5 , M 6 ) = (100, 80, . . ., 20, 0) for sixgrade evaluation.In this setting, since the value of r is quantized, controller K cannot access the exact value of E J (x (d) , u (d) ).
Remark 2.1: E J > −1 holds since Ĵ > 0. This fact is used for the analysis given in Section 4.
We impose a technical assumption on the model of the system-user, denoted by H.In addition to user rating r, system-user H gives the sign of E J to the control system based on his/her experience {x (d) , u (d) }.Then, the model of H is described by H : R(x (d) , u (d) = {r(x (d) , u (d) ), sgn ( Ĵ(x (d) , u (d) ) − J * (x (d) , u (d) ))}, (4) where sgn(•) is the sign function.We see that this R is a "reward" in the reinforcement learning framework.Controller K can access reward R, which depends on user-experience {x (d) , u (d) }, to estimate J * and to update its control law.

Remark 2.2:
A similar problem of estimating objective functions and/or rewards, which generate the control actions, is known as inverse reinforcement learning (IRL).See e.g.[12,13] for the problem setting and e.g.[14][15][16] for its applications.In most of the IRL frameworks, the control law is pre-defined and fixed, and its generating data is available for the estimation.
On the other hand, in this paper, the control law is not fixed and to be updated, and the rating of a system-user, who is not included in the control loop, is available.The block diagram of the control system with the user rating is illustrated in Figure 2. In the figure, the blue line connecting the controller and the plant indicates the loop of the control operation, while the red line connecting the user, controller, and plant indicates the loop of the controller update.

Problem of controller update
Control system (P, K) is updated based on user rating R(x (d) , u (d) ).The flow of the update is given as follows.
Flow of controller update (1) The control system at version τ , denoted by (P, K τ ), is operated, and the user gains experience {x The user gives his/her rating R(x Note that control law K is uniquely determined once the parameter estimation is performed.This implies the essence of the controller update is the parameter estimation, addressed at Stage 3. The estimation problem is given as follows: Problem 2.1: Given R(x (d) , u (d) ), estimate q * i , i ∈ {1, 2, . . ., }.

Parameter estimation algorithm
In this section, we propose an algorithm of estimating parameters that characterize the user objective function given in (1).To simplify the discussion, the objective function is characterized by only two parameters q 1 and q 2 , i.e.The following discussion and the derived algorithm are extended to more general -parameters cases in a straightforward manner.

J(x, u)
We aim at deriving the algorithm of estimating the parameters.Since the user rating is modelled by a quantized function as (3), the parameter estimation can be reduced to a class of set-membership estimation [17] as studied for state estimation problems [18,19].Section 3.1 is devoted to the estimation of a parameter region.In Section 3.2, the parameter estimate is given from the region, and the estimation algorithm is presented.

Estimate of parameter region
Recall first that user rating R, given in (4), includes rough information on his/her utility gained from experience {x (d) , u (d) }.We suppose that r = M s holds for the experience, which implies the user gives s-grade for a current control system.Further supposing sgn( Ĵ(d) − J * (d) ) ≥ 0, we have the following inequality where E J (d) := E J (x (d) , u (d) ).Further, we let f (d)   i := f i (x (d) , u (d) ), i ∈ {0, 1, 2}.Then, we see that E J (d) is described by where q1,τ−1 and q2,τ−1 are the estimated parameters at (τ − 1)th trial on the controller update.Then, by substituting (6) into (5), we have The set of the linear inequalities in (7) represents the existence region of q * 1 and q * 2 .In a similar manner, supposing sgn( Ĵ(d) − J * (d) ) < 0, we see that holds.
Consider again the τ th trial on the controller update for derive the parameter estimate algorithm: a control operation is performed, and the system-user rates the control system (P, K τ ) based on his/her experience {x We suppose here that in the rating

R(x
τ ) = M s holds, i.e. the user rates the system by s-grade.Then, letting (9) Recalling the user rating given at the first to (τ − 1)th control operations, we show that where and S 0 the initial guess on the parameter region.The update of the parameter existence region is illustrated in Figure 3.In the figure, the region enclosed by the red line is S 0 , and the region enclosed by the blue line is S 1 .
The coloured area represents the parameter existence region Q 1 .In this way, we contract the parameter existence region by taking the intersection repeatedly.In the next subsection, we define the parameter estimate (q 1,τ , q2,τ ) from Q τ .

Update method of estimated parameters
One can define the estimate (q 1,τ , q2,τ ) by the "centre" of region Q τ .A drawback of taking the centre is its complexity: when the number of the controller update increases, i.e. τ increases, it is difficult to find the centre from Q τ of ( 10), which is a polytope.To find the estimate (q 2,τ , q2,τ ) in a computationally tractable way, we apply an approximation of Q τ .Let Q rect,τ be the rectangle region that approximates Q τ , and is defined by where q i,τ and qi,τ are defined by q i,τ = min respectively.Note that region Q rect,τ is an "outer" approximation of Q τ , i.e. it holds that Outer approximation of Q τ to define Q rect,τ , and the centre of Q rect,τ to define (q 1,τ , q2,τ ).
By taking the approximation of Q τ as ( 11) at every controller update τ , we can find the estimate (q 1,τ , q2,τ ) in a simplified manner as where C(•) represents the centre of gravity, i.e.
The approximation of the parameter existence region and parameter estimate in (12) are illustrated in Figure 4.In the figure, the black dotted line represents the rectangle region Q rect,1 while the blue triangle represents the estimated parameters (q 1,1 , q2,1 ).Finally, the algorithm of estimating parameters {q * 1 , q * 2 } is summarized in Algorithm 1.

Analysis of algorithm
In this section, we address the convergence analysis of the proposed algorithm.We present the following theorem, which states the contraction of region Q rect,τ .

Theorem 4.1:
where vol(S) represents the volume of S.
To prove (13), we show that one of the following four conditions holds.
The graphical interpretation of conditions (i)-(iv) is given in Figure 5.We see that (i) holds if and only if line which is a part of the boundary of Q rect,τ .In the same way as the derivation of ( 7) from ( 5), the condition for the intersection is described by In a similar manner, (ii)-(iv) are equivalently reduced to Now, we suppose that none of (i) -(iv) holds to prove (13) by contradiction.Then, it follows that both of the following inequalities Here, we note that q1,τ f (d)  1 + q2,τ f (d)  2 > Y must hold since a s−1 > 0 and f 0 ≥ 0. This contradicts that none of (i) -(iv) holds.Next, we consider the case sgn( Ĵ(d) − J * (d) ) < 0, which implies that ( 14) is reduced to −a s ≤ E J (d) < −a s−1 .
In the same way as the case of sgn( Ĵ(d) − J * (d) ) > 0, to prove (13), we show that one of the following conditions holds.
Supposing that none of (i) -(iv) holds, we have Recall E J > −1 as stated in Remark 2.1.It follows that −a s−1 > −1 holds.Consequently, it holds that q1,τ f This contradicts that none of (i) -(iv) holds.This concludes the statement of the theorem holds. of the outer approximation.The asymptotic convergence of Q τ to (q * 1 , q * 2 ) is not guaranteed in the analysis but is numerically verified in a demonstration given in Section 5.

Numerical experiment
In this section, we present a numerical experiment of the proposed control system with Algorithm 1.In the experiment, we demonstrate personalization using the proposed control system, considering two users.

Problem setting
We address the LQR problem.A plant system is given by a discrete-time linear state space equation: where system matrices A and B are given by The objective function of a system-user is described by where Q and W are weighting matrices.The corresponding optimal control law is given by where K(Q, W) = W −1 B P is the optimal feedback gain and P is the solution to the Riccati equation PA + A P − PBW −1 B P + Q = 0. We consider two users, user A and user B. The parameters in J for user A are given by Q * A = diag{q * 1,A , q * 2,A } = diag{50, 1} and W = 5.Then, the parameters for user B are given by Q * B = diag{q * 1,B , q * 2,B } = diag{1, 50} and W = 5.It should be noted again that J is private, i.e. the system-designer cannot access Q directly.In this experiment, we try to estimate q * 1 and q * 2 based on the following user rating; The flow of the experiment is shown below.
(2) A control experiment is performed where the initial state x 0 is determined by random values chosen from [0, 10] 2 (3) A system-user evaluates the temporal control system and gives a rating to a system-designer based on ( 18).(4) Apply Algorithm 1 to the user rating to update the estimate of q i , i ∈ {1, 2}.(5) Update control law (17) with the updated parameters.(6) Back to Stage 2.

Experiment results
The result of the experiment is given in Figures 6-8.The transitions of the parameters q 1 and q 2 obtained from the experiment are shown in Figure 6.In the figure, the horizontal axis represents the number of parameter updates and the vertical axis represents the parameter value.We see that the parameter estimate (q 1 , q2 ) converges to (q * 1,A , q * 2,A ) = (50, 1) for user A, and to (q * 1,B , q * 2,B ) = (1, 50) for user B. Next, the transitions of the user rating are shown in Figure 7.In the figure, the horizontal axis represents the number of parameter updates and the vertical   axis represents the value of the utility gained from the control system.For either user A or B, we see that the utility is maximized by the algorithm even if the systemdesigner cannot access full information on the objective function.
Finally, we show the state transitions of personalized control systems in Figure 8, which use converged estimated parameters for the control.In the figure, the horizontal axis represents the discrete-time k and the vertical axis represents the state of the plant system (16).We see that the control behaviour is different for each user, which indicates that the control system is personalized.

Conclusion
In this paper, we addressed the personalization of control systems where the optimal controller is updated according to users' private objective.We formulated a problem of estimating the individual objective function based on the user ratings and proposed its solution algorithm.The algorithm was analysed for a special case where the objective function is characterized by only two parameters.Finally, a numerical experiment showed the usefulness of the algorithm.
We have not proved that Theorem 4.1 holds for the case with more than three parameters, so future works include the extension of the analysis where a general objective function is addressed.Another future work is to model the user rating in a different manner to (3).

Figure 1 .
Figure 1.An example of user rating: five-grade evaluation is given to the control system, such as Excellent/Very Good/Good/Average/Poor.
e. a reward for (P, K τ ) is given to the controller.(3) Parameters q i , i ∈ {1, 2, . . ., } are estimated.(4) Control law K is updated based on the estimated objective function Ĵτ .(5) The version of the control system is updated as τ ← τ + 1, and go to Stage 1.

Figure 2 .
Figure 2. Personalized control system updated based on user rating.

Figure 3 .
Figure 3. Estimate of parameter existence region.

Remark 4 . 1 :
As implied by Theorem 4.1, by iterating the controller update with Algorithm 1, the parameter region Q τ contracts monotonically in the sense

Figure 6 .
Figure 6.Transitions of parameter estimate.(a) user A and (b) user B.

Figure 7 .
Figure 7. Transitions of utility.(a) user A and (b) user B.

Figure 8 .
Figure 8. State transitions of personalized control system.(a) user A and (b) user B.