Adaptive self-learning controllers with disturbance compensation for automatic track guidance of industrial trucks

ABSTRACT This paper presents an extended control concept for automatic track guidance of industrial trucks in intralogistic systems. It is based on Reinforcement Learning (RL), a method of Artificial Intelligence (AI). The presented approach is able to adapt itself to different industrial truck variants and to the associated specific vehicle parameters. In order to avoid starting the whole training of the controller for each truck variant from scratch, the training process is divided into two steps. In the first step, the controller is trained on a simplified linear model using parameters of a nominal vehicle variant. Based on this, the control parameters are only fine-tuned in the second step using a more complex nonlinear model, representing the real industrial truck. In this way, the controller is adapted to the actual truck variant and the corresponding parameter values. By using the nonlinear model, it can be ensured that the forklift's dynamic is approximated within the entire operating range, even at high steering angles. Moreover, the influence of the disturbance variable of the system (path curvature) is compensated by considering this a priori knowledge within the control design. Therefore, the Artificial Neural Networks (ANN) of the RL controller and the observation vector are suitably adjusted. In this way, the occurring path curvatures can be considered in both training steps and the control parameters can be optimized accordingly. Thus, the influence of the disturbance variable can be compensated, which significantly improves the control quality. In order to demonstrate this, the new approach is compared to an RL control concept, which is not considering the disturbance variable and to a classical two-degrees-of-freedom (2DoF) control approach.


Problem description and requirements
In times of global economic markets and increasing competition, the automation of logistic processes is a basic requirement for corporate success.An important object of research and development is to increase the internal material flow via an autonomous and intelligent networked fleet, which usually consists of a wide variety of different truck variants.
An essential element in this environment is the automatic track guidance of individual industrial trucks.The main objective is to guide the truck as accurately as possible along a predefined path where only small lateral deviations occur.The classical model-based control methods with respect to automatic track guidance of a heterogeneous logistics fleet are disadvantageous for several reasons.On the one hand, these control approaches prove to be time-consuming, since the modelling of the plant and the design of the controller has to be separately carried out for each truck variant.On the other hand, the use of the extensive methods of linear control theory requires a linear model that describes the plant as accurately as possible.
However, considering the entire range of applications of forklifts, the widely used linear single-track model (Section 2) reaches its limits.This is based on the simplifying assumptions during the development process of the model.Especially the small angle approximation leads to problems with respect to industrial trucks.Due to the high demands on manoeuverability, forklifts are designed with rear-axle steering systems, allowing steering angles of up to 90 • [1].As a result, the linear model is not able to approximate the vehicle dynamics in the entire operation mode, which can lead to significant disadvantages in the model-based design of the controller.
Since the varying path curvature during operation has a significant influence on the automatic track guidance system and the trajectory is predefined, this information represents a priori knowledge and should be exploited by control concept.Consequently, an approach has to be developed that independently adapts to different industrial truck variants, takes into account the actual dynamic of the forklift within the entire range of applications and considers existing a priori knowledge.

Related research
The papers [2][3][4][5][6] deal with the topic of automatic track guidance of industrial trucks, but each of them focuses only on a single truck variant.
In the publications [7][8][9][10], a 2DoF control concept for automatic track guidance of vehicles is presented, that specifically considers the influence of the disturbance variable (path curvature) as a priori knowledge.These control structures consist of a linear disturbance compensation (feedforward controller FFC) combined with different kinds of feedback controllers (FBC).These approaches proved to be very effective, since the influence of the changing path curvature can almost be compensated.Compared to a classical feedback control concept, significant advantages can be achieved using the 2DoF approach.Since both parts of the lateral controller (FFC and FBC) depend on the plant, this concept is suitable for only one single truck variant as well.
In order to consider multiple forklift variants, new methods based on AI are used in addition to the classical adaptive control concepts given in Ref. [11][12][13].An overview as well as a classification of the different AI approaches is given in Ref. [14].The well-known RL control methods suffer from the fact, that a priori knowledge is not integrated into the training process [15,16].Therefore, a new approach has been presented in Ref. [14] that will be called Reinforcement Learning Control Considering a priori Plant Knowledge (RLCCPK) in the following.Its basic idea consists of integrating a priori knowledge of the plant into the training, which significantly increases the efficiency of the whole process.The presented RLCCPK approach considers a priori knowledge of the controlled system but neglects the influence of the varying path curvature during operation.Since the path is available in advance, this a priori knowledge should be taken into account by the control concept.Therefore an extension of the RLCCPK approach has been presented in Ref. [17].However, this approach uses only a simplified linear model during the training process of the RL controller, which approximates the vehicle dynamics only for a limited range of applications (Section 2.4).

Main contribution and outline of this paper
This paper presents an extended control concept for the automatic track guidance of industrial trucks which is based on RL.It adapts itself to different vehicle variants and also takes into account a priori plant knowledge.RL is implemented in the form of the so-called Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, as it proves to be suitable for the application of automatic track guidance [18].The method of integrating a priori plant knowledge into the training process known from RLCCPK is extended to compensate the influence of the disturbance variable and to ensure steady-state accuracy in analogy to a classical 2DoF control concept [17].By means of an appropriate extension of the so-called observation vector the path curvature is provided to the RL controller.Furthermore, the structures of the RL controller's ANN have to be adjusted to process the information of the enlarged observation vector.
In order to guarantee a high control quality within the entire application range of the real industrial truck, the training process of the AI-based controller is divided into two steps using different plant models.In the first step, the controller is pre-trained on the basis of a simplified linear model representing a priori knowledge of the basic lateral dynamic vehicle behaviour.Since this model is derived for an industrial truck with average vehicle parameter values, a fine tuning of the control parameters with respect to the actual vehicle variant is performed in the second training step.Therefore, a more complex nonlinear model is used, representing the real industrial truck's lateral dynamic behaviour.Using this advanced model, the actual dynamics of the industrial trucks can be approximated within the entire operating range, even at high steering angles.This two-stage procedure using different plant models offers the possibility to investigate the adaptability of the already pre-trained controller to the real vehicle behaviour in simulation.In this way, both the control quality and the controller's training efficiency can significantly be improved.To demonstrate this, the control concept proposed in this paper, called Reinforcement Learning Control with Disturbance Compensation (RLCDC), is compared to the RLCCPK and a 2DoF control approach.
This paper is organized as follows.Section 2 provides the system overview and addresses both the development of the linear and nonlinear plant model and their validation using real measurement data.This section ends with the introduction of the structures of the different control approaches.In Section 3, the design of the 2DoF controller is carried out using the root locus method.The fundamentals of RL as well as the AIbased control approaches (RLCCPK and RLCDC) will be introduced in Section 4. Subsequently, the simulation results of the used control concepts are assessed (Section 5).At the end of the paper, in Section 6, the main conclusions are discussed.

System overview and modelling of the plant
At the beginning of this section, the principle of the automatic track guidance and the fundamental control structure are presented.Subsequently, both the linear (Section 2.2) and the nonlinear plant model (Section 2.3) are introduced.In order to illustrate the advantages of the nonlinear model, especially in the range of high steering angles, a validation of the models is presented in Section 2.4.The following Section 2.5 is dedicated to the structure of the classical 2DoF control concept, since it is used as a comparison control approach in this paper.Finally, the proposed AI-based control concept for the specific consideration of the disturbance variable (RLCDC) is presented in analogy to the 2DoF control approach in Section 2.6.

System overview
Figure 1 depicts the principle of automatic steering control of an industrial truck.First of all, the desired vehicle trajectory (predefined path) is calculated and stored as a data set on the real-time computer.The data record includes the necessary setpoint information for the automated vehicle guidance, such as the Cartesian Coordinates and curvature of the trajectory.The main objective is to guide the truck as accurately as possible along the path.In Ref. [14] it is shown, that it is of benefit to the control of the system if a preview point P p is guided instead of vehicles centre of gravity (CoG).For this purpose, P p is defined in the preview distance l p in front of the industrial truck's CoG [19].The lateral deviation a p corresponds to the distance between the preview point P p and the reference point R p on the predefined path.Using this information, the controller calculates an appropriate control signal for the steering actuator in order to reduce the lateral deviation.
The fundamental structure of the proposed vehicle guidance system is provided in Figure 2. The plant model consists of three parts, starting with the positioncontrolled steering actuator.The second part is the so-called single-track model, which describes the lateral vehicle dynamics depending on the steering angle δ r .The last part represents the kinematics of the vehicle, i.e. its relative motion with respect to the predefined path.The curvature χ p of the path in the reference point R p represents one input of the controlled system and is considered a disturbance variable.The second input is a control signal δ set which is calculated by the lateral controller depending on the lateral deviation a p , representing the output signal of the controlled system.

Linear plant model
Based on the presented structure of the mathematical plant model in Figure 2, the modelling of the single  parts can be given.The transmission behaviour of the position-controlled steering actuator is approximated as a first-order delay element with the delay time constant T s : The second part is the so-called single-track model [7][8][9][10]20].This well-known model from the literature is valid for vehicles with front-axle steering.An extension in order to describe the lateral dynamic behaviour of industrial trucks with rear-axle steering has been derived in Ref. [14].It is obtained under the assumption that the CoG of the vehicle is at road level, which neglects the influence of wheel load distributions.Thus, the wheels of each axle can be combined into one resulting wheel.Figure 3 depicts the single track model for industrial trucks with rear-axle steering.An overview of the associated variables is given in Table 1.Furthermore, the following simplifying assumptions are made during the mathematical description of the lateral dynamic vehicle behaviour: • Neglect of longitudinal dynamic forces like traction forces, braking forces and aerodynamic drag forces • Constant or only slowly changing vehicle speed • Small steering angles, slip angles and side slip angles The first two assumptions can be made since the lateral vehicle dynamic motion basically changes significantly faster than the longitudinal dynamic motion.The third assumption limits the range of validity of the model.It describes the lateral vehicle dynamic behaviour only with restricted accuracy for manoeuvers with high steering angles and during the limits of driving physics.Taking into account these assumptions and adapting the linear model to forklifts with rear-axle steering, the following equations of motion are obtained: where m is the vehicle mass and J z is the moment of inertia at the forklifts CoG about the vertical axis.
Assuming small steering angles the tyre forces F y,f and F y,r can be linearised and represented as: with Thus, the tyre forces are assumed to be proportional to the vehicle's slip angles α f and α r , while the lateral tyre stiffness coefficients c f and c r are assumed to be constant.
To use the plant model for the design of the lateral controller, the model equations have to be extended to describe the relative motion of the vehicle with respect to the path (third part of the model).Specifically, the following relationship results for the lateral deviation (a p ) and the course angle (κ path ) with respect to the preview point P p and the reference point on the predefined path R p [21,22].
Finally, the linear plant model can be given in state space representation Equation (9), where x = [β, ψ, κ, a p , δ r ] T describes the state vector of the system and u = [δ set , χ p ] T represents the vector of its input signals.These are the steering angle setpoint, calculated by the lateral controller (control signal), as well as the curvature of the predefined path, considered as disturbance variable.Using this information and the state space model Equation ( 9) the transfer functions of the plant can be specified.The disturbance transfer function (G a,χ (s)) and the control transfer function (G a,δ (s)) are given in Equations ( 10) and (11) and are used for the classical model-based control design of the 2DoF controller in Section 3. In this case, G a,χ (s) describes the effects of the path curvature χ p on the system's output a p .G a,δ (s) characterizes the dynamic behaviour of the vehicle and the steering actuator.
Table 2 provides an overview of the associated values of the model parameters used in Equations ( 9)-( 11) for a nominal truck variant, the Linde E30 [1]. with

Nonlinear plant model
As already mentioned, the second part of the plant model in Figure 2 describes the vehicle's dynamic.Since the rear-axle steering system of forklifts allows high steering angles, a more complex nonlinear single-track model is used [20,23] and [24].Compared to the linear plant model, the equations of motion are calculated to: Furthermore, the nonlinear tyre forces F y,f and F y,r are calculated using the arc-tangent approximation as described in Refs [20,22] in dependence on the cornering stiffness coefficients c f 1 ,c f 2 , c r1 , c r2 as well as on the slip angles α f , α r [10,23,24].
μ f ,max and μ r,max correspond to the adhesion coefficients at the front and rear tyres, whereas c f and c r correspond to the cornering stiffness coefficients.With that the final equations for calculating the slip angles result in: Substituting the slip angles and the nonlinear tyre forces into the equations of motion, the final differential equations of the nonlinear model can be given, which are shown framed in Equations ( 16)-( 20):

Validation of the plant models
To investigate the validity of the presented single track models, three different manoeuvers are performed.For this purpose, a forklift that is comparable to the nominal Linde E30 is equipped with an Inertial Measurement Unit (IMU) and the state variables β and ψ as well as the vehicle speed v and the steering angle δ r are recorded while driving.In Figure 4, the measured variables β and ψ (blue) are compared to the corresponding simulation results based on the linear (red) and nonlinear (orange) model derived above.The first manoeuver performed in the driving test deliberately involves only steering angles of up to 20 • on the rear axle.Both simulation models represent the real vehicle behaviour quite accurately and behave comparable.
If the steering angle is increased, the advantage of the nonlinear model becomes clear, which is due to the small angle approximation and the linear tyre characteristics during the development process of the linear model (Figure 5).This illustrates that the linear model is not able to approximate the lateral vehicle dynamic behaviour of industrial trucks sufficiently well within the entire operation mode, i.e. during operations with high steering angles.The nonlinear model is, therefore, able to represent the real vehicle behaviour for small as well as for higher steering angles.The fact that high steering angles actually occur during the operation of industrial trucks is demonstrated by a turning manoeuver (Figure 6).This validation illustrates again the advantage of the nonlinear model for describing vehicle dynamics of industrial trucks.Nevertheless, the linear model is quite suitable to approximate the vehicle behaviour in a limited range of operation.This a priori knowledge in the form of a validated simplified linear model will be used to pretrain the RL-controller in simulation (first step), in order to build up experience regarding the basic vehicle behaviour of a nominal forklift variant (Linde E30).Thus, based on this pretrained RL-controller, only the fine-tuning has to be done using the more accurate and complex nonlinear model to simulate real-time operation, which significantly accelerates the training process.

2DoF control structure
Figure 7 presents the structure of the 2DoF control concept, which can be used to compensate the influence of the disturbance variable, i.e. the path curvature [7][8][9][10]25].The control signal δ set is formed by superposition of two signal components.The first part (δ FFC ) is calculated by an FFC that uses a priori knowledge in the  form of the detailed path information, which are available in advance [21].It determines the control signal in dependence on the path curvature χ p in the current reference point R p based on the linear plant model (Section 2.2) [7].
The FBC calculates the second component (δ FBC ) of the control signal.Its task is to stabilize the plant and to compensating for the lateral deviation a p caused by model inaccuracies and other disturbances.In addition to the described advantages of this control concept, it has a decisive disadvantage with regard to the task of automation of a heterogeneous fleet.The FFC is not adaptive to different vehicle variants.Although the FBC can compensate for minor variances during operation, an adaptation to another truck variant is not possible with this control approach.concept given in Ref. [14].In order to take into account the influence of the disturbance variable, the structure of the RLCCPK control concept is extended in analogy to the 2DoF concept.Since the path is defined in advance and stored on the real-time computer (Section 1), the path curvature in the reference point R p can be used as a priori knowledge.Thus, this information (χ p ) is provided to the lateral controller as an additional input signal.

Proposed AI-based control structure
The calculation of the control signal δ set is based on the current system state x on the one hand as well as on the current path curvature in reference point R p on the other hand.With this new control structure, the advantages of the RLCCPK and the FFC of the 2DoF control concept can be combined.It results in a new approach that is able to adapt to different vehicle variants taking into account the a priori plant knowledge and to compensate the influence of the varying path curvature during operation.

Design of the 2DoF controller
In Section 2, it was pointed out that the curvature of the path χ p in the current reference point R p can be regarded as a disturbance variable of the lateral vehicle guidance system.Since the path is predefined and stored on the real-time computer, this a priori knowledge offers the possibility to reduce the influence of the varying path curvature during operation by means of a disturbance rejection [25].Assuming that the mathematical model describes the controlled system accurately, the influence of the disturbance variable can completely be compensated with a suitable definition of the FFC (G FFC (s)).
Figure 9 shows the structure of the 2DoF control concept.Its design is based on the disturbance transfer function (G a,χ (s)) and the control transfer function (G a,δ (s)) of the plant, given in Equations ( 10) and ( 11) in Section 2.2.Based on this, the following calculation of G FFC (s) can be derived: Since the resulting transfer function G FFC (s) (Equation ( 21)) has a higher number of zeros than poles, a first-order low-pass filter with a small time constant T FFC has to be added.As the FFC does not ensure a precise track guidance by itself, a FBC is used to compensate the occurring lateral deviation a p .This procedure increases the robustness of the control system with respect to imprecisely known model parameters and stabilize the controlled system.The FBC is designed using the root locus method in order to achieve a damping of the dominant poles of about D = 0.7.A detailed description of the control design using root locus method has already been given in Refs [7,9,10].associated transfer function (G FBC (s)) represents the FBC as a PDT 1 controller Equation (22), where K FBC is the gain factor, T D is the derivative time and T FBC is the time constant of a first-order low-pass filter.The associated control parameters of the 2DoF controller are given in Table 3.

AI-based control approaches
This section is dedicated to the AI-based control approaches for the automatic track guidance of industrial trucks.At the beginning, the used methodology   and the basics of RL are presented.Section 4.2 introduces the RLCCPK approach given in Ref. [14], since it is used as a comparison control concept in Section 5. Finally, the proposed RLCDC approach is discussed in Section 4.3 that specifically considers the varying path curvature during operation.

Reinforcement learning basics
RL is a well-known approach in the domain of control systems [26][27][28].It is assigned to the methods of direct neural control, since AI acts as a controller and calculates the control signal by itself.The training of the RL controller takes place in closed-loop operation and is done in analogy to the human learning process.Experiences are built up by interacting with the system.The principle of the RL process is displayed in Figure 10 and essentially consists of three blocks.The undermost block (vehicle) represents the controlled system, in this case, the industrial truck.Its current state k is provided to the RL controller.This block describes the lateral controller that calculates the control signal u k in order to affect the controlled system.The third block (reward function) evaluates the control signal u k based on the current state k and the following state k+1 , in form of a feedback, called reward r k .It is a measure of control quality.In analogy to the human learning process, the control strategy is adapted in order to optimize the reward.The described basic idea of RL can be implemented using different methods.In this paper, the TD3 algorithm is used, which is an extension of the Deep Deterministic Policy Gradient (DDPG) Algorithm [27].The TD3 algorithm is well suited for the application of automatic track guidance based on two main reasons.On the one hand, the RL controller is able to calculate a value continuous control signal which is important for a smooth vehicle track guidance.On the other hand, the training process proves to be more stable compared to the DDPG algorithm due to additional target-nets [18].TD3 is a so-called Actor-Critic method that uses separate memory structures to differ between the control strategy μ( ) (actor-ANN) and the value function Q( , u) (critic-ANN).Q( , u) is a function to calculate the expected cumulative reward r, based on its input signals and u and represents the knowledge of the plant.This means, it evaluates the expected reaction of the controlled system with respect to the calculated control signal in the current system state.The optimization of the parameters φ of the critic-ANN is done by supervised learning, based on the obtained reward [29,30].The task of the actor-ANN consists of calculating the control signal u k in dependence of the current system state k and is indicated as a function of the actor-ANN parameters θ.The optimization of this parameters (θ) should be done in order to maximize the output of the critic-ANN and thus the reward.To implement this, a criterion J is defined that describes the start distribution of Q( , u) [27].The basic idea is to adjust the parameters of the actor-ANN θ in the direction of the gradient ∇ θ J [18,27,31].This is done by applying the chain rule with respect to the actor-ANN parameters θ (Equation ( 23)): The observation vector , reflecting the state of the system, is depending on the chosen methodology.
Whether the disturbance variable is taken into account (RLCDC) or not (RLCCPK), the observation vector is composed differently (Sections 4.2 and 4.3).

RLCCPK approach
The RLCCPK approach does not consider the influence of the disturbance variable (χ p ).The used observation vector is formed similarly to the state vector x of the models described in Section 2 and is given in Equation ( 24): The behaviour of the RL controller can be specified by the definition of the reward function.The study [14] demonstrates that closed-loop behaviour of optimal state control can be approximated by choosing the reward function r k in analogy to the quadratic cost function of classical LQR [32].In this application, the used reward function of the RLCCPK is defined to focus on minimizing the lateral deviation a p of the vehicle with respect to the path.Therefore, the weighting factor of a 2 p,k is chosen significantly larger than the weightings of the other signals (Equation ( 25)).

Proposed RLCDC approach
In order to compensate the influence of the varying path curvature in the reference point R p during operation, the observation vector of the RLCCPK (Equation ( 24)) is extended by the disturbance variable χ p .The resulting observation vector ext of the RLCDC approach is given in Equation (26).Thus, the current path curvature can be used for the calculation of the ideal control signal δ set (actor-ANN).Due to the fact that χ p only provides a non-zero value while driving a curve, the RLCDC approach offers the opportunity to generate an additional control signal component.In case of a control deviation caused by model inaccuracies or other disturbances, the extension of the observation ( ext ) has no effect.
Since the signals of the observation vector form the inputs of the actor-ANN and the critic-ANN of the RL controller, the structure of these networks has to be adjusted.A further neuron is integrated in the input layers of the ANN, in order to process the information of the enlarged observation vector.Figure 11 depicts a simplified representation of the structure of the actor-ANN (left) and the critic-ANN (right).In the first hidden layer of both fully connected feed-forward ANN, 400 neurons are inserted.Therefore, the extension of the input layer with additional neuron results in a large number of further ANN parameters.Since the path curvature χ p is integrated into the observation vector ( ext ), this information is available to both the critic-ANN and the actor-ANN.Thus, it can be used both to estimate the expected reward (r) and to calculate the ideal control signal δ set .Consequently, the influence of the disturbance variable can be compensated and the control quality can be significantly improved.
In order to compare the different RL control concepts (RLCCPK and RLCDC) with each other, the reward function given in Equation ( 25) is used for the RLCDC approach as well.

Control design and simulation results
In this section, the simulation results of both RL approaches (RLCCPK and RLCDC) and the 2DoF control concept are presented and compared with each other.Section 5.1 focuses on the results after the first training step of the AI-based approaches, called pretraining, that is carried out using the simplified linear plant model and the parameters of the nominal truck variant (Linde E30).The fine-tuning of the control parameters with respect to the actual dynamic behaviour of the forklift is carried out in the second training step.For this purpose, the more accurate nonlinear plant model is used, representing the real industrial truck.The simulation results of the fine-tuned AI-based controllers are presented in Section 5.2.
Finally, the adaptability of the RL concept to another vehicle variant, such as the Linde E80 will be discussed in Section 5.3.For this purpose, the second training step is performed based on the pre-trained controllers with respect to the Linde E80.Since the 2DoF control concept is not adaptive, the simulation results of the Linde E80 are also presented using the 2DoF controller designed for the nominal truck variant.

Simulation results after the first training step
This section compares the RLCCPK, RLCDC after completion of the first training step and the 2DoF controller, considering the scenario given in Figure 12 [17].
The upper part of the figure displays the course of the predefined path [ 0-20 m].The path initially runs as a straight line [ 0-10 m] and merges into a curve with a constant curve radius ρ path .The transition between the mentioned segments is realized as a clothoid [10-12 m], where the radius is linearly reduced until it reaches the final curve radius [ 12-20 m].Since the control concepts refer to a constant velocity of v = 2m/s during the entire test scenario, the path curvature can be calculated.It is shown in the bottom part of Figure 12 and is applied to the system as a disturbance variable χ p (Section 1).The industrial truck starts with an initial lateral deviation of the preview point of a p = 0.2m, i.e. offset from the path.
The closed-loop simulation results using the different control approaches with respect to the nominal vehicle variant (Linde E30) are presented.It shall be shown, how the different control concepts can deal with the scenario given in Figure 12 and compensate the influence of the disturbance variable.Since RLCCPK does not consider the varying path curvature during operation, this approach is trained without disturbance signals in all training epochs.In order to take into account occurring path curvatures during operation, the structure of the ANN of the RLCDC is adjusted as discussed in Section 4.3.The training process of the RLCDC controller is divided into several epochs, each Figure 13 shows the closed control loop simulation results using the linear plant model.In the upper part of the figure, the time courses of the control variable (δ set ) are depicted.The controlled variable (a p ) is illustrated below.Obviously, all three concepts are comparable in the range of [0sec -5sec].The lateral deviation of the RLCCPK differs from the other control concepts in the rear part [5sec -10sec] and exhibits a permanent control deviation of about a p = 4 cm.This is due to the fact that the path curvature is applied to the system and not taken into account by the RLCCPK.Obviously, the extension of the RL approach (RLCDC) almost completely compensates the influence of the disturbance variable and leads to steady-state accuracy comparable to the 2DoF control approach.Based on the extension of the observation vector ( ext ) with the signal of the disturbance variable (χ p ) in the current reference point R p , this information is made available to the controller.Due to the additional neuron in the ANN's input layer, the path curvature can directly be incorporated into the calculation of the control signal, which leads to a significant improvement in the control quality while driving along curves.Since the used ANN are fully connected, the additional neuron within the input layer in combination with the high number of neurons of the first hidden layer, lead to a more complex ANN with a large number of additional ANN parameters.This results in a higher degree of freedom with respect to the design and improves the control quality by compensating the influence of the disturbance variable.
However, the RLCDC approach has a negative effect on the training efficiency.The additional control parameters have to be taken into account in the training process.Therefore, significant more optimization steps have to be performed.This can be seen by comparing the optimization steps carried out in the first training step of RLCCPK and RLCDC Table 4.In order to ensure a safe vehicle guidance in the entire operating range of industrial trucks, the control concepts compensating the influence of the disturbance variable (RLCDC and 2DoF) are now tested on the basis of the nonlinear model.Using this model, the actual dynamic vehicle behaviour is approximated, even in applications with large steering angles.In order to get into this range, the used test scenario is changed.Therefore, a tighter curve was designed, resulting in a higher path curvature χ p in the reference point R p [8 -15 s] and thus in higher steering angles during operation (Figure 14).The vehicle velocity is constant within this scenario as well (v = 2m/s).
If the controllers, designed based on the simplified linear model, are tested using the nonlinear model, the control quality suffers.Both approaches (RLCDC and 2DoF), which ensured steady-state accuracy during the simulation tests using the linear model (Figure 13) cannot guarantee it using the nonlinear one representing the real vehicle dynamics (Figure 15).In each case, a permanent control deviation occurs in the range [8 -15 s], which is about a p = 2.8 cm using the 2DOF and a p = 6.4 cm using the RLCDC.
With respect to the 2DoF controller, the fact that in case of an applied disturbance variable [8 -15 s], steady-state accuracy cannot be guaranteed is due to the FFC (Section 3), since the nonlinear plant model differs from the linear model (used for the design of the FFC).Focusing on the controlling process (2DoF) of the initial lateral deviation [0 -5 s], no significant difference can be seen compared to the investigation presented in Figure 13.Regarding to the RLCDC, the nonlinear plant model has an even stronger impact on the steady-state behaviour in case of an applied disturbance variable.It is striking that the dynamics of the closed-loop control behaviour using the RLCDC is changed in the beginning of the scenario as well.This can be explained by the fact that the control signal is not calculated exclusively on the basis of the controlled variable (a p ), like it is done using the 2DoF controller.Since the calculation is done by the Actor ANN, it is based on the ANN's input signals and thus on the observation vector ext .Due to the high steering angle in the beginning of the control process, both the side slip angle β and the yaw rate ψ, as well as the other values of the observation vector, differ significantly from the values that occur during the use of the linear model.Thus, the simulation results in Figure 15 illustrate the importance of the fine-tuning step using the more accurate nonlinear plant model, which will be tested in the following section.

Simulation results after the second training step
In Section 5.2, the second training step of the AIbased controller is presented, which is performed using the nonlinear plant model representing the real industrial truck.Based on the pre-trained controller (first training step) the control parameters can be adapted to the actual lateral vehicle dynamic behaviour using the nonlinear model.Figure 16 presents the simulation results of the retrained RLCDC controller (red line), using the nonlinear model and the parameters of the nominal vehicle variant.In order to illustrate the advantage of the control parameter's adaptation in the second training step, the course of the controlled variable after the first training step is also given again (blue line).Significant advantages can be achieved within the fine-tuning of the RLCDC using the nonlinear plant model.Already 2000 optimization steps (second training phase) increase the control quality and are sufficient to reduce the permanent control deviation using  the RLCDC approach (a p = 0.7 cm) in the rear part [10-15 s] of the scenario given in Figure 14.

Investigation of the AI-based controllers adaptability
In this section, the adaption of the control concepts to another industrial truck variant is investigated.To avoid starting the entire training process from scratch, the pre-trained RLCDC of Section 5.1 is used as starting point for the second training step.The already pretrained controller has to be adapted within the second training step (fine tuning) to the actual industrial truck variant, in this case, the Linde E80 and the associated vehicle parameters (Table 5).By this method, the number of optimization steps can be significantly reduced compared to a training that has to be started from scratch.This can be illustrated by comparing the required optimization steps of the RCLDC in the first and second training steps given in Table 6.
Figure 17 shows the control quality of the RLCDC and 2DoF control concepts using the nonlinear plant model for the scenario II (Figure 14).It can be seen that the 2DoF controller designed for the Linde E30 is not able to stabilize the Linde E80.This is due to the fact that the model parameters and the resulting dynamics of the two vehicle variants are significantly different (Tables 2 and 5), which affects the design of both the FFC and the FBC.
The RLCDC is adapted to the changed vehicle variant and the actual industrial trucks dynamics in the second training step using the nonlinear model.After the fine-tuning process (Table 5), the re-trained RLCDC is capable of stabilizing the vehicle.The AI-based controller is still able to almost compensate the influence of the disturbance variable, which can be seen in the range of [5 -15 s] in Figure 17.A permanent control deviation of a p = 1.6 cm occurs.In order to be able to exactly evaluate the RLCDC's control quality, the course of the controlled variable (a p ) is given individually (bottom part of the figure) in addition to the comparison with the controlled variable using the 2DoF controller.

Conclusion
This paper presents an extension of an AI-based control approach for the automatic track guidance of industrial trucks.By separating the training process into two steps, existing a priori plant knowledge can be integrated during the training.In the first step, the controller is trained on a simplified linear model using parameters of a nominal vehicle variant.Based on this, the control parameters are only fine-tuned in the second step using a more complex nonlinear model in order to adapt to the actual vehicle variant.The use of the more complex nonlinear model represents the real industrial truck and ensures that the forklift's dynamic is approximated within the entire operating range of industrial trucks, even in operations with high steering angles.By extending the observation vector and the ANN used in the RL controller, a compensation of the influence of the path curvature is possible.Thus, the control quality of the concept can be improved and a stable control loop behaviour for different industrial truck variants can be ensured in the investigated scenarios.With the new AI-based control concept with disturbance compensation (Reinforcement Learning Control with Disturbance Compensation RLCDC), the advantages of the other presented control concepts can be combined.The adaptability with regard to new industrial truck variants of the self-learning controller presented in Ref. [14] is combined with the possibility of compensating the influence of disturbance variables of the 2DoF control concept.Finally, it should be mentioned that the RL concepts have a significant disadvantage compared to the 2DoF approach.In this configuration of the RL control approaches, all state variables of the system have to be available to the controller, whereas the 2DoF concept only requires the output variable of the plant.

Figure 1 .
Figure 1.Principle of automatic track guidance.

Figure 4 .
Figure 4. Validation of the plant models for small steering angles.

Figure 5 .
Figure 5. Validation of the plant models for higher steering angles.

Figure 8
Figure 8 depicts the control structure of the proposed AI-based control concept.It is based on the RLCCPK

Figure 7 .
Figure 7. Structure of the 2DoF control concept.

Figure 8 .
Figure 8. Structure of the proposed AI-based vehicle guidance system.

Figure 9 .
Figure 9. Structure of the 2DoF control concept.

Figure 11 .
Figure 11.Simplified representation of the extended ANN structure of the RLCDC approach.

Figure 13 .
Figure 13.Steering angle and lateral deviation using the linear plant model (Linde E30).

Figure 15 .
Figure 15.Steering angle and lateral deviation using the nonlinear plant model (Linde E30).

Figure 16 .
Figure 16.Steering angle and lateral deviation of the retrained controller using nonlinear model (Linde E30).

Table 1 .
Variables of single-track model in Figure3.

Table 3 .
Control parameters of the 2DoF approach.

Table 4 .
Training efficiency of the AI-based approaches.

Table 6 .
Training efficiency of the RLCDC approach.