Reinforcement learning-based control to suppress the transient vibration of semi-active structures subjected to unknown harmonic excitation

The problem of adaptive semi-active control of transient structural vibration induced by unknown harmonic excitation is studied. The controller adaptation is attained by using a specially designed reinforcement learning algorithm that adjusts the parameters of a switching control policy to guarantee efficient dissipa-tion of the structural energy. This algorithm relies on an efficient gradient-based sequence that accelerates the learning protocol and results in suboptimal control. The performance of this method is examined through numerical experiments for a span structure that is equipped with a semi-active device of controlled stiffness and damping parameters. The experiments cover a selection of control learning scenarios and comparisons to optimal open-loop and heuristic state-feedback control strategies. This study has confirmed that the developed method has high stabilizing performance, and the relatively low computational burden of the incorporated iterative learning algorithm facilitates its application to multi–degree-of-freedom structures.


Motivation
The rapid growth of the scale and complexity of modern designs in civil and mechanical engineering (e.g., bridges, overpasses, skyscrapers, and also automotive, railway, aerospace, and robotic technologies) and the evidence that large-scale systems can be exceptionally sensitive to external perturbations have motivated intensive research into the design of reliable and high-performance structural controllers (Ghaedi et al., 2017;Gutierrez Soto & Adeli, 2017c;Li & Adeli, 2018). In line with the recent concept of smart cities (Li & Adeli, 2018) and smart structures (Adeli & Saleh, 1997), special attention has been devoted to adaptive controllers that can operate in dynamic and where the time horizon is constantly rolled back (this is often referred to as receding horizon control). Even though MPC controllers rely on the system model by definition, some level of uncertainty in the model parameters or inaccuracies in forecasting the external disturbances can be compensated by a state-feedback loop that accommodates the actual system response in the subsequent optimal control problems. Numerous MPC applications can be found in the optimization of industrial processes (Bordons & Camacho, 1998) and traffic flows (Ferrara et al., 2015), where the controllers are used to cope with time-varying parameters and evolving boundary conditions. MPC is of special importance for the coordination of wind farms (Vali et al., 2019), which are subject to permanent changes in wind direction. The MPC-based controllers have also confirmed their efficiency for autonomous driving, where vehicles confront dynamic obstacles (Babu et al., 2018). In structural control, the majority of MPC controllers rely on specifically designed dynamic models that predict the evolution of the external excitation forces. Oveisi et al. (2018) developed a recursive least squares algorithm to estimate the disturbance signal, which is constantly updated and used to determine the receding horizon control. The method was successfully validated for a piezo-laminated beam subjected to harmonic disturbances. In Wasilewski et al. (2019), earthquake excitation is recovered from an autoregressive model and fed-forward to the MPC controller, which stabilizes the vibration of a multistorey building with hydraulic actuators. In Zelleke and Matsagar (2019), an energy-based predictive control algorithm was developed to suppress the vibration of a multistorey building subjected to wind excitation. An alternative method to mitigate the vibration of slender buildings exposed to uncertain excitation, based on the probabilistic robust control approach, was proposed by Yuen et al. (2007). Five optimal and suboptimal MPC methods were tested in Takacs and Rohal'-Ilkiv (2014) to determine their computational complexity and capabilities for online implementation to mitigate the free, steady-state, and transient vibration of a cantilever beam equipped with piezoceramic control devices. The authors observed no significant diversity in the control performance between optimal and suboptimal strategies. They suggested that in practice the computationally efficient suboptimal methods (e.g., minimum-time explicit or Newton-Raphson's MPC) may be implemented for systems of larger dimensions without a considerable loss of performance.
The majority of the MPC-based adaptive methods have been confirmed to have a decent stabilization performance. Nevertheless, due to the high computational complexity of the search for the optimal solution, they are mostly restricted to applications in linear structures with active force control actuators. The recent trend in struc-tural control promotes the use of semi-active devices (Cundumi & Suárez, 2008;Gutierrez Soto & Adeli, 2019;Naderpoor Shad & Taghikhany, 2021), in particular, those based on intelligent materials (Ostrowski et al., 2021;Szmidt et al., 2019) that offer robust, energy-efficient operation, and relatively easy deployment. However, semi-active devices introduce nonlinearities, which in the case of multi-degree-of-freedom systems result in highly complex optimal control problems. It is therefore essential to search for alternative approaches that allow for efficient online control adaptation. Appealing perspectives are offered by recent nonclassical computational approaches such as replicator dynamics (Gutierrez Soto & Adeli, 2017b, 2017a, 2018 or reinforcement learning (RL). The latter is a subfield of machine learning (Adeli & Hung, 1994;Amezquita-Sancheza et al., 2020) that is grounded on the idea of learning from interaction (Sutton & Barto, 2020). In view of the adaptive control design, an RL algorithm enables the control decisions to be adjusted based on the controller-system interaction. Therefore, the knowledge of the system model and its parameters may be much poorer than in the case of the MPC control. Furthermore, the computational complexity of successive updates of RL control is significantly lower than in the case of searching for the optimal control solutions.

RL approaches
Approaches based on RL have recently achieved exceptionally successful results in a variety of hard real-world control-like problems, ranging from a superhuman level of proficiency in the games of chess and Go (Silver et al., 2018), through the thermal soaring of gliders (Reddy et al., 2016) and swimming by body undulation (Jiao et al., 2021), to bus traffic control (Shi et al., 2021) and autonomous car driving (Sallab et al., 2017;Shi et al., 2022). Despite these outstanding achievements and the conducive algorithmic structure, the possibilities offered by RL are only occasionally exploited in the field of structural control. There is only a handful of related publications. Although pioneering, they concern only active control or structures with a very limited number of degrees of freedom. Qiu et al. (2021) adopted a deep deterministic policy gradient RL algorithm to train the neural networks that are responsible for controlling a flexible hinged plate. The control was realized through piezoelectric actuators. The experiments confirmed that the developed method is superior to a PD (proportional derivative) controller. In Nagendra et al. (2017), three RL algorithms (i.e., temporal-difference, policy gradient actor-critic, and value function approximation) were studied in the context of stabilizing a benchmark cart-pole system with no prior knowledge of its parameters. The authors compared the algorithms for their convergence and control performance, and concluded that the value function approximation method was the most preferred option. In Khalatbarisoltani et al. (2019), the Q-learning RL algorithm was applied to tune a fuzzy-PD controller to stabilize the vibration of a high-rise building. The method was successfully verified for a selection of seismic scenarios, while taking into account time delays in the control loop. The potential of using RL to control the shape of an active tensegrity structure was studied in Adam and Smith (2008). The developed algorithm combines case-based reasoning and learning from errors. It has an increased control quality, while reducing the time for control computation. Dengler and Lohmann (2018) employed the RL actorcritic algorithm to stabilize a swinging chain at the desired position. The proposed active force control used incomplete state information. Although this method was outperformed by the control relying on the analytic solution, it can be viewed as a viable alternative for classical control designs if the model cannot be sufficiently accurate.

Objectives and organization of this paper
This paper proposes a new RL-based control method to mitigate the transient vibration of semi-active structures subjected to an unknown repetitive harmonic excitation force. This method is dedicated for a class of bilinear systems that represent a wide range of structures of controlled internal parameters. The main contribution lies in the control design, where due to the nonlinear nature of the considered dynamical system, it is necessary to constitute a new unique control policy that guarantees the asymptotic stability of the homogeneous system and facilitates an efficient optimization to suppress the transient vibration induced by unknown excitation. The optimization is performed by using a specially developed RL actor-only algorithm that relies on the state measurements and structural model parameters, with no information from the external excitation force. It adapts the control policy using the derivatives of the assumed energy-related cost functional, which is defined directly over the parameter space of the assumed control policy. This technique exploits the convergence that is naturally inherited from the incorporated gradient descent method, accelerating the iterative learning protocol, and results in a suboptimal control. The proposed method is validated by numerical experiments for a span structure equipped with a semi-active device with controlled stiffness and damping parameters. The convergence and performance are examined for several learning scenarios, including random perturbations in the frequency of the excitation force. The designed RL con-troller is compared to the optimal open-loop solution and the heuristic strategy, which relies on an equivalent control function that is precomputed offline. The relatively low computational complexity of the iterative learning algorithm opens up new perspectives for its application in large-scale complex structures.
The remainder of this work is structured as follows. Section 2 provides the assumptions and definitions of the considered system. The problem of RL control under unknown excitation force is formulated and resolved in Section 3: The switching parameterized control policy is defined, and then an updating sequence for the policy parameter is defined and accommodated in an iterative learning algorithm. A definition of comparative control methods is given in Section 4. In Section 5, the proposed controller is investigated by means of numerical experiments. The concluding remarks are given in Section 5.

THE INVESTIGATED SYSTEM
This paper studies a class of semi-active vibrating structures that are governed by the following dynamical equation:̇( In Equation (1), = ( ) ∶ [0, ] → ℝ represents the state vector at time ∈ [0, ], where > 0 is a considered control time. The initial state is denoted by 0 . Each of the control inputs 1 , … , is assumed to be bounded by the minimum and maximum admissible values; that is, ( ) ∶ [0, ] →  = [ , ], = 1, … , , < . These bounds correspond to the physical constraints of a semi-active device (e.g., extreme voltages). The × matrices and , = 1, … , are assumed to be constant. The × 1 vector ( ) represents a repetitive excitation defined by a harmonic function with unknown amplitude, frequency, and phase shift (see Figure 1), which repeats in identical time windows corresponding to the control time interval [0, ]. For each time window the excitation force ( ) ≠ 0 is acting on a structure for ∈ [0, ], where < . The time is assumed to be sufficiently small, so that the vibrations in the whole time interval [0, ] are transient (the steady-state vibration is not considered here). For the remaining control time-that is, for ∈ ( , ]-the problem of controlling free vibration is considered by setting ( ) = 0.
Equation (1) can represent a wide range of semiactively controlled structures, such as cantilever beams with elastomer-based blocks (Szmidt et al., 2017), span structures supported with magneto-rheological dampers F I G U R E 1 The assumed repetitive harmonic excitation with unknown amplitude, frequency, and phase shift. For the adaptation of the control decision, some learning time window denoted by will be used. (Pisarski & Myśliński, 2017;Wasilewski & Pisarski, 2020), frames with dry friction-based joints (Popławski et al., 2019), vehicular suspensions (Pepe & Carcaterra, 2016), or buildings with semi-active tuned mass dampers (Runlin et al., 2002). The assumed excitation is typical for repetitive industrial operations, such as drilling or grinding. Similar characteristics can model the influence of vehicle formations on the neighboring infrastructure. Temporary harmonic excitation is also found in the repetitive starting-up of rotor machines (e.g., mills, blowers, pumps, compressors), which is a result of subsynchronous resonances that are induced by electro-mechanical interaction between the motor's electric circuit and shaft structure (Szolc et al., 2019).

RL-BASED CONTROL DESIGN
The aim is to design state feedback control functions 1 , … , (referred to as policy) for the system Equation (1) (environment in the RL terminology) that guarantee efficient suppression of the vibration induced by the excitation . Regarding the excitation structure (Figure 1), two control phases are distinguished: Phase I. The first phase is concerned with the transient vibration that is observed for ∈ [0, ].
Here, an actor-only RL algorithm will be developed that allows the policy to be adapted to unknown characteristics of the excitation ( ) ≠ 0. The algorithm will employ state measurements ( ) for some learning time window ∈ [0, ] (where ≤ ) and it will provide succeeding reductions of the value of the cost functional: In Equation (2), ( ) stands for the structural energy: where ≻ 0 is a positive definite × matrix. Phase II. For the free vibration at ∈ ( , ], a policy will be used that assures the asymptotic stability of Equation (1) for ( ) = 0. It will be derived using the Lyapunov functions method and structural energy matrix .

Parameterized policy
To construct a policy that is uniform for the control phases I and II, a parameterized state-dependent control law is employed. For each control time ∈ [0, ], the switching policy 1 ( ), … , ( ) is defined for Equation (1) as follows: Here, ( ) is an × matrix referred to as the policy parameter. Each of the policy parameters ( ), = 1, … , is structured by two subparameters. The first is denoted by * and iterated online by the learning algorithm in phase I; that is, for the time ∈ [0, ], where nonzero excitation force is acting on a structure. The second subparameter is 0 . It is precomputed offline based on the system structuring and remains constant for the free vibration in phase II, that is, for ∈ ( , ] when = 0. Formally, the policy parameters can be written as follows: The method for iterative learning of * , = 1, … , will be discussed in detail in the next section. First, computing the constant parameters 0 , = 1, … , is considered. For each 0 , = 1, … , , it is assumed that: where is an × symmetric matrix, which is computed as the solution to the following Lyapunov equation: In Equation (7), the matrix ≻ 0 is the same as in the definition of the energy function Equation (3). The assumed structuring of the policy parameters 0 = guarantees that the system Equation (1) for ( ) = 0 is asymptotically stable. To inspect this, let = ( ) be the Lyapunov function, which is defined by; see, for example, Sastry (1999): The time derivative of is: The insertion of Equation (1) with ( ) = 0 into Equation (9) yields: which can be written in the following form: . (11) From the Lyapunov equation (7) and the symmetry of the matrix = , it can be eventually concluded that: The application of the switching policy Equation (4) ensures that: From ≻ 0 and Equation (13), it follows thaṫ< 0 for every , which guarantees the asymptotic stability of the closed-loop system Equation (1) with Equations (4) and (5) for ∈ ( , ].

Policy parameter update
The policy parameter * , = 1, … , in Equation (5) will be updated using an approach that is in line with the method of actor-only RL. The actor-only method relies on the optimization of a cost functional that is defined directly over the parameter space of the policy (Grondman, 2015). Here, the energy-related objective functional that is defined in Equation (2) will be optimized for the parameter space * , = 1, … , . For each matrix * , the admissible set is defined as: where * < * are given real constants. The solution ( ) to the system Equation (1) depends continuously on the policy parameters * , = 1, … , (see Chicone, 2006, Theorem 1.3). This result implies the continuity of the cost functional in Equation (2) with respect to * , = 1, … , on the admissible set  * 1 × ⋯ ×  * ⊂ ⋅ 2 . This set is finite-dimensional and compact in ⋅ 2 . Therefore, from the Weierstrass theorem (Liberzon, 2012), it follows that there is a set of policy parameters * , = 1, … , that minimizes . The method of steepest descent is used to search for the optimal policy parameters, which relies on the following updating sequence: where > 0, and is the maximal number of learning iterations. The sequence Equation (15) will be initialized by setting where 0 is computed as in Equation (6).
To derive the formula to compute the derivative of objective functional with respect to the policy parameters * , = 1, … , for Equation (15), the policy Equation (4) is first rewritten for ∈ [0, ] using the unit step function  (⋅), as follows: Here, it is assumed that Next, the Hamiltonian for the cost functional Equation (2) is defined: with the adjoint state = ( ) ∶ [0, ] → satisfying the following differential equation: where (⋅) stands for the Dirac delta function. From Equation (3) and Equation (18), the cost functional Equation (2) can be represented by: Let the functions ∶ [0, ] → and ∶ [0, ] → denote perturbations of the functions and with respect to the infinitesimal changes d * ∶ × → × , = 1, … , of * , respectively. From the differentiability of the state vector with respect to , it follows thaṫ= Now let * , = 1, … , denote the vector corresponding to the th column of the matrix * . Consistently, d * , = 1, … , will stand for the perturbation of the vector corresponding to th column of the matrix * . From Equation (20), it follows that the differential d of the cost functional Equation (2) with respect to perturbations d * , = 1, … , is given by Recall that (Chicone, 2006): and thus the last term in Equation (22) can be canceled. Furthermore, integration by parts yields: From ( ) = 0 and the initial condition in Equation (1) that implies (0) = 0, it follows that the last term in Equation (24) vanishes: Taking into account Equations (23), (24), and (25) and inserting this into Equation (22), one obtains: From the definition of the adjoint state in Equation (19), it can be observed that: From Equation (27), it follows that the derivative of the functional with respect to the matrix * is given by: Using the definition of the Hamiltonian (18), it follows that: Applying the Dirac delta function's sifting property, the explicit formula is obtained: where 1 , … , denotes the sequence of time instants when the argument of the Dirac delta function in Equation (29) equals zero: To evaluate the cost derivative Equation (30) and perform the updating of the policy parameters Equation (15), the method has to rely on the information of the state ( ) and adjoint state ( ) for ∈ [0, ]. While the state is assumed to be measurable or accessible through a state observer, the collection of the adjoint state requires an integration of the differential equation (19). The right-hand side of Equation (19) includes the Dirac delta function term, so that the solution is piecewise continuous and composed of a set of limiting functions; see Nedeljkov and Oberguggenberger (2012, Proposition 2.1). A convenient way to generate the trajectory of ( ) is to start by detecting the sequence of time instants 1 , … , as in Equation (31). Next, the sequence of time steps 0 = 0 , 1 , … , = is assumed for the purpose of the backward integration of the equation: for ∈ [ * , ]. Here, * denotes the time step that is the closest to the time instant , that is, * = argmin =0,1,⋯, Then, for = * the following jump is performed: where the increment of the adjoint state Δ ( * ) is computed applying the Dirac delta function's sifting property and the specific structuring of the right-hand side of adjoint state dynamical equation (19): Next, taking into account the value of ( * ) updated by the jump Equation (34), the backward integration of Equation (32) is continued until = * −1 ; that is, until the time step that is the closest to the time instant −1 and found in analogy to Equation (33). Again, a jump for = * −1 is made in the same manner as in Equation (34). The operation is repeated unless = 0 . A complete procedure for updating the policy parameters is demonstrated in Algorithm 1. In Figure 2, the RL pro-A l g o r i t h m 1 Iterative learning algorithm to update the policy parameters * 1 , … , * .
Step 6. Check if any of the terminal conditions is fulfilled:‖ d d * ‖ | * = * ( −1) < or = . If yes, then STOP. Otherwise go to Step 2. cess is visualized in the context of an actor-environment interaction.
Remarks 1. (30) is related to the number of time instants 1 , … , defined in Equation (31). To guarantee a substantial decrease of the cost functional value when executing the sequence Equation (15), the selection of the step sizes 1 , … , ∈ (0, 1) in Step 1 should depend on the number , and may vary for subsequent iterations .

R1. The value of the cost derivative in Equation
Here, it will be assumed that =̄∕ , = 1, … , for some small positive numbers̄1, … ,̄. R2. Respecting the structuring of the admissible sets Equation (14), the projection Proj  ( * ) = { * } , =1 used in Step 5 is defined as follows: R3. The norm of the cost derivative in the terminal condition in Step 6 is defined as the maximal value of the absolute entries of matrices Equation (30) for F I G U R E 2 Scheme of the reinforcement learning process. Updating the control policy parameters * 1 , … , * to the unknown excitation force is based on the actor-environment (controller-dynamic system) interaction.

COMPARATIVE CONTROLS
To assess the efficiency of the developed method, it will be compared with the optimal open-loop solution, a heuristic control, and a passive strategy. The focus will be on optimality in suppressing the transient vibration in the control phase I. For that purpose, the open-loop optimal control for ∈ [0, ] will be computed assuming a complete information of the excitation ( ). An examination of the overall stabilization capabilities, including the control phase I and the free vibration in the control phase II (i.e., for ∈ [0, ]), will be performed by comparison to the heuristic and passive strategies.

Open-loop optimal control
The open-loop optimal control 1 ( ), … , ( ) for ∈ [0, ] will be established as the solution to the problem of minimizing the cost functional ( ) as in Equation (2), that is, Assuming the set of admissible controls  = [ , ] and employing the Pontryagin Maximum Principle (Pontryagin et al., 1962) lead to the following solution to the problem Equation (38): where ( ) stands for the adjoint state that is computed using the Hamiltonian associated to the problem Equation (38) (see Mohler, 1973). To determine the trajectories of 1 ( ), … , ( ), the method based on the gradient descent will be used (see Pisarski, 2012).

Heuristic control
Heuristic control is based on the concept of instantaneous optimization of the rates of change of the system energy (Pisarski, 2018). The control functions 1 ( ), … , ( ) for ∈ [0, ] that provide the best instantaneous decrease of the energy ( ) (see Equation (3) Computing the time derivative of the energy function (see Equation 3):̇=̇Q x +̇, It can be observed that the control Equation (42) guarantees the asymptotic stability of Equation (1) if ( ) = 0 and is a Hurwitz matrix (the last condition is fulfilled for a majority of structures as a consequence of material or viscous damping).

Passive strategy
In this method, constant control functions 1 ( ), … , ( ) for ∈ [0, ] will be assumed, where each actuator operates at the maximal admissible value; that is, ( ) = , = 1, … , . In the majority of the semi-active controlled structures, this operation is equivalent to the optimal passive strategy (see, e.g., Szmidt et al., 2017).

The analyzed structure
A span structure supported by a semi-active actuator will be investigated as depicted in Figure 3. For the span, a slender elastic body is assumed that is subjected to small deflections. The height and the depth of the span are small when compared to the length . The span can be thus represented by the Euler-Bernoulli beam equation parameterized by the bending stiffness and length density . It is subjected to an external damping of air that is char- acterized by the coefficient . The semi-active actuator is attached at position = 0.4 . Note that the ith mode shape of the assumed simply supported beam at the coordinate is characterized by ( ) = sin ( ∕ ), and for the first four modes = 1, … , 4 the assumed actuator's position guarantees ( ) ≠ 0. Therefore, the actuator's position allows the first four modes to be controlled, but besides, it is selected arbitrarily. For the actuator, a controlled input is assumed that influences the damping ( ) and stiffness ( ) parameters. Each of these parameters depends linearly on the control variable, that is, where and are assumed to be constant. The force generated by the actuator is assumed to be equal to the sum of the elastic and damping forces that are, respectively, proportional to the beam's traverse deflection and velocity at point . The unknown external force is of short duration and acts on the span at point = 0.6 . The parameters assumed for the simulations are listed in Table 1.
A deflection of the span at the coordinate and time is denoted by ( , ). The Dirac delta function ( ) is used to describe the contact point between the span and actuator or external force. Based on these assumptions, the structure can be represented by the following partial differential equation: The left-hand side of Equation (44) consists of the common elements of the Euler-Bernoulli beam equation that characterize the potential, air damping, and inertial forces of the span. On the right-hand side there are the terms that stand for the viscoelastic forces generated by the semi-active actuators and unknown force acting on the structure. The assumed endpoint supports (see Figure 3) enforce the boundary conditions: For each simulation, a zero initial condition is assumed: The finite element method is employed to represent Equation (44) in the form of an ordinary differential equation, as in Equation (1). For the span structure, 10 identical elements and 11 uniformly distributed nodes are used (nodes 1 and 11 are located at positions = 0 and = , respectively). Introducing the vector of nodal displacements ( 1 , … , 11 ) = ( 1 , … , 11 ) ( , = 1, … , 11 represents the span's displacement at the th node's position) and angles of rotation ( 12 , … , 22 ) = ( 1 , … , 11 ) ( , = 1, … , 11 represents the span's angle of rotation at the th node's position), the system Equation (44) can be approximated by the second-order differential equation:̈( In Equation (47), , , and are, respectively, the 22 × 22 mass, damping, and stiffness matrices, 1 , 2 are the 22 × 22 matrices that accommodate the elastic and damping forces generated by the actuators, and̄represents the 22 × 1 vector that incorporates an unknown external force. The composition of these terms results from a standard finite element approach that involves the shape functions based on a third-degree polynomial (Bathe, 1996). Define the state vector: the system matrices as: and the external force vector: .

Controller settings
For each control function (Section 3.1), 0 (Section 4.1), (Section 4.2), and (Section 4.3), it is assumed that = 0.02 and = 1. To simulate the transient response of the semi-active device, any change between the extreme control values is realized at a constant rate, and this change takes 0.002 (s); for an alternative filtering approach, see Wang and Adeli (2015a). For the control time, = 2 (s) is selected (see Figure 1). An unknown excitation force ( ) ≠ 0 is acting on a structure for = 0.2 (s).
For the designed RL algorithm, the maximal number of iterations = 500 (Step 1) was assumed and the terminal condition parameter = 0.001 (Step 6). The learning process was repeated for three lengths of the learning time window: = 0.4 , = 0.75 , and = . Based on several test runs, the step size for the updating sequence Equation (15) was selected as = 0.0015∕ for = 0.4 and = 0.75 , and = 0.0005∕ for = , where was computed at each iteration using Equation (31) (see Step 1 and remark R1). To solve the adjoint state equation (Step 3), the Runge-Kutta fourth-order scheme was employed with a time step of 0.0001 (s). The controller was implemented in the MATLAB programming language and run using a workstation with an Intel Xeon, 3.00 GHz, 16 GB, that operated on the Linux platform.

Simulation results
The proposed method will be examined for three scenarios of the external excitation force ( ). In the first scenario (Case A), the excitation will repeat at each learning iteration with an identical frequency. The convergence of the policy parameters will be then analyzed, as well as the stabilization performance, in comparison to the optimal, heuristic, and passive strategies. The two subsequent scenarios (Cases B and C) will validate the robustness of the learning process. Here, the policy parameter will be updated for the excitation force with either a randomly perturbed frequency (Case B) or an additional high-frequency harmonic perturbation of a random amplitude (Case C). For the comparisons, the cost functional as in Equation (2) will be used that is computed either for the transient vibration in the control phase I-that is, applying ( )-or for the overall process, including the transient and free vibration in control phases I and II, assuming ( ). Finally, the computational effort will be assessed by varying the number of finite elements that compose the structure from 10 to 100.

Case A
The external excitation is assumed to constantly repeat for = 0.2 (s) with an identical characteristic given by ( ) =  sin(2 ), where amplitude  = 100 (N) and frequency = 25 (Hz) are constant. The duration between subsequent repetitions is assumed to be sufficiently large. Therefore, for each repetition, zero initial conditions are assumed for Equation (51). It can be emphasized that the specified excitation parameters are unknown to the designed controller. The reader can also observe that the proposed controller does not require any informa-F I G U R E 4 Evolution of the derivative norm with respect to the updating iteration for the assumed learning time windows. For each case, the norm is normalized to its initial value.

F I G U R E 5
Evolution of the cost functional value computed for the assumed learning time windows. For each case, the cost is normalized to its initial value. tion on which point of a structure the force is acting on.
To investigate the proposed control learning process realised through Algorithm 1, the evolution of the derivative norm (see Equation 37) with respect to the updating iterations can be analyzed for the assumed learning time windows, as depicted in Figure 4. For each case, the optimization procedure was terminated by the condition = . For each of the curves, smooth sections can be observed that are separated by instant jumps. The latter indicates the iterations where the derivative changes the number of added up components (see Equation 31). Although there are the sections where the value of the derivative norm is increasing, the general downward trend (for each case the final value is lower than the initial one) validates the convergence of the sequence for updating the policy parameter Equation (15). The convergence of the steepest gradient procedure can also be inspected by analyzing the evolution of the cost functional value Equation (2), as presented in Figure 5. Here, a substantial decrease in the value of the cost functional can be observed for each of the assumed learning time windows = 0.4 , = 0.75 , and = . The notably slower TA B L E 2 Comparison of the cost functional ( ) obtained in the case of the designed method and comparative controls. For each of the control cases, the values are normalized to the passive strategy.

F I G U R E 6
Comparison of the beam's deflection for the considered controllers measured for the control phase I at the actuator's position convergence rate in the case of = follows from the selection of a lower step size compared to the remaining cases (see Section 5.2). This selection was motivated by the fact that a longer learning time window has a larger variation in number (see Equation 31), which in turn implies a larger variation in the derivative value Equation (30). The assumed lower step size in the case of the largest time window allowed overshooting to be avoided and guaranteed a stable reduction of the cost functional value. It should be noted that a too short learning time window may significantly degrade the control efficiency. The piecewise sections in the cost trajectories are concerned with the discrete-time solution of Equation (51), which for some subsequent algorithm iterations results in identical switching times of the control policy function Equation (4). The effect of some policy parameter updates on the state response cannot be then detected.
After completing Algorithm 1 for = 0.4 , = 0.75 , and = , the comparative study can be performed. Regarding the assumed cost functional Equation (2), which is computed for the transient state vibration at the control phase I-that is, for ∈ [0, ] (see Section 3)-the RL-based control (in short, RL control) exhibits a comparable performance for all of the assumed learning time windows (see Table 2). Moderately poorer efficiency in the case of = is concerned with the previously investigated convergence in the learning protocol. Assuming = 0.75 , the RL control is marginally outperformed by the optimal one by 1.13%. Compared to the heuristic and passive strategies, this RL control obtains a cost reduction of 27.6% and 55.9%, respectively. Figure 6 depicts the beam's deflection simulated for the control F I G U R E 7 Comparison of the system's energy for the considered controllers measured for the control phase I phase I at the location of the semi-active device for the RL controller of the learning time window = 0.75 with the other controllers. It can be observed that the RL and optimal control result in almost identical responses for < 0.5 . For the remaining time, a gradual divergence of the trajectories generated by the RL and optimal control can be detected with 30.6% of the relative difference in their last peak amplitudes at ≈ 0.9 . A significantly larger divergence in the deflection trajectories can be found in the case of heuristic and passive strategies, where (compared to the RL control) the final peak amplitude has increased by 84.8% and 221%, respectively. The similarity of the dynamic response of the system to the RL and optimal control is also confirmed by the characteristics of the energy function Equation (3), as demonstrated in Figure 7. For the ending time of control phase I, the RL control results in a negligible increase of 1.91% of the energy when compared to the optimal solution. For the heuristic and passive methods, this increase is 31.0% and 72.1%, respectively. Figure 8 compares the switching patterns of the RL control of = 0.75 and the optimal open-loop control. They do not match, although both functions are generated through the optimization of the same cost functional. This mismatch is essentially concerned with the statefeedback structuring that is imposed on the RL control. To a lesser, but not negligible, extent, the observed mismatch is caused by different selections of the time horizon assumed for the optimization (respectively, 0.75 and in the case of the RL and optimal control). Even though the RL control is unable to reproduce the optimal control pattern, the critical switches of the optimal control are here replicated much more accurately than in case of the heuristic control. In particular, the first two switching actions of F I G U R E 8 Comparison of the control signals for the control phase I the RL and optimal controls at ≈ 0.03 and ≈ 0.11 almost coincide, while these switching times are evidently advanced and retarded in the case of the heuristic control.
The overall stabilization performance of the proposed method can be justified by investigating the deflection (Figure 9a), energy (Figure 9b), and frequency ( Figure 9c) characteristics that are obtained for the transient and free vibration-that is, for ∈ [0, ] (control phases I and II)where the RL control of = 0.75 and the heuristic strategy were employed. For deflection and energy trajectories, the RL control resulted in a reduction of the peak amplitudes for phase I and faster convergence to the equilibrium for phase II. The reduction of the deflection amplitudes is also confirmed by the frequency characteristics (Figure 9c), where we observe a decrease of 19.4% of the peak value at 6 Hz, which corresponds to the first natural frequency of the beam structure. The implementation of the RL control also resulted in a decrease of the cost functional value ( ) by 15.2% and 47.8% when compared to the heuristic and passive methods, respectively.

Case B
The following simulation was carried out to investigate how the changes in the frequency of the excitation force during the realization of Algorithm 1 influence the efficiency of the resulting RL control. For this purpose, the learning time window = 0.75 was selected and = 500 iterations were performed following Steps 2-6. At each iteration, for the excitation force, ( ) =  sin(2 ( + ( )) ) was assumed. Here, similarly to Case A, the constant amplitude  = 100 (N) was used, but the frequency was given by + ( ), where = 25 (Hz) and ( ) is a perturbation that was selected randomly at each iteration and fulfilled the condition max =1,…,500 | ( )|∕ ≤ . The learning protocol was performed while assuming different frequency perturbation magnitudes , from 0.05 to 0.20. For each perturbation magnitude, the algorithm was repeated three times. Next, each of the obtained RL controls was applied to the unperturbed case-that is, assuming ( ) =  sin(2 ). Eventually, for each perturbation magnitude, the cost functional value ( ) was computed and averaged for the assumed three repetitions. The attained controls were compared to the corresponding RL control computed in the previous section; that is, for = 0 (see Table 3). The increase of the cost functional value remains below 0.5% for all of the considered cases, which confirms that moderate perturbations have no significant impact on the control performance. The learning protocol was also carried out for > 0.2, where difficulties gradually appeared in selecting a relevant step size for a stable cost descending (concerned with an increased variation in the cost derivative values). As a result, there was a loss in the control performance (8.2% increase in the cost functional value for = 0.3).

Case C
In order to examine the developed method for a more complex excitation force, Algorithm 1 was executed assuming ( ) =  sin(2 ) +  ( ) sin(2 ). Here, the first term stands for the dominant harmonic excitation where-as in Case A-the constant amplitude  = 100 (N) and frequency = 25 (Hz) were assumed. The second term characterizes an additional harmonic disturbance of a constant frequency = 100 (Hz) and an amplitude  ( ) that was randomly selected for each learning iteration = 1, … , 500 and fulfilled the condition 0 ≤ max =1,…,500  ( ) ≤  . Assuming the learning time window = 0.75 , the procedure was carried out for different values of  , ranging from from 20 to 100 (N); that is, 10%-50% of the amplitude  of the dominant excitation. For each limiting value  the procedure was repeated three times. The obtained RL controls were then applied in two scenarios. In the first scenario (Case C1), it was assumed that the excitation force is unperturbed; that is, inserting ( ) =  sin(2 ). In the second scenario (Case C2), the applied excitation force included the additional harmonic disturbance with a constant amplitude equal to half of the limiting value  , namely, ( ) =  sin(2 ) + 0.5 sin(2 ). For each scenario and limiting value  , the cost functional value ( ) was computed and averaged for the assumed three repetitions (see Table 4). Analyzing the cost values obtained for Case C1, where the drop of the control performance for each of the perturbed cases remains below 1.9%, it can be concluded that the learning algorithm is robust to disturbances imposed on the dominant characteristic of the excitation. Furthermore, the results obtained for Case C2 confirm that the method can be successfully applied for more complex polyharmonic forces, also in the case of random perturbations of the amplitude. where the updating algorithm is performed with different magnitudes of perturbations in the frequency of the excitation force. For each perturbation magnitude the cost functional value was averaged for the assumed three repetitions and normalized to the case with no perturbation.

The computational effort
The final set of simulations was performed to investigate the capabilities of the designed algorithm with respect to systems with a larger number of state variables. The aim was to analyze the computational time required for updat-ing the policy parameter (Steps 3-6 in Algorithm 1), when assuming different sizes for the dynamic Equation (51). In Cases A-C, the investigated structure was represented by 10 finite elements that resulted in the state vector of 44 components. The learning procedure was repeated assuming 20, 40, 60, 80, and 100 elements in the finite element TA B L E 4 Comparison of the cost functional ( ) in the case of the RL control of = 0.75 where the updating algorithm is performed with an additional harmonic disturbance of the constant frequency and a randomly selected amplitude limited by different boundaries  . In Case C1, each value is averaged taking into account the assumed three repetitions and normalized to the case with no disturbance; that is, when  = 0. In Case C2, each value is averaged taking into account the assumed three repetitions and normalized to the case where the learning protocol was carried out for constant amplitude of the additional disturbance that was equal to the half of the limiting value  . model, which resulted in the state vector in Equation (51) of the size 84, 164, 244, 324, and 404, respectively. The greatest computational cost was concerned with the integration of the adjoint state equation in Step 3 (performed using the Runge-Kutta fourth-order scheme), which for the assumed time step of 0.0001 (s) and learning time window = 0.75 required 1500 time samples. The obtained computational times are summarized in Table 5. A single iteration in the case of 60 finite elements (244 components in the state vector) remained below 1 (s). Furthermore, the use of the steepest descent approach to update the policy parameter guaranteed that an increase in the size of the system did not significantly influence the rate of convergence in the overall learning process (in the case of the 60 finite elements, the algorithm required 612 iterations to reach the same cost functional value as in Case A). It can be concluded that the method can be effectively used in multidimensional systems.

CONCLUSIONS
An RL-based semi-active control method for suppressing structural vibration that is induced by unknown harmonic excitation has been proposed. This method relies on a state-feedback switching control law that includes a parameter matrix to be updated by means of the developed actor-only iterative learning algorithm. In view of the stabilization performance, this method can be perceived as suboptimal. Its efficiency has been validated via numerical experiments for a span structure that is equipped with an actuator of controlled stiffness and damping parameters. In terms of the assumed energy-related cost functional, the method resulted in a marginal degradation (of 1.13%) when compared to the optimal open-loop control, while it significantly outperformed a heuristic control (by 27.6%) that employed an analogous control law and an identical amount of state information. The relatively low computational burden of the proposed iterative learning algorithm allows this method to be applied to multidimensional systems (a single iteration required less than 0.08 (s) for the system represented by the state vector of 44 components). The method has been designed and validated for repetitive transient vibration. Nevertheless, assuming an appropriately selected moving learning time window, the proposed algorithm can also be adapted to a steady-state vibration. The ongoing works include the development of an adaptive scheme to select the moving learning time window that guarantees the best convergence of the updating sequence and the design of a test stand platform that simulates a real environment for experimental validation.