AN EVALUATIVE ANALYSIS OF PARTICLE SWARM OPTIMIZATION FOR REINFORCEMENT LEARNING IN PENDULUM TASK

ABSTRACT


I. INTRODUCTION
In supervised learning, neural networks can be optimized using gradient-based methods with labeled training data.This involves computing the difference between neural network's outputs and their respective target values, and then adjusting connection weights and unit biases through backpropagation of errors.However, reinforcement learning tasks require the use of gradient-free training algorithms since labeled training data are not available.Applying swarm intelligence algorithms [1,2] to reinforcement learning of neural networks is practical because they do not rely on gradients.On the other hand, Q-learning [3,4] is a popular reinforcement learning method that selects subsequent actions based on the reward r(t) for action a(t) in state s(t) at each time step t.Unlike Q-learning, swarm algorithms do not require the calculation of r(t) at every step, but instead, evaluate the reward after the completion of an episode.This feature of swarm algorithms relieves the practitioner from the burden of designing appropriate rewards for every combination of states and actions.
Particle swarm optimization (PSO) [5,6], ant colony optimization (ACO) [7,8], and artificial bee colony (ABC) [9,10] are representative swarm algorithms.However, to effectively use these algorithms for training neural networks, it is essential to select appropriate variations and design their hyperparameters carefully, as they have a significant impact on performance.In this paper, the author experimentally evaluates the effectiveness of PSO in the reinforcement learning of multilayer perceptrons, using a pendulum control task.

II. PENDULUM TASK
As a reinforcement learning task, this study utilizes the pendulum task available in the OpenAI Gym 1,2 .The goal of the task is to maintain the pendulum in an upright position by applying torque.The system is depicted in Fig. 1, which shows a screenshot of the task, with the round arrow indicating the direction and magnitude of the torque applied by the controller.The study aims to provide insights into the performance of PSO on this task.
The author modified the system such that the task starts with the pendulum in a position opposite to the desired outcome, as shown in Fig. 2(a).The objective is to manipulate the pendulum to reach and maintain the state depicted in Fig. 2(b).Additionally, the author adjusts the system to begin the control task with zero angular velocity for the pendulum.An episode in the simulation consists of 200 time steps, during which the controller observes the current state and determines the corresponding action.The state is characterized by three values: the cosine and sine of the angle (θ), and the angular velocity, which are within the ranges of -1.0 to 1.0 and -8.0 to 8.0, respectively.The action taken by the controller is the torque applied to the pendulum, within the range of -2.0 to 2.0.The constant torque of 2.0 (or -2.0) is not sufficient to bring the pendulum from its initial position to the goal position: the controller must actively swing the pendulum, leveraging gravity to increase the angular velocity and allow it to overcome the obstacle.
In this study, the author defines the fitness of a controller as follows: θ(t) denotes the angular at time step t.Initially, the error is calculated as Error(t)=|±π|/π=1, indicating that the pendulum is in a position opposite to the desired goal state.As the pendulum moves towards the goal state, the error decreases.At the goal state, the error is 0/π=0, indicating that the pendulum is upright.
The fitness score rewards the controller more for achieving the desired goal state more quickly and maintaining it longer, i.e., a higher fitness score is obtained when the error is lower for more time steps.

III. MULTILAYER PERCEPTRON
This study adopts a multilayer perceptron (MLP) [11] as the pendulum controller.The MLP is a three-layered feedforward neural network.The topology is illustrated in Fig. 3, while the feedforward computations are shown in ( 3)- (7).

Input layer:
(1) =   ,  = 1,2, … ,  Hidden layer: (1) ,  = 1,2, … , (2) = ℎ(  (2) ) ,  = 1,2, … , Output layer: (2) ,  = 1,2, … , (3) = ℎ(  (3) ),  = 1,2, … , The activation function denoted as h() is the hyperbolic tangent function whose shape is illustrated in Fig. 4.This activation function is a widely used in neural networks due to its ability to produce a smooth non-linear output that ranges from -1.0 to 1.0.The MLP plays the role of a policy function where the action at time t is a function of the observation at time t, i.e., action(t) = F(observation(t)).The input layer of the MLP comprises three units that receive the values of cos(θ), sin(θ), and the angular velocity.To ensure that the input values are within the range of [-1.0, 1.0], the angular velocity is normalized by dividing it by 8.0.The output layer of the MLP consists of one unit, which outputs the torque applied to the pendulum.To ensure that the torque falls within the range of [-2.0, 2.0], the output value is scaled by multiplying it by 2.0.

IV. TRAINING OF MLPS USING PSO
The MLP illustrated in Fig. 3 comprises M + L units and NM + ML connections, giving a total of D = M + L + NM + ML parameters.To train the MLP, the author formulates the problem as the optimization of a D-dimensional real-valued vector,  = ( 1 ,  2 , . . .,  D ) , where each   corresponds to one of the D parameters in the MLP.The feedforward computation, as described in ( 3)-( 7), involves applying the values of  to their corresponding connection weights or unit biases.
In this study, PSO is applied to optimize the D-dimensional vector .PSO represents one of the swarm intelligence algorithms, which are characterized by being population-based stochastic search algorithms.PSO utilize  as a particle position in the Ddimensional search space.Source: Author, (2023).
In Step 1, vectors  1 ,  2 , … ,  S are initialized randomly, where S represents the swarm size (the number of particles in the swarm).  denotes the position vector of the s-th particle in the Ddimensional search space, i.e.,   = ( 1  ,  2  , … ,  D  ),  = 1,2, … , S. The swarm size is predetermined.In Step 2, the fitness of each particle is evaluated using (1).In Step 3, the loop of the swarm process is terminated when a specific termination condition is satisfied.In this study, the loop is terminated when the loop counter reaches a predetermined value.In Step 4, the personal best (Pbest) of each particle and the global best (Gbest) in the swarm are updated according to their fitness scores.The Pbest of a particle represents the position vector that has achieved the highest fitness score up to the current iteration for that specific particle.On the other hand, the Gbest represents the position vector with the highest fitness score among all the Pbests within the population.Let us denote each Pbest as   and the Gbest as , respectively.In Step 5, the velocity of each particle is updated.Let   = ( 1  ,  1  , … ,  D  ) represents the velocity for the s-th particle.The velocity   is updated by (8).
denotes the inertia weight, while   and   are coefficients.
Additionally,   and   are uniformly distributed random values within the interval [0,1].In Step 6, each particle moves in the search space according to its velocity.The position vector   is updated by (9).

V. EXPERIMENT
The MLP's capability to model nonlinear functions is influenced by the number of hidden units.Optimizing a smaller MLP using swarm algorithms is facilitated by a reduced number of variables (the shorter length of position vector ).However, this reduction in hidden units may impede the MLP's capability to effectively control the pendulum.Conversely, a larger MLP is more capable of successfully controlling the pendulum, but optimizing it becomes more challenging due to the longer position vector  .Moreover, implementing a larger MLP requires additional computational resources.Therefore, striking a balance between these trade-offs is essential for determining the optimal number of hidden units for the given task.This study explores three different configurations of hidden units: 8, 16, and 32.PSO hyperparameter values were determined through empirical analyses, as illustrated in Table 1.The number of iterations was set to 500 or 100, corresponding to swarm sizes of 10 and 50, respectively.Consequently, the total number of fitness evaluations remained constant at 50,000 (equal to the product of iteration and swarm size).It is important to choose an appropriate search space because the values in   are utilized as connection weights or unit biases in the neural network.The range should neither be excessively large nor small.In this experiment, the search space is [-10.0,10.0] D .The position vectors  1 ,  2 , … ,  S are randomly initialized within the space, and the velocities  1 ,  2 , … ,  S are initially zero vectors.Comparing the scores in Table 2 between configurations (a) and (b), it is observed that the values obtained using configuration (b) are higher than those obtained using configuration (a).This result indicates that configuration (b) is better than configuration (a).Wilcoxon signed rank test revealed that this difference is statistically significant (p=1.52e-5).Therefore, in this study, it is evident that increasing the swarm size rather than the number of iterations allowed PSO to discover better solutions.In PSO, increasing the swarm size promotes global exploration in the early stages, while increasing the number of iterations enhances local exploitation in the later stages.Based on the results of this experiment, it is evident that in this learning task, early-stage global exploration is more important.
Next, comparing the fitness scores obtained using configuration (a) among the three variations of M (the number of hidden units), it is observed that even for the smallest size, M=8, the scores are not inferior to those of M=16 or M=32.In fact, the average and worst values across the 11 trials indicate that M=8 is the most desirable.Increasing hidden units would typically enhance the MLP's nonlinear modeling capability and improve the performance of pendulum control.However, it can be seen that increasing hidden units to 16 and 32 does not improve the control performance and instead leads to a decrease in the learning performance through PSO.Wilcoxon rank sum test revealed that the difference between M=8 and M=16 (or 32) is not statistically significant (p=0.55 and p=0.42 respectively).As the number of hidden units increases, the dimensionality of the search space also increases, resulting in the increased difficulty in global exploration.Therefore, in the learning task of this study, the swarm size of 50 particles is sufficient for M=8 but insufficient for M=16 and M=32, indicating that the global exploration was not adequate in those cases.
Fig. 6 presents the learning curves of the best, median, and worst runs among the 11 trials, where M=8 and the configuration is (b).These learning curves indicate a slower progression of fitness scores within the ranges of [0.4,0.5] and [0.6, 0.7].Consequently, attaining a fitness score of 0.4 is relatively straightforward for PSO in training MLPs, while challenges arise in achieving higher scores for improved pendulum control.Remarkably, even in the most unfavorable trial out of the 11 conducted, PSO successfully trained the MLPs to reach a score of 0.807 (as shown in Table 2), demonstrating the robustness of PSO in effectively discovering desirable solutions.Fig. 7(i) illustrates the actions and errors of the MLP in the 200 steps prior to training, while Fig. 7(ii) displays the corresponding actions and errors after training.In this scenario, the MLP employed 8 hidden units, and the configuration (b) was utilized.Fig. 7(i) reveals that the MLP prior to training outputs significant variations in torque values ranging from 2.0 to -2.0 during the initial and mid-stages of the 200 steps, indicating an attempt to lift the pendulum.However, from the mid-stage to the end, the torque value remains approximately constant at 2.0, leading to pendulum rotation and substantial fluctuations in error.In contrast, Fig. 7(ii) reveals that the MLP after training successfully switches the polarity of torque appropriately within the first 50 steps, lifting the pendulum upward and rapidly reducing the error to nearly zero.Furthermore, it maintains the pendulum in the upward position with zero error by setting the torque value to 0 for the remaining steps, exerting no unnecessary force on the pendulum.Supplementary videos are provided which demonstrate the pendulum controlled by the MLPs 3,4 .

VI. CONCLUSIONS
In this study, the neural network controller for the pendulum task was trained using Particle Swarm Optimization.The results demonstrated the successful training of an MLP with 8 hidden units, enabling rapid uprighting of the pendulum.Notably, it was found that a larger swarm size yielded greater effectiveness compared to increasing the number of iterations.In future work,

Figure 5 :
Fig.5shows the process of training neural networks by PSO.The process of particle swarm optimization.

Table 1 :
PSO hyperparameters.An MLP with 8, 16, or 32 hidden units was trained 11 times independently.Table2displays the best, worst, average, and median fitness scores achieved by the trained MLPs among the 11 trials.Each of the two hyperparameter configurations (a) and (b) was applied.