QL-ADIFA: Hybrid optimization using Q-learning and an adaptive logarithmic spiral-levy firefly algorithm

: Optimization problems are ubiquitous in engineering and scientific research, with a large number of such problems requiring resolution. Meta-heuristics o ff er a promising approach to solving optimization problems. The firefly algorithm (FA) is a swarm intelligence meta-heuristic that emulates the flickering patterns and behaviour of fireflies. Although FA has been significantly enhanced to improve its performance, it still exhibits certain deficiencies. To overcome these limitations, this study presents the Q-learning based on the adaptive logarithmic spiral-Levy flight firefly algorithm (QL-ADIFA). The Q-learning technique empowers the improved firefly algorithm to leverage the firefly’s environmental awareness and memory while in flight, allowing further refinement of the enhanced fire-fly. Numerical experiments demonstrate that QL-ADIFA outperforms existing methods on 15 benchmark optimization functions and twelve engineering problems: cantilever arm design, pressure vessel design, three-bar truss design problem, and 9 constrained optimization problems in CEC2020.


Introduction
Optimization problems are ubiquitous in engineering and scientific research, involving finding the optimal parameter values that meet specific objective functions while satisfying constraints [1][2][3].Traditional gradient-based optimization algorithms have limitations and may not be effective in practical scenarios.In recent years, numerous meta-heuristic algorithms have been developed, showing better performance in terms of target function values while reducing computation costs.Among them, natural-inspired meta-heuristic algorithms have gained attention and have been applied to various optimization problems [4,5].
In the field of meta-heuristic algorithms, there are four main categories: swarm intelligence optimization algorithms, evolution algorithms, physics-based algorithms, and human-based algorithms [6,7].The firefly algorithm (FA) [8], an example of a swarm intelligence optimization algorithm, is inspired by the behaviour and flashing patterns of fireflies.It has been found to be more effective than other algorithms such as particle swarm optimization in solving optimization problems.Consequently, various versions of FA have been applied to a wide range of fields, including architectural material design [9], travel itinerary design [10], tumour classification [11], and image segmentation [12].To further enhance the performance of FA, scholars have made improvements in various aspects.
The improvement of FA is mainly divided into two angles: algorithm modification and mixing with other algorithms.The modified FA is chiefly based on two important factors: Light variation and Attraction [13].For example, [14] introduced Levy flights into FA to create the Levy-Flight Firefly Algorithm (LF-FA), which increased its global searching capability.A cooperative hybrid firefly algorithm is proposed by [1] with multiple firefly algorithm populations, where each FA maintains population diversity by hybridization and communication with each other to prevent the proposed algorithm from falling into local optimum.Many hybrid FAs are proposed when firefly algorithms incorporate machine learning, heuristics, hybridization, and other techniques.A novel adaptive hybrid evolutionary firefly algorithm (AHEFA) [15] mixes FA and the differential evolution (DE) algorithm, and selects mutation operators according to the results of iterations, balancing exploration and exploitation.Elitist techniques are adopted in the selection phase to carry on the viable solutions of the target individuals to the next generation.An example is the adaptive logarithmic Spiral-Levy FA (AD-IFA) proposed by [16], which combines LF-FA with logarithmic spiral paths and an adaptive approach to balance exploration and exploitation capabilities.[17] provided a novel chaotic sine-cosine firefly (CSCF) algorithm with numerous variants, which integrates the chaotic form of the sine cosine algorithm (SCA) and the firefly algorithm (FA).CSCF chooses the best chaos variant from various chaotic forms, improving convergence speed and efficiency.To address FA's weaknesses in exploration and early convergence, [18] introduce an opposition-based method into FA and combine it with a symbiotic organisms search (SOS) algorithm, called IOFASOS.The impact of SOS algorithms on solutions is large in the early stages of IOFASOS implementation and becomes smaller and smaller as iterations progress.
While previous improvements to the firefly algorithm have been effective in improving the reliability of the firefly positions, they do not fully utilize the information generated during the last path changes.As a result, they do not make full use of the firefly's knowledge and memory of the environment when flying.Using this information as the basis for position changes could lead to faster discovery of the global optimal solution.
Q-learning involves the agent selecting actions based on the current state of the environment, receiving rewards or penalties based on the outcomes of its actions, accumulating knowledge, and making future predictions to maximize cumulative returns [19].Many articles have mixed Q-learning with other algorithms and achieved good results.[20] utilizes a Q-learning model for adaptive parameter control in the differential evolution (DE) algorithm, where Q-learning uses information from its memory to select the best combination of parameters at the beginning of each iteration.Q-learning can also be used in the local reinforcement stage [21], which is dedicated to selecting the optimal state-action pair based on knowledge and completing the transition from one heuristic algorithm to another.Qlearning is applied to the marine predators algorithm to help leverage historical iteration information to balance exploration and development [19].As a solution to the issue of not fully utilizing the infor-mation generated during previous path changes, this paper proposes the integration of Q-learning into AD-IFA, resulting in a new algorithm named Q-learning based on an adaptive logarithmic Spiral-Levy FA (QL-ADIFA).By incorporating Q-learning into AD-IFA, fireflies can choose the optimal strategy from two operations, thereby achieving a more balanced exploration and exploitation capability.The contributions of this paper are two-fold: 1) proposing the Q-learning based on the logarithmic Spiral-Levy firefly algorithm (QL-ADIFA) to solve the global optimization problem with faster convergence and superior solutions compared to the original improvement of firefly algorithms; and 2) testing QL-ADIFA on 15 benchmark functions and twelve engineering problems, demonstrating its improved convergence performance over the improved firefly algorithm.
The structure of this paper is organized as follows: Section 2 presents an overview of the improved firefly algorithm and the Q-learning algorithm.Section 3 provides a detailed description of the proposed Q-learning based on the adaptive logarithmic spiral-Levy flight firefly algorithm (QL-ADIFA).In Section 4, the performance of QL-ADIFA is evaluated using 15 benchmark functions.In Section 5, the effectiveness of the proposed algorithm is demonstrated through its application to twelve engineering problems.Finally, Section 6 provides a summary of the contributions and conclusions of this paper.

Q-learning
Q-learning is an off-policy temporal-difference method proposed by Watkins, which estimates the value of the Q-function for each state-action pair to determine the optimal action strategy [22].The state represents the movement of the agent currently being taken, and the action represents a change from one state of the agent to another.Since the state and action spaces of the problem addressed in this paper are finite and discrete, the value function can be recorded using a matrix.The Reward table is used to store the reward or penalty for each state-action pair, while the Q-table records the corresponding Q-value for each pair.At each decision point, the optimal action strategy is selected by comparing the Q-values of each available action in the current state, and the Q-table is iteratively updated using the Bellman equation to minimize the difference between Q-values for adjacent states [23].The equation is as follows: where s Iter and a Iter represent the state and action in this iteration respectively, λ is the learning rate, θ is the attenuation factor, and r Iter+1 is the immediate return.Set the maximum iteration number Max Iter = 100, s 1 and s 2 represent different states, and a 1 and a 2 represent different actions.Then the pseudo-code for the Q-learning algorithm is provided in Algorithm 1.
Algorithm 1: Q-learning algorithm pseudocode The Q-learning algorithm diagram is shown in Figure 1.

The firefly algorithm
The firefly algorithm is a meta-heuristic algorithm proposed in [24] based on the characteristics of firefly flashes, which is effective in dealing with nonlinear and multi-modal optimization problems.To simplify the firefly algorithm, fireflies follow the following three rules [8]: 1) All fireflies are of the same gender, and each firefly can only be attracted to the brighter ones.
2) The attraction of fireflies increases with brightness, while their brightness decreases with distance.
Therefore, fireflies always move in the direction of the brighter ones, and the brightest firefly moves randomly.
3) The brightness of a firefly is equivalent to the value of the objective function.
The Euclidean distance between fireflies in d-dimension space can be expressed as: where x i,p is the pth component of the space coordinate of the ith firefly.Since light is absorbed by the medium during propagation, the brightness of fireflies decreases with the increase of distance, so the attraction of fireflies can be expressed as: where β 0 is the original attraction, γ is the light absorption coefficient, and r is the distance between fireflies.
The updated position of the ith firefly after it is attracted to a brighter jth firefly can be expressed as: where x i,t represents the position of the ith firefly in time t, rand is a d-dimensional uniform random vector between [0, 1] d , and α is a parameter in [0, 1].

The Levy-flight firefly algorithm
The firefly algorithm easily falls into the local minimum when dealing with global continuous optimization problems.To solve this problem, [25] proposed the Levy-flight firefly algorithm (LF-FA), inspired by the sudden and large turn of insects in straight flight.
Levy flying firefly algorithm replaces uniform distribution with Levy distribution, and updates the position of the u firefly after being attracted by brighter firefly j as follows: where ⊗ is the Hadamard product, and sign(•) is the sign function.

The logarithmic spiral path
Although the Levy flying firefly algorithm significantly enhances the exploration ability of the global space, it ignores the local development ability of the algorithm and the balance between exploration and development.
[26] proposed a logarithmic spiral path (LS) which could improve the local development ability of the algorithm by referring to the flight path taken by a peregrine falcon when looking for food.Therefore, [16] considered introducing the logarithmic spiral path as the direction of the improved firefly algorithm, and a new firefly position update mode can be designed as follows: where I is a d-dimensional uniform random vector in [−1, 1] d and is a constant used to define the shape of a logarithmic spiral.

An adaptive logarithmic spiral-Levy FA (AD-IFA)
The adaptive logarithmic spiral-Levy firefly algorithm (AD-IFA) proposed in [16] can solve the imbalance between exploration (Levy flight) and development (logarithmic spiral).An adaptive switch (ratio) method is proposed in AD-IFA, and the new position update formula is expressed as: where u is a uniform random number between [0,1] and R t is calculated in the last iteration.The value of R t+1 ranges in [0.5, 1], whose initial value is set as 0.5.Its specific expression is as follows, where, θ = 10 In Eq (8), f * t is the best fitness function value at the tth iteration, lg(•) = log 10 (•), and ⌊•⌋ is the floor function.

The proposed algorithm
This paper proposes the Q-learning based on the adaptive logarithmic spiral-Levy flight firefly algorithm (QL-ADIFA), which combines the strengths of the Q-learning algorithm and the adaptive logarithmic spiral-Levy flight firefly algorithm.QL-ADIFA leverages Q-learning to enhance the efficiency of the exploration and exploitation stages of the meta-heuristic algorithm.Additionally, it utilizes the meta-heuristic algorithm to better retain the information of the search space obtained during the iterative process.As a result, the QL-ADIFA algorithm achieves higher efficiency and effectiveness.The proposed QL-ADIFA is divided into two parts.One is composed of the adaptive logarithmic spiral-Levy flight firefly algorithm (AD-IFA), that is, an adaptive switching (proportional) method is used to solve the problem of unbalanced exploration and exploitation.And the other part is composed of the Q-learning based on the logarithmic spiral-Levy flight firefly algorithm (QL-LSLFA), that is, Qlearning is used to solve the imbalance problem.During each iteration, one of the above two sections will be randomly selected to update the position of the firefly.And this random probability is set to 0.7 for choosing AD-IFA and 0.3 for choosing to use QL-LSLFA after testing in this article.To improve readability, abbreviations used in the article are recorded in Table 1.The QL-LSLFA algorithm takes fireflies as agents, where there are two states of the agent: the exploration stage (the Levy-flight path) and the exploitation stage (the logarithmic spiral path) and two actions, i.e., switching from one stage to another stage.The flow chart of QL-LSLFA is shown in Figure 2. The Q-learning algorithm controls the action of fireflies by adapting to state transitions based on the Q-table.The fireflies learn good or bad behaviour based on the Reward table, which also rewards them for good behaviour (+1) and punishes them for bad behaviour (-1) to update the Reward table.The Q-table can better represent the firefly's performance in the process, allowing the firefly to obtain more appropriate actions.In QL-LSLFA, the position of the i firefly in the moment t changes based on: In Eq (9), a 1 represented the switch from the exploitation stage to the exploration stage, and a 2 represented the switch from the exploration stage to the exploitation stage.
In QL-LSLFA, fireflies are able to make adaptive judgments and choose the most appropriate actions according to the Q-learning algorithm.The improved steps of Q-learning can be summarized into five parts as follows: 1) The Q-table is initialized as a 2 × 2 zero matrix.The specific form of the Reward table is shown in Eq (10): 2) According to each value of the Q-table in the current state, the action with the highest score is selected as the best action in the current stage.3) Perform the selected action and calculate the new fitness value.The immediate reward is calculated as follows: 4) Update the Q-table with Eq (1).5) Update the location of the agent based on the new state.
Figure 2 shows the general flow chart of the proposed QL-ADIFA.From the initial phase of the QL-ADIFA, each search agent is independent of the other and continuously improves its behaviour based on Q-learning.The QL-ADIFA is executed iteratively until the termination condition is satisfied.
The feature of this algorithm is that it can effectively switch between different stages according to its own needs, so it can find the global solution effectively, and improve the efficiency of local searches.
The pseudo-code of the proposed QL-ADIFA algorithm is shown in Algorithm 2.
Algorithm 2: QL-ADIFA algorithm pseudo-code 1 Initialize Reward table ; 2 Set Q-table as a zero matrix of m × n, m = 2 and n = 2; 3 Set Iter = 0, Max Iter = 500, and N is the population size of fireflies; The light intensity I i of x i is calculated by fitness function

Numerical simulations
Numerical simulations based on 15 benchmark functions are performed to verify the performance of the proposed method.Table 2 shows the details of these functions.
Table 2.The description of 15 reference functions.

Parameter settings
This paper shows the results of the proposed QL-ADIFA and compares it with the original algorithm in different test functions.100 iterations and 25 search agents are used to execute the proposed algorithm QL-ADIFA.To display the capabilities of the proposed method, the results of QL-LSLFA and the algorithms below 1) AD-IFA [16]; 2) FA [24]; 3) LFFA [16]; 4) the removed Q-learning based on the adaptive logarithmic spiral-Levy flight firefly algorithm (NQL-ADIFA); 5) A quasi-opposition learning and Q-learning based marine predators algorithm (QQLMPA) [19]; 6) an innovative optimizer named weighted mean of vectors (INFO) [27] carried out the comparison and analysis.The NQL-ADIFA here refers to eliminating the impact of Q-learning in QL-ADIFA, that is, in the QL-LSLFA part, the Qlearning algorithm is no longer used for exploration and exploitation strategy selection, but a random selection of strategies.The experiment was conducted in the Matlab R2017a environment, which has a 1.4 GHz quad-core Intel core i5 and 8 GB of RAM.The experimental settings for the parameters in all firefly algorithms (FA, LF-FA, AD-IFA, our QL-ADIFA) are as follows: the randomization parameter α = 0.2, the fixed light absorption coefficient γ = 1 and the attractiveness at r = 0 is β 0 = 1.
To ensure the fairness of the experiment, this study obtained the average results with 25 runs.In particular, considering that the study rate θ is usually set to a high value at the beginning and then gradually decreases with time steps, it is set as follows: where Iter is the current iteration and Max Iter is the total number of iterations.

Result analysis
Tables 3 and 4 provide the performance details, where the mean (Avg) and standard deviation (Std) values are used to evaluate the results of QL-ADIFA and other algorithms.The best results are presented in bold, and all algorithms are ranked from best to worst based on their average performance.Additionally, the Wilcoxon rank sum test is conducted with a 95% confidence level to calculate the p-values and h-values, which demonstrates that QL-ADIFA is significantly different from other algorithms.'NaN' represents 'not available'.Moreover, Figure 3 illustrates the convergence curves of six functions for all algorithms, which indicates that QL-ADIFA performs better than other algorithms in terms of searching for prey and achieving better results during fewer iterations.And QL-ADIFA has better exploration and development capabilities, making it easier to avoid local optima.
QL-ADIFA exhibits superior exploration capabilities that enable it to discover better solutions compared to the original algorithm.The results presented in Tables 3 and 4 show that QL-ADIFA achieves an Avg that is closer to the global optimal Avg when compared to other algorithms.However, compared with the two excellent algorithms of QQLMPA and INFO, QL-ADIFA still has a lot of room for improvement.Compared only to the FA series algorithms, QL-ADIFA performs the best in terms of Avg except for f 3 .Furthermore, the Std value of QL-ADIFA is the lowest among 50% of the benchmark functions, indicating that its excellent performance is robust.The convergence curves of the six functions shown in Figure 3 illustrate that QL-ADIFA outperforms other algorithms in terms of convergence speed.Overall, the proposed algorithm achieves good performance in terms of accuracy and speed after the improvements are made.
Tables 3 and 4 present the p-values and h-values obtained from the non-parametric Wilcoxon rank sum statistical test.The test was conducted to determine whether QL-ADIFA performs significantly better than other algorithms.The results show that in most benchmark functions, the p-value is less than 0.05 and the h-value is 1 between QL-ADIFA and each of the other algorithms.This indicates that the advantage of QL-ADIFA over the other algorithms is credible and significant.Specifically, the h-value is 1 in 52% of functions, indicating a significant difference between QL-ADIFA and other algorithms.Additionally, the p-value is less than 5% in 52% of benchmark functions, indicating that QL-ADIFA is effective in solving most functions.

Engineering design problems
This section presents the practical application of QL-ADIFA in solving engineering design problems, specifically the cantilever beam design and pressure vessel design.Similar to the benchmark function test, QL-ADIFA is evaluated using 100 iterations repeated 25 times for each problem.

The cantilever beam design
The first practical engineering problem involves the weight optimization of a square-section cantilever beam, specifically the design of a cantilever beam.The beam is made up of 5 hollow square blocks of constant thickness, where the height of each block is a decision variable and the thickness is fixed.One end of the beam is rigidly supported, and a vertical force acts on the free node of the cantilever.The objective is to minimize the weight of the cantilever beam and can be expressed as follows: Table 5 presents a comparison of the best results achieved by different algorithms.The best result obtained by QL-ADIFA outperforms the best result of the other algorithms.The optimal value of QL-ADIFA is 1.7505 when the decision variables x 1 = 4.7861, x 2 = 7.9613, x 3 = 7.7762, x 4 = 4.6585, and x 5 = 2.8702.

The pressure vessel design
The second example pertains to the design of a pressure vessel.The objective is to minimize the total cost, which includes welding, material, and pressure vessel moulding costs.The problem involves four design variables, namely, 1) x 1 , which is the thickness of the shell, 2) x 2 , which is the thickness of the head, 3) x 3 , which is the inner radius of the head, and 4) x 4 , which is the length of the cylindrical The table provided in Table 7 displays the best results achieved by various algorithms for the threebar truss design problem.QL-ADIFA outperforms all other algorithms with the best result.Specifically, when the design variables are x 1 = 0.79, x 2 = 0.41, the optimal value obtained by QL-ADIFA is 263.85.There are 57 constrained optimization problems in CEC2020 with varying dimensions, ranging from 2 variables to 158 variables, and a range of equality and inequality constraints, ranging from 2 constraints to 148 constraints.Comprehensive information regarding these problems can be found in [29].To evaluate the performance of the QL-ADIFA algorithm proposed in this study, 9 constrained optimization problems were selected for a series of experiments.
The detailed results of some CEC2020 functions obtained from the proposed QL-ADIFA and other contrast algorithms are presented in Table 8, including the average (Avg), best (Best), the standard deviation (Std) of each problem.As shown in Table 8, QL-ADIFA significantly improves average values, stability, and convergence speed compared with AD-IFA.

Conclusions
The present study proposes a new optimization algorithm, called QL-ADIFA, which is a hybrid of Q-learning based on the Logarithmic Spiral-Levy Firefly Algorithm (QL-LSLFA) and the adaptive Logarithmic Spiral-Levy Firefly Algorithm (AD-IFA).The QL-LSLFA algorithm improves the efficiency of the original FA by introducing Q-learning, which enables the fireflies to better adapt to state transitions and use the information obtained from previous iterations.In addition, with the union with AD-IFA, the QL-LSLFA can be avoided falling into local optimums as much as possible.In order to evaluate the performance of QL-ADIFA, the algorithm was tested on 15 benchmark functions and twelve engineering problems.The experimental results demonstrated that QL-ADIFA outperforms other algorithms in terms of solution quality, stability, and convergence speed for most of the tested functions and problems.The proposed hybrid algorithm thus represents an effective and promising approach to solving global optimization problems.
In future work, combining Q-learning with other variants of the Firefly algorithm can be considered to improve the convergence speed of the algorithm.In addition, we can also try adding pre-experiments for parameter settings to optimize the performance of the algorithm.

Figure 1 .
Figure 1.The flowchart of Q-learning.

Table 1 .
AbbreviationFull name FA Firefly Algorithm LF-FA The Levy-Flight Firefly Algorithm AD-FIA The adaptive logarithmic spiral-Levy flight firefly algorithm QL-LSLFA The Q-learning based on the logarithmic spiral-Levy flight firefly algorithm QL-ADIFA The Q-learning based on the adaptive logarithmic spiral-Levy flight firefly algorithm NQL-ADIFA The removed Q-learning based on the adaptive logarithmic spiral-Levy flight firefly algorithm
various initial rates and define the light absorption coefficient γ; 9 while Iter < Max Iter do 10 for i = 1 : N do 11 for j = 1 : N do 12

27 Evaluate new solutions and update light intensity; 28 Rank fireflies and find the fitness value f l * of fireflies in the best position; 29 Iter = Iter + 1 ;
According to Eq(5), enter the exploration stage and update the position of ith firefly; else According to Eq (6), enter the exploitation stage and update the position of ith firefly; else Use AD-IFA for the ith firefly's position updates; if The new function value has been improved then Reward = Reward + 1; else Reward = Reward -1; 26The attractive force varies with the distance r according to exp(−γr 2 ); 30 return Final result;

Figure 3 .
Figure 3. Select the convergence curve of the comparison algorithm on the reference function.

8
Execute action a Iter and get immediate feedback r Iter+1 ;Acquiring the corresponding maximum value s Iter+1 of Q-table; 10 13 Iter = Iter + 1; 14 return Q-table;

Table 5 .
Comparison of optimization design of cantilever beam by different algorithms.

Table 7 .
Comparison of optimization design of three-bar truss design by different algorithms.

Table 8 .
Results of QL-ADIFA and other algorithms on 9 constrained optimization problems.