Generative Design by Using Exploration Approaches of Reinforcement Learning in Density-Based Structural Topology Optimization

: A central challenge in generative design is the exploration of vast number of solutions. In this work, we extend two major density-based structural topology optimization (STO) methods based on four classes of exploration algorithms of reinforcement learning (RL) to STO problems, which approaches generative design in a new way. The four methods are: ﬁrst, using ε -greedy policy to disturb the search direction; second, using upper conﬁdence bound (UCB) to add a bonus on sensitivity; last, using Thompson sampling (TS) as well as information-directed sampling (IDS) to direct the search, where the posterior function of reward is ﬁtted by Beta distribution or Gaussian distribution. Those combined methods are evaluated on some structure compliance minimization tasks from 2D to 3D, including the variable thickness design problem of an atmospheric diving suit (ADS). We show that all methods can generate various acceptable design options by varying one or two parameters simply, except that IDS fails to reach the convergence for complex structures due to the limitation of computation ability. We also show that both Beta distribution and Gaussian distribution work well to describe the posterior probability.


Introduction
Artificial intelligence (AI) has developed rapidly and been applied to structural topology optimization (STO) efficiently in these years. Around 2000, evolutionary algorithms like genetic algorithm [1,2], as well as swarm intelligence algorithms like ant colony algorithm [3,4] and particle swarm optimization [5], were used in STO. Some conventional machine learning algorithms such as linear regression, support vector machine [6], and shallow neural network [7] were also tried to solve STO problems. Since 2012, fast-developing computing power has enabled deep learning with deeper networks and stronger learning ability to become a popular tool.
Reinforcement learning is another branch of AI, whose samples are created during the learning process rather than prepared in advance. It can solve all sequential decision problems in theory. Although it has hardly ever been used in STO, its value has been proven in various research areas including robotics, games, recommendation system, neural architecture design, music generation, and so on [8].
Generative design is the process of exploring the variants of a design and assessing the performance of output designs [9,10]. Instead of focusing only on one optimal design like traditional design optimization, generative design is aimed at providing different designers with suitable designs, and these variant designs are useful to meet the demand of some practical usages which are hard to describe mathematically. Also, aesthetics is an important factor for customers and should be balanced In the area of reinforcement learning, many algorithms have been proposed to solve exploration /exploitation problems such as the multi-armed bandit problem. The progress in reinforcement learning can be seen obviously during these years, the size of state space and action space grows dramatically from simple Atari games [22] to the game of Go [23], then to more complex Starcraft II [24]. As for the strategies of exploration, the simplest idea is adding a random disturbance to change the search direction, whose representative methods are ε-greedy policy [8] and Boltzmann policy [25]. Due to its simplicity (just adding a uniform noise), it is called naïve exploration. Another popular algorithm is Upper Confidence Bound or UCB [26,27], belonging to optimistic exploration, it is prone to choose those uncertain actions by adding a weighted term in the action-value function. Probability matching is a less known family of algorithms, in which its most famous method is called Thompson sampling (TS) [28]. It can search more efficiently using posterior probability. Although its theoretical analysis is hard, the convenience of coding is not affected. A new approach to help explore is named information-directed sampling (IDS), which utilizes the information gain to make decisions. For some simple multi-armed bandit problems, it performs better than popular approaches but has high computation demand [29].

Research Purpose
In this work, we extended four representative exploration methods of RL to direct the update of SIMP and BESO for structural topology optimization problems, which can generate a bunch of meaningful design options considering both aesthetics and engineering performance, and evaluated those results by some attributes such as compliance and novelty.
The proposed framework consists of traditional SIMP and BESO, iterative design exploration and final design evaluation. The traditional part involves the process of finite element analysis, sensitivity analysis, as well as filters. Iterative design exploration means to change the value of sensitivity or the scheme of element density updating. Design evaluation includes quantifying the novelty and mechanical performance of generated structures compared with previous designs. In the end, our proposed framework is demonstrated on some 2D and 3D structural compliance minimization cases.
The rest of the paper is structured as follows. Section 2 briefly describes the density-based method. Section 3 introduces and explains how to use the exploration methods in SIMP and BESO. Section 4 demonstrates and discusses case study results. In the end, Section 5 summarizes the conclusions and introduces future work.

Density-Based Structural Topology Optimization
SIMP is the mostly used STO approach in major commercial software, and BESO is called "the discrete SIMP" by Sigmund [30], because it is very similar to SIMP, the different part is that the variable is discrete in BESO, while it is continuous in SIMP. In this section, we describe the procedures of SIMP and BESO in detail.

Problem Statements
As the most typical STO problem, the structure compliance minimization problem is usually under a certain constraint of volume based on the density-based methods, which is defined as Min : C = 1 2 f T u, Subject to : V * = N e=1 V e x e , x e = x min or 1 (BESO), x e ∈ [x min , 1] (SIMP). (1) in which C represents the structure compliance, also called the strain energy, the force vector f = Ku, and the global stiffness matrix K = e x p−1 e K e , where K e is the element stiffness matrix, p means the penalty exponent in SIMP and u denotes the displacement vector. x e denotes the density of an individual element e, x e = 1 means that the element is full of material and x e = x min means void, typically we set x min = 0.001 so as to avoid the possible singularity of the equilibrium problem [31]. V * is the prescribed total structural volume and V e is the volume of the element e. N is the sum of the elements inside the design domain.

Sensitivity Analysis and Filter Schemes
In density-based methods, the sensitivity of an element is equal to the change of the total compliance when deleting this element from the structure [32], which is calculated as in which u e is the element displacement vector of the eth element.
where the weight factor H ej is defined as where r ej is the distance between the centers of element e and element j and r min denotes the filter radius. Furthermore, in BESO, the discrete variables change sharply between 1 (solid) and x min (void), which make the objective function and topology hard to converge. To solve the problem, element sensitivity numbers should be smoothed by the information of the last iteration [33].
in which t represents the current iteration number.

Adding and Removing Elements
The difference between SIMP and BESO is large when updating the design variables, where the former usually uses highly efficient algorithms like the optimality criteria (OC) methods or the method of moving asymptotes (MMA). A standard OC updating scheme is used in this paper, which can be formulated as [31] x t+1 where x t e means the value of the design variable at iteration t, η is a numerical damping coefficient (equal to 0.5 in this paper), m is the positive move limit, and B e is given by the optimality condition as in which λ is a Lagrangian multiplier that can be determined using the bisection method. As for BESO, in each iteration, the volume is evolved by where ER means the evolutionary volume ratio. Another parameter called maximum volume addition ratio (AR) is usually used to guarantee that not too many elements are added in a single iteration, but it does not matter in our method, so we ignore it. The elements are removed according to their sensitivity numbers from low to high strictly until reaching the prescribed volume in this iteration, just like a kind of greedy heuristic search algorithm.

Convergence Criterion
The convergence criterion is defined by the change of the objective function. After reaching the volume constraint V * , the update process will not end unless the difference between the compliance in the past few iterations is little enough here, M is an integral number, and τ is a little constant representing the convergence tolerance. Considering that the exploration will hamper the convergence, it will not be used when the volume of the topology decreases near the limited value.

Evaluation
To guarantee the engineering performance and the novelty of results, two metrics are calculated to evaluate the final structure in each episode: First, the difference of performance (like compliance) between the current design and the reference design is considered. Those structures with low performance will be rejected.
Another metric named IoU (intersection-over-union) is from a common evaluation metric for the object detection ( Figure 1), which represents the similarity of the current structure and previous structures. It is calculated as the ratio of the intersection to the union of two rectangles called candidate bound CB and ground truth bound GB Designs 2020, 4, x FOR PEER REVIEW 5 of 19

Evaluation
To guarantee the engineering performance and the novelty of results, two metrics are calculated to evaluate the final structure in each episode: First, the difference of performance (like compliance) between the current design and the reference design is considered. Those structures with low performance will be rejected.
Another metric named (intersection-over-union) is from a common evaluation metric for the object detection (Figure 1), which represents the similarity of the current structure and previous structures. It is calculated as the ratio of the intersection to the union of two rectangles called candidate bound and ground truth bound Furthermore, can be extended to 3D flexibly if changing 'area' to 'volume'.

Using Exploration Approaches of Reinforcement Learning in STO
The exploration-exploitation dilemma is a fundamental problem of reinforcement learning. Exploitation means making the best decision based on current information, while exploration indicates gathering more information, which helps make the best overall decision at the expense of efficiency. Figure 2 shows how to combine those exploration approaches with STO briefly. ε-greedy helps disturb the search direction when adding and removing elements; UCB adds a bonus when calculating the sensitivity; TS and IDS replace sensitivity functions by posterior reward distributions and update the density of elements by sampling. Among them, UCB is used both in SIMP and BESO, while the remaining exploration strategies are only used in BESO because of their inconvenience in dealing with continuous variables. In this section, we describe the procedures in detail.  Furthermore, IoU can be extended to 3D flexibly if changing 'area' to 'volume'.

Using Exploration Approaches of Reinforcement Learning in STO
The exploration-exploitation dilemma is a fundamental problem of reinforcement learning. Exploitation means making the best decision based on current information, while exploration indicates gathering more information, which helps make the best overall decision at the expense of efficiency. Figure 2 shows how to combine those exploration approaches with STO briefly. ε-greedy helps disturb the search direction when adding and removing elements; UCB adds a bonus when calculating the sensitivity; TS and IDS replace sensitivity functions by posterior reward distributions and update the density of elements by sampling. Among them, UCB is used both in SIMP and BESO, while the remaining exploration strategies are only used in BESO because of their inconvenience in dealing with continuous variables. In this section, we describe the procedures in detail. Figure 2 shows how to combine those exploration approaches with STO briefly. ε-greedy helps disturb the search direction when adding and removing elements; UCB adds a bonus when calculating the sensitivity; TS and IDS replace sensitivity functions by posterior reward distributions and update the density of elements by sampling. Among them, UCB is used both in SIMP and BESO, while the remaining exploration strategies are only used in BESO because of their inconvenience in dealing with continuous variables. In this section, we describe the procedures in detail.

Naï ve Exploration
In reinforcement learning, ε-greedy policy is a common method of exploration and mostly used for discrete action spaces. As in this book [8], ε-greedy policy is defined as

Naïve Exploration
In reinforcement learning, ε-greedy policy is a common method of exploration and mostly used for discrete action spaces. As in this book [8], ε-greedy policy is defined as where a * denotes the best action for maximizing the predicted value, and A(s) is the total number of actions in the state. By choosing an appropriate value of ε, the balance between exploration and exploitation can be well achieved. In general, a decay schedule for ε is added as the algorithm running.
There are many variants of ε-greedy, one is called 'greedy with optimistic', which initializes the action-value function to high value, then almost every action will be tried theoretically, but it can still lock on to sub-optimal actions. Another is called 'decaying ε t -greedy policy', ε t will decrease as the iteration and gap (difference in the value of the current action and the optimal action) increase. The regret of this method is lower, but advanced knowledge of gaps is required, which seems not realistic in practice. Moreover, there is another approach named Boltzmann exploration procedure based on thermology, using temperature to control the choice of actions, where the temperature is equivalent to ε in ε-greedy.
In a word, exploration in these naïve exploration methods is achieved by randomization. While in BESO, elements are deleted based on the rank of sensitivity numbers completely like greedy algorithm. To use naïve exploration in BESO, a random perturbance during the process of adding and removing elements is needed. In order to prevent the whole structure from collapsing suddenly, we divide the elements into two classes based on the ranking of sensitivity numbers, which is inspired by the genetic algorithm used in BESO [2], the proportion of the higher class is V * /2, and this mechanism is also used in all the other algorithms in Section 3. When deploying the ε-greedy policy, after sorting the elements by their sensitivity numbers, the elements are divided into lower class and higher class, and the bottom (1 − ε) of the total elements which need deleting in this iteration are firstly removed as BESO, while the other ε portion are removed randomly from the lower class.
Moreover, ε is declined by iteration to make solutions convergent like the decaying ε-greedy policy, the scheme is formed as in which ε 0 means the value of ε at the beginning of the episode, and t 0 denotes the number of iteration when the volume fraction just reaching the minimum. For some structures with detailed meshes, the effect of removing just one element is not obvious. Therefore, we use 'search window' (Figure 3) when removing the elements, which means deleting one element as well as its neighbors at the same time. The size of the search window is controlled by two parameters: winx, winy, and it can be expanded to 3D cases easily.
in which 0 means the value of at the beginning of the episode, and 0 denotes the number of iteration when the volume fraction just reaching the minimum.
For some structures with detailed meshes, the effect of removing just one element is not obvious. Therefore, we use 'search window' (Figure 3) when removing the elements, which means deleting one element as well as its neighbors at the same time. The size of the search window is controlled by two parameters: , , and it can be expanded to 3D cases easily.

Optimistic Exploration
Optimistic exploration means those uncertain actions are more likely to be chosen. To measure the uncertainty, the frequency of each action is calculated as an additive term, namely bonus when choosing actions in which Q t is the action-value function, and c is a positive parameter that controls the degree of exploration, a bigger c means more exploration. Upper confidence bound, or UCB [26,27], is the most popular algorithm for this kind, where the bonus is in which N(a) is the number of times the action a has been selected, and t means the number of iterations. There are also some variants of UCB, but the main difference between those methods and greedy exploration is the bonus, no matter what the bonus is. In a word, the key point of optimistic exploration is keeping the balance between the uncertainty and optimality. In STO, the idea of UCB is adopted in the process of calculating the sensitivity where N a e 0 = 1 in the beginning. In SIMP, N(a) is calculated as the increment of the density As for BESO, because the initial elements are all solid and their density is discrete, N(a) is updated by their times of keeping solid states In this way, these elements with low probability to be removed are more likely to be deleted now, and the probability will increase as the value of c increases. Normally, c should be chosen to make the magnitude of sensitivity and bonus similar.

Probability Matching
Probability matching, also called posterior sampling, uses posterior probability to achieve exploration based on Bayesian theory, whose idea is also to encourage the exploration of uncertain actions. This scheme is also called Thompson Sampling [28], which chooses each action according to its probability of being optimal where h t is the history sequence of action and reward, π(a|h t ) means the probability to choose action a under h t , Q(a) represents the action-value function and R means reward function. The brief steps of Thompson sampling are below: Take the optimal action by Equation (18) • Execute the chosen action in actual environment and get the reward r t • Update the posterior distribution P by Equations (19) and (20) As the learning process going on, the confidence interval of each action will be narrower and narrower towards convergence.
In BESO, like exploring the multi-armed bandit problems, elements are just like the arms of the bandit, and those exploration methods help decide which arm to be chosen (deleted). Reward is defined as the rank of the sensitivity numbers of elements, 0 to 1 from higher sensitivity to lower sensitivity, then elements with higher reward (lower sensitivity) are more likely to be chosen (removed). The actual reward distribution of each action is gained by previous ranks. To describe the reward distribution, there are two frequently used patterns named Beta distribution and Gaussian distribution shown in Figure 4. Beta distribution is a kind of binomial distribution with two parameters Beta(α, β). Despite its proofs are a bit complex, its update scheme is simple and can describe the sampling process well, which is (α, β ) = (α, β ) + (r t , 1 − r t ) (19) Designs 2020, 4, x FOR PEER REVIEW 8 of 19 In BESO, like exploring the multi-armed bandit problems, elements are just like the arms of the bandit, and those exploration methods help decide which arm to be chosen (deleted). Reward is defined as the rank of the sensitivity numbers of elements, 0 to 1 from higher sensitivity to lower sensitivity, then elements with higher reward (lower sensitivity) are more likely to be chosen (removed). The actual reward distribution of each action is gained by previous ranks. To describe the reward distribution, there are two frequently used patterns named Beta distribution and Gaussian distribution shown in Figure 4. Beta distribution is a kind of binomial distribution with two parameters Beta( , ). Despite its proofs are a bit complex, its update scheme is simple and can describe the sampling process well, which is For example, the initial distribution is Beta(1, 1) without any prior knowledge, each action has the same reward. Then and increase as the number of sampling increases. As a result, the distribution function will be narrower and narrower near the mean value.  Gaussian distribution also has two parameters N( , 2 ), which are updated by  (20) in which is the observation drawn independently from the actual reward distribution N( 0 , 0 2 ).
The variance 2 will decrease as the number of sampling increases, then the distribution function will more and more concentrate on the mean value .

Information State Search
Information state search views information as a part of state. The idea of this kind of method is choosing the action with the most information gain. The most representative approach is called information-directed search (IDS) [29], in which action is sampled by minimizing the ratio of squared For example, the initial distribution is Beta(1, 1) without any prior knowledge, each action has the same reward. Then α and β increase as the number of sampling increases. As a result, the distribution function will be narrower and narrower near the mean value.
Gaussian distribution also has two parameters N(µ, σ 2 ), which are updated by Designs 2020, 4, 10 9 of 20 in which Y t is the observation drawn independently from the actual reward distribution N(µ 0 , σ 2 0 ). The variance σ 2 will decrease as the number of sampling increases, then the distribution function will more and more concentrate on the mean value µ.

Information State Search
Information state search views information as a part of state. The idea of this kind of method is choosing the action with the most information gain. The most representative approach is called information-directed search (IDS) [29], in which action is sampled by minimizing the ratio of squared expected single-period regret and information gain where ∆(a) = E[r(a * ) − r(a))] represents the difference of the rewards earned by the optimal action and the actual action, g(a) is the information gain of a and equal to the expected reduction in the entropy H of γ t due to the observation, γ t means the posterior distribution of A * : To calculate ∆ and g, some approximated methods must be utilized, because the integrals in posterior distribution should be computed over high-dimensional spaces. First, near-exact approximations can be gained by calculating integrands at discrete gird of points. Another attempt is replacing integrals with sample-based estimates, which reduces the time complexity especially when calculating the integration over high-dimensional spaces. We prefer the latter method in this paper. The brief steps are below: • Get the samples for estimation by interacting with the actual environment in the first iteration, and by the updated posterior reward distribution P in the following steps • Compute ∆, g and the information gain ratio • Take the optimal action by Equation (21) • Execute the chosen action in actual environment and get the reward • Update the posterior reward distribution by Equations (19) and (20) In BESO, the definition of the reward is the same as that in Section 3.3, but the number of actions must be down because of its high time complexity. Thus, we proceed the STO by BESO in the first episode, then divide the elements by the rank of sensitivity numbers into 10 to 20 groups, and each group is corresponding to one action which can be chosen (deleted) for a certain times equal to the number of solid element it contains. In the following episodes, the number of elements that need deleting in each group will be allocated by IDS first, after that, the elements in each group are chosen by sampling according to the actual reward function of each element.

Cases and Discussion
This section applies different exploration methods above combining with SIMP and BESO to several well-known 2D and 3D minimum compliance problems, and then filters these final structures by evaluation. After that, a bunch of design options and stacked views are generated, and the degree of variation can be controlled by one parameter for each method simply, when improving ε of ε-greedy policy, c of UCB, as well as the variance of the posterior distribution of TS and IDS, more novel structures will be created, but at higher risk of divergence.

Cantilever Beam
A 2D cantilever beam is given to demonstrate the performance of different exploration algorithms. Figure 5 shows the design domain of the structure with its load and boundary conditions. Young's modulus E = 1MPa and Poisson's ratio υ = 0.3 are assumed. The objective volume fraction is set to 50% of the total area of the design domain. Other BESO parameters are shown below: the evolution volume ratio ER = 0.04, the filter radius r min = 4mm, the minimum design variable x min = 0.001, M = 5, τ = 0.1% for the convergence criterion and the penalty exponent p = 3.0. Parameters of SIMP are m = 0.2, r min = 4mm, and τ = 1%. Moreover, the evaluation standards are those designs whose compliance is below twice of that of the benchmark structure and whose IoU is below 0.9. several well-known 2D and 3D minimum compliance problems, and then filters these final structures by evaluation. After that, a bunch of design options and stacked views are generated, and the degree of variation can be controlled by one parameter for each method simply, when improving of εgreedy policy, of UCB, as well as the variance of the posterior distribution of TS and IDS, more novel structures will be created, but at higher risk of divergence.

Cantilever Beam
A 2D cantilever beam is given to demonstrate the performance of different exploration algorithms. Figure 5 shows the design domain of the structure with its load and boundary conditions. Young's modulus = 1MPa and Poisson's ratio υ = 0.3 are assumed. The objective volume fraction is set to 50% of the total area of the design domain. Other BESO parameters are shown below: the evolution volume ratio = 0.04, the filter radius = 4mm, the minimum design variable min = 0.001, = 5, = 0.1% for the convergence criterion and the penalty exponent = 3.0. Parameters of SIMP are = 0.2, = 4mm, and = 1%. Moreover, the evaluation standards are those designs whose compliance is below twice of that of the benchmark structure and whose is below 0.9.  First, the decaying ε-greedy policy is used. The first result is calculated by BESO, and the rest are got by setting ε = 0.05 or ε = 0.1 with winx = winy = 3. By the way, it is worth mentioning that the structures tend to be asymmetric. Therefore, to keep the diversity, the symmetry constraint is added for getting some symmetric results.
Second, UCB algorithm is applied with BESO. Four results satisfying the demand are got when c = 1 in 10 episodes, and the rest 16 results are generated by varying c from 1 to 1000.
Third, TS is utilized based on two distributions. The diversity of the topologies can be controlled by zooming the variance of the posterior distribution of reward. For beta distribution, diverse final topologies can be gained from varying the increment of α and β from 0.1* to 10*. For gaussian distribution, diverse final topologies can be gained by varying σ of the model from 1* to 100*.
Fourth, IDS is also used on two distributions. The rate for beta distribution varies from 0.1 to 10, and the rate for gaussian distribution varies from 1 to 100.
Finally, UCB is combined with SIMP, and the range of c is from 0.05 to 0.1 to get the designs. All topologies are shown in Figure 6, we can see that all the seven methods can generate diverse acceptable structures successfully. UCB can keep the symmetry of origin structure, while ε-greedy policy cannot keep the symmetry of structures without the symmetry constraint, the results of TS and IDS can be symmetric randomly. Furthermore, IDS needs more computation resource. The good phenomenon is that UCB, TS, and IDS usually generate divergent structures in the first one or two episodes because of the exploration, but after that the quality of the structures is improved as the shrink of the search space. As for the difference of SIMP and BESO, the former sometimes generates intermediate densities (gray elements), which make the solutions unable to be adopted directly.
policy cannot keep the symmetry of structures without the symmetry constraint, the results of TS and IDS can be symmetric randomly. Furthermore, IDS needs more computation resource. The good phenomenon is that UCB, TS, and IDS usually generate divergent structures in the first one or two episodes because of the exploration, but after that the quality of the structures is improved as the shrink of the search space. As for the difference of SIMP and BESO, the former sometimes generates intermediate densities (gray elements), which make the solutions unable to be adopted directly. To enable a clear visualization of the exploration process, we also show a stacked mode for all results in Figure 6. With this view, users can see discrepancies as well as common features in different topologies directly. The color for each pixel is easy to get by calculating the average color at that pixel To enable a clear visualization of the exploration process, we also show a stacked mode for all results in Figure 6. With this view, users can see discrepancies as well as common features in different topologies directly. The color for each pixel is easy to get by calculating the average color at that pixel for all individual designs, so the stacked view is equal to the average of generative designs. In those stacked views, the importance of different parts is shown obviously, we can see that the main frame is invariant for those designs, while the inner microstructures always change.
The details of the design by UCB with BESO are presented as an example. The information about compliance, IoU and iterations of the design in each episode is listed in Table 1. The compliance and IoU of all topologies satisfy evaluative criteria, which means that enough engineering performance and diversity of the designs can be guarantee. Furthermore, the numbers of iterations of most episodes are similar to that of BESO, meaning that the computation cost is acceptable. The evolutionary histories of the objection function of five episodes are shown in Figure 7. The curves are usually not monotonous with one or more peaks, and the corresponding iteration and the amplitude of peaks are different in different episodes, representing various search directions.

L-Shaped Beam
In the second example, we demonstrate the design problem for a L-shaped beam sketched in Figure 8, which is loaded at the center of the rightmost free end by = −1N. The beam is discretized into 1600 quadrilateral elements with 40% maximum volume fraction. Material properties are = 1MPa and υ = 0.3. The parameters of BESO are set to = 0.03, = 2mm, min = 0.001, = 5, = 1 %, and = 3.0 . SIMP parameters are = 0.2 , = 1.5mm , and = 0.01 %. Evaluation standards are the same as those in Section 4.1.

L-Shaped Beam
In the second example, we demonstrate the design problem for a L-shaped beam sketched in Figure 8, which is loaded at the center of the rightmost free end by F = −1N. The beam is discretized into 1600 quadrilateral elements with 40% maximum volume fraction. Material properties are E = 1MPa and υ = 0.3. The parameters of BESO are set to ER = 0.03, r min = 2mm, x min = 0.001, M = 5, τ = 1%, and p = 3.0. SIMP parameters are m = 0.2, r min = 1.5mm, and τ = 0.01%. Evaluation standards are the same as those in Section 4.1.  The parameters of the exploration approaches are below: first, = 0.05, 0.10, and 0.15 with = = 1 is used for -greedy policy. Then, in UCB, = 1, 10, and 100 for BESO, and = 0.3 to 10 for SIMP. As for TS and IDS, we test 0.1* to 10* for beta distribution, and 1* to 100* for gaussian distribution. Figure 9 shows all the final topologies generated by five methods except IDS, because IDS almost always creates divergent results for this case, the reasons may be the fragile structure around the corner and the low filter radius make the structure more likely to be disconnected. Results of other methods seem no problems, what is not perfect is that some results has mall holes, it does not matter as the number of results increases, or the problem can be solved by rising the filter radius. The parameters of the exploration approaches are below: first, ε = 0.05, 0.10, and 0.15 with winx = winy = 1 is used for ε-greedy policy. Then, in UCB, c = 1, 10, and 100 for BESO, and c = 0.3 to 10 for SIMP. As for TS and IDS, we test 0.1* to 10* for beta distribution, and 1* to 100* for gaussian distribution. Figure 9 shows all the final topologies generated by five methods except IDS, because IDS almost always creates divergent results for this case, the reasons may be the fragile structure around the corner and the low filter radius make the structure more likely to be disconnected. Results of other methods seem no problems, what is not perfect is that some results has mall holes, it does not matter as the number of results increases, or the problem can be solved by rising the filter radius.
The parameters of the exploration approaches are below: first, = 0.05, 0.10, and 0.15 with = = 1 is used for -greedy policy. Then, in UCB, = 1, 10, and 100 for BESO, and = 0.3 to 10 for SIMP. As for TS and IDS, we test 0.1* to 10* for beta distribution, and 1* to 100* for gaussian distribution. Figure 9 shows all the final topologies generated by five methods except IDS, because IDS almost always creates divergent results for this case, the reasons may be the fragile structure around the corner and the low filter radius make the structure more likely to be disconnected. Results of other methods seem no problems, what is not perfect is that some results has mall holes, it does not matter as the number of results increases, or the problem can be solved by rising the filter radius.
From the stacked views in Figure 9, we can see that there is a same clear main frame for structures of each method, and the changeable parts concentrate on the bottom.

3D Cantilever Beam
Those proposed exploration approaches can be extended to three dimensions straightforwardly, a STO problem for a 3D cantilever beam is shown in Figure 10. The design domain is 50 × 20 × 10 mm in shape and discretized using 10,000 eight node cubic elements. Only 10% volume of the design domain is available. The material has = 1MPa and υ = 0.3. BESO parameters = 0.03, = 1.5 mm, min = 0.001, = 5, = 1%, and = 3.0 are used. SIMP parameters are = 0.2, = 1.5 mm, and = 1%. The evaluation for compliance is the same as above but the maximum is set to 0.7, because of the generated designs is usually lower in 3D cases. From the stacked views in Figure 9, we can see that there is a same clear main frame for structures of each method, and the changeable parts concentrate on the bottom.

3D Cantilever Beam
Those proposed exploration approaches can be extended to three dimensions straightforwardly, a STO problem for a 3D cantilever beam is shown in Figure 10. The design domain is 50 × 20 × 10 mm in shape and discretized using 10,000 eight node cubic elements. Only 10% volume of the design domain is available. The material has E = 1MPa and υ = 0.3. BESO parameters ER = 0.03, r min = 1.5 mm, x min = 0.001, M = 5, τ = 1%, and p = 3.0 are used. SIMP parameters are m = 0.2, r min = 1.5 mm, and τ = 1%. The evaluation for compliance is the same as above but the maximum IOU is set to 0.7, because IOU of the generated designs is usually lower in 3D cases. Due to the time consuming nature of complex 3D cases, only 10 designs are shown for each method in Figure 11, and beta distribution need not be considered because it has shown the same effect as gaussian distribution. Four exploration methods are used, -greedy policy with = 0.10 and = = = 1, UCB with = 1 to 10 for both BESO and SIMP as well as TS with 1* and 10* increment for beta distribution are chosen, Diverse design options are generated successfully. As shown in Figure 11d, the phenomenon of gray elements is more severe for the case with low volume fraction, but the good thing is SIMP needs less computation time than BESO. Results by IDS Due to the time consuming nature of complex 3D cases, only 10 designs are shown for each method in Figure 11, and beta distribution need not be considered because it has shown the same effect as gaussian distribution. Four exploration methods are used, ε-greedy policy with ε = 0.10 and winx = winy = winz = 1, UCB with c = 1 to 10 for both BESO and SIMP as well as TS with 1* and 10* increment for beta distribution are chosen, Diverse design options are generated successfully. As shown in Figure 11d, the phenomenon of gray elements is more severe for the case with low volume fraction, but the good thing is SIMP needs less computation time than BESO. Results by IDS are below our expectation, which are always divergent, the reason is that IDS tends to disconnect the structure easily under small volume constraints. Figure 10. Design domain of a 3D cantilever beam with dimensions and boundary and loading conditions. Due to the time consuming nature of complex 3D cases, only 10 designs are shown for each method in Figure 11, and beta distribution need not be considered because it has shown the same effect as gaussian distribution. Four exploration methods are used, -greedy policy with = 0.10 and = = = 1, UCB with = 1 to 10 for both BESO and SIMP as well as TS with 1* and 10* increment for beta distribution are chosen, Diverse design options are generated successfully. As shown in Figure 11d, the phenomenon of gray elements is more severe for the case with low volume fraction, but the good thing is SIMP needs less computation time than BESO. Results by IDS are below our expectation, which are always divergent, the reason is that IDS tends to disconnect the structure easily under small volume constraints.

Upper Body of an Atmospheric Diving Suit
The final case is based on a practical engineering problem of China Ship Scientific Research Center, which is the design of an atmospheric diving suit (ADS), seen in Figure 12a. We chose its upper body as an example to test our exploration algorithm. To simplify the problem, stability, gravity, hydrodynamics, and the minor parts-such as holes or lugs-are not considered. upper body as an example to test our exploration algorithm. To simplify the problem, stability, gravity, hydrodynamics, and the minor parts-such as holes or lugs-are not considered.
The main dimensions are shown in Figure 12(b), and the structure is set to be fixed on the connections of head, arms, and the lower part. The maximum working depth is 500 m and the calculation pressure is set to 9 Mpa after considering the safety. Material properties are = 68.9 Gpa, υ = 0.3, and the allowable stress is 208 Mpa. Because the thickness is much lower than the size of the shape, the size of the structure must be large enough to guarantee enough elements to be deleted, which needs huge computation source. To avoid this problem, we transferred it to a variable thickness design problem, which means changing the density of elements on the outer surface instead of removing them, it can be simply achieved by setting = 1 in SIMP, then the thickness can be represented by density. The goal is optimizing the distribution of thickness without changing the weight.
Because the code is written in MATLAB, the geometric model should be imported from Solidworks first as shown in Figure 13. The upper body is discretized using 8-node hexahedral element, whose size is 20 × 20 × 20 mm. The density of elements are all initialized to 0.5 and varying between 0.2 and 1, which means the thickness of outer elements varies between 8 and 40 mm. In order to avoid the case that the thickness declines to 0 in some region, which may cause the structural failure by leakage or corrosion, the minimum thickness is set to 40 × 0.2 = 8mm. Figure 14 shows the stress contour of the origin structure, the value of stress is higher near the inner surface of neck as well as the connections of body and arms. The maximum stress is 96.1Mpa, and the total compliance is 230.2 N • m.
The design domain is only the outer surface (not including the connected parts). SIMP parameters are = 0.2, = 1, = 40 mm, = 3, = 1%. The evaluation for compliance is the same as above but maximum is set to 0.95 because the space of optimization is limited for this case, in addition, the maximum element stress is also a metric of evaluation, which cannot exceed the allowable stress. The main dimensions are shown in Figure 12b, and the structure is set to be fixed on the connections of head, arms, and the lower part. The maximum working depth is 500 m and the calculation pressure is set to 9 Mpa after considering the safety. Material properties are E = 68.9 Gpa, υ = 0.3, and the allowable stress is 208 Mpa.
Because the thickness is much lower than the size of the shape, the size of the structure must be large enough to guarantee enough elements to be deleted, which needs huge computation source. To avoid this problem, we transferred it to a variable thickness design problem, which means changing the density of elements on the outer surface instead of removing them, it can be simply achieved by setting p = 1 in SIMP, then the thickness can be represented by density. The goal is optimizing the distribution of thickness without changing the weight.
Because the code is written in MATLAB, the geometric model should be imported from Solidworks first as shown in Figure 13. The upper body is discretized using 8-node hexahedral element, whose size is 20 × 20 × 20 mm. The density of elements are all initialized to 0.5 and varying between 0.2 and 1, which means the thickness of outer elements varies between 8 and 40 mm. In order to avoid the case that the thickness declines to 0 in some region, which may cause the structural failure by leakage or corrosion, the minimum thickness is set to 40 × 0.2 = 8mm. Figure 14 shows the stress contour of the origin structure, the value of stress is higher near the inner surface of neck as well as the connections of body and arms. The maximum stress is 96.1Mpa, and the total compliance is 230.2 N·m.  UCB with = 30,000 is used to generate four topologies (Figure 15), whose information is shown in Table 2 and contour distribution can be seen in Figure 16. According to the first design calculated by SIMP, the material accumulates around the neck, back and chest, meaning these parts need to be thickened. And from Figure 16 (a-d), more uniform the thickness is, higher compliance the structure has. In addition, the maximum and distribution of stress are almost invariant, which means our proposed method can decrease the compliance by controlling the thickness distribution without changing the weight and stress distribution of the structure.
The stress calculated by our code in MATLAB is not highly precise, because the size of the element is large and the hexahedral elements can't fit the curved surface very well. Thus, the results focus more on providing guidance for designers during the conceptual design. The design domain is only the outer surface (not including the connected parts). SIMP parameters are m = 0.2, p = 1, r min = 40 mm, M = 3, τ = 1%. The evaluation for compliance is the same as above but maximum IOU is set to 0.95 because the space of optimization is limited for this case, in addition, the maximum element stress is also a metric of evaluation, which cannot exceed the allowable stress.
UCB with c = 30, 000 is used to generate four topologies ( Figure 15), whose information is shown in Table 2 and contour distribution can be seen in Figure 16. According to the first design calculated by SIMP, the material accumulates around the neck, back and chest, meaning these parts need to be thickened. And from Figure 16a-d, more uniform the thickness is, higher compliance the structure has. In addition, the maximum and distribution of stress are almost invariant, which means our proposed method can decrease the compliance by controlling the thickness distribution without changing the weight and stress distribution of the structure. UCB with = 30,000 is used to generate four topologies (Figure 15), whose information is shown in Table 2 and contour distribution can be seen in Figure 16. According to the first design calculated by SIMP, the material accumulates around the neck, back and chest, meaning these parts need to be thickened. And from Figure 16 (a-d), more uniform the thickness is, higher compliance the structure has. In addition, the maximum and distribution of stress are almost invariant, which means our proposed method can decrease the compliance by controlling the thickness distribution without changing the weight and stress distribution of the structure.
The stress calculated by our code in MATLAB is not highly precise, because the size of the element is large and the hexahedral elements can't fit the curved surface very well. Thus, the results focus more on providing guidance for designers during the conceptual design.    The stress calculated by our code in MATLAB is not highly precise, because the size of the element is large and the hexahedral elements can't fit the curved surface very well. Thus, the results focus more on providing guidance for designers during the conceptual design.

Conclusions and Future
In this paper, we focus on adding the features of the exploration approaches in reinforcement learning to the procedures of SIMP and BESO, so as to increase the search ability of structural topology optimization methods to generate multiple solutions. Then, the capability of those combined methods is demonstrated on three minimum compliance STO problems, including 2D and 3D cases, and the degree of exploration can be controlled with one parameter easily. The source code can be found at https://github.com/Nick0095/STO_exploration. By comparison, -greedy policy with random search is easy to deploy, but to get satisfying designs, additional schemes are necessary sometimes, such as the search window for dense mesh and the symmetric constraint for symmetric structures. UCB is used as a bonus when calculating the sensitivity numbers, which encourages exploration of less-exploited elements, and it is able to create acceptable results without adding the computation complexity dramatically. TS is a method based on Bayesian theory and needs a little more computation resource, it estimates the value of elements by updating the posterior probabilities, which is more accurate compared with random search and heuristic rules. Finally, IDS is a new information-directed sampling method, but limited in its high demand for computing capacity. When dealing with STO problems, an approximation step is added to decrease the number of design variables, which maybe hampers its performance. In our cases, it can work for the simple cantilever beam, but the divergence always occurs when facing more complex structures. We believe it has potential as the computation ability increases in the future. Furthermore, beta distribution as well as gaussian distribution have similar functions and both work well in describing the posterior reward distribution of elements.
As for SIMP and BESO, although the solving algorithm of BESO is not so precise, it is not a big problem for generative design, and the discrete variables make it easier to be combined with various exploration approaches. SIMP can be solved by more efficient algorithms, but the gray problem cannot be neglected for designs with low volume fraction, on the other hand, that makes SIMP suitable for variable thickness design problems.
Low efficiency is still the bottleneck of the development of reinforcement learning, a highly efficient exploration approach that keeps the balance of searching ability and computation complexity is needed. According to the trend of development in RL, more and more experience can

Conclusions and Future
In this paper, we focus on adding the features of the exploration approaches in reinforcement learning to the procedures of SIMP and BESO, so as to increase the search ability of structural topology optimization methods to generate multiple solutions. Then, the capability of those combined methods is demonstrated on three minimum compliance STO problems, including 2D and 3D cases, and the degree of exploration can be controlled with one parameter easily. The source code can be found at https://github.com/Nick0095/STO_exploration. By comparison, ε-greedy policy with random search is easy to deploy, but to get satisfying designs, additional schemes are necessary sometimes, such as the search window for dense mesh and the symmetric constraint for symmetric structures. UCB is used as a bonus when calculating the sensitivity numbers, which encourages exploration of less-exploited elements, and it is able to create acceptable results without adding the computation complexity dramatically. TS is a method based on Bayesian theory and needs a little more computation resource, it estimates the value of elements by updating the posterior probabilities, which is more accurate compared with random search and heuristic rules. Finally, IDS is a new information-directed sampling method, but limited in its high demand for computing capacity. When dealing with STO problems, an approximation step is added to decrease the number of design variables, which maybe hampers its performance. In our cases, it can work for the simple cantilever beam, but the divergence always occurs when facing more complex structures. We believe it has potential as the computation ability increases in the future. Furthermore, beta distribution as well as gaussian distribution have similar functions and both work well in describing the posterior reward distribution of elements.
As for SIMP and BESO, although the solving algorithm of BESO is not so precise, it is not a big problem for generative design, and the discrete variables make it easier to be combined with various exploration approaches. SIMP can be solved by more efficient algorithms, but the gray problem cannot be neglected for designs with low volume fraction, on the other hand, that makes SIMP suitable for variable thickness design problems.
Low efficiency is still the bottleneck of the development of reinforcement learning, a highly efficient exploration approach that keeps the balance of searching ability and computation complexity is needed. According to the trend of development in RL, more and more experience can be referred to deal with large-scale combinatorial optimization problems like structural topology optimization. Furthermore, the compliance minimization problem is only a basic problem to test our proposed method, after that, we will try it in more kinds of problems with different objection functions and constraints, so as to make it more widely used. Also, achieving these algorithm in those commercial softwares with more kinds of elements is another choice to improve the efficiency and accuracy. Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.

Nomenclature Variable Description
A action vector a action a * best action for maximizing the predicted value B bonus Beta(α, β) Beta distribution, α, β: two parameters C structure compliance (strain energy) c a positive parameter that controls the degree of exploration in UCB E [:] operator of calculating the mathematical expectation ER evolutionary volume ratio f force vector g(a) information gain of action a H(:) operator of the entropy H ej weight factor of the filter h t history sequence of action and reward K global stiffness matrix K e element stiffness matrix M an integral number in the convergence criterion m positive move limit N(a) number of times action a has been selected N(µ, σ 2 ) Gaussian distribution, µ: mean; σ 2 : variance P posterior reward distribution p penalty exponent in SIMP Q action-value function R reward function r sampled reward r ej distance between centers of element e and element j r min filter radius s state t current iteration number t 0 number of iteration when the volume fraction just reaching the minimum u displacement vector V * prescribed total structural volume x e density of element e Y t observation drawn independently from the actual reward distribution α e sensitivity of element e γ t posterior distribution of A * ∆(a) difference of rewards earned by optimal action and actual action ε a probability value defined in ε-greedy ε 0 value of ε at the beginning of the episode π policy τ a specified little value representing the convergence tolerance Subscripts e an individual element