Bacterial Foraging Algorithm Based on Activity of Bacteria for DNA Computing Sequence Design

Since the quantity and quality of DNA sequence directly affect the accuracy and efficiency of computation, the design of DNA sequence is essential for DNA computing. In order to improve the efficiency and reliability of DNA computing, there is a rich literature targeting at generating DNA sequences with lower similarity that can hybridize at a lower melting temperature. However, it is not trivial to improve both melting temperature and similarity for the DNA sequence, since DNA sequence design problem under the constraints of Hamming distance, secondary structure and molecular thermodynamic is known to be NP-hard. For the sake of achieving the lower melting temperature and similarity for the generated DNA sequence, we proposed an improved method for the bacterial foraging algorithm based on activity of bacteria (BFA-A). In particular, the effect of bacterial vitality on foraging ability is considered, and a competitive exclusion mechanism is introduced to improve the quality of the generated DNA sequences. In BFA-A, high-quality DNA strands are replicated to avoid the participation of inferior strands in the operation, and the active regulation mechanism and the competitive rejection mechanism are used to improve and accelerate the chemotaxis process. Experiments show that our proposed approach significantly outperforms existing methods in terms of melting temperature and similarity. In addition, the experimental results also show that our method can reduce the number of iterations, and has guiding significance to generate high-quality DNA sequences more efficiency.


I. INTRODUCTION
DNA computing is one of novel computational models by combining computer science and biological science. Owing to biological characteristics, DNA computing can take advantage of the high parallelism, high storage density and low power consumption. Therefore, it has great potential in solving complex combinatorial optimization problems such as NP-complete problems, which making it is one of the important ways to develop non-traditional high-performance computing. A large number of studies have shown that DNA computing can be widely used in various fields, such as Hamiltonian loop problem solution [1], artificial neural network establishment [2], handwritten content recognition [3], The associate editor coordinating the review of this manuscript and approving it for publication was Xiaokang Wang. circuit logic gate design [4]- [6], Nano robot design [7]- [9], DNA origami nanomaterial design [10], image encryption [11]- [13], and storage technology innovation [14]. With the development of biotechnology, it is conceivable that DNA computing will be used to solve more realistic problems. For instance, in the latest research of the synthetic molecular science, 4 new bases ('Z' and 'P' are complementary; 'S' and 'B' are complementary) have been artificially synthesized [15]. The number of DNA bases has been expanded from 4 to 8, which enables a richer coding format of DNA, and thus effectively reducing the similarity between DNA sequences.
Although researchers from different fields such as computer science, biology, and mathematics research on DNA computing from various aspects, there are still lots of theoretical challenges to truly realize DNA computers. In particular, DNA sequence coding is the major problem, since the quantity and quality of the generated DNA sequence have a strong influence on the reliability and accuracy of DNA computing [16]. Generally, low similarity and low melting temperature are required to reduce the probability of mismatch hybridization and to increase the reaction efficiency [17]. This can be explained by the fact that the DNA sequences involved in the reaction need to be simultaneously melted to ensure that each sequence reacts with each other. For instance, we consider 2 DNA strands X 1 and X 2 , and they hybridize with each other. We suppose that the melting temperature (Tm) of X 1 is 59 • C and the Tm of X 2 is 67 • C. Therefore, when the environment temperature reaches 59 • C, X 1 will melt, but X 2 cannot. Due to the large difference between the Tm of DNA strands X 1 and X 2 , the efficiency of the reaction is greatly reduced. Moreover, multiple constraints on the biological properties should be satisfied for the DNA sequences to decrease the error of DNA hybridization reaction. Generally, these constraints are defined based on molecular thermodynamic, secondary structure, Hamming distance, etc. Therefore, the problem of DNA sequence coding can be transformed into a multi-objective optimization problem with low reaction temperature and low mismatch hybridization as the main objective. In the previous work, researchers generally considered six constraints, and they are melting temperature (Tm), GC base content (GC), Continuity (Con), Hairpin structure (Hair), Similarity (Sim) and H-measure (Hm). To achieve the objective of low mismatch, false negative and false positive hybridization should be prevented in DNA sequences generation. Moreover, in order to improve the efficiency of reaction, all the DNA sequences should be melted at a similar lower temperature, and it is also one of the main objectives.
The problem of DNA sequences coding is a typical multi-objective and multi-constraint problem, which is known to be NP-Hard [18]. Among multiple optimization goals, to generate the DNA sequences with lower Tm is the main goal. However, due to multiple constraints and biochemical characteristics, it is not trivial to achieve this goal [19], [20]. In our previous work, we have effectively reduced the Tm of the generated DNA sequence by bacterial foraging algorithm (BFA) [21]. Different from the other existing methods, the half-replication characteristic of the BFA is used to eliminate inferior DNA sequences. With this characteristic, we can maintain the intrinsic similarity between different sequences, and thus the Tm variance of the multiple sequences can be reduced. A large number of experiments have shown that the coding method based on BFA can generate DNA strands with lower Tm. In addition, we also give an evaluation strategy to better quantify the quality of DNA sequences.
As an extension of this work, in this paper, we make the following further contributions: • Active regulation mechanism and competition exclusion mechanism are considered in BFA to improve the generating efficiency and the quality of DNA sequences; • A special relationship between Sim and Hm that is approximately equal to a fixed value is found and analyzed; • A novel constraint on special DNA sequence pieces that can be recognized by endonucleases is proposed.
Extensive experiment demonstrates that the proposed method outperforms the existing methods in terms of Tm, Sim and Hm, especially Tm.
The remainder of the paper is structured as follows. In Section II, we discuss the related work. In Section III, we present the objectives and constraints of the problem. The detailed description of BFA-A based DNA computing sequence coding algorithm is given in Section IV. In Section V, we present the experimental results and finally conclude in Section VI.

II. RELATED WORKS
DNA sequence coding design is a complex problem, since it has different design goals for different experimental purposes, and the coding itself subject to the biological characteristics of DNA needs to meet various constraints. The initial theoretical basis for coding design is still not sufficient. Therefore, the early researchers often selected the expected sequences from a large number of generated codes. Hartemink et al. [22] used an exhaustive method to traverse all possible coding combinations. Frutos and Thiel [23], Arita and Kobayashi [24], and Liu et al. [25] etc. respectively used a specially designed template and mapping strategy to select dissimilar sequences from a large number of automatically generated sequences. This method of screening from a large number of codes is simple and straight forward, but the screening work often takes a long time due to the large number of potential DNA sequences. In the early research, there is a coding design based on graph theory. Feldkamp et al. [26] designed the DNA sequence based on the directed graph. The idea of this method is relatively complicated, and the obtained code cannot well meet the actual requirements. In the subsequent research, researchers often aimed at reducing melting temperature and similarity, and improving stability. At the same time, with the study of the biological characteristics of DNA, researchers began to consider the multiple constraints of coding design. The coding design problem is formed as a single-objective multi-constraint optimization problem, or multi-objective multi-constraint optimization problem. Usually the reduction of melting temperature and similarity and the optimization of stability and code acquisition efficiency are considered as the objective, and hairpin structure, repeatability, GC base content, Hamming distance, etc. are considered as constraints. After that, a large number of coding design methods based on heuristic algorithms are proposed by solving the formed multi-constraint optimization problem.
In view of Hamming distance constraints, there are some methods to reduce coding similarity. Marathe et al. [27] adopt the dynamic programming technology to design DNA sequences based on Hamming distance. Tanaka and Nakatsugawa [28] proposed a method based on simulated annealing to generate more stable and reliable sequences. Zhang et al. [29] proposed an algorithm based on the minimum free energy criterion to generate DNA sequences with almost no false hybridization. Zhang et al. [30] also proposed a method based on the improved genetic algorithm to design the stable DNA sequences. Compared with the original genetic algorithm for the DNA coding, this method can effectively improve the sequence stability. Luo and Luo [31] proposed a method based on weed algorithm to generate DNA sequences with high Hamming distance. Yang et al. [16] proposed a niche-based weed algorithm to improve the lack of the original weed algorithm for DNA coding to reduce the similarity and improve the stability.
There are also several DNA sequence design methods considering the constraints on melting temperature, free energy and stability. For example, Shin et al. [32]. proposed a multi-objective evolutionary algorithm to solve the DNA sequence design problem using the nucleic acid computing simulation toolkit/sequence generator (NACST) system. Xu and Zhang [33], [34] proposed a genetic algorithm optimized by particle swarm to generate DNA sequences with low Tm. Guo et al. [35] proposed a coding design method based on Bloch's quantum chaos algorithm, which dynamically adjusts the quantum rotation angle to obtain DNA codes with better thermodynamic characteristics. Liu and Wang [36] combined the bat algorithm and particle swarm algorithm to design the DNA code with good stability and thermodynamic properties.
To improve the efficiency of DNA sequence acquisition, Yin and Ye [37], [38] proposed a particle swarm optimization algorithm based on cultural evolution, which combines cultural evolutionary algorithm with particle swarm optimization. By taking full advantage of evolutionary information, their methods can improve the search ability and the efficiency of DNA code acquisition. Xiao and Jiang [39] proposed a dynamic membrane evolution algorithm, which uses particle swarm optimization to update and improve the global search ability by using adaptive differential evolution algorithm. Choong and Lee [40] proposed a DNA sequence generation method based on convolutional neural networks, which can obtain DNA codes very quickly. Wang and Shen [41] proposed an improved non-dominated sorting genetic algorithm II (INSGA-II) for DNA coding design, which introduces constraints into the non-dominated sorting process, which has good convergence and good population diversity. Liu and Wang [42] modified the simulated annealing algorithm and the group search algorithm, and combined them into a hybrid iterative search algorithm to improve the efficiency of the DNA sequence generation.
All of the above methods can optimize the quality of the generated DNA sequences to a certain degree, but they are not capable of producing DNA sequences with sufficiently low Tm and Sim. In order to address this problem, we propose a DNA sequence design method based on the bacterial foraging algorithm (BFA). It should be noted that the bacterial foraging algorithm has been extensively used in image matching and job scheduling, etc. [43], [44] to avoid exhaustive search and improve search efficiency. It has the advantages of convenient and fast convergence owing to its parallel search of the swarm. In addition, it does not require gradient information of the object in the optimization process and has strong versatility. Although the BFA method for DNA sequence design can effectively reduce the variance of the Tm, it cannot generate the DNA sequences with sufficiently low Sim and Hm, since the replication mechanism can reduce bacterial diversity.
In order to improve the problem of insufficient optimization of Sim and Hm in the BFA method, we propose the bacterial foraging algorithm based on activity of bacterial (BFA-A) by introducing activity regulation mechanism and competitive exclusion mechanism. These two mechanisms play a critical role in the process of chemotaxis to effectively reduce the melting temperature and similarity. BFA-A is composed by the four behaviors (i.e., initialization, chemotaxis, replication, and dispersion), which is an abstraction of the natural growth of bacterial colonies [45], [46].

III. OBJECTIVES AND CONSTRAINTS
In order to obtain DNA sequences with error-free hybridization at lower melting temperature, we consider six constraints based on the basic biological properties of the double helix structure, and they are Similarity (Sim), H-measure (Hm), Continuity (Con), Hairpin structure (Hair), melting temperature (Tm), and GC base pair content (GC). With these constraints, reaction temperature, the occurrence of mismatch hybridization and the complexity of the synthesis can be decreased and the stability of the DNA strand will be improved. Note that, considering that biological reactions often require multiple enzymes, we also introduce the seventh constraint, i.e., specific sequence constraints. That specific DNA fragments can be recognized by several endonucleases for specific tailoring. The cleavage site forms a blunt end or a sticky end, and some special portions can be inserted into these blunt or sticky end, such as fluorophores, then the end can be combined by the DNA ligase. Different from the constraints based on biological characteristics, the seventh constraint is a constraint condition oriented to actual experimental requirements, which makes coding design move from theoretical design to practical application.

A. OPTIMIZATION OBJECTIVE
In order to improve the reaction efficiency and reduce the mismatch probability, Similarity is considered as one of optimization targets. Low Tm means low average and low variance of Tm. Low Similarity means low Sim and low Hm. The DNA strands which participate in reaction can reform a stable double helix, and thus the experiments result can be detected or separated easily. Low Tm and low Sim (Hm) could improve the stability of reformed double helix. In this paper, we try to design DNA sequences with low Tm and low Sim (Hm).

B. CONSTRAINTS 1) MELTING TEMPERATURE
Melting temperature is defined as the temperature at which 50% of the base pairs in double-strand DNA are melted into a single strand. In the DNA computing experiment, all DNA strands should be placed in the same solution, and the temperature is slowly raised up to reach Tm. At the same time, all DNA strands are required to be simultaneously melted. At this temperature, the double helix structure starts to open to form a single strand, and then they recover to the double helix structure with the temperature dropping. Therefore, the lower Tm implies higher reaction efficiency. For this reason, low average temperature and especially low variance of temperature should be satisfied to ensure the reaction efficiency. In this paper, Tm is calculated based on the nearest-neighbor model [47].

2) SIMILARITY
Similarity is used to describe the degree of similarity between the base compositions of two DNA sequences. In order to ensure low mismatch hybridization, the isotropic sequence of two DNA strands should be as unique as possible, and the sequence should be not repeated after the sliding. The similarity is defined based on the sliding hamming distance [48], i.e., where n represents the total number of DNA strands, Ham(X i , shift(X k j )) is the Hamming distance between X i and X j , and m is the total number of bases in each DNA strand. For sliding sequence X j , k > 0 means that X j slides to the right with |k| step size, and k < 0 means that X j slides to the left with |k| step size. min j=1:n&&i =j −m<k<m indicates the minimum Hamming distance between X i and X j , when X j slides from the left end to the right end.

3) H-MEASURE
H-measure is used to describe the degree of similarity between the base compositions of DNA sequences X i and its complementary strand X c j . It is defined as follows [48]: where the meaning of n, m and k is same as (1), and the Hamming distance between X i and sliding complementary sequence X c j is defined as Ham(X i , shift(X ck j )). Here, sequence X c j is slidable. min j=1:n −m<k<m indicates the minimum Hamming distance between X i and X c j , where X c j slides from the left end to the right end.
H-measure and Similarity can influence the probability of a mismatch hybridization in the experiment. For instance, Fig. 1 shows the DNA sequences X 1 and X 2 with different Hamming distances at different relative positions, and the resulting incorrect hybridization. For X 1 and X 2 at the relative position in Fig. 1 (1), there is only one identical base. However, when X 2 slides to right until the relative position in Fig. 1 (2) is achieved, there are eight identical bases. Too much identical bases will cause DNA strand X 1 to make a partial hybridization with X C 2 (the complementary strand of X 2 ) as shown in Fig. 1 (3), and this kind of hybridization should be avoided. Therefore, the DNA sequences with too much identical bases should be avoided in the DNA sequence coding.

4) GC BASE PAIR CONTENT
GC content is a kind of molecular thermodynamic constraint. Because the content of GC base affects the stability and Tm of the DNA strand, it has to be considered as a constraint in the DNA sequence optimization. The GC content of the DNA sequence is defined as the relative percentage of bases 'G' and 'C', i.e., where n represents the total number of DNA strands, Ham(X i , shift(X k j )) is the Hamming distance between X i and X j , and m is the total number of bases in each DNA strand. For sliding sequence X j , k > 0 means that X j slides to the right with |k| step size, and k < 0 means that X j slides to the left with |k| step size. min j=1:n&&i =j −m<k<m indicates the minimum Hamming distance between X i and X j , when X j slides from the left end to the right end.

5) CONTINUITY
Continuity refers to the continuous appearance of a certain base in a DNA sequence. This constraint is based on the secondary structure. It is necessary to avoid too much continuous same bases, since this kind of DNA strands is unstable, and it is hard to synthesize [16].

6) HAIRPIN STRUCTURE
The Hairpin structure is a typical secondary structure caused by single-stranded DNA molecules folded by themselves. The bases in Hairpin loop and stem cannot combine with other bases, and thus the single DNA strand with Hairpin structure cannot hybridize with other single strand. For this VOLUME 9, 2021 reason, the Hairpin structure has to be avoided. Hairpin structure can be calculated as follows [32]: where X c 2 and X 2 are complementary strands, p is the stem length of the Hairpin, r is the loop length of the Hairpin, R min is the minimum length forming the Hairpin loop, and P min is the minimum length forming the Hairpin stem. As shown in Fig. 2, it is an example of a Hairpin structure.

7) SPECIAL SEQUENCE
In the DNA biochemical reaction experiments, multiple biological enzymes are usually used to process DNA strands. For instance, ligase can be linked to DNA molecules, and endonucleases can cleave DNA molecules. In this paper, four common endonucleases are selected, and they are AluI, HaeIII, EcoRI, and HindIII [50]- [52]. As showed in Fig. 3, the sequence ''AGCT'', ''GGCC'', ''GAATTC'' and ''AAGCTT'' can be recognized by AluI, HaeIII, EcoRI, and HindIII, respectively. The cutting method is shown with the red line in Fig. 3. Among them, the incision of AluI and HaeIII is called blunt end, and the incision of EcoRI and HindIII is called sticky end. In general, the sticky end is suitable for the attachment point of the DNA strand displacement reaction, and the blunt end is suitable for the work at the end of the reaction to form a stable structure without exposed bases. Therefore, the cut of DNA strands with endonucleases is also a way to ensure reaction efficiency.

IV. ALGORITHM DESIGN
The DNA sequence coding problem is a multi-constraint multi-objective optimization problem, which is an NP-hard problem. To this end, in this paper, we consider a heuristic algorithm to solve it. In consideration of the process of chemotaxis and replication characteristics, BFA-A is designed for the DNA sequence coding problem.
BFA-A is divided into four stages, i.e., initialization ( Fig. 4 (1)), chemotaxis ( Fig. 4 (2)), replication ( Fig. 4 (3)), and dispersion ( Fig. 4 (4)). Each DNA sequence is defined as a bacterium in the algorithm, and a base change in a DNA sequence is defined as a chemotaxis. For convenience of expression, in the following algorithm description, the DNA sequence is described as bacteria. In Fig. 4 (1), we initiate a population of bacteria, and represent them in format ''number -Generation''. The ''number'' is based on the total number of populations. ''Generation'' is the generation of this bacteria, and it represents whether it is younger or not. In Fig. 4 (2), chemotaxis is a foraging behavior that implements a type of optimization where bacteria try to move to the nutrient substances, to avoid poisonous substances. In Fig. 4 (3), the bacteria in the good environment (i.e., the blue zone shown in Fig. 4 (2)) will be replicated, since this part of bacterial has strong survivability. Once the replication is finished, the ''Generation'' of parental bacteria increases 1, the ''Generation'' of new offspring bacteria is set to 1, and the children will inherit the parent's ''number''. In Fig. 4 (4), a few bacterial in a specific area are eliminated with a small probability because of the dispersion event.   Fig. 4 (a) and (b) are the activity regulation mechanism, (c) and (d) are competitive exclusion mechanism. We consider that the bacteria will gradually get old with growing, and the aging bacteria mean that their activity will decrease. Therefore, the movement step size of the bacteria in different age periods should be dynamically adjusted. The younger ones have a bigger movement step size (see Fig. 4 (a) and (b)), which can facilitate the algorithm to expand the global search range and avoid early maturity [53]- [55]. While the older bacteria movement step size is smaller, which is beneficial to the convergence of the algorithm. The difference in chemotactic ability of bacteria of different ages is also reflected in the total number of chemotaxis in each chemotaxis cycle, so the total number of chemotaxis times of the older one also decreases with aging. Considering that bacteria concentrate on the eutrophic zone with chemotactic, excessive bacteria quickly consume nutrients and make the eutrophic zone to low-nutrient zone (Fig. 4 (c)). This can cause bacteria unable to survive in the natural area. In order to avoid this phenomenon, we introduce a competitive exclusion mechanism. When bacteria are too concentrated in the same area, the bacteria are forced to separate (Fig. 4 (d)).
The characteristics of BFA-A can be summarized as follows: • The bacterial activity affects the chemotactic ability of bacteria, and dynamic chemotaxis increases global search ability and improves convergence; • The bacteria which live in a favorable position will be replicated, so that the population can always move towards a favorable living environment; • The bacteria are not overly concentrated in a certain area, due to competitive exclusion mechanism. Therefore, the rapid consumption of nutrients in the area can be avoided; • A few of bacteria will be eliminated during the dispersion events to avoid trapping in local optimum.

A. BFA-A ALGORITHM OVERVIEW
Before giving a detailed description of the algorithm, we first give some relevant definitions and data structures used in the algorithm. In order to obtain a set of sequences with low melting temperature, low similarity, and high stability, we first define some key initial information. This initial information determines the execution of the algorithm in the stages of chemotaxis, replication, and dispersion, and it affects the effects of competitive exclusion mechanisms and activity regulation mechanisms. As shown in Algorithm 1, it is the pseudocode of BFA-A. In the first step, the DNA sequence set (DNA set ) and the other main parameters (N re , N c , DNA num , P ed , E de , Lim) are initialized. Here, N re represents the total replication steps, N c means the total chemotaxis steps in one loop of replication, DNA num is the scale of DNA set , P ed is the probability of dispersion event, E de is the number of DNA sequences that have been eliminated in a dispersion event, and Lim is used to describe the similarity threshold between sequences. Note that N c will be changed by the age of bacterial due to the activity regulation mechanism. N c = ActivityChemotaxisNum(X i .Gen); 7: S 0 = Analysis (X i ); 8: The algorithm has three layers of loops, the outermost layer is the number of replication N re , the middle layer is the traversal of all sequences in the set of DNA sequences DNA set (the number of sequences in the DNA set is DNA num ), and the innermost layer is the number of chemotaxis N c . Here, each DNA sequence simulates a bacterium, and the base conversion is bacteria foraging. This algorithm enables all DNA sequences to achieve base conversion by the middle layer loop and the innermost layer loop.
In the innermost layer, a sequence X i is extracted from DNA set as the first step. An attribute of X i is generation (abbreviated as X i .Gen), which represents the age of bacteria. Young bacteria have high activity, which means high foraging ability. Bacteria map to DNA sequences, and thus younger DNA sequences have a stronger ability to change bases.
According to the X i .Gen information, the function ActivityBaseNum() is used to get the number of changed base E c of sequence X i during a loop of chemotaxis. Then, the number of chemotaxis N c of sequence X i is obtained by function ActivityChemotaxisNum(). E c and N c depend on X i .Gen, and if X i .Gen is bigger, E c and N c are smaller. After that, function Analysis() (this function is based on the equations in section IV.B.1) is used to calculate the score of X i , and the score is denoted as S 0 . The score of X i measures VOLUME 9, 2021 the quality of the sequence. Then, sequence X i changes its bases by function Chemotaxis(), the number of changed base is determined by E c , and the new sequence is denoted as X i . Analogously, we get the score S 1 of X i by Analysis(). After we get the score, we can determine the difference before and after base changing. If S 1 is bigger than S 0 , it means that the bases are indeed changing in the direction to which we want them to move. Therefore, the new sequence X i will replace the original sequence X i . On the contrary, it means that after changing the base, it deviates from the desired target, and the original X i remains unchanged. Then, we need to determine the similarity between sequences, SIMILARITY is another attribute of X i . If X i .SIMILARITY is bigger than Lim, it indicates that the density of bacteria is too high. We use the function Competition() to realize competitive exclusion process. That is to keep the various base components in X i unchanged, and rearrange the base positions. After that, the adjusted X i is restored in the new set DNA newset . At the same time, the chemotactic process of the DNA sequence ends.
Next, the dispersion process for the random E de sequences with the probability of occurrence P de for DNA newset is implemented by Dispersion(). Then, we sort the sequences in DNA newset from high to low, and take out the top half of the sequence to form a high-quality sequence set DNA betterhalfset . After that, we replicate this set to obtain DNA newbetterhalfset . Note that, these two sets are not exactly the same, and their attribute Gen is different. DNA newbetterhalfset .Gen is set to 1, while DNA betterhalfset .Gen increases by one for each replication. At last, we can obtain the final DNA set by the union of these two sets.

B. DNA SEQUENCE DESIGN BASED ON BFA-A
In this subsection, five main strategies of the BFA-A method for the DNA sequence design are introduced, namely DNA sequence score strategy, initialization, chemotaxis, replication and dispersion.

1) DNA SEQUENCE SCORE STRATEGY
In order to make a judgment about the direction of chemotaxis and the choice of replication, the quality of each DNA sequence should be quantified for BFA-A. To this end, we propose a score strategy to quantify the quality of a DNA sequence. In order to make a fair comparison with our previous work [21] in the experiments, the same score strategy is used as the one used in BFA [21]. In this paper, Tm is considered as the essential constraint, since the low Tm is beneficial to the operation of the experiment. In addition, Sim and Hm are also the essential constraints, since Hamming distance directly affects the probability of the mismatch hybridization. In the evaluation function, we mainly consider six factors, and they are Con, Hair, Hm, Sim, Tm and GC. Based on the results of the simulation experiments, different weights are assigned to these factors, and Tm, Sim, Hm, Con account for 40 points, 20 points, 20 points and 20 points, respectively. The total score of each DNA sequence is denoted by S, and S Tm , S Sim , S Hm , S Con are the score of Tm, the score of Sim, the score of Hm, the score of Con, respectively. They are calculated as follows: Sim ≥ 13 (9) where Sim = Sim/(NUM DNA − 1), NUM DNA is the total number of DNA sequences that generated by initialization.
According to the above evaluation criterion, for a DNA sequence with lower Tm, Sim, Hm, and Con, its score is higher; for a DNA sequence with GC = 0.5 or Hair = 0, its score is zero. Therefore, the idea of scoring strategy is keeping GC and Hair stable, while reducing Tm, Sim, Hm, and Con as low as possible.

2) INITIALIZATION
The initialization process mainly includes the initialization of N re , N c and P ed . N re is the number of the replication, N c is the maximum number of chemotaxis and P ed is the probability of dispersion. The initial DNA sequences are generated randomly. In addition, the population scale of DNA sequences is also initiated in a random way.

3) CHEMOTAXIS
In the natural environment where the bacteria live, some are eutrophic area and some are not, even poisonous area. Therefore, the bacteria will move from one place to another to find food, and this process is often referred to as chemotaxis [56]. Fig. 5 (1) indicates the initial state of bacteria. In the next moment, bacteria are moving to look for food for survive. Here, the pink square is eutrophic zone and the green square is poisonous zone. In Fig. 5 (2), bacteria start to move for a while, and bacteria gradually move closer to the eutrophic zone and away from the poisonous zone. For the score of a DNA sequence, most DNA sequences are transformed from a low score to a high score through adjusting the bases of each DNA sequence.
In order to improve the original BFA method, the BFA-A method makes certain adjustments. They are the bacterial activity regulation mechanism and competitive exclusion mechanism. For the BFA method, the characteristics of chemotaxis can be summarized as following: • The length of chemotaxis step is 1, and it means that only one base will change for each chemotaxis; • The total number of chemotaxis N c is fixed; • Due to chemotaxis and replication, bacteria will gradually concentrate in a certain area, and the diversity of DNA sequences will decrease, which leads to high Sim and Hm. In the above process, the chemotaxis step size is fixed, and it limits the diversity of DNA sequences and the global search ability of the algorithm. Note that insufficient DNA sequence diversity will lead to excessive Sim and Hm. To this end, we try to improve the chemotaxis operation [57]- [59] in BFA-A, and its characteristics are summarized as following: • Since the generation of each DNA sequence is known, the length of chemotaxis step (E c ) can be adaptively adjusted based on the generation of each DNA sequence according to the related rule given in Table 1. Owing to this adjustment rule, the algorithm has strong global search ability in the early stage, and has better convergence in the later stage. Note that this rule is based on the fact that the maximum chemotaxis time (N c ) is 20.
• N c is dynamically adjusted based on the generation of each DNA sequence, and the corresponding rule is defined in Table 1. With this mechanism, the convergence can be ensured by reducing the chemotactic ability of bacteria getting old.
• If there are lots of bacteria concentrate to a certain zone, we will force them to separate away. The corresponding rule is that when the Sim or Hm of two DNA sequences is greater than a threshold (here, threshold is set to 10), the bases in the DNA sequence will be rearranged, while ensuring that the content of ''A T G C'' is unchanged. By this way, Sim and Hm can be effectively reduced, and thus, the probability of mismatch is reduced. From Fig. 5, we can find that there are two small pink eutrophic zones. However, they are not as good as the big zone. Small nutrient areas are not sufficiently attractive to bacteria, so the bacteria gather at large nutrient areas. From Fig. 5 (3), we can find that lots of bacteria gather at the big upper eutrophic zone, and the nutrition will be consumed rapidly. As shown in Fig. 5 (4), we introduce the competitive exclusion mechanism to avoid nutrition consumption quickly, and we can see that some individual will be separated to other zones. A large number of bacteria gather around the large nutrient areas, and thus it is more competitive for the large nutrient areas. In addition, some of the less competitive bacteria are squeezed into small nutrient areas. With this mechanism, the load balance can be achieved by the eutrophic zone, and bacteria can survive longer in eutrophic environment. Moreover, the diversity of bacteria that multiply in multiple areas is higher than that of bacteria that are propagated in a specific area. Therefore, the diversity of DNA sequence is also correspondingly improved with more eutrophic zones, and thus Sim and Hm can be reduced with this mechanism.

4) REPLICATION
After chemotaxis is finished, only DNA sequences with high scores will be replicated, while the ones with low scores will be not. Note that the total population is unchanged for each replication operation. Replication has the characteristics of retaining a better half of the DNA sequences and replacing the poorer half of the DNA sequences. DNA sequences with high scores can be retained to the next generation. This evolutionary process of the bacterial foraging algorithm advantageously guarantees that the base components of the DNA sequences with high scores are retained, and low scores sequences are eliminated. So it is very useful for the optimization of DNA coding design.

5) DISPERSION
Dispersion event will occur with a specific probability after the completion of each replication. The random elimination can facilitate avoiding local optimizations; however, the optimal solution may also be eliminated. In order to avoid dispelling the optimal solution, the dispersion event is only carried out for the DNA sequences with middle scores. VOLUME 9, 2021

V. SIMULATION RESULT
In order to show that the algorithm in this paper has the ability to optimize DNA coding design, we designed 2 simulation experiments for the BFA-A method to evaluate the melting temperature and similarity of the DNA sequence. In the first part of this section, we analyze the performance of BFA-A method under 6 traditional constraints (Sim, Hm, Tm, Con, GC, Hair), and we compare BFA-A method with 5 other methods of DNA sequence coding. In the second part, we analyze the performance of BFA-A under the constraint of the special sequence which can be recognized by four common kinds of endonuclease.

A. THE PERFORMANCE OF BFA-A
We evaluate the performance of BFA-A against NCIWO [16], NACST [32], IWO [31], IGA [30] and BFA [21]. The setting of main parameters used in our experiment is given in Table 2. N re represents the replication steps of each bacteria and it is set to 20. N c is the maximum chemotaxis steps of each bacteria and it is set to 20 as the initial value. Note that N c changes with the ''age'' of the bacteria. P ed is the probability of dispersion and it is set to 0.1. As shown in Table 3, there are 7 sequences with 20 bases for BFA-A, BFA, NCIWO, NACST, IWO and IGA in the experiment. Based on the definition of each constraint given in Section III, we consider 6 constraints (Sim, Hm, Tm, Con, GC, Hair) of all DNA sequences generated by different algorithms. Sim (Hm) and Tm are the main optimization targets in this paper, so we first analyze their values. After that, other values such as Con, GC, and Hair that affects structural stability and secondary structural properties are analyzed. The comparison of Sim and Hm is illustrated in Fig. 6. In Fig. 6, two adjacent bars in the histogram represent Sim and Hm of the same sequence. Sim and Hm together reflect the similarity of DNA sequences. We can clearly see that when one of Sim and Hm is larger, the other is smaller, which seems to indicate that Sim and Hm are complementary. To verify this relationship, we calculate the sum of Sim and Hm for each DNA sequence generated by different algorithms, and the result is illustrated in Fig. 7. We can clearly find that the sum of Sim and Hm for each DNA sequence is approximately equal to a constant value. Now, we discuss this complementary relationship through a simple mathematical reasoning.
We assume that Sim + Hm approximately equals to a constant value (expressed as C), i.e., According to section III. B, we can get the specific mathematical expressions of Sim and Hm, and they are, According to the known formulas and existing conditions, it can be clearly seen that, when i = j, Sim + Hm ≈ C can be transformed into, + Ham(X i , shift(X ck j ))] ≈ C (15) In (15), it is obvious that C will change with n. Now, we assume n = 2, since 2 DNA strands are required at least, and the formula (15) transforms into, (when n = 2, the result is C') where k is the sliding distance of DNA strand Xj. We assume that k and j are fixed values, then the formula can be transformed into, [Ham(X i , shift(X j )) + Ham(X i , shift(X c j ))] ≈ C (17) For any DNA sequence in ideal condition, bases 'A', 'T', 'G', 'C' should account for 25%, respectively. For instance, if the 4 th base of DNA sequence X i is 'A', which is expressed as Xi [4] = A , then in DNA sequence Xj (or X c j ), the probability of Xj [4] = A (or X c j [4] = A ) should also be 25%. It means that for any DNA sequence set in ideal conditions, the sum of Sim and Hm for any two sequences in this set should be equal to 25% of the length of DNA sequence. Now, we assume that k is not a fixed value, and this means that Xj (or X c j ) can shift to left or right. According to the above analysis, the probability of X k j [4] = A (or X ck j [4] = A ) is still 25%. Sim+Hm is still equal to a constant value.
Then, we assume that j is not a fixed value and n > 2, and it means that DNA strand Xj (or X c j ) can be changed. The above inference for the sum of Sim and Hm for each DNA sequence is approximately equal to a constant value is still true obviously. For more DNA sequences, even if some sequences are not ideal, the overall result is still ideal when n is large enough. Therefore, we can get Sim + Hm ≈ C.
Note that, the actual value of C has a substantial relationship with n, and that inference is under the ideal condition (uniform distribution of four kinds of base). For DNA computing, we need to consider some constraints, Tm, Hair and so on, and these constraints will limit the base composition. Therefore, Sim or Hm will be larger than 25%. However, in Fig. 7, Sim and Hm are approximately equal to a constant value, and this means that even under these constraints, the inference is still correct.
In addition, from Fig. 7, we can find that the sum of Sim and Hm for BFA-A is smaller than the others. It demonstrates that the DNA sequences obtained by BFA-A have lower Sim and Hm, which means that these DNA sequences have lower mismatch probability. We also can find that BFA, NCIWO, NACST and IWO perform similarly on Sim and Hm, and IGA is inferior.
Considering that Tm affects the efficiency and difficulty of experiment, as shown in Fig. 8. The DNA sequences obtained by BFA-A have the smallest maximum, minimum and average values of Tm. IWO is slightly worse than BFA-A, but better than other methods. We also can find that NCIWO, NACST, BFA and IGA perform worse on these 3 values of Tm. In order to comprehensively analyze Tm of DNA sequences obtained by different algorithms, we report the variance of Tm in Fig. 9. We can find that BFA-A outperforms all existing algorithms in terms of the Tm variance, and it means that BFA-A can generate DNA sequences with relatively lower and closer melting temperature. From the base composition given in Table 3, we can find that the reason of the best performance of Tm is that different adjacent bases have different stability. It can be clearly seen that there are lots of ''T·C'', ''A·C'' and ''C·C'' neighbor combinations in the DNA sequences generated by BFA-A. According to [49], the thermodynamic stability of adjacent base pairs conforms to inequality (18), where the ''T·C'', ''A·C'' and ''C·C'' neighbor combination has a lower thermodynamic stability in (18), which means that the DNA strand has a lower Tm. This explains that BFA-A has a good performance in terms of Tm. The constraints GC, Hair and Con affect the stability of the DNA strand, the ability to fold itself, and the difficulty of synthesis, respectively. We compare these 6 methods in Table 4. For all methods except IWO, GC is set to 50%, and for IWO, GC is larger than 45% and less than 55%. BFA-A and BFA strictly control Con and Hair which are always 0. Note that, for NCIWO, NACST and IGA, their Con is worse than BFA-A and BFA, although they are as good as BFA-A and BFA in terms of Hair. For BFA, due to the replication progresses, bacteria will gradually gather around a certain zone. It means that the diversity of DNA sequences is reduced, and thus the Sim and Hm of DNA sequence will increase. Instead, in BFA-A, we fix out the excessive concentration problem by introducing the competitive exclusion mechanism. Therefore, Sim and Hm of DNA sequences obtained by BFA-A are greatly decreased compared with BFA. In addition, since the activity regulation mechanism, the dynamic chemotaxis step length makes the BFA-A method have greater global search ability and convergence ability. In the initial stage of the algorithm, the obtained DNA sequences are more diverse. In the later stages of the algorithm, better DNA sequences can be generated after the algorithm convergence.
The simulation results are summarized as follows: • BFA-A is slightly better than the other algorithms on Sim and Hm; • Compared with other algorithms, BFA-A has an obvious good performance in the Tm.
• In terms of Con, Hair and GC, BFA-A is as good as BFA. To further illustrate the experimental result performance of the BFA-A method, we continue to make several groups of the experiment with larger bacterial scale as showed in Table 5.
In Table 5, Gen is the average generation of bacteria (DNA strands) in this group, and Sim, Hm, Con, GC and Hair are the average values of different constraints index in this group. We can find that the average generation is below 5, and it means that the bacteria are young. Young bacteria have high life activity and strong chemotaxis ability, and they are more likely to find a nutritious environment. The experimental results show that the method can reduce the number of iterations and has guided significance for improving the efficiency of DNA sequence acquisition. From Table 5, we can see Con and Hair always equal to 0, and GC is 50%. In addition, we can find that the average and variance of Tm are low. From the results, we can confirm that DNA sequences generated by BFA-A have good performance in terms of Tm, Sim and Hm.

B. THE PERFORMANCE OF BFA-A UNDER ENDONUC-LEASE RECOGNITION SEQUENCE CONSTRAINT
Biochemical reactions require a variety of enzymes, and for instance, endonuclease and ligase are two commonly used enzymes. DNA endonuclease can specifically recognize a piece of sequence and cut it off to form a blunt end or a sticky end in the incision. DNA ligase will reconnect the incisions again. Different endonucleases can recognize different DNA sequences. Therefore, we should consider retaining some specific sequence pieces for the design of DNA sequences. In this section, we consider 4 kinds of common DNA endonuclease, AluI, HaeIII, EcoRI, and HindIII. As shown in Fig. 3, they can recognize special sequence ''AGCT'', ''GGCC'', ''GAATTC'' and ''AAGCTT'', respectively. Under this constraint, the experiments are performed with the parameters given in Table 2 to verify the effect of BFA-A on decreasing the melting temperature and the similarity. Table 6 shows the DNA sequence sets under 4 kinds of endonuclease constraints, which are generated by BFA-A. In Table 6, identifiable special fragments are bolded. In the previous section, we obtained the inference that Sim + Hm is approximately equal to a constant. The experiments in this section can further confirm the correctness of this inference. The result for Sim and Hm is given in Table 6, and the comparison result is shown in Fig. 10. When Sim is high, Hm will be low. This result is similar to Fig. 6.
To further illustrate the correctness of the inference, as shown in Fig. 11, we report the sum of Sim and Hm in each group. We can find that the sum of Sim and Hm is approximately equal to a constant. This phenomenon is similar to the result discussed in section V.A. However, the performance is not as good as the situation of absence of specific sequence restriction. Since our inference is based on the ideal situation, but specific sequence constraint is too strong to break the  ideal situation, it directly leads to the deterioration of the phenomenon. In addition, we can find that the DNA sequences with a long base length (such as EcoRI and HindIII) are worse for Sim and Hm than the sequence with a shorter base length (such as AluI and HaeIII).
Considering that Tm affects the efficiency and difficulty of experiment, we report the average and variance of Tm under a specific sequence constraint, as shown in Fig. 12. It can be seen that BFA-A performs well at Tm, even under specific sequence constraint.
To summarize this subsection, under the constraint of specific base sequences, BFA-A performs good for Tm. However, BFA-A performs poorly for Sim and Hm in absence of specific base sequences constraint. Considering that there is VOLUME 9, 2021  a certain correlation between Sim and Hm, we can consider only one constraint to reduce the complexity of the algorithm.

VI. CONCLUSION
In this paper, we proposed a DNA sequence coding method named BFA-A to generate DNA sequences with low mismatch probability and stable melting temperature, which solves the problem that DNA sequences are prone to mismatch and experimental temperature instability in DNA computing. In order to obtain DNA sequences with high quality, we introduced a variety of constraints and DNA sequence scoring strategies. By introducing bacterial activity regulation mechanism and competitive exclusion mechanism, the global search ability and convergence of the algorithm are significantly improved, and thus a better DNA sequences are obtained. A large number of simulation experiments show that the DNA sequences obtained by BFA-A have improved performance on Sim, Hm and Tm. As part of our future work, we will pay close attention to the following issues: • We will consider more constraints in DNA sequence design to adapt to more practical problems; • We will try more new methods to enrich the technical routes in this field; • We will explore the application scenarios of the BFA-A method to enable it to be more commonly used in a variety of problems.