Internal and online simplification in genetic programming: an experimental comparison

Genetic programming is an evolutionary algorithm, which allows performing symbolic regression — the important task of obtaining the analytical form of a model by the data, produced by the model. One of the known problems of genetic programming is expressions’ bloating that results in ineffictevely long expressions. To prevent bloating, symbolic simplification of expression is used. We introduce a new approach to simplification in genetic programming, making it a uniform part of the evolutionary process. To do that, we develop a genetic programming on the basis of transofmation rules, similarly to computer algebra systems. We compare our approach with existed solution, and prove its adequacy and effectiviness.


I. INTRODUCTION
Symbolic regression is an approach to data mining, which accepts a data, generated by some model, and produces an analytic form of this model.Probably, the most known and earliest successful application of the symbolic regression is Johannes Kepler's astronomical laws, which mathematically describe observations made by Tycho Brahe.Symbolic regression is an important step in the scientific method that prescribes explaining observed data through the construction of their mathematical model.By the close examination of such mathematical model, scientists understand its internal structure and suggest hypotheses about their underlying nature.
We should stress the difference between the symbolic regression and numerical regression methods, like the linear, segmented linear or polynomial regression.In case of numerical regression the model is fixed, and only its quotients are to be found.For example, by applying polynomial regression to the data, we explicitly suggest that the model is a polynomial function.If the actual model is a trigonometric function, line sinus, the regression can be made arbitrarily accurate by choosing the appropriate polynom's degree.However, no matter how accurate it is in the sense of mean square error, the polynomial regression is still incorrect, because it will unavoidable miss the fact that the observed model is the trigonometric function.Symbolic regression allows finding the model itself, and therefore the sinus function will be recognized as sinus.
Until recently, the symbolic regression could be performed only manually, and no algorithm of symbolic regression was available.With the discovery of genetic programming technique by John Koza [Poli et al., 2008], it becomes possible to automate symbolic regression.Now automated symbolic regression is widely used in natural sciences [Schmidt and Lipson, 2009], robotics [Robertson and Dumont, 2002], economics [Koza, 1994], medicine [Zhang and Wong, 2008], etc.
The algorithm processes versions about the actual data's model.These versions are expressions, encoded as trees and stored in the pool.Initially, these expressions are random.Then, the algorithm alters expressions with the following procedures.
• Mutation.The randomly chosen expression is changed by a replacement of a node.• Crossover.Two randomly chosen expressions exchange subtrees.
• After all the mutations and crossovers are performed, the resulting expressions' set is subjected to the selection, which evaluates how each expression fits the experimental data.The least valuable expressions are then removed from the population.With the time, expressions become better until the satisfiable solution is found.
The known problem of genetic programming is expressions' bloating, which means that expressions become ineffectively long.For example, expression (x + 1) 2 − (x − 1) 2 − 3x is bloated, because it actually equals to x and should be replaced by x in the pool.One result of bloating is unacceptable form of the algorithm output.It can be resolved with the simplification of the algorithm's result.However, bloating also hampers the algorithm's work by increasing the expression length and therefore the time required to compute them, and also by leading the algorithm along the blind alley.It can be resolved with the online simplification [Zhang et al., 2006], [Kinzett et al., 2008], when all expressions in the pool are simplified with some frequency.There exist other approaches ( [Poli et al., 2008], [Mori et al., 2009]), however online simplification is considered to be more effective.
We argue that online simplification is too rough.Simplifying the expression inevitably leads to the elimination of potential growing points.For example, while approximating the function (x + 1)y 2 , the intermediate solution (1 + 1)y 1+1 can be found.This solution will be simplified to 2y  e.g.2y 2 ⇒ xy 2 ⇒ (x + 1)y 2 .The initial solution (1 + 1)y 1+1 requires only one mutation (1+1)y 1+1 ⇒ (x+1)y 1+1 .Hence, the simplification hampers the evolution in this case.On other hand, the partial simplification (1 + 1)y 1+1 ⇒ (1 + 1)y 2 does not produce such effect for the function (x + 1)y 2 , but does so for 2y x+1 .Therefore, the question of where to apply the simplification depends on the problem specification, on the particular found expression, etc.In other words, the simplification can alter evolution of expressions in the same way the mutation and crossover do.
In [Borcheninov and Okulovsky, 2011], we introduce an approach of integration of simplification into genetic programming as uniform part.We call our approach internal simplification genetic programming (ISGP), as opposed to online simplification genetic programming (OSGP).The key aim of this paper is to measure the advantage of ISGP in comparison with OSGP.
Simplification is based on the rules, which describe ways of correct expressions' transformation.Since we use the simplification inside the algorithm, we must base our algorithm on the rules.In section 1, we show how to implement OSGP and ISGP an instances of more general rule-based algorithm.In section 2, we describe experiments to compare internal and online simplification.

A. Expressions, trees and rules
An expression is represented as a tree of nodes.The example of such tree that encodes the function f (x) = |x| is shown in the Fig. 1.Three types of nodes are considered: constants, variables and operators.In Fig. 1, node '&%$ !"# x is a variable node, '&%$ !"# 0 is a constant node.The remaining nodes are operators: addition /.-, ()*+ + , comparison /.-, ()*+ > and ternary logical operator '&%$ !"# ?, defined as follows Each node has a return type, which is an arbitrary C# type.Different return types can be used in one expression.For example, in Fig. 1, all nodes have double return type, except for the node /.-, ()*+ > that has the bool return type.
We define numerous rules to transform these expressions.Some of these rules are universal, and can be applied to the tree regardless of data types or operations that are used in it.In I-Re rule, the select clause specifies the nodes that will be selected as a tuples (A, B), and then processed by the rule.The notion ?A(?B) specifies that A is an arbitrary descendant of root (i.e., and arbitrary node in the tree), and B is an arbitrary descendant of A. Then, selected tuples are subjected to selection according to where clause.In I-Re, we accept only the tuples (A, B) such that they returning types coincides.To selected tuples, we can apply mod clause.In the case of I-Re, it replaces A with B. The tree remains correct, because of the selection in where clause.In I-Cr rule, the select clause ?A,?B denotes that the rule accepts two trees, and selects an arbitrary node from each of them.Therefore, this rule is binary, while I-Re rule is unary.Then we demand the equality of their returning types, and finally replace A with B and return the root of A as an output.Using produce clause means that we specify directly the output of the rule.It is necessary, because binary rules accept two trees, and it is not clear which one of them should be the output.
Most of the rules, however, are not universal.With each data type T , the following rules are associated I-Co rule replaces the node with the return type T with the constant of the same type.Here v is a randomly selected value of the constant.I-Va rule replaces the node with the return type T with the variable.The argument i is a number of the variable in the argument array of the expression.Instances of I-Va rule have to be created for each variable of type T .We can also define tunning rules that adjust the constants.For Boolean and integer data types, such rules seem to be redundant, because they are just instances of I-Co.However, for floating point data type, rule I-Tu can be written.Here R is a random function R(x) that returns a random number from [x(1 − c), x(1 + c)].I-Tu rule allows changing the constant value gradually, near its initial value, and therefore differs from I-Co rule that does not take the previous value into account.
Some rules are even more specific, and are associated not with data types, but with the operations domain.The domain is a set of operations that are commonly used together and are bound by some mathematical laws.Examples are arithmetic domain (addition, multiplication, etc.); trigonometric domain (sinus, cosinus, etc.); logical domain (conjunction, negation, etc.).
For each operation, we need an introduction rule.Two approaches to operation's introduction are possible.G-In rules selects a node with floating point return type, and replaces it with a new multiplication operation.G-In rule differs from all the rules above, because it does not change the function, encoded by the expression.It only inflates the expression and adds potential growing point in it.Of course, we could combine G-In rule with I-Co, therefore obtaining G-In*.However, it is not convenient.Suppose our task is to transform x into 2x.With the modified G-In rule, we need the double luck to do that: we need to guess correctly both the operation and the constant.Wrong choice of constant may lead to significant decrease of the expression correctness, and therefore the expression will be removed, without a chance to adjust the constant.Original G-In rule does not affect correctness, and therefore modified rule can remain in the pool for a long time, so different mutations by I-Co rule can occur in the future and a right constant has more chances to be chosen.
For each operation, we also define simplification rules, for example transforming a multiplication of two constants into a constants with their multiplication, or transforming the multiplication of any node and zero into zero.We call such simplifying rules S-rules.They are known from computer algebra systems, so we will not study them deeply.Some rules are developed not for a single operations, but for several operations in the domain.The example is distributivity of addition and multiplication, which is G-rule for transformation a • (b + c) → ac + cb and S-rule for reverse transformation.
Aside from simplification rules, we can also define a crossover rules for domain, with a very natural meaning: The absence of quotation marks before A and B means that they are not descendants of the root, but the roots themselves.Crossover I-CA is reasonable: if two expressions fit the task, their halfsum may fit even better.

B. Implementation of genetic programming algorithms
To define a concrete algorithm in the genetic programming algorithms' family, we need to specify the operations, mentioned in the Intoduction: mulation, crossover and evaluation.We define mutation and crossover operations on basis of rules collection.The algorithm has two sets of rules: the set of unary rules for mutation, and the set of binary rules for crossover.In order to perform mutation, algorithm randomly selects expressions for mutation.Then, for each expression, we randomly select a rule, and perform it to obtain a mutated expression.Correspondingly, in order to cross two randomly selected expressions, the algorithm chooses a binary rule from the collection and performs it.
From the start of observations it becomes clear that different rules must have different probability to be applied.Each rule has multiple tags that describe the place of the rule in our classification.Then we assign to each tag its weight, and calculate the weight of the rule as the product of associated tags' weights.The greater the rule's weight is, the more the probability of rule's application is.
The most important tags are Inductive and Simplification tags.Inductive tags marks all the rules, which enlarge the expressions (G-rules from section 1.1), or changes the function the expression encodes (I-rules).Simplification rules make the expression shorter (S-rules).The ratio of Inductive and Simplification tags κ is the first important parameter of our algorithm.
The evaluation of the expression is performed by calculating several metrics and obtaining their weighted total.The fitness metric describe, how good the found expression g fits given data (x 1,j , . . ., x n,j , y j ), and is calculated as Taking the reciprocal value is important, because it allows bounding the value of ρ, and provides correspondence between a higher value of ρ and a better expression.The length metric µ l is a reciprocal to the count of operations in g.Valuation of an expression is determined as a weighted total e(g) = w f µ f (g) + w l µ l (g).The ratio between the fitness metric and the length metric λ = w l /w f is the second important parameter of our algorithm.
To perform online simplification, we modify the described algorithm.First, only I-and G-rules are allowed to be used in the algorithm.Second, the weight of length metric is set to zero, because algorithm does not have necessary means to decrease the expression's length.Finally, after each ξ iterations, we apply a simplification algorithm to each expression in the pool.Namely, we apply S-rules to expression until it is possible, and return the resulting expression in the pool.Online simplification algorithm has only one parameter ξ.

III. EXPERIMENTAL RESULTS
We conducted the following experiments to compare online and internal simplification in genetic programming.At first, we prepared test sets to run the algorithm on.Then, we found the optimal parameters of both algorithms to fetch best performances.Finally, we compared the performance of both algorithms.
In order to achieve a reasonable ratio between the representativeness of experiments and the time of computations, we followed the guidelines below.We limited the domain of expressions by algebraic expressions that contain addition, subtraction, multiplication, division and power operations and integer constants.The reason is limiting the amount of parameters of algorithms.Two parameters are unavoidable: length/evaluation metrics ratio λ and inductive/calculation κ tags ratio.Introducing floating point constants demands us to use tunning rule (I-Tu).Our observations showed that intensity of this rule should be much greater than others', in order to find the appropriate values of constants.This adds one more parameter.Correspondingly, the introduction of trigonometric functions leads to various expressions like sin(sin(cos(. ..))), and therefore these operations need to have their own tag with reduced value.Therefore, widening the domain requires increase of parameters.Since we needed to obtain the optimal parameters in order to compare approaches, we decided to limit the domain.
On the other hand, we made a high demand to the algorithm's outcome.The algorithm was provided with a very strict amount of data points: 10 for unary function and 100 for binary.The amount of iteration was limited by 10000, which takes about 15 minutes to compute.We also demanded the algorithm to find the exact function, used to generate the data, not its good approximation.The function may be presented as different expressions, however.It is a very strict requirement: sometimes the algorithm found the solution that was very close to data (root mean error is about 2-3%), and nevertheless, we neglected such solution and demanded the exact solution to be found.Summarizing, we can say that algorithm had to find an exact function with a limited data set in a short time.We believed that the complexity of this task compensates the domain narrowness.
To build the test set we made a rundown over different expressions, tested them with our algorithm and therefore obtained a knowledge about "complexity" of these expressions in terms of the algorithm.The considered parameters of expressions was the number of expression's arguments; the number of operations, used in the expression; the level of white noise, applied to data.At first, we builded a random tree with desired count of operations and tested, if the expression truly depends on all its arguments.Then, we formed test set as an array ), p is white noise level and α is a uniform random number between 0 and 1.If f cannot be calculated for some j, we dropped the expression and searched again.On each data set, we run the algorithm several times and measure the average success rate.If the algorithm had accidentally found the form of expression containing least operation that planned,  the data set was also considered invalid and was excluded from experimental result.
For each set of parameters, we run 50 successful data set, and each data set was processed by the algorithm 10 times.Obtained result are presented in Table III.The overall tendency is clear.The complexity is determined mostly by count of operations, then by the level of white noise.Additional variables seem to reduce the complexity, probably because of widening data set from 10 to 100 samples.We can also conclude that the algorithm is functional, even though initial parameters could be far from optimal.
We selected 9 expressions as test set for the OSGP and ISGP comparison.Selected expressions are listed in Table III.We did not selected expressions with 0% success rate, because it this case the difference between hard and impossible is not clear.For the same reason, we omited expressions with 100% success.
We ran ISGP algorithm with different length/fitness metrics ratio λ and calculation/induction tags ratio κ and obtained the resuls, presented in Table III.We see that the algorithm is in tote stable, and its success rate varies in range 60-70%.It is unlikely to find some local maxima outside the considered parameters' range.Parameters λ and κ by definition are greater that zero.When λ = 0 or κ = 0, the simplification is simply not performed, and expressions bloat rapidly, blocking the algorithm.When λ > 1 or κ > 1, the simplification is too strong: by out observation, no expressions of length more than 3 can be produced.Therefore we believe that the best success  rate of our algorithm is about 70% on our test set.
For OSGP, we need to determine the count of iterations between simplifications, ξ.The results of OSGP for different ξ are presented in Table III.Again, it is unlikely that the optimal value of ξ is greater than 160, because such rare simplification is hardly noticeable.On other hand, when the simplification is performed too often (ξ < 5), long expressions are almost never appear in the pool.
In table III we present the success rate of best algorithm's variants on test set.We can conclude, that the algorithms are very close in terms of performance.It is also obvious that accurate choise of parameters is important, and improves effectiveness significantly, at least for some functions.

IV. CONCLUSION
The research, presented in this article, proves the internal simplification genetic programming to be an operational technique that prevents bloating of expressions and provides effective symbolic regression.The only way to implement ISGP is to found genetic programming on the basis of expressions' transformation rules, as it was described in section 1.
The performance comparison states that ISGP is not worse than existed online simplification approach.ISGP also open a road for further research in the following areas.At first, we plan to explore the more presice devision of rules into groups, and finding the appropriate tags for such devision.This task can be considered even for the algebraic domain: for example, we could consider different tags for I-and G-rules.For the  greater domains, this task is even more important, because additional tags emerge anyway.
The more intriguing branch of research is adjusting the tag's weights during the algorithm's work.The tentativ observations show that such adjusting can sometimes drive the algorithm out of the local minimun by speeding up induction, or narrow the search around the best expression by increasing of the fitness metric weight.
We also plan to develop a distributed version of our OSGP implementation, and test it in real-world problems, mostly from robotics field.
produce A→B; ret A.Root mod A→new Const(R(A.Value)) A.Type = double mod A→new Mult(A,new Const(v)) and B.Type=double produce new Div(new Plus(A,B),2) 2, which requires at least two mutations to become a correct answer, IN PERCENTS, OF OSGP WITH VARIOUS VALUES OF THE PARAMETERS κ AND λ.THE LOWER TABLE GIVES A CLOSER LOOK TO THE AREA, WHERE LOCAL MAXIMA SEEM TO BE.

TABLE IV SUCCESS
RATES, IN PERCENTS, OF ISGP WITH VARIOUS VALUES OF THE PARAMETER ξ.

TABLE V SUCESS
RATES, IN PERCENTS, OF ALGORITHMS ON TEST SET.THE SECOND COLUMN REPRESENTS THE INITIAL SUCCESS RATES, GENERATED WHEN BUILDING TEST SET.THE THIRS AND FOURTH COLUMNS ARE BEST RESULTS OF OSGP AND ISGP, CORRESPONDINGLY.