Bias-variance decomposition in Genetic Programming

Abstract We study properties of Linear Genetic Programming (LGP) through several regression and classification benchmarks. In each problem, we decompose the results into bias and variance components, and explore the effect of varying certain key parameters on the overall error and its decomposed contributions. These parameters are the maximum program size, the initial population, and the function set used. We confirm and quantify several insights into the practical usage of GP, most notably that (a) the variance between runs is primarily due to initialization rather than the selection of training samples, (b) parameters can be reasonably optimized to obtain gains in efficacy, and (c) functions detrimental to evolvability are easily eliminated, while functions well-suited to the problem can greatly improve performance—therefore, larger and more diverse function sets are always preferable.


Introduction
Bias-variance decomposition is a fundamental method in machine learning. It allows for the decomposition of the error rate of a predictor into two components: the bias, representing the systematic error made by a model, and the variance, representing the error associated with the particularities of a training set. There are many "non-parametric" estimators characterized by a learning error that always tends to zero as the number of samples becomes large. Unfortunately, these learners become computationally expensive to deal with as the number of training samples increases, especially when problem dimensionality is high. Given a fixed number of training samples, non-parametric estimators typically encounter the "bias-variance trade-off", where greater complexity is required to exclude model bias but too much complexity will cause over-specialization to the training data. In light of this trade-off, several authors suggest that the "hard" part of solving a complex problem is precisely finding a proper model with a bias suited for the domain at hand [10,11], for instance, via the inclusion of appropriate heuristics.
Genetic Programming (GP) refers to the use of evolutionary computation to generate computer programs or mathematical expressions. The most typical form of GP is the tree-based version pioneered by Koza [17]. There are many other types, however, including a family easily expressed as graph-based networks, which include Cartesian Genetic Programming [24], Parallel Distributed Genetic Programming [27], Linear Genetic Programming (LGP) [3], and others [26]. These forms of GP are often static-length, while the complexity of the program is derived from the ratio of neutral to active code in the representation. In this work, we concentrate on LGP, an attractive choice for applications due to its tendency to produce parsimonious solutions [3] and its capacity to be converted directly to machine code [28]. We use a variable-length version, in which the effective length of a program will be less than some varying maximum value.
Our goal in this study is to explore the application of LGP to several benchmark problems under the lens of bias-variance decomposition. Some analysis of this form has already been conducted for tree-based GP, where it was shown that GP is generally a low-bias approach for regression problems. Some authors have used variance adjustment [1,13] or other strategies [8] to improve generalization of discovered solutions. In particular, Fitzgerald and Ryan hypothesize that low operator complexity (a smaller or simpler function set) corresponds to lower variance [7]. In this study, we find some evidence to the contrary.
Here we investigate more deeply this breakdown for several problems, both regression and classification, by considering the effect of various key parameters on this decomposition. First, we look at program length, a key parameter for the complexity of GP programs and the variable leading to the classic bias-variance trade-off. Next, we examine more closely choices involving initialization and function sets as potential means of reducing either the bias or the variance portions of the decomposition. In these latter cases, we do not attempt to produce yet other versions of the usual bias-variance trade-off, but rather explore means toward the amelioration of either component of error given realistic constraints. We believe this analysis will help guide practitioners in their choice of models and parameter setting.

Bias-variance decomposition
Let x 2 R p be a real-valued input vector, and y 2 R be an output value, with joint probability distribution P .x; y/. Let our loss function in output space be L.y; y 0 /, some measure of similarity between y and y 0 in R. We seek a function f which, when provided with an input x, will predict an "appropriate" output value f .x/, i.e., one that tends to minimize the average loss at that point: R L.y; f .x//P .yjx/dy.

Regression problems
For regression problems, the most common loss function is L.y; f .x// D .y f .x// 2 . The best choice of predictor, in this case, is the simple conditional expectation, f .x/ D EOEyjx D R yP .yjx/dy, also known as the regression function.
Assume that we have some predictor f , generated through the use of a data sample T , and tuned by a set of parameter values . In the case of a stochastic system, it also takes an initial seed I . For a given , we will write the output of f at x as f .xI T; I / to recall the other two dependencies. Then, the mean square error (MSE) of f at point x 0 can be expressed as the expected loss between f .x 0 / and f .x 0 / over various instances of T and I : in which we impose fixed-size training sets T . Note here that mse.x 0 / refers to the error of the expected predictor in x 0 , and as such, is a measure of the efficacy of the technique (and parameters) which spawned the predictor. Via algebraic manipulation, and making the assumption that our problem is deterministic, we can break the MSE into two components: var.
where O f .x 0 / D E T;I OEf .x 0 I T; I / denotes the average response of the predictor over T and I . The bias-variance dilemma refers to the trade-off between the two components of error: whereas the bias should be reduced, to prevent var.
Note that P OEY D yjx will be 1 or 0, due to the determinism of the problem. The term P T;I OEY f D yjx is the probability of guessing value y via the learning algorithm over all possible training samples T and initial seeds I . As with previous expectations, we estimate this probability via 50 runs of the system.

Integrated statistics
Finally, we will report the integrated forms of the MSE, MMR, bias and variance, respectively denoted b mse, b mmr, b bias and c var, to compare predictors on the basis of a single global measure in each category. For regression problems, integrals are computed numerically over a large set of uniformly distributed samples Q D fx j 0 g as follows: where jQj D 360;000. Note that since our numerical integration uses independently chosen samples, it can also be considered as an independent test set, and hence would detect any overfitting. For classification problems, we can similarly approximate the integrated mean misclassification rate, denoted b mmr, over the test problem instances.

Linear Genetic Programming
We follow Brameier and Banzhaf's definitions of LGP closely [3], with some minor modifications. An LGP individual consists of a collection of registers, n reg , equal to the number of inputs, n in , plus a number of additional registers initially filled with genetically defined constant values, n const . Thus n reg D n in C n const . Following this is a list of n program statements, with n ranging from 1 to a maximum program length, n prog . A program is executed by loading the input values into the initial registers, executing the program statements, then reading the output from the 0-th register. Figure 1 shows an example program. LGP program, instantiating the 2D Euclidean distance function, written in pseudo-Java notation. Note the existence of ineffective ("neutral") code, commented out in light gray.  LGP individual is initialized by generating a sufficient number of constants to fill the additional registers, then generating a series of program statements. The constants are chosen uniformly and randomly from [0,1]. The number of program statements is selected randomly and uniformly between 1 and a maximum initialization length n init Ä n prog . The statements are generated by selecting three registers r a , r b and r c uniformly and randomly from all the value registers n reg , and then selecting a function g from the function set to generate the statement r c D g.r a ; r b /, or r c D g.r a / if g takes only one variable. Finally, any output from the LGP individual is constrained within a certain range, where outlying values are rounded to the closest extreme bound. These problem-specific output bounds were added to prevent undue influence of singularities on statistical analysis. The global output function produced by the LGP individual is denoted f as before: it is equal to some (more or less complicated) composition of a certain number of functions g. We use two function sets to explore our problems: one short list, G short , and one long list, G long . In some experiments, we utilize arbitrary subsets of G long . All possible functions are listed in Table 1. Generally, in this article, we will write our LGP individuals as mathematical expressions. The reader should be aware that: (1) they are an expanded view of the program, since modules are written out explicitly, and (2) while we remove unused code, we do not remove any redundancy (i.e., a statement such as a a is not replaced by 0), in order to give a realistic view of the raw evolutionary outputs.

Evolutionary algorithm
For all problems, we use a steady-state evolutionary algorithm. In the beginning, a population of N pop randomly initialized individuals f is created and each of them is evaluated by calculating its fitness F .f / (see below). The population is maintained at a constant size N pop throughout the evolutionary search by performing one-to-one replacements of individuals. During the search, an additional N new evaluations are performed as follows: for each evaluation, a deterministic tournament is held between two randomly chosen individuals of the current population. The worse of the two is replaced by either a cross between the winner and a second tournament-selected individual (with probability p cross ), or a mutation of the winner (individual elements are mutated with probability p mut and equal chances of macro-or micro-mutation). We also sometimes include a "parsimony pressure": if the difference between the fitness of the two individuals selected for tournament is less than F pars , then we select the individual with the smaller number of program statements. In sum, while the population's size remains N pop , the total number of evaluations is N eval D N pop C N new and the total number of individuals that are effectively replaced is comprised between 0 and N new .

Four benchmarks
We perform our investigations on several benchmark problems. For each benchmark, we execute approximately 500 runs with randomly chosen parameter values similar to the original source. From these runs, we estimate the combination leading to the lowest test fitness (since the "fitness" represents an error or a mismatch to be minimized). The problems and their associated search parameters are summarized in Table 2.

MexHat
The first problem is the "Mexican hat" (MexHat), borrowed from [3] and so named after the shape of its 2D manifold in 3D space. The MexHat function is reduced here to its 2D expression, denoting x D .a; b/: Note that Euler's number e is not included in the function set, and hence must be approximated genetically. For this regression problem, the fitness value F of an LGP individual f is defined as the sum of squared errors (SSE), with respect to the target f Mex , approximated over the training samples T D fx i g:

DistND
Next, we evaluate several "distance" problems (Dist1D, Dist2D, ..., DistND). These are a series of regression problems based on Euclidean distance in any even number of dimensions: Note that, since x D .a 1 ; :::; a N ; b 1 ; :::; b N /, the dimensionality of the problem is actually 2N . The distance functions are useful for investigating a related series of problems in increasing dimensionality. The fitness function F SSE is calculated as in Eq. (10), with target f Dist .

Spiral
We also include two classification problems. The first one is the artificial "spiral" problem, as described by Lang and Witbrock [19]. Here, the target f Spir is a binary function whose domains are intertwined in a double spiral pattern ( Figure 2). There has been significant research on this problem, including in the GP community, due to both its difficulty and its capacity to be easily visualized [5]. For this classification problem, the fitness function is an approximation of the misclassification rate (MR), i.e., the average of all binary mismatches:

Cenparmi
The second classification problem is the Cenparmi database of hand-written characters, collected by the CENPARMI lab at Concordia University ( Figure 3). This real-world supervised learning challenge consists of 6000 image samples of hand-written digits, and constitutes a high-dimensional problem with tightly correlated features. We scaled the images to a size of 12 12 D 144 integer inputs between 0 and 255 representing gray-level pixels. Our treatment made the classification problem binary by distinguishing between a selected class and the remainder of the set (e.g., between "4" and "not-4" instances). At the start of each run, we randomly selected the particular class, involving jT j D 300 training samples from the training pool, and jV j D 600 test samples from the test pool. Note that our approach to the database remained "naive" on purpose, i.e., we did not include geometric information about the relative location of the pixels. Our goal here was to test the bias-variance limits of genetic programming, not achieve competitive performance. The fitness function F MR was calculated as in Eq. (12), with target f Cenp representing the correct binary answer for the chosen class. Note that the state-of-the-art MR, across all learning techniques and available information, is approximately 0.02 [20].

Initial exploration
As expected, LGP was generally successful at evolving good regression functions and classifiers. Some examples of outputs for the Spiral problem are shown in Figure 2, and for the MexHat problem in Table 3. In further sections we will look closely at the variance associated with the individual runs of the evolutionary algorithm. Recall that we are discussing the variation of the solutions in terms of their behaviour in input, i.e., "phenotypic", space. While this is typically the object of interest during the use of genetic programming-as practitioners care how well they achieve some objective or fit some data-it should be noted that this is not the same as variance in genotypic space. In other words, two largely different mathematical expressions (genotypes) might have nearly identical performance (phenotypes), while, conversely, two genetic programs differing by a single instruction might produce dramatically different output functions.
The question here is about genetic convergence, that is, the propensity of evolutionary methods to find the same or equivalent mathematical expressions for the same problem on different runs. Indeed, this is a difficult concept to evaluate, since while there are ways to detect when some related programs are similar, there is no way, in general, to determine if two arbitrary programs are equivalent. There exist several measures of genotypic dissimilarity (termed "diversity", after their typical usage 2 ) for GP. For instance, edit distance (a measure of the number of steps required to transform one program into another, adapted for LGP in [3]), entropy, and behavioural diversity (comparing distributions of program outputs) are known to correlate well with expected fitness for some problem domains [4,12]. Unfortunately, in the case of edit distance and entropy, typical applications of these measures to GP tend to make the assumption that individuals are genetically related, hence are not useful for programs generated via independent means. Furthermore, some identities, such as the capacity to construct one primitive function from combinations of other functions, are not detectable. Informally speaking, in most cases that we examined, some form of genetic convergence was the norm. For instance, consider the solutions evolved from the Dist2D problem using the G short function set (see Table 2). We show below the two functions of x D .a 1 ; a 2 ; b 1 ; b 2 / that had the best fitness values among 33 inexact solutions: In these cases, the evolved solutions strikingly "resemble" the target function f Dist .x/ D ..a 1 b 1 / 2 C .a 2 b 2 / 2 / 1=2 , genetically speaking. To obtain exactly f Dist , all that would be required is minor tweaks and some elimination of redundancy. One could reasonably expect that additional computational effort would achieve at least the former, and under parsimony pressure, possibly the latter.
In the case of the Spiral classification problem, we also observed convergence to a similar form of solution, even if there were differences between the individual runs. For instance, using the G long function set, 5 out of 50 runs found zero-fitness solutions to the problem. Each of the 5 runs admits a similar structure of concentric rings (Figure 2), in which some additional singular boundary, like a flaw in a crystal, separates regions of rings to compensate for the ascending radius of the spirals. All five solutions make prodigious use of the div, mag2, and thresh functions, and one of either sin or cos. While this strategy is relatively consistent for the best of the LGP runs, it is by no means the only solution to the problem generally. For instance, techniques using constructive neural networks generate quite different output patterns, including true spirals [6].
On the other hand, consider the best solutions to the MexHat problem using its G long function set. The three best solutions are shown in Table 3. Notice how all three individuals depend on a 2 C b 2 , i.e., they discovered the radial symmetry of the MexHat target. Otherwise, these solutions display much more genetic variance than in the previously discussed problems. The edit distance between these statements is evidently quite high, as both the statement structure and the functions used differ wildly. Regardless, the outputs of these functions in the 2D plane are quite similar, and do not significantly overfit. Hence, despite great genotypic variation, they are all highly successful examples of regression solutions.
While our description is rather informal (the development of a more rigorous measure of genetic convergence lying beyond the scope of this study), we believe that it highlights the possibility of phenotypic convergence in the absence of genotypic convergence.

Typical bias-variance numbers
The bias-variance decomposition of each of our benchmark problems, including five different instances of DistND, is shown in Table 4. Parameters were set as indicated in Table 2. As proceeding sections will show, the listed values are typical. Nearly everywhere, the variance portion of the error dominates the bias component, often by several multiples. This is generally consistent with a view of GP as a low-bias/high-variance approach, which suggests that overfitting should be of concern for GP practitioners. In all cases, the use of G long also outperforms G short , especially Table 3. The three best solutions on a run of the MexHat problem using the G long function set.

Detailed analysis
In this section, we examine the effects of varying one of four control parameters separately from the others. First, we look at a key parameter of GP complexity, the maximum program length n prog . It is here that we expect to see the classic bias-variance trade-off, and the existence of a range corresponding to the optimal point in that trade-off. Next, we examine parameters related to the genetic initialization I , population size N pop , and choice of function set G. Our goal here is to explore the potential for reducing the error due to variance and bias, respectively, in a manner achievable by a GP practitioner. As such, in these latter experiments we aim not to generate new forms of the bias-variance trade-off, but instead, to study the error components under computationally constrained experimentation.

Varying maximum LGP length
First, a series of experiments were undertaken in which the maximum length of the LGP expressions was varied. The n init value (the maximum initial number of program statements) was chosen randomly and uniformly from the range OE1; 150, and the maximum LGP program length was set to n prog D 2n init . Over 200 values of n prog (including repeats) were explored, and b mse (or b mmr), b bias and c var were computed for each value. The G short function set was used throughout. Table 4. Integrated error and bias-variance decomposition on the target benchmarks. "Exact" solutions, for the regression problems DistND, refer to individuals whose symbolic expression reduces exactly to the target function f (impossible for MexHat, in the absence of Euler's number e) and, for the classification problems Spiral and Cenparmi, to those achieving a zero fitness. Our first question dealt with the best choice of model for the integrated error quantities over an independent parameter (such as n prog here). Based on experimentation with several curve types, we elected to fit b mse (or b mmr) and b bias to a four-parameter model err 4 . / D˛eˇ C 2 C , and c var to either the same curve, or to a straight line. Our choices are motivated in Annex B. Figure 4 shows the results. The data fits closely to the expected view of the bias-variance decomposition of a non-parametric learner over a complexity measure 3 . Indeed, as the maximum complexity of the evolved solutions increases, the bias term quickly drops to a level close to zero. Simultaneously, however, the variance term rises, showing an increased propensity to overfit. Clearly, selecting a maximum length too low will significantly sabotage results. An important question is, on the contrary, whether a practitioner could plausibly be expected to choose higher or intermediate values so as to favor good results. To verify this, we broke our independent variable n prog into a series of equally sized bands of length 50.
We report the mean b mse (or b mmr) over the best band, as opposed to over all runs, including the improvement and the certainty (according to a two-sample t -test): Hence, we conclude that some reasonable amount of one-dimensional experimentation with n prog could be expected to lead to improvements.

Variance due to genetic initialization
Here, we confirm our intuition regarding the role of choice of random seed for the efficacy of the evolutionary algorithm. Our goal is to estimate the proportion of variance resulting from the training samples T versus the random seed I . For our regression problems, MexHat and DistND, using the G short function set, we computed b mse, b bias and c var as described in Section 2. Note that the Spiral classification problem uses a static set of samples, hence could not be analyzed in this fashion. For the other classification problem, Cenparmi, we calculated b mmr. First, we used different random seeds I .k/ with the same training sample set T ("same T "); then, we used different seeds I .k/ paired with different sample sets T .k/ ("normal"). In both cases, the size of the sample set jT j was fixed, as indicated in Table 2. Finally, we computed a third experiment in which the training sets were larger ("large jT j"), with jT .k/ j D 6400 for the regression benchmarks and jT .k/ j D 600 for the Cenparmi benchmark. The results over approximately 50 runs for each trial are shown in Figure 5 Comparing the "normal" runs against the "same T " runs, we see statistically significant gains in performance for the latter in the case of the Cenparmi benchmark only, although the absolute difference is small. This implies a smaller role for the particularities of training set selection in the generation of variance, relative to the role of the initialization seed. Similarily, comparing the "normal" runs against the "large jT j" runs shows statistically significant gains for the latter in the MexHat benchmark only. This time, the reduction in variance due to the increased training set size is approximately 40% of total variance, leaving 60% due to initialization seed. For the other two benchmarks, there is negligible difference in c var. Again, we see that the selection of the initialization seed has more influence on the variance than the size of the training set, even when increased by a factor 16.
Therefore, it is clear that in these examples the majority of the variance associated with the error rates stems from the initial sample of genetic space. We would expect this to be reflected in the final genetic outputs. var on three benchmarks. Plots are "Tukey-style" boxplots: dark lines are median values, boxes are based on quintiles, whiskers represent the 95% confidence interval, circles are outliers.

Varying the population size
In this third series of experiments, we elected to explore the effect of the population size N pop (the steady number of evolved individuals f ) on the performance of the algorithm, given a constant number of evaluations N eval . This parameter plays here the role of a trade-off, which involves the amount of initial exploration taken by the EA (in a larger population), as opposed to the exploitation of the better individuals (in a smaller population). In order to avoid greater amounts of computation, we maintain the number of evaluations N eval D N pop C N new constant, i.e., diminish the number of individuals created via genetic operators, N new , as N pop grows.
We computed over 200 samples for each problem, with ranges of OE1; 4000. Due to the different role of parameter N pop , i.e., that of a trade-off rather than a measure of model complexity like n prog , we re-evaluated our choice of fitting curves to model the data and decided to use err 4 for all three error measures. These results are shown in Figure 6.
In two cases, we observe that the variance component of error drops to some minimal level, and then plateaus. This suggests that following some critical population size, an adequate sample of genotypic space is found. The bias component of error also drops initially, and then gradually begins to climb. This is likely due to a decrease in evolutionary evaluations, where unnecessarily large initial population sizes encroach on the time devoted to exploitation in the algorithm. (It is unlikely that the bias is caused by the discovery of difficult-to-find local minima via larger samples, since these would not only increase bias but also lower variance.) In the other two cases, no significant effect on variance was observed, suggesting that small populations sample the genetic space sufficiently well.
A key point here is that the lowest values of variance for all these problems is still significantly higher than the variance we associate with the selection of the training set (see Section 4.1). That is, we cannot reasonably expect larger initial samples of the genomic space to eliminate the variance due to initialization.
An interesting effect can be seen with the Dist3D benchmark: namely, the best results were observed with a very small population, followed by an increase in error rates, and finally a decrease. Indeed, the error scores seen at the larger population sizes are significantly better than the middle range. This difference is driven largely by bias, not variance. We are at a loss to explain this behaviour.
Again, we asked whether or not a practitioner could hope to select optimal values of N pop in order to increase success. We broke the possible values into bands of size 500. We summarize our results below (noting that no significant changes were observed with the Cenparmi runs): The conclusion here is that, in some cases, modest but significant improvements can be made by adjusting N pop .

Varying the function set
In a final series of experiments, we elected to vary the size of the function set, jGj, and its membership. Here, each run uses a random subset of G long as a selection of available choices for the evolutionary algorithm. In each subset G, the basic functions fplus, minus, times, divg were included by default. Next, an integer was chosen randomly and uniformly in OE0; 14 and that many additional functions were drawn from G long (see Table 1) to form the pool available to the evolutionary algorithm. Each function was equally likely to be chosen by genetic initialization or mutation. Results are shown in Figure 7. For all benchmarks save Cenparmi, there is a sharp increase in performance with the number of functions included (in the case of Cenparmi, the performance is unchanged at all sizes). It is immediately evident that the expected performance, in terms of b mse or b mmr, improves rapidly with more functions. Although there is some drop in variance, too, especially with values near four functions, the primary gains are made via reduction in b bias, until the value drops nearly to zero. The c var score, on the other hand, appears to plateau before this. The fact that c var does not begin to increase with more functions is interesting. It suggests that the addition of new choices to the function set is not an increase in model complexity, i.e., that it does not generally enable the production of things previously impossible. Instead, we should view it as a means of skewing the distribution of solutions so as to make relevant solutions more probable. Thus, we propose that the function set can be used to control the bias of the system, that is, introduce heuristics that may (or may not) be appropriate to a given problem.
If the above hypothesis is correct, we should be able to see changes to output associated with the addition of particular functions while the set size is held constant. To test this, we generated over 50 runs of the system where the function set was selected as above, but the size was fixed to 11 (the necessarily included functions fplus, minus, times, divg along with 7 additional randomly chosen functions from G long ). For each function, we compared the mean of b mse in those runs which included the function versus those runs which did not. Indeed, we discovered several important results. Below we list all those functions with significant certainty (p < 0:05) for the MexHat problem, noting that the mean b mse score over all runs is 0.0082 (see Figure 8 for a graphic comparison): The majority of the functions had an adverse effect (8 out of 14 increased b mse), implying that either they tended towards overfitting or that evolutionary effort was wasted on removing them from the potential solutions. The most significant single function, mag2, had a highly beneficial effect, decreasing the expected b mse score by about 67%.
Most of the improvement in efficacy when augmenting the function set can probably be ascribed to this single function. The most useful function, mag2, accounts for all gains when increasing the function set. This is not surprising, as mag2 is a repeated element in the target solution f Dist3D . Again, the majority of additional functions (10 of 13) have an adverse effect on the problem (increasing the fitness), but a very useful function can compensate this, and provide large gains to overall performance, primarily through elimination of bias. For the Cenparmi problem, there are several moderately significant functions, but none whose effect increased or decreased error by more than 0.005.

Conclusions
Our study has generally confirmed the view of GP as a low-bias and high-variance approach to regression and classification problems. Furthermore, our analysis of variation on the maximum program length n prog has shown results consistent with the bias-variance decomposition of a non-parametric learning technique.
We have reached several key conclusions from this study. While these results have been seen in particular contexts in the literature, here we contrast their effects between benchmarks, and quantify the expected effect. The conclusions are: -Initialization creates the most variance: The variance associated with GP runs is largely due to the initialization seed, and secondarily to the selection of training samples. Further, increasing the sample of the genomic space taken in the population cannot be realistically expected to reduce this variance. -Parameters can be optimized: For all three parameters that we examined (maximum program length n prog , population size N pop , and function set G), one-dimensional selection of a reasonably sized band of values usually led to significant improvements in overall results. Of the three parameters, the largest gains were obtained by making minor changes to the function sets: indeed, in three of four benchmarks, the inclusion of one appropriately chosen function affected performance more than the best expected gains from tuning the other two parameters. In none of the benchmarks was the inclusion of all functions detrimental. The consistency of these results between benchmarks suggests that this conclusion can be generalized. -Population size effects are unclear: The choice of population size, N pop , led to largely inconsistent results.
For two benchmarks, variance could be decreased with larger initial populations. Along with this decrease was an increase in bias, due to the lessened efforts devoted to genetic optimization. For the other two benchmarks, significant changes in variance were not seen. -Larger function sets are better: Regarding the choice of function set G for inclusion in the genetic search space, the widest possible space was consistently preferred by evolution, reflected in a steady decrease of regression or classifier bias to near-zero levels. This was true despite the fact that the majority of functions were demonstrably detrimental to the evolvability of the problem. Thus it appears easier for evolution to eliminate ill-suited heuristics than to construct well-suited heuristics from more primitive operators. In particular, when increasing the function set size, we found no increase in either the average error or the variance of the results, thus providing evidence against the hypothesis of Fitzgerald and Ryan [7]. -Well-chosen functions are best: In most cases we explored here, there existed some non-standard functions in the larger function set very well suited to the problem at hand. It is these functions which accounted for the majority of gain in efficacy.

Future directions
Today, GP is increasingly being applied to knowledge extraction, where a symbolic description of a given database is desired. For instance, GP serves to extract scientific hypothesis from selected databases [2,30,33]; extract symbolic features from image databases [15,16,25,34]; explore the space of network generators to create predictive descriptions of agent behaviours or complex networks [22,23]; and other engineering-related applications [18]. In all these tasks, the genetic component of the evolved solution has definite meaning, possibly independently of the evaluation of the solution.
The most popular philosophy of science generally admits any model which makes useful and testable (falsifiable) predictions, and is parsimonious [21]. These conditions, however, are the product of an age in which the general assumption was made that only a few competing hypothesis would be available at any time, and hence, that determination of the most accurate or parsimonious solution would be simple. In the case of automated knowledge extraction, the possibility exists that indefinitely many models can be posited without any clear means of determining a best one: generalization becomes a multi-dimensional question, and parsimony, if at all definable, is potentially subject to the non-computability of minimal program length.
While in some cases human-understandable (or even elegant) solutions are discovered [2,30], generally speaking, little attention has been paid to the matter. This study has shown that these issues are problem-dependent: in cases where a clear solution existed in the space (such as the DistND regression problems) genotypic convergence was possible, while in other cases (such as the MexHat regression problem) many competing genotypically distinct solutions existed. The consequences of consistent genetic diversity on the capacity to extract knowledge automatically remains to be investigated.

A A note on several other UCI databases
In the course of conducting this research, we also experimented with several popular data sets from the University of California, Irvine (UCI) [9]. They were explored in some detail, but ultimately rejected as inappropriate for this style of research. Specifically, we worked with the Breast Cancer Wisconsin (Diagnostic) data set [32], the Pima Indians Diabetes data set [31] (both original and corrected versions), and the Statlog (Australian Credit Approval) data set [29]. Performance was systematically tested by measuring MR for different program lengths: n prog 2 f1; 2; 20; 50; 100g. We discovered that in all three cases the naive application of GP was incapable of improvement when given additional complexity (i.e., increasing n prog ), relative to the natural stochasticity due to the selection of training and test sets. For all data sets, the difference in MR on randomly chosen test samples was not significantly different between n prog D 2 and n prog D 100 (over 30 runs, a textbook t-test did not discover any trends with p < 0:1). This implied that a simple threshold on one or two input variables was the best discoverable performance by a naive technique.  Lest our results be interpreted as a failure of our particular approach to GP, we re-ran the same experiments using another non-parametric learner, a neural network. Specifically, neural networks with a varying number of hidden nodes were trained and tested on the above databases, and trained via backpropagation (using a sigmoid activation function .v/ D 1:7159 tanh.2v=3/, and 50 epochs of training). The number of hidden neurons used varied over the set f1; 2; 5; 10; 50; 100g and 30 runs. In all cases, the test error did not change significantly, save for the Diabetes database, where in fact the test error worsened significantly. An illustration of these results is shown in Figure 9. In conclusion of these findings, we deemed the above three databases too noisy for non-parametric learning, and recommend future researchers to proceed with caution.

B Curve selection
Selection of a model (curve) for data fitting was carried out using the MexHat domain, using the G short function set, and over 100 runs. All curves were fit using the Gauss-Newton method of non-linear regression. Goodness-offit error was the residual standard error. All polynomials up to degree seven were fit. We also tested three curves designed to resemble expected curve shape (from previous experiments with MSE): err 5 . / D˛eˇ C 2 C ı C err 4 . / D˛eˇ C 2 C err 4 0 . / D˛eˇ C ı C The best error rate for fitting the mse data was achieved by the err 4 curve (0.00809), slightly outperforming the other two exp curves, and even outperforming the more complex 7-term polynomial (0.00865). Further simplifications to the exp curves rapidly increased the error rate. Hence, we elected to use err 4 as a default guess for all curves, with other err curves substituted in the case of an improvement of error greater than 0.001. The var curves were typically modelled via straight lines, unless a curve improved error by more than 0.001.