DESIGNING NEURAL NETWORKS FOR MODELING BIOLOGICAL DATA : A STATISTICAL PERSPECTIVE

In this paper, we propose a strategy for the selection of the hidden layer size in feedforward neural network models. The procedure herein presented is based on comparison of different models in terms of their out of sample predictive ability, for a specified loss function. To overcome the problem of data snooping, we extend the scheme based on the use of the reality check with modifications apt to compare nested models. Some applications of the proposed procedure to simulated and real data sets show that it allows to select parsimonious neural network models with the highest predictive accuracy.

1. Introduction.It is widely accepted that complex structures, in time and in space, exist in biological data and that non-linear models, both parametric and non-parametric, can effectively be used to reveal these patterns.In this context, artificial neural networks are one of the most popular artificial learning tools due to their ability to accurately represent the complex, non linear behaviour of relatively poorly understood processes without any a priori knowledge of input and output relationships.
The growing interest in neural networks, with respect to other non parametric techniques, is due to their versatility which comes from the high capability of providing, under quite general conditions, an arbitrarily accurate approximation to an unknown target function of interest.Barron [2] obtained a deterministic approximation rate for a class of single hidden layer feedforward neural networks with r hidden units and sigmoid activation functions when the target function satisfies certain smoothness conditions.Hornik et al. [16] extended Barron's result to a class of neural networks with possibly non-sigmoid activation approximating the target function and its derivatives simultaneously.More recently, Makovoz [23] and Chen and White [5] obtained an improved degree of approximation of Barron's neural networks with sigmoid activation function.Moreover, neural networks do not suffer for the so-called 'curse of dimensionality'.Theoretically, they are expected to perform better than other approximation methods since the approximation form is not so sensitive to the increasing data space dimension, at least within the confines of particular classes of functions.
The reliability of using neural networks in practice has been affirmed in many different applications ranging from pattern recognition [3], in chromatographic spectra [15,4], and expression profiles [36,18], to functional analyses of genomic and proteomic sequences [6] to QSAR models [10,1].

MICHELE LA ROCCA AND CIRA PERNA
Many studies [24,40,32] agree that one of the main difficulties in applying neural networks in biotechnology is the choice of an adequate model.The problem can be faced referring to the many proposals based on trial and error procedures, adequately combined with pruning of the network graph, based on weight selection or classical information criteria.Alternatively, the neural network model building strategy could be faced in a statistical perspective, relating it to the classical model selection approach.In the case of the single hidden feedforward neural network class, model selection basically involves the choice of the number and type of input neurons and the number of neurons in the hidden layer.In our opinion, a model building process should highlight the different role of these two types of neurons.The input neurons are related to the explanatory variables and, as a consequence, are useful for the identification and interpretation of the model.Therefore, although many techniques have been proposed in the literature, their selection should be addressed focusing on statistical test procedures for variable selection in regression models [19,20] in order to give information explicitly on the 'relevance' of the variable to the model.On the contrary, the hidden layer size takes into account the trade-off between estimation bias and variability and so it is related to the complexity of the model.Of course, the selection of this parameter plays an important role since under-parametrized (over-parametrized) models can lead to heavy consequences on underfitting (overfitting) and, consequently, poor modeling performance or reduced ex-post forecast accuracy.Generally, it is chosen according to one of the information criteria available in the statistical literature even if many statistical studies [31,26] agree on the failure of these measures in choosing the best forecasting model.
In this paper, we propose a strategy in which the hidden layer size is selected by comparing models in terms of their out of sample predictive ability, for a specified loss function.In this context, since a given set of data is used more than once for inference and model selection, there is the possibility that any satisfactory results may simply be due to chance rather than to the model itself (a problem known as 'data snooping').To overcome this problem, we extend the procedure based on the use of the reality check proposed in White [39] with the modification for nested models as proposed in Clark and McCracken [7,8].
The paper is organized as follows.In the next section, we describe the structure of the data generating process and the neural network model.In section 3 we discuss the proposed test procedure for model selection.Numerical examples on simulated data and a moderate Monte Carlo experiment are reported in section 4 while in section 5 two applications to biological data are discussed.Some final remarks close the paper.
2. Neural network models.Let the observed data be the realization of a sequence Z i = Y i , X T i T of independent identically distributed (iid) random vectors of order (d + 1), with i ∈ N.
The random variables Y i represent targets (in the neural network jargon) and it is usually of interest the probabilistic relationship with the variables X i , described by the conditional distribution of the random variable Y i |X i .Certain aspects of this probability law are relevant in interpreting what is the modeling role of artificial neural network models.If E (Y i ) < ∞, then E (Y i |X i ) = g (X i ) and we can write where ε i ≡ Y i − g (X i ) and g : R d → R is a measurable function.Clearly, by construction the error term ε i is such that E (ε i |X i ) = 0.The function g embodies the sistematic part of the stochastic relation between Y i and X i .It can be approximated by using the output of a single hidden layer feedforward artificial neural network of the form where w ≡ w 00 , w 01 , . . .w 0r , w T 11 , . . ., w T 1r T is a r(d + 2) + 1 vector of network weights, w ∈ W with W compact subset of R r(d+2)+1 , and x ≡ 1, x T T is the input vector augmented by a bias component 1.The network (2) has d input neurons, r neurons in the hidden layer and identity function for the output layer.The (fixed) hidden unit activation function ψ is chosen in such a way that f (x, •) : W → R is continuous for each x in the support of the marginal distribution of X i and f (•, w) : R d → R is measurable for each w in W.
The main difference with a parametric model is that instead of postulating a specific non-linear function, a neural network model is constructed by combining many "simple" non-linear functions via a multi-layer structure.In a feedforward network, the explanatory variables simultaneously activate r hidden units in an intermediate layer through some function ψ, and the resulting hidden-unit activations ψ xT w 1j , j = 1, . . ., r, then activate output units to produce the network output.
Given a training set of N observations, D = {(Y i , X i ) , i = 1, 2, . . ., N } estimation of the network weights (learning) is obtained by solving the optimization problem where q(•) is a proper chosen loss function.Under general regularity conditions (White, [37]), denoting π (z) the distribution of Z i , a weight vector ŵn solving equation (3) exists and converges almost surely to w * , which solves provided that the integral exists and the optimization problem has a unique solution vector interior to W. This is not necessarily true for neural networks without considering appropriate restrictions, since the parametrization of the network function is not unique.However, Ossen and Rügen [25] provide sufficient conditions to ensure uniqueness of w * in a suitable parameter space W for specific network configurations.Moreover, from an asymptotic point of view, the possible presence of multiple minima has no essential effect for solutions to equation (4) (see [37]), while, from a computational point of view, several global optimization strategies (simulation annealing, genetic algorithms, etc.) have been successfully employed to avoid to be trapped in local minima.Finally, when the focus is on prediction, as in this paper, it can be shown that the unidentifiability can be overcome and the problem disappears [17].
The estimated neural network model should capture the real functional relationship existing between the inputs and the output, so it should well approximate the observed data and, at the same time, it should perform reasonably well on new data (prediction).Clearly, a good neural network model is characterised by its ability to approximate and to generalise in a proper way.A model structure that is chosen to be too complex in relation to the real functional relationship, captures the noise contained in the data (overfitting).Such a model will perform well in approximating the data used for the estimation of its parameters but very poorly on new data.
The trade-off between approximation accuracy and generalization ability is governed by the hidden layer size which, in neural network framework, plays the role of a smoothing parameter.
This issue is usually addressed by splitting the available data set into two subsets: (i) the training set and the test data set.The first is used for estimating the weights of a certain model structure with a specified number of hidden units; then the test data set is fed into the neural network model which runs in the prediction mode.The discrepancy between the computed and the observed data of the second subset, expressed as mean square error, is a measure of the generalisation property of the network.The relationship between training and test data is usually chosen to be 70% to 30%.To apply this method one usually starts with a small model structure which is stepwise increased by adding hidden units.For every structure the approximation and generalisation errors are computed as mean square error on the basis of the training and the test data respectively.Obviously, the approximation error is expected to decrease continuously with increasing complexity of the model.The generalisation will also improve with increasing number of hidden nodes,but beyond a certain complexity the model will have a poor generalisation performance, due to overfitting problems, even if the number of hidden nodes is still increasing.The number of hidden units corresponding to the minimum of the generalisation error determines the optimal model structure and represents the solution to the model selection problem.
In any case, all these model selection procedures are not entirely satisfactory.Since model selection criteria depend on sample information, their actual values are subject to statistical variations.As a consequence, a model with higher model selection criterion value may not significantly outperform its competitors.In recent years there is a growing literature addressing the problem of comparing different models and theories through the use of predictive performance and predictive accuracy tests ( [9] and the references therein).In this literature, it is quite common to compare multiple models, which are possibly misspecified (they are all approximations of some unknown true model), in terms of their out of sample predictive ability, for a specified loss function.In such context data snooping, which occurs when a given set of data is used more than once for inference or model selection, can be a serious problem.When such data reuse occurs, there is always the possibility that any satisfactory results obtained may simply be due to chance rather than any merit inherent to the model yielding to the result.In other words, by looking long enough and hard enough at a given data set it will often reveal one or more forecasting models that look good but are in fact useless.
The data snooping can be particularly serious especially when there is no theory supporting the modeling strategy, as it happens when using neural network models which are basically atheoretical.Unfortunately, as far as we know, there are no results addressing the problem just described in a neural network framework.

3.
Testing superior predictive ability for neural network modeling design.Let (Y τ , X τ ) denote a future observation that satisfies At this stage, assume that the set of explicative variable has been selected by an appropriate variable selection technique (see [19,21] inter alia).Moreover, assume that k + 1 alternative forecasting neural network models are available, namely f j (x, w), j = 0, 1, . . ., k, where j denotes the hidden layer size and k is fixed to maximum level of desired complexity.Obviously, f 0 (x, w) is a neural network with skip layer and r = 0 neurons in the hidden layer (that is the linear model).In our framework it is assumed to be the benchmark model.
Let the generic forecast error be u j,τ = Y τ − f j (X τ , w * ) , j = 0, 1, . . ., k where w * is defined as in the previous section.Let h be a loss function chosen to properly weight the forecasting error [12] and define Clearly, if model j beats the benchmark (i.e. it shows better expected predictive performances) we have θ j > 0, otherwise θ j ≤ 0 and our goal is to identify as many models for which θ j > 0. In other words, for a given model j, consider and, in a multiple testing framework, take a decision concerning each individual testing problem by either rejecting H j or not.In this framework, to avoid declaring true null hypotheses to be false, the familywise error rate, defined as the probability of rejecting at least one of the true null hypotheses, should be taken under control.This can be done by using the well known Bonferroni method or stepwise procedures such as Holm's approach, which are more powerful.Unfortunately, all these procedures are conservative since they do not take into account the dependence structure of the individual p-values [27].To avoid these issues, it is possible to use the reality check as in White [39] and the modification for nested models as proposed in Clark and McCracken [7,8].
This latter approach can be easily extended to our neural network framework.Let S = {(Y i , X i ) , i ∈ S} and P = {(Y i , X i ) , i ∈ P } denote, respectively, the estimation data set and the test data set, where P is the complement set of S with respect to D, with |P| = N − |S|.Let the estimated forecast error be ûj,τ = Y τ − f j (X τ , ŵ), j = 0, 1, . . ., k and let MPE j = τ ∈P h (û j,τ ), where P is the cardinality of the set P.
The test procedure can be based on the F-type statistic defined as It has a clear interpretation: large values of F p j indicate evidence against the null H j .The procedure for testing the system of hypotheses (7) keeping under control the family wise error rate, runs as follows.Relabel the hypothesis from H r1 to H r k in redescending order with respect to the value of the test statistics F p j , that is F p r1 ≥ F p r2 ≥ . . .≥ F p r k .The procedure focuses on testing the joint null hypothesis that all hypotheses H j are true, that is no competing model is able to beat the benchmark model.This hypothesis is rejected if F p r1 is large, otherwise all hypotheses are accepted.In other words, the procedure constructs a rectangular joint confidence region for the vector (F p r1 , . . ., F p r k ) T , with nominal joint coverage probability 1−α.The confidence region is of the form where the common value c 1−α is chosen to ensure the proper joint (asymptotic) coverage probability.If a particular individual confidence interval F p rj − c 1−α , ∞ does not contain zero, the corresponding null hypothesis H rj is rejected.So, the testing procedure will select a set of models which delivers the greatest predictive ability, when compared to the benchmark model.All these models are somewhat equivalent and, for a parsimony principle, the one with smallest hidden layer size should be selected.If all the nulls are not rejected in the first step, there is no neural network model which is able to outperform the linear model (assumed as a benchmark) in terms of predictive ability.The quantile of order c 1−α is estimated by using the bootstrap [29].
The pseudo-code for the complete testing procedure is described in algorithm (1).
Algorithm 1 Testing algorithm for superior predictive ability.
reject H rs and include s in K 10: end if 11: end for 12: Deliver the set K (if it is an empty set, no neural network model is able to beat the benchmark model) 4. Some numerical results.To illustrate the performance of the proposed model selection procedure, we use simulated data sets generated by models with known structure.The simulated data sets were generated by using different models often employed in the neural network literature as data generating processes.The first model is the same used in De Veaux et al. [11] and it is defined as where ε is gaussian with zero mean and variance equal to 0.1 and X = (X 1 , X 2 , X 3 ) T is drawn randomly from the unit hypercube.The function is radially symmetric in these three variables.Clearly, the number of the neurons in the hidden layer is unknown and the model we try to identify is, by construction, misspecified.
The second model has been used by Friedman [13] and it is defined as where X = (X 1 , X 2 , X 3 , X 4 , X 5 ) T is a vector of multivariate uniform random variables and ε is gaussian with zero mean and variance equal to 1.The model includes both linear and nonlinear relationships.The third model is the same one used by Tibshirani [33] and it is defined as where ψ is the logistic activation function, X = (X 1 , X 2 , X 3 , X 4 ) T is a vector of multivariate gaussian random variables with zero mean, unit variance and pairwise correlation equal to 0.5 and ε is gaussian with zero mean and variance equal to 0.25.
Clearly a neural network with logistic activation function, four input neurons and two hidden neurons is a correctly specified model and no misspecification is present.The last model is the same model used by Turlach [35] and it is defined as T is a vector of multivariate uniform random variables and ε is gaussian with zero mean and variance equal to 0.05.The model includes both linear and nonlinear relationships.
For the numerical examples, we have considered a quadratic loss function h and N = 600, P = 180, B = 1000 and k = 8.All neural network models have been estimated by using nonlinear least squares, including a weight decay in the objective function to control overfitting.Moreover, to avoid to be trapped in local minima, the estimation procedure has been initialized 25 times with random starting values, keeping the estimated network with the lowest residual sum of squares.The results of the testing procedure for typical realizations are reported in figure (1).In the Tibshirani model case, the hidden layer size is known and equal to 2. The procedure correctly identifies the hidden layer size and indicates that it is not possible to improve accuracy by increasing the hidden layer size.All models with r ranging from 2 to 8 are basically equivalent with respect to the predictive accuracy.Similar remarks apply also to all other models.Note that for the DeVeaux and the Friedman data simply considering the statistical index would indicate r = 8 as the best choice, but this does not give any significant improvement with respect to r = 4.
A moderate Monte Carlo experiment has also been performed considering the same data generating processes as before.We have considered 240 Monte Carlo runs with three different sample sizes N = 300, 400, 600 using the last 30% observations for prediction.The results are reported in figure (2).In the Tibshirani case, the hidden layer size (which is known and equal to 2) the proportion of correct identification is very high for all the sample sizes, reaching 100% for N = 600.For the other data sets, the simulations confirm the results shown by the numerical examples and highlight the steep improvement as the sample size increases.
Figure 2. Proportion of hidden layer size identification by using the testing procedure for superior predictive ability 5. Real data applications.To validate the performance of the proposed procedure two applications to real data are discussed in this section.The aim is to verify if the proposed procedure is able to correctly identify an appropriate model structure for the data at hand.
As a first example, we use the Prostate Cancer data set which comes from a study by Stamey et al [30] and also used by Hastie et al [14].The dependent variable is the level of prostate-specific antigen which depends on 8 clinical measures in men who were about to receive prostatectomy.The data set, already used in many biostatistical studies, has a well known regression structure and so it is suitable for testing new procedures.The data set has 97 observations and it is splitted in two subsets: 67 observations have been used for the modeling step while 30 observations have been used for the validation step.By using a linear model and a best subset variable selection rule, just two explanatory variables (out of eight) are identified as relevant: lweight (log prostate weight) and lcavol (log cancer volume).For sake of comparison, as identification tools for the number of hidden neurons, we also use the k-fold Cross-Validation (CV) selection rule (see [14] inter alia) and the Bayesian Information Criterion (BIC) [28], proved to be consistent (almost surely) in the case of multi-layer perceptrons with one hidden layer in [38].
Clearly, the BIC identifies a neural network with skip layer and zero hidden neurons (i.e a linear model), for all the weight decay values considered (see figure 3, left panel).In the following, all the computations are based on a weight decay equal to zero, since it delivers the lowest BIC value.The CV also confirms the model selected by using the BIC and the same conclusions can be drawn by using the proposed test of superior predictive ability (see figure 4, left panel).To validate these results, a linear model and neural networks with hidden neurons ranging from 1 to 8 have been estimated and used to predict the observations in the validation set.The distributions of the absolute prediction errors are reported in figure 4, right panel.The plot shows that the neural networks considered are not able to provide better predictions with respect to the linear model (as predicted by the CV, the BIC and the novel test).Even a neural network with 6 hidden neurons (which shows the lowest median absolute prediction error) does not appear to provide prediction errors statistically different from those provided by the linear model.These results are confirmed by a formal statistical comparison between the two distributions: the Brunner Munzel test and the Wilcoxon rank sum test give p-values equal to 0.497 and 0.495, respectively.The second data set used as an example has been downloaded from the UCI Machine Learning Repository and is composed of a range of biomedical voice measurements from 42 people with early-stage Parkinson's disease recruited to a sixmonth trial of a telemonitoring device for remote symptom progression monitoring (for details see [34]).The data set has 5,875 observations on age, gender and on 16 biomedical voice measures.The statistical model is used to predict the total UP-DRS score.For computational reasons, just the subset of the first 887 observations (corresponding to the first 5 patients) has been considered.Again, the data set is splitted in two subsets: 731 observations (the first 4 individuals) have been used for the modeling step while 156 observations (corresponding to the 5th patient) are used for validation purposes.In this case, the CV model selection rule would suggests a neural network model with 4 hidden neurons while, following the BIC, also a neural network with 2 neurons would be appropriate.On the contrary, by using the proposed test, there is no superior predictive ability of neural network models with respect to linear models (see figure 4, left panel).This is confirmed by the distribution of the absolute predictive errors reported in figure 4, right panel: the linear model and the neural networks with 2 or 4 neurons in the hidden layer perform similarly.This latter result is also supported by the Brunner Munzel test and the Wilcoxon rank sum test whose p-values are, respectively, equal to 0.219 and 0.218.In this example, the regression model uses 17 explanatory variables so networks with 2 or 4 hidden neurons include 39 or 77 parameters, respectively.These networks appear to be heavily overparametrized, with no clear advantages in terms of predictive ability.
6. Concluding remarks.In this work, the main point was to introduce a strategy for the selection of the hidden layer size in feedforward neural network models.
The numerical examples and the Monte Carlo experiment show that looking at the predictive ability of the model, simply measured by statistical indexes of predictive accuracy, might be misleading.In this case, the selected model might be overparametrized with heavy consequences on the generalization ability of the network.A better approach should be based on testing procedures of superior predictive ability.However, this strategy generates a sequence of tests and as a consequence the data snooping problem arises.This multiple testing structure, which is inherent to most model selection strategies, can be effectively addressed by reality check type tests.The proposed testing procedure, which takes under control the familywise error rate, is able to select parsimonious neural network models with the highest predictive accuracy.The real data analysis also supports this latter conclusion showing also that the CV and the BIC might lead to neural network models much more complex than necessary.

Figure 3 .
Figure 3. Bayesian Information Criterion values for different hidden layer sizes and different weight decay values (left panel).k-fold cross validation values for k = 5 and k = 10 using a weight decay equal to zero (right panel).

Figure 4 .
Figure 4. Joint confidence regions with nominal coverage probability 1 − α = 0.95 (left panel).Absolute prediction error distributions computed on the test set for linear models and neural networks with hidden layer size ranging from 1 to 8 (right panel).

1 :
Relabel the hypothesis from H r1 to H r k in redescending order of the value of the test statistics F p j , that is F p r1 ≥ F p r2 ≥ . . .≥ F p r k .