Optimal Training of Mean Variance Estimation Neural Networks

This paper focusses on the optimal implementation of a Mean Variance Estimation network (MVE network) (Nix and Weigend, 1994). This type of network is often used as a building block for uncertainty estimation methods in a regression setting, for instance Concrete dropout (Gal et al., 2017) and Deep Ensembles (Lakshminarayanan et al., 2017). Specifically, an MVE network assumes that the data is produced from a normal distribution with a mean function and variance function. The MVE network outputs a mean and variance estimate and optimizes the network parameters by minimizing the negative loglikelihood. In our paper, we present two significant insights. Firstly, the convergence difficulties reported in recent work can be relatively easily prevented by following the simple yet often overlooked recommendation from the original authors that a warm-up period should be used. During this period, only the mean is optimized with a fixed variance. We demonstrate the effectiveness of this step through experimentation, highlighting that it should be standard practice. As a sidenote, we examine whether, after the warm-up, it is beneficial to fix the mean while optimizing the variance or to optimize both simultaneously. Here, we do not observe a substantial difference. Secondly, we introduce a novel improvement of the MVE network: separate regularization of the mean and the variance estimate. We demonstrate, both on toy examples and on a number of benchmark UCI regression data sets, that following the original recommendations and the novel separate regularization can lead to significant improvements.


I. INTRODUCTION
N EURAL networks are gaining tremendous popularity both in regression and classification applications. In a regression setting, the scope of this paper, neural networks are used for a wide range of tasks such as the prediction of wind power (Khosravi and Nahavandi, 2014), bone strength (Shaikhina and Khovanova, 2017), and floods (Chaudhary et al., 2022).
Due to the deployment of neural networks in these safetycritical applications, uncertainty estimation has become increasingly important (Gal, 2016). The uncertainty in the prediction can be roughly decomposed into two parts: epistemic or model uncertainty, the reducible uncertainty that captures the fact that we are unsure about our model, and aleatoric uncertainty, the irreducible uncertainty that arises from the inherent randomness of the data (Abdar et al., 2021;Hüllermeier and Waegeman, 2019). In this paper, we refer to the latter as the variance of the noise, to avoid any confusion or philosophical discussions. The variance of the noise can be homoscedastic if it is constant, or heteroscedastic if it depends on the input x.
There is a vast amount of research that studies the model uncertainty. Notable approaches include Bayesian neural networks (MacKay, 1992;Neal, 2012), dropout (Gal and Ghahramani, 2016;Gal et al., 2017), and ensembling (Heskes, 1997). Conversely, a lot less emphasis is often placed on the estimation of the variance of the noise. Monte-Carlo dropout, for example, simply uses a single homoscedastic hyperparameter. Some other methods, such as concrete dropout and the hugely popular Deep Ensembles (Lakshminarayanan et al., 2017), use a Mean Variance Estimation (MVE) network (Nix and Weigend, 1994).
An MVE network, see Figure 1, works as follows. We assume that we have a data set consisting of n pairs (x i , y i ), with y i ∼ N µ(x i ), σ 2 (x i ) . An MVE network consists of two sub-networks that output a prediction for the mean,μ(x), and for the variance,σ 2 (x). These sub-networks only share the input layer and do not have any shared weights or biases. In order to enforce positivity of the variance, a transformation such as a softplus or an exponential is used. The network is trained by using the negative loglikelihood of a normal distribution as the loss function: Since the MVE network is often used as the building block for complex uncertainty estimation methods, it is essential that it works well. Multiple authors have noted that the training of an MVE network can be unstable (Seitzer et al., 2021;Skafte et al., 2019;Takahashi et al., 2018). The main argument, elaborated on in the next section, is that the network will start focussing on areas where the network does well at the start of the training process while ignoring poorly fitted regions. However, Nix and Weigend (1994) already warned for the possibility of harmful overfitting of the variance and gave the solution: The training of an MVE network should start with a warm-up period where the variance is fixed and only the mean is optimized.
Additionally, the variance is initialized at a constant value in order to make all data points contribute equally to the loss. Nix and Weigend (1994) did not demonstrate the importance of this warm-up period in the original paper. In this paper, we empirically demonstrate that using a warm-up period can greatly improve the performance of MVE networks and fixes the instability noted by other authors. A limited amount of research has investigated possible improvements of the MVE network (Seitzer et al., 2021;Skafte et al., 2019). Most improvements require a significant adaptation to the training procedure such as a different loss function or locally aware mini-batches. However, to the best of our knowledge, none have investigated our proposed easyto-implement improvement: The mean and variance in an MVE network should be regularized separately.
Most modern methods (Egele et al., 2021;Gal et al., 2017;Jain et al., 2020;Lakshminarayanan et al., 2017) appear to use the same regularization for both the mean and the variance.
In fact, the current use of the MVE network often does not even easily allow for different regularizations. Typically, only a second output node is added to represent the variance, instead of an entire separate sub-network (see Figure 2). As we will demonstrate in this paper, separate regularization can be very beneficial to the predictive results.

Contributions:
• We provide experimental results that demonstrate the importance of a warm-up period as suggested by Nix and Weigend (1994). • We investigate if it is beneficial to update the mean and variance simultaneously after the warm-up as opposed to keeping the mean fixed after the warm-up and only learning the variance. • We provide a theoretical motivation why different regularization for the mean and variance in an MVE network is desirable and experimentally demonstrate that this can lead to significant improvements.

Organisation:
This paper consists of 6 sections, this introduction being the first. In Section II, we go through the problems with MVE networks that have recently been reported in the literature. In the same section, we show that these problems can be resolved by following the recommendation of using a warm-up period. The following Section III gives a theoretical motivation in favor of updating both the mean and the variance after the warmup as opposed to keeping the mean fixed and only learning the variance. Section IV explains, both intuitively and using classical theory, why we expect to need different amounts of regularization for the mean and the variance estimates.
Both the effect of the warm-up and of separate regularization are experimentally examined in Section V. The final section summarizes the results, gives a list of recommendations when training an MVE network, and provides possible avenues for future work.
All the code used in the experiments of this paper can be found at https://github.com/LaurensSluyterman/Mean Variance Estimation.

II. DIFFICULTIES WITH TRAINING MVE NETWORKS
It is known that the training of an MVE network can be unstable (Seitzer et al., 2021;Skafte et al., 2019;Takahashi et al., 2018). The main argument is that the network may fail to learn the mean function for regions where it initially has a large error. In these regions, the variance estimate will increase, which implies that the residual does not contribute to the loss as much. The network will start to focus more on regions where it is performing well, while increasingly ignoring poorly fit regions.
To illustrate what can happen, we reproduced an experiment from Seitzer et al. (2021). We sampled 1000 covariates, x i , uniformly between 0 and 10, and subsequently sampled the targets, y i , from a N 0.4 sin(2πx i ), 0.01 2 distribution. Figure 3 shows that the MVE network is unable to properly learn the mean function. Increasing training time does not solve this. A network with a similar architecture that was trained using the mean-squared-error loss was able to learn the mean function well.
We provide a second explanation for this behaviour by noting that the loss landscape is likely to have many local minima. We already encounter this in a very simple example. Suppose we have a data set consisting of two parts: 100 data points from a N 2, 0.5 2 distribution and 100 data points from a N 5, 0.1 2 distribution. For each part, we are allowed to pick a separate variance estimate,σ 2 1 andσ 2 2 but we can only pick a single estimate for the mean. In this situation, there are two local minima of the negative loglikelihood (see Figure 4): we can setμ to approximately 2 with a smallσ 2 1 and largeσ 2 2 or setμ to 5 with a largeσ 2 1 and smallσ 2 2 . While this simplified setting is of course not a realistic representation of a neural network, it does illustrate that there can easily be many local minima when dealing with complex functions for the mean and the variance. When we start from a random estimate for the mean, it is therefore not unlikely to end up in a bad local minimum.

A. The Solution: Warm-up
The original authors advised to use a warm-up, which indeed alleviates most problems. After initialization, the variance is fixed at a constant value and the mean estimate is learned. In a second phase, the mean and variance are updated simultaneously.
We can motivate why a warm-up is beneficial, both form the loss-contribution perspective and from the local minima perspective. From the loss-contribution perspective, we do not have the problem that regions that are predicted poorly initially fail to learn. Because the variance estimate at initialization is constant, all residuals contribute to the loss equally. From the loss-landscape perspective, we are less likely to end up in a bad local minima if we start from a sensible mean function. Figure 5 shows that adding a warm-up period indeed solves the convergence problem that we previously had in the sine example in Figure 3. The data consist of two parts: 100 data points from a N 2, 0.5 2 distribution and 100 data points from a N 5, 0.1 2 distribution. The graphs shows negative loglikelihood as a function ofμ where we take the optimal variance estimates for each value ofμ. : By using a warm-up where only the mean is updated, the MVE network is able to learn the mean function well. In this example, the variance appears to be overfitting slightly.

III. WHAT TO DO AFTER THE WARM-UP?
After the warm-up period, we could either update the variance while keeping the mean estimate fixed or update both simultaneously. In the original MVE paper, the authors argue that simultaneously estimating the mean and the variance is also advantageous for the estimate of the mean. The reasoning is that the model will focus its resources on low noise regions, leading to a more stable estimator.
We go through some classical theory that shows that this is the case for a linear model. All derivations for the statements in this paper regarding linear models can be found in van Wieringen (2015).
We assume that we have a data set consisting of n data points (x i , y i ), with x i ∈ R p and y ∈ R. With X, we denote the n × p design matrix which has the n covariate vectors x i as rows. With Y , we denote the n × 1 vector containing the observations y i . We assume X to be of full rank. Consider the linear model: where Σ can be any invertible covariance matrix, possibly heteroscedastic and including interaction terms. Suppose this covariance matrix is known, then classical theory tells us that it is beneficial for our estimate of β to take this into account. To see this, we will compare the linear model in Equation (1) with a rescaled version that takes the covariance matrix into account. Since Σ is positive semi-definite, we can write it as BB T and rescale our model by multiplying with B −1 : When finding the least-squares estimators, both formulations lead to different estimators of β, denoted byβ andβ * respectively. In Appendix A, we show that both estimators are unbiased estimators of β and thatβ * has a lower variance.
We want to emphasize that this leads to improved metrics such has RMSE. Let us for instance look at difference between the expected quadratic errors of a new pair (x new , y new ) when usingβ andβ * . In Appendix A, we prove that This short example illustrates that it can be beneficial for the mean estimate to take the variance into account. Besides the obvious benefit of having an estimate of the variance, it may therefore be a good idea to use an MVE network in order to get a better estimate for the mean, as the authors of the original papers also pointed out.
In summary, focussing on low noise regions is beneficial. However, the estimate of the noise is made using the mean predictor. If the mean predictor is bad, we do not focus on low noise regions but on high accuracy regions, which can be very detrimental. We therefore need a warm-up period, after which classical theory would suggest that estimating the mean and variance simultaneously has advantages. In Section V, we test if estimating the mean and variance simultaneously is indeed beneficial for the mean estimate.

IV. THE NEED FOR SEPARATE REGULARIZATION
In this section, we give a theoretical motivation for the need for a different regularisation of the parts of the network that give the mean and variance estimate. Intuitively this makes sense: there is no reason to assume that the mean function and the variance function are equally complex. If one is much more complex than the other, we do not want to regularize them the same way.
We can give a more rigorous argument by again analyzing a classical linear model. We do this by considering two linear models that most closely resemble the scenario of an MVE network. The first model will estimate the mean while knowing the variance and the second model will estimate the log of the variance 1 while knowing the mean. Both models will generally have a different optimal regularization constant.

A. Scenario 1: Estimating the Mean With a Known Variance
We use the same notation as in the previous example and assume a homoscedastic noise. The goal is to find the estimator that minimizes the mean-squared-error loss plus a regularization term, Depending on the value of λ, we get different estimatorsβ(λ).
In Appendix B, we show that optimal regularization constant, λ * , satisfies We defined optimal as the λ for which is minimal.
1 An MVE network often uses an exponential transformation in the output of the variance neuron to ensure positivity. The network then learns the log of the variance.

B. Scenario 2: Estimating the Log-Variance With a Known Mean
Next, we examine a linear model that estimates the logarithm of the variance. We again have n datapoints (x i , y i ) where we assume the log of the variance to be a linear function of the covariates: We use the same covariates and for the targets we define: where ψ is the digamma function. This somewhat technical choice for C is made such that where˜ has expectation zero and a constant variance. The details can be found in Appendix B. In the same appendix we repeat the same procedure, i.e. minimizing the mean-squarederror with a regularization term, and demonstrate that for the optimal regularization constant, λ * , satisfies The conclusion is that for these two linear models, that most closely resemble the scenario of regularized neural networks that estimate the mean and log-variance, the optimal regularization constants rely on the true underlying parameters β and β. Since there is no reason to assume that these are similar, there is also no reason to assume that the mean and variance should be similarly regularized.

C. Separate Regularization of the Variance Alleviates the Variance-Overfitting
While the problem of ignoring initially poorly fit regions is still present, proper regularization of the variance can alleviate the harmful overfitting of the variance. To illustrate this effect, we trained 4 MVE networks, without a warm-up period, on a simple quadratic function with heteroscedastic noise. The xvalues were sampled uniformly from [−1, 1] and the y-values were subsequently sampled from a N x 2 , (0.1 + 0.2x 2 ) 2 distribution. We used the original MVE architecture which has two sub-networks that estimate the mean and the variance. We used separate l 2 -regularization constants for both subnetworks in order to be able to separately regularize the mean and the variance. We used the same mean regularization in all networks and gradually decreased the regularization of the variance. Figure 6 demonstrates the effect of different amounts of regularization of the variance. When the variance is regularized too much, the network is unable to learn the heteroscedastic variance. This is problematic both because the resulting uncertainty estimates will be wrong, but also because we lose the beneficial effect on the mean that we discussed in the previous subsection. In the second subfigure, the network was able to correctly estimate both the mean and variance. When we decreased the regularization of the variance further, however, we see that the network simply increased the variance on the right side instead of learning the function. When we remove regularization of the variance all together, the network was completely unable to learn the mean function. Additionally, we repeated the sine experiment while using a higher regularization constant for the variance than for the mean. In Figure 7, we see that the MVE network is now able to learn the sine function well, even without a warm-up period. We were unable to achieve this when using the same regularization constant for both the mean and the variance.

V. EXPERIMENTAL RESULTS
In this section, we experimentally demonstrate the benefit of a warm-up period and separate regularization. In Subsection V-A, we specify the three training strategies that we compare. Subsections V-B and V-C give details on the data sets, experimental procedure, and architectures that we use. Finally, the results are given and discussed in Subsection V-D.

A. Three Approaches
We compare three different approaches: 1) No Warm-up: This is the approach that is used in popular methods such as Concrete dropout and Deep Ensembles. The mean and the variance are optimized simultaneously. 2) Warm-up: This is the approach recommended in the original paper. We first optimize the mean and then both the mean and the variance simultaneously. 3) Warm-up fixed mean: We first optimize the mean and then optimize the variance while keeping the mean estimate fixed. We add this procedure to test if optimizing both the mean and the variance after the warm-up further improves the mean estimate. For each approach, we consider two forms of l 2 -regularization: 1) Separate regularization: The part of the network that estimates the mean has a different regularization constant than the part of the network that estimates the variance. 2) Equal regularization: Both parts of the network use the same regularization constant.

B. Data Sets and Experimental Procedure
We compare the three approaches on a number of regression UCI benchmark data sets. These are the typical regression data sets that are used to evaluate neural network uncertainty estimation methods (Gal and Ghahramani, 2016;Hernández-Lobato and Adams, 2015;Lakshminarayanan et al., 2017;Pearce et al., 2018).
For each data set we use a 10-fold cross-validation and report the average loglikelihood and RMSE on the validation sets along with the standard errors. For each of the 10 splits, we use another 10-fold cross-validation to obtain the optimal regularization constants. The entire experimental procedure is given in Algorithm 1.

C. Architecture and Training Details
• We use a split architecture, meaning that the network consists of two sub-networks that output a mean and a variance estimate. Each sub-network has two hidden layers with 40 and 20 hidden units and ELU (Clevert et al., 2015) activation functions. The mean-network has a linear transformation in the output layer and variancenetwork an exponential transformation to guarantee positivity. We also added a minimum value of 10 −6 for numerical stability. • All covariates and targets are standardized before training. • We use the Adam optimizer (Kingma and Ba, 2014) with gradient clipping set at value 5. We found that this greatly improves training stability in our experiments. • We use a default batch-size of 32. • We use 1000 epochs for each training stage. We found that this was sufficient for all networks to converge. • We set the bias of the variance to 1 at initialization. This makes sure that the variance predictions are more or less constant at initialization.
Algorithm 1: Our experimental procedure. A 10 fold cross-validation is used to compare the different methods. In each fold, a second 10-fold cross-validation is used to obtain the optimal regularization constants. We use the same splits when comparing approaches.
1 Input: Data set (X, Y ); 2 Devide (X, Y ) in 10 distinct subsets, denoted (X (i) , Y i ); 3 for i from 1 to 10 do Use 10-fold cross-validation (using only (X train , Y train )) to find the optimal regularization constants. This is done by choosing the constants for which the loglikelihood on the left-out sets is highest. The possible regularization constants are

D. Results and Discussion
The results are given in Tables I and II. Bold values indicate that for that specific training strategy (no warm-up, warmup, or warm-up fixed mean) there is a significant difference between equal and separate regularization. This means that every row can have up to three bold values. Significance was determined by taking the differences per fold and testing if the mean of these differences is significantly different from zero using a two-tailed t-test at a 90% confidence level. We see that a warm-up is often very beneficial. For the yacht data set, we observe a considerable improvement in the RMSE when we use a warm-up period. A warm-up also drastically improves the result on the energy data set when we do not allow separate regularization.
Generally, the difference between keeping the mean fixed after the warm-up and optimizing the mean and variance simultaneously after the warm-up is less pronounced. For a few data sets (Concrete, Kin8nm, Protein) we do observe a considerable difference in root-mean-squared error if we only consider equal regularization. If we allow separate regularization, however, these differences disappear.
A separate regularization often drastically outperforms equal regularization. The energy data set gives the clearest example of this. For all three training strategies, a separate regularization performs much better than an equal regularization of the mean and variance. A similar pattern can be seen for the yacht data set. The optimal regularization for the variance was typically similar or an order of magnitude larger than the optimal regularization of the mean, never lower. We would like to stress that statistically significant results are difficult to obtain with only 5 to 10 folds but that the pattern emerges clearly: separate regularization often improves the results while never leading to a significant decline.
Equal regularization and no warm-up perform as well as the other strategies for some data sets, although never considerably better. For Boston Housing, for example, using a warm-up and separate regularization yields very similar results as the other strategies. This can happen since the problem may be easy enough that the network is able to simultaneously estimate the mean and the variance without getting stuck. Additionally, while there is no reason to assume so a priori, the optimal regularization constant for the mean and the variance can be very similar. In fact, for the Boston Housing experiment we often found the same optimal regularization constant for the mean and variance during the cross-validation.

VI. CONCLUSION
In this paper, we tested various training strategies for MVE networks. Specifically, we investigated if following the recommendations of the original authors solves the recently reported convergence problems and we proposed a novel improvement, separate regularization.
We conclude that the use of a warm-up period is often essential to obtaining the best results and fixes the convergence problems. Contrary to what classical theory suggests, we do not observe a significant advantage to estimating the mean and variance simultaneously after a warm-up as opposed to estimating the mean while assuming a constant variance and fixing it afterwards. Optimizing the mean while accounting for heteroscedastic noise was seemingly only beneficial when separate regularization was not allowed.
We argued that we expect to need different regularization constants for the mean and variance. There are no reasons to assume that the mean and variance functions are equally complex and we therefore should not expect a similar regularization constant. We experimentally demonstrated that a separate regularization constant indeed often leads to large improvements.

A. Recommendations
Based on our experiments, we have come to the following recommendations when training an MVE network: • Use a warm-up, as also suggested by the original authors.
It is important to initialize the variance such that it is more or less constant for all inputs. Otherwise, some regions may be neglected. This is easily achieved by setting the bias of the variance neuron to 1 at initialization. • Use gradient clipping. We found gradient clipping to yield more stable optimization when optimizing the mean and variance simultaneously. • Use separate regularization for the mean and variance.
If a hyperparameter search is computationally infeasible, we found that the variance should typically be regularized an order of magnitude stronger than the mean. The average loglikelihoods of the 10 cross-validation splits along with the standard errors. For each split, the optimal regularization constants were obtained with a second 10-fold cross-validation. We used 5-fold cross-validation for the larger Kin8nm and Protein data sets. Bold values indicate a significant difference between equal and separate regularization at a 90% confidence level. Equal regularization never performs significantly better than separate regularization.

B. Future Work
The results on separate regularization indicate that estimating the variance and estimating the mean are often not equally difficult problems. It is therefore likely not optimal to use a similar architecture and training procedure for both. It would be interesting to investigate if the use of a separate architecture and training procedure leads to further improvements. On a more general note, it may be very worthwhile to get better estimates of the data noise variance. A vast amount of very intricate work is being done to improve the model uncertainty estimate. Approaches such as dropout, ensembling, and Bayesian neural networks, to name a few. For the predictive uncertainty, however, it may be just as or even more important to properly estimate the variance of the noise correctly. This, of course, highly depends on the data set in question but it could be that the uncertainty due to a heteroscedastic variance has much more influence on the predictive uncertainty than the model uncertainty. We therefore think that is very worthwhile to investigate the optimal way to estimate the the variance of the (possibly non-Gaussian) noise.

APPENDIX A DETAILS ON THE ADVANTAGE OF TAKING THE VARIANCE INTO ACCOUNT FOR A LINEAR MODEL
In this appendix, we provide additional details for the claims regarding linear models in the main text. All derivations for the statements in this paper regarding linear models can be found in van Wieringen (2015).
To reiterate, we assume that we have a data set consisting of n data points (x i , y i ), with x i ∈ R p and y ∈ R. With X, we denote the n × p design matrix which has the n covariate vectors x i as rows. With Y , we denote the n × 1 vector containing the observations y i . We assume X to be of full rank.
We consider the linear model: where Σ can be any invertible covariance matrix, possibly heteroscedastic and including interaction terms. Suppose this covariance matrix is known, then classical theory tells us that it is beneficial for our estimate of β to take this into account. To see this, we will compare the linear model in Equation (2) with a rescaled version that takes the covariance matrix into account. Since Σ is positive semi-definite, we can write it as BB T and rescale our model by multiplying with B −1 : Both formulations lead to different estimators of β. The unscaled formulation leads tô and the second formulation leads toβ Both estimators are linear unbiased estimators of β. However, the Gauss-Markov theorem (Gauss, 1823), see Dodge (2008) for a version that is not in Latin, tells us that the variance ofβ * is lower than the variance ofβ.
Gauss-Markov Theorem (Gauss, 1823) In the notation introduced above, consider the linear model Y = Xβ + U . Under the following assumptions: 3) X is of full rank, the ordinary least squares (OLS) estimator for β,β = (X T X) −1 X T Y , has the lowest variance of all unbiased linear estimators of β, i.e., the difference of the covariance matrix of any unbiased linear estimator and the covariance matrix of the OLS estimator is positive semi-definite.
We note that in the second formulation all the conditions of the theorem are met. We therefore know thatβ * has the lowest variance of all unbiased linear estimators of β and thus in particular we know that it has a lower variance thanβ.
We want to emphasize that this leads to improved metrics such has RMSE. Let us for instance look at difference between the expected quadratic errors of a new pair (x new , y new ) when usingβ andβ * : Where we used thatβ andβ * are both unbiased and independent of y new . In the final line, we applied the Gauss-Markov theorem that guarantees that Σβ − Σβ * is a positive semi-definite matrix.

APPENDIX B DETAILS ON THE DIFFERENT OPTIMAL REGULARIZATION CONSTANTS FOR LINEAR MODELS
We considering two linear models that most closely resemble the scenario of an MVE network. The first model will estimate the mean while knowing the variance and the second model will estimate the log of the variance while knowing the mean. An MVE network often uses an exponential transformation in the output of the variance neuron to ensure positivity. The network then learns the log of the variance. We show that both models will generally have a different optimal regularization constant.
A. Scenario 1: Estimating the Mean With a Known Variance We use the same notation as in the previous example and assume a homoscedastic noise with variance σ 2 . If we do not consider regularization, the goal is to find the estimator that minimizes the sum of squared errors, The solution is given byβ = (X T X) −1 X T Y, for which we know that E β = β and V β = Σβ = σ 2 (X T X) −1 .
In particular, we used that E [ ] = 0 and V [ ] = σ 2 . When we add a regularization constant, λ, the objective becomes to minimize The solution to this problem is given byβ (λ) = (X T X + λI p ) −1 X T Y.
For simplicity, we assume that we have an orthonormal basis, in which case X T X = I. This does not change the essence of the argument and makes the upcoming comparison clearer. Our new estimate is no longer unbiased but has a lower variance: E β (λ) = 1 1 + λ β en V β (λ) = σ 2 (1 + λ) −2 I p Our goal is to answer the question what the optimal value for λ is. We define optimal as the λ for which MSE(β(λ)) := E |β −β(λ)| 2 is minimal. Theobald (1974) showed that there exists λ > 0 such that MSE(β(λ)) < MSE(β). Typically, the exact value of λ is unknown, but in our controlled example, we can derive the optimal value analytically: λ * = pσ 2 (β T β) −1 .

B. Scenario 2: Estimating the Log Variance With a Known Mean
Next, we examine a linear model that estimates the logarithm of the variance. We again have n datapoints (x i , y i ) where we assume the log of the variance to be a linear function of the covariates: We use the same covariates and for the targets we define: z i := log((y i − µ) 2 )) − C, with C = ψ(1/2) + log (2), where ψ is the digamma function. This somewhat technical choice for C is made such that z i = log(σ 2 (x i )) +˜ , where˜ has expectation zero and a constant variance, as can be seen from the following derivation: log((y i − µ) 2 ) = log σ 2 (x i ) (x i − µ) 2 σ 2 (x i ) = log(σ 2 (x i )) + log (x i − µ) 2 σ 2 (x i ) = log(σ 2 (x i )) + log (ζ) , ζ ∼ χ 2 (1) = log(σ 2 (x i )) +˜ * The random variable˜ * has an expectation C and a constant variance that does not depend on µ or σ 2 (x) (Pav, 2015). The key result of this specific construction of z is that we have recovered a linear model with additive noise that has zero mean and a constant variance. This allows us to repeat the procedure from the previous subsection, i.e. minimizing the sum of squared errors with a regularization term. We obtain the following optimal regularization constant λ * = pV [˜ ](β Tβ ) −1 .
The conclusion is that for these two linear models, that most closely resemble the scenario of regularized neural networks that estimate the mean and log-variance, the optimal regularization constants rely on the true underlying parameters β andβ.