Generalised additive dependency inflated models including aggregated covariates

Let us assume that X, Y and U are observed and that the conditional mean of U given X and Y can be expressed via an additive dependency of X, λ(X)Y and X + Y for some unspecified function λ. This structured regression model can be transferred to a hazard model or a density model when applied on some appropriate grid, and has important forecasting applications via structured marker dependent hazards models or structured density models including age-period-cohort relationships. The structured regression model is also important when the severity of the dependent variable has a complicated dependency on waiting times X, Y and the total waiting time X + Y . In case the conditional mean of U approximates a density, the regression model can be used to analyse the age-period-cohort model, also when exposure data are not available. In case the conditional mean of U approximates a marker dependent hazard, the regression model introduces new relevant age-period-cohort time scale interdependencies in understanding longevity. A direct use of the regression relationship introduced in this paper is the estimation of the severity of outstanding liabilities in non-life insurance companies. The technical approach taken is to use B-splines to capture the underlying one-dimensional ∗Research of Young K. Lee was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2015R1A2A2A01005039 and NRF-2018R1A2B6001068). †Research of Enno Mammen was supported by Deutsche Forschungsgemeinschaft through the Research Training Group RTG 1953. ‡Research of Jens P. Nielsen was supported by the Institute and Faculty of Actuaries, London, UK. §Research of B. U. Park was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2015R1A2A1A05001753).


Introduction
Let us assume that X, Y and U are observed and that the conditional mean of U given X and Y can be expressed via an additive dependency of X, λ(X)Y and X + Y for some unspecified funtion λ, leading to the following mathematical definition of the "Generalised Additive Dependency Inflated Model including Aggregated Covariate" (GADIMAC): for a link function where the constant m 0 and the functions m 1 , m 2 , m 3 , λ are unknown.Notice that this is a special case of the generalised structured regression model considered in Mammen and Nielsen (2003).In the case where U is the number of events within a suitable grid of X and Y , the conditional mean of U is essentially a density and one can use the GADIMAC model to identify and analyse the density version of the age-period-cohort model.Without the time acceleration λ(X), the density age-period-cohort model is known to be hard to visualize, analyse and forecast, because the entering effects are only identifiable up to a line, see Kuang et al. (2008aKuang et al. ( ,b, 2011) ) and Antonczyk et al. (2017).Also, one is left with complicated second order differences in the discrete case and complicated second order derivatives in our continuous case when working with canonical and well-defined parametrisation, see Nielsen and Nielsen (2014), O'Brien (2014), Riebler et al. (2012), Smith and Wakefield (2016) and Beutner et al. (2017).
The relevant age-period-cohort density version of GADIMAC is used to forecast the future asbestos-related deaths in the United Kingdom in the application in Section 4.3.Asbestos mortality data is characterized by its lack of proper exposure data and its complicated dependency structures on the entering time effects, see Peto et al. (1995) for an early approach and Hodgson et al. (2005), Rake et al. (2009), and Tan et al. (2010Tan et al. ( , 2011) ) for later approaches building various micro models to overcome the lack of exposure data.The recent approach by Martinez-Miranda et al. (2015, 2016) uses updated data and is simpler because exposure is directly modelled and estimated from the observed deaths.This paper adds to this latter approach by including further time scales, forecasting the peak of asbestos related deaths in the UK to be 2572 in the year 2018 and the total future UK asbestos-related deaths until the year 2032 to be forty eight thousands.
There are many potential applications of GADIMAC.One further potential application provides a solution to an omnipresent challenge in non-life insurance when estimating the severity of outstanding liabilities.Here X is the waiting time from an insurance claim has happened till it is reported, Y is the waiting time from the claim being reported till its final settlement and U is the size of the claim.Another potential further application is within the current and important theme of longevity estimation, where X, Y and X + Y represent cohort, age and period and U represents raw occurrence divided by exposure of some grid-points of discretised X's and Y 's.In the longevity forecasting case the conditional mean of U is approximately equal to a two-dimensional hazard function as a function of cohort X and age Y .The GADIMAC model introduced by this paper provides a structured nonparametric representation of both past and future mortality when the calendar effect of X + Y is extrapolated into the future.An authoritative exposition of current efforts on longevity forecasting could be the six models reviewed in Cairns et al. (2011), where a discrete time series is used to forecast calendar effects in all models as have become standard, see also some of the original proposals of Lee and Carter (1992), Lee and Miller (2001) and Renshaw and Haberman (2006).The GADIMAC model could introduce a welcome alternative modeling deterministic trends first before time series or other uncertainties complicate the visual and analytic impression of future mortality.
This paper develops theory identifying the GADIMAC model and introduces estimation techniques and asymptotic theory of GADIMAC models based on B-splines.Finite sample simulation studies show good performance of our new methodology and the usefulness of the new class of regression models is illustrated via a timely application to forecasting future asbestos related deaths in the UK.In Section 2 below the B-spline based estimation of the nonparametric structured model is constructed and the asymptotic theory is derived.Identifiability is discussed in Section 3. Practical implementation is considered in Section 4 with an implementation guide in Section 4.1 and finite sample simulation studies in Section 4.2 showing good performance of the estimation of the model.Finally the important practical forecast of future UK asbestos related deaths is given in Section 4.3.

Estimation and asymptotic properties of GADIMAC models
We assume that one observes independent real valued random variables U i for 1 ≤ i ≤ n with mean μ(x i , y i ) where x 1 , . . ., x n , y 1 , . . ., y n are some deterministic design points on the real line.The regression function μ(x, y) satisfies for some unknown functions m 1 , m 2 , m 3 , λ and for a known invertible link function G. Later, we will also apply this model to a random design case where one observes i.i.d.copies (X i , Y i , U i ) of (X, Y, U ) such that ((2.1)) holds for the conditional mean μ(x, y) = E(U |X = x, Y = y).We assume that the tuples (x i , y i ) lie in a connected subset I of a two-dimensional bounded rectangle.
Let I 1 denote the projection of I onto the x-axis, I 2 (λ) = {λ(x)y : (x, y) ∈ I} and I 3 = {x + y : (x, y) ∈ I}.In this section we discuss the estimation of the regression function μ with this structure using a set of observations U i and design points (X i , Y i ) ∈ I.We will show that the function μ can be estimated with a one-dimensional nonparametric rate.Throughout this paper, we assume that I contains a rectangle [α, β]× [0, γ] for some β > α > 0 and γ > 0. Without loss of generality, we take α = 0, since otherwise we may shift I along the x-axis and redefine the component functions m 1 , m 3 and λ accordingly.
Later we will use a simplified version of μ in our numerical studies, where we replace M by a subspace containing only linear λ and spline functions m j .
In our theory and also in our numerical studies we will choose The following theorem shows that, under the assumption that the functions m 1 , m 2 , m 3 and λ allow derivatives up to order k, the choice ρ n n −2k/(2k+1) leads to an estimator μ that achieves a one-dimensional nonparametric rate.
Theorem 1.For the components (m 0 , m 1 , m 2 , m 3 , λ) ∈ M of the underlying regression function suppose that m 1 , m 2 , m 3 and λ have derivatives of order k with bounded L 2 -norm, where k is the constant in (2.3).Furthermore, we assume that G has an absolutely bounded derivative which is continuous at the point m 0 with G (m 0 ) > 0, that the distributions of the error variables Furthermore, we assume that there exists a sequence δ n → 0 with n k/(2k+1) δ n → ∞ such that the number of indices i with |x i | ≤ δ n and |y i | ≤ δ n can be bounded from below by C δ nδ 2 n for some constant C δ > 0.Then, it holds with We note that this result does not imply that the functions m 1 , m 2 , m 3 and λ can be estimated with the rate O P (n −k/(2k+1) ).A first step to such kind of results are identification results for our model that we discuss in the next section, which constitute the main contributions of this paper.Now for the random design case, let m = ( m0 , m1 , m2 , m3 , λ) be defined as the minmizer of Then, we obtain the following analogue of Theorem 1.
Theorem 2. Suppose that one observes i.i.d.copies (X i , Y i , U i ) of (X, Y, U ) for 1 ≤ i ≤ n, where μ(x, y) = E(U |X = x, Y = y) satisfies (2.1) and the distribution of (X, Y ) is supported on I.We assume that the conditions of Theorem 1 on m 1 , m 2 , m 3 , λ and G hold, and that the conditional distributions of the error Furthermore, we assume that there exists a sequence 2k+1) ).

Identification of GADIMAC models
We discuss identification of the model.Suppose that there exists a consistent estimator μ of the function μ defined by The question is if this implies that there exist consistent estimators of the functions m 1 , m 2 , m 3 and λ.We study the following question: Given a function (x, y) → μ(x, y), does this function identify m 1 , m 2 , m 3 and λ up to a constant in the model (3.1)?Note that the same question arises in in-sample density forecasting where the model f (x, y) = f 1 (x)f 2 (λ(x)y)f 3 (x+y) can be transferred to this model by putting μ(x, y) We say that m 1 , m 2 , m 3 and λ in (3.1) are identified up to a constant if μ(x, y) = G m1 (x) + m2 ( λ(x)y) + m3 (x + y) for some functions m1 , m2 , m3 and λ implies that for some real numbers c 1 , c 2 , and c 3 .We first discuss identification in case λ is known, and then in case λ is a linear function.We treat a more general case at the end.

The case of known λ
Let us assume that λ is known.The following theorem demonstrates that, in this case, m j are identified up to a constant under some smoothness conditions and conditions on the shape of λ.This theorem serves as a basic building block for the discussion of the identification of m j and λ in more general cases.For the formulation of the theorem we need the following additional notation.Denote by I 0 the interior of I. Put I 0 1 = {x : (x, y) ∈ I 0 for some y}, I 0 2 = {λ(x)y : (x, y) ∈ I 0 }, and I 0 3 = {x + y : (x, y) ∈ I 0 }.Theorem 3. Suppose that λ is a fixed continuously differentiable strictly positive function that is not equal to the function x → (ξ 1 + ξ 0 x) −1 for some ξ 0 , ξ 1 ∈ R. In particular, we do not allow that λ is a constant function.Assume that the closures of I 0 1 , I 0 2 and I 0 3 are equal to I 1 , I 2 = I 2 (λ) and I 3 , respectively and that I 0 1 and I 0 3 are connected sets.Furthermore, we assume that m 1 , m 2 and m 3 are twice continuously differentiable and that G is strictly monotone.Then in model (3.1), m 1 , m 2 , and m 3 are identified up to a constant.
Thus for known λ we have a clear picture.Given some smoothness conditions, the model is identified up to constants if and only if λ is not equal to the function x → (ξ 1 + ξ 0 x) −1 for some ξ 0 , ξ 1 ∈ R. Note that for constant λ, the functions m 1 , m 2 and m 3 are only identified up to addition or substraction of linear functions, see Antonczyk et al. (2017).If λ is not equal to the function x → (ξ 1 + ξ 0 x) −1 for some ξ 0 , ξ 1 ∈ R and if we put some constraints on m 1 , m 2 and  m 3 such as m 1 (0) = m 2 (0) = m 3 (0) = 0, then m 1 , m 2 and m 3 are completely identified.On the other side, if λ(x) is equal to (ξ 1 + ξ 0 x) −1 for some ξ 0 , ξ 1 ∈ R then such a constraint does not lead to unique identification.To see this one can choose arbitrary choices of m 1 , m 2 and m 3 and one can define Then one can easily verify that for all choices of ξ ∈ R it holds that Thus we have no unique identification.

The case of unknown linear λ
Recall that, in our setting, I contains a nontrivial rectangle [0, β] × [0, γ] for some β, γ > 0. We note that this implies that J 1 (0 We consider the following constraints: The first three conditions can be always achieved by redefining the functions m 1 , m 2 and m 3 .For the fourth condition this can be done if m 2 (0) > 0. If m 2 (0) < 0, we consider the link G(z) = G(−z) so that the model (3.1) can be written as We consider the case where λ is an unknown linear function λ(x) = ax + b with a = 0. We note that, if a equals zero, then the functions m j are not identifiable.The following theorem demonstrates that λ is identifiable under these conditions if m 2 is not linear, and that λ is identifiable only up to a constant if m 2 is linear.In the case where m 2 is linear, it follows that m 2 (z) = z from the constraints on m 2 in (3.2).This means that the model (3.1) reduces to In this case, we may have different sets of (m 1 , m 3 , b) that give the same function μ.For example, we have we assume that m 1 is differentiable, m 2 is two times continuously differentiable and G is strictly monotone.Then in model (3.1) under the constraints (3.2), m 1 , m 2 , m 3 and λ are identified if m 2 is not linear.In case m 2 is linear, the function λ is identified up to a constant.
We now give a heuristic explanation why m 1 , m 2 , m 3 and λ can be consistently estimated under the assumptions of Theorem 4. For a rigorous proof of identifiability of the functions we refer to the appendix.We make the additional assumption that there exists an estimator ν of ν(x, y) , where f x (x, y) and f y (x, y) for a bivariate function f (x, y) denote its partial derivatives with respect to x or y, respectively.For the case where G is the identity function, such an estimator can be achieved by kernel smoothing of the estimator μ, i.e.
for x and y in the interior of the support of μ and with boundary corrections of the kernel K at the boundary of the support.Let for ax + b = 0 and q(x, y) = 0 for ax + b = 0. We have that uniformly in x and y Furthermore, by more lengthy but straightforward calculations one can show that For linear functions m 2 we have q(x, y) ≡ 0 under our norming conditions.For nonlinear m 2 , the functions q(x, y) and y 2 are linearly independent.Thus (3.6) can be used to get a consistent estimator â of a in the case of a linear function m 2 , and consistent estimators (â, b) of (a, b) can be achieved in the case of a nonlinear function m 2 .In the latter case we can replace a and b by â and b in (3.5) which gives a consistent estimator of m 3 (x).Integration of this estimator results in a consistent estimator m3 of m 3 .Using this estimator we get with the help of (3.4) a consistent estimator of m 1 .Finally using all these estimators we can use (3.3) and we get a consistent estimator of m 2 .Thus, we can consistently estimate all component functions m 1 , m 2 , m 3 and λ.Note that for this estimation we only need estimators of ν and of the integrals of partial derivatives of ν.In Theorem 1 we have argued that ν can be estimated with a one-dimensional nonparametric rate.We conjecture that, as in many other nonparametric models, the integrals of partial derivatives of ν can be estimated with the same rate as ν, see Lee et al. (2017) for example.This would imply that m 1 , m 2 , m 3 and λ can be estimated with a one-dimensional nonparametric rate.

More general cases
We now come to a more general case that includes nonlinear λ.The next lemma is a first step for analyzing identification of the component functions in the general case.We continue to make the norming constraints (3.2).
Lemma 1. Suppose that the functions m 2 and m2 are continuously differentiable, that the functions m 1 , m 2 , m 3 and m1 , m2 , m3 fulfill the norming constraints (3.2), and that the functions λ and λ with m j and mj satisfy m 1 (x) + m 2 (λ(x)y) + m 3 (x + y) = m1 (x) + m2 ( λ(x)y) + m3 (x + y).(3.7)Also, we assume that λ(0) = 0.Then, for all (x, y) ∈ I with x+y ∈ J 1 (0)∩J 2 (0) and λ(x)y/ λ(0) ∈ J 1 (0) ∩ J 2 (0), it holds that (3.8) We make use of Lemma 1 for identification of unknown λ under the assumption that λ is linear in a small neighborhood of zero.Specifically, we assume that λ(0) = 0 and there exists ε > 0 such that (3.9) The assumption is clearly more general than the one in Theorem 4. It only requires linearity in an arbitrarily small neighborhood of zero and thus allows a general class of nonlinear functions λ.The following theorem demonstrates that λ is identifiable within this wider class if m 2 is not linear in any neighborhood of zero and if J 1 (0) ∩ J 2 (0) equals I 1 , the domain of λ.The theorem is based on the same set of conditions for m j in Theorem 4. The condition that m 2 is not linear in any neighborhood of zero is implied by the condition that m 2 (0) = 0 if m 2 is continuous.

Finite sample studies and an application forecasting asbestos related deaths in UK
This section first introduces in Section 4.1 the necessary considerations when implementing the method in practice, and Section 4.2 presents finite sample simulation studies showing good performance of our B-spline estimation approach.The important forecasting result on future asbestos related deaths are presented in Section 4.3 with some new and interesting forecasts that might be of interest for some policy makers, health economists or non-life insurers.Note that the methods considered here do not strictly belong to the recently defined class of "in-sample forecasters".To qualify as an in-sample forecaster, the forecast should be a function of one-dimensional functions that are fully estimated in-sample, see Martinez-Miranda et al. (2015), Lee et al. (2015Lee et al. ( , 2017) ) and Hiabu et al. (2016).However, the calendar effect has to be extrapolated when applying GADIMAC to the modeling of the age-period-cohort relationship of asbestos related deaths, disqualifying the approach as in-sample forecasting.The necessary extrapolation of the calendar effect does compromise the otherwise simple interpretation of the GADIMAC model making the forecasting exercise non-trivial and non-automatic.There will be more comments on this in Section 4.3 below.

Practical implementation of GADIMAC models
Before we present the results of our simulation study and real data example, we describe briefly how we get the estimators that minimize the objective function (2.4).For simplicity, we choose λ to be linear and thus assume λ(x) = θ 0 +θ 1 x for some unknown θ 0 and θ 1 .The procedure may be generalized to any parametric function or even to a nonparametric model for λ.
We are to minimize the objective function at (2.4) over (m 0 , θ 0 , θ 1 ) ∈ R 3 and (m 1 , m 2 , m 3 ) with each m j in the space of cubic B-splines (k = 2) satisfying the constraints Instead of the penalty at (2.3) we use a simpler version given by where Tj (m) = Ij m (k) (x) 2 dx for j = 1, 3, and T2 (λ) = I1 λ(x) 2 dx.For the penalty at (4.2) we omit m j (u) 2 du for 1 ≤ j ≤ 3 in (2.3) since we may show that Theorem 1 remains to hold under this change at the cost of much more involved arguments based on a decomposition of the functions mj as sums of a polynomial and a smooth function.In an additional technical step we have to show that the polynomials can be estimated with the parametric rate n −1/2 .This would be more involved than the discussion of smoothing splines in van de Geer (2000) because of the nonlinear nature of our model.Furthermore, we omit λ (u) 2 du + λ (k) (u) 2 du in T 2 (λ) because we use parametric linear fits for λ.
The minimisation problem is nonlinear since we have a link function G.Even if we choose the identity link, it is still a nonlinear optimization problem since θ 0 and θ 1 enter the objective function in the arguments of the basis functions for cubic B-splines.We suggest an iteration scheme, which we actually employed in simulation study and real data example.The procedure starts by initializing θ0 and θ1 .Then, (i) find m0 and the B-spline coefficients of mj under the constraints (4.1) that minimize (2.4); (ii) after we get the estimators m0 and mj , we update θ0 and θ1 by minimizing (2.4) with the estimated mj being plugged into (2.4).The minimization problem in (i) requires another iteration if G is not the identity link, while the updating task in (ii) is nonlinear and thus needs an iteration even if G is the identity link.
We iterate the above procedures (i) and (ii) until convergence.We stop the iteration when the changes in mj are sufficiently small.To be more specific, let m[ ] j for 0 ≤ j ≤ 3 and λ[ ] denote the component estimators at the th iteration step.When falls below a threshold value, we stop the iteration for mj taking mj = m[ ] j and then find the final update λ[ ] for λ by minimizing (2.4).This iteration scheme worked very well in our numerical studies.As the threshold value in the stopping criterion, we chose 10 −4 .
We found that the MISE (Mean Integrated Squared Error) properties of the proposed estimators do not depend much on the choice of the initial values of θ0 and θ1 .The results presented in Table 1 are for the choice θ0 = 1.3 and θ1 = 0.5.For these results, we also used equally spaced knots and chose the number of knots that minimizes the mean integrated squared error where μ(x, y) = m0 + m1 (x) + m2 ( λ(x)y) + m3 (x + y).The table includes the values of MISE, as defined above, IV (Integrated Variance) and ISB (Integrated Squared Bias) defined by It also reports those values for each component estimator mj , where for m0 they are the values of E( m0 − m 0 ) 2 , var( m0 ) and (E m0 − m 0 ) 2 , respectively.
The results in Table 1 support that our proposed method works very well for finite sample sizes.Overall, the values of MISE, IV and ISB of the regression function estimator μ(x, y) decrease very fast as the sample size n increases, or as the noise level σ 2 gets smaller.For the component function estimators mj the values of MISE also decreases rather fast as n increases, except m2 .For the latter component the MISE goes down slowly.If we investigate the numbers more closely, we find that the bias of m2 stays unchanged or even gets larger slightly as n increases, which results in the slow decline in MISE.We think this is partly owing to the structural complexity that the function m 2 enters our model in the form of m 2 (λ(x)y) with another unknown function λ.The bias of the estimator comes from the inaccuracy of the spline approximation, which may be hard to reduce, for this particular component, by choosing the tuning parameters depending on the sample size and noise level.Indeed, we also find in the table that the bias of m2 does not improve as the noise level σ 2 decreases.The biases of other component estimators mj do not change much as the sample size or the noise level changes.However, the variances of all component estimators mj decrease as n increases or σ 2 decreases.

Forecasting asbestos related deaths in the UK
As an example of implementing our method, we considered the UK mesothelioma mortality dataset.It consists of the counts of deaths caused by exposure to asbestos, given by year  and age (25-94) at the time of death.The total number of the deaths during the period and in the range of age is 46,348.Basically, for this dataset one may take the variable x to be the cohort and y the age of death.Thus x = (year of death) − y.To put the support of where λ is a linear function.In general, it is not easy to check if the dataset comes from a distribution that satisfies the subexponentiality condition in Theorem 1.However, the condition holds if U (k, l) for each given (k, l) follows a Poisson distribution, which one typically assumes for count data such as ours.To choose K j , the numbers of knots for the cubic B-splines in the approximation of m j , we employed a cross validatory (CV) criterion.For the CV criterion, we chose 100 × α % among those ages in {25, 26, . . ., 94} for each calendar year l.Call the set of chosen integers V l .We then fitted the model at (4.4) with those remaining 100 × (1 − α) % of the data on the grid Denote the estimated constant and component functions by mT j and λT .We computed In our application we chose α = 0.1 in the above CV criterion.The penalty constants ρ j,n were set to ρ n = 0.0022.Actually, we found that the associated Hessian matrix of the quadratic objective function at (2.4) was not invertible for too small values of ρ n .The results of the application of our method to the mortality data are shown in Figures 1.We used the estimated model to forecast the death counts in the future years 2012 + δ for δ ≥ 1.For this we considered only the cohort group who were born during the years 1886-1987, among whom the numbers of deaths in the future years were our target for forecasting.The set of transformed (x, y) for those who will die in the year 2012 + δ is We computed the forecasted number for the year 2012 + δ by the formula Note that our estimate m3 is nonparametric and thus it is defined only on the range of x + y, which is the interval [0, 101].We used a quadratic extrapolation of m3 for future times x+y > 101 in the forecasting formula.We briefly compare this forecasting methodology to the three forecast options I 0 , I 1 and I 2 from the discrete age-period-cohort model considered in the applied claims study of Kuang et al. (2011).Had we chosen to use a line instead of a quadratic extrapolation then it would correspond to the I 0 forecast as defined in the latter paper.Had we used a minimum of recent information to estimate our quadratic extrapolation, then it would have corresponded to the I 2 forecast of the latter paper that defines I 1 as the extrapolated line based on a minimum of recent information.In our method we used all available information on the calendar time to estimate a quadratic extrapolation.

Estimates of the three component functions m j and the time transformation λ obtained by applying the model (4.4) to the asbestos mortality data.
The forecasting result is shown in Figure 2. The peak year is 2018 and the peak number of deaths is 2,572.The total number of deaths during the period 2013-2032 is predicted to be 48,007.Martinez-Miranda et al. (2016) worked on the male-subset of the same data set and estimated male asbestos related deaths to peak in the year 2017 and to total around 2,100 deaths that year.Our predictions are at a higher level because both males and females are considered.However, the two studies do seem to be in a reasonable relationship to each other given that more males than woman seemed to have been exposed to asbestos at life threatening levels.While it is beyond the scope of this paper, it would be interesting to undertake a detailed applied statistical study, where male, female and joint male-female mortality rates of asbestos related deaths in the UK are closely investigated and where the methodology of the current paper and that of Martinez-Miranda et al. (2016) are implemented, compared and discussed.Also, it would be interesting to collect data up till and including 2017 to see whether short term forecasting of the two alternative methods do indeed work around the peak year of number of deaths.
In Figure 1 we notice that time acceleration λ is decreasing indicating that time goes slower implying increasing lifetimes in later calendar years.There is a complicated interaction between the mortality with age component m 2 that has a reasonable mortality shape and then complicated m 1 and m 3 functions that rapidly increase and decrease respectively and in this way to some extent leveling each other out over time.The intuition of the four one-dimensional functions and their interplay is not easy and it is the resulting forecasts, as shown in Figure 2 with future number of deaths over the years, that perhaps is the best driver of our intuition.However, the forecasts are based on just four functions and that is after all easier to understand than hundreds of discretely estimated parameters as in Martinez-Miranda et al. (2015, 2016).But it still does take a trained applied statistician to do these forecasts well.It will most likely never be a fully automatic exercise.

Appendix. Proofs
Proof of Theorem 1.The main idea of the proof is to apply uniform bounds from empirical process theory and to proceed as e.g. in the proof of Theorem 10.2 in van de Geer (2000).For this purpose we have to check entropy conditions for function classes defined by constraints of our penalty terms.For a related proof in a neural network model, see also Horowitz and Mammen (2007).For a constant c > |m * 0 |, where m * 0 denotes the true value of m 0 , we define where a n is the empirical norm such that a This result can be achieved from Corollary 8.8 in van de Geer (2000).See also the remark after the statement of this corollary.
One can easily check that the elements of G mc are absolutely bounded by a constant times J( mc ).To see this, note that for a function f : 2k+1) .For the first event we get from the basic inequality (A.2) and Ĵ ≤ Δ 2 n 2k/(2k+1) that Thus we have Δ 2 = O P (ρ n ) on the first event.Note that because of Ĵ ≤ Δ 2 n 2k/(2k+1) this also implies that Ĵ = O P (1) on the first event.On the second event, we get from the basic inequality (A.2) Thus on this event we have Ĵ = O P (1) and because of Δ Ĵ−1/2 ≤ n −k/(2k+1) we get that Δ = O P (n −k/(2k+1) ) .This shows that Compare also Mammen and van de Geer (1997) for a related application of the above empirical process bound.
We now argue that G mc = G m with probability tending to one.For this claim we make use of the result J( mc ) = O P (1) that we have just proved.We now argue that this implies that the derivatives of mc 1 , mc 2 , λc , and mc 3 are uniformly bounded by a random variable that is of order O P (1).For a proof of this claim we argue first that the L 2 norms of the first order and second order derivatives of these functions are of order O P (1).This gives a bound of order O P (1) for the supnorm of the first order derivatives where in this step we make use of the same argument used above for showing that the elements G mc are absolutely bounded by a constant times J( mc ).For bounding the L 2 norms of the first order and second order derivatives one can make use of interpolation inequalities: it holds that (ϕ (j) (x)) 2 dx ≤ Cγ −2j (ϕ(x)) 2 dx + Cγ 2(l−j) (ϕ (l) (x)) 2 dx for functions ϕ with (ϕ (l) (x)) 2 dx < ∞, for 1 ≤ j ≤ l and for γ > 0 with a constant C depending only on the integration region, see Agmon (1965).
By applying these bounds for the first-order derivatives we obtain for a random variable R n = O P (1), where δ n is defined in the statement of the theorem and fulfills n k/(2k+1) δ n → ∞ and δ n → 0. Choose δ > 0. We get with some further constants C 1 , C 2 , . . .> 0: ).Thus we have that the probability of the event {| mc 0 − m * 0 | ≥ δ} converges to 0 and mc 0 = m * 0 + o P (1).This shows that with probability tending to one the constraint |m 0 | ≤ c is not active in the calculation of mc 0 , mc 1 , . ... This implies that G mc = G m with probability tending to one.
We now come to the proof of the entropy bound (A.1).For the proof we make use of the entropy bound for Sobolev classes, .4)for some constant C 1 > 0. For (A.4) the reader is referred to Birman and Solomjak (1967).Using similar arguments as above we now argue that λ, m 1 , m 2 and m 3 are uniformly absolutely bounded.Note that for (m 0 , m 1 , m 2 , m 3 , λ) ∈ M * c we have, because of T 2 (λ) = 1, that |λ(x λ )| < C 2 for an element x λ of I 1 depending on λ with some constant C 2 > 0 (not depending on λ).Thus λ ∞ ≤ In particular, the length of the intervals I 2 (λ) can be uniformly bounded.We have T 1 (m 1 ) ≤ 1, T 3 (m 3 ) ≤ 1 and From all these inequalities we get that m j ∞ ≤ C 4 for 1 ≤ j ≤ 3 for some constant C 4 > 0. Thus, we can apply the entropy bound (A.4) for 1 ≤ j ≤ 3 to all functions m j with (m 0 , m 1 , m 2 , m 3 , λ) ∈ M * c for some m k (k = j) and λ, and to the Sobolev class of all functions λ with (m 0 , m 1 , m 2 , m 3 , λ) ∈ M * c for some m 0 , m 1 , m 2 , m 3 .Thus, the bound (A.1) follows if we prove that m 2 ∞ ≤ C 5 for some constant C 5 > 0 as long as (m 0 , m 1 , m 2 , m 3 , λ) ∈ M * c for some m 0 , m 1 , m 3 and λ.Such a bound for m 2 ∞ can be achieved by noting that I2(λ) m 2 (x) 2 dx < C 6 for some constant C 6 > 0. This inequality follows with j = 1, l = k − 1, ϕ = m 2 from the interpolation inequality that we already used above: (ϕ (j) (x)) 2 dx ≤ Cγ −2j (ϕ(x)) 2 dx + Cγ 2(l−j) (ϕ (l) (x)) 2 dx for functions ϕ with (ϕ (l) (x)) 2 dx < ∞, for 1 ≤ j ≤ l and for γ > 0 with a constant C depending only on the integration region, see Agmon (1965).This concludes the proof of the theorem.
Proof of Theorem 2. The theorem follows by a simple application of Theorem 1. Choose κ n → ∞.Then, Theorem 1 implies that almost surely.This implies that the expectation of this conditional probability converges to zero.Because this holds for all κ n → ∞ we get the statement of the theorem.
Proof of Theorem 3. Choose m1 , m2 , and m3 with for (x, y) ∈ I 0 .Put δ j = mj − m j for j ∈ {1, 2, 3}.Then We have to show that this equality implies for some real numbers c j for j ∈ {1, 2, 3}.From (A.6) we get By taking the difference of the last two equations we get Choose (x 1 , y 1 ) ∈ I 0 .We consider the following three cases of (x 1 , y 1 ).

Fig 2 .
Fig 2. Small circles indicate the observed numbers of asbestos deaths in the UK until the year 2012.Forecasted numbers of deaths are depicted as a dashed curve.

Table 1
Mean integrated squared error (MISE), integrated square bias (ISB) and integrated variance (IV) of the whole regression function estimator m and of the component function estimators mj and λ, based on 100 MC samples of sizes n = 400 and n = 1, 000.