Regularised forecasting via smooth-rough partitioning of the regression coefficients

: We introduce a way of modelling temporal dependence in ran- dom functions X ( t ) in the framework of linear regression. Based on discretised curves { X i ( t 0 ) ,X i ( t 1 ) ,...,X i ( t T ) } , the ﬁnal point X i ( t T ) is pre- dicted from { X i ( t 0 ) ,X i ( t 1 ) ,...,X i ( t T − 1 ) } . The proposed model ﬂexibly reﬂects the relative importance of predictors by partitioning the regression parameters into a smooth and a rough regime. Speciﬁcally, unconstrained (rough) regression parameters are used for observations located close to X i ( t T ), while the set of regression coeﬃcients for the predictors positioned far from X i ( t T ) are assumed to be sampled from a smooth function. This both regularises the prediction problem and reﬂects the ‘decaying memory’ structure of the time series. The point at which the change in smoothness occurs is estimated from the data via a technique akin to change-point de- tection. The joint estimation procedure for the smoothness change-point and the regression parameters is presented, and the asymptotic behaviour of the estimated change-point is analysed. The usefulness of the new model is demonstrated through simulations and four real data examples, involv- ing country fertility data, pollution data, stock volatility series and sunspot number data. Our methodology is implemented in the R package srp , avail- able from CRAN.


Introduction
Over the last few decades, functional data analysis (FDA) has been growing in importance and enjoying increased attention. Functional objects arise in many contexts and the applications in the literature include prediction of daily curves of particulate matter in the air (Aue et al., 2015), testing stationarity of intraday price curves of a financial asset (Horváth et al., 2014), modelling the dynamics of fertility rate (Chen et al., 2017), studying the effect of air pollution on mortality rate across cities (Kong et al., 2016), prediction of the protein content of meat from spectral curves (Zhu et al., 2014), investigation of a bike sharing system by predicting bike pick-up counts (Han et al., 2017), choosing predictive days from daily egg-laying counts for fruit flies (Ji and  and predicting sucrose content of orange juice from its near-infrared spectrum (Ferraty et al., 2010).
In this paper, we consider random functions X i ∈ L 2 [a, b]w h e r ei =1,...,n and [a, b] is a compact subset of R. If the functions are used as a predictor for explaining a scalar response variable Y , this simply describes the standard functional linear regression which has been widely studied in the literature. The reader is referred to Reiss et al. (2017) for a review of numerous approaches to scalar-on-function regression. On the other hand, if the random functions X i are believed to possess temporal dependence and are analysed by separating the domain they live on into shorter units, we call such a data structure functional time series. Functional time series analysis has been an active field of research in recent years. The best-known model in this area is the first-order functional autoregressive model proposed by Bosq (2000). Other recent contributions include testing for stationarity (Horváth et al., 2014), testing for mean functions in a two-sample problem (Horváth et al., 2013), testing for error correlation (Gabrys et al., 2010) and prediction (Antoniadis et al., 2006;Aue et al., 2015).
In practice, functional data are often observed on a grid, rather than continuously. The observation of i.i.d. square-integrable random functions X i (t) ∈ L 2 [a, b] on an equispaced grid {t 0 ,t 1 ,...,t T } gives the discretised curves {X i (t 0 ), X i (t 1 ), ..., X i (t T )} for i =1 ,...,n where t 0 = a and t T = b. Based on these design points, our objective in this work is to predict the final point X i (t T ) from the past observations {X i (t 0 ),...,X i (t T −1 )}. This is an important applied problem in a variety of fields, including public health, earth sciences, finance and environment, as our data examples of Section 4 illustrate. Arguably the simplest statistical framework for expressing the dependence of X i (t T )o n {X i (t 0 ),...,X i (t T −1 )} is linearity, and with this in mind, this work focuses on the following model: α j X i (t T −j )+ε i ,i =1,...,n. (1) We now discuss its specifics. In our asymptotic considerations, we work with afi x e dT ; however, in practice, T can be large (e.g. two of the datasets in Section 4 have T roughly of the order of n), which inevitably brings us into a high-dimensional setting and the set of parameters α j cannot be estimated well by classical approaches. In addition, we often experience a high degree of collinearity between the predictors. As a way of regularising the problem, our proposal in this work is to split the set of parameters {α 1 ,...,α T } into two, {α 1 ,...,α q } and {α q+1 ,...,α T }, and assume that the second set is discretised from a smooth curve β(t), which gives ,...,n, (2) where the final point X i (t T ) is a scalar response variable, {X i (t T −j ), j=1,...,T }∈ R T represents scalar predictors and ε i denotes the error term with E(ε i |X i (t T −j ) , j=1,...,T ) = 0 and unknown variance σ 2 . Since all the dependent and independent variables are obtained from random functions, we assume them to be random. The unknown parameter set contains a constant μ ∈ R, real and scalar α =( α 1 ,...,α q ) T ∈ R q , real and functional β ∈ L 2 [t 0 ,t T −q−1 ] and a changepoint index parameter q. Throughout the paper, we will be referring to (2) as the Smooth-Rough Partition (SRP) model. The SRP model assumes that the change-point index q is unknown, and we estimate it from the data via a change-point detection technique. This is possible because we will be assuming that the coefficients α j are rougher than the coefficients β(t T −j ), i.e. exhibit more variation. We now motivate the smooth-rough partitioning idea in more detail. The partitioning of the regression coefficients into two classes of smoothness captures the difference in the relative importance of the observations in predicting the final point X i (t T ). Constraining the β's to be smooth reflects the relatively lower importance of the more remote observations, whose influence on X i (t T ) is 'bundled together' by the smoothness restriction in β. By contrast, the unconstrained parameters α are not connected to each other in any (functional) way, so are able to capture any arbitrary linear influence of the near observations on X i (t T ). The smoothness assumptions on (α,β) will be specified in Section 5.
The smooth-rough partitioning results in regression estimation that is interpretable in the sense that it automatically separates the effects that can be seen as "long-term" (these are the ones corresponding to the smooth portion of the parameter vector) from those that can be seen as "instantaneous" (these are the ones that correspond to the rough portion of the parameter vector). In other words, the SRP framework can be seen as a "two-scale" approach to linear prediction, where the two scales are defined by both the smoothness and the extent of the regression parameter vector (i.e. the long, smooth portion and the short, rough portion). As will be demonstrated in Section 4, this two-scale framework is useful in various real-world datasets e.g. fertility rate data, high-frequency stock volatility series, Mexico city pollution data and sunspot number data.
Each of them appears to display both long-term and instantaneous temporal dependences, which (as we illustrate) are well captured by the SRP model. For example, it is reasonable to believe that in the context of the pollution data, the level of pollution in a particular curve at a particular time depends both on the overall shape and level of the curve up until the current time (which could be seen as the long-term effect) and the levels immediately preceding the current time in question (which can be seen as the instantaneous effect). We can attach similar interpretations to the other datasets studied in the paper.
Additionally, the SRP framework can also be useful in the modelling and forecasting of univariate time series, especially those that are believed to be well modelled as AR (autoregressive) processes with large orders, in which case the SRP smoothing device would be able to offer both regularisation and (hopefully) interpretability, especially if the time series is believed to possess long-range dependence (which will typically be the case if an AR model with a large order is used in the first place). Our sunspot example in Section 4 illustrates this.
Model (2) covers two special cases: 1) in the case of q = T , i.e. if we ignore the constrained part, then it has the form of multiple linear regression X i (t T )= μ + T j=1 α j X i (t T −j )+ε i and 2) when q = 0, i.e. without the unconstrained part, if the summation is replaced by integration with a large enough T ,t h e ni t becomes scalar on function regression with X i (t T )=μ + Unlike the former, completely unconstrained case, the regularisation in model (2) operates in a way that reduces the model's degrees of freedom. In the examples of Section 4, we empirically show that the full model (2) exhibits better prediction performance than these two extreme cases. This further justifies our efforts in proposing a methodology for detecting the change-point index q automatically from the data.
Other ways of regularising the functional linear regression coefficient have been proposed in the literature. In particular, some researchers have used ideas from variable selection to obtain β(t) = 0 for the non-informative subintervals and β(t) = 0 for the informative ones. James et al. (2009) employ the LASSO and Dantzig selector with the aim of improving the interpretability of β(t) while Zhou et al. (2013) use the Dantzig selector and SCAD approaches. Lin et al. (2015) propose a functional version of SCAD by combining the SCAD method and smoothing splines to obtain a smooth and sparse estimator for the functional coefficient. By contrast, we do not regularise by finding null subregions of β(t) but by imposing different smoothness constraints over different sections of the parameter curve. The 'null subregion' and our approach are compared and contrasted in Sections 3 and 4.
In addition, our approach is different from the functional linear regression model with points of impact (Kneip et al., 2016) in the sense that our unrestricted coefficients are grouped into a single region that is the nearest to the time-location of the response variable, which (in contrast to Kneip et al. (2016)) allows us not to have to remove the observations adjoining the points of impact in estimating their locations, which would be impossible in our time series context. Other methods related to Kneip et al. (2016) but less so to our work have also been proposed: McKeague and Sen (2010) aim to estimate a single point of impact with the motivation from gene expression data and Ferraty et al. (2010) fit a nonparametric model after finding several predictive design points. The performance of our technique is compared to that of Ferraty et al. (2010)

i n Sections 3 and 4.
Change-point detection ideas have been proposed in other functional regressions contexts before. Hall and Hooker (2016) find the truncation point θ under the truncated functional linear model (2015) use two functions β 1 (t)a n dβ 2 (t) by dividing the interval into two with one discontinuity point. They suggest the partitioned functional single index model, Y i = μ + g 1 [0,λ] β 1 (t)X i (t)dt + g 2 (λ,1] β 2 (t)X i (t)dt + ε i ,w h e r e g 1 and g 2 are smooth functions to be estimated and the breakpoint λ identifies a discontinuity in the functional regression coefficient. Neither of these methods use their concept of change-point detection to differentiate between two classes of smoothness. If q in model (2) were known, the skeleton of our model would be similar to that of partial functional linear regression with both scalar and functional covariates, recently studied by e.g. Kong et al. (2016), Zhou et al. (2016), Zhou a n dC h e n( 2012), Shin and Lee (2012), Shin (2009), Aneiros-Pérez and Vieu (2008)a n dG o i a( 2012). Apart from assuming q to be unknown, (2) is different in that it operates in a time series context.
The remainder of the article is organised as follows. Section 2 describes the model and the parameter estimation procedure. The supporting simulation studies are outlined in Section 3, with further real-data illustrations in Section 4 regarding country fertility data, Mexico city pollution data, stock volatility series and sunspot number data. The relevant theoretical results are presented in Section 5 and we end with additional discussion in Section 6. The SRP methodology is implemented in the R package srp and the proofs of our main theoretical results are in Appendix A.

Model and its estimation
We work with the discretised curves {X i (t 0 ),...,X i (t T )} i=1,...,n observed from each function X i (t) on the equispaced T + 1 discrete points including both endpoints. Since the regression coefficients vary by q, we rewrite model (2)a s where 1 ≤ q ≤ T .T h ep o i n tt T −q is where a sudden smoothness change occurs in the sequence of the regression coefficients, with the coefficients α q j being unconstrained in terms of their smoothness and the coefficients β q (t T −j ) assumed to be a sampled version of a smooth function. The change-point location in (3) is the same for all i's. Our expectation is that q is substantially smaller than T and the optimal q is chosen by examining a number of q's over a subset of {1,...,T}, which we specify in Section 2.1. The reason why T is assumed to be fixed is that if we were to allow T →∞,thent T would asymptotically approach t T −1 and we could simply predict X(t T )b yX(t T −1 ).
The set of unknown parameters in (3) can be categorised into two types: 1) change-point t T −q and 2) regression coefficients (μ q , α q ,β q ). Our interest includes the estimation of the underlying smooth function β(t). Broadly speaking, two possible ways exist: 1) estimate (β q (t 0 ),...,β q (t T −q−1 )) and then use interpolation to obtain the functional form ofβ q (t) or 2) obtain the interpolant {X(t),t∈ [t 0 ,t T −q−1 ]} and then estimate the functionβ q (t) through basis expansion. In this paper, we use the latter approach as it is more popular and the former approach needs a particular penalty to make it feasible if T is close to or exceeding n. Examples of the former can be found in Cardot et al. (2007)a n d Crambes et al. (2009).
..,X i (t T −q−1 )) using natural cubic splines with knots at (t 0 ,...,t T −q−1 ). As stated in Crambes et al. (2009), the essential property of natural splines is that for any vector, the unique natural spline interpolant exists and it can be expressed as a B-spline expansion with dimension equal to 'number of knots + 2' (in our case T − q + 2) as follows: where B h (t) is a set of basis functions for the normalised B-splines Dimension reduction is necessary for the estimation of β(t). The required regularisation is usually achieved by a basis expansion, which enables a finite number of basis functions to approximate the infinite-dimensional function. Numerous approaches are available, such as via the Fourier series, functional principal components (PC), splines or wavelets. The reader is referred to Ramsay and Silverman (2005) for more details. In what follows, we use B-splines. Cardot et al. (2003) argue that spline estimators should be preferred to the functional PC approach when X(t) is rough and the functional coefficient is smooth, which is the case we are interested in. Moreover, a spline estimator is not directly affected by the estimation of the eigenstructure of the covariance operator of X(t).
Let S be the space of splines defined on [t 0 ,t T −q−1 ] with degree s and k − 1 equispaced interior knots where L = k + s denotes the dimension of S. Then one can derive a set of basis functions from the normalised B-splines {B l } l=1,...,L to approximate β q (t)a s where b l represents the corresponding coefficient. For each t T −q , the set of the regression parameters simplifies to δ q =( μ q , α q ,b q 1 ,...,b q L ) T ∈ R 1+q+L where α q =(α q 1 ,...,α q q ) T . The choice of L is considered in Section 2.2.

Joint estimation procedure for parameters
We suggest a one-stage estimation procedure for the change-point and the regression parameters. Since the parameter q represents the number of scalar parameters, under fixed L, q itself determines the dimension of the model. Thus, using the well-known criterion of Schwarz (1978), we estimate q by minimising where and (μ q ,α q j ,β q (t T −j )) are repeatedly estimated for each q by minimising the following sum of squared errors with appropriate penalisations, with the positive integer m satisfying m<swhere s denotes the degree of space S.
The penalty terms in (8) contain two tuning parameters: λ 1 controls a ridgetype penalty and λ 2 governs the smoothness of the estimatedβ q (t). We do not explicitly specify assumptions for the magnitudes of λ 1 and λ 2 , but instead, as in Hall and Hooker (2016), our theoretical conditions (in Section 5) are phrased in terms of the appropriate convergence rates (Assumptions 3 and 4). In practice, only the initial values of λ 1 and λ 2 need to be specified by the user and the optimal values are selected automatically via a cross-validation-type criterion described in Section 2.2.I fq were known, our task would be to estimate the regression parameters (μ q , α q ,β q ). However, we assume that q is not known and estimate the parameters (q, μ q , α q ,β q ) jointly. We preserve the original time scale of β q (t) instead of rescaling it to [0, 1] so that we can placeαq and βq(t) on the same time scale.
Let q 0 , α 0 ,β 0 denote the true values of the parameters q, α,β, respectively. Typically, as a function of q, M (q) decreases sharply as q ↑ q 0 , and becomes relatively flat (as n →∞)forq ≥ q 0 .F orq>q 0 , the smooth function β 0 (t)isestimated by the scalar estimators (α q ,...,α q0+1 ) on the interval [t T −q ,t T −q0−1 ]. As the smoothness of (α q ,...,α q0+1 ) is unrestricted, the fit is typically good, which causes the flat shape of M (q). Conversely, when q<q 0 , some of the unrestricted parameters, (α 0,q0 ,...,α 0,q+1 ), are estimated by the smooth (β q (t T −q0 ), ...,β q (t T −q−1 )), which typically causes M (q)t obea w a yf r o mi t sm i n i m u mf o r q<q 0 . The SIC penalty "lifts" the flat part of M (q) and enables us to estimate the q parameter close to its true value. This is shown theoretically in Section 5 and numerically in Sections 3 and 4.
When finding the optimal q in (6), although q can in principle be large enough up to q = T , we recommend examining 1 ≤ q ≤q,w h e r eq is substantially smaller than T . In the examples considered in Sections 3 and 4, we takeq = min(⌈T × 0.1⌉, 30). Based on our empirical experience, when q is large, there is the possibility that the optimisation of the two tuning parameters, λ 1 and λ 2 in (8), becomes unstable in that it becomes highly dependent on the selection of their initial values. In addition, examining the entire range 1 ≤ q ≤ T can make the algorithm unnecessarily slow especially when both T and n are large. In practice, even if we do not restrict q to be small (which introduces the stability and speed issues referred to above), the minimiserq of SIC(q)in (6), if computed successfully despite the potential stability issues, is typically obtained to be substantially smaller than T .

Selection of the tuning parameters
To select the tuning parameters, we use the magic function from the R package mgcv (Wood (2006)). The mgcv includes various regression models such as GAM or the generalised ridge regression. The magic function is useful in that it is able to optimise over more than one penalty parameters (λ 1 and λ 2 in our case) by minimising GCV based on Newton's method. The results also give the estimators (αq,βq(t)) in (8) under each selectedq.
Regarding the dimension of β q , we typically set L to be large but substantially smaller than T − q. As mentioned in Ruppert (2002), the number of basis function tends not to play an important role in functional linear regression with a roughness penalty, if we choose it to be large enough to prevent undersmoothing. Following the rule of thumb from Ruppert (2002), we use L = 35 in Sections 3 and 4, except in cases in which T<40, when we use L =9.

Simulations
In this section, we evaluate the finite-sample performance of our approach. We expect the performance of our method to vary depending on the size of change between β 0 (t)a n dα T 0 and on the degree of fluctuations in the α T 0 coefficients relative to the smoothness of β 0 (t).
B a s e do nt h em o d e l( 3), we consider the following four parametric cases -C a s e1 :μ 0 =0 .0180, α 0 =( 0 .4, 0.2, 0.1) T ,C a s e2 :μ 0 = −0.0836, α 0 = (0.6, −0.5, 0.4) T ,C a s e3 :μ 0 = −0.0239, α 0 =( 0 .4, 0.2, 0.1) T and Case 4: μ 0 = −0.0742, α 0 =(0.4, −0.2, 0.1) T , to investigate how the performance of changepoint detection is affected by the degree of changes in the regression parameters. The true change-point index parameter is q 0 = 3 for all cases as shown in Figure  1 and β 0 (t) is available from the R package srp. In the data generating process based on the model (3), we use the Gaussian noise ε i with the signal-to-noise ratio, defined as σ 2 X /σ 2 , equal to 4 where σ 2 is the error variance. In Cases 1 and 3, α 0 shows less fluctuation than in Cases 2 and 4. The size |α 0,3 − β 0 (t T −4 )| of the change-point is approximately 0.4 in Case 2 and approximately 0.1 in the remaining three cases. Case 3 is similar to Case 1 except that its β 0 (t)=b 0 + b 1 t is linear. We simulate n = 300 independent copies of each process, in which the length of the sample is T + 1 = 360 (see formula (3)).

Competing methods
We compare the performance of our approach to the following existing methodologies: multiple linear regression ( (2002)). We also compare our proposal (SRP C ) with its simplified version named SRP L , which follows the form of SRP C except that β 0 (t) is estimated as a linear function. The corresponding objective functions for the parametric methods are as follows: The objective function of our method (SRP C )isin(8) and we determineq 1 and q 2 for MLR and SRP L by minimising SIC(q)i n( 6) with appropriate M (q)f o r each. In the implementation of FLR, we use cubic smoothing splines (s =3 ) with the dimension L = 35 for both β(t)a n dX i (t) where the derivative order of β(t)i sm =2a n dλ is selected by minimising GCV. Ridge parameter is also optimised by minimising GCV. For the implementation of other methods, we follow the suggestions of each paper for selecting the tuning parameters and the R code is available on the web (FLiRTI: http://www-bcf.usc. edu/~gareth/research/Research.html,M P D Pa n dN P :http://www.math. univ-toulouse.fr/~ferraty/). The top row of Figure 2 shows that the mean of 100 SIC(q) is minimised at true q 0 = 3 for all cases. Case 2 shows a more rapid decrease than the other cases when q ↑ q 0 due to the larger size of change at the change-point. Similarly, in the bottom row, we see that the mode ofq is q 0 = 3 in all cases. Since Cases 1 and 3 have a relatively smooth α,q =1, 2(<q 0 ) are selected more frequently than in Cases 2 and 4, which have relatively more fluctuating α's. Figure 3 provides numerical evidence of the increased closeness ofq to q 0 i nC a s e4a s the sample size n increases. As is apparent from Table 1, FLR and RIDGE perform systematically worse than the others. Our proposal, SRP C , outperforms the others in Cases 1, 2 and 4 and the difference is the most striking in Cases 2 and 4, in which a sudden smoothness change occurs. In Case 3, in which there is no clear smoothness change around the change-point and the true β(t) is linear, SRP L turns out to be the best-performing method. Table 1 The mean(sd) of SSE(×10 2 ) defined in formula (9) over 100 simulation runs for the parametric methods in all cases. Bold: methods with the lowest mean of SSE. Examining Figure 4, while the misestimation in SRP C is mainly located around the true change-point, in FLiRTI and FLR it is scattered over the whole interval. In addition, the graph offers visual confirmation of the superior performance of SRP C in Cases 1, 2 and 4. In particular, in Cases 2 and 4, FLR ignores the sudden fluctuation in α by estimating it as a smooth function. Unlike FLR and FLiRTI, SRP C shows its advantages not only when scale changes are present (Cases 1 and 3) but also when a sudden smoothness change occurs at the change-point (Cases 2 and 4). Table 2 contains two more columns than Table 1 as the mean-square prediction error can also be obtained for the nonparametric methods, MPDP and NP, which do not involve the estimation of (α,β(t)). In all cases considered, FLR, MPDP, NP and RIDGE show worse prediction performance than the other methods. SRP C performs better than FLiRTI for all cases (but more noticeably so in Cases 2 and 4). SRP C is superior to SRP L in all cases except Case 3  Table 1.

Table 2
The mean(sd) of MSPE(×10 2 ) defined in formula (10)  which is expected since Case 3 includes a linear β 0 (t). However, SRP C is not far behind SRP L in Case 3 as the smoothness ofβ(t) is flexibly controlled by the automatically chosen penalty.

Data applications
In this section, our methodology is applied to country fertility data, Mexico city pollution data, stock volatility series and sunspot number data. The data can be obtained from the Human Fertility Database (https://www.humanfertility. org/), the R package aire.zmvm, the Wharton Research Data Services (https:// wrds-web.wharton.upenn.edu/wrds/) and the Base R datasets available from CRAN, respectively.

Country fertility rate data
Forecasting future fertility rates has a great impact on governments in planning children's service and education. We use fertility rates at age 20, recorded for 36 years from 1974 to 2009 for 31 countries around the world. As shown in Figure  5, the fertility rates at age 20 show an overall decreasing trend in all countries and although it is not illustrated in this paper, similar patterns are observed at ages 21-26, while fertility rates at ages 30-39 have obvious increasing trends in recent years from 1990 onwards, which reflects the phenomenon of more women deferring childbirth to a later age. The final observation recorded in 2009 is predicted from the past observations from 1974 to 2008. To compare the prediction power of the new model with competitors, we split the whole dataset into a training sample of size n 1 =2 6 and a test set of size n 2 = 5 randomly 100 times and compute the mean, median and standard deviation of the 100 mean-square prediction errors defined in (10). In the training set, the B-spline expansion with dimension L =9i su s e d for SRP C ,S R P L and FLR. As found in Table 3,M L R ,S R P C and SRP L lead to similar performance in prediction, which is better than that of the nonparametric methods (MPDP, NP), the full functional model (FLR), the full scalar setting (RIDGE) and FLiRTI. Table 3 The mean, median and standard deviation of 100 MSPE's (×10 6 ) defined in formula (10) for all methods described in Section 3.1, for the case study in Section 4. As shown in Figure 6,q . =1 , 2 are the most frequently selected as the optimal size of scalar variables for MLR, SRP L and SRP C . Although MLR and Fig 6. Barplots of the 100q 1 for MLR (first), 100q 2 for SRP L (second) and 100q for SRP C (third) estimated by minimising {SIC(q), 1 ≤ q ≤ 4} in formula (6) and the frequency barplot of the best-performing method (with the lowest MSPE) out of the 100 samples (fourth) for the case study in Section 4.1. SRP L seem to be slightly better than SRP C in prediction in Table 3,F i g u r e6 shows that SRP C is the most frequently selected as the best-performing method in terms of MSPE from 100 samples. In Figure 7, the functional estimatorsβ(t) for FLR and FLiRTI and the discrete ones for RIDGE live in the whole interval t ∈ [t 0 ,t T −1 ] while SRP C ,M L Ra n dS R P L assign the corresponding subintervals forα with the optimally chosenq =1 ,q 1 =2a n dq 2 = 1 (respectively). The estimated curves for FLR and FLiRTI and the estimated coefficients for RIDGE appear to be relatively oscillatory over the entire interval under a fixed smoothness while the smoothness of the SRP estimators varies as dictated by their design. Interestingly, all parametric methods give a large size of the regression coefficient at year 2008, which contrasts with the coefficients for years 1974-2007 which are close to zero. In a time series context, this indicates that the fertility rate in 2008 is more influential for predicting the fertility rate in 2009 than the older observations are.

Nitrogen oxides in Mexico City
We use the daily curves of hourly average nitrogen oxides level in Mexico City, recorded at the Pedregal station in 2016. As shown in Figure 8, daily curves contain 24 observations each and have similar patterns including two peaks around hours 9 and 21. The final observation recorded at hour 24 is predicted from the past observations indexed 1 to 23. We split the whole dataset into a training sample of size n 1 = 161 and a test set of size n 2 = 86 randomly 100 times and compute the mean, median and standard deviation of the 100 mean-square prediction errors defined in (10). In the training set, the B-spline expansion with dimension L =9isusedforSRP C ,SRP L and FLR. As found in Table 4 and Figure 9,S R P C gives the best prediction among all methods and is also the most frequently selected as the best-performing one from the 100 samples in terms of MSPE. As shown in Figure 9,q = 3 is the most frequently selected as the optimal size of scalar variables for SRP C whileq . =2i ss of o r MLR and SRP L . In Figure 10, it is interesting to observe that the smooth portion of the SRP parameter vector appears to be non-trivially different from zero, which, together with the fact that the SRP model outperforms its competitors in the forecasting exercise reported above, provides evidence for the existence and impact of the long-term temporal dependence in this dataset. It is also apparent that all the methods attempt to fit a particularly large-size regression coefficient at hour 23. The SRP C curve detects a change at hour 20, where it experiences a seemingly non-trivial drop. It would be difficult for us to conclude that this drop is merely caused by a boundary effect as the RIDGE solution (in which there are no boundary effects to speak of) also experiences a dip at that point. In the same manner, the sudden increase observed in the FLR curve at hour 23 does not appear to be a mere boundary effect, but it also reflects this method's own effort to fit the influential predictor under its own smoothness constraints. The results in Table 4 show that it is useful to apply two different regularisations, as done in SRP C , depending on the perceived importance of predictors, rather than estimating the regression coefficients under an unvarying regularisation, as done in RIDGE. Table 4 The mean, median and standard deviation of 100 MSPE's (×10 2 ) defined in formula (10) for all methods described in Section 3.1, for the case study in Section 4.2. Bold: methods with the three lowest MSPE's.  9. Barplots of the 100q 1 for MLR (first), 100q 2 for SRP L (second) and 100q for SRP C (third) estimated by minimising {SIC(q), 1 ≤ q ≤ 3} in formula (6) and the frequency barplot of the best-performing method (with the lowest MSPE) out of the 100 samples (fourth) for the case study in Section 4.2.

High frequency volatility series
In financial data analysis, modelling high-frequency volatility has attracted much attention in recent years. Especially, in the functional framework, non-parametric methods have been extensively studied (Bandi and Phillips, 2003;Reno, 2008;Kristensen, 2010). Müller et al. (2011) emphasise the random nature of volatility functions under the assumption that the repeated realisations of the volatility trajectories come from a suitable functional volatility process. Our interest is also in the random nature of functional observations rather than in modelling potential dependencies between curves, therefore, as in Müller et al. (2011), we view the daily curves as i.i.d. random functions. We aim to predict the latest point of the curves from the past observations.
Specifically, our methodology is applied to the prediction of the Disney stock volatility where the raw observations contain n = 248 trading days available from January 2, 2013 to December 30, 2013 and each curve has 395 grid points of closing prices recorded every 1 minute. The volatility trajectories are obtained from the return series in the same way as in Müller et al. (2011), however we retain the roughness of volatility trajectories by using natural cubic splines as in (4) rather than smoothing them. This is important as volatility is not observable but typically estimated to be oscillatory, thus an extra smoothing step can possibly cause the loss of important information as stated in Kneip et al. (2016).
We split the dataset into a training and a test set of size n 1 = n 2 = 124 randomly 100 times and in the training set, the B-spline expansion with dimension L =3 5i su s e df o rS R P C ,S R P L and FLR. Figure 11 shows thatq 1 =3i st h e most frequently chosen for MLR whileq 2 =1andq = 1 are the most frequently selected for SRP L and SRP C , respectively.
Similar to the previous examples in Sections 4.1 and 4.2,F i g u r e12 shows that all the parametric methods reflect the 'fading memory' of the time series by assigning a large-size regression coefficient for observations located close to the closing volatilities, which contrasts with the coefficients for intervals positioned far from the closing volatility. As found in Table 5 and Figure 11,SRP C leads to the best prediction among all methods and is also the most frequently selected as the best-performing one in terms of MSPE from 100 samples. Fig 11. Barplots of the 100q 1 for MLR (first), 100q 2 for SRP L (second) and 100q for SRP C (third) estimated by minimising {SIC(q), 1 ≤ q ≤ 30} in formula (6) and the frequency barplot of the best-performing method (with the lowest MSPE) out of the 100 samples (fourth) for the case study in Section 4.3.

Monthly numbers of sunspots
In this section, we demonstrate the usefulness the SRP framework in univariate time series modelling, as an alternative to the AR model, which is often used in time series forecasting. The SRP model is similar to the AR model in that they both specify the fading memory structure of the time series under linear dependence of the output variable on its own previous values. In practice, the AR(p) model is usually fitted with a small p for simplicity, interpretability and better forecasting performance, however it may fail in the presence of longerrange effects. In this case, the SRP model can also be used for the forecasting of a univariate time series, where it becomes an autoregressive (AR) model with a large order (e.g. AR(T )i n( 2)w i t hafi x e dT ) under the smooth-rough regularisation. The middle plot of Figure 13 shows that the monthly sunspot number series may need large-order autoregression (even up to or exceeding order 100), in which case it may be advantageous to use the SRP model over plain AR modelling. The sunspot number data contains 3177 observations available from 1749 to 2013 and we perform a square root transformation to the raw data. We split the whole dataset into a training sample of size n 1 = 2223 and a test set of size n 2 = 954 and create the data matrix for each set via a moving window with a prespecified number T + 1 = 151 of discrete points in one curve (150 for covariates and 1 for the response variable), i.e. from the univariate time series (x 1 ,x 2 ,...,x n1 ) in the training sample, we create 2073 curves, X 1 (t)=( x 1 ,x 2 ,...,x 151 ), X 2 (t)=( x 2 ,x 3 ,...,x 152 ), ..., X n1−151+1 = (x n1−150 ,x n1−149 ,...,x n1 ). In the same way, we create 804 curves for the test sample. In each curve, we use the last points as the response variable and the covariates are the remaining 150 observations. Due to the temporal dependence in the entire dataset, we do not randomly repeat the construction of the training and test sets. From the training set, with L = 35, the optimal change-point index parameter for MLR, SRP L and SRP C are chosen asq 1 =5 ,q 2 =6 ,q = 2 (respectively) from {q :1≤ q ≤ 15} asshowninFigure14. As the optimal sizeq 1 =5forMLR is obtained by minimising the SIC criterion, the estimated regression coefficients are very close to that of the AR(5) model and the significance of the first five lags is already revealed in the partial autocorrelation function in Figure 13.I n Figure 14, the FLR and RIDGE estimators appear to be relatively oscillatory over the entire interval, while the estimators for FLiRTI and SRP C are relatively smoother. We also obtain the OLS (ordinary least squares) estimator which is slightly more fluctuating than RIDGE, but is not included in Figure 14.A s is apparent from Table 6, our approach shows an improvement in prediction compared to the other methods. From this example, SRP C appears to be a useful substitution for a classical AR(p) model with a small p, especially when the memory of a time series is relatively long.

Theoretical results
In this section, we assume that the SRP model in (3) is correct and explore the asymptotic behaviour ofq, the estimator of the change-point index q 0 .T h e r ei s a one-to-one correspondence between q and t T −q , so we will be interchangeably consideringq and t T −q . We denote the true values of scalars α and function β by (α 0 ,β 0 ) and assume the following conditions.
As mentioned in the discussion of the shape of the function M (q) in Section 2.1, Assumption 2 quantifies the non-convergences occurring when q<q 0 .T h e next two assumptions list the converging components of M (q)w h e nq ≥ q 0 . Our Assumptions 3 and 4 are similar to the assumptions made on estimated regression coefficients in Hall and Hooker (2016).
We are now ready to state our main result.
Theorem 5.1. Ifq is any value of q which minimises (6) on the interval [q 1 ,q 2 ] when q 1 and q 2 are chosen to satisfy 1 ≤ q 1 <q 0 <q 2 <T,t h e nu n d e rt h e Assumptions 1-6, we have P (q = q 0 ) → 1 as n →∞.
Technical proof of Theorem 5.1 is available in Appendix A.W ee n dt h i s section with further brief justification of our assumptions by comparing them to similar assumptions made in some related recent works.
The B-spline expansion employed in this article can be replaced with other bases, for instance the set of eigenfunctions of the covariance operator of X(t). Cai and Hall (2006) investigate this case and derive the parametric rates with this methodology. Hall and Hooker (2016) mention that the methods used by Cai and Hall (2006) can give the rate of convergence of β q (t)i n( n −1/2 ,n 0 )f o r Assumption 3-(b) under appropriate smoothness conditions for β(t), X(t)a n d the covariance function measured by the spacing of the eigenvalues in a fully functional setting (that is, when q = 0 in our case). Similarly, Crambes et al. (2009) derive the rate of convergence for the general spline classes which is comparable to that of Cai and Hall (2006), under the usual smoothness assumptions on β(t)a n dX(t) defined by the continuity of its derivatives. The methods used in Crambes et al. (2009) can give the rate in Assumption 3-(b) under appropriate smoothness conditions for β(t)a n dX(t) in a full functional setting. Since our model contains scalar covariates and has the ridge type penalty in (8), we postulate the same or slightly slower rates, which are also supported by our numerical experience.

Discussion
The SRP model represents a compromise between a completely unregularised and a completely regularised linear model in that it keeps all the effects as nonzero but partitions them into two classes of regularity. This makes it a useful alternative to sparsity-based approaches as retaining the smooth non-zero regression parameter can be beneficial for prediction, as this paper demonstrates.
The SRP approach can in principle be applied in any context in which potential regressors have been pre-ordered in terms of their importance as is the case in the time series setting studied in this paper.