Inference in functional factor models with applications to yield curves

This article develops a set of inferential methods for functional factor models that have been extensively used in modelling yield curves. Our setting accommodates both temporal dependence and heteroskedasticity. First, we introduce an estimation approach based on minimizing the least‐squares loss function and establish the consistency and asymptotic normality of the estimators. Second, we propose a goodness‐of‐fit test that allows us to determine whether a specific model fits the data. We derive the asymptotic distribution of the test statistics, and this leads to a significance test. A simulation study establishes the good finite‐sample performance of our inferential methods. An application to US and UK yield curves demonstrates the generality of our framework, which can accommodate both sparsely and densely observed yield curves.


INTRODUCTION
For three decades, models akin to the Nelson-Siegel model have been used to quantify the term structure of various economic and financial variables, including yields, spot rates, and futures, with hundreds of papers using it in various contexts. The general modelling paradigm is well known, but we recall it to be able to explain several issues. At each period i, which generally corresponds to day, week, or month, we observe a curve X i (t), where the argument t corresponds to the time to the expiration of a contact or to the maturity of a fixed-income investment. Nelson and Siegel (1987) introduced the following three curves: known as the 'level', the 'slope', and the 'curvature'. The essence of the modelling approach is that each curve X i can be in some sense approximated by a linear combination of the above three curves: . This is definitely not the only approach to modelling the term structure. At about the same time, Bliss and Fama (1987) proposed a fully non-parametric approach which fits a smooth curve to the data. However, the Nelson-Siegel paradigm has become much more popular, chiefly because of its ability to reduce term structure to the vector [b 1, , b 2, , b 3, ]. Various formal or informal statistical approaches are easier to apply to three-dimensional vectors than to curves. It has been, however, immediately recognized that such a reduction is possible only if the value of the shape parameter is available. Nelson and Siegel (1987), in fact, used different values of for different periods, but it is now a common practice to fix over the time period under consideration. This fixed value is INFERENCE IN FUNCTIONAL FACTOR MODELS 873 either based on past experience, for example, the value given in Diebold and Rudebusch (2013), or is estimated in some way prior to further analysis. It is well known that with slightly different 'fixed' values of , the coefficients b i, can take on very different, often erratically behaved, values, limiting their usefulness and interpretability. Such findings are discussed in Vermani (2012) and Annaert et al. (2013), among others. The Nelson-Siegel model is highly nonlinear in its four parameters, which may cause estimation difficulties.
The objective of this article is to develop estimation and testing theory and methodology for a broad class of yield curve models, which includes the Nelson-Siegel model. Our approach is based on established principles of statistical inference, which involve the specification of model parameters and structure of model errors. It allows us to formulate results on consistency and asymptotic distribution of parameter estimators and derive a goodness-of-fit test. This test permits us to evaluate the admissibility of a specific model over a specific time period within the Neyman-Pearson testing paradigm. Existing literature, including  and Diebold and Li (2006), focuses on practical aspects of yield curve fitting and forecasting. Our study is different, and complementary, because we provide a set of inferential methods based on a general statistical framework. It is hoped that the rigorous and general development presented in this article will be a useful addition to a large body of applied work on models of this type.
We consider functional observations X 1 (t), X 2 (t), … , X N (t) defined on the interval  . We assume that these observations follow the model c ,0 f (t; 0 ), t ∈  and 1 ≤ i ≤ N, that is, the mean of the observations can be written as a linear combination of the functions f 1 (t; 0 ), f 2 (t; 0 ), … , f K (t; 0 ), where the functions f 1 , f 2 , … , f K are known, and 0 ∈ R d is the true value of an unknown parameter. The parameter vector specifying the model is where the subscript 0 indicates the true values. As explained above, in most current applications is considered as a fixed value obtained in some way. In subsequent analyses, the curves f k (t; 0 ) are treated as given deterministic functions. We will call these functions functional factors, or simply factors. In our approach, we focus on the estimation of the whole parameter vector a.
To give an idea of the scope of the models in current use, and consequently the scope of the theory and methodology we develop, we list several examples. All functions of t are defined on a compact interval  ⊂ [0, ∞). While historical data are generally available at discrete subsets, which do not need to be regularly spaced, contracts of any maturities can be traded, so factor models of the type we study have traditionally been considered on a continuous domain  . As we will see in Section 4, some data are available as densely observed curves.
There are several novel mathematical techniques used in the proofs of the results stated in Section 2. The broadest innovation is the treatment of functional factor models under the assumption of a changing stochastic structure of the errors. We developed several approaches, which may be useful in similar contexts. To prove the almost sure convergence in Theorem 2.1, we must study the behaviour of the variances We use a blocking technique by considering 2 k ≤ N < 2 k+1 . This allows us to obtain delicate bounds in probability via Menshov's inequality, and then apply the Borel-Cantelli lemma. The proof of Theorem 2.2 is built up of very complex manipulations revealing the asymptotic distribution, which has a non-trivial form. The proofs of the remaining results use this form to justify the goodness-of-fit test. It would not have been possible to derive this test without knowing the precise limit. Numerical implementation of our method is not trivial either. The structure of the yields data must be taken into account. We provide precise algorithms.
The remainder of the article is organized as follows. Section 2 is dedicated to the development of the asymptotic theory for the models discussed above. This theory is conditional on the availability of suitable long-run covariance estimators, which remain consistent under heteroskedastic errors. Such estimators are studied in Section 3. Section 4 illustrates the statistical methodology and theory derived in this article by applying it to US and UK yield curves. Finite-sample properties of our methods are explored in Section 5. Online Supporting Information contains the proofs of the results stated in Sections 2 and 3, details of the numerical implementation of our methods, as well as additional results and figures.

ASSUMPTIONS AND LARGE-SAMPLE RESULTS
We work within the Hilbert space L 2 = L 2 (m) of square integrable functions defined on the compact interval  ⊂ [0, ∞). The integration over  is denoted simply by ∫ . To be able to handle various forms of data, we consider an abstract positive measure m on  . This can be a discrete counting measure, which is convenient in many applications. We assume that the functions f (t; ) are elements of L 2 for any admissible value of the parameter vector . The functional observations X i and the errors i are random elements of L 2 .
Our approach consists in minimizing the least-squares loss function We begin by listing the assumptions on the measure m, the parameter space A, and other objects in (1.1) and (1.2).
The next condition requires that the model is uniquely defined.
Assumption 2.4. For every a ∈ A, such that a ≠ a 0 , the set The above assumption means that the true mean curve corresponds only to the true parameter a 0 . We model heteroscedasticity by assuming that the errors i = (e i,1 , e i,2 , … , e i,K , i (t)) ⊤ are piecewise Bernoulli shifts. In particular, between the break points k j in Assumption 2.5, these errors are strictly stationary. Their dependence is quantified in Assumption 2.6.
Assumption 2.5. The changes occur at times We use the convention of k 0 = 0 and k M+1 = N . In contrast to Bardsley et al. (2017), we do not require that 1 , 2 , … , M are known.
The piecewise Bernoulli shifts are defined in the following assumption: The idea of approximation of a stationary sequence with random variables which exhibit finite dependence first appeared in Ibragimov (1962). Examples of models that satisfy Assumption 2.6 (and similar assumptions that follow) are discussed in Chapter 16 of Horváth and Kokoszka (2012). Basically, all stationary time series models in practical use can be represented as Bernoulli shifts, see Wu (2005), Shao and Wu (2007), Aue et al. (2009), Hörmann andKokoszka (2010), Horváth et al. (2013), Kokoszka et al. (2015), and Zhang (2016), among many other contributions. The nonlinear moving average representation admits heteroskedastic models for the errors, like those studied in Hörmann et al. (2013).
Let f(t, ) = ((f 1 (t; ), f 2 (t; ), … , f K (t; )) ⊤ and (2.8) We note that the existence of D i , d i (t), and d i (s, t) follows from Assumption 2.6. Let Γ(t) be a Gaussian process with EΓ(t) = 0 and To derive the limit distribution of N 1∕2 (â N −a 0 ), we need to replace (2.3) and (2.4) with mildly stronger conditions. Assumption 2.8. There are 1 > 2, 2 > 2, and C > 0 such that Assumption 2.9. The matrix defined by We note that Assumption 2.9 means that the coordinates of (t; 0 ) are linearly independent, that is, f 1 , f 2 , … , f K and their derivatives with respect to are linearly independent at 0 as a function of t in L 2 .
It is clear that is a (K + d)-dimensional normal random variable with E = 0, and the covariance matrix We discuss the estimation of ℭ in Section 3. We now proceed to results that will allow us to investigate whether a specific model defined by (1.1) and (1.2) fits the data, that is, the goodness-of-fit testing. Recall that the estimator for a 0 has the form Our testing procedure is based on a functional of the process , t ∈  .
If the null limit distribution of the process {Z N (t), t ∈  } can be established, in a suitable function space, then any continuous functional of this process can serve as a test statistic. In this article, we focus on the L 2 norm with respect to another abstract measure, n, which can be taken to be m.
Assumption 2.10. The measure n satisfies one of the following assumptions: (i) n((−∞, t]) = n(t), a function increasing only at the jump points on m. (Not all jump points of m need to be used.) The testing procedure needs a simple modification of Assumption 2.8. We require higher moments and v-approximability in L 2 with respect to the measure n on  .
Assumption 2.11. There are 1 > 4, 2 > 4, and C > 0 such that To formulate the null limit distribution of our goodness-of-fit statistic, we introduce Gaussian processes where the processes Γ N have the same distribution as the Gaussian process Γ with covariances (2.9). The existence of the sequence {Γ N } is part of the claim of the following theorem. Each process Γ 0 N is clearly Gaussian and its distribution does not depend on N.
Theorem 2.3. If (1.1), (1.2), and Assumptions 2.2-2.11 are satisfied, then Next, we discuss the consistency of the testing procedure. Under the alternative, model (1.1)-(1.2) does not fit the data, which is formalized in the following assumption: The distributions of functionals of Γ 0 depend on several unknown parameters, so it is not immediately obvious how to use Theorem 2.3 to test the validity of Nelson-Siegel-type models. We now explain how critical values can be obtained.
According to the Karhunen-Loéve expansion where  1 ,  2 , … are independent standard normal random variables. The eigenvalues in (2.17) are unknown, so we need to estimate them from the observations. Assume that N is an L 2 estimator of , that is, A construction leading to such an estimator is described in Section 3. Using again the spectral theorem, there are non-negative empirical eigenvalueŝ1 ,N ≥̂2 ,N ≥̂3 ,N ≥ … and corresponding orthonormal eigenfunctionŝ is a good approximation for the distribution of the integral in (2.17) if L and N are suitably large. Thus, if we compute the estimateŝi ,N , we can approximate the null distribution of the test statisticsT N in (2.15). These estimates can be obtained by numerically solving (2.19). If the critical value is defined by then under the null hypothesis we have that and under the alternative Thus, our procedure has asymptotic size and it is consistent. Conditions (2.22) and (2.23) are verified in the online Supporting Information.

ESTIMATION OF ℭ, (t, s) AND 
The estimation of the matrix is conceptually easy. Recall that

881
The plug-in estimator for iŝN = (̂N). (3.1) The continuity of ( ) in a neighbourhood of 0 and Theorem 2.1 immediately imply that Thus we need to consider only to construct an estimator for ℭ. Let The observations, X i (t), 1 ≤ i ≤ N, as well as the errors form heterogenous sequences. Thus, we need to verify if the standard methods can be used to estimate ℭ 1 as well as (t, s). We write Using Assumption 2.11, we get that Thus we get The estimation of ℭ 1 is based on the projection vectors

L. HORVATH, ET AL.
The residuals of the model are The data are heteroscedastic, but we show that the long-run covariance matrix estimator still can be used. As usual, we start with the estimation of the covariances of lag : where J is a kernel function. In a similar manner, we can define an estimator for (t, s) which satisfies (2.18). The estimator is based on the sample covariances defined for ≥ 0 bŷ We note that Eq i (t)q i+h (s) depends on i because of the heteroscedasticity of the data. However, we show that the classical kernel estimator for the long-run variance of sums of dependent variance can still be used. The proof of (2.18) requires new arguments under our more general assumptions. The estimator is defined bŷ (3.5) The following conditions on the kernel J and the window (smoothing parameter) h are standard. Now it is easy to find an estimator for (t, s) wherêN is defined in (3.1). Observing that (s; ) is continuous in , Theorem 2.1, (3.2), and Theorem 3.1 imply that (2.18) holds.

APPLICATION TO US AND UK YIELD CURVES
This section demonstrates our estimation method and the goodness-of-fit test by applying them to the yield curves in the United States and the United Kingdom. Our framework is general and can accommodate both sparsely and densely observed data. Researchers have the freedom to choose the measures m and n. They can be the counting measure, potentially with some weights, the Lebesgue measure, or any absolutely continuous measure with a specific density of  . For the sake of conserving space, the method of parameter estimation is demonstrated by the traditional Nelson-Siegel model, and other models can be estimated in the same manner. 1 The goodness-of-fit test is illustrated by a range of models. 2

US Yield Curves
The website of the Board of Governors of the Federal Reserve System provides daily nominal interest rates for selected non-inflation-indexed US Treasury securities. 3 The market yields are calculated from composites of quotations on actively traded treasury securities in the over-the-counter market. Such a calculation method provides yields at fixed maturities even if there is no outstanding security with one of those exact maturities. The maturities are at fixed at 1, 3, 6, 12, 24, 36, 60, 84, 120, 240, and 360 months. Since the data are sparsely observed at these maturities, we set m and n as the counting measure for the US yield curves. This corresponds to the established way of fitting yield curves to such data. Additionally, we produce yield curves in the form of functional data from the discrete observations using the linear interpolation. 4 The data we used in this study are daily yield curves from 1 January 1980 to 31 August 2020. For the purpose of demonstration, we investigate the Nelson-Siegel model over  Note: * indicates that the last contraction period ends at the most recent available date.
six most recent contraction periods. The dates of US contraction periods are obtained from the National Bureau of Economic Research. 5 We apply our estimation method to the yield curves in each contraction period. Table I reports estimated parameters of the Nelson-Siegel Model. As can be observed, the level parameterĉ 1 has a substantial reduction from 11.00% in 1980 to 1.91% in 2020. The slope parameterĉ 2 is positive in the first contraction period (Jan 1980-Jul 1980, indicating that the yield curve is inverted. In the other five periods, the yield curves are generally upward-sloping with negativeĉ 2 . The curvature parameterĉ 3 is positive in the second (Jul 1981-Nov 1982 and fourth (Mar 2001-Nov 2001 contraction periods, implying the upside hump in yield curves. The negativeĉ 3 in the other four periods indicates the downside hump in the shape of yield curves.̂controls the exponential decay rate and the location of hump. We find that̂changes in different periods, unlike Diebold and Li (2006), who assume that is constant.
Next, we apply our goodness-of-fit test to a range of yield curve models, including Nelson and Siegel (1987), Svensson (1994), Christensen et al. (2009), Chambers et al. (1984, and Yallup (2012). We set the polynomial basis up to the fourth degree in the model of Chambers et al. (1984), as suggested in their original paper. We also consider different numbers of exponential basis in the model of Yallup (2012) in order to explore the effects of more basis functions. Table II shows the p-values of the goodness-of-fit test for those models on US yield curves. The general pattern is that more factor functions enable larger flexibility in the yield curve modelling, which results in fewer rejections by the goodness-of-fit test. The Nelson-Siegel model is the most parsimonious model among the considered candidates, and it cannot be rejected in the first and last contraction periods, but can be rejected at 5% significance level in other periods. The model of Yallup (2012) with K = 9, used by the Bank of Canada, is the most flexible one among all models, and it cannot be rejected in all contraction periods. It is also interesting to see that the Nelson-Siegel model outperforms the models of Svensson (1994) and Christensen et al. (2009), based on the number of rejections. This can be possibly explained by the additional uncertainty of parameter estimates in the extra factor functions, in particular when the second hump is not present. We note that while the Nelson-Siegel model may not offer a good fit, it may be useful for other purposes.

UK Yield Curves
The Bank of England provides daily estimated nominal government yield curves for the United Kingdom on their website. 6 We choose to use the short-end curves with monthly resolution. 7 Unlike the sparsely observed US yield data, the short-end of the UK yield data is densely observed. It should be emphasized that the estimated yield data in the United Kingdom is derived from spline-based smoothing methods. Specifically, Bank of England used the 'variable roughness penalty' (VRP) method of Anderson and Sleath (1999) to fit a curve to the data, with constraints imposed to ensure that the overall curve is continuous and smooth. One can say that the Bank of England has already converted the discrete data to functional data by penalized smoothing. Since the UK yields are available at densely observed time points, m and n are set to be the Lebesgue measure because it can take into account more information contained in the curves. For the UK yield curves, we focus on the recent years because of the frequent change in the uncertainty related to the Brexit, which has a substantial impact on the shape of yield curves that can reflect expectations on the future of the UK economy. According to Walker (2020), the timeline of Brexit can be summarized in Table III. We are interested in the evolution of the parameters of the Nelson-Siegel Model in different periods related to the Brexit. Our sample period is from 1 January 2014 to 31 August 2020, containing 1685 trading days in total. Figure 1 shows the UK yield curves in the whole sample period, and Figure S1 in the online Supporting Information presents the mean yield curves in different periods related to the Brexit. Table IV presents the estimated parameters of the Nelson-Siegel Model. The first two parameters,ĉ 1 andĉ 2 , are associated with level and slope of the yield curves. There is a general pattern thatĉ 1 decreases andĉ 2 increases as time moving on to the later periods related to the Brexit. This indicates that the yield curves gradually move to a lower level and become flatter, which is consistent with the observation in Figure 1   is negative in all periods, indicating a downside hump in the shape of the yield curves. In terms of̂, it starts from 0.070 in P0 and decreases to 0.030 in P2 and generally stays at that level afterwards, except for a bounce in P7. We further perform the goodness-of-fit test for the same range of yield curve models. Table V shows the p-values of the goodness-of-fit test for different models on UK yield curves. A remarkably similar pattern as in Section 4.1 is observed. At the 5% significance level, the Nelson-Siegel model cannot be rejected in P5 and P6, while the model of Yallup (2012) with K = 9 cannot be rejected in all periods. Additionally, the Nelson-Siegel model has less number of rejections than the models of Svensson (1994) and Christensen et al. (2009). Again, this can be explained by the estimation uncertainty in the parameters related to the second hump, which may not exist.
When assessing goodness-of-fit, we assume that one model should be suitable for a given stretch of data. We addressed this issue by considering relatively short time periods. However, the partitions into these shorter periods may not be optimal. It might be useful to develop a change point detection and estimation procedure for these models, and use the resulting data-driven partitions. We hope that our work will motivate research in this direction.

FINITE-SAMPLE PERFORMANCE
We now examine the finite-sample performance of our estimation method and the goodness-of-fit test. We first describe the data-generating process (DGP) designed to be in line with the characteristics of the empirical yield curves in the United States and the United Kingdom analysed in Section 4. We simulate realizations of the  Svensson (1994) 0.00% 0.00% 0.03% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Christensen et al. (2009) 90.54% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Chambers et al. (1984) 0 Nelson-Siegel model where the random coefficients follow and the factor functions are The true values of the model parameters are set to be c 1 = 7.579, c 2 = −2.098, c 3 = −0.162, and = 0.0609. The chosen values are taken from Diebold and Li (2006). Recall that Assumptions 2.5 and 2.6 allow for heteroscedasticity in the errors. This enables us to consider both homoscedastic and heteroscedastic errors to check the robustness of the developed methods. Under homoscedasticity, the errors of the random coefficients in (5.2) are generated by The error term i (t) in (5.1) is generated by where r (t) are Fourier basis functions, and i,r follows an autoregressive process Under heteroscedasticity, the errors are generated by The error term i (t) is generated by and i,r follows the same autoregressive process in (5.3). For the parameters to generate errors under both homoscedasticity and heteroscedasticity, we set R = 5, b = 0.2, Z 1 = 0.1, Z 2 = 0.05, Z 3 = 0.05, Z 4 = 0.025, Z 5 = 0.025, and = 0.1 or 0.5. For the heteroscedastic errors, we set k * = N∕2. As shown in Section 4, the US yields data are sparsely observed, while the UK data come in the form of densely observed yield curves. In order to closely mimic the characteristics of the empirical data, we devise two specific versions of DGPs: functional DGP and discrete DGP. For the functional DGP, the data are simulated in the domain of t ∈ [0,360], which is already in the form of functional data. For the discrete DGP, the data is first generated only at 11 fixed maturities, t ∈ {1, 3, 6, 12, 24, 36, 60, 84,120, 240,360}, and then linear interpolation is employed to convert discrete data into continuous curves as functional data. To compare the two DGPs, Figure S2 in the online Supporting Information shows one realization of simulated yield curves based on the functional DGP (upper panel) and the discrete DGP (lower panel) of sample size N = 100 with heteroscedastic errors.
Before presenting the simulation results, it is worthwhile to discuss two practical issues: the measures and the kernel estimators. Since our framework allows us to choose various measures m and n, we have explored many options and found that the Lebesgue measure is suitable for the functional DGP and the counting measure is appropriate for the discrete DGP. In the following, the reported simulation results for the functional DGP are based on the Lebesgue measure and those for the discrete DGP on the counting measure. The kernel estimator ofĈ 1 in (3.4) and N (t, s) in (3.5) needs the specification of the kernel and the smoothing bandwidth, and we use the flat-top kernel with the bandwidth h = log(N). Our simulations show that this bandwidth produces stable and satisfactory results. In Section D of the online Supporting Information, we show that the test is not sensitive to the bandwidth choice. Data-driven methods of bandwidth selection might improve the results further. Politis (2003) studies adaptive bandwidth for scalar data and Rice and Shang (2017) propose a plug-in bandwidth selection for functional data.

Parameter Estimation
Based on the two DGPs, we investigate the consistency and the normality in finite samples of the estimator developed in Section 2. It is also of interest to compare the asymptotic standard deviation of the estimator in Theorem 2.2 with the sample standard deviation. In Section B of the online Supporting Information, we provide a practical step-by-step algorithm to implement our estimation approach and the details of using a quasi-Newton method to numerically minimize the least-squares loss function in (2.1), together with an improved algorithm with fast computation if the resolution (sampling frequency) of yield curves is monthly. We generate 1000 independent realizations of the two DGPs, and record the estimated values of a = (c 1 , c 2 , c 3 , ). The summary statistics of estimated parameters are reported in Tables VI and VII for the functional DGP and the discrete DGP respectively. Generally, the consistency of the four parameters is well supported by the simulation results for the two DGPs because the bias and standard deviation become smaller with the increase in sample size. Additionally, the median of the asymptotic standard deviation obtained by Theorem 2.2 is close to the standard deviation in the simulations. 8 Importantly, our estimation procedure is robust to the heteroscedastic errors in the data. Compared to those based on the functional DGP, the estimated parameters based on the discrete DGP follow a normal distribution more closely. This is also reflected in the QQ-plots. For example, Figure S3 in the online Supporting Information shows QQ-plots under the discrete DGP with heteroskedastic errors for sample size N = 100. For homoskedastic errors and larger sample sizes, the QQ-plots look even better. Note: The measure function m is the counting measure. The true values for the parameters are c 1,0 = 7.579, c 2,0 = −2.098, c 3,0 = −0.162, and 0 = 0.0609.

The goodness-of-fit test
We study the empirical size and power of the goodness-of-fit test developed in Section 2. In Section C of the online Supporting Information, we provide the detailed steps for implementing our test. Under the null hypothesis, we simulate the data based on the functional DGP and the discrete DGP using 1000 independent replications, and calculate the test statisticsT N in (2.15) and approximate its limit distribution by (2.20). For each replication, we compare the test statisticsT N with the critical values obtained by (2.21) to decide the rejection of the null hypothesis. Table VIII shows the empirical size of the test for the functional DGP in Panel A and for the discrete DGP in Panel B. Generally, the empirical size is close to the theoretical levels for the two DGPs, with a tendency to