A Survey of Dynamic Nelson-Siegel Models, Diﬀusion Indexes, and Big Data Methods for Predicting Interest Rates*

In this paper we survey a number of recent empirical ﬁndings regarding the usefulness of including diﬀusion indexes in dynamic Nelson-Siegel (DNS) type models used to predict the term structure of interest rates (see e.g., Diebold and Li (2007) and Diebold and Rudebusch (2013)). We also survey various empirical methods used in the construction of DNS models, and used to specify and estimate diﬀusion index augmented DNS models. In particular, we review (sparse) principal component analysis, factor augmented autoregression, and various dimension reduction, variable selection, machine learning, and shrinkage methods, such as the least absolute shrinkage operator (lasso), the elastic net, and independent component analysis, among others. Finally, we discuss the importance of using real-time data in contexts where datasets are subject to revision; and we compare and contrast the use of targeted versus un-targeted speciﬁcation methods when including diﬀusion indexes in DNS type prediction models. Interestingly, as noted in Swanson and Xiong (2018a,b), the usefulness of diﬀusion indexes is crucially dependent upon whether real-time data are used or not. Speciﬁcally, when real-time data are used to estimate the weights in diﬀusion indexes, it is found that relatively few “data rich” models that use big data are preferred to simpler DNS models, post 2010. Instead, pure DNS models that rely only on historical interest rate data deliver mean square error “best” forecasts. However, when data are not real-time, diﬀusion indexes always have marginal predictive content for interest rates. Moreover, it is clear that in more volatile interest rate regimes, such as prior to 2010, machine learning and related methods have much to oﬀer, regardless of the type of dataset used in their construction.


Introduction
The term structure of interest rates plays a important role in asset management.One reason for this is that interest rates contain important information for pricing interest rate contingent assets.As a consequence, industries ranging from banking and finance to insurance are interested in forecasting and simulating yields on government, municipal, and corporate bonds.In this paper, we survey two key recent models used for prediction and simulation of interest rates.These include: (1) dynamic Nelson-Siegel (DNS) type models (see e.g., Nelson and Siegel (1985), Svensson (1994), Diebold and Li (2006) and Diebold, Rudebusch and Aruoba (2006), Diebold and Rudebusch (2013), and Swanson and Xiong (2018a,b), and the references cited therein); and (2) factor augmented regression models (see e.g., Stock and Watson (2002a,b), Bai and Ng (2006), and Kim andSwanson (2014,2018).We also discuss hybrid models which combine (1) and ( 2), and summarize recent empirical findings in which such models are estimated using both "real-time" and "fully revised" datasets.The hybrid models that are referred to above are models in which diffusion indexes are included in DNS type models.This variety of model is particularly interesting because DNS models are often estimated using very small datasets that only include interest rate data, while diffusion indexes are typically constructed using so-called "big-data" (i.e., datasets with potentially hundreds of variables).In this sense, the hybrid models that we survey combine data-rich models with non-data-rich models.
When the objective is prediction (or simulation), as in this paper, an important data related issue related to the availability of the data used to estimate forecasting models arises.This issue involves whether or not to use real-time data in model specification and estimation.In order to understand what is meant by real-time data in this context, consider monthly inflation and interest rates.In the empirical findings surveyed in this paper, monthly interest rates are predicted.Now, historical interest rate data are never revised.Thus, it is clear which interest rate data to use when estimating DNS models that only include historical yields.However, historical inflation rates are regularly revised, and so if one specifies a hybrid DNS type model that includes diffusion indexes that are constructed using inflation (and other revised macroeconomic variables) one must account for the fact that inflation data are subject to revision.
Why is this important?The reason is that a diffusion index prediction constructed for the calendar dated time period January 2019, say, using data "pulled" in December 2018, and hence using data available through December 2018, will potentially be different from a diffusion index prediction constructed for the same calendar dated time period (i.e., January 2019), using data that were "pulled" at any point in time after December 2018.Why?In 2019, inflation data for calendar periods prior to and including December 2018 may have been revised by the government, so the timing associated with the "pulling" of data becomes relevant.Namely, data for specific calendar dated time periods is revised periodically, and over time, as the government receives new survey and related information.This in turn results in a revision of the model used to construct diffusion index predictions for calendar dates January 2019 and earlier.
To accurately account for new releases of economic variables, the entire history of the diffusion index in question must potentially be revised each month, as new "estimates" of historical data become available.Now, since the diffusion index is one of the inputs into our hybrid DNS type prediction model of interest rates, real-time data matters.In practice, what this means is that analysts and researchers interested in constructing prediction models that reflect the type of prediction models used in industry must use realtime data, and cannot simply download a single set of explanatory variables for use in ex ante prediction experiments.Stated differently, assume that the objective of a researcher is to simulate the environment faced by an practitioner when constructing interest rate forecasts.Given that the practitioner constructs new forecasts on a regular basis, say monthly, they are faced with an entirely new historical dataset each month.For variables like inflation, money, GDP, and unemployment that are regularly revised, many calendar dated observations that were previously available have been revised.In principle, one can collect an entire time series for each calendar dated observation of a variable, where the elements of the time series denote different "revisions" of the same calendar dated observation, which are successively available, over time.Such datasets, in which case a matrix of data is available for each variable, rather than the usual (time series) vector, are referred to as real-time datasets.It should be stressed, however, that not all variables have this feature, some variables, like interest rates and various asset prices are never revised, and hence are considered "fully revised".
In the context of estimating (hybrid) DNS type prediction models, it is important to decide whether to carry out targeted or un-targeted prediction.Here, "targeting" refers to a specific approach used to construct diffusion indexes.Consider one of the approaches used to construct diffusion indexes in this paper.Namely consider principal components analysis (PCA).Now, diffusion indexes (or latent factors) constructed using PCA can be interpreted as weighted combinations of all of the variables in the large-scale dataset used in their construction.In un-targeted prediction, diffusion indexes included in regression models and DNS type models are usually chosen to be those that explain the maximal correlation across the entire dataset.When using PCA, this amounts to selecting the eigenvectors of a certain correlation matrix that are associated with the largest eigenvalues of a particular eigenvalueeigenvector decomposition of said matrix.However, there is no guarantee that these eigenvectors have particularly useful information for forecasting interest rates, which are our "target" variable.Namely, achieving maximal correlation with the entire set of variables does not ensure maximal correlation (or predictive content) for a particular variable.In targeted prediction, the variables used in the construction of the diffusion indexes are carefully pre-selected, so as to maximize their predictive content for interest rates.Many methods for doing this are available in the machine learning, dimension reduction, shrinkage, and variable selection literatures.Some such methods that are discussed in this survey include the partial least squares, the lasso, the elastic net, ridge regression, and bootstrap aggregation.Additionally, as alternatives to PCA, we discuss sparse principal components analysis and independent components analysis.These methods directly induce sparseness in the coefficient vectors used to construct diffusion indexes.
In the empirical part of this survey, we review various empirical findings from Swanson and Xiong (2018a,b).These findings are based on analyses that utilize real-time datasets, as well as analyses that assume all data are fully revised.In the latter case, this amounts to taking a "snapshot" of the data that are available at a particular point in time, and assuming that the historical elements of that data are never revised, even when carrying out real-time forecasting experiments in which DNS hybrid DNS, and factor augmented regression models are re-estimated at each point in time, prior to the construction of each new forecast.As alluded to above, the use of real-time data (or not) has broad implications when specifying and utilizing DNS type models with and without diffusion indexes for forecasting interest rates; and we shall see that empirical findings are indeed dependent upon which type of data are used in model specification and estimation.We also review findings indicating that the use of more sophisticated targeted prediction methods results in hybrid DNS-diffusion index prediction models that outperform pure DNS models prior to around 2010.However, the same is not true after 2010.In particular, after 2010, if real-time data are used in hybrid model construction, then hybrid models are (predictively) outperformed by pure DNS models.On the other hand, when non real-time data are used in a similar "post-2010" empirical exercise, hybrid DNS models still (predictively) outperform pure DNS models.This underscores the importance of carefully selecting the data to be used in DNS models, when simulating and predicting the term structure of interest rates.
The rest of the paper is organized as follows.Section 2 reviews dynamic Nelson Siegel models, and Section 3 discusses modeling with diffusion indexes.In Section 4, factor augmented autoregression is reviewed, and in Section 5, dimension reduction, variable selection, machine learning, and shrinkage methods are surveyed.Section 6 summarizes selected recent empirical evidence concerning the usefulness of diffusion indexes in DNS modelling.Concluding remarks are gathered in Section 7.

Dynamic Nelson Siegel Models
In this section, we discuss two representative DNS type models, drawing on the discussion of Swanson and Xiong (2018b).For a complete survey, see Diebold and Rudebusch (2013).Additionally, refer to Diebold and Li (2006) and Diebold, Rudebusch and Aruoba (2006).First, we discuss the three factor dynamic Nelson-Siegel model.Motivated by rational expectation theory, Nelson and Siegel (1985) express spot interest rates in terms of instantaneous forward rates.
Namely, the instantaneous forward interest rate of a bond with maturity m is denoted as f (m), and the spot interest rate of a bond with maturity τ as y(τ ).Then, the yield to maturity of a bond can be written as the average of forward rates Nelson and Siegel (1985) motivate the use of the following model of the forward rate that can generate monotonically increasing, humped, and occasionally S-shaped yield curves, a range of shapes for yield curves: where λ t = 1 θt is the so-called decay parameter, which must be estimated, is assumed fixed in this model, and is time varying in the dynamic version of the model discussed below.It is then easy to derive the following model for bond yields: In the above model, the latent factors (i.e., the "betas") are fixed.Diebold and Li (2006) generalize this model to allow for time-varying betas: β 1,t , β 2,t and β 3,t .Their so-called Dynamic Nelson-Siegel (DNS) model is estimated using a two-step procedure.First, the rate of decay λ t is set to a constant.
Next, at each point in time, t, the yield cross section is linearly projected onto the set of factor loadings In our experiments, various different dimensions are considered when specifying the yield cross section.Namely, we consider yield cross sections using 10, 12, and 30 different yield maturities.For example, with our 12-dimensional cross section, we estimate the latent factors by fitting the following regression: The betas (i.e., β1,t , β2,t , and β3,t ) are called the "level", "slope", and "curvature" factors.In particular, note that the loading on β1,t is one, which is naturally interpreted as the "level" factor.The loading on β2,t decreases as bond maturity increases, resulting in an increase of the "slope" of bond yield curve.
Finally, the loading on the third latent factor, β3,t , starts from zero on the short end of yield curve, reaches its peak at some maturity in the middle, and gradually decays to zero as maturity goes to infinity.Figures 3 exhibits the three NS factors estimated with ordinary least squares for sample period 1988:8 -2017:10. 1 1 An increase in the "level" component, β 1t , affects all yields equally, thus it determines the level of the yield curve.Also, as maturity τ goes to infinity, β 1t = yt(∞) by definition.An increase in "slope" component β 2t affects short rates more than long rates, thereby changing the slope, or the so-called "term spread" of the yield curve.Finally, an increase in β 3t , the "curvature" component, will increase medium-term yields and have little effect on the short and long end of the curve.
In summary, the DNS model can be written as follows: (2.1) In order to construct predictions using the DNS model, we fit estimated factors to AR and VAR models, as follows.βi,t+1 = c i + γ i βi,t + t i = 1, 2, 3 or, (2.2) where t is a scalar stochastic disturbance term, t is a 3×1 vector of stochastic disturbance terms, and c i , c, γ i , and Γ, i = 1, ..., 3, are conformably defined constants, constant vectors and constant matrices.
With these last two models, one con construct predictions of the βi,t , for i = 1, ..., 3, which can in turn be inserted into the above model of ŷt (τ ) in order to generate predictions thereof.In all experiments in the sequel, rolling estimation is carried out when estimating the above models (and all other models), using windows of length 120 months, so that "real-time" predictions are constructed in all cases.Additionally, we consider two types of prediction models.In one, the decay parameter is fixed.In the other, the decay parameter is re-estimated prior to the construction of each new prediction.Svensson (1994) extends the Nelson-Siegel Svensson (NSS) model by adding a fourth term.This additional term allows for a second "hump" shape in term structures.In particular, he discusses using the following four-factor model for fitting the instantaneous forward interest rate: Notice that in the above equation there are now two different decay parameters controlling the doublehump shape of the forward curve, called θ 1 and θ 2 .Similar to the DNS model, we consider a dynamic version of the NSS model.Thus, we utilize the following variant of the DNS model (factor estimation and prediction construction is carried out using the DNS modeling approach discussed above).
where we now have two decay parameters, as discussed above.These are called λ 1,t and λ 2,t .As discussed in De Pooter (2007), the second hump in the NSS model is difficult to identify without imposing additional restrictions.We adopt his approach to solving this issue, which includes assumptions that the two humps Therefore, the yield curve will becomes more hump shaped.As demonstrated in Diebold and Li (2006), the "level" factor can be approximated with the 10-year bond yield, the "slope" factor can be approximated with 10-year -3-month bond yield spreads, and the "curvature" factor moves closely with two times the 2-year yield minus the sum of the 3-month and 10-year yields.
are at least one year apart, and that the second hump reaches its maximum for a maturity which is at least twelve months shorter than the first hump.Additionally, it is assumed that λ 1 = λ 2 , in order to avoid multicollinearity.Figures 4A -B plots the four NSS factors estimated with static and dynamic decay parameters, λ 1,t and λ 2,t .Figure 4C plots estimated rates of decay used in the construction of the four Nelson-Siegel-Svensson factors, where the rates of decay (λ 1,t , λ 2,t ) are either set to fixed numbers, or estimated recursively using nonlinear least squares.See Section 3.1.2for details on model estimation.

Modelling with Diffusion Indexes
Continuing to draw on discussion contained in Swansion and Xiong (2018b), we now turn our attention to so-called diffusion indexes, which have been utilized in numerous recent empirical investigations of economic data (see e.g.Andreou, Gagliardini, Ghysels, and Rubin (2018), Boivin and Ng (2005), Cheng and Hansen (2015), Exerkate, van Dijk, Heij, and Groenen ( 2013), Ludvigson and Ng (2009), Schumacher (2007,2009), and the references cited therein).A key open question in this literature remains whether or not macroeconomic, financial and other non-yield information is useful in fitting and forecasting the yield curves.As Duffee (2013) points out that assuming yields follow a Markov implies that all information in fundamental economic variables, should already be embedded in yield cross sections.However, many of the aforementioned paper find that so-called "unspanned risks", as proxied for by additional economic variables and/or diffusion indexes contain useful predictive content for yields.For example, Ang and Piazzesi (2003) find that macroeconomic variables are significant for explaining Treasury security yield dynamics; Mönch (2008) shows that including diffusion indexes in an affine Gaussian term structure model results in improved predictive performance; and Diebold, Rudebusch, and Aruoba (2006) discover strong evidence in favor of causal linkages between macroeconomic variables and future yield curve dynamics.
In the above paragraph, we refer to diffusion indexes, which summarize information contained in (potentially) largescale economic datasets (big data).As discussed in the introduction, one important aspect of big data in our context is the use of so-called real-time data.Recently, McCracken and Ng (2016) and St. Louis Federal Reserve Bank's data desk created the FRED-MD, which is a large monthly real-time database that contains over 130 macro-variables and all revisions of all of these variables.The dataset contains variables summarizing economic output and income, labor markets, consumption, money and credit, housing, and stock market, for example.Moreover, they show that diffusion indexes extracted from their FRED-MD dataset contain the same predictive content as diffusion indexes constructed using the classic Stock and Watson dataset (Stock and Watson (2002a,b)).However, the FRED-MD is a real-time database, while the Stock and Watson dataset contains only fully revised data.Several studies have revealed the importance of collecting and updating such real-time datasets including Diebold and Rudebusch (1991), Hamilton and Perez-Quiros (1996), Bernanke and Boivin (2003), and the papers cited therein.
One of our main objectives in this paper is examining whether diffusion indexes are useful for predicting yields when the data used to construct the indexes are purely "real-time", rather than fully revised.
A natural approach for answering this question involves adopting a dynamic factor model framework resembling that used by Coroneo, Giannone and Modugno (2016) where c y , c x are vectors containing constant terms, h is the forecast horizon, Γ y contains factor loadings on yield factors, Γ xx contains factor loadings on the macro factors, and Γ x summarizes the marginal effect of macro factors on yield factors.Additionally, e y,t+h and e x,t are idiosyncratic stochastic disturbance terms.
In their paper, Coroneo, Giannone and Modungo (2016) use a so-called expectation conditional restricted maximization algorithm for model estimation, and measure the effect of "unspanned" macroeconomic variables (risks) on the yield curve.We use principal component analysis (PCA) for estimating our macro diffusion indexes (i.e., macro factors), following Stock and Watson (2002a,b), and consider various alternative models that utilize macro diffusion indexes.For instance, we examine whether adding macro diffusion indexes to our DNS and NSS models improves the predictive accuracy of these models.Of course, we also consider baseline DNS (or NSS) models that contain only yield factors.More concretely, h−step ahead predictions for yield factors are constructed using the following model: where F y,t is our estimated DNS (or NSS) latent factor (i.e.F y,t are our betas in the above discussion), F f y,t+h is our prediction constructed by specifying simple AR(1) or VAR(1) models, ĉy is an estimate of c y , and Γy is an estimate of Γ y .We additionally add the first r x principle components from a PCA analysis of our real-time dataset, denoted as F x,t , to the above prediction model, yielding: where Γx is an estimate of Γ x .When predicting yields, in addition to utilizing DNS and NSS models, we also examine whether adding macro diffusion indexes to benchmark AR and VAR models improves predictive accuracy.In particular, we consider the following model: where c is the vector containing constant terms, all coefficient matrices (i.e., ∆ y , ∆ x , and Γ xx ) are a conformably defined coefficient matrices, ∆ x summarizes the marginal effect of macro diffusion indexes on yields, and e t+h is an idiosyncratic stochastic disturbance term.Summarizing, our focus of interest is on h-step ahead yield predictions constructed using the following model: where ĉ(τ ) is an estimate of c(τ ), which is an element of c.Also, δy is an estimate of δ y , which is a row vector of ∆ y .y t contains lags of y t+1 (τ ) in autoregressive specifications, and contains lags of y t+1 in vector autoregressive specifications.We additionally add the macro diffusion indexes discussed above, where δx is an estimate of δ x , which is a row vector of ∆ x .For further discussion of diffusion indexes in macroeconomic forecasting, see Banerjee, Marcellino and Marsten (2008) Boivin and Ng (2005), and Kim and Swanson (2014).
In the following section, we review methods used to estimate diffusion indexes, which are denoted F y,t and F x,t in the above discussion.This is done in the context of constructing forecasts using factor augmented autoregressions.This class of models nests the model outlined above.Our discussion centers around both the specification of the models and their estimation.When discussing estimation, we focus on the use of "un-targeted" prediction, in which case diffusion indexes are extracted from a large dataset, and the diffusion indexes that are utilized in forecasting models are simply those that are the most information rich, in the sense that they explain the largest share of the overall covariance matrix of all of the variables in the dataset.In the context of PCA, this corresponds to using diffusion indexes which are the eigenvectors associated with the largest eigenvalues of an eigenvector-eigenvalue decomposition of the correlation matrix of the variables in the dataset.Needless to say, such an approach does not guarantee that a particular set of variables (i.e., interest rates) can be predicted with optimal precision using diffusion indexes selected this way.
The above arguments suggest that it may, in some cases, be preferable to use "targeted prediction", where "targeting" means that the variables used in forming diffusion indexes are selected in order to maximize their association with the actual variables that are being predicted.In this approach, one might use machine learning, say, in order to "pre-select" a set of relevant variables for inclusion in the analysis used to construct the diffusion indexes.Examples of variable selection type machine learning methods that can be used for targeted prediction include the elastic net, the lasso, and the non-negative garrote.Before turning to a discussion of such methods, however, we first discuss factor augmented forecasting models and the construction of diffusion indexes.

Forecasting Using Factor Augmented Autoregressive Models
In addition to utilizing the models discussed in the previous 2 sections, the empirical analysis discussed in this paper draws on some of the most highly touted recent developments in forecasting concerning estimation and asymptotic properties of diffusion indexes based on PCA; and the use of diffusion indexes in the construction of forecasting models.Drawing from the discussion in Swanson and Xiong (2018a), we summarize key features of recent developments by considering static and dynamic factor models in order to motivate the use of diffusion indexes in forecasting.For further discussion, refer to Stock and Watson (2002a,b) and Armah and Swanson (2010a,b) Let y t+h be the scalar target forecast variable and X t be an N -dimensional vector of predictor variables, for t = 1, . . ., T .Assume that (y t+1 , X t ) has a dynamic factor model representation with r common dynamic factors, f t , which can be written as: and for i = 1, 2, . . ., N , where W t is an l × 1 vector of observable variables with l << N, including lags of y t ; α(L) = q j=0 α j L j and λ i (L) = q j=0 λ ij L j are finite order lag polynomials in nonnegative powers of L; and h > 0 is the forecast horizon.It is important to note that this framework ensures that all variables in X t can be expressed as a linear function of the dynamic factors (and an idiosyncratic shock, e it ).For a discussion of approximate factor models, in which this condition does not hold, refer to Carrasco and Rossi (2016).Next, wrte write (3.5) and (3.6) in static form as: and where F t = (f t , . . ., f t−q ) is an r × 1 vector of static factors, with r = (q + 1)r, α is an r × 1 vector, and Λ i = (λ i0 , . . ., λ iq ) is a vector of factor loadings on the static factors, where λ ij is an r × 1 vector for j = 0, . . ., q and β = (β 1 , . . ., β l ) .The model in (3.7) is called a factor augmented forecasting model (i.e. see Stock and Watson (2002a,b) and Bai and Ng (2007)).The static factor in (3.8) is thus named because the contemporaneous relationship between x it and F t .One major advantage of the static representation of the dynamic factor model is it enables one to utilize PCA to estimate the diffusion indexes (factors).
An important theoretical feature of the model in (3.7) is that consistent estimation of the factors in F t , which can be achieved via simple application of PCA, allows for subsequent √ T consistent estimation of α and β in (3.7) using quasi-maximum likelihood, as long as √ T /N → 0, as N, T → ∞.Thus, as shown in Bai and Ng (2006), F t , when estimated using the PCA method outlined in Stock and Watson (2002a,b), can be treated as a vector of observed regressors, eschewing the need to address the generated regressors problem that often arises in applied econometrics.For a discussion of alternative methods for factor forecasting based on estimation of generalized dynamic factor (GDF) models, see Forni, Hallin, Lippi and Reichlin ( 2005) and Forni, Hallin, Lippi and Zaffaroni (2015).Note also that Boivin and Ng (2005) compare alternative factor based forecast methodologies, and conclude that when the dynamic structure is unknown, the static factor modeling approach of Stock and Watson performs favorably when compared with dynamic factor modeling.
In the literature on factor modeling, many additional question that arise in the context of diffusion index estimation have been addressed.For example, Bai and Ng (2006b) examine whether observable economic variables can serve as proxies for the underlying unobserved factors.In particular, they use a variety of statistics to determine whether a group of observed variables yields the same information as that contained in the latent factors.Armah and Swanson (2010) and Stock and Watson (2002a) also discuss methods for reducing the complexity of estimated diffusion indexes.Stock andWatson (1998,2009) demonstrate that when PCA is used in estimation, factors remain consistent even when there is some time variation in factor loadings and small amounts of data contamination, so long as the number of variables in the panel dataset or the number of predictors is very large (i.e., N >> T ).The usefulness of factor augmented models that include cointegration restrictions is discussed in Banerjee, Marcellino and Marsten (2014).The importance of assessing and testing for structural breaks in these models is (3.9) Let F k and Λ k be the minimizers of equation (3.9).Since Λ k and F k are not separately identifiable, if N > T , a computationally expedient approach would be to concentrate out Λ k and minimize (3.9) subject to the normalization F k F k /T = I k .Minimizing (3.9) is equivalent to maximizing tr[F k (XX )F k ].This optimization is solved by setting F k to be the matrix of the k eigenvectors of XX that correspond to the k largest eigenvalues of XX .Note that tr[•] represents the matrix trace.Let D be a k × k diagonal matrix consisting of the k largest eigenvalues of XX .The estimated factor matrix, denoted by F k , is √ T times the eigenvectors corresponding to the k largest eigenvalues of the T × T matrix XX .Given The solution to the optimization problem in (3.9) is not unique.If N < T , it becomes computationally advantageous to concentrate out F k and minimize (3.9) subject to Λ k Λ k /N = I k .This minimization is the same as maximizing tr[Λ k X XΛ k ], the solution of which is to set Λ k equal to the eigenvectors of the N × N matrix X X that correspond to its k largest eigenvalues.One can thus estimate the factors as F k and F k span the same column spaces, hence for forecasting purposes, they can be used interchangeably.Given 2 be the sum of squared residuals from regressions of X i on the k factors, ∀i.A penalty function for over fitting, g(N, T ), is chosen such that the loss function can consistently estimate r.Let kmax be a bounded integer such that r ≤ k max.Bai and Ng (2002) propose three versions of the penalty function g(N, T ), namely, , all of which lead to consistent estimation of r.
Additional details on the estimation of r are contained in Bai and Ng (2002).Alternative methods for selecting r are discussed in Chen, Huang, and Tu (2010), Onatski (2015), Carrasco and Rossi (2016), and the references cited therein.
In the above discussion, we alluded to the fact that diffusion indexes can be quite complex.For example, diffusion indexes constructed using PCA are linear combinations of every variable in the dataset being used by the applied practitioner.We also mentioned that dimension reduction of a dataset can be accomplished prior to application of PCA, for example, by applying machine learning, shrinkage or other variable selection methods.When such methods are used, the variable selection undertaken can be "targeted" so that only variables useful for predicting the target variable are selected.These variables can then in turn be used to construct diffusion indexes that are potentially much less complex that those constructed by directly applying PCA.In the next section, we discuss a variety of such methods.

Machine Learning, Dimension Reduction, Shrinkage, and Variable Selection Methods
In this section, we briefly review select methods in machine learning, dimension reduction, shrinkage, and variable selection methods that are important in economics in general, and in particular for our discussion of the usefulness of "big data" and diffusion indexes in DNS model based forecasting.2For forecasters, a key objective when predicting economic variables (such as interest rates) using big data is to remove redundant and irrelevant information from datasets.This is particularly important if the objective is targeted prediction.This problem has historically been be tackled via step-wise regression and ridge regression.However, variables are typically highly correlated in time series applications.Hence, statistical significance tests used in many regression type algorithms suffer from severe size distortion issues.Ghysels, Hill, and Motegi (2017) address this issue by examining multiple parsimonious regressions, each with one key regressor, while jointly accounting for sequential testing problems.
A second solution to the dimension reduction problem with correlated regressors is the use of partial least squares (PLS), which was originally proposed by Herman Wold in the mid 1960s.Broadly speaking, PLS is a latent variable approach to modeling the covariance structure between two sets of variables.
One set might be a target variable or variables to be predicted (say Y ), while the other might be a very large set of correlated predictor variables, say X.More precisely, the model underlying PLS has where F 1 and F 2 are projection matrices of X and Y ; and L 1 and L 2 are so-called factor loading matrices that operate on the latent factors F 1 and F 2 .Additionally, the error terms, E 1 and E 2 are assumed to be identically and independently distributed, and all matrices are conformably defined, given the dimensions of X and Y .In this setup, the decompositions of X and Y maximize the covariance between the latent factors F 1 and F 2 .
A third solution uses principle components analysis (PCA), in which latent factors (often called diffusion indexes) are again estimated, but this time via use of an eigenvalue-eigenvector decomposition of the covariance or correlation matrix of the data, for example.Just as in PLS, the objective is to "explain" the data" using a reduced set of (latent) explanatory variables, with the idea being that the useful information in a large set of predictors is often contained in a (much smaller) set of latent factors, which are themselves simply linear combinations of the original variables.A key difference between PCA and PLS is that PLS directly attempts to account for correlation between the target variable and the predictors, while PCA is "unsupervised", in the sense that correlation with any given target variable is not emphasized in the construction of the latent factors.Rather, overall explanation of the entire dataset is the focus of PCA.Needless to say, this particular feature of PCA is of potential concern when targeting (predicting) a specific variable or variables.For this reason, many supervised versions of PCA have been developed.For example, Carrasco and Rossi (2016) use cross validation methods to supervise PCA, while Bai and Ng (2008) consider targeted forecasting using subsets of X (see also Armah and Swanson (2010a,b)) and Cheng, Swanson, And Yang (2017).Given its ease of application as well as recent empirical evidence on its usefulness, PCA (which is the oldest of the methods discussed in this paper; see Spearman (1904) and the discussion in Swanson (2016) for further details), has received the most attention in economics recently, and hence will be discussed in considerably more detail below.
Penalized regression or shrinkage methods, which reduce or shrink redundant or irrelevant variables are also important in big data analysis.Key examples include ridge regression, the lasso, and the elastic net.When viewed through the lens of multivariate regression analysis, all of these methods involve shrinking the magnitude of coefficients in regression models.When the "penalty functions" are carefully designed, and when the "regularization parameters" used to regulate the strength of the penalties in these functions are of sufficient magnitude, then substantial dimension reduction can be achieved.For example, when shrinkage is used in conjunction with PCA, factor loading matrices can be induced to be sparse, in the sense that certain coefficients in the linear combinations of the predictor variables are identically zero.This nice feature imposes parsimony on the number of variables used to form latent factors in PCA, whereas under standard PCA; all predictors receive non-zero weight in each latent factor.Just as in the case of PLS, the number of predictors may be greater than the number of observations in the dataset being analyzed using PCA.
To fix ideas, let's consider the "original" shrinkage estimator.Namely, consider ridge regression and its associated estimator.Assume that we are interested in the following regression model: where Y contains data on a single variable, there are many (possibly highly correlated) variables represented in the data matrix, X, and ε is an error term.Later, we shall introduce the ridge estimator slightly differently, but for now, note that the ridge estimator can be expressed as: The "ridge" down the diagonal in this estimator is equivalent to adding a penalty of λ N i=1 θ 2 i to the usual residual sum of squares term that is minimized in least squares estimation of the above regression model, where N is the number of predictors in X.Here, as λ → 0, θ ridge → θ ols , and as λ → ∞, θ ridge → 0.
Evidently, applying the ridge penalty shrinks parameter estimates towards zero, which increase bias and reduces estimator variance.One very important feature of ridge regression is that invertibility problems associated with X X when the number of predictors is too large relative to the number of observations are no longer an issue, and there is always a unique solution (i.e., θ ridge ).Other shrinkage estimators that shall be discussed in the sequel include one where the penalty function is λ N i=1 θ i (the lasso) and another that combines both of the above penalty functions (the elastic net).
Another shrinkage estimator is based on bootstrap aggregation (bagging), and was introduced by Breiman (1996).Stock and Watson (2012) note that predictions of Y , at a point in time, T +1, conditional on information available up through period T, say y f T +1|T can be constructed as follows: where X T (i) is the datum on the i th variable in X for period T , θ(i) is the least squares estimator from regressing X T −1 (i) on Y T , and ψ(λt θ(i) ) is a regularized (through λ) function of the t-statistic associated with the aforementioned regression.3For bagging λ = 1, while various Bayesian predictors, including Bayesian model averaging and empirical Bayes can also be formulated in this manner, by setting λ appropriately.Interestingly, Hirano and Wright (2017) show that forecasting models constructed using out-of-sample or split sample schemes perform well only when combined with other methods, such as bagging.Broadly speaking, their results offer a glimpse into the benefits of using state of the art (asymptotic) statistical analysis in order to examine new methods that combine conventional out-ofsample approaches to model selection and estimation with algorithmic approaches such as bagging.In their paper, they show that out-of-sample schemes so regularly used for model selection (and estimation are inefficient when applied in the conventional manner.This finding is reversed when bagging or other risk reduction methods are combined with conventional out-of-sample schemes, however. As discussed earlier, ongoing research efforts in the study of factor augmented forecasting models include the analysis of problems associated with the "selection" of diffusion indexes that are most useful for predicting y t+1 .For example, see Bai andNg (2008,2009) and Schumacher (2009), who discuss using targeted predictors based on quadratic principal components and thresholding rules for variable subset selection to estimate diffusion indexes.Armah and Swanson (2010a,b) also discuss this issue.Further, Carrasco and Rossi (2016) propose cross validation methods for selecting the "best" diffusion index for use in forecasting).A related area of research, which is the subject of this subsection, is the development of alternative diffusion index estimators, important examples of which use shrinkage methods in order to impose sparseness on the factor loadings used in the construction of diffusion indexes.Two of the many interesting new estimators in this context include sparse principal components analysis (SPCA) and independent component analysis (ICA).
Zou, Hastie, and Tibshirani (2006) note that diffusion indexes estimated using PCA are linear combinations of all underlying predictor variables, and factor loadings are hence all nonzero, which adversely affects the parsimony of forecasting models, a property known to be important in time series forecasting.
Moreover, they stress that diffusion indexes are thus difficult to interpret.In light of this, they propose SPCA, in which the least absolute shrinkage selection operator (lasso) or the related shrinkage estimator called the elastic net is utilized in order to construct principal components with sparse loadings.This is done this by first reformulating PCA as a regression type optimization problem, and then by using a lasso (elastic net) on the coefficients in a suitably constrained regression model.
Before further discussing SPCA, it is worth noting that the lasso and elastic net are important techniques for big data analysis in and of themselves, and are related to the venerable ridge regression estimator.Using the above notation, say that Here, penalized (shrinkage type) regression is carried out as follows: For the ridge estimator, construct: where y is the T x1 target variable, X = [X 1 , ..., X N ], i = 1, ..., N is the T xN predictor matrix, with X i = (X 1,i , ..., X T,i ) , and λ > 0 is the tuning parameter.Notice that this is an alternative formulation of θ ridge to that given earlier.The more recently developed lasso and the elastic net estimators involve imposition of L 1 (lasso) and L 1 +L 2 −norm penalties on parameter magnitudes, and are formulated as: Interestingly, SPCA follows directly by formulating PCA as a regression-type optimization problem, and then by subsequently imposing lasso (elastic net) constraints on the regression coefficients in the optimization problem.Put simply, factor loading can be recovered by regressing principal components on the N variables in X t , as shown in Zou, Hastie, and Tibshirani (2006).Here, imposition of the L 2 −norm penalty in ridge regression allows for N > T.Moreover, when the lasso or elastic net is utilized in this context, then large enough λ 1 yields sparse θ.In this sense, SPCA is a natural data reduction method.Since the important paper by Zou et al., many authors have proposed modifications to SPCA, as discussed in Kim and Swanson (2017).
this assumption can be relaxed.Thus, it is assumed that X = SΩ.Stated differently, assume that: where ω ij is the (i, j) element of Ω.Since Ω and S are unobserved, one must estimate the "demixing matrix", Ψ, which transforms the observed X into the independent components, F .That is, F = XΨ, or F = SΩΨ.As detailed in Kim and Swanson (2017), if Ω is square, then so is Ψ, and Ψ = Ω −1 , so that F is exactly the same as S, and perfect separation occurs.In general, it is only possible to find Ψ such that ΩΨ = P D, where P is a permutation matrix and D is a diagonal scaling matrix.
The independent components, F are latent variables, and are analogous to the principal components discussed in the case of PCA.In summary, upon estimation of Ω and S, it is feasible to estimate the demixing matrix Ψ, and the independent components, F. However (4.1) is not identified unless several assumptions are made.The first assumption is that the sources, S, are statistically independent.Since various sources of information (for example, consumer's behavior, political decisions, etc.) may have an impact on the values of macroeconomic variables, this assumption is not strong.The second assumption is that the signals are stationary.For further details, see Tong, Liu, Soon, Huan (1991).ICA maps the N components of X into the rank N matrix, F .However, we can simply construct factors using up to r (< N ) components, without loss of generality, for comparability with PCA.Alternatively, one might carry out ICA using r principal components, hence further filtering diffusion indexes constructed using PCA in order to obtain statistically independent variants thereof (see Stone (2004) for further details).
In general, the above model would be more realistic if there were noise terms added.See Hyvärinen and Oja (2000) for a detailed discussion of the noise-free model, and Hyvärinen (1998Hyvärinen ( ,1999) ) for a discussion of the model with noise added.
For a detailed comparison of ICA with PCA, see Kim and Swanson (2016), who note that the main difference between ICA and PCA is in the properties of the factors obtained.Principal components are uncorrelated and have descending variance so that they are naturally ordered in terms of their variances.
While setting the diffusion index in equation (3.5) equal to the highest variance (correlation) principal components may well not equate with the specification of the indexes that are most useful for forecasting a given variable, say y t , it is certainly the case that components explaining the largest share of the variance are often assumed to be the "relevant" ones.For simplicity, consider two observables, X = (X 1 , X 2 ) .
PCA finds a matrix which transforms X into uncorrelated components F = (F 1 , F 2 ) , such that the uncorrelated components have a joint probability density function, p F (F ) with: On the other hand, ICA finds a demixing matrix which transforms the observed X = (X 1 , X 2 ) into independent components F * = (F * 1 , F * 2 ) , such that the independent components have a joint pdf p F * (F * ) with: for every positive integer value of p and q.Evidently, ICA is more restrictive, and it should thus not be surprising that implementation is much more difficult than PCA, in which estimation is much simpler, since it just involves finding a linear transformation of components which are uncorrelated.Moreover, there is no natural ordering of latent factors in ICA.This is perhaps a blessing in disguise.Namely, as stated above, there is no a priori reason why the ordinal (correlation) ranking of diffusion indexes corresponds to a ranking of their usefulness for predicting y t (see Kim and Swanson (2014), Bai and Ng (2008) and Carrasco and Rossi (2016) for further discussion of this issue).
Even given all of the recent progress in the area, much remains to be done.There are a vast number of estimators and algorithms than can be utilized for machine learning (we have touched in our discussion on only a very few of these).In the end, what will probably differentiate the "good methods" from the "not so good" is their ability to properly marry the latest tools in statistical inference with the latest algorithmic techniques.For example, step-wise methods now often rely on learning functions and thresholding variables (such as t-statistics) centered around conditional mean type prediction, while there is a clearly a need to fully incorporate conditional or predictive density type prediction in new methods.As another example, recall our earlier discussion on the use of asymptotic analysis to examine the combination of conventional out-of-sample schemes with bootstrap aggregation.Many of these sorts of analyses remain to be done in the context of combining conventional forecasting approaches with state of the art dimension reduction, machine learning, and penalized regression algorithms.
5 Survey of Select Recent Empirical Findings

Setup
In this section, we survey recent empirical findings reported in Swanson and Xiong (2018a,b).In these two papers, the authors compare interest rate predictions from a variety of different models, including: (i) Benchmark time series models: where τ denotes the maturity of a bond (bill) for which the scalar, y t+h (τ ), measures the annual yield.
Additionally, W t may contain lags of y t (τ ) as well as lags of additional explanatory variables, δ y is a conformably defined coefficient vector, and c(τ ) is a constant term.In all models, up to 5 lags of y t (τ ) are included, with the number of lags selected using the Schwarz information criterion (SIC).In addition to estimating AR(SIC) and VAR(SIC) models, straw-man AR(1) and VAR(1) models are estimated.
Finally, in experiments where VAR models are estimated, W t includes five bond yields with maturities 3 months, 1 year, 3 years, 5 years, and 10 years.
(ii) DNS and DNSS models: These models include the dynamic Nelson-Siegel (DNS) and dynamic Nelson-Siegel-Svensson (DNSS) models discussed above.As outlined in Xiong and Swanson (2018b), forecasting model estimation in this context is carried out by estimating latent factors using the following regression: where ε t is a stochastic disturbance term.Forecasts of y t+h are constructed using the model: where y t+h (τ ) is a scalar, and β f 1,t+h , β f 2,t+h , and β f 3,t+h are predictions constructed by specifying AR and VAR models for β 1,t , β 2,t , and β 3,t .
Analogously, estimates of the DNSS factors (i.e.β 1,t , β 2,t , β 3,t , and β 4,t ) are constructed at each point in time by regressing (1, In this case, thus, the regression model is: (5.4) Forecasts of y t+h (τ ) are constructed using: (5.5) where y t+h (τ ) is a scalar, and β f 1,t+h , β f 2,t+h , β f 3,t+h , and β f 4,t+h are predictions constructed by specifying AR and VAR models.For complete details, refer to Swanson and Xiong (2018b).
(iii) Hybrid DNS and DNSS models with diffusion indexes: All of the above models are also estimated with latent factors (i.e., diffusion indexes) added as additional regressors.In particular, for the above benchmark time series models, predictions are constructed using y t+h (τ ) = c(τ ) + δ y W t + δ x F x t + ε t+h , (5.6) where F x t includes either 1, 2 or 3 diffusion indexes, and W t is defined as above, yielding AR and VAR variants of these models.Here, c(τ ) is a constant term, and δ y and δ x are conformably defined vectors of coefficients.In these models, diffusion indexes, (i.e., F x t ) are estimated using recursive PCA with both fully revised (see Swanson and Xiong (2018a)) or real-time (see Swanson and Xiong (2018b)) macroeconomic datasets. 4  When constructing DNS type prediction models, diffusion indexes are included by augmenting the models used to predict β f i,t+h , for i = 1, 2, 3, 4. Namely, for AR based forecasts, the following prediction models were used: i,t+h = ĉi + γ y,i β i,t + γ x,i F x t , for i = 1, 2, 3, 4 where F x t includes either 1, 2 or 3 latent factors.All other terms are conformably defined and analogous to our above discussion.We also construct forecasts using the following VAR(1) variant of this model: where β f t+h = ( β f 1,t+h , β f 2,t+h , β f 3,t+h , β f 4,t+h ) , ĉ is 4 × 1 vector, Γy = ( γ 1 , γ 2 , γ 3 , γ 4 ), γ j is a 4 × 1 vector, for j = 1, 2, 3, 4, and Γx is a conformably defined matrix of constants. 5 When comparing the predictive performance of the models detailed below, Swanson and Xiong (2018a,b) report mean square forecast errors (MSFEs), defined as: (5.7) where ŷt+h (τ ) is the h-step-ahead forecast of the Treasury bond yield, with maturity τ .Here, P is the number of ex ante predictions.As alluded to above, all model parameters are estimated with maximum likelihood and PCA; and parameters are updated prior to the construction of each forecast using a rolling window of 120 months of historical data.For an analysis of the use of rolling versus recursive and alternative windowing techniques in the context of forecasting, see Clark and McCracken(2009) and Hansen and Timmermann (2012) and Rossi and Inuoe (2012).
In addition to using un-targeted PCA, Swanson and Xiong (2018b) construct real-time diffusion indexes by implementing machine learning and related techniques to first select a subset of the 130 macroeconomic variables (see discussion below) in their dataset.These "variable subsets" are selected using both the elastic net and the least absolute shrinkage operator, in which ten-fold cross validation is used, in real-time, to estimate tuning parameters in the operators.Then, the "variable subsets" are inputted into PCA in order to construct diffusion indexes (i.e.targeted PCA is carried out).
4 Different lag specifications were examined in the aforementioned papers, empirical results using only one lag in the above specification were reported on, as one lag yields mean square forecast error "best" models.
5 For DNS models, only the first three diffusion indexes are used in the above AR and VAR forecasting equations (i.e., i = 1, 2, 3), while for DNSS models four diffusion indexes are used in the above AR and VAR forecasting equations (i.e., i = 1, 2, 3, 4).
perform benchmark time series models as well as pure diffusion index models.Moreover, in many cases, augmenting DNS and DNSS models to include diffusion indexes constructed either with PCA or targeted PCA yields even more precise predictions.

Concluding Remarks
In this paper, we survey recent methods used for predicting the term structure of interest rates using dynamic Nelson-Siegel (DNS), dynamic Nelson-Siegel Svensson (DNSS), and various econometric models.
We also survey methods for constructing diffusion indexes using principal component analysis (PCA), as well as various dimension reduction, variable selection, machine learning, and shrinkage methods, all in the context of "mining" big data.We then discuss how diffusion indexes can be used to construct "hybrid" DNS and DNSS prediction models that exhibit good forecasting properties.Finally, we review select recent empirical findings regarding the use of hybrid DNS and DNSS type prediction models that include diffusion indexes constructed using both un-targeted (e.g.PCA) and targeted (e.g.elastic net) methods of variable selection.It is noted that there are many time periods during the last twenty years during which DNS and DNSS models that include diffusion indexes constructed either with PCA or targeted PCA yields result in predictions of the term structure of interest rates that are more precise, in a mean square forecast error sense, than pure DNS/DNSS type models, pure diffusion index models, and various benchmark linear econometric models.However, in recent years, pure DNS/DNSS models yield more precise predictions of the term structure of interest rates than any of the alternative models that we examine.
discussed inBanerjee, Marcellino and Marsten (2008),Stock and Watson (2009), and Chen, Dolado   and Gonzalo (2014).Factor loading and parameter stability testing is addressed inCorradi and Swanson (2014), Breitung and Eickmeier (2011), Goncalves and Perron (2014), and Han and Inoue (2014).Finally, the empirical and theoretical properties of factor augmented VARMA models are investigated in Dufour and Stevanovic (2013).Now, consider estimation of the factors appearing in (3.7).Drawing from the discussion in Armah and Swanson (2010a,b) and Swanson and Xiong (2018a), let k (k < min{N, T }) be an arbitrary number of factors, Λ k be N × k factor loadings matrix, (Λ k 1 , . . ., Λ k N ) , and F k be the T × k matrix of factors (F k 1 , . . ., F k T ) .From (3.8), estimates of Λ k i and F k t are obtained by solving the optimization problem: (τ ) − y t+h (τ )) 2 Swanson and Xiong (2018b)ssume that yields curve factors, (which are the betas in the above discussion are are here called F y,t ), are driven by both past yield curve factors and macro factors (i.e., diffusion indexes), called F x,t .Additionally, it is assumed that macroeconomic variables are driven only by F x,t only.In particular, consider the following model, as discussed inSwanson and Xiong (2018b):