Reservoir Computing for Macroeconomic Forecasting with Mixed Frequency Data

Macroeconomic forecasting has recently started embracing techniques that can deal with large-scale datasets and series with unequal release periods. MIxed-DAta Sampling (MIDAS) and Dynamic Factor Models (DFM) are the two main state-of-the-art approaches that allow modeling series with non-homogeneous frequencies. We introduce a new framework called the Multi-Frequency Echo State Network (MFESN) based on a relatively novel machine learning paradigm called reservoir computing. Echo State Networks (ESN) are recurrent neural networks formulated as nonlinear state-space systems with random state coefficients where only the observation map is subject to estimation. MFESNs are considerably more efficient than DFMs and allow for incorporating many series, as opposed to MIDAS models, which are prone to the curse of dimensionality. All methods are compared in extensive multistep forecasting exercises targeting US GDP growth. We find that our MFESN models achieve superior or comparable performance over MIDAS and DFMs at a much lower computational cost.


Introduction
The availability of timely and accurate forecasts of key macroeconomic variables is of crucial importance to economic policy makers, businesses, and the banking sector alike. Key macroeconomic variables, such as GDP growth, become available at low frequency with a considerable time lag and are subject to various rounds of revisions after their release. This is particularly problematic in a fast-changing and uncertain economic environment, as experienced during the Great Recesion of 2007(Hindrayanto et al. (2016) and the recent pandemic (Buell et al. (2021), Huber et al. (2021)). However, a large number of the potentially predictive financial market (and other macroeconomic) indicators are available at a daily or even higher frequency (Andreou et al. (2013)). The desire to utilize such high-frequency data for macroeconomic forecasting has led to the exploration of techniques that can deal with large-scale datasets and series with unequal release periods (see Borio (2011Borio ( , 2013, Morley (2015); we also refer the reader to Fuleky (2020) for more details regarding high-dimensional data and to Armesto et al. (2010) and Bańbura et al. (2013) for a review on mixed-frequency data).
We contribute to the existing literature by proposing a new macroeconomic forecasting framework that utilizes high-dimensional and mixed-frequency input data, the Multi-Frequency Echo State Network (MFESN). The MFESN originates from a machine learning paradigm called Reservoir Computing (RC). RC is a family of learning models that take advantage of the information processing capabilities of complex dynamical systems (see Maass et al. (2002), Legenstein and Maass (2007), Crutchfield et al. (2010), and Lukoševičius and Jaeger (2009), Tanaka et al. (2019) for reviews). Generally speaking, RC is a versatile class of recurrent neural network (RNN) models (see Salehinejad et al. (2017) for a detailed survey). Although conventional RNNs are well-suited for handling sequence data and dynamic problems, estimating their weights during the training phase is well-known to be inherently difficult (Pascanu et al. (2013), Doya (1992)). Reservoir networks stand out due to the fact that their inner weights can be randomly generated and fixed, and only the output (readout) map is subject to supervised training. Echo State Network (ESN) is one of the most popular instances of RC models with provable universality, generalization properties (see Ortega (2018a,b, 2019), Gonon et al. (2020aGonon et al. ( , 2022, Gonon and Ortega (2021), and references therein for more details), and excellent performance in forecasting and classification. While RNNs have been adopted for macroeconomic forecasting in a few instances (see, for example Paranhos (2021)), to the best of our knowledge, we are the first to explore easily-trainable reservoir models in this context. Our main contribution is three-fold. First, inspired by the remarkable empirical success of ESNs in prediction tasks, we propose the so-called Multi-Frequency Echo State Network (MFESN) framework, which allows multistep forecasting of the target variable at lower or the same frequencies as those of the input series. Second, we introduce two different approaches to predicting within the MFESN framework, namely Single-Reservoir MFESN (S-MFESN) and Multi-Reservoir MFESN (M-MFESN). S-MFESN is determined by modifying the ESN architecture to accommodate input and target variables of mixed frequencies. In M-MFESN, several Echo State Networks are adopted to handle input time series, each ESN corresponding to a group of input variables quoted at one given frequency. Finally, our third contribution consists in an extensive empirical comparative analysis of the forecasting capability of the proposed approaches in a concrete task of predicting the quarterly U.S. output growth. We inspect the forecasting capabilities of the MFESN framework compared to two well-established benchmarks widely used in the macroeconomic literature and among practitioners and show its empirical superiority in several thoroughly conducted forecasting exercises. Moreover, as a bi-product, we propose a new data aggregation scheme that allows bridging these two standard forecasting approaches, which is not available in the literature.
In our empirical study, we evaluate the multistep forecasting performance of the MFESN framework targeting quarterly U.S. output growth (Gross Domestic Product (GDP) growth) and utilizing a smalland medium-sized set of monthly and daily financial and macroeconomic variables. We compare the MFESN approach against two state-of-the-art methods, MIDAS and DFM, known for their ability to incorporate data of heterogeneous frequencies and utilize high-dimensional data inputs. The MIxed DAta Sampling (MIDAS) model developed in Ghysels et al. (2004Ghysels et al. ( , 2007 has been adopted widely for macroeconomic forecasting with mixed-frequency data (see for instance Galvão (2008, 2009), Ghysels and Wright (2009), Francis et al. (2011), Monteforte and Moretti (2012), Galvão and Marcellino (2010), Galvão (2013), Andreou et al. (2013), Ghysels (2016), Jardet and Meunier (2020)). However, MIDAS is prone to curse-of-dimensionality problems and performs poorly when the set of predictors is of even moderate size (Clements and Galvão (2009), Kostrov (2021)) due to the optimization related issues. Recently, some attempts have been made in the literature to overcome these issues by employing variable selection techniques under some additional assumptions. For instance, Babii et al. (2022) proposes the MIDAS projection approach, which is more amenable to high-dimensional data environments under the assumption of sparsity. Even with these improvements, practical highdimensional implementations of MIDAS remain challenging. This is in part caused by the ragged edges of the "raw" macroeconomic data, incomplete observations, and uneven sampling frequencies.
The relative inflexibility of MIDAS regression lag specifications makes integrating daily and weekly data at true calendar frequencies (i.e. without interpolation or aggregation) very complex. State-space models effectively mitigate these issues. A strong state-of-the-art state-space competitor for our MFESN framework is the Dynamic Factor Model (DFM), which has been first introduced in Geweke (1977) and Sargent et al. (1977). DFMs have become the standard workhorse for macroeconomic nowcasting and prediction (for more details, we refer the reader to Stock and Watson (1996, Giannone et al. (2008), Bańbura and Rünstler (2011), Chauvet et al. (2015), Hindrayanto et al. (2016)). Conventional DFMs for data of multiple sampling frequencies are linear state-space models with a latent low-frequency process of interest and high-dimensional input time series. Although their linear structure lends itself to inference with likelihood-based methods and Kalman filtering, using DFMs in the high-dimensional setting is limited by the associated computational effort. For Gaussian state-space models, some of these issues are proposed to be handled with a more compact matrix representation as in Delle Monache and Petrella (2019). Still, in the particular settings of nowcasting and forecasting of GDP growth, the computational complexity is one of the main reasons why DFMs are rarely used with daily input series (see Bańbura et al. (2013) for a detailed review). We address these numerical difficulties using novel Python libraries for auto differentiation and using GPUs for parallel computing 1 , which allow the estimation of DFMs even in instances of high-frequency input observations. Further, to adapt the DFM to mixed frequency tasks, we propose a new DFM aggregation scheme with Almon polynomial structure that bridges MIDAS and the DFM for our forecasting comparison. To our knowledge, we are the first to present this aggregation scheme which reduces the number of parameters subject to estimation.
In order to carry out a fair comparison of our MFESN framework with the state-of-the-art MIDAS and DFM models, we designed two model evaluation settings that differ regarding whether the financial crisis of 2008-2009 is included in the estimation period or not. In the first forecasting setting, all the competing models are estimated using the data from January 1st, 19901st, , until December 31st, 2007 Their performance in the forecasting into and after the financial crisis period is assessed. In the second evaluation setting, fitting is done with data largely encompassing the crisis period, again from January 1st, 1990 but now up to December 31st, 2011. In both cases, the forecasting (testing) period comprises the time period up to the COVID-19 pandemic events, namely the fourth quarter of 2019. Along with the two state-of-the-art DFM and MIDAS models, we use the unconditional mean of the sample as a baseline benchmark against the reservoir models. We find that our ESN-inspired models attain comparable or much better performance than DFMs at a much lower computational cost, even for a relatively long forecasting horizon of four quarters. Additionally, ESNs do not suffer from curse-ofdimensionality problems, which are known to be pervasive for MIDAS models and hence consistently outperform them in a number of forecasting exercises.
The remainder of the paper is structured as follows: Section 2 introduces the notation and terminology used throughout the paper. Section 3 presents the MFESN framework and discusses its implementations, including model estimation, hyperparameter tuning, penalization, and nonlinear multistep forecast evaluation. In this section, we dedicate our attention to proposing single-reservoir and multi-reservoir MFESN architectures and to spelling out their defining features. Section 4 discusses the specification of the benchmark models and introduces the new Almon polynomial aggregation scheme for the DFMs. Section 5 contains the empirical study, it describes the datasets used for the two model evalu- ation experiment designs and comparatively assesses one-step and multistep forecasting results. Section 6 concludes with the forecasting comparison and discusses future research avenues and applications.

Notation and Preliminaries
This section introduces the notation used throughout the paper. It also presents what we call temporal notation, which allows us to write multi-frequency models consistently and unambiguously, even when an arbitrary number of sampling frequencies is considered. We also introduce definitions for nowcasting and low-and high-frequency forecasting schemes.

Notation
We use the symbol N (respectively, N + ) to denote the set of natural numbers with the zero element included (respectively, excluded). Z denotes the set of all integers, and Z − (respectively, Z + ) stands for the set of the negative (respectively, positive) integers with the zero element included. We abbreviate the set [n] = {1, . . . , n}, with n ∈ N + .
Vector notation: A column vector is denoted by a bold lower case symbol like r and r indicates its transpose. Given a vector v ∈ R n , we denote its entries by v i , with i ∈ {1, . . . , n}; we also write v = (v i ) i∈{1,...,n} . The symbols i n , 0 n ∈ R n stand for the vectors of length n consisting of ones and of zeros, respectively. Additionally, given n ∈ N + , e (i) n ∈ R n , i ∈ {1, . . . , n} denotes the canonical unit vector of length n determined by e (i) n = (δ ij ) j∈{1,...,n} . Matrix notation: We denote by M n,m the space of real n × m matrices with m, n ∈ N + . When n = m, we use the symbols M n and D n to refer to the space of square and diagonal matrices of order n, respectively. Given a matrix A ∈ M n,m , we denote its components by A ij and we write A = (A ij ), with i ∈ {1, . . . , n}, j ∈ {1, . . . m}. The symbol I n ∈ D n denotes the identity matrix and the symbol O n stands for the zero matrix of dimension n.
Input and target stochastic processes: We fix a probability space (Ω, A, P) on which all random variables are defined. The input and target signals are modeled by discrete-time stochastic processes z = (z t ) t∈Z and y = (y t ) t∈Z taking values in R K and R J , respectively. Moreover, we write z(ω) = (z t (ω)) t∈Z and y(ω) = (y t (ω)) t∈Z for each outcome ω ∈ Ω to denote the realizations or sample paths of z and y, respectively. Since z can be seen as a random sequence in R K , we write interchangeably z : Z × Ω −→ R K and z : Ω −→ (R K ) Z . The same applies to the analogous assignments involving y.
Temporal notation: Let (u t ) t∈I , u t ∈ R be a (scalar) time series with I some index set (in this paper it will always be discrete). Time series (u t ) t∈I will be denoted just as (u t ) when the index set I is specified by the context. We write u s1:s2 = (u t ) t∈{s1,...,s2} for integers s 1 < s 2 and time series (u t ). To define the concept of the sampling frequency, we must introduce an additional series, call it (z s ) s∈J . The time index J is not the same as I. We assume that u t is sampled at the coarsest rate; equivalently, it has the lowest sampling frequency, which we call in what follows the reference frequency. In practice, this means that in the same window of time, u t will be observed at most as frequently as z s . The case when the sampling frequency of z s is strictly higher than that of u t is of primary interest.
We assume that all sampling happens at instants which are evenly spaced in time. Series other than the reference one and with higher sampling frequencies are given an additional time index, the tempo index, written t, * |κ , where κ is the frequency multiplier. Our tempo notation assumes that low-and high-frequency series are sampled with temporal alignment: this means that the reference time index t and the tempo index * |κ have the following properties.
Definition 2.1 A reference time index t ∈ Z + and a tempo index * |κ for a given high-frequency κ ∈ N + are such that the following relations hold (i) t, 0|κ ≡ t (ii) t, κ|κ ≡ t + 1 Figure 1: Diagram illustrating the fixed forecasting, high-frequency forecasting, and nowcasting schemes in tempo notation. Arrows point to time indices of the forecast target, while dots indicate the highfrequency time placeholder for the constructed high-frequency forecasts.
where mod is the modulo operation and for any x ∈ R the floor operator x outputs the greatest z ∈ N such that z ≤ x.
We now give an example to clarify the use of the tempo notation.
Example 2.2 (Macroeconomic Mixed Frequency Data) Let (y t ) be the time series of quarterly sampled GDP growth and let (u r ) be the time series of industrial production (IP) available at monthly frequency. Notice that we use two different time indexes, t and r. We assume that both series are observed at the end of the relevant time period: GDP is released at the end of each quarter, and IP is released at the end of each month. Additionally, we assume that GDP is observed at the end of the last month of the quarter which coincides with the release of monthly IP. Formally, we may then write t ≡ r/3 for all t and r, since there are exactly 3 months in each quarter. In our tempo notation we proceed in a reverse fashion and we instead anchor time to the lowest frequency index. The frequency multiplier is κ = 3, therefore r ≡ t, s|3 with s ∈ {1, 2, 3}. Since in the tempo notation we can exchange "frequency" and "frequency multiplier", we will make no distinction between the two terms in what follows.

Nowcasting, Forecasting and Multicasting
In order to clarify the design of the forecasting experiments conducted in this paper, we now present the different types of prediction that we shall tackle.
For the sake of simplicity, let t denote time in the reference frequency of (y t ) and suppose that only a single regressor (z r ) of frequency κ is included in the forecasting model along with an autoregressive term. The notation can be readily extended to include multiple regressors. In this section, let h ≥ 0 be a low-frequency prediction horizon for the low-frequency variable of interest: h always stands for the forecasting horizon counted from the last available low-frequency observation of the target series (y t ). Let ≥ 0 be a high-frequency horizon with respect to frequency κ. We shall consider the following different forecasting exercises, illustrated in Figure 1: • Fixed forecasting: We call a forecast fixed when predictions for the target variable are constructed only at the end of the low-frequency periods, namely, h ∈ N. We use this terminology because when constructing this forecast, neither its information set changes with different choices of h, nor it is updated by the information coming from the high-frequency regressors. The information set which is used at the time of h-steps ahead fixed forecasting at t is the σ-algebra defined as F t = σ y t , y t−1 , y t−2 , . . . , z t,0|κ , z t,−1|κ , z t,−2|κ , . . .
(2.1) and the h-steps ahead fixed forecast at t, when using the mean square error as a loss, is hence provided by the conditional expectation • Nowcasting: We call nowcasting the setup in which one constructs a high-frequency proxy for a yet-unobserved target which will be available at the end of the current low-frequency period. As such, we construct a nowcast only for horizons 0 < ≤ κ − 1; notice that = κ yields a contemporaneous regression at t + 1, while = 0 falls into the category of fixed forecasting considered above, hence both these two cases are excluded. The σ-algebras that are used in order to construct nowcasts y t+1, |κ are given by . . . The -steps nowcast for the high-frequency proxy constructed at moments t, |κ of the current period for the low-frequency variable which becomes available at t + 1, 0|κ ≡ t + 1 is provided by the conditional expectation y t+1, |κ = E y t+1 |F N t, . • High-frequency forecasting: Unlike the fixed forecasting scheme where one makes forecasts of the target variable in its low (reference) frequency, one may also use high-frequency regressors to produce additional forecasts. For example, in the case of a target released at the end of each year and having monthly quoted covariates, the fixed forecasting scheme would correspond to constructing forecasts for the yearly sampled variable of interest always at the end of the last month of the year (December). At the same time, with all the information collected up to the end of December, there are other possibilities to construct forecasts. In particular, the forecaster could consider placing herself at the end of any other month of the year instead and construct predictions for the monthly proxy of the yearly variable for future years.
It is obvious that in this scheme, one often artificially reduces the information set. Although not all the available information is exploited, this procedure has its own benefits: first, it renders high-frequency forecast instances; second, it allows taking into account misspecification due to a seasonal response of (y t ) to (z r ). This is especially important whenever multiple time series with different sampling frequencies are combined in one model and seasonality effects are either difficult to detect or impossible to avoid. In the context of macroeconomic forecasting, we refer the reader to Galvão (2008, 2009), Chen and Ghysels (2010) and Jardet and Meunier (2020) where these questions are carefully discussed.
• Multicasting: One always aims to construct one-step and multistep forecasts by using all the available information at a given point in time. It is, therefore, natural to compare models by constructing high-frequency nowcasts for the target variable to be released at the end of the current period and its high-frequency proxy forecasts for the next periods. To avoid confusion, we refer to this situation as multicasting. More explicitly, provided that the forecaster finds herself at time index t, s|κ and is interested in all the forecasts up to some maximal low-frequency horizon H ≥ 1, for each 1 ≤ ≤ Hκ the multicasting scheme yields the following combination: (a) Nowcasting when 0 < ≤ s, with information sets F N t, . (b) Forecasting when > s: (i) Fixed forecasting if satisfies mod κ = 0, with information set F t .
(ii) High-frequency forecasting if mod κ = 0, with information sets F H t, .

Reservoir Models
In this section we introduce a reservoir computing (RC) model for time series forecasting. We focus on a family of RC systems called Echo State Networks, which have been succesfully applied to the forecasting deterministic dynamical systems Jaeger and Haas (2004), Pathak et al. (2017Pathak et al. ( , 2018, Wikner et al. (2021), Arcomano et al. (2022) but have yet to be fully developed in the forecasting of stochastic time series in general, and in the mixed-frequencies context, in particular. In the following paragraphs, we explain the linear estimation of ESN model parameters, the hyperparameters tuning, the loss penalty selection, and how to carry out nonlinear forecasting. Finally, we propose a multi-frequency ESN modeling strategy, a generalization of the standard Echo State Network setup, that can be implemented with single unified or with multiple frequency-specific state-space equations.

Reservoir Models
Echo State Networks (ESNs) are nonlinear state-space models belonging to the Reservoir Computing (RC) family, which more broadly can be characterized as recurrent neural network models. The success of RC modeling in a range of scientific data analysis and forecasting applications has been driven by their ease of training and the possibility of implementing RC systems using high-performance dedicated hardware (see Tanaka et al. (2019) for an overview). The following nonlinear state-space system provides a general setup for reservoir computing models for all t ∈ Z, where the state space map F : R N × R K → R N , N, K ∈ N + is called in our context the reservoir map, and h θ : R N → R J , J ∈ N + is called the readout or the observation map. The sequences (z t ) t∈Z , z t ∈ R K , and (y t ) t∈Z , y t ∈ R J , stand for the input and the output (target) of the system, respectively, and (x t ) t∈Z , x t ∈ R N , are the associated reservoir states. In what follows, we shall assume that the readout map h θ (·) is parameterized by θ ∈ Θ and these parameters need to be estimated using a sample of observations. Moreover, in cases where both inputs and targets are stochastic processes, we consider the following observation equation: where ( t ) t∈Z are J-dimensional independent zero-mean innovations with variance σ 2 I J that are also independent of x t across all t. In the particular case of the ESN model, the reservoir map is chosen so that it resembles the neuron dynamics in a recurrent neural networks and the readout map is chosen to be an affine function. The ESN model is hence given by where A ∈ M N is the reservoir matrix, C ∈ M N,K is the input matrix, ζ ∈ R N is the input shift, α ∈ [0, 1) is the leak rate and W ∈ M N,J are the readout weights. The map σ : R → R is a sigmoid function applied elementwise; we shall generally take σ as tanh. ESNs have been proven to have universal approximation properties for L p -integrable stochastic processes (Gonon and Ortega (2019)), and estimation and generalization error bounds have been established in Gonon et al. (2020aGonon et al. ( , 2022. Note that, since the observation equation is linear in x t , the weights W can be estimated via either a least-squares estimator or, more commonly, using a ridge regression of y t on x t . To simplify notation, we will suppress the intercept a from the observation equation since its inclusion in the linear estimation of W is trivial.
The core advantage of an ESN model is that, unlike factor models or recurrent neural networks, the state equation features fixed, randomly sampled parameter matrices A, C, and ζ. Therefore, in ESNs, states are always directly observable given inputs, whereas in a factor model, one must infer latent factors, as well as estimate model parameters such as loading matrices. Also, unlike standard recurrent networks, there is no need to estimate all model parameters jointly via gradient descent algorithms, making model fitting a straightforward linear regression problem.
So far, we have left the nature of ESN inputs (z t ) and targets (y t ) undetermined. Since we are interested in forecasting, we can now be more precise about the timing of targets with respect to the inputs. We write where y t+1 is the target one-step ahead observation of input z t . If y t has J components, but only one series is of interest, the linear regression equation in (3.5) can be partitioned to involve only the jth component y j,t+1 , that is y j,t+1 = W j x t + j,t+1 .
Here, W j is the jth column vector of W as defined in (3.5). As we shall see in Section 3.3, it is nonetheless often necessary to consider a multivariate regression equation, especially in multistep forecasting setups.

Estimation
Assume that a sample (z t ) 1:T −1 , (y t ) 2:T of inputs and targets is available. Given an initial state x 0 , states x 1 , . . ., x T can be readily computed according to (3.4), and W in (3.5) can be estimated by ordinary linear regression. Recursive/online versions of the least squares estimator, W RLS , as presented in Young (2012), can be used although this is a methodology that we do not implement in this paper. Define X = (x 1 , x 2 , . . . , x T −1 ) ∈ M T −1,N , Y = (y 2 , y 3 , . . . , y T ) ∈ M T −1,J . The least squares estimator W LS is given by Even though the result of this estimation problem is obviously dependent on the initial state condition x 0 , the architecture of the ESN will be typically chosen (see Section 3.2.3) so that the socalled fading memory property holds Boyd and Chua (1985), a corollary of which is the state forgetting property (see, for instance, (Grigoryeva and Ortega, 2019, Theorem 27)) that states that the importance of the choice of x 0 fades away as the time-length of the sample increases. Even though, W LS is only properly estimated when N < T , it is not uncommon for ESN models to require very large state-spaces (in the order of 10 3 -10 4 dimensions, see for example Pathak et al. (2017)). This makes desirable to apply echo state networks such that N ≥ T . Given that the OLS estimator has no unique solution in this case, we consider the ridge regression for the estimation of W . The ridge estimator is given by where λ > 0 is the penalty strength. When λ → 0, the estimator W R λ converges to the minimum-norm least squares solution (Ishwaran and Rao (2014)). In applications, ridge regression is the most commonly used estimation method applied to ESNs, as it provides a straightforward regularization scheme both when N < T and N ≥ T . Moreover, the virtue of the ridge regression problem is the fact that the objective function is convex and hence, even in those cases when min{N, T } is large, it can be efficiently solved using stochastic gradient descent.
Notice that (3.4)-(3.5) is a nonlinear state-space model in the context of nonparametric estimation. Hence it is clear that a natural trade-off between bias and variance exists due to the choice of N . Since ESNs have a natural connection to random-weights neural networks (Cao et al. (2018)) and random projection regression (Maillard and Munos (2012)), one may consider comparing them to nonparametric sieve methods in an asymptotic framework. If the data were independently sampled, classical results on consistent linear sieve estimation would require that N 2 /T = o(1) (see for example Chen (2007)). Belloni et al. (2015) have proven that under mild conditions one may weaken rates to N/T = o(1) up to log factors, and Chen and Christensen (2015) have generalized this result by showing that for dependent data N (log(N )) 2 /T = o(1) is sufficient. For single-layers neural networks specifically, Chen and White (1999) demonstrated a rate of N 2d log(N )/T = o(1) with d > 1 for φ-mixing and β-mixing data generating processes. Though there are only limited results on the statistical properties of reservoir models (see, for instance, Grigoryeva and Ortega (2018b), Gonon et al. (2022)), sieve rates seem to suggest that choosing N = O(T ) in echo state networks could lead to nontrivial forecasting bias owing itself to poor approximation properties. This comparison, however, is only qualitative, as it does not acknowledge the nonlinear state-space structure of ESNs.
The well-known bias-variance trade-off for out-of-sample generalization error also points to poor forecasting performance when a model is at the interpolation threshold. While ridge regression is not necessarily a solution to estimator inconsistency, it is commonly applied to address generalization concerns in statistical learning (see Hastie et al. (2009)). Recently the link between regularization and generalization has also been studied more accurately and Hastie et al. (2022) have shown that "ridgeless" (i.e. interpolation) solutions can be optimal in some scenarios. However, in our empirical evaluations in Section 5, cross-validation (CV) consistently selects non-negligible ridge penalties even for large models, implying that ridge penalization indeed plays an important role in the ESN forecasting performanceat least in applications involving macroeconomic-financial data with multiple sampling frequencies. As statistical inference issues are beyond the scope of this work, we consider the use of the ridge estimator W R λ as a means to control the generalization error in line with the ESN literature (we refer the reader to Hart et al. (2021) where some of these questions are considered for the case of dynamical systems).

Fixed, Expanding and Rolling Window Estimation
Model parameter stability is an important and well-studied question in linear time series analysis. Indeed, the identification and modeling of structural breaks play a key role in macroeconomic modeling. To account for this possibility, we compare multiple estimation setups which may account for possible changes in model parameters.
Suppose again that a sample Y = (y 2 , y 3 , . . . , y T ) ∈ M T −1,J of targets is available, an initial state x 0 is given and regressors Z = (z 1 , z 2 , . . . , z T −1 ) ∈ M T −1,K are observed. Additionally, the researcher has available an out-of-sample dataset, Y † = (y T +1 , y T +2 , . . . , y T +S ) ∈ M S,J , Z † = (z T , z T +1 , . . . , z T +S−1 ) ∈ M S,K for S ≥ 1. We define the following estimation setups: (i) Fixed parameters: An estimator W is computed strictly over sample observations Y and Z; tuning parameters are also chosen with data available up to time T . Model parameters are kept fixed as the estimated model is applied to construct forecasts y T +1 , y T +2 , . . . , y T +S as out-ofsample regressors z T , z T +1 , . . . , z T +S−1 are added to the information set.
(iii) Rolling window: Unlike in the above expanding window case, in this setup the within-window sample size is kept fixed across windows -that is, the sample window "rolls" over the data -by defining W RW s as the estimate over Y RW s := (y 2+s , y 3+s , . . . , y T +s−1 , y T +s ) and Z RW s := (z 1+s , z 2+s , . . . , z T +s−2 , z T +s−1 ) for s = 1, . . . , S. Parameters are re-estimated over windows The fixed-parameter setup is the most rigid one. It builds upon the idea that the initial sample contains sufficient information for correct model estimation and forecasting and that the model parameters are constant. Its theoretical analysis is hence relatively easy since there is no need to discuss the stability of the penalty and the hyperparameters across sample windows. An expanding window setup is based on the belief that newly available data contains key information to produce forecasts and, therefore, must be continuously incorporated. In essence, forecasters do this when they re-estimate a model at each data release cycle. In the case of a rolling window estimation strategy, one can theoretically handle model changes. Although proper structural break modeling would require a consistent identification of breakpoints, rolling window estimation can potentially accommodate slow drifts in model parameters over time by directly discarding old data, unlike with an expanding window. We do not explore the selection of an optimal window size, which in rolling window estimation has been shown to improve forecasting performance (Inoue et al. (2017)).

Penalty Selection
Penalty selection via cross-validation with time series data has already been studied, and its validity has been established in Bergmeir et al. (2018). In our empirical study, we use a simple out-of-sample cross-validation (CV) strategy with ten folds over the training split of the data. The construction of the folds is sequential and involves properly accounting for the time dependence of the data. Because ESN models involve relatively large dimensional state-spaces, we want to split the data in a parsimonious way. To do this, we select a testing window of five target observations, and all preceding observations (at any frequency) are used to fit the model. Splitting proceeds from the end of the sample backward, and we always use one-step-ahead target observations to construct the loss. In expanding or rolling windows setups, we further re-validate the penalty via out-of-sample CV as the sample grows: this is important to ensure that, as the sample sizes grow, estimated ESN weights do not induce over-smoothing but rather reflect the information contained in the window. For additional details, we refer the reader to Appendix 7.1.

Hyperparameter Tuning
A fundamental result in the ESN literature is that the structure of the matrices (A, C, ζ) determines the dynamics of the states (x t ) and much work has been put into determining optimal specifications (see for example Rodan and Tino (2011), Goudarzi et al. (2016), Farkas et al. (2016), Grigoryeva et al. (2015Grigoryeva et al. ( , 2016, Gonon et al. (2020b)). In what follows, we call the tuple (A, C, ζ) the parameters of the ESN that we use to determine what we call the hyperparameters in the echo state model equations. Indeed, rewrite the state equation (3.4) as where A = A/|λ 1 (A)| with λ 1 (A) denoting the largest eigenvalue of A, C = C/ C and ζ = ζ/ ζ are the normalized input matrix and normalized input shift, respectively. The choice of the norm · is inconsequential, so we will be using the 2-norm in our empirical study. The hyperparameter ρ ∈ R + is called the spectral radius of the reservoir; γ ∈ R + is the input scaling, and ω ∈ R + is the shift scaling. The choice of the hyperparameters defines the properties of the state map. In particular, it is standard to ensure the existence and the uniqueness of solutions of the state equation for semi-infinite inputs via the so-called Echo State Property (see Ortega (2018a,b, 2019) for the details) as well as the continuity of the resulting input/output map in order to obtain the so-called Fading Memory Property. We consider α as a hyperparameter since it is often tuned for each specific application of the ESN model. Notice that if ρ = 0 and α = 0 the state equation reduces to x t = σ(γCz t + ωζ), there are no autoregressive dynamics and the ESN turns into a nonlinear regression model with random coefficients (in other words, a feedforward neural network with random weights which is usually referred to as a Extreme Learning Machine, see Cao et al. (2018) for an extensive review).
We now propose a general scheme that allows us to jointly select hyperparameters ϕ := (α, ρ, γ, ω) for a model of the form (3.4)-(3.5). Our approach builds on the idea of leave-one-out cross-validation for time series models. Using a fixed, expanding, or rolling window over the training data, we can always compute the one-step forecasting errors committed by the ESN, given fixed normalized model matrices (A, C, ζ) and a hyperparameter vector ϕ. By choosing an appropriate loss function : R × R → R + we can thus compute the empirical ESN forecasting error where W t (ϕ) is the readout weights estimator involving data available up to time t and 1 < T 0 < T − 1 is the minimum number of observations used for fitting. Notice that if (y t+1 , W t (ϕ) x t ) = y t+1 − W t (ϕ) x t 2 2 then L T (ϕ) is simply the squared loss that is minimized in training (modulo a ridge penalty term). Here, however, the interest lies not in finding the matrix W , which minimizes L T , but rather the optimal hyperparameter vector ϕ * ∈ arg min ϕ L T (ϕ). (3.8) We highlight that to tune ϕ one may choose a norm, metric, or pseudo-metric that is different from the one used in the estimation of the readout map W . We present the entire hyperparameter optimization routine in Algorithm 1. Note that step (i) might entail re-normalizing inputs and targets at each window t. This setup is general and allows applying any global optimization routine to minimize L T (ϕ). We construct the loss L T (ϕ j ) sequentially, that is by summing squared residuals of the model estimated in step (i) of Algorithm 1 when is a qudratic loss. One can program L T (ϕ j ) via Tensoflow so that the gradient can be evaluated by backpropagation in Algorithm 1 (iii). Since there is no guarantee that the objective function is convex or even everywhere smooth, we suggest applying optimizers known to explore the search space efficiently. We emphasize at this point that the lack of convexity guarantees for the loss functions is much more consequential for the other models that we present later on and that we shall be using as benchmarks, in particular for the MIDAS model. Indeed, as we discuss in Appendix 7.5.1, even though explicit gradient calculations might help the optimization for the MIDAS models (see Kostrov (2021) for a detailed discussion), multiple suboptimal minima for MIDAS can not be ruled out.
The parameter space is bounded, and, normalizing both data and state-space matrices, we suggest that upper bounds ρ, γ, and ω can be safely set to be less than or equal to 10.
One issue with the state formulation in (3.4) and thus with the hyperparameter optimization routine just described, is that ϕ = (α, ρ, γ, ω) can not always be point identified. For example, if we simplify the state equation to be linear and let α = ω = 0, it is obvious that the ESN model is system isomorphic Grigoryeva and Ortega (2021) This issue also arises in nonlinear models, for example when σ ≡ tanh and γ is sufficiently small. Parameter identification in nonlinear models has been extensively studied in semi-and nonparametric cross-sectional regressions. For instance, it is known that in certain setups, point identification requires a proper normalization to be imposed. The interested reader can refer to Section 6.3 of Horowitz (2009) for a discussion in a similar vein regarding nonparametric transformation models. Since we often set ω = 0, hyperparameter identification can be a significant issue when attempting model tuning. When ω = 0, a simple but valuable reparametrization is given by where ψ = ρ/γ. This prescription allows decoupling ρ and γ at the cost of the constant input scaling, which may be undesirable whenever one wants to attenuate the nonlinearity induced by the sigmoid Algorithm 1: Hyperparameter tuning Data: Sample y 2:T = {y 2 , y 3 , . . . , y T }, z 1:T −1 = {z 1 , z 2 , . . . , z T −1 }, initial state x 0 , initial guess ϕ 0 , convergence criterion Crit, maximal algorithm iterations MaxIter. If ridge regression is used to estimate W , fixed regularization strength λ > 0. Result: ϕ * Fix T and determine the model fit windows for t = T 0 , . . . , T − 1. Choose whether the ESN model is estimated with fixed or rolling window; where possibly W t (ϕ j ) does not depend on t, e.g. in the fixed estimation setup; the total one-step-ahead forecasting loss; (iii) Update ϕ j+1 ← ϕ j with an appropriate rule (for example,the gradient descent of L T in the direction of ϕ j ; in our applications, we use variants L-BFGS-B and pattern search); (iv) j ← j + 1, update Crit; map without also reducing the spectral radius. 2 It is immediate to modify the optimization scheme to deal with the case ϕ = (α, ψ). In the sequel, we assume that the ESN models are estimated using the approaches proposed in this subsection and use the conventional ESN specification as in (3.4)-(3.5) to discuss the forecasting strategy.

ESN Forecasting
We are interested in using ESN models to construct conditional forecasts of target variables. Given that the conditional mean is the best mean square error estimator for h-step-ahead target y t+h , h ≥ 1, our main focus is approximating y t+h|t := E y t+h |x 0:t , z 0:t . (3.9) The case h = 1 is trivial, because the ESN model is estimated by regressing y t+1 on state x t , and thus we can set y t+1|t = W t (φ * ) x t . However, when h > 1 the nonlinear state dynamics precludes a direct computation of the conditional mean. This is in contrast to linear models like VARMAs or DFMs, where the assumption of linearity implies that conditional expectations reduce to simple matrixvector operations. In particular, linear models are such that the variance (and any other higher-order moments) of the noise term do not impact the conditional mean forecast. Let p θ (x t |x t−1 , z t ) and g θ (y t+1 |x t ) be the state transition and observation densities, respectively. Then, for h > 1, where ν(z t+j |x t+j−1 ) is the conditional density of inputs. Here, we introduce the additional assumption that x t+j−1 is sufficient to condition on past states and inputs, that is Some elements in the expectation integral are not directly available. Specifically, while an ESN explicitly models both p θ (x t |x t−1 , z t ) and g θ (y t+1 |x t ), the density ν(z t+j |x t+j−1 ) is unavailable.
In the remaining part of this subsection, we present two approaches to forecasting that are available for the target variable. In our empirical study, we exclusively work with the first one, which is more suitable for forecasting in cases in which the input and target variables are sampled at different frequencies, and shows superior performance compared to the other standard benchmarks, as we demonstrate later in Section 5.
Forecasting of target time series via iterative forecasting of inputs. We are interested in constructing forecasts of target variables which are, in general, not the same as the model inputs. We propose a method that resolves the issue of the intractability of (3.10) and that capitalizes on the available results that use ESNs for the forecasting of dynamical systems. More explicitely, we introduce an equation in addition to the ESN specification that allows us to sidestep the modeling of the density ν directly, thus making feasible the computation of y t+h|t even when h > 1. Consider the model where we use the symbol W for the output matrix in order to separate this case from the general ESN equations and where (u t ) t∈Z are K-dimensional independent zero-mean innovations with variance σ 2 u I K that are independent of x t across all t.
In this case, the reservoir map F (x t−1 , z t ) in (3.1) is determined by (3.11). This setup allows for re-feeding the forecasted variables back into the state equation as inputs, providing the following state recursion: where the subscript θ denotes the dependence on the model coefficients. In the reservoir computing literature, regimes where the ESN state equation is iteratively fed with the model outputs are called "autonomous" (Gonon et al., 2020b). They are widely and successfully utilized for the prediction of deterministic dynamical systems. Indeed, in those instances, provided that the W estimate is available from data, the h > 1 steps autonomous state iteration is given by (3.14) Hence one can directly obtain the h-steps ahead predictions of the input time series as z t+h = W x t+h−1 .
In the case of stochastic target variables, we notice that for the conditional forecast of the states, it holds that where the density φ is unavailable. Additionally, even under Gaussianity assumptions, which are standard in the filtering literature, namely u t ∼ N (0, Σ u ), the presence of the nonlinear map G θ makes the computation of the forecasts of z t+h a non-straightforward exercise. Nevertheless, this forecast construction can be readily used when one is interested exclusively in predicting the time series z t . Whenever the final goal of the exercise is forecasting some other explained variable y t+h h-steps ahead, additional issues arise. In this case, one needs to compute the conditional expectation in (3.10) which is intractable even under Gaussian assumptions on the innovations. One option is to apply particle filtering techniques such as bootstrap sampling or sequential importance sampling (SIS) to evaluate the expectation (Doucet et al., 2001). We emphasize that the state-space dimension is usually chosen to be large, and hence implementing the standard filtering or more sophisticated techniques requires some care.
Our approach is to avoid dealing with the nonlinear densities involved in (3.10) with the help of (3.15) and, instead, to reduce the computation of the conditional expectation y t+h|t to a composition of functions. By the linearity of the observation equation (3.5) and the assumption of independence in the zero-mean noise t+h we write and use the approximation where u t is assumed to be zero-mean. The validity of (3.17) itself requires implicit assumptions on the nature of the distribution of u t , but here we want to keep the analysis of y t+h|t to a minimum, and just use the insights from the dynamical systems ESN literature. We are hence not delving deeper into alternative approaches to estimate forecasts or, more generally, to compute conditional expectations of ESN models with stochastic inputs.
Direct forecasting of target time series. Alternatively to forecasting target variables with the iteratively predicted input time series, one could estimate y t+h|t based exclusively on linear projection arguments. Indeed, it is possible to estimate the h-step-ahead regression for h ≥ 1, where v t+h:t is a vector of additional control variables (regressors) and e t+h is a noise vector independent of states and of v t+h:t . In general, v t+h:t can consist, for example, of specific lags of (a subset of) regressors, or even other series observed at the same frequency as the target. For clarity, we refer to this approach as direct (one-or multistep) forecasting. The regression coefficients W h and B h can be estimated via least squares or ridge regression in a similar fashion as the usual ESN readout weights are. In the simple case where v t+h:t is zeros, we get the forecast The validity of this direct forecast approximation builds upon an assumption of linearity of E θ y t+h |x t . The direct forecast approximation is more restrictive than the iterative one in (3.16). We emphasize that (3.17) produces a local approximation at each step 1 ≤ j ≤ h, while all y t+h|t in (3.18) are derived via linearizing at x t . The literature on various models with the underlying linearity assumption is rich. Although one might compare the performance of the direct multistep forecasting ESN strategy to a plethora of existing linear models, we do not find it interesting to constrain intrinsically nonlinear ESN models to a linear regime. For instance, one could explore the Local Projections (LP) estimation of Jordà (2005), which builds on similar ideas. However, in this paper, we aim to assess the added value offered by the nonlinear state map. Therefore, in the following sections and in our empirical study, we adopt only the strategy based on the iterative ESN forecasting of the input series.

Multi-Frequency Echo State Models
In this subsection, we construct a broad class of ESN models that can accommodate input and target time series sampled at distinct sampling frequencies. We call this family of reservoir models the Multi-Frequency Echo State Networks (MFESNs). We extend the prediction strategy discussed in Section 3.3 to the case of input series with mixed frequencies for each of the presented MFESNs. In particular, we show that whenever the forecasted variable of interest is observed at most as frequently as other covariates-series, the proposed reservoir models allow accurate multistep prediction that is superior to the existing benchmarks. The state-space structure of MFESNs is naturally amenable to multi-frequency settings, which is also a virtue of (dynamic) factor models popular in the macroeconomics literature. We emphasize that, unlike other state-space prescriptions, the tuples (A, C, ζ) of matrix parameters of the state maps of ESN-based models are randomly generated and hence do not require estimation. Provided that often the available data is scarce while the dimensions of state spaces need to be sufficiently large to account for the complex dynamics of multiple input time series, the fact that the inner weights of ESNs as recurrent neural networks are randomly generated allows combating numerous computational issues. This offers additional advantages of our proposed approach compared to other benchmark competitors as we see later.
We present two groups of MFESN architectures. The first family is based on a single echo state network architecture and we call these models Single-Reservoir Multi-Frequency Echo State Networks (S-MFESNs). The second group allows for as many state equations as the number of distinct sampling frequencies are present in the data Multi-Reservoir Multi-Frequency Echo State Networks (M-MFESNs). We defer the empirical comparative analysis of these groups of models to Section 5 where use them in a US GDP multistep forecasting exercise.

Single-Reservoir MFESN
We propose several variants of the single reservoir ESN architectures applicable in the multi-frequency setting. To that goal, we formulate a single state equation with a time index that runs at the highest sampling frequency. We start by recalling that in the temporal notation we introduced in Section 2.1 in Definition 2.1, we let t be the reference frequency, which shall also be the frequency of the target variable. All other frequencies including the highest among all considered time series will be measured using this reference frequency.
Consider L collections of different time series. We assume that each l-th group, l ∈ [L], consists of n l time series that are sampled at a common frequency κ l and contain observations (z t,s|κ l ∈ R n l for all t ∈ Z and s ∈ {0, . . . , κ l − 1}. Let κ max = max l κ l be the highest sampling frequency among L time series groups and let q l := κ max /κ l indicate how slower is the κ l sampling frequency with respect to the highest one. We stack together and repeat the observations in a way that is consistent with the high-frequency index and define Now it is possible to write a single high-frequency ESN as where W ∈ M N, L l=1 n l . We term this class of MFESN models the Single-Reservoir Multi-Frequency ESNs (S-MFESNs).
Aligned and stacked S-MFESN. Multiple choices of the observation equation prescription are possible depending on the learning task of interest. Provided that we aim at forecasting target variables, the following two architecture choices seem the most natural: • An aligned S-MFESN assumes that the most recent state -with respect to the reference time index t -is sufficient to model the process dynamics. More precisely, in order to obtain target y t+1 ∈ R J , the state equation of S-MFESN is iterated κ max times until the state x t−1,κmax|κmax = x t,0|κmax is observed and used in the following observation equation Let W and W be the estimates of the readout matrices based on a sample of length T . Then the high-frequency autonomous state transition map is given by which, composed with itself exactly κ max times, yields the target-frequency-aligned autonomous state transition map Finally, from (3.16) the h-steps ahead forecasts can be computed as • A stacked S-MFESN extends the aligned S-MFESN by taking into account p ∈ N high-frequency state lags in the observation equation yielding where W ∈ M N (p+1),J . This method offers more flexibility though the number of parameters that need to be estimated grows linearly with p.
In this paper we are not interested in model selection and hence apply only the alignment approach in our practical exercises.
Example 3.1 (Forecasting with aligned S-MFESN) Suppose we want to use an aligned S-MFESN model to forecast a quarterly one-dimensional target (y t ) using n (m) monthly and n (d) daily series, t,s|κ2 ), respectively. We adopt the assumption that daily data is released 24 days over each calendar month and hence κ 1 = 3, κ 2 = 72 and κ max = 72, while q 1 = 24 and q 2 = 1. We want to repeat the most recent monthly observation, as the unified monthly-daily state equation is run at high frequency.
Let t, * |72 be the temporal index with quarterly reference. The input vector for the S-MFESN state equation is given by Let the dimension of the state space is N . Then the complete model is written as where we notice that (3.25) and (3.26) are run in their own temporal index s and only whenever the state x t−1,κmax|κmax = x t,0|κmax is observed it is then used in the observation equation. Here, the coefficient matrices W ∈ M N,n (m) +n (d) in (3.26) and W ∈ R N in (3.27) can be easily estimated via ridge regression. We assume, as usual, that the sample available for estimation ends at the reference time index T . From (3.22) the high-frequency autonomous state transition map is given by t,s−1|72 + ζ , which, composed with itself exactly 72 times, by (3.23) yields the target-frequency-aligned autonomous state transition map t,0|72 we iterate the S-MFESN forward in time to provide an estimate for x (m,d) t+1,0|72 , which can then be linearly projected using W to yield a forecast for y t+2 . For the target variable, as well as forecasts, we do not use our temporal notation for the sake of compactness and clarity of exposition. Finally, the multistep forecasts for any h ≥ 1 can be computed using (3.24) as This example provides an explicit illustration of our forecasting strategy in the case of the application to quarterly GDP forecasting using monthly and daily series.

Multi-Reservoir MFESN
Constructing a MFESN with a single reservoir is not necessarily the most effective modeling strategy. Having more than one reservoir allows a more flexible design of states for different subsets of input variables. For example, suppose quarterly and monthly data are used as regressors. In that case, it could be beneficial to handle series of different frequencies, each with their own reservoir, since economic time series observed with the same frequency have similar time dynamics. Our presentation is general enough to accommodate other types of the partitioning of series into the corresponding reservoir models. We leave it to future research to test other approaches based, for instance, on markets or data types as done in van Huellen et al. (2020).
Assume again L series with input observations (z t,s|κ l ∈ R n l for all t ∈ Z and s ∈ {0, . . . , κ l − 1} sampled at common frequencies {κ 1 , . . . , κ L }, respectively. For each of the L input series we define the corresponding ESN model as with W l ∈ M N l ,n l with N l the dimension of the state space. Notice that the time index s is different for each l according to our temporal notation introduced in Definition 2.1 and each state equation runs at its own frequency. The dimensions {N 1 , N 2 , . . . , N L } of the state space can be chosen for the L reservoir models individually. Additionally, multiple reservoirs have the associated hyperparameter tuples {ϕ 1 , . . . , ϕ L } to be tuned. This requires some care whenever one wants to optimize all hyperparameters jointly. Since there are L reservoir state equations, we call this class of MFESN models Multi-Reservoir Multi-Frequency ESN (M-MFESN). Similar to our approach for S-MFESN, the state equations are iterated each κ l times until the states x t,0|κ l are obtained and we can formulate the aligned M-MFESN observation equation as where W ∈ M L l=1 N l ,J . The schematic diagram of the aligned M-MFESN for the case of two frequencies of regressor time series is shown in Figure 2.
Let W and W l , l ∈ [L] be the estimates of the readout matrices based on the T -long sample of observations. Then for any l ∈ [L] the κ l -frequency autonomous state transition map is given by The target-frequency-aligned autonomous state transition map associated to each frequency l is hence defined as Finally, from (3.16) the h-steps ahead forecasts can be computed as . . . (3.33) Similar to the stacked S-MFESN architecture, defining the stacked M-MFESN is straightforward, even though the notation is burdensome. We emphasize that for a stacked M-MFESN model, it is also necessary to choose L state lags, {p 1 , . . . , p L }, one for each state equation. The proliferation of model hyperparameters due to the additional lag specification required by a stacked M-MFESN can seriously complicate tuning for this class of models, especially when many distinct frequencies are modeled. In our empirical study, we use only the aligned M-MFESN architecture even though we see the stacked M-MFESNs as an interesting avenue for further research.
Example 3.2 (Aligned M-MFESN Forecasting) Similar to Example 3.1, we aim to forecast a quarterly target with monthly and daily series, but this time we use a M-MFESN model. We have to define two independent state equations, one for monthly and one for daily series; in the observation equations, two states must be aligned temporally and stacked to form the full set of regressors. The data consists again of quarterly (y t ), n (m) monthly series (z (m) t,s|3 ) and n (d) daily series (z The aligned M-MFESN model with two reservoirs of dimensions N (m) and N (d) , respectively, is given by Here, the monthly reservoir (x as well as their target-frequency aligned counterparts F (m) and F (d) , respectively, by (3.32) given by The h-step ahead forecasts can be computed using the approximation in (3.33) as In this case, it is important to note that while both F (m) and F (d) are composed h−1 times at step h, the underlying number of autonomous reservoir iterations is different for the monthly and daily reservoirs, namely 3 and 72, and depends on their own frequencies. This also suggests that one should take into account the different time dynamics when, for example, tuning M-MFESN hyperparameters ϕ (m) and ϕ (d) .

Benchmark Models
This section gives an overview of two popular methods that can accommodate time series with different sampling frequencies: The MIDAS (MIxed DAta Sampling) framework and the mixed-frequency dynamic factor model (DFM). A new aggregation scheme with Almon exponential structure is proposed to bridge MIDAS and DFM for our forecasting comparison. Both models serve as a baseline for forecasting comparison against our proposed MFESN model.

MIDAS
A state-of-the-art methodology for incorporating data of heterogeneous frequencies into one model is the MIDAS framework developed in Ghysels et al. (2004Ghysels et al. ( , 2007. Here we present MIDAS in its dynamic form, which allows the inclusion of target series autoregressive lags. We use our temporal notation given in Definition 2.1 throughout. If the MIDAS model contains only one explanatory variable (x r ) with frequency multiplier κ, then it can be written as where α 0 is a constant term, {α i } p i=1 are the autoregressive parameters, β is a scaling parameter, {ϕ(θ, k)} K k=0 are the MIDAS weights given as a parametric function of lag k and underlying parameter vector θ ∈ R q , and ( t ) is a martingale difference process relative to the filtration {F t } generated by The MIDAS weighting scheme is the core innovation of the model. It borrows parsimony from distributed lag models in the sense that, even if K is large, the vector θ ∈ R q is usually restricted to contain only a handful of parameters. This greatly reduces the number of coefficients that need to be estimated, and a nonlinear least-squares estimator θ can be readily implemented. There are alternative formulations of the MIDAS framework where ϕ(θ, k) = θ k so that the above reduces to a full linear model, the so-called unrestricted MIDAS or U-MIDAS (Foroni and Marcellino, 2011). We follow the literature and use the most commonly applied weighting scheme that is based on the exponential Almon weighting polynomial map ϕ : R q × N + −→ R + (see Almon (1965) for more details). In particular, for the case of q = 2, the two-parameter Almon weighting polynomial is given by ϕ(θ, k) = ϕ((θ 1 , θ 2 ), k) = exp(θ 1 k + θ 2 k 2 ), k ∈ N + . (4.2) Since Almon weights need not sum up to a given constant for different values of θ 1 and θ 2 , it is often common to consider the normalized Almon scheme which together with (4.1) allows to treat β as a rescaling constant.
Let us now consider a more general model suitable for situations where time series of different frequencies are available and must be integrated into the MIDAS equation. Let {x r L } be the set of L regressors with their corresponding frequencies {κ 1 , κ 2 , . . . , κ L }, respectively. We also require that κ l ∈ N + , l ∈ [L]. Additionally, it happens frequently in practice that κ l , l ∈ [L] takes values from a small set of integers. For example, in the case of yearly, quarterly, and monthly data κ l ∈ {1, 4, 12} even though L could be very large (often, hundreds or thousands of series might be of interest). The MIDAS model explaining low-frequency target variable y t with L regressors {x (l) r l } L l=1 can be written as follows where the martingale difference process ( t ) is relative to the filtration generated by sets as in (4.1), modified to include all the L regressors considered. The MIDAS framework produces forecasts of the chosen target variable at the low frequency of the target. Yet, due to the MIDAS multi-frequency structure, nowcasting is also a straightforward exercise: if, for example, the high-frequency regressor is a single series (x r ) with frequency multiplier κ, one can construct exactly κ regression equations -one for each high-frequency release within a low-frequency period -and use these to produce high-frequency nowcasts of the target. In fact, due to the convenience of the MIDAS model, it is easy to define high-frequency regression specifications to study high-frequency forecasts and multicasts.
In practice, implementing (4.4) demands some care. From a computational point of view, as long as the relevant regression matrices can be constructed, estimation amounts to a nonlinear least-squares problem, which can be readily solved. In Appendix 7.2 and Appendix 7.5.1 we discuss the technical aspects of our MIDAS implementation in more detail. One of the important issues of the MIDAS estimation is the non-convexity of the nonlinear least squares loss as a function of parameters. Often, a practitioner may obtain different estimation results depending on initialization and, more importantly, those that lead to a different quality of forecasts. Another crucial disadvantage of the MIDAS specification is that practical implementations can be very challenging. This is caused mainly by the ragged edges of the "raw" macroeconomic data, incomplete observations, and uneven sampling frequencies. The relative inflexibility of MIDAS regression lag specifications makes integrating daily and weekly data at true calendar frequencies (i.e. without interpolation or aggregation) very complex. State-space models effectively mitigate these issues. Therefore we continue our exposition by presenting a state-space benchmark that is very popular in the macroeconomic forecasting literature, the so-called dynamic factor model (DFM).

Mixed-frequency DFM
Macroeconomic modeling based on dynamic factor models (DFMs) has been popular since their introduction in Geweke (1977) and Sargent et al. (1977). The proposition of DFMs is that a low-dimensional latent factor (f t ) t∈Z , f t ∈ R d , drives a high-dimensional observable stochastic process (y t ) t∈Z , y t ∈ R n . We consider a time-inhomogeneous state-space model with dynamics f t+1 |f 1:t , y 1:t ∼ h t+1,θ (·|f t ) (4.5) y t+1 |f 1:t+1 , y 0:t ∼ g t+1,θ (·|f t+1 ) (4.6) for some time-dependent state transition kernels h t,θ and observation densities g t,θ and some parameter θ in a parameter space Θ. A common example in the literature (see Watson and Engle (1983) for more details) is linear Gaussian factor models with time-inhomogeneous state transitions that can be represented as with state transition matrix A θ ∈ M d , time-dependent factor loading matrices Λ t ∈ M n,d , u t and w t are independent Gaussian vectors with zero mean and identity covariance matrix of dimension p and n, respectively, and matrices R θ and S t,θ of appropriate dimensions. It is often assumed that the dimension p of the state noise vector u t is smaller than the latent state space dimension d, which implies that R θ R θ is rank deficient, such as for AR(p) factor dynamics (Stock and Watson, 2016, Forni et al., 2005, Doz et al., 2011 so that d = kp for some k ∈ N + , is a kdimensional AR(p) process and it is commonly assumed that Λ (j) t,θ = O n,k for j > 1. Let the initial state f 0 be distributed according to ν. The joint density of the latent path f 0:T and observations y 0:T is then while the marginal likelihood of y 0:T is p θ,ν (y 0:T ) = p θ,ν (f 0:T , y 0:T )df 0:T . Popular procedures for learning the static parameters θ ∈ Θ are based on gradient descent of the negative log-likelihood function T : Θ → R, θ → − log p θ,ν (y 0:T ) or on the Expectation Maximization (EM) algorithm introduced in Dempster et al. (1977). We consider here gradient descent algorithms 3 based on a sequence of step sizes γ k > 0, that update the model parameters based on iterations of the form Assuming a linear Gaussian setting where the transition density of the latent factor process is given by (4.9) to yield an AR(p) process (v t ) t∈Z , v t = (v 1,t , . . . , v k,t ), there remains some flexibility as to how the linear mappings 4 Agg θ : . We call this linear mapping Agg θ an aggregation function and consider specific examples below that yield different models for the observation matrices Λ t,θ .
3 Consistency of the maximum log-likelihood estimate for the dynamics (4.7)-(4.8) in the time-homogeneous case has been established for instance in Douc et al. (2011) under regularity assumptions, including for instance the full-rank of the noise covariance matrix S θ , of the controllability matrix C θ = R θ |A θ R θ | · · · |A d−1 θ R θ , and of the observability matrix . It is also possible to consider an online learning setting using a recursive decomposition of the score function as in LeGland and Mevel (1997). For general latent state dynamics (4.5) and observation densities (4.6) that can be non-linear with non-Gaussian noise, particle filtering algorithms are often utilized that make use of particle approximations in gradient-descent or EM learning approaches, see for instance Kantas et al. (2015). 4 The Markovian representation (4.5)-(4.6), that is, the companion form, is based on the autoregresssive order p, however, one can set A ( ) This aggregation scheme is motivated by self-attention models (we refer the reader to Bahdanau et al. (2014), Vaswani et al. (2017) for more details), but to retain linearity only considers a relative positional encoding with a Toeplitz structure. Observe that the aggregation parameters are shared across all n dimensions in contrast to the Almon lag scheme in Example 4.2.
Some authors (see for example Mariano and Murasawa (2003), Bańbura and Modugno (2014)) have imposed different restrictions on the form of the emission matrix or aggregation function, particularly for one-dimensional mixed-frequency factor models of quarterly GDP growth rates and monthly covariates, which are motivated by approximations of growth rates. We do not pursue this additional restriction in this work.
The static parameters θ can be estimated using gradient ascent of the log-likelihood computed using Kalman filtering. For alternative estimation approaches using EM that could be extended to this setting, we refer the reader to Bańbura and Modugno (2014). Bai et al. (2013) discusses connections between mixed-frequency factor models and MIDAS regression. Nonlinear or non-Gaussian dynamic factor models in a mixed frequency setting have been considered in Gagliardini et al. (2017), Leippold and Yang (2019) that rely on particle filtering methods in conjunction with backward simulation algorithms as in Godsill et al. (2004), while Schorfheide et al. (2018) consider a Bayesian approach using particle MCMC (see Andrieu et al. (2010)). Such approaches can become computationally expensive and are not considered for benchmarking purposes.
While previous mixed-frequency DFMs (see Mariano and Murasawa (2003), Bańbura and Modugno (2014) for a more thorough discussion) often consider time series which are sampled at two frequencies, we introduce here a flexible mixed-frequency DFM that describes L ∈ N collections of distinct time series sampled at frequencies {κ 1 , . . . , κ L } and each consisting of {n 1 , . . . , n L } series, respectively. In the same setting as in Subsection 3.4, each group of n l , l ∈ [L], time series sampled at frequency κ l contains observations (y (l) t,s|κ l ) with y (l) t,s|κ l ∈ R n l for all t ∈ Z and s ∈ {0, . . . , κ l − 1}. Let κ max = max l κ l . Suppose that the latent factor dynamics are updated at the highest sampling frequency based on the linear transition f t,s+1|κmax = A θ f t,s|κmax + R θ u t,s+1|κmax , (4.10) where f t,s|κmax = v t,s|κmax , . . . , v t,s−p+1|κmax , with A θ given in (4.9) for the special case where A ( ) with parameters ρ ∈ (0, 1),Ā ∈ M k and with λ 1 (Ā) denoting the largest eigenvalue ofĀ. In the simplified scenario of first order autoregressive dynamics, we parameterize R θ ∈ M k to be positive definite and diagonal and u t,s+1|κmax are a sequence of IID k-dimensional standard Gaussian variables.
Notice that Kalman filtering formulas yield the first moment f t,s|κmax = E f t,s|κ y 1,0|κmax , . . . , y t,s|κmax recursively online, see for example Appendix 7.3 for details in the general time-inhomogeneous case. Due to the linearity in (4.10), for any h ∈ N, Furthermore, from the linearity of the emission model, we obtain the forecasts for any s, h ∈ N such that s, h mod κ l = 0, E y (l) t,s+h|κmax y 1,0|κmax , . . . , y t,s|κmax = Agg θ (l) f t,s+h|κmax . (4.11) We observe that there is a single latent factor process that describes the observations at all frequencies, in contrast, for instance, to hierarchical Hidden Markov Models (HMM) (Hihi and Bengio, 1995) where the latent variables evolve a-priori at different time-scales or like with some of the ESN models developed in this paper. It is possible to write the above model as a general time-inhomogeneous state-space system (4.5)-(4.6) by suitable parameterizing the time-dependencies in the emission matrices. We provide more details on implementing our mixed frequency DFM in Appendix 7.3. The standard Kalman filtering recursions utilized therein for parameter estimations have a cubic complexity in the dimension d or n of the Markovian factor process f or the observation process y, respectively, at every time step. The marginal log-likelihood 5 is optimized based on stochastic gradient methods with adaptive step sizes (Kingma and Ba, 2014) and is generally not a concave function of the parameter values.
Example 4.4 (Quarterly-Monthly-Daily DFM Model) We consider n (d) daily time series over 24 days per calendar month that are averaged over 6 days and denoted as y (d) . Furthermore, we consider n (m) monthly y (m) as well as n (q) quarterly time series y (q) . Notice that κ max = 72/6 = 12. The latent factor process of dimension k is updated every 6 days in line with the 6-month average of the daily data, and assumed to have the AR(1) dynamics for any s, t ∈ N, A (1) ∈ M k,k and R ∈ M k . The averaged daily data is described by for β (d) ∈ M n (d) ,k , S (d) ∈ M n (d) and IID n (d) -dimensional standard Gaussian variables u 1 , ), . . . , ϕ(ψ (m) k , ) ∈ R k . The symbol stands for the Hadamard or componentwise matrix product. The quarterly components are described analogously as for a stock aggregation scheme, while the Almon scheme writes as with β (q) ∈ M n (q) ,k , S (m) ∈ M n (q) , IID n (q) -dimensional standard Gaussian variables u (m) t,0|12 and ϕ(ψ (q) , ) = ϕ(ψ (q) 1 , ), . . . , ϕ(ψ (q) k , ) ∈ R k .

Empirical Study
In this section, we compare the forecasting performance of our proposed MFESN models to that of MIDAS and DFMs. We use a combination of macroeconomic and financial data sampled at low and high-frequency intervals, respectively. Our empirical exercises encompass several setups, with a small and a medium-sized set of regressors, fitting models with data before and after the 2007-08 crisis, and with different estimation windows.

Data
Two sets of predictors of different sizes are compiled: 'Small-MD' with 9 predictors and 'Medium-MD' with 33 predictors in monthly and daily frequency. The reference frequency is quarterly: this is the frequency at which the target variable, US GDP growth, is available. Seasonally adjusted quarterly and monthly data is obtained from the Federal Reserve Bank of St. Louis Monthly (FRED-MD) and Quarterly (FRED-QD) Databases for Macroeconomic Research; see Ng (2016, 2020) for detail. Daily data is obtained from Refinitiv Datastream, a subscription-based data service. All data is the last revised vintage data. The macroeconomic target and predictors, their transformations, and availability are listed in Table 5.1. The selection of predictors follows the seminal work by Watson (1996, 2006) in which the FRED-MD and FRED-QD data are proposed. Variations of their dataset have been used profusely in the literature; e.g. see Boivin and Ng (2005), Marcellino et al. (2006), Hatzius et al. (2010). Indicators from ten macroeconomic and financial categories are considered: (1) output and income, (2) labor market, (3) housing, (4) orders and inventories, (5) price indices, (6) money and credit, (7) interest rates, (8) exchange rates, (9) equity, and (10) derivatives. The latter five categories represent financial market conditions and are sourced at daily frequency. The exception are interest rates, which move relatively slowly and enter as monthly aggregates, available in the FRED-MD data. A subset of predictors is selected for the 'Small-MD' dataset by choosing variables that have been identified as leading indicators in the empirical literature; e.g. Ingenito and Trehan (1996) Data availability is an additional criterion, and predictors unavailable before 1990 are not considered. 6 We follow instructions by McCracken and Ng (2016, 2020) on pre-processing macroeconomic predictors before they are used as input for forecasting. These are mainly differenced for detrending. We further transform financial predictors to capture market disequilibrium and volatility. Disequilibrium indicators, such as interest rate spreads, have been found to be more relevant for macroeconomic prediction than routine changes captured by differencing; e.g. see Borio and Lowe (2002), Gramlich et al. (2010), Qin et al. (2022). In addition to disequilibrium indicators, realized stock market volatility has been found to improve macroeconomic predictions; e.g. Chauvet et al. (2015). In the absence of intraday trading data from the 1990s onward, which prevents us from utilising conventional daily realized volatility indicators, we extract volatility indicators from daily price series by fitting a simple GARCH(1,1) by Bollerslev (1986). 7 In addition to volatility of stock and commodity prices, term structure indicators are used. The term structure is forward-looking, capturing information about future demand and supply, and has been found to be a leading predictor of GDP growth; e.g. Hong and Yogo (2012), Kang and Kwon (2020).
The data spans the period January 1st, 1990 to December 31st, 2019. 8 We are interested in evaluating model performance under two stylized settings. First, a researcher fits all models up until the 6 This excludes the VIX volatility index, which has been identified as a leading indicator in some studies; e.g. Andreou et al. (2013), Jardet and Meunier (2020). 7 We include a control scale = 1 to ensure convergence of the optimization algorithm and only include a constant mean term in the base process for simplicity. 8 In the Small-MD dataset experiments we make a small variation and instead include data starting from 1st January 1975, but only for the initial CV selection of ridge penalties for MFESN models. Our aim is to make sure that at least for the fixed coefficient model -where λ is cross-validated once and only one W is estimated -the ridge estimator is somewhat robust. In practice, when we compare to expanding and rolling window estimators, where λ is re-selected at each window, we find that extending the initial CV data window has little impact on out-of-sample performance.
Great Recession, including data from Q1 1990 to Q4 2007. Second, fitting is done with data largely encompassing the crisis period, again from Q1 1990 but now up to Q4 2011. In both cases, the testing sample ranges from the next GDP growth observation after fitting up to Q4 2019. All exercises exclude the global COVID-19 economic depression, as we consider it as an extreme, unpredictable event which induces significant structural changes of the underlying macroeconomic dynamics. 9 We emphasize that in the 'Medium-MD' setup, we do not include any MIDAS model specification. The reasons are multiple, and are mainly related to the computational issues associated with the MIDAS regression framework. First, the number of MIDAS parameters increases unless a careful model selection step is performed. Note, however, that our DFM specifications include a MIDAS-type aggregation scheme, in a similar fashion to Marcellino and Schumacher (2010): the factor structure effectively mitigates the parameter poliferiation. We also observe that the MIDAS estimation can be hard to perform in practice due to the complexity of nonlinear optimization. In Appendix 7.5.1 we document, using a simple replication experiment, how the Almon-scheme MIDAS loss can have a large number of distinct local minima which can be randomly selected depending on the initial conditions of the numerical optimization algorithm. These issues are present notwithstanding the fact that we follow Kostrov (2021) in implementing closed-form gradients for the MIDAS models and employ multi-start optimization routines.
To avoid having to handle the many edge cases that daily data in its "raw" calendar releases involves, we use an interpolation approach. We set ex ante the number of working days in any month to be exactly 24: given that in forecasting the most recent information sets are more relevant, when interpolating daily data over months with less than calendar 24 observations, we linearly interpolate the "missing" data starting from a months' beginning (using the previous months' last observation). The choice of 24 as a daily frequency is transparent by noting that this is the closest number to actual commonly observed data releases, whilst also being a multiple of both 4 (approximate number of weeks per month) and 6 (upper bound on the number of working days per week).

Models
In this section, we present the set of models that we use throughout our empirical exercises. For a general overview, Table 5.2 summarizes all models, including hyperparameters.
As a baseline benchmark, we use the unconditional mean of the sample used for fitting. For GDP growth forecasting, there is evidence that the unconditional mean produces forecasts which are, in MSE terms, often competitive with linear models such as VARs, even at relatively short horizons (Arora et al., 2013). It is therefore an important point of reference for the performance of all other forecasts: in fact, we shall use it as the reference to present relative MSFE and RMFSE numbers in the tables below.
The non-trivial mixed-frequency model benchmark is given by a MIDAS model that we parametrize to include 30 daily lags and 9 monthly lags for all daily and monthly series, respectively. Since we use a dynamic MIDAS specification, 3 lags of the target variable plus intercept are included as linear regressors. We believe that this is a reasonable choice for MIDAS, as it strikes a balance between lag depth and parsimony, given that the strength of the Almon polynomial weighing is the reduction of an arbitrary number of lag coefficients to just a couple of parameters. To make optimization efficient, we also implement MIDAS loss gradients explicitly as detailed in Kostrov (2021). Our initial coefficient values for optimization are all zeros. 10 Two distinct DFM specifications are used. The first uses the standard linear aggregation scheme, as detailed in Example 4.1 (we term it DFM [A]), while the second is a variation that implements an Almon weighting scheme -much like the MIDAS model -as presented in Example 4.2 (we term it DFM [B]). A key choice for a DFM model is the dimension of the factors. While a number of methods   have been developed over the years to systematically derive the number of factors -see, for example, the review of Stock and Watson (2016) -commonly used macroeconomic panels feature a number of challenges, such as weak factors (Onatski, 2012). To sidestep these issues, we directly assume that both DFM models have 5 unobserved factors, and further impose that they follow a VAR(1) process. One extant issue with integrating daily data is its very high release frequency compared to monthly and especially quarterly releases: computationally this can be extremely taxing, which might be one of the reasons why to our knowledge we are the first to provide DFM forecasts that include daily data. Our solution is to reduce aggregate daily data every 6 days by averaging, thus leaving 4 observations per month. This eases the computational burden to estimate coefficients and latent states considerably (12 versus 72 daily observations per quarter). The first set of ESNs we propose is given by two S-MFESN models: one model uses a reservoir of 30 neurons (for shortness we term it singleESN [A]); the other has a larger reservoir with 120 neurons (termed singleESN [B]). The sparsity factor for both models, and of both A and C matrices, is adjusted to be 10/N , where N is the reservoir size. Both MFESNs share the same hyperparameters, ρ = 0.5, γ = 1, α = 0.1, which are values we have not tuned but are credible given other ESN implementations from the literature. To make a fair comparison with the closest alternative approach -dynamic factor models -we also elect to fit models using 6-day-averaged daily data. Note here that for MFESN models the computational savings of averaging are negligible, and are mostly apparent when tuning the ridge penalty via cross-validation.
Our second set of proposed models consists of two M-MFESN. Both have two reservoirs, one for each data frequency -monthly and daily -with 100 and 20 neurons, respectively; sparsity degrees are again adjusted to be 10/N , where N is the reservoir size. The first M-MFESN has hyperparameters that are hand-selected among reasonable values: we note that the monthly frequency reservoir has no state leak and a larger input scaling, while the daily frequency reservoir features smaller scaling than usual (in hopes to avoid compressing high volatility events with the sigmoid activation) and the same leak rate as in S-MFESN models (we term this multiESN [A]). For the second M-MFESN, we change hyperparameters more radically: our aim is to set up a model that has a very high input memory, and that also features long-term smoothing of states. Note that here input scaling factors are small, spectral radiuses are an order of magnitude smaller than in previous models, while leak rates are large (we term this last model multiESN [B]). Finally, in all S-MFESN and M-MFESN models we use the aligned state design, which allows us to forgo introducing more tuning parameters in terms of state lags in the observation equations.

Results
We now present our empirical results. Competing forecasts are compared using the Model Confidence Set (MCS) test derived by Hansen et al. (2011). One should note that due to the intrinsic nature of data availability of macroeconomic time series and panels, our sample sizes are modest. This implies that the small sample sensitivities of the MCS test need to be taken into account when evaluating our comparisons. In fact, recent analyses of the finite sample properties of the MCS methodology have shown that it requires signal-to-noise ratios which are unattainable in most empirical settings, an issue which undermines its applicability (Aparicio and de Prado, 2018). In view of this fact, we also conduct pairwise model comparison tests with the Modified Diebold-Mariano (MDM) test for predictive accuracy (Diebold andMariano, 1995, Harvey et al., 1997).
As we also provide multiple-steps-ahead forecasts, we test for the best subset of models uniformly across all horizons using the uniform Multi-Horizon MCS (uMCS) test proposed by Quaedvlieg (2021). Since there is relatively little systematic knowledge regarding the power properties of the uMCS test in small samples, our inclusion of this procedure is meant as a statistical counterpoint to simple relative forecasting errors comparisons, which provide limited information about the significance of performance differences. Finally, we do not report uMCS test outcomes for the expanding window setup, because Quaedvlieg (2021) argues that in such context the test is invalid.

Small Dataset
We begin our discussion of the Small-MD forecasting results by reviewing Table 5.3. For both sample setups (2007 and 2011) and all three estimation setups (fixed, expanding and rolling windows) we provide relative MSFE metrics, with the unconditional mean being used as benchmark, and MCS tests at 90% and 75% confidence levels. Plots of each of the model forecasts are given in Figures 3 and 4; additional plots for cumulative SFE, RMSFEs and other metrics can be found in Appendix 7.6.
The overall finding is that MFESN models perform very well, and, when we exclude the 2007 fixed parameters setup, they perform the best. It is easy to see from Figure 3 (a) why the 2007 fixed window estimation case is different from other cases: the 2008 Financial Crisis induced a deep fall in quarter-toquarter GDP growth that was in stark contrast with previous business cycle fluctuations. By keeping model parameters fixed, and using only information from 1990 to 2007 -periods where systematic fluctuations are small -DFM and MFESN models are fit to produce smooth, low-volatility forecasts; MIDAS, on the other hand, yields an exponential smoothing which can be more responsive to changes in monthly and daily series. From Figure 3 (b) and (c) it is possible to see that expanding and rolling window estimation resolves this weakness of state-space models. Table 5.3 shows that MFESN models always perform better than the unconditional mean in terms of MSFE, something which no other model class achieves across all setups. Furthermore, at least one MFESN model for each subclass (single or multiple reservoir) is always included in the model confidence set at the highest confidence level. We remind again that the MCS test of Hansen et al. (2011) might be distorted due to the modest sample sizes considered, even more so in the 2011 test sample. To complement the MCS, we provide graphical tables for pairwise Modified Diebold-Mariano tests, with 10% level rejections highlighted in Figure 5. The MDM tests broadly agree with results of Table 5.3, although of course they do not account for multiple testing, and therefore can not be interpreted as yielding subsets of the most accurate forecasting models in a statistical sense.
For multiple-steps-ahead forecasts, relative RMSFE and uMCS are reported in Tables 5.4 and 5.5: we constrain our exercise to h ∈ {1, . . . , 8} steps, since we are interested in GDP growth forecasts within 2 years. Note that for h = 1 our results are similar to, but do not reduce to the one-step-ahead results. We note that in order to make correct multistep RMSFE evaluations one must select h different vectors of residuals of the same length: this implies that residuals at the end of the forecasting sample must be trimmed off to compute short-term multistep RMSEs that are comparable to the long-term ones. We actually focus our discussion on Figures 7 and 8, which reproduce graphically the RMSFE numbers of the aforementioned Tables. Generally, we can notice that MIDAS, as well as S-MFESN models provide the worst performing multistep forecasts, with RMSFEs considerably exceeding the unconditional mean baseline after horizon 1. For MIDAS, we have already discussed above how the existence of multiple loss minima can generate numerical instabilities. Model re-fitting at each horizon can amplify this problem, as the loss landscape itsel changes as new observations are added in the fitting sample. We provide more discussions in Appendix 7.5.1. In the case of S-MFESN models, the reason is structural: we have discussed how in our framework multistep MFESN forecasting entails iterating the state map, which can have multiple attraction (stable) points. If the hyperparameters and estimated full model W s jointly do not define a contraction, the outcome is that the limit of the multistep forecast needs not be the estimated MFSEN model intercept. However, Figures 7 and 8 show that our M-MFESN models, multiESN [A] and multiESN [B], both perform on par or better than DFM models even after horizon h = 4. For example, in the 2007 expanding and rolling window experiments, multiESN [B] is able to outperform both DFMs and even an unconditional mean forecast by meaningful margins for forecasts up to a year into the future.

Medium Dataset
We now present the results for the Medium-MD dataset, which includes more than 30 regressors and many high-frequency daily series. The same metrics used in the previous section to evaluate the relative performance of difference methods are used for this dataset.
The main difference in our empirical exercises is that now we a priori exclude MIDAS from the set of forecasting methods. The main reason to avoid MIDAS is that even in medium-sized setups the      Table 5.3. Tests are one-sided and carried out row-wise: the null hypothesis for row i and column j reads as "the ith-row model forecasts more accurately than the jth-column model". Rejections at 10% level are highlighted in red.    [B] 0.682 ** 0.748 ** 0.587 ** 0.774 ** 0.547 ** 0.728 ** Table 5.6: Relative MSFE and Model Confidence Set (MCS) comparison between models in 1-stepahead forecasting exercises. Unconditional mean MSFE used as reference. MCS columns show inclusion among best models: * indicates inclusion at 90% confidence, * * indicates inclusion at 75% confidence.
computational burden due to numerical optimization makes it impractical to implement, especially in expanding or rolling window estimation setups. In Appendix 7.5.1 we make a more precise evaluation of the issues associated with the nonlinear optimization problem that is inherent in the MIDAS modeling framework. Put simply, even though MIDAS efficiently reduces the number of coefficients involved in multivariate regressions with lags, the loss function becomes highly non-convex. This means that even optimization routines initialized with many starting points might converge to different local minima over different runs. Moreover, in practical applications the optimization domain of MIDAS is often still moderately high-dimensional, because the number of Almon lag coefficients grows linearly with the number of coefficients. To give an example, to produce a MIDAS forecast in the Medium-MD setup we would need to optimize a non-convex loss involving at least 100 parameters. Table 5.6 showcases the relative performance of DFM and MFESN models in the Medium-MD forecast setup. We find that the MFESN model multiESN [B] performs best in all setups, particularly under fixed parameters, where MCS testing reveals that it is the only model included at a 75% confidence level. Of course, for the MCS results we must again take into account the relatively small sample size, which could distort the selection of best model subsets. Modified Diebold-Mariano tests of Figure  11 largely agree with the MCS results: in the fixed parameter setup any pairwise comparison of an alternative model against MFESN multiESN [B] is rejected in favor of the latter. A visual inspection of one-step-ahead forecasts in Figures 9 and 10 also shows that DFM models estimated over the Medium-MD datasets produce forecasts with larger variability than MFESN methods, which is likely the key driver of the difference in performance.
The multistep-ahead experiments are run as for the Small-MD dataset, with a maximum horizon of 8 quarters. Tables 5.7 and 5.8 present the relative RMSFE performance of multistep forecasts for all models, and we use Figures 13 and 14 as references for our discussion. What can be seen visually -and is also reproduced in the Tables -is that multi-reservoir MFESN models and DFM model [A] have the better performance up to 4 quarter ahead; overall, taking into account also the longer term, expanding or rolling window estimation of model multiESN [B] yields the best forecasting results in the 2007 sample setup. The post-crisis 2011 sample setup makes comparison harder, as DFM and M-MFESN models largely produce results in line with the unconditional sample mean. This evaluation is confirmed by uMCS tests, consistently with the multistep-results obtained with the Small-MD dataset.      Table 5.3. Tests are one-sided and carried out row-wise: the null hypothesis for row i and column j reads as "the ith-row model forecasts more accurately than the jth-column model". Rejections at 10% level are highlighted in red.

Conclusions
Macroeconomic forecasting -especially long-term forecasting of macroeconomic aggregates -is a topic of crucial importance for institutional policymakers, private companies and economic researchers. Given the modern-day availability of "big data" resources, methods capable of integrating heterogeneous data sources are increasingly sought to provide more precise and robust forecasts. This paper presents a new methodological framework inspired by the Reservoir Computing (RC) literature to deal with data sampled at multiple frequencies and with multiple-steps ahead forecasts. First, we have presented two well-known methods, MIDAS and Dynamic Factor Models, the current benchmarks available in the literature. We have then taken Echo State Networks -a type of RC models -and formally extended them in order to allow modeling of data with multiple release frequencies. Our discussion encompasses model fitting, hyperparameter tuning, and forecast computation. As a result, we provide two classes of models, single-and multiple reservoir multi-frequency ESNs, that can be effectively applied to our empirical setup: forecasting US GDP growth using monthly and daily data series. In our applications, we find that MFESN models are computationally more efficient and easier to implement than DFMs and MIDAS, respectively, and perform better than or as well as benchmarks in terms of mean-square forecasting errors. In a number of setups these improvements are also statistically significant, as shown by our Model Confidence Set and Diebold-Mariano tests. Thus, we argue that our machine learning based methodology can be a useful addition to the toolbox of contemporary macroeconomic forecasters.
Lastly, we wish to highlight the many potential areas of research that we believe would be interesting to explore in the future. We have not discussed the role of the distribution from which we sample the entries of the reservoir matrices. While it is known that these can have significant effects on the forecasting capacity of an ESN model, the literature lacks definitive theoretical results (even for dynamical systems applications) or systematic studies with stochastic inputs and targets. The hyperparameter tuning routine we have developed neither allows separating individual hyperparameters, nor does it tackle the identification problem. Moreover, we assume that the ridge regression penalty strength, λ, is tuned ex ante: it would be interesting and desirable to understand if it is possible to jointly tune λ and ϕ, or rather if one can fully separate their selection. In our preliminary experiments, we have noticed that the roles of the ridge penalty and the input scaling, for example, cannot be trivially disentangled -thus prompting the ψ-form normalization. Model selection for the dimension of MFESN models is another open question that would be key to exploring and designing more efficient and effective ESN models, especially when dealing with multiple frequencies and reservoirs.
Acknowledgments: The authors acknowledge financial support from UK Research and Innovation (grant number ES/V006347/1). GB thanks the hospitality and the generosity of the University of Warwick without which this paper would not have happened. GB is grateful to the Center for Doctoral Studies in Economics (CDSE) at the University of Mannheim for providing travel funding through the 2020 CDSE Teaching Award. The authors acknowledge support by the state of Baden-Württemberg through bwHPC. The authors acknowledge the use of the UCL Myriad High Performance Computing Facility (Myriad@UCL), and associated support services, in completing this work. MH acknowledges that the majority of his work was done when he was affiliated with UCL and hence thanks the support of UCL in this project. The authors acknowledge the support of the Swiss National Science Foundation (grant number 200021 175801/1). The authors also thank Thanos Moraitis and Alfonso Silva Ruiz for excellent research assistance and help in data collection and curation.

ESN Implementation
Cross-validation. Because the initial cross-validation of λ uses an extended sample to try and improve generalization -specifically, our concern is for the fixed estimation setups -we use two slightly different approaches: • In all setups -fixed, expanding, rolling -the initial ridge penalty cross-validation is done on the extended sample (starting January 1st, 1975 instead of January 1st, 1990). We construct 10 folds with 5 out-of-sample observations starting from the end of the sample. Each fold and out-of-sample observation set is re-normalized.
• Only in the expanding and rolling setups, for each subsequent window (the ones that now include at least one testing observation), we use the true sample (starting January 1st, 1990) and construct 5 folds, again with 5 out-of-sample observations. This is done to keep cross-validation computational complexity low and avoid making some folds too small, which could hurt larger MFESN models.
In practice, simple experiments show that there is not much difference between using 5 or 10 folds in the initial cross-validation.

MIDAS Implementation
While the MIDAS regression framework is straightforward to discuss in terms of equations, some care must be taken when implementing it computationally. A key assumption that can be imposed is that the integer frequencies κ := {κ 1 , . . . , κ L } of M regressors are such that κ max := max(κ) is a multiple of each of the κ l , l ∈ [L]. In this case MIDAS parameter estimation can be written in matrix form, which allows for efficient numerical implementation, which we spell out in the following paragraphs. Let q l = κ max /κ l , l ∈ [L] denote the frequency ratios and define y := (y 1 , y 2 , . . . , y T ) the vector of target observations, where T is the sample length in reference time scale. Additionally, let x l := (x l,1 , x l,2 , . . . , x l,T l ) be T l = T · κ l long vector which consists of observations of the l-th covariate x l released with frequency κ l . In order for the parameters of the MIDAS model in (4.4) to be identifiable, we assume that Since κ max is a multiple of each of the L frequencies, for each series we introduce where i q l and i κmax are vectors of ones of lengths q l and κ max , respectively. In the absence of missing observations, we have that Y, X l ∈ R Tmax with T max = T · κ max observations. We now construct preliminary regression matrices such that their maximal rows number is T max without accounting for the lags structure of both the target (autoregressive lags) and regressors (MIDAS lags) and we introduce zeros where no observations are available. 11 Define for p ≥ 1 and for K l ≥ 0 In the special case p = 0 (the MIDAS model in this case is called static, since it does not contain an autoregressive term) we take Y p as empty. We now follow by noticing that one should not use Y p and X K l as autoregressive and mixed-frequency regression matrices, respectively, due to the fact that some observations are missing. To overcome this we introduce and the so-called upper truncation (selection) matrix and we obtain the following required response vector and regression matrices where Y reg p is empty whenever p = 0. We can now observe that Y resp and Z reg are sufficient to construct all MIDAS forecasting and nowcasting regressions. In practice, some care needs to be taken to make sure that data is correctly aligned: for example, in the case of nowcasting exercise regressors in Z reg and targets in Y resp have to be aligned in a different fashion than in the case of forecasting exercises. Provided the aligned data is executed correctly, the estimation of MIDAS parameters can be carried out efficiently. An important thing to mention is that the truncation with the help of s in (7.3) may be too restrictive, as it may lead to excluding up to K max −1 rows from Z reg that could be used for estimation. This can be avoided at the time of implementing. In our repository available at RCEconModelling/Reservoir-Computing-for-Macroeconomic-Modelling we consider this detail and exclude from the final regression matrices only those rows which cannot be used due to the lag requirements in the model. We warn the reader that this comes at a cost, namely the codes are lengthier and less elegant.

Mixed-frequency DFM Implementation
This section gives additional details on implementing non-homogeneous dynamic factor models, such as the mixed frequency model introduced in the main text. We notice that the conditioning notation in this section should not be confused with our temporal notation in Definition 2.1.
Kalman Filtering and Computational Complexity. The sufficient statistics of the posterior distribution of the latent factor f t |y 0:t can be updated recursively by the Kalman filter updates in the linear Gaussian setting. First, propagate the prior Compute the innovation covariance Γ t+1,θ = Λ t+1 Σ t+1|t,θ Λ t+1 + R θ R θ and the Kalman gain K t+1,θ = Σ t+1|t,θ Λ t+1,θ Γ −1 t+1,θ . Then, update the statistics with the new information y t+1 , Notice that the inverse of the log-determinant of the innovation matrices Γ t,θ are required for computing the Kalman gains and the marginal log-likelihood, respectively, which yield a cubic computational complexity in the dimension of the observation process. Alternatively, one can apply matrix inversion or determinant lemmas to obtain a computational complexity that is cubic in the dimension of the Markovian factor process f t . For an alternative approach in high-dimensions that imposes a dynamic factor structure after a projection of the observations onto a low-dimensional space, see Jungbacker and Koopman (2015).
Model Selection. The model parameters θ are learned to jointly maximize the log-likelihood of the observed data for all frequencies. This is in contrast to the parameter estimation approach for MIDAS that maximizes the log-likelihood of the low-frequency process only conditional on observing the high-frequency time series. We remark that a different log-likelihood weighting for the different frequencies in DFMs has been suggested in Blasques et al. (2016), but requires cross-validation to optimize such weightings. Nevertheless, the introduced DFM contain several hyperparameters that need to be chosen, such as the latent state space dimension k or the order p of the latent Markov process. One possibility is to select such hyperparameters by evaluating the low-frequency predictions on a validation set. Approaches for choosing the dimensions of the latent factor process have been under-explored in the mixed-frequency setting, but see Bai and Ng (2007), Hallin and Liška (2007) for possible criteria in general dynamic factor models. In our implementation, we choose p = 1, as this allows for a differentiable model parametrization with stationary factor dynamics. We set k = 5 for the small dataset and k = 10 for the medium dataset.
Parameter Estimation and Forecasting. Based on the results from the Kalman filtering recursions, the model parameters θ are learned by maximizing the marginal-log-likelihood using t (θ) = − log p θ (y 0:t ) = − t s=0 log p θ (y s |y 0:s−1 ) where p θ (y s |y 0:s−1 ) is Gaussian with mean Λ s,θ f s|s−1,θ and covariance Γ s,θ . Gradients of t (θ) can be computed using algorithmic differentiation.
For fixed θ, the τ -step ahead prediction function H τ t,θ (y 0:t ) = y t+τ |t,θ = Λ t+τ,θ f t+τ |t,θ is linear due to the Kalman filtering recursion. For s ≤ t, consider also the prediction H τ s,t (y 0:t ) = E θ (y 0:s ) [y t+τ |y 0:t ] that is based on the dataset y 0:t . where θ (y 0:s ) = arg min θ s (θ) maximizes the marginal likelihood of the data y 0:s only. This setting allows to implement the different parameter estimation setups from Section 3.2.1. For instance, the fixed parameter setup corresponds to fixing s which yields a fixed training set y 0:s to estimate θ. In the expanding window setup, both s and t are expanded, while a rolling window setting updates the set y 0:s by rolling over the data.

High-Frequency Forecasts
In order to better understand how the use of high-frequency data impacts forecasting, as an additional empirical experiment we investigate high-frequency (HF) forecasts of all models for in the Small-MD dataset. We restrict our analysis to this dataset because the computational burden to construct HF forecasts can be high: when using daily data and using our suggested 24 days-per-month interpolation, one quarter amounts to 72 daily frequency observations, which means HF forecasts can involve thousands of data points, and for DFM and M-MFESN models this setup can be quite computationally onerous.
Constructing HF forecasts with MIDAS is trivial once the aggregation weights have been estimated, even thought a practical implementation requires care in constructing the appropriate lag matrices. Recall for Section 4.1 that the MIDAS equation with L regressors {x (l) r l } L l=1 can be written as For clarity, we suppress the dynamic autoregressive component, as it has the same frequency as the target. Now assume that we include n (m) monthly and n (d) daily frequency regressors in the model that are sampled κ m = 3 and κ d = 72 times per quarter and hence κ max = 72. Therefore we can partition the regression above in the following way j=1 are available, the HF forecast y t+1,s|72 is given by For DFMs, high-frequency forecasts can be constructed using (4.10) and (4.11) to iterate factors forward in time and then aggregating them according to estimated loadings or a weighting scheme.
Multi-frequency ESN models are also able to yield high-frequency forecasts in a trivial manner. For simplicity, let us consider the case as in Example 3.1 of an aligned S-MFESN model that has been fitted to a quarterly target with monthly and daily data. The reservoir is run in high-frequency, κ max steps per quarter, according to state equation    Suppose a coefficient matrix W has been estimated. Then, as states between low-frequency periods t and t + 1 are collected, we can immediately construct the high-frequency one-step-ahead forecasts by setting y t+1,s|72 = W x (m,d) t,s|72 . For M-MFESN models HF forecasts require slightly more care. For example, when dealing the multi-reservoir MFESN model of Example 3.2, we must repeat the most recent monthly state at daily frequency correctly.

MIDAS
As we discuss briefly in the main text, parameter optimization is a principal problem in implementing any MIDAS model. Even though explicit formulas exist for both gradient and Hessian of the MIDAS loss objective when an Almon weighting scheme is used (see Kostrov (2021)), there is no known guarantee that the loss itself is convex or even locally convex. In practice, for a given starting point (or point set) a numerical optimizer might only converge to a local minimum.
We observe this in practice, and we explore its effects on the robustness of MIDAS forecasts. We report summary results for our simulations in Figure 17. Our proposal is, given a MIDAS model specification and a set of starting points for evaluating the loss, to run an optimizer (for example, L-BFGS-B with explicit gradient) and select the smallest local minimum. By repeating this procedure multiple times, we collect a set of MIDAS parameters and study both the variation between the parameter vectors and the implied one-step ahead forecasts.
To be precise, our procedure is as follows: 1. For a total of B repetitions: (a) Choose M initialization points for the optimizer. We draw 64 points inside the hypercube of edge length 0.025 using a low-discrepancy Sobol sequence. The choice of a down-scaled hypercube as a domain comes from the empirical fact that the Almon exponential scheme may produce extremely large values even for relatively small coefficients. A straightforward way too see this is to notice that given any arbitrary small value for θ 1 and θ 2 in (4.3), for lag index k sufficiently large weight exp(θ 1 k + θ 2 k 2 ) will overflow at any given numerical precision. This means that one should adjust the MIDAS optimization domain based on the number of lags in the model. (c) Among the resulting M (local) loss minimizers, select and store the one with the lowest loss value.
2. With the resulting B stored minimizer: • Construct a low-dimensional projection of the high-dimensional minima to see their relative location in the parameter space and to compare their gradient and loss values, see Figure 17 (a)- (b). • Use each minimizer to produce MIDAS one-step ahead forecasts and plot quantile frequency plots of the forecast variation due to initialization; see Figure 17 (c). Figure 17 shows that the best minimizers among initial Sobol sets are clustered together. To construct this 2D projection of the high-dimensional Almon coefficient space (including autoregressive lags and intercept), we use the well-known t-SNE procedure developed in van der Maaten (2009), which is an unsupervised dimensionality reduction algorithm capable of preserving the latent high-dimensional structure. This approach naturally implies that the Euclidean distances in the plot are suggestive of "clustering" rather than the actual latent distance between points. In Figures 17 (a)-(b), we see that the L-BFGS-B optimizer with explicit gradient achieves good convergence in terms of gradient norm and also that the resulting cluster of minimizers has close loss values. However, one can see that there is no single loss minimum: Figure 17 instead suggests that the local structure of the MIDAS loss function is very uneven, and therefore many distinct local minima can be achieved even when choosing a large number of initialization values for the optimizer. This means that the "multistart" strategy suggested in Kostrov (2021) to alleviate issues in MIDAS model estimation is insufficient.
The effects of non-negligible variation in parameter values on forecasts appear to be significant. Looking at Figure 17 (c), we can see wide frequency bands for the one-step ahead forecasts constructed using the Small-MD dataset and fixed parameter values. In particular, the Financial Crisis period seems to induce larger variation in forecasts, consistent with the intuition that data with larger variation causes stronger model sensitivity when making forecasts.

MFESN
Since ESN models, and thus MFESN models, require random sampling of parameter matrices, the size of which is often large, there is inherent variability in any ESN model forecast. In theory, because all MFESN parameters in matrices (A, C, ζ) are drawn independently of each other, one could try to decompose the variance of any MFESN into the share due to parameter sampling and the share due to data sampling. Unfortunately, in practice, such decomposition is hard to derive. Cross-validation of ridge penalties and rolling and expanding window estimation are non-trivial data-dependent operations that complicate inference. In this work, we limit ourselves to numerically evaluating the effect of reservoir coefficient sampling on MFESN forecast variance.
Our approach is straightforward: given an MFESN model specification, c.f. Table 5.2, and a forecasting setup (fixed parameters, expanding or rolling window), we resample the reservoir design matrices, perform cross-validation and possibly train-test sample windowing, and finally construct point-wise forecasts. Once a sufficient large set of resampling forecasts have been computed, we plot frequency intervals in Figures 19 and 21. From Figure 19, we can see that the single-reservoir MFESN model with reservoir size 30 produces forecasts with a meaningful amount of variability induced by parameter resampling. Forecasts exhibit more variation when using an expanding or rolling window estimation strategy, even though the overall forecasts align with the GDP realizations. A similar discussion to that of MIDAS applies here: forecast sensitivity increases with underlying data variation, exacerbated in periods of systemic economic crisis. Figure 21 suggests that larger MFESN models produce significantly more stable forecasts regarding model resampling. Note that the M-MFESN model [A] has a monthly frequency reservoir that is approximately 3 times the size of the S-MFESN model [A]. This stability is preserved even in expanding or rolling window settings, even though a slightly higher variation is apparent at the height of the 2008 Financial Crisis. We hypothesize that this reduction in variance due to model parameter sampling is due to the concentration of measure phenomena that prevail in high-dimensional spaces. Figure 21 suggests that larger MFESN models produce significantly more stable forecasts regarding model resampling. Note that the M-MFESN model [A] has a monthly frequency reservoir that is approximately 3 times the size of the S-MFESN model [A]. This stability is preserved even in expanding or rolling window settings, even though a slightly higher variation is apparent at the height of the 2008 Financial Crisis. We hypothesize that this reduction in variance due to model parameter sampling is due to the concentration of measure phenomena that prevail in high-dimensional spaces.