Improving the teaching of econometrics

We recommend a major shift in the Econometrics curriculum for both graduate and undergraduate teaching. It is essential to include a range of topics that are still rarely addressed in such teaching, but are now vital for understanding and conducting empirical macroeconomic research. We focus on a new approach to macro-econometrics teaching, since even undergraduate econometrics courses must include analytical methods for time series that exhibit both evolution from stochastic trends and abrupt changes from location shifts, and so confront the “non-stationarity revolution”. The complexity and size of the resulting equation specifications, formulated to include all theory-based variables, their lags and possibly non-linear functional forms, as well as potential breaks and rival candidate variables, places model selection for models of changing economic data at the centre of teaching. To illustrate our proposed new curriculum, we draw on a large UK macroeconomics database over 1860–2011. We discuss how we reached our present approach, and how the teaching of macro-econometrics, and econometrics in general, can be improved by nesting so-called “theory-driven” and “data-driven” approaches. In our methodology, the theory-model’s parameter estimates are unaffected by selection when the theory is complete and correct, so nothing is lost, whereas when the theory is incomplete or incorrect, improved empirical models can be discovered from the data. Recent software like Autometrics facilitates both the teaching and the implementation of econometrics, supported by simulation tools to examine operational performance, designed to be feasibly presented live in the classroom.


PUBLIC INTEREST STATEMENT
We recommend a major change in macroeconometrics teaching for both graduates and undergraduates to include methods for analysing time series that have both stochastic trends and abrupt shifts. To incorporate all theory-based variables, their lags, possibly non-linear functional forms, potential outliers, breaks and rival candidate variables, the resulting specifications place model selection facing non-stationary economic data at the centre of teaching. We illustrate our proposed curriculum using a large UK macroeconomics database over the turbulent period 1860-2011. We describe how the teaching of econometrics can be improved by nesting "theory-driven" and "data-driven" approaches, whereby the theorymodel's parameter estimates are unaffected by selection when the theory is correct, whereas improved empirical models can be discovered when the theory is incorrect. Recent software like Autometrics facilitates both the teaching and the implementation of econometrics, supported by simulation tools to examine operational performance, designed to be feasibly presented live in the classroom. seminar presentations. This was facilitated by the first author developing a suite of programmes derived from those used in his doctoral research for the analysis of autoregressive time series models (see Hendry & Srba, 1980), which was the precursor to the widely used PcGive software of today: Hendry and Doornik (1999) trace that development. Section 2.9 provides a brief overview of this and some of the other econometrics software that has been developed since the late 1970s. The extremely powerful computing equipment and the sophisticated, yet easy to use, software implementing the many advances in modelling strategy that are available today mean that it is now possible for empirical researchers to tackle the vast array of issues that they face in modelling economic systems. The magnitude of these developments and their success in modelling complex economic systems relative to the achievements of the widely used alternatives that pervade today's econometrics textbooks means that it is important for there to be a major shift in both undergraduate and graduate econometrics curricula. We will now present an overview of the evolution of the elements of our recommended modelling strategy, then illustrate its application in teaching.
The structure of the paper is as follows. Section 2 summarizes the many key concepts that have been clarified. Each is then amplified in Sections 2.1-2.9, followed by subsections to help in teaching undergraduates by explaining the basics of any material that involved substantive mathematics. Section 3 illustrates the new econometrics teaching that entails, considering in turn the roles of subject matter theory in Section 3.1, the database and software in Section 3.2, computing parameter estimates and evaluating models in Section 3.3, selecting a representation of the general unrestricted model in Section 3.4. We briefly note testing parameter constancy in Section 3.5, the validity of exogeneity in Section 3.6, and checking the need for a non-linear representation in Section 3.7, and using simulation to investigate the reliability of the outcome in Section 3.8. Section 4 concludes.

Key concepts
A set of inter-related concepts sustains our modelling strategy, commencing with the fundamental notion of a data generation process (DGP) and its representation. Data are obviously generated by some process, which in economics is a combination of the behaviour of the individuals in an economy (and other economies they interact with) and how that behaviour is measured. The total number of transactions in an economy far exceeds any hope of detailed modelling of them all, so macroeconometrics focuses on aggregates of variables such as gross national product (GNP), total unemployment (U) and so on. Section 2.1 considers further reductions to create manageably sized formulations. From the perspective of teaching, the first and most crucial step explaining the origin of empirical models through the theory of reduction unfortunately involves many technical concepts, some of which students will not have encountered previously. Moreover, any earlier courses probably assumed, explicitly or implicitly, that the specification of the model was given by an economic analysis. Certainly, the empirical counterpart of the theory model is the objective of the analysis. However, the only hope of ending with a well-specified empirical representation is for the target of the modelling exercise to be the process that actually generated the variables under analysis. Conflating these two distinct entities by asserting that "the model is the mechanism" is all too common, but unlikely to lead to robust empirical evidence. Our approach retains any available theory insights within the general model, so if the theory happened to be complete and correct, that is what would be found, but if it were either incomplete or incorrect, an improved representation would be discovered without the retention of the theory contaminating the final selection in any way. Thus, tackling the theory of reduction is imperative to place empirical modelling on a sound footing. A valuable collateral benefit is that many of the central concepts of modern econometrics correspond to reductions moving from the DGP to the model to be empirically analysed, including sufficient statistics, innovation errors, (Granger) causality, exogeneity, constancy and invariance, cointegration, conditioning and simultaneity.
The next question is how to judge whether such reductions lose valuable information about the objectives of the analysis, addressed in Section 2.2 by the concept of congruence between a model and the evidence. A strategy to locate congruent models is described in Section 2.3, followed in Downloaded by [Radcliffe Infirmary] at 02:04 01 May 2016 Section 2.4 by an explanation of exogeneity and its role in empirical modelling. Empirical analyses almost never exist in isolation, so to evaluate the latest findings we require that a new model encompasses or explains previous results, discussed in Section 2.5. However, economies evolve over time and often suffer abrupt shifts, as with the recent Great Recession, so empirical modelling has to take account of such non-stationarities, the topic of Section 2.6. The final two developments concern the potential non-linearity of relationships, considered in Section 2.7, and how to undertake model selection in Section 2.8 in a way that addresses all of the potential complications inherent in economic data. While we mainly consider macroeconomic time series, similar principles apply to cross section and panel observational data (see e.g. , for an example).

Data generation process and its representation
The most crucial decision in all empirical studies concerns the set of variables to collect observations on and then analyse, which will be a small subset of all the variables in the economy. The factors influencing this decision include the focus of interest in modelling (which could be an economic policy issue, evaluating an economic theory, or forecasting), as well as economic theory, previous empirical results, and related institutional knowledge and economic history. Denoting these n variables by { t } to mean the complete set ( 1 , … T ) for a time period t = 1, … , T, then the focus of attention in modelling is to learn as much as possible about the process that generates { t }, called the local data generation process (LDGP). For a sample of size T, the LDGP is represented by the joint density ( 1 , … T | , 0 ) where (⋅) is the form of the density function (often assumed to be normal), 0 is the set of initial conditions, characterized by the vector of k parameters ∈ , although these might be time varying so could be denoted T 1 . The variables { t } are generated as part of a much larger set of variables, t , generated in the economy under analysis, which might be the global economy for present-day modelling. Such a DGP is far too high dimensional, heterogeneous and time varying to accurately theorize about or model in detail, whereas theorizing and learning about the LDGP for { t } is often feasible. To structure the analysis, we assume that the DGP of { t } can be represented by the joint density Underlying the decision to model the LDGP rather than the DGP is a series of reductions which are inevitable and intrinsic (see inter alia, Florens, Mouchart, & Rolin, 1990;Hendry, 1995a). Firstly, the decision to model the variables in { t } and not those of the DGP { t } entails discarding the remain- , the decision to model just { t } requires that there is no substantive loss of information incurred. This in turn requires that there is no Granger causality from the lagged values of t to t , which is a demanding requirement (see Granger, 1969;Hendry & Mizon, 1999). Secondly, almost all economic data in { t } are aggregated across commodities, agents, space and time, and in some cases, they are estimates of the "correct" aggregate based on small samples of activity. Thirdly, most econometric studies analyse data after transformations such as logarithms and growth rates, which can affect the constancy of, and cross-links between, the resulting parameters. In fact, since aggregates are linear sums of disaggregates, log transformations of the aggregates might be well behaved even though disaggregates are not.
If any of the reductions implied in moving from then the LDGP may be non-constant and so provide a poor representation of the future generation of { t } despite describing the sample. By definition, the LDGP changes with changes in the set of variables { t } to analyse, which means that the decision about what to include in { t } is crucial in determining how good an LDGP will be. Thus, it will be valuable to include a wide range of variables from the outset, rather than beginning with a restricted set that is more likely to be inadequate. Although it is possible to include more variables as the inadequacies of a particular set are revealed, doing so incrementally is fraught with difficulties, e.g. which additional variables to consider and in what order. Once a reasonable choice of { t } has been made, its LDGP may be too complicated to be recovered fully using available empirical and theoretical information. Hence, the potentially infinite lags ( T 1 |⋅) must be reduced to a small number of lags, and its parameters T 1 have to depend on a smaller set of parameters that are constant. The validity of the reduced lag length can be checked by testing if longer lags matter, and the constancy of , at least within regimes, can also be tested directly. In order to proceed further empirically, a general unrestricted model (GUM) has to be specified that: (a) uses data transformations sufficiently general to capture those in the reduced LDGP; (b) includes all variables { t } of the LDGP, with perhaps some additional variables that might transpire to be relevant; and (c) contains long enough lags and sufficient deterministic variables (including indicator, or dummy, variables) to be able to capture a constant parameter representation. With a judicious choice of parameters and variables, the LDGP might be nested within the GUM, and in this case, a well-specified model which embeds the economic theory and can deliver the parameters of interest should be obtainable. Alternatively, when the LDGP is not nested in the GUM, and so some of the reductions mentioned above involve important mis-specifications, it is difficult to establish what properties the final specific model will have, although a well-specified approximation can often still be found. Section 2.1.1 suggests an approach to teaching the main ideas of intrinsic reductions without the mathematics.
Other important considerations in formulating the GUM include taking into account the wide-sense non-stationarity of economic variables, the possibility of conditioning on exogenous variables t when � t = ( � t , � t ), as well as possible simultaneous determination of the endogenous variables, t , allowing the more parsimonious combinations t (where is non-singular), rather than t , to be modelled. The consequences of non-stationarity and simultaneous determination are discussed in Section 2.6, and the importance of valid conditioning is discussed in Section 2.4. For more details on each of the issues considered in this section, see Hendry (2009) and Hendry and Doornik (2014).

Explaining the basics of reduction theory
For undergraduates, how can a teacher get across the main ideas of intrinsic reductions without the mathematics? One approach after teaching basic regression is to use a simple linear model fitted to artificial data, with (say) 4 variables: y t related to x 1,t , x 2,t , x 3,t and an intercept, where the x i,t are quite highly correlated, are autocorrelated, and have large means, all with non-zero parameters in the DGP except x 3,t , which is in fact irrelevant. Such data are easily generated in, say, PcNaive, part of OxMetrics (see Doornik & Hendry, 2013b), and an invaluable tool for easy-to-create Monte Carlo simulations. Then, illustrate: (a) the DGP estimates; (b) those where each of x 1,t , x 2,t , x 3,t and the intercept are omitted in turn; and (c) where a lag of every remaining variable is added when any variable is omitted. Non-constancy could be added where the students can cope. A slight extension of the model in Section 3.1 could be used. Better still, Monte Carlo simulations can show students the estimator and test distributions, and allow comparisons between correctly specified and mis-specified cases in terms of both biased estimates and incorrect estimated standard errors, as against the correct standard deviations of the sampling distributions.
To assess the validity of reductions, t-tests for eliminating variables can be used. That for x 3,t should be insignificant, have little effect on the estimates of the remaining parameters, but increase the precision of the estimates, whereas the other reductions should all reject, with possibly large changes in parameter estimates (did the coefficient of x 3,t become "spuriously" significant?), and the appearance of residual autocorrelation. The impacts of incorrectly eliminating the intercept, a "fixed regressor", merit discussion.

Congruence
Although the early econometrics textbooks (see e.g. Goldberger, 1964;Johnston, 1963) emphasized estimation theory and its analysis using matrix algebra, they also described simple hypothesis testing, such as individual coefficient Student's t, goodness of fit F, and Durbin-Watson (DW) statistics. Initially, the adequacy of a particular model specification was assessed by comparing the statistical significance, size and sign of estimated coefficients relative to economic theory, calculating the overall goodness of fit R 2 and F, and inspecting mis-specification test statistics such as DW. However, with the continuing increase in computing power, it became possible to consider models with larger numbers of explanatory variables including longer lags, and the possibility that non-linear models might be important. Equally important was the recognition that in developing empirical models, which rarely will be more than an approximation to the high dimensional and complicated LDGP, it is crucial that once t has been chosen, all the relevant information that is available is fully exploited. This idea is captured in the concept of congruence which requires an empirical model not to depart substantively from the evidence. In particular, it is desirable that the empirical model is theory consistent (coherent with well established a priori theory), data consistent (coherent with the observed sample information), and data admissible (consistent with the properties of the measurement system). The theory consistency of a model can be assessed via specification tests of the restrictions implied by the theory relative to the estimated model. Given that the error terms and the corresponding residuals are by definition unexplained by the model, finding that they exhibit systematic behaviour is an indication that there is potentially valuable information in the data that the model has not captured. Whether this is the case can be assessed via mis-specification tests for residual serial correlation, and heteroskedasticity, as well as for invalid conditioning and parameter non-constancies.
By definition, the DGP is congruent with itself, and provided that the LDGP is a valid reduction of the DGP, it will also be congruent. An empirical model is congruent if it is indistinguishable from the LDGP when the LDGP is a valid reduction of the DGP. Given the importance of the GUM in the modelling strategy that we advocate, it is crucial that it is congruent, and that in seeking simplifications of it, congruence is maintained for a simpler model to be acceptable. Although the diagnostic tests used to check the congruence of the GUM and subsequent model selections are designed to affect the operating characteristics of general-to-specific selection algorithms such as Autometrics (e.g. possibly retaining insignificant variables to avoid diagnostic test rejections), selection does not affect their null distributions (see Hendry & Krolzig, 2005). The potential costs of not testing the congruence of the GUM, and simplifications thereof, are that it may be non-congruent and so adversely affect all inferences during the selection process. To re-emphasize, a non-congruent model fails to account for, or make use of, all relevant theoretical and empirical information that is available and so is inadequate. For further details see inter alia Hendry (1995a), Mizon (1995a, Bontemps and Mizon (2003), and Hendry and Doornik (2014).

Explaining the basics of congruence
Undergraduates will almost surely have been taught some Euclidean geometry so know about congruent triangles, namely ones which match perfectly, perhaps after rotation. However, they may never have thought that one of the triangles could be the cut-off top of a triangular pyramid, so the match is only two-dimensional, and the third dimension is not "explained". That is precisely why the name congruence was originally introduced, as a model may match the DGP only where tested, and many aspects may not be matched. Thus, congruence is not "truth", though DGPs must be congruent, so non-congruent models cannot be DGPs. However, as discussed in Section 2.5, a sequence of congruent models is feasible in a progressive research strategy, where each explains the results of all earlier models.

General-to-specific
The possibility that higher order autoregressive processes might be important to adequately capture the dynamics in time series models led to analyses of the relative merits of sequential testing from simple to general specifications as opposed to simplifying general models. Parsimony has long been thought to be a highly desirable property for a model to have-why include unnecessary features? However, it is also important to include necessary features in order to find a reliable and well-specified model. The tension between the costs of profligacy, including unnecessary variables, and excessive parsimony, omitting necessary variables, led to the development of a number of alternative modelling strategies. Given the computational difficulties of modelling complicated processes in the 1970s, it was tempting to start with simple formulations, possibly embodying a specific economic theory, then assess the need to generalize them. These expanding, or specific-to-general, model selection methods require a criterion for the termination of the search, and this is often based on a measure of penalized goodness of fit or marginal significance. For example, the next most significant omitted regressor is added to the model with the expansion stopping when no further significant variables can be found. This simple search strategy can be extended, as in stepwise regression, by also removing insignificant regressors from the model. While stepwise regression can work well in some circumstances, such as independent, white-noise regressors, in others involving complex interdependencies they can fail badly. Other expanding search methods have been developed more recently, e.g. RETINA (Perez-Amaral, Gallo, & White, 2005) and Lasso (Efron, Hastie, Johnstone, & Tibshirani, 2004), with a large literature on shrinkage-based methods, but there is also a substantial literature illustrating the drawbacks of such approaches (see e.g. Anderson, 1962;Campos, Ericsson, & Hendry, 2005;Hendry, 1995a).
The benefits of working from general models and testing for the acceptability of simplification was established by Anderson (1962) in the context of ordered sequences for the determination of the order of trend polynomials, and in Anderson (1971) for autoregressive processes. Mizon (1977) extended this analysis for some non-ordered sequences of hypotheses, which is the usual case in econometrics, and pointed out the need for a structured search. A contracting search strategy begins from a general model with insignificant variables being deleted until a termination criterion is reached, e.g. as in running stepwise regression backwards by including all variables initially then eliminating insignificant terms one by one, although there are dangers in only exploring one search path. Hendry and Doornik (1994) list the advantages of a simplification strategy when modelling linear dynamic systems, Mizon (1995a) illustrated the superiority of a general-to-specific strategy in locating the LDGP over a specific-to-general strategy in a simulation study using artificially generated data, and similarly Mizon (1995b) illustrated this point in a study of quarterly aggregate UK wages, prices and unemployment data over the period 1965(1) to 1993(1). However, these examples are for small numbers of variables where there is a limited set of possible search paths, so can be implemented manually. But to capture the complex features of the typical LDGP of today's macro-econometrics, it is necessary to model in high dimensions, and even expert modellers are not capable of handling all the resulting possible search paths. Fortunately, advances in computing power and software development mean that model complexity is no longer a limitation on the choice of modelling strategy, which instead can be based on the desired properties of the resulting selected model. Indeed, as anticipated by Hendry and Mizon (2000) computer-automated search algorithms are now available that efficiently achieve results beyond human capabilities.
Despite a considerable literature arguing against the usefulness of empirical model discovery via general-to-specific searches (see inter alia Faust & Whiteman, 1997;Leamer, 1978Leamer, ,1983Lovell, 1983), an impressive record has been built up for this approach. Following the stimulus given by Hoover and Perez (1999), the general-to-specific (Gets) algorithm PcGets implemented within PcGive by Hendry and Krolzig (1999), Hendry and Krolzig (2001) and Krolzig and Hendry (2001), quickly established the credentials of the approach via Monte Carlo studies in which PcGets recovered the DGP with an accuracy close to that to be expected if the DGP specification were known, but tests for significant coefficients were undertaken. Also, when PcGets was applied to the data-sets analysed in empirical studies by Davidson, Hendry, Srba, and Yeo (1978) and Hendry and Ericsson (1991), it selected in seconds models that were at least as good as those developed over several years by those authors (see Hendry, 2000). The general-to-specific strategy in Autometrics (see Doornik, 2009) employs a multi-path search algorithm which can combine expanding and contracting searches, so can handle more variables than observations, a feature that is particularly valuable for analysing nonstationary processes. Another key feature of Gets is that it is based on selecting variables rather than whole models and so is more flexible and open to discovery. We discuss this and the subsequent developments of the Autometrics algorithm in Section 2.8.

Explaining the basics of Gets
Although a critical decision in all empirical modelling is where to start from-general, simple or in between-all modelling strategies, including automated ones, require the formulation of a general information set at the outset. Consequently, this general information set provides a well-defined initial model from which a contracting modelling strategy can proceed to test simplifications when N ≪ T. For specific-to-general modelling strategies on the other hand, the general information set provides a well-defined list of variables from which to select expansions. Important considerations in the choice of the general information set include: the subject matter of the empirical modelling; institutional knowledge; past experience; data quality and availability; and the results of previous investigations. However, choosing a GUM to include as much as possible of the available relevant information makes it more likely that it will be congruent, which is an essential requirement for viable inferences during search procedures. We will consider the case where N ≥ T below.

Exogeneity
Exogeneity, in the sense of a variable being determined "outside the model under analysis", has a long history in economics and econometrics. Early textbooks of econometrics concentrated on the estimation and testing of linear regression models in which the regressors were assumed exogenous by being fixed in repeated samples (see e.g. Goldberger, 1964, p. 162), an assumption relevant in experimental sciences but not in economics where data are largely observational. Although a convenient simplification, "fixity" is rarely appropriate in economics and has counter-examples (see e.g. Hendry, 1995a, p. 161), so a more relevant concept of exogeneity was needed. This was important for estimating dynamic models with lagged values of the regressand and regressors, and particularly in simultaneous equation models (SEMs), seeking to analyse the joint determination of several variables where the exogeneity of conditioning variables was questionable. Moreover, an appropriate form of exogeneity is critical for reliable forecasting and policy analyses from conditional models.

Re-consider the LDGP
If the parameters of interest to the modeller are , and these can be recovered solely from 1 (i.e. = ( 1 )) when the parameters 1 and 2 are variation free (so there are no parametric restrictions linking 1 and 2 ), then t is weakly exogenous for (see Engle, Hendry, & Richard, 1983, Hendry, 1995a. Weak exogeneity is a sufficient condition for inference on to be without loss of information using the conditional distribution | ( t | t , t−1 , 1 ) alone.
However, weak exogeneity in isolation is not sufficient to sustain predictions of t conditional on t more than one period ahead because t may vary with t−1 when In order for reliable predictions of t to be made from the conditional distribution | ( t | t , t−1 , 1 ), then t must be both weakly exogenous for and must not vary with t−1 . The latter condition entails that ( t | t−1 , 2 ) = ( t | t−1 , 2 ) and is the condition for y not to Granger cause z (Granger, 1969).
However, the absence of Granger causality is neither necessary nor sufficient for weak exogeneity, so cannot per se validate conditional inference.
Conditional econometric models are also important for assessing and predicting the likely effects of policy changes in interest rates, tax rates, welfare benefits, etc. The fact that economic processes intermittently undergo location shifts and intrinsically exhibit stochastic trends and other widesense non-stationarities (see Hendry & Mizon, 2014) means that parameter constancy and invariance (i.e. not changing when there is a change in policy regime) cannot be guaranteed, so must be tested. There was a time when the feasibility of such tests was doubted, for example, when Lucas (1976), following the concerns expressed earlier by Frisch (1938), Haavelmo (1944) and Marschak (1953), asserted that "any change in policy will systematically alter the structure of econometric models" and so render policy analysis infeasible. This claim, known as the Lucas critique, appeared to many to be a fatal blow to econometric policy analysis. Fortunately, the concept of super exogeneity defined in Engle et al. (1983) provides the condition for valid econometric policy analysis, and importantly it is testable as shown initially by Favero and Hendry (1992). The conditioning variables changes in 2 . The requirement that 1 be invariant to changes in 2 entails that policy regime shifts in the marginal process for t do not alter the parameters 1 of | ( t | t , t−1 , 1 ), which are critical in assessing the effect on t of those policy changes in t . Note that super exogeneity does not require strong exogeneity, but only weak exogeneity and invariance. This is vital, as the behaviour of past values of is usually an important input into changes in the policy variables within , so cannot be strongly exogenous. The testing of super exogeneity, and in particular, invariance, requires a class of changes in 2 to be considered, and parameter constancy tests applied to the marginal process for are described in Engle and Hendry (1993). Hendry and Santos (2010) introduced automatic testing of super exogeneity using impulse-indicator saturation (IIS) to detect location shifts in the processes for the conditioning variables t , then testing the relevance of the significant indicators in the conditional model. This test of super exogeneity can be computed without additional intervention from the investigator, and without knowing ex ante the timings, forms or magnitudes of the breaks in the marginal process for t .
Deterministic terms such as dummy variables for seasons, outliers and structural breaks have been routinely used in econometric modelling for many years. However, when investigating why many researchers had experienced difficulties in modelling US demand for food in the 1930s and 1940s, Hendry (1999) found that by introducing zero-one indicators for 1931-1953, simplified to dummy variables for 1931-1936, 1938, and 1941-1946, led to a model that appeared to be constant over the whole period. This was tested using the Chow (1960) test of parameter constancy over the period 1953-1989, which Salkever (1976 had shown was equivalent to testing the significance of impulse indicators over that period. Hence, zero-one impulse indicators had been included for every observation from 1931 onwards, but in two large blocks for the periods 1931-1953 and 1953 on, so as many impulse indicators as observations had been used, plus all the regressors. This realization meant that models with more variables, N, than observations, T, could be investigated in a Gets framework, provided the variables are introduced in blocks. This discovery led to the use of IIS for detecting and removing the effects of structural shifts, outliers and data discrepancies, thus helping to ensure near normality in residuals and sustain valid inferences, and make bias correction after selection viable. IIS creates a zero-one indicator for each observation in the sample, which are then entered in blocks, noting that such indicators are mutually orthogonal. In the simplest case in which just two blocks of T / 2 are used, the first step is to add half the indicators and use an automatic search procedure (e.g. Autometrics) to select significant variables, then record these results. Next, drop the first set of indicators, and search again for significant indicators in the second set and record the results. Finally, perform the variable search again with the significant indicators combined. Setting the retention rate of irrelevant variables in the selected model (the gauge) to means that overall T indicators will be retained on average by chance. Thus, then the smaller model setting ⩽ r∕T (with r small, such as unity) ensures on average a false null retention of r indicators, which is a small efficiency loss when testing for any number of breaks at T points. More details of IIS are given by Hendry, Johansen, and Santos (2008), who proposed IIS for detecting and removing outliers when they are present. Johansen and Nielsen (2009) provided a comprehensive theoretical justification of IIS, in particular extending the analysis to dynamic models. When testing exogeneity, IIS can have low power if changes in the conditional process are infrequent, but this problem can be circumvented using step indicator saturation (SIS) instead (see Castle, Doornik, Hendry, & Pretis, 2015). Ericsson and Irons (1994) reprint many of the key papers on exogeneity, Ericsson and Irons (1996) provide an overview of the literature on super exogeneity and its relationship to the Lucas critique, and Hendry and Doornik (2014) give more details of the testing of super exogeneity using IIS and SIS.

Explaining the basics of exogeneity
For undergraduates, how can a teacher get across the main ideas of exogeneity without too much distributional mathematics? Again after teaching basic regression, use the simplest linear model which is independent of all values of {x t }, so the Gauss-Markov theorem apparently applies with the least-squares estimator ̂ of being best linear unbiased. Now, where 2 x is tiny, but large enough to allow ̂ to be calculated. Then, = T −1 ∑ T 1 x t , which is linear in x t , can be a far better unbiased estimator of .
This "contradiction" with Gauss-Markov arises because x t is not weakly exogenous in the conditional model for y t given the cross-link of the parameter between the conditional and marginal distributions. Thus, independence between errors and regressors is insufficient, and even the venerable Gauss-Markov theorem needs to be supplemented by a weak exogeneity condition.
This example also has implications for super exogeneity. Consider a policy context where an agency controls the x t process (e.g. interest rates) and can change the parameter by setting the level of the interest rate. Doing so when is also the parameter of the conditional model can be seen to always alter that model when there are cross-links. If its parameters change every time the policy changes, then clearly a model is not useful for policy-this is essentially an extreme "Lucas critique"-so failures of super exogeneity have important implications.

Encompassing
Encompassing is a principle that aims to reconcile the plethora of empirical models that often can be found to "explain" any given phenomenon. The infamous ability of economists as a profession to develop multiple theories for the explanation of a single phenomenon provides a rich source of potential interpretations of empirical evidence. Equally, in other areas of research such as epidemiology, experts cite polar opposite evidence, as regularly occurs in the TV program "Trust me-I'm a Doctor". Generally, there seems to be a lack of an encompassing approach in other observational disciplines, although "meta analyses" are an approximation. Some interpretations might be complementary, and so could be amalgamated in a single theory, but the majority are usually alternatives, so it is necessary to discriminate between them. Indeed, when there are several distinct competing models, all but one must be either incomplete or incorrect, and all may be false. Adopting the encompassing principle in such situations enables testing whether any model can account for the results of the alternative models and so reduce the set of admissible models, and in addition reveal the directions in which a model under-performs relative to its rivals. This lays the foundations for a progressive modelling strategy in which theory and evidence mutually interact to learn about the LDGP, noting that empirical modelling is not a once-for-all event, but a process in which models evolve to supersede earlier ones.
The early development of the encompassing approach can be found in Davidson et al. (1978) and Davidson and Hendry (1981) which informally sought to find a model that was capable of accounting for the behaviour of alternative competing models in the context of an aggregate UK consumption function. Mizon (1984) provided a formal discussion of the encompassing principle, which was further developed in Mizon and Richard (1986). Adopting their statistical framework, consider two distinct empirical models  1 and  2 with parameters and respectively, each purporting to provide an explanation of the process that generates a variable t conditional on t and lagged values of both, namely the LDGP | ( t | t , t−1 , 1 ). Then,  1 encompasses  2 (denoted  1  2 ) if and only if ̂ = (̂ ) where ̂ is the estimator of under  2 and (̂ ) is the estimator of the pseudo-true value of under  1 . The test of encompassing is whether  2 captures features of the LDGP beyond those already embodied in  1 . If  2 does not offer any new insights into the LDGP beyond those of  1 , then  1  2 . The encompassing principle implies that if  1  2 , then  1 ought to be capable of explaining the predictions of  2 about all features of the LDGP, so  1 can accurately characterize these, making  2 redundant in the presence of  1 . Equally, when  1  2 , then  1 ought to be able to indicate some of the mis-specifications of  2 such as omitted variables, residual serial correlation and heteroskedasticity, invalid conditioning, or predictive failure.
An important distinction can be drawn between situations in which  1 is nested within  2 , when encompassing tests the validity of the reductions leading from  2 to  1 , and those in which  1 and  2 are non-nested so that neither model is a restricted version of the other. Cox (1961Cox ( ,1962  seminal papers on testing non-nested hypotheses, with the many related encompassing developments since then reviewed in Mizon (2012). When  1 is nested within  2 ( 1 ⊆ 2 ) and  1  2 , then the smaller model explains the results of the larger nesting model and so  1 is a parsimonious representation of the LDGP relative to  2 . The principle of parsimony has long been an important ingredient of model selection procedures that seek to find the simplest undominated model, but with many different penalty functions adopted for lack of parsimony (e.g. the AIC or Schwarz criteria, see Judge, Griffiths, Hill, Lütkepohl, & Lee, 1985). In the context of the encompassing principle,  1 parsimoniously encompasses  2 (denoted  1  p  2 ) when  1 ⊆ 2 and  1  2 , making it a suitable strategy for checking reductions from a GUM within the Gets procedure.  p also satisfies the three conditions for a partial ordering (see Hendry, 1995a, chapter 14) as it is (i) reflexive since  1  p  1 ; (ii) asymmetric since  1  p  2 implies that  2 does not  p  1 when  1 and  1 are distinct; and (iii) transitive since  1  p  2 and  2  p  3 imply that  1  p  3 . Thus parsimonious encompassing is a vital principle to incorporate in a modelling strategy such as Gets as it will enable the gradual accumulation of knowledge, and plays a key role in Autometrics (see Doornik, 2008).
The outcome of empirical analysis may suggest that a more general formulation is needed to obtain a better approximation to an LDGP, or that a larger set of variables is required to define a different LDGP that is more constant and interpretable. Importantly, though it is shown by White (1990) that sufficiently rigorous testing followed by suitable model re-specification ensures the selection of an acceptable data representation of a constant LDGP as the sample size tends to infinity, provided that the significance level of the complete testing process is controlled and in particular declines as the sample size increases. Although any approach might eventually converge on a constant LDGP as the sample size increases, the Gets strategy can do so relatively quickly. Commencing from a sufficiently general GUM that nests, or closely approximates, the LDGP has the advantage of reducing the chance that an extension of the data-set will be required later. In addition, by always requiring models to be congruent ensures that seeking parsimonious encompassing of successive models sustains a progressive modelling strategy. For further details see inter alia Mizon (2008) and the papers in Hendry, Marcellino, and Mizon (2008), particularly Bontemps and Mizon (2008).

Explaining the basics of encompassing
Even for undergraduates, the concept of encompassing should be intuitive: if one model cannot explain the results of another model on the same data, it must be incomplete or incorrect; and if it can, the second model is redundant. The illustration in Section 2.1.1 can be re-interpreted as an exercise in encompassing, and reveals the dangers of under-specification and how it can lead to conflicting claims. Slightly more formally, the DGP is: so x 3,t is irrelevant, but estimates of the other three parameters are highly significant. However, an investigator mistakenly fits: and finds that ̂ 0 , ̂ 1 and ̂ 3 are significant. Another investigator chooses to fit: and finds that ̂ 0 , ̂ 2 and ̂ 3 are significant. Since (1) includes both x 1,t and x 2,t , neither 1 encompasses 2 , nor 2 encompasses 1 , each being inadequate. To deal with the inadequacy of both models to encompass the other, estimation of: (1) reveals that x 3,t is not required, but appeared to be relevant in 1 and 2 because of its correlation with x 1,t and x 2,t . Thus, the significance of x 1,t in (4) explains the failure of 2 to encompass 1 , and conversely the significance of x 2,t in (4) explains why 1 does not encompass 2 .
To enliven the coverage, a teacher could refer to the history of scientific discovery where encompassing has implicitly been prevalent: examples include Newton's theory of universal gravitation explaining Descartes' vortices, and Einstein's theory of general relativity explaining Newton; or Priestley discovering what he called "dephlogisticated air", explained by Lavoisier as oxygen, thereby replacing the theory of phlogiston with a modern theory of combustion.

Non-stationarity
The world we inhabit whether viewed from an economic, political, meteorological or cultural perspective can be beautiful and full of interesting objects and events, but it provides no shortage of challenges and surprises. The extreme weather events throughout the world, the financial crash in 2007-08 and the subsequent economic recession, and the political unrest in Eastern Europe and the Middle East leading to mass migration of homeless and impoverished people, are but recent examples. Thus, change is forever present-the world is not static. However, many economic theories have at their core stable relationships or equilibria between variables, and the concept of stationarity has long played a key role in statistics. A weakly stationary process is one where its mean and variance are finite and constant over time such as y t = + t with t ∼ [0, 2 ] for which [y t ] = and [y t ] = 2 . A feature of a stationary process is that it is ahistorical, in that a sample drawn from one period of time will have the same characteristics as another drawn from a different period, so that knowing the historical dates reveals no additional information. Clearly though, many variables evolve over time including increases in world population, average life expectancy in western countries, and UK wages and prices. Indeed, most economic variables are non-stationary, in that their distributions shift, and we will consider two of the most important sources of such changes, namely stochastic trends and location shifts. Sometimes, it is argued that this non-stationary behaviour can be represented by a trend-stationary process like y t = + t + t , but this ignores the fact that population could not grow without food, with similar prerequisites for growth in other variables. In any case, it is unsatisfactory to attribute non-stationary behaviour to something outside the model.
In a stationary process, the influence of past shocks t−s for s > 0 must die out, otherwise the variance [y t ] could not be constant. One form of stationary process in which past shocks initially affect y t , but their influence declines through time, is an autoregressive process such as y t = + y t−1 + t which is stationary when | | < 1. This has the moving-average representation y t = A process in which past shocks do not accumulate is said to be integrated of order zero, denoted I(0). An important and more general form of autoregressive process is the vector autoregression of order m (VAR(m)), which takes the form and t are kth order vectors and the i are k × k matrices. Considering only the case in which m = 1, since higher order processes can always be reduced to first order using the companion form (see e.g. Hendry, 1995a, p. 724), the VAR is stationary when the eigenvalues 1 , 2 , … , k of | k − | = 0 lie inside the unit circle (see e.g. Johansen, 1995, p. 14).
When this condition is satisfied, the VAR consists of k I(0) processes.
However, stationarity and I(0) processes are the exception, non-stationarity is the norm. What we observe is that, as well as evolving, time series processes are greatly influenced by specific events, including key discoveries like vaccination and antibiotics; inventions like the steam engine and dynamo; major wars, pandemics and massive volcanic eruptions; financial innovations, etc., all of which can cause persistent shifts in the means and variances of the data, thereby violating stationarity. Processes in which the effects of shocks persist are therefore common, and are said to be integrated of order greater than zero. For example, y t = + y t−1 + t can be written after successive substitution for lagged ys as y t = y 0 + t + ∑ t−1 i=0 t−i revealing that the shocks t−i accumulate. is an example of I(1) processes that are often observed in practice, an example of which in economics is the stock of a variable such as an inventory that cumulates its net inflow. Thus, unlike an I(0) process, which varies around a constant mean, an I(1) process has an increasing variance, usually called a stochastic trend, and may "drift" in a general direction over time to induce an actual trend when ≠ 0. Perhaps the best known example of an I(1) process is a random walk, first proposed by Bachelier (1900) to describe the behaviour of prices set in speculative markets. Another feature of an I(1) process is that since successive observations share a large number of past inputs, the correlation between them will be high and only decline slowly as their distance apart increases. Not only will the serial correlation coefficients p = [(y t − [y t ])(y t−p − ]y t−p ])] 2 remain high, only declining very slowly with p, but also there can be a high correlation between different I(1) variables that should be unrelated. This is known as the "nonsense correlation" problem first identified by Yule (1926), and illustrated by Hendry (1980) who created an example between the price level in the UK and cumulative annual rainfall. Granger and Newbold (1974) emphasized that a supposedly "significant relation" between variables, but where there was serial correlation in the residuals from that relation, was a symptom associated with nonsense regressions. Phillips (1986) provided a technical analysis of the sources and symptoms of nonsense regressions. Noting that differencing is the opposite of integration suggests that differencing an I(1) variable will render it I(0), and this is indeed the case as transforming the I(1) process y t = + y t−1 + t into the I(0) process Δy t = + t illustrates. This idea underlies the approach in Box andJenkins (1970/1976) which was very popular in the 1970s and early 1980s in economics as well as other disciplines.
Linear combinations of several I(1) processes are usually I(1) as well, which led to some researchers modelling variables in differences rather than levels. Were it the case that relationships between I(1) variables could only be developed in their differences, it would imply that there could be no stable economic equilibrium relationships between I(1) variables. However, stochastic trends can cancel between series to yield an I(0) outcome, and this is called cointegration (Engle & Granger, 1987). Consider the first-order autoregressive-distributed lag model y t = a 1 y t−1 + b 0 z t + b 1 z t−1 + u t when both y t and z t are I(1) variables with |a 1 | < 1. Then, the re-parameterized model Δy t = b 0 Δz t + (y t−1 − z t−1 ) + u t , where = (a 1 − 1) and = (b 0 + b 1 )∕(1 − a 1 ), will consist entirely of I(0) variables if (y t−1 − z t−1 ) is I(0), and thus forms a cointegrating relationship. In economics, integrated-cointegrated data seem almost inevitable because of the Granger (1981) Representation Theorem which shows that cointegration between variables must occur if there are fewer decision variables (e.g. your income and bank account balance) than the number of decisions (e.g. hundreds of shopping items: see Hendry, 2004, for an explanation). Cointegrated relationships define a "long-run equilibrium trajectory" for the economy, departures from which induce "equilibrium correction" that move the economy back towards that path. Prior to Granger (1981) and Engle and Granger (1987) defining and developing the concept of cointegration, Davidson et al. (1978) had been using what they called "error correction" models which had essentially the same characteristics as the cointegration "equilibrium correction" models. A model that has played an important role in the modelling of econometric time series since the publication of Engle and Granger (1987), and especially the subsequent more detailed statistical analysis of cointegrated systems including a test of the order of cointegration in Johansen (1988Johansen ( , 1995, is the vector equilibrium correction model (VEqCM) given by Δ t = + ∑ m−1 i=1 i Δ t−i + � t−m + t when and are k × r matrices of rank r and � t−m are r I(0) cointegrating vectors. This reveals that modelling only in differences to take account of the I(1) non-stationarity in t ignores important levels information in � t−1 and so is inefficient. PcGive and CATS in RATS (see Hansen & Juselius, 1995) provide full implementations of the statistical analysis of VEqCMs. Juselius (2000, 2001) provide surveys of the literature.
If the only source of non-stationarity were the presence of I(q) processes with q = 1 or2, then a combination of differencing and cointegrating relationships would bring the analysis back to I(0) processes. Other sources of non-stationarity also matter, however, especially shifts in the means of data distributions of I(0) variables, including equilibrium corrections and growth rates. There is a tendency in the econometrics literature to identify "non-stationarity" with integrated data (unit roots), and so incorrectly claim that differencing a time series induces stationarity. There are many other sources of non-stationarity, so we refer to wide-sense non-stationarity to include both stochastic trends and location shifts, the combination of which causes numerous problems for econometric modelling.
In the VEqCM above, a location shift must occur when changes with other parameters constant, or those parameters shift with constant. Failure to model, or remove, such shifts can have a pernicious effect on the quality of an estimated model, as shown in Castle and Hendry (2014a). Moreover, as Hendry and Mizon (2014) demonstrate, inter-temporal economic theory fails when unanticipated location shifts occur, with the law of iterated expectations no longer applying, and "rational expectations" being biased. Fortunately for empirical modelling, SIS provides an automatic selection method to detect and "neutralize" location shifts in-sample. Also, analogous to cointegration cancelling unit roots to deliver an I(0) relation, co-breaking can cancel location shifts in linear combinations of variables (see Hendry & Massmann, 2007). Such an occurrence suggests a tight connection between the variables involved.
Stochastic trends and location shifts in economic time series can also adversely affect forecast accuracy. The methods used in practical forecasting have to rely on currently available information about the past and present, to extrapolate into the future. Even if the analysis of the available information and the representation of it in models is exemplary, accurate and reliable, forecasting requires that the future resembles the present in its essential attributes. Unfortunately, intermittent unanticipated shifts entail that this is rarely true. Though attempts have been made to predict future shifts (see Castle, Fawcett, & Hendry, 2010, that still remains an important research agenda item. While the most parsimonious, congruent and encompassing model in-sample usually would dominate in forecasting out of sample if there were no location shifts, such models have to be made robust to location shifts, which leads to a different class of model, and one that need not even be congruent in-sample. Thus, while automatic Gets aims to locate the LDGP, doing so successfully need not improve forecasting in the face of unanticipated location shifts. However, a congruent encompassing model, although it may require robustification in order to forecast accurately, can still form a useful basis for doing so and help retain valuable causal information. Further, I(1) processes lead to much higher forecast uncertainty using the correct in-sample model, compared to even mis-specified models on I(0) data; and models with deterministic linear trends on either data type seriously understate the correct uncertainty. Indeed, the poor record of econometric forecasts as compared with (say) the time series models of Box andJenkins (1970/1976) led to the realization that it is important to robustify forecasting models by exploiting the fact that location shifts, or shifts in time trends, can be reduced to impulses by an appropriate order of differencing. Hendry (1998, 1999) provide extensive discussions. Hendry (2015) stresses that much of the historical variation in economic time series has been due to "non-economic" factors such as changes in social mores, legislation, technology, medicine and finance as well as wars, only partly influenced by economic variables like prices and incomes. Since change is the norm, and that book is aimed at teaching undergraduates, it offers simple explanations for unit-root non-stationarity, cointegration, location shifts and co-breaking, to which the reader is referred.

Non-linearity
So far, all the DGPs and models considered have been linear in the variables, albeit we have generally assumed that holds after log-transforms of the basic aggregate measures. There are also many models that are non-linear in parameters, such as threshold (see Teräsvirta, Tjøstheim, & Granger, 2011), and regime-switching models (see e.g. Hamilton, 2015). Non-linear in variables relations cannot be excluded a priori, but are everything else, so comprise an infinite number of possibilities, thereby posing an impossible modelling task. To cut that Gordian knot, Hendry (2010, 2014b) propose a low-dimensional approach, using squares, cubes and exponential functions of the individual elements in the principal components of all N original variables in the GUM. Denoting those components by v i,t , they add v 2 i,t , v 3 i,t and v i,t exp(| − v i,t |) to the set of candidate variables, in order to capture the most important sources of departure from linearity, including asymmetry and sign-preserving reactions, using "only" 3N more variables.
A valuable advantage is that there is no collinearity between elements of v i,t , and by demeaning the higher order terms, little between those either. A drawback is the difficulty of interpreting any non-linearities discovered, but whenever a preferred theory specification is available, such as a logistic smooth transition formulation, an encompassing test against that is easy to conduct. The outcome could reveal that the preferred model accounts for all the non-linearity captured by the low-dimensional approach, or is significant but some non-linearity remains, or is insignificant so is not the correct non-linear specification.
However, to tackle the vast numbers of candidate variables in GUMs with many variables, long lags on those, the non-linear components just described, and IIS and/or SIS, so there are many more variables than observations, a powerful selection tool is needed, the topic to which we now turn.

Explaining the basics of non-linearity
Few economics undergraduate econometrics courses tackle either principal components or selection in non-linear in variables models, but often include an explanation of the RESET test (see Ramsey, 1969). That test adds the square, or sometimes also the cube, of ŷ t to the regression, which creates a non-linear function of a linear combination of the regressors weighted by their estimated coefficients. Here, we are adding non-linear functions of linear combinations of the regressors, weighted by their importance in explaining their overall variance. To illustrate, one route would be to reuse the data in Section 2.1.1, and add, say, x 2 1,t to the general regression to show that it is insignificant-as there is no non-linear connection. Next, find the four largest values of x 1,t in the data-set, and add sufficiently large impulses to y t at those dates to create outliers that should now be "modelled" by a spuriously significant coefficient for x 2 1,t . Finally, add impulse dummies for those dates to demonstrate that the correct relation can be recovered (or use IIS if that is available). There are also non-parametric approaches: if these are available, it can be fun to show how they behave in this setting.

Model selection
When modelling economic or social systems, it is impossible to capture everything that matters empirically, so we focus on influences that "matter substantively", albeit that must be context and sample size dependent. The previous sections have explained the framework and concepts that have led us to seek congruent, parsimonious encompassing representations, obtained by simplifying an initial general unrestricted model, or GUM, that captures the main data properties, such as autocorrelation, non-stationarity and regime shifts. In wide-sense non-stationary processes, ceteris paribus cannot apply empirically, so commencing with too few variables in the candidate set may make it impossible to find a constant parameter model. Undertaking selection from many variables rapidly exceeds what any human can achieve, so automatic selection methods have become essential for successful econometric modelling. They are capable of investigating empirically a much wider range of possibilities than even the greatest experts, and of doing so efficiently when the automatic searches are well structured. Doornik (2009) explains how "general-to-specific" selection algorithms operate, of which Autometrics is the latest version. This uses block multi-path searches in a tree structure, essentially classifying effects into those that are significant given all other selected and retained variables, and those that are not. This approach allows Autometrics to select models even when there are more candidate variables than observations as discussed in Doornik and Hendry (2015). To maintain congruence, diagnostic testing is undertaken throughout the simplification process, as well as checking encompassing of the (local) GUM by each terminal model, backtracking to an earlier, less simple, model if any tests reject.
The advantages of automatic methods are described in Hendry and Doornik (2014), so here we will consider model selection when a set of variables suggested by a prior theory are to be retained without selection. Section 2.1 distinguished between the target of model selection, which has to be the LDGP, and the object of the analysis, usually an economic theory model. A natural reconciliation is to nest "data-driven" and "theory-driven" approaches in a common framework, where the theory model is retained, but not imposed, and a wide range of influences that potentially could matter are selected over. To ensure the theory model parameter estimates that result have exactly the same distribution as when a complete and correct theory is fitted directly to the same data, prior to selection, Hendry and Johansen (2015) regress all the other candidate variables on the theory variables and replace the former by the resulting residuals which are thereby orthogonal to the theory variables. As is well known from generalizations of the famous theorem in Frisch and Waugh (1933), parameter estimates are unaffected by the inclusion or exclusion of orthogonal regressors, all of which would be irrelevant when the theory model was complete and correct. In the more likely setting that the theory is incomplete, a better representation of the LDGP can be discovered, so in both states of nature, such an approach is either costless or beneficial.

Explaining the basics of model selection
If you undertook the example in Section 2.1, or used Autometrics at any other stage, then you have already done model selection, and presumably explained the steps involved. Thus, this part ends where it began, highlighting a perennial problem for teaching: all the concepts are closely inter-related. This usually leads to a simple to general approach in teaching, and like all such methods, the stopping point can be arbitrary. The resulting danger is leaving students with a seriously naive view of econometric modelling when they only study at an elementary level. This paper, Nielsen (2007,2010) and Hendry (2015) attempt to convey the complexities of real-world economic time series, and provide exciting tools to build models that at least avoid the most egregious mistakes.
In teaching, we often use Figure 1 from Hendry and Doornik (2014) as a summary of all the stages above. Starting at the top right with the DGP, which is bound to be unknown however good the accompanying economic analysis, the reductions lead down to the LDGP for the variables to be modelledthe topic of Section 2.1. Moving to the upper left-hand side of the diagram, the GUM must be specified sufficiently generally to nest the LDGP and embed the theory-based variables, so is congruent in order to sustain valid inferences during selection, as discussed in Section 2.2, and for conditional models in Section 2.4. While this is ideal, often a GUM may not be sufficiently general to nest the LDGP, in which case an approximation will result, as discussed in Castle, Doornik, and Hendry (2011). In practice, with more variables than observations (e.g. when IIS or SIS are used, as discussed in Section 2.6), congruence can only be checked after some simplification. Inevitably, the GUM will contain some redundant candidate variables and indicators, so general-to-specific model selection (considered in Section 2.3) is used to find a congruent, parsimonious encompassing representation in a final specific model, the subject of Section 2.5 and Section 2.8, possibly requiring checks of the linearity Section 2.7. When the GUM nests the LDGP, and the final model parsimoniously encompasses the GUM, then it should also parsimoniously encompass the unknown LDGP. Hence, the search has discovered what actually matters, and the researcher can legitimately evaluate the theory model.

Econometric software
We now present a very brief overview of some of the computer software that has been used in the teaching of econometrics since the late 1970s: Renfro (2009) provides an extensive history of econometric computing. Early mainframe computer programmes for the analysis of single equation time series models that could illustrate empirical work for teaching included TSP (see Hall & Cummins, 2005, for a recent release), MODLER (see Renfro, 1996, for a retrospective) and Give (see Hendry & Srba, 1980), providing estimation methods for models with endogenous explanatory variables via two and three stage least squares (TSLS and 3SLS) and instrumental variables (IV). Fiml, a companion to Give implemented full information maximum likelihood estimation of systems of simultaneous equations based on Hendry (1976). A feature of Give and Fiml was that they also incorporated mis-specification test statistics as these were developed and shown to be effective in model evaluation, particularly those related to the testing of concepts described in Section 2. With the advent of PCs, PcGive was developed for this more flexible medium, initially to complement Give but eventually to supersede it and Fiml. MicroTSP (initially developed by David M. Lilien) was introduced as the PC version of TSP, later integrated into EViews (see QMS, 2005). Among others, Microfit (see Pesaran & Pesaran, 1987) and RATS (see Enders, 1996) extended the available range, the latter more so after the development of CATS in RATS implemented multivariate cointegration analysis (see Hansen & Juselius, 1995). At a more basic level, STATA is often used in undergraduate courses, as are spreadsheets (the use of which is likely to lead to serious errors); whereas at a professional level the R language is popular, as is Ox.
The evolution of the PcGive software discussed in Hendry and Doornik (1999) can be traced via the many editions of the accompanying manuals beginning with Hendry (1984) through to the latest (Doornik & Hendry, 2013a, 2013b which are part of the OxMetrics suite and incorporate Autometrics. Previous publications by us on the teaching of econometrics using PcGive and the OxMetrics suite of programmes include Hendry (1986Hendry ( , 1990 and Nielsen (2007, 2010). Hendry (2015) proposes a major change in the curriculum for undergraduate econometrics, to include a range of topics essential for understanding and undertaking empirical research, explicitly addressing the need to discover what influences actually matter in practice:

Illustrating new econometrics teaching
... the notion of empirical model discovery in economics may seem to be an unlikely idea, but it is a natural evolution from existing practices. Despite the paucity of explicit research on model discovery, there are large literatures on closely related approaches, including model evaluation (implicitly discovering what is wrong); robust statistics (discovering which sub-sample is reliable); non-parametric methods (discovering the relevant functional form); identifying time series models (discovering which model in a well-defined class best characterizes the available data); model selection (discovering which model best satisfies the given criteria), but rarely framed as discovery.
Model selection in the face of changing economies is at the centre of Hendry (2015), using a UK macroeconomics database over 1860-2011. Not only do economies evolve, so does economic theory, making it hazardous to impose extant theory on empirical models if that theory might be discarded shortly. As outlined in Section 2.8, rather than adopt either a "theory-driven" or a "data-driven" approach, empirical model discovery embeds the best available theory to be retained during selection while investigating many other potentially relevant variables, longer lags, non-linear functions and both outliers and location shifts. When the theory is complete and correct for the sample under analysis, the distributions of the parameter estimates will be identical to those obtained by directly fitting the theory to the data: see Hendry and Johansen (2015). However, if the theory is incorrect, or more usually, incomplete, but the extended specification includes substantively relevant variables, an improved representation will result. As illustrated in Hendry and Mizon (2011), failing to correctly determine dynamics and outliers can lead to a model that is so seriously mis-specified the empirical results appear to reject the theory from which the model was derived, yet after taking account of those effects, is strongly consistent with the same theory. When systematically conducted, model discovery-or "data mining" as it is sometimes pejoratively called-can improve theory-based specifications. Think of undertaking extensive explorations while controlling for adventitious significance as answering all likely seminar questions in advance.
Recent advances in computer power and speed, and improvements in search algorithms, facilitate a modified general-to-specific modelling strategy even if the initial number of candidate explanatory variables, N, exceeds the available number of observations, T. Thus, we now provide an example of how one might teach a systematic approach to undertaking empirical time series research, albeit simplified to modelling a single variable dependent on a few explanatory variables using artificial data. The simplicity is to sustain live demonstrations and class participation, either directly with each student undertaking their own data generation, modelling, then simulation, or enabling questions to be addressed by showing their impact on the instructor's models.
The aim of the following sections is to illustrate the roles of the various components discussed above. Subject-matter theory is discussed in Section 3.1, the database and software in Section 3.2, computing the estimates of the DGP parameters and testing congruence in Section 3.3, the formulation of the general unrestricted model (GUM, although we will eschew orthogonalization here) and selection with indicator saturation in Section 3.4, then testing parameter constancy in Section 3.5 and exogeneity in Section 3.6. Parsimonious encompassing is implemented automatically during selection, and we will briefly note testing for a non-linear representation in Section 3.7. Finally, the use of simulation to investigate the outcome will be described in Section 3.8.

The economic theory
We consider a mimic of a demand model for a perishable commodity like fish (see e.g. Graddy, 2006, and the use of her data in Hendry & Nielsen, 2007). Let p t denote the price of the specific variety of fish in the market available in a quantity q t where lower case letters denote the logs of the variables.

The database and software
The database here has two components. First, the model in Hendry and Nielsen (2007) of the Fulton Fish Market time series data collected by Graddy (1995) on the daily prices and quantities of whiting sold by a wholesaler from 2 December 1991 to 8 May 1992, and associated weather-related measures sets the scene for creating a simulated data-set. 3 Second, students are given the task of generating an artificial data-set to mimic such a market. The reason for the artificial data is that the closeness of any claimed model to its DGP can be judged, whereas there are no known "correct answers" with empirical data.
To generate the artificial data, we use the PcNaive module within PcGive: see Doornik & Hendry, 2013b, and also the explanations for its use in Hendry (2015, Chap. 8.10). The role of PcNaive is to ease the design of simulation experiments, and its output is a computer programme in Ox, which (5) can be run by Ox Professional. In PcGive, select Monte Carlo, Advanced Experiment option, and create a DGP with two endogenous and two exogenous variables, with a break from observations 40-50, which we use to mimic a prolonged period of stormy weather that reduces the supply of fish. 4 Choose the "simultaneous equations" formulation and create the two equations (5) and (6), using the parameter values 1 = −0.5 and 2 = 0.5 (which deliver a long-run price elasticity of minus unity), and 1 = 0.4, 2 = 0.4 and 3 = 0.025. The intercepts matter in reality, but we will set them to zero here, an effect that could be achieved by appropriate choices of units. Finally, set = 0.01 and = 0.01, again dependent on units, but in a log-linear model represent error standard deviations of 1%, so the storm is 2.5 . Set T = 100, and select "save data" so the final simulation trial can be analysed as if it were empirical data. We used M = 10, 000 replications for Figure 2, but only a few replications are needed.
Conceptually labelling the panels a-k along successive rows, then a & b, record the standardized data time series denoted Ya t and Yb t , and Za t & Zb t c-e show the sampling distributions of the estimators of 0 , 1 , 2 , then f-h the distributions of their conventionally computed estimated standard errors (denoted ESE), and finally i-k show the means over the M replications of the parameter estimates, with ±2 and ±2 (the "true" standard errors based on the distributions shown in c-e). The wide scatter of the possible estimates of the parameters is noticeable, as is the potential variability in their ESEs.
Every student can be given a different draw by changing the number of replications each is assigned. We will return later to do a multi-replication study to compare the estimation of the DGP model with that resulting from selection from a much larger GUM. The role of the two exogenous variables will be to create additional irrelevant variables.

Computing the estimates of the DGP parameters
Load the created data-set back into PcGive, and use the calculator to form a dummy variable StormDum equal to unity from observation 41 to 51 and zero elsewhere (the date shift is due to how PcNaive times events). First check that the data graphs match the general form of those shown in Figure 2, then estimate the two DGP Equations (5) and (6). Here, we found: (1, 98) = 0.001 2 (2) = 3.25; (4, 95) = 0.33: (2, 95) = 1.75  and: In (7) and (8), ̂ is the residual standard deviation, and is the squared multiple correlation, with coefficient standard errors shown in parentheses. The mis-specification test statistics have the form of (k, T − l), denoting an approximate -test against the alternative hypothesis j and comprise: kthorder serial correlation ( : see Godfrey, 1978); kth-order autoregressive conditional heteroskedasticity ( : ARCH, see Engle, 1982); heteroskedasticity ( : see White, 1980); which is the RESET test (see Ramsey, 1969); and a chi-square test for normality ( 2 (2), see Doornik & Hansen, 2008).
Parsimonious encompassing of the feasible GUM will be checked during selection. Parameter constancy over k periods ( : Chow, see Chow, 1960), super exogeneity ( based on IIS: see Hendry and Santos (2010)), and the low-dimensional test for non-linearity ( : see  could be added as discussed in Sections 3.5-3.7. These estimates are recognizably close to the DGP parameter values used, with no mis-specification tests significant by chance at the 1% level, so there are no important departures from congruent representations.
Both GUMs have more regressors than T, but this does not pose any problems for an automatic model-selection approach like that in Autometrics, as explained in Hendry and Doornik (2014) and Doornik and Hendry (2015). The theory in Hendry and Johansen (2015) proposes orthogonalizing all the additional regressors against the theory-model variables, so the latter are not "contaminated" by selection, but here we will simply retain them and the intercept during selection. For the Ya t GUM in (9), selection at 1% exactly reproduces (7) despite all the added irrelevant variables. This is slightly lucky, since with 110 candidate variables in total, on average one should be significant by chance.
For the Yb t GUM in (10), selection at 1% finds: The storm from 41-51 is approximated by five similar magnitude same sign impulses thereby missing some of its intermediate effects but "picking up" an earlier start. When the shift is just 2.5 , a positive shock is likely to make an impulse indicator less significant than the 1% critical value of 2.6. Thus, finding 5 is the average probability. In addition, IIS finds outliers at observations 65 and 98, which are clearly visible in Figure 3(a). On average, roughly one should be significant by chance, but the missing storm impulses have somewhat biased the regression estimates, which may have created spurious outliers (as will transpire to be the case).
None of the irrelevant regressors was retained. Equation (11) slightly overfits when all the impulse indicators are entered freely, although that can be mitigated by a bias correction (see e.g. Johansen & Nielsen, 2009). Imposing a common coefficient across the contiguous indicators found, but from 39 to 48, leads to an almost identical outcome as using StormDum. However, 98 now becomes insignificant at 1%, and dropping that leads to ̂ = 0.0093: Thus, eliminating 98 creates some residual autocorrelation at 5%: such "trade-offs" between keeping insignificant variables and congruence often occur in empirical research. Notice that all the theory-based variables have been selected in Equation (12), so the same results would be delivered when those were retained.
(10) Yb t = 0 + 1 Ya t−1 + 2 Yb t−1 + 3 Ya t−2 + 4 Yb t−2 + 5 Za t + 6 Za t−1 + 7 Za t−2 + 8 Zb t + 8 Zb t−1 + 10 Zb t−2 + To summarize, despite a lack of knowledge of dynamic reactions, relevant variables, location shifts or outliers, so the GUM had 110 variables for 98 observations (after lags), only one irrelevant effect, namely 1 65 , was significant by chance which is what one would anticipate at 1%. Inspecting the residuals from (8) would have shown the same outlier, and applying IIS to that equation would also have revealed that 1 39 was significant and reduced the significant (spurious) residual autocorrelation.

Testing parameter constancy
Like exogeneity in the next section, testing the constancy of a model's parameters usually only happens after modelling, but with indicator saturation, now can take place jointly with other selections. So far we have not applied step-indicator saturation (SIS), which uses increasing step indicators that essentially cumulate the corresponding impulses up to that time, but for a step shift like the simulated stormy period, is an effective device. For Yb t , SIS yields: The storm is captured from observation 40 to 48, and the outlier at 65 by the two offsetting steps: replacing them by 65 moves S 40 ≈ −S 48 with ̂ = 0.0093. So why does SIS not get the correct timing? The answer is because the shift is 2.5 and we are selecting at 1% with a critical value of about c = 2.6, positive draws can leave an apparent shift of less than c so would not be selected. In effect, the storm does not show up in the data at that point, and indeed 2 is slightly lower in (12) with the correct dummy than in the variant of (13) using 1 65 : in finite samples, the DGP need not be the "best" model, illustrating that modelling with a single sample of observations may capture a feature particular to that sample which is not part of the LDGP.

Testing exogeneity
The only contemporaneous regressor in the two models is Yb t in (7), so that will be the focus of our test. Clearly, the conditional analysis was conducted under the assumption that Yb t was weakly exogenous, as that hypothesis cannot be tested until the relevant equation has been established. The basis of our approach will be the automatic IIS test of super exogeneity, namely the joint hypothesis of constancy and weak exogeneity, proposed in Hendry and Santos (2010). This involves locating any shifts in the process of the conditional variable, here Yb t , and testing their significance in the conditional model, here (7). Equation (11) revealed eight indicators in the former, so we test their inclusion in (7). This delivered (8, 89) = 1.26, which is insignificant, confirming the validity of conditioning on Yb t in the model for Ya t .

Testing for a non-linear representation
As with constancy tests, there are many available approaches, a number of which are offered within PcGive. Here, we calculate the low-dimensional test for non-linearity, , proposed by  based on the squares, cubes and exponential functions of the principal components of the data series. These deliver (6, 90) = 1.50 and (9, 83) = 1.06 for (7) and (12), respectively, so neither reveals signs of non-linearity, which is the appropriate null outcome. An alternative approach we now prefer is to include those non-linear functions in the GUM and select at a tight significance level.

Re-simulating the model selection exercise
Since the final selections for Ya t and Yb t closely match their DGP equations, simulating (7) would deliver an outcome like that in Figure 2. However, simulating the GUM, or a version thereof selected (1, 97) = 0.79 2 (2) = 0.75; (7, 90) = 0.86; (2, 90) = 1.06 using IIS, could be worthwhile for comparison with the DGP outcome, although M = 10, 000 will take a considerable time on a PC. Note that the constant needs to be included as unrestricted for technical reasons, even though 0 = 0, as does Yb t because the programme otherwise mishandles it as being endogenous. The Autometrics selection should be at 1%, which is approximately proportional to the total number of candidate variables, 1∕(T + N). The chosen output should not include recursive estimates, nor need to include saving the final replication. Figure 4 records the resulting distributions of parameter estimates and their estimated standard errors for Ya t : the output is essentially identical to Figure 2.

Conclusions
The various manifestations of the KISS principle-keep it simple stupid-correctly emphasize parsimony, but fail to note that in his "razor", William of Occam stressed avoiding adding unnecessary features. In a high-dimensional, possibly non-linear and wide-sense non-stationary world, facing both stochastic trends and distributional shifts, empirical models must be sufficiently general to capture all the substantive influences, or could end badly mis-specified. Teaching the empirical econometric analysis of large, complicated models can be demanding, but we have tried to steer a route through all the key steps, exploiting the amazing power of modern software. Not confronting the complications, and hence the need to discover what matters empirically while retaining the best available theory insights, will leave students with a dangerously naive view of how to model macroeconomic time series, so we strongly advocate changing the curriculum to address all these issues. Astute readers will have noticed the gulf between the focus of our paper on concepts and model formulations, as against the usual textbook sequence of recipes for estimating pre-specified models. Appropriate estimation techniques are certainly necessary, but are far from sufficient if the model in question is not well specified. Since economic reality is complicated, pre-specification is unlikely to be perfect, so discovering a good model seems to be the only viable way ahead.