Cluster Regularization via a Hierarchical Feature Regression

This paper proposes a novel graph-based regularized regression estimator - the hierarchical feature regression (HFR) -, which mobilizes insights from the domains of machine learning and graph theory to estimate robust parameters for a linear regression. The estimator constructs a supervised feature graph that decomposes parameters along its edges, adjusting first for common variation and successively incorporating idiosyncratic patterns into the fitting process. The graph structure has the effect of shrinking parameters towards group targets, where the extent of shrinkage is governed by a hyperparamter, and group compositions as well as shrinkage targets are determined endogenously. The method offers rich resources for the visual exploration of the latent effect structure in the data, and demonstrates good predictive accuracy and versatility when compared to a panel of commonly used regularization techniques across a range of empirical and simulated regression tasks.


Introduction
In this paper, I propose a new solution to the old problem of obtaining robust parameter estimates in a high-dimensional regression with nonorthogonal predictors.I decompose the estimates of an ordinary least squares regression along a supervised hierarchical graph, then optimally shrink the edges of the graph to achieve a group-wise regularization of the parameter space.The resulting estimator has several useful properties: (i) It solves the problem of group shrinkage in an elegant and efficient manner, where the composition of parameter groups as well as group shrinkage targets are determined endogenously; (ii) The estimator offers intuitive tools for the visual inspection of the model effects structure; (iii) It exhibits significant versatility, performing well (in terms of prediction accuracy) both in sparse, as well as dense regression settings; Finally, (iv) the estimator encodes the prior expectation of a world governed by hierarchical processes, making it uniquely suitable for several empirical applications, particularly in the domains of economics and finance.
A substantial literature exists on regularized regression techniques, the main thrust of which comprises variants of penalized or latent variable regressions, and which finds its most general expression in the extensive field of Bayesian regression analysis.With increasing availability of data, regularized regressions have steadily grown in importance in many fields, and underpin developments in domains as seemingly disparate as bioinformatics, finance or deep learning.Economic applications in particular are often characterized by high-dimensional, multicollinear data sets, and regularized machine learning algorithms are well established as computationally efficient means of obtaining accurate parameter estimates when the number of predictors relative to observations is high.The hierarchical feature regression (HFR) contributes to this body of knowledge, combining elements of graph theory and machine learning to inform a novel group shrinkage estimator.
The HFR constructs a parsimonious information graph, using a supervised hierarchical clustering algorithm that groups predictors based on the similarity of their explanatory content with respect to a dependent variable.The information graph is translated into a parameter hierarchy, consisting of several chains of coefficients (edges in the graph) that capture increasingly nuanced signal.The coefficient chains adjust first for shared variation, with each lower element introducing a further degree of idiosyncrasy.By shrinking the chain of coefficients, the HFR achieves group shrinkageremoving idiosyncratic information from the fitting process and giving a higher weight to shared effect patterns.
An economic case study highlights how the structure introduced by the hierarchical graph can be exploited to garner insights into latent effect dynamics in the fitted model, with rich resources for visual exploration.Furthermore, the HFR exhibits robust predictive accuracy, comparing favorably against a panel of benchmark regularized regression techniques.The results also indicate a high degree of versatilty in the simulated setting, with good performance across different types of regression settings (e.g.sparse, latent factors, grouped).This flexibility is a key advantage: where related methods tend to be best suited to specialized types of tasks, the HFR can produce accurate parameter estimates across a spectrum of data generating processes.
The remainder of this paper is structured as follows: Section 2 introduces important literature relating to the field of regularized regression.The HFR is developed in Section 3, while Sections 4 and 5 explore its performance both in empirical and simulated settings.Finally, Section 6 concludes the paper.

Literature review
Nobel prize laureate Herbert Simon posits that complex systems tend to evolve in a hierarchic manner and, as a result, encompass hierarchical structures (Simon, 1962).This proposition is supported by an understanding of highly integrated markets and economies driven in part by deeper global undercurrents -e.g.global business cycles (Diebold & Yilmaz, 2015;Kose et al., 2003) or global financial cycles (Rey, 2015) -, and is reflected in the popularity of latent variable methods (e.g.dynamic factor models for macroeconometric analysis) and, increasingly, deep learning methods for nonlinear prediction tasks. 3  The HFR utilizes empirical data hierarchies with the objective of achieving an optimal group mean shrinkage that captures the hierarchical nature of the data generating processes and, in turn, attains more robust out-of-sample performance.It is therefore located squarely within the regularization literature.A plethora of approaches to parameter regularization have been developed in this domain.Penalized regressions -termed "Lasso and friends" in Varian (2014) -receive some attention in this paper as natural benchmarks for the HFR.The approaches introduce a constraint on the parameter norm, by adding a penalty function P λ (β) to the least squares loss of a regression of y on x: (2.1) Here β is a vector of parameter estimates and N is the sample size.The penalty function depends on a hyperparameter λ governing the weight given to the penalty, and typically takes the form P λ (β) = λ i |β i | q , where q = 1 is a Lasso and q = 2 is a ridge regression.Important contributions to this literature include James & Stein (1961), Hoerl (1962), Hoerl & Kennard (1970), Tibshirani (1996) and Efron et al. (2004), as well as multiple variants, including Zou & Hastie (2005), Zou (2006) and Zou & Zhang (2009).An introductory overview is found in Friedman et al. (2001).
Penalized regressions -particularly those based on the 1 -norm (q = 1) -have been extended to permit group shrinkage (Bondell & Reich, 2008;Tibshirani et al., 2005;Turlach et al., 2005;Yuan & Lin, 2006;Zeng & Figueiredo, 2013).A good review of available approaches is given in Bach et al. (2012).Group shrinkage typically aims to shrink disjoint or overlapping groups of variables towards zero, often requiring prior knowledge of groups.The HFR differs from these methods in that sparsity is not an objective and group compositions are estimated endogenously without the need for external structures.
Conceptually, group shrinkage can be achieved in a penalized regression framework, for instance, by generalizing the ridge regression to the following form (Hansen, 2019;van Wieringen, 2020): where ∆ governs the speed and direction of shrinkage for each parameter individually, and β 0 contains a shrinkage target for each parameter.The target values can be set in such a way as to induce group-wise shrinkage, by selecting the same shrinkage target for groups of variables, and specifying penalties in ∆ on a group-specific basis.This requires a priori definitions of group compositions and target values, reducing its practicality.
A second broad class of regularization techniques are latent variable regressions.Examples include the principal components regression (PCR) described in Friedman et al. (2001), the partial least squares regression (PLSR) developed by Wold in the 1960s and 70s (see Wold (2001) and Martens ( 2001)), or -in the econometric setting -the dynamic factor model surveyed in Stock & Watson (2016a) and Stock & Watson (2016b).These methods reduce the dimensionality of the predictor set by removing low variance components in the case of principal components based methods, or components with a low response correlation in the case of PLSR (Jolliffe, 2002).Unlike penalized regressions, latent variable regressions are mostly unsupervised in their construction of latent factors.Some exceptions exist, for instance the aforementioned PLSR, or Bair et al. (2006), who introduce a (semi-)supervised PCR, by using a supervised process of pre-filtering the predictor set before performing principal components analysis.
The HFR constructs factors using a hierarchical transformation of the predictors.The concept of feature hierarchies has been applied in the machine learning domain to visual and text classification tasks, where general features (e.g.objects, phrases) are learned first, with subsequent fine-tuning for lower level representations (e.g.pixels, words) (Epshtein & Uliman, 2005;Girshick et al., 2014).
The HFR ports this concept to the linear regression setting, where the feature hierarchy can be exploited to increase the robustness of parameter estimates in a manner not unrelated to its role in learning invariant representations in text and image data.The HFR decomposes the data generating process (DGP) into a signal graph, estimating parameters for general (shared) signal patterns separately from the idiosyncratic contribution of each individual predictor.
Hierarchical clustering algorithms (a sub-field of unsupervised machine learning) present an approach to estimating the type of signal graphs used by the HFR, and have been applied in multiple domains, including financial time series (Di Matteo et al., 2004;León et al., 2017;Mantegna, 1999;Tola et al., 2008;Tumminello et al., 2010).Recent applications in the financial portfolio construction literature have resulted in an interesting conceptual pendant to the HFR (Lopez de Prado, 2016;Pfitzinger & Katzke, 2019;Raffinot, 2016).The authors find that portfolios of financial assets can be enhanced by replacing pairwise correlations with group-wise correlations of asset return series.This reasoning is not unlike the mechanism by which the HFR achieves more robust parameter estimates.

Syntax of feature hierarchies
Before introducing the HFR estimator, this section provides a brief overview of the graph theoretical concepts and definitions drawn on in the subsequent discussions.
A hierarchical representation is taken to mean the arrangement of predictors into clusters of two or more, which are merged at nodes to form higher levels.The predictors are the leaf nodes (i.e. they represent the lowest nodes in the hierarchy), while nodes at higher levels are called internal nodes.
The process of merging is repeated at each level until all predictors are contained within a single cluster called the root node.The node directly above any node is typically referred to as the parent node, while the nodes below are the children.Adjacent nodes that share a single parent are siblings.The chain of preceding parent nodes for any node is its branch.
Hierarchies can be depicted graphically in dendrograms, or mathematically in summing matrices.
Figure 3.1 portrays a simple hierarchy dendrogram of the illustration introduced in Section 3.2.
There are K = 4 predictors (leaf nodes), and two subsets grouping two predictors each.The root node completes the dendrogram.The corresponding hierarchy summing matrix S (right panel, Fig. 3.1) consists of D × K dimensions, where D = 7 is the total number of nodes and K = 4 is the number of predictors.S is invariant to the ordering of rows (i.e.child and parent nodes do not have to be arranged in any particular order).However, to simplify the discussion it is presented in a top-down order throughout this paper, starting with the root node and ending with the leaf nodes.
Hierarchies can be cut along the y-axis of the dendrogram by drawing a horizontal line at any height of Fig. 3.1.The nodes directly beneath the cut describe a level.In the discussions that follow, an arbitrary level is denoted , and L is the total number of levels.Fig. 3.2 shows a cut in the dendrogram and the summing matrix associated with that level: A predictor hierarchy conveys information about the interrelatedness of predictors, grouping similar predictors closely together.In the context of the HFR, coefficients on predictors whose paths merge within the hierarchy experience shrinkage towards a common target.The higher in the hierarchy the merge is located, the stronger the shrinkage.In Sections 3.2 to 3.4, the HFR is introduced under the assumption of a given optimal hierarchy, while Section 3.5 introduces an algorithm to estimate S.

A framework for group shrinkage
The hierarchical feature regression is introduced using a simple example, and following two steps: First, a decomposition of the ordinary least squares (OLS) estimator into a sequence of nodespecific estimates in a hierarchical graph is proposed.Second, shrinkage is introduced to the levels of the graph, resulting in the HFR estimator.The simple example is eventually generalized in the subsequent sections.
To introduce the decomposition of the OLS estimator into hierarchical components, take again the setting described above with K = 4 standardized predictors, x = {x i } i=1,...,N ∈ R K , which are clustered into one, two and four groups, resulting in the summing matrix in Fig. 3.1, assumed to represent an optimal graph.The matrix S can be divided into sub-matrices, denoted S , that describe the individual levels within the feature hierarchy.
For the three levels in the example, with = 1, 2, 3, the sub-matrices are given by .
Here the lowest level (S 3 ) is an identity matrix containing the leaf nodes.
The level-specific hierarchical features are now defined as z = xS , and the complete hierarchical feature set is given by z = xS = z 1 z 2 z 3 .The hierarchical features, z, represent factor estimates of the common variance contained in the child features (i.e. the features associated with child nodes).Under the assumption that the covariance between the predictors' idiosyncratic components is low (such that the mean converges to zero), the sum of the predictors represents an estimate of the common component that is consistent up to a constant scale.See Stock & Watson (2016b) for a discussion of the role of feature averaging in factor estimation.
Using the level-specific factor estimates, define with the regression response variable y = {y i } i=1,...,N ∈ R.Here M is the residual maker matrix, with M = I N − P = I N − z Q −1 z , and M 0 = I N .Furthermore, I N is an N × N dimensional identity matrix.The role of M −1 is to partial out the effect of each node's branch from Q y , resulting in a regression that updates parameter estimates using only the new information introduced at each level.Note that in a nested hierarchical graph where each level contains strictly more information than the preceding levels, it holds that Thus, the information of the entire branch can be partialled out using only M −1 .Now, with βols denoting OLS estimates for a regression of y on x, a top-down hierarchical decomposition of the OLS estimator for our problem is given by where b are level-specific estimates that account for the new variation introduced at level .The level-specific estimates are defined simply as the least squares estimates for z conditional on the path of each node: Proposition 1 stacks the above decomposition, and shows that the resulting estimates are numerically equivalent to OLS estimates: Proposition 1.Consider a simple regression decomposition for the case of L = 3 with a given summing matrix S, hierarchical features z defined as above, and and Q zy = z y.
Now the coefficient estimates β = S Q −1 zz Q zy represent optimal least squares estimates of the linear slope coefficients β of a regression of y on x.
The proof of Proposition 1 is given in Appendix A. Note that Q zz can be written as Q zz = (z z) H, where is the element-wise multiplication operator, and H is a matrix of ones with the block-wise upper triangle set to zero: .
The matrix H eliminates bottom-up conditional effects from the precision matrix, which are represented by the upper block-triangular entries.Conversely, the lower block-triangular entries represent conditional effects flowing down the hierarchy from the root node towards the leaf nodes (i.e.top-down effects).
As shown in Appendix A, Proposition 1 is equivalent to the chain of level-specific estimates introduced in Eq. 3.2, with In sum, therefore, the hierarchical decomposition consists of an additive chain of level-specific estimates (Eq.3.2) that iteratively adjust for idiosyncratic variation in the fitting process in a top-down manner (Eq.3.3), until at the final level ( = 3 in the example) all explainable variation is accounted for.While this decomposition seems trivial at first glance, it can be used as the basis for a regularized regression.The HFR estimator shrinks the extent to which levels are permitted to adjust for new variation, resulting in estimates that are biased towards higher-level representations in the form of group targets for clusters of predictors, with lower levels not permitted to adjust fully to the variation contained in them.
In the simplest form, one could add a shrinkage coefficient to Eq. 3.2, such that where θ is the th shrinkage coefficient, with 0 ≤ θ ≤ θ −1 and 0 ≤ θ 1 ≤ 1.For instance, if θ 2 = θ 3 = 0 and θ 1 = 1, the estimates are reduced to βhfr = b1 , which is equivalent to a single group mean across all parameters.The monotonicity constraint on θ ensures that -given that the hierarchy represents a nested information set -information that is removed at one level is not subsequently reintroduced at a lower level.In the stacked form of Proposition 1, Eq. 3.5 introduces a shrinkage matrix, such that Here Θ governs the extent of shrinkage for level , with Eq. 3.6 is the HFR estimator under the assumption that the hierarchy (S) as well as the extent of shrinkage (Θ) are given.Since the framework permits the exclusion of entire levels from the regression (by setting θ = 0), it can be used as a tool to select a parsimonious hierarchy based on a (potentially large) set of input levels.As shown in subsequent sections, this property will be useful for the estimation of S.
The framework described by Eq. 3.6 can become arbitrarily complex, including a large number of levels that permit a high degree of nuance with respect to the nature and strength of regularization.
At its core, however, it remains a decomposition of each parameter into a chain of parameters, which captures successively more idiosyncratic signal, and which is subsequently regularized, resulting in overall shrinkage towards a more general and less idiosyncratic representation of the data generating process.A key ingredient for this form of group shrinkage is determining the optimal extent of shrinkage for each hierarchical level (i.e. for the elements of the parameter chain).
Section 3.3 discusses an appropriate loss function that can be used to obtain optimal shrinkage coefficients.

Optimal shrinkage
Generalizing the definition in Eq. 3.6 to an arbitrary hierarchy, the hierarchical feature regression estimates, βhfr , are given by where H, as before, is a matrix of ones and zeros, which ensures that the block-wise upper triangle of z z is zero.Θ is a D × D matrix controlling the extent of shrinkage on a level-by-level basis.
Letting Θ contain that subset of columns of Θ associated with nodes in level , Θ is defined as The extent of shrinkage is therefore governed entirely by the L × 1 vector of level-specific shrinkage When θ = 1 there is no shrinkage, leading to the OLS solution as shown in Proposition 1.If any θ < 1, the parameters associated with level are regularized, where θ → 0 constitutes maximum shrinkage.Note that when θ 1 < 1, the entire parameter-norm is shrunken.
When S includes the maximum possible number of levels, with L = K so that each level comprises exactly one more cluster than the preceding level, the shrinkage vector θ has the useful property that its sum (i.e. the sum of all level-specific adjustments to new variation) is equal to the effective model size, as captured by the effective degrees of freedom (ν eff ): Proposition 2. With an HFR projection matrix given by P hfr = z(z z H Θ) −1 z , and the effective model degrees of freedom defined in the usual manner using the trace of the projection matrix, ν eff = tr(P hfr ), it holds that when L = K distinct levels are included in the hierarchy described by S.
The proof of Proposition 2 is given in Appendix B.
With this definition in hand, an information theoretically motivated approach to the determination of an optimal shrinkage vector, θ * , is to impose a constraint on the effective model size.Defining a hyperparameter, κ, that represents the effective model size (normalized to a value between 0 and 1), the optimal shrinkage vector is the solution that maximizes fit subject to the constraint When κ = 1, the problem is unconstrained, with ν eff = K and βhfr = βols .Conversely, when κ < 1, the model fit is maximized given a predetermined value for ν eff .Expressing the optimization in terms of the HFR loss, the optimal extent of shrinkage conditional on hyperparameter κ, is given by Eq. 3.9 trades off goodness-of-fit against parsimony, where the hyperparameter κ tilts the global trade-off towards goodness-of-fit as κ → 1, or parsimony as κ → 0. Fig. 3.5 plots the complete dendrogram for the example problem, with L = 4 levels, where = 3 had been omitted previously for the sake of simplicity.The total height of the dendrogram is now exactly equal to the effective model size (κ = 1).In fact, the definition of κ ensures that the hyperparameter represents the overall size of the optimal HFR graph as a percentage of K, with a shallower hierarchy as κ → 0: The following section demonstrates how the optimal shrinkage vector θ * κ can be obtained efficiently for any given value of κ using quadratic programming to solve Eq. 3.9.

Recasting the HFR as a model average
The HFR estimates in Eq. 3.7 can be restated as the dot product of level-specific estimates and a transformed shrinkage vector, such that βhfr = Bφ. (3.10) Here B stacks unconditional level-specific estimates (unconditional with respect to preceding levels in the hierarchy), such that with Note that ŵ is an unconditional counterpart to b , where the effect of each node's branch has not been partialled out.Furthermore, φ is a transformation of θ that satisfies the equality θ = ω φ, where ω is a lower triangular matrix, resulting in The derivation of Eq. 3.10 is given in Appendix C, and follows directly from the introduction of shrinkage weights to the calculations in Appendix A. By reformulating the problem in an unconditional manner, ŵ can be computed in parallel for each level, and the optimization of θ can be split into two consecutive steps: (i) estimating level-specific regressions ( B), and (ii) constructing the optimal shrinkage hierarchy by optimizing φ.
Eq. 3.10 resembles a model-averaging estimator, where the models B are averaged by the weights φ.Mallows model averaging (MMA), for instance, represents a close mathematical pendant, where the weighting vector is obtained by minimizing the Mallows information criterion (Hansen, 2007;Mallows, 1973).The optimal shrinkage problem of the HFR can correspondingly be thought of as the minimization of a custom information criterion (Eq.3.9) to determine the optimal vector φ * κ (which, by extension, yields θ * κ ).Importantly, the information theoretic model-averaging problem is quadratic in its weights (i.e.quadratic in φ), and can be solved analytically using quadratic programming algorithms.
Following this reasoning, the level-specific coefficients in B are used to reformulate the optimization, Since the original shrinkage vector can be expressed as θ = ω φ, the constraints in Eq. 3.12 are identical to the constraints in Eq. 3.9, and θ * κ = ω φ * κ .Note that the monotonicity constraint collapses to a simple weight constraint on φ.
The HFR estimates given by βhfr = Bφ * κ have thus far assumed a given hierarchy, encoded in S. The aim of the HFR is to estimate S in a supervised manner, which conceptually requires selecting the composition of predictor groups at each level that minimizes Eq. 3.9.This is a computationally intractable combinatorial problem.Instead, the following section suggests a feasible and computationally efficient algorithm for arriving at a graph estimate based on the similarity of the predictors' explanatory structure in y, using supervised hierarchical clustering.

Graph estimation
The graph-based decomposition of linear regression parameters introduced in the preceding sections assumes a hierarchical arrangement of predictors into L = K levels that are captured in S.Here S contains the maximum number of levels possible in a nested hierarchical tree, while θ * κ selects a parsimonious hierarchy by reducing the weight of individual levels, or removing levels from the hierarchy entirely.In order to estimate S, I propose a supervised hierarchical clustering algorithm, that merges variables based on the similarity of their explanatory component with respect to y.
A typical approach to (unsupervised) hierarchical clustering constructs a dissimilarity matrix D that encodes information about the predictor set (e.g. the (inverse) pairwise correlation coefficients, or distances), and recursively merges the predictors or clusters with the lowest cluster distance (Maimon & Rokach, 2010).The aim of a supervised rendition of a hierarchical clustering algorithm is to merge those clusters that maximize the goodness-of-fit of a regression of y on the appropriate cluster features z at each . 4Two predictors or clusters are deemed similar, if merging them leads to a comparatively small increase in the regression error, or conversely, a comparatively small decline in the goodness-of-fit.
Consider the previous example of a regression of y on four predictors x, with the estimated regression fit given by: A merge of any two predictors i, j results in: where x −ij contains all remaining predictors.
A merge is therefore akin to the imposition of an equality constraint on the associated coefficients β i and β j .This equality constraint is least costly (in terms of goodness-of-fit), when the conditional effect of x i and x j on y is similar.That is when Here, r x i ,y|x −ij is the partial correlation between x i and y conditional on x −ij .
An intuitively appealing and computationally feasible alternative to the estimation of regression fits for each possible cluster combination (Eq.3.14), is therefore to examine the similarity of the partial correlation coefficients.If the within-cluster variance of partial correlations is small (Eq. 3.15), the cost of the equality constraint, and, by extension, the reduction in goodness-of-fit, can be expected to be low.Ward (1963) outlines an agglomerative clustering algorithm that achieves just this: merging clusters based on the minimum additional within-cluster variance introduced by the merge.The author shows that the approach can be reduced to a clustering based on the Euclidean distances between the input vectors.The algorithm begins by placing each row in D into a cluster of its own, and iteratively merges those clusters that result in the minimum increase in overall within-cluster variance.Clusters are merged a total of K − 1 times, until all rows in D are contained in a single cluster, and L = K levels have been formed.A detailed description of Ward (1963) clustering can be found in Kaufman & Rousseeuw (2005) and Everitt et al. (2011).The algorithm is implemented using the cluster package in the statistical computing language R (Maechler et al., 2019;R Core Team, 2018).
Substituting partial correlations for D results in a supervised hierarchical clustering algorithm.
However, since conditioning on x −ij is at best imprecise and at worst unfeasible in the highdimensional setting, an approximate supervised dissimilarity matrix can instead be defined based on bivariate partial correlations, such that: , i = j. (3.16) Note that diag(D y ) is undefined so that, letting d i denote the ith row, ||d i − d j || measures the distance between {r x i ,y|x k } k / ∈i,j and {r x j ,y|x k } k / ∈i,j (i.e. the distance between the bivariate partial correlations conditioning on all predictors in x −ij individually).
The matrix D y results in a sign-sensitive clustering of parameters (positive and negative coefficients tend to be clustered separately).However, at the highest levels in the hierarchy, clusters will invariably contain effects with mixed signs.To ensure sign-invariance, with shrinkage towards absolute group targets, the summing matrix S must be adjusted such that where S i is a row in in S, and ρ = cor(x).The matrix S + is the unadjusted (positive-only) summing matrix.This ensures that when coefficients with opposite signs are contained in a single cluster, their effect is mirrored and not averaged. 5 The combination of Ward (1963) clustering and partial correlations between y and x produces a supervised hierarchical clustering algorithm that merges clusters based on the within-cluster variance of the partial correlations -a method that is analogous to the minimization of the cost of the hierarchical constraint encoded in S. Since the hierarchical constraint increases the regression error at each merge, its minimization is analogous to a selection of cluster-splits using a goodness-of-fit criterion, but can be implemented within the efficient framework of agglomerative clustering algorithms.

Deterministic terms, standard errors and further issues
The preceding discussions have abstracted from deterministic elements in the regression.Including these is exceedingly simple, and can be achieved by adjusting the level-specific regressions in B.
Letting M = {M i } i=1,...,N ∈ R M be a matrix of M deterministic elements (e.g. a vector of ones), with the associated parameter estimates m, the level-specific regression becomes: where z = M z , and S expands S such that Since deterministic elements are exogenous to the estimation of the hierarchy, the corresponding parameters are not regularized.Apart from a regression constant, deterministic elements can include statistical features such as trends or dummy variables, or simply predictors that, for one reason or another, are better represented outside of the feature hierarchy.All applications in this paper contain a deterministic element in the form of a regression constant.
The analogy of the HFR to a model average over level-specific regressions can furthermore be 5 As an aside, the HFR can be made entirely sign-invariant, permitting negatively correlated predictors with a similar explanatory effect on y -albeit with opposite signs -to be clustered adjacently.This is achieved by using the absolute partial correlation matrix, |Dy|.Such an approach is useful when the sign is not deemed to convey meaningful information, with only the absolute size of the coefficients being relevant.
extended to obtain approximate standard errors of the parameter estimates.Since level-specific standard errors, ŝe( ŵ ), are readily retrieved from the level-specific regressions, the average standard errors ŝe( βhfr ) can be obtained following Burnham & Anderson (2004), with where the weighted average parameters ŵφ are simply the HFR estimates βhfr .It is important to note that for purposes of inference ŝe( βhfr ) are understated.For instance, the graph estimation error embedded in S is omitted entirely.Nonetheless, the standard errors provide valuable information about the average significance along the branch of each variable in the hierarchy, and can be useful to prune noise clusters and to inform sparse model selection, as illustrated in the following section.Once again, in the absence of shrinkage, with θ = 1, the standard errors ŝe( βhfr ) are equivalent to the standard errors of the OLS regression.
An additional tool in understanding the role of the optimal parameter graph is to examine the level-wise decomposition of the coefficient of determination.Letting the model fit up to the th level be given by ŷ→ the cumulative coefficient of determination can be defined in the usual manner, with (3.21)When = L, this simply results in the total R 2 of the HFR fit.However, the level-wise formulation in Eq. 3.21 also yields contributions of each individual level to the overall coefficient of determination, where In the plots in Section 4, the level contributions are added to the dendrograms as bars with darker colors suggesting a larger contribution of that level, as illustrated in Fig. 3.6: As a final issue, the discussion has thus far assumed K < N − M .When K ≥ N − M , the level-specific regressions for all ≥ N − M cannot be computed.Since the lowest levels group predictors with the highest similarity, the simplest remedy is to prune all levels where ≥ N − M .
This leaves a total of L = N − M − 1 levels with no effect on the structure of the HFR, with the sole exception that the constraint in Eq. 3.12 substitutes κ(N − M − 1) for κK: An implementation of the HFR algorithm and the issues discussed in this paper is provided in the hfr package available on the Comprehensive R Archive Network (CRAN) for the statistical computing language R (Pfitzinger, 2022).

A case study: Determinants of economic growth
The HFR is useful both as a regression estimator and as a tool to garner insights into the effect structure underlying the estimated statistical model.In this section, I propose an analysis workflow that uses the HFR to understand an empirical problem and to obtain robust out-ofsample predictions.The data is taken from Sala-I-Martin et al. (2004), who in their seminal paper on the determinants of economic growth, compile a cross-country data set comprising GDP per capita growth rates between 1960-1996 for a sample of 88 countries, alongside 67 potential explanatory variables.The data set has become a workhorse for testing high-dimensional regression techniques, particularly in the Bayesian literature (Eicher et al., 2011;Hofmarcher et al., 2011;Ley, 2008;Sala-I-Martin et al., 2004;Schneider & Wagner, 2012).The econometric techniques that have been employed include Bayesian model averaging, as well as various model selection and shrinkage methods such as the ElasticNet and Lasso estimators.A description of the variables contained in the data set is provided in Table D.1. 6 As a starting point, Fig. 4.2 depicts hierarchical graphs for 4 different settings of κ -the hyperparameter governing the size of the optimal graph.The unconstrained regression graph is plotted in the top-left panel, with a total height of 67 (κ = 1) and each level contributing to a maximum extent (θ = 1).The graph is highly complex, reflecting the dimensionality of the problem, and is difficult to interpret in a meaningful manner.The regression coefficients themselves, which are represented by the leaf nodes, are estimated with substantial variance (see Fig.As the bias of the estimates increases with higher κ, the variance decreases.The standard errors represent weighted averages over the level-specific standard errors as described in Section 3.6.
6 Since the HFR as well as benchmark methods require the ranges of the input variables to be similar, the 67 predictors in the data set are scaled to an interval of [−1, 1].Dummy variables are normalized to a range of [−0.5, 0.5].This is done to dampen the otherwise overstated effect of the dummy variables in the hierarchy.The GDP per capita growth variable is not transformed to ensure that a comparison to previous research is possible.All specifications discussed in this section include an intercept term.In contrast to the complex unconstrained structure, Fig. 4.3 displays the estimated optimal shrinkage tree for the regression.The height of the tree is 10.1, with ν eff = 10.1 + 1 determined using a 10-fold cross-validation procedure.The distance between the levels reflects the shrinkage weights θ, and the vertical bar on the right is shaded based on the contribution of each level to the overall coefficient of determination of the HFR fit:  Fig. 4.3 suggests that the primary contribution to model fit is derived from the upper levels.
Examining the level-wise contributions directly in Fig. 4.4 shows that only the first 18 levels contribute to the fitting process and the first four levels account for over 85% of the explained variation.The plot is analogous to scree plots produced for principal components regressions, with the summation over the level-specific contributions yielding the total R 2 of the HFR fit: As illustrated in the bottom-right panel of Fig. 4.2, the first four hierarchical levels divide the sample into four latent signal factors that explain a significant portion of the response variation.
The factors appear to identify regional or topical sub-clusters, as well as consolidated noise components.The first cluster contains variables that identify the East Asian region (e.g.BUDDHA, CONFUC, EAST).The second cluster appears to group mostly institutional quality measures and some related variables (e.g.H60, CIV72, OPENDEC1, ECORG).The third cluster groups variables that presumably identify developing economies (e.g.MALFAL66, SAFRICA, TROPPOP) and several closely related economic measures (e.g.RERD, IPRICE1, PRIEXP70).Finally, the fourth cluster contains a large group of variables with coefficients close to zero, suggesting that these measures represent primarily noise components.
The fact that the upper clusters enter with a much higher importance than their corresponding leaf nodes, may suggest that the common -as opposed to the idiosyncratic -information in the predictor groups determines growth disparities.For instance, rather than malaria prevalence entering as a growth determinant in its own right, the variable (MALFAL66) helps to identify an underlying geographic factor that drives economic growth.well as all HFR coefficients with an indicative p-value < 0.05.7 As an auxiliary comparison, the model selected using a Lasso estimator is also displayed. 8The HFR identifies a total of 14 growth determinants grouped into two blocks: those associated with cluster one and those associated with cluster three.The variable set closely resembles related studies (with 12 of 14 overlapping drivers), but reflects the clustering inherent to the HFR.The model selected using the Lasso is almost identical to the HFR, with all but one of the growth determinants taken from the two relevant effect clusters discovered by the HFR.A key consideration for the validity of the uncovered model and the quality of the HFR estimates is the method's predictive performance.In order to assess this systematically, I employ a sampling setup closely resembling Hofmarcher et al. (2011).Observations are randomly sampled to form training, validation and testing sets containing 68/10/10 observations, respectively. 9Parameters are estimated using the training sample, hyperparameters are determined via a grid search minimization of the validation MSE and the performance is calculated as the test sample MSE.Samples are drawn in 500 iterations with hyperparameters determined independently in each run.The benchmark methods include penalized regressions in the form of the ridge regression, Lasso, Adaptive Lasso (AdaLasso) and ElasticNet, latent variable regressions in the form of PCR and PLSR, and finally OLS. 10 In addition, Table 4.1 displays the distribution of the MSEs alongside the results of the BACE and the BEN.The estimation of BACE and BEN is not replicated, but the results are taken directly from the table presented in Hofmarcher et al. (2011), page 10.
9 These proportions are roughly equivalent to those used in Hofmarcher et al. (2011), but with the addition of a validation sample, which is obtained by reducing the size of both the training and testing samples slightly. 10Ridge, Lasso and ElasticNet are implemented using the glmnet-package in the statistical computing language R, described in Friedman et al. (2010).For a discussion of the AdaLasso, see Zou (2006).PCR and PLSR are implemented using the pls-package in the statistical computing language R, described in Mevik & Wehrens (2019).The HFR outperforms all benchmark methods, with the ElasticNet regression achieving the highest mean accuracy among the panel of non-Bayesian benchmarks in Fig. 4.6.When compared to the performance of the BACE and BEN models, the HFR is again found to achieve lower mean and median prediction errors.The results provide justification for the approach taken by the HFR, suggesting that the aggregation of growth determinants into a low-dimensional set of latent factors is indeed appropriate.
In sum, the HFR offers a dual benefit: (i) it generates robust out-of-sample predictions, while (ii) the parsimonious hierarchy, in which the estimates are embedded, produces meta-insights about the underlying latent signals that explain observed response variation.In the case of the determinants of economic growth, several regional and topical sub-clusters may suffice to offer robust explanations of observed growth disparities.This ability to distinguish between the types of explanatory variation (shared or idiosyncratic) within a fully supervised framework is unique to the HFR, and can provide valuable additional insight with respect to the data generating process.The following section tests the generality of the observed predictive accuracy in simulated conditions.

Simulations
I use four simulations, largely replicated from related work, that cover different types of regression tasks to compare the performance of the HFR to similar methods.The benchmark methods include penalized regressions in the form of the ridge regression, Lasso, Adaptive Lasso (AdaLasso) and ElasticNet, latent variable regressions in the form of PCR and PLSR, and finally OLS. 11 The simulations show that each benchmark method is particularly well suited to certain regression tasks and poorly to others, as is generally observed in the related literature (Tibshirani, 1996).The HFR, by contrast, exhibits a high degree of versatility, outperforming or matching the benchmark methods in all simulations.Three of the simulations are based on Tibshirani (1996) and Zou & Hastie (2005) and have been applied occasionally in similar research (Bondell & Reich, 2008).The final simulation is new.
Data is simulated from the true model Observations are divided into training, validation and testing samples, where the training sample is used to estimate the models, the validation sample is used to determine optimal hyperparameters, and the testing sample is used for performance evaluation.Model performance is assessed by calculating the mean squared error (MSE) over the testing sample.Sample sizes are denoted by •/ • /•, where the dots represent training, validation and testing samples, respectively.In each case, the results of 500 simulation runs are plotted.
Hyperparameters include κ for the HFR, the size of the penalty (λ) for the penalized estimators (ridge, Lasso, AdaLasso, ElasticNet), the mixing parameter (α) for the ElasticNet, and the number 11 Ridge, Lasso and ElasticNet are implemented using the glmnet-package in the statistical computing language R, described in Friedman et al. (2010).For a discussion of the AdaLasso, see Zou (2006).PLSR and PCR are implemented using the pls-package in the statistical computing language R, described in Mevik & Wehrens (2019).
of latent components for the PCR and PLSR.Optimal hyperparameter values are determined using an extensive grid search with selection based on the minimum validation MSE.Hyperparameter tuning is performed individually for each method in each simulation run.

Simulation setup
Simulation (a) is taken from Tibshirani (1996), where it was originally used to demonstrate the ).
As before the pairwise correlation between x i and x j is 0.5 |i−j| , and σ 2 = 15.The sample size is set to 100/100/400.Since the DGP is sparse, the task is likely to be solved well with a Lasso, AdaLasso or ElasticNet.Variations on this simulation are used in Section 5.4 to explore scenarios for which the HFR is less suitable.
Simulation (c) is taken from Zou & Hastie (2005), who study the effect of grouped predictors.

Trace plots
In order to explore shrinkage behavior in the HFR, Fig ).
Here the second and fourth parameter blocks represent evenly spaced sequences on the interval [1, 3] and [−3, −1], respectively.This requires the HFR to construct a shallower hierarchy, since the effect of the predictors on y is more heterogeneous and more idiosyncratic information must be included.A shallower hierarchy limits the feasible extent of regularization, and results in a higher effective degrees of freedom.
Simulation (f) is identical to Simulation (b), but with a pairwise correlation between x i and x j of 0.5.The high degree of correlation between noise and signal predictors results in an extremely noisy D y matrix that cannot be clustered in any meaningful manner.Thus, the HFR can only poorly distinguish between signal and noise predictors and hierarchy construction becomes essentially random.
The results are plotted in Fig. 5.4 and Table 5.2.In Simulation (e), the variability of all methods increases, however, HFR again outperforms the benchmarks, suggesting that the method can achieve good out-of-sample results, even when the scope for shrinkage is reduced.In Simulation (f), the HFR exhibits an average performance, roughly on par with the PCR and PLSR, but worse than penalized regressions, suggesting that significant value is added by a meaningful clustering of predictors into hierarchical groups.In sum, the simulations make a compelling case for the use of the HFR estimator.The versatility of the method across a spectrum of different types of regression tasks is a key strength when compared to the benchmarks, which are typically tailored to serve specialized purposes.

Concluding remarks
Prediction tasks with high-dimensional multicollinear predictor sets are challenging for least squares based fitting procedures, and a large, productive literature exists advancing various regularized approaches to addressing the issue.The HFR is a novel contribution to this body of knowledge, presenting a method of shrinking coefficients towards group targets along the branches of an optimal predictor graph.Given a hyperparameter, which is conveniently interpreted as the effective model size and bounded between 0 and 1, the HFR is able to estimate both a supervised graph, as well as the optimal regularized coefficients associated with that graph.
The characteristics of the HFR make it particularly well-suited to regression applications with an underlying hierarchical or grouped data generating process, such as high-dimensional factor modeling in econometric analysis (e.g.nowcasting with dynamic factor models) or in finance (e.g.multifactor asset pricing).Applications similar to the gene selection problem discussed in Zou & Hastie (2005) may also prove particularly suitable to the HFR.The ability to plot the estimated hierarchy and explore the effect of individual clusters or levels in the regression provides a wealth of auxiliary insights into the underlying effect structure.
Both the empirical case study and the simulations presented in this paper suggest that the HFR provides an interesting complement to widely used regularized regression algorithms such as the Lasso or PLS regressions.The HFR achieves lower out-of-sample prediction errors than a panel of benchmark methods across a spectrum of different regression tasks, making it interesting both in terms of its performance as well as its versatility.The method can be thought of as a structured hybrid between a penalized regression and a supervised latent factor regression, with some benefits of both classes of algorithms, with potentially good performance across a wider range of data generating processes.
Proposition 1 can be shown to hold by demonstrating the equivalency to ordinary least squares coefficients.The proposition defines Expanding the regression equation and calculating the inverse product results in . (A.4) Here Q ij = z i z j and Q iy = z i y.Using the definition of the projection matrix P i = z i Q −1 ii z i , and substituting the definitions of Q ij and Q iy the above can be simplified to give A. In addition, φ is the transformation of the vector of shrinkage weights described in Section 3.4.
Eq. C.2 begins by restating Eq.A.2 with shrinkage weights: Calculating the inverse and multiplying out in a manner analogous to Eqs.A.3 & A.4 in Appendix A reduces the above to With the addition of an arbitrary number of levels, this result generalizes to the definition presented in Section 3.4.

Figure 3 . 2 :
Figure 3.2: Example of a cut hierarchy dendrogram (left) and the corresponding level-specific summing matrix (right).

Fig. 3 .Figure 3 . 3 :
Fig.3.3 plots a dendrogram of the decomposition, expanding the root node such that each level is represented by a band of unit width.As shown later in the section, the width of each levelspecific band will come to represent the proportion to which information introduced at that level is incorporated into the HFR estimates.Each b adjusts the coefficients based on the new cluster information at , with a single cluster at b1 , two clusters at b2 and four clusters at b3 :

Fig. 3 .Figure 3
Fig. 3.4 plots two shrunken dendrograms with the degree of shrinkage represented by the distance between two levels and equal to θ .The left panel of Fig. 3.4 represents moderate shrinkage, while the right panel removes an entire level:

Figure 3
Figure 3.5: Dendrogram of the level-wise decomposition of the OLS estimator with L = K levels.

Figure 3 . 6 :
Figure 3.6: Illustrative dendrogram of the level-wise decomposition of the HFR estimates with level-specific contributions on the right.
4.1), highlighting the need for a regularized approach.The remaining panels of Fig. 4.2 show different degrees of shrinkage, leading to successively simpler hierarchies.Each lower value of κ increases the strength of shrinkage (and hence the parameter bias), while in turn decreasing the variability of the estimates, as demonstrated in Fig. 4.1.

Figure 4 . 1 :
Figure 4.1: Approximate standard errors of the HFR estimates using four different settings for κ (log scale).As the bias of the estimates increases with higher κ, the variance decreases.The standard errors represent weighted averages over the level-specific standard errors as described in Section 3.6.

Figure 4 . 2 :
Figure 4.2: Dendrograms for the HFR coefficients obtained from four different settings for κ.The top-left panel represents the unconstrained case, while the bottom-left panel is the hyperparameter that minimizes a 10-fold cross-validated MSE.Page 22

Figure 4 . 3 :
Figure 4.3: Dendrogram for the HFR coefficients for the hyperparameter that minimizes a 10-fold cross-validated MSE.

Figure 4
Figure4.4: Cumulative R 2 over the levels in the optimal hierarchy (dashed line), and level-specific contributions to R 2 (bars).Only levels < 25 are plotted.The remaining levels do not contribute to the model fit.
Examining the individual growth drivers more closely, Fig. 4.5 displays the most important variables identified by Sala-I-Martin et al. (2004) (BACE) and Hofmarcher et al. (2011) (BEN), as

Figure 4 . 5 :
Figure 4.5: Model selection using the approximate p-values of the HFR coefficients compared to BACE, BEN and Lasso regressions.

Figure 4 .
Figure 4.6 plots the MSE and the average rank for the HFR and a panel of benchmark methods.

Figure 4 . 6 :
Figure 4.6: Comparison of prediction accuracy of HFR, Ridge, PLSR, PCR, ElasticNet, Lasso, and AdaLasso for GDP per capita growth from 1960-1996.MSE (left panel), rank of estimators (right panel).Statistics plotted as mean and 95% confidence interval based on 500 random training, validation and testing samples.Prediction errors are multiplied by 1e4.

0 10,
performance of the ridge regression.True parameter values are set to β j = 0.85, ∀ j = 1, ..., K, with K = 8.The sample size is 20/20/200, σ 2 = 3 and the pairwise correlation between x i and x j is 0.5 |i−j| .Simulation (b) is again based onTibshirani (1996) and is a sparse regression used to illustrate the Lasso's ability of eliminating noise features.There are 40 predictors with parameters set to β = (0, ..., grouped feature selection task.The sample consists of 50/50/400 observations and 40 1), i = 11, ..., 15 x i ∼ N (0, 1), i = 16, ..., 40.The simulation is designed to illustrate the ability of the ElasticNet to deal with grouped variables and variable selection simultaneously, and should therefore see the ElasticNet performing well.Simulation (d) is designed to test predictive performance in the presence of latent factors.The sample consists of 20/20/200 observations.Simulation (d) draws from a true model y = f β +

Figure 5 .
Figure 5.1 plots the model accuracy for Simulations (a) to (d).The HFR outperforms or closely matches the benchmarks in all simulations.Good performance in the cases when no predictor groups exist in the true DGP (Simulations (a) & (b)), or when an overlapping grouping structure exists (Simulation (d)) illustrate the versatility of the HFR in estimating robust parameters.The feature selection tasks (Simulations (b) & (c)) demonstrate how the ability to group noise features can lead to good performance even when compared to methods that explicitly perform variable selection, such as the Lasso, AdaLasso and ElasticNet regressions.

Figure 5 0 10,
Figure 5.3: Trace plots for HFR, PLSR, ridge and Lasso for K = 8 predictors.The hyperparameter values are κ for the HFR, the number of components for the PLSR, and λ for ridge and Lasso.True parameter values are represented by the dashed line.
33 z 3 [I − P 2 ][I − P 1 ]y the level-specific estimates, where the preceding levels are partialled out of each respective level-specific estimate.Note also that the nested nature of the hierarchical features implies that [I − P 2 ][I − P 1 ]y = [I − P 2 ]y, making the above exactly analogous to a stacked version of Eq. 3.3.Multiplying the matrix and using the simple trick that S i (z i z i ) −1 z i z j = S j ∀ i > j rank of P -is simply .The above therefore simplifies to tr(P hfr ) = that βhfr can be reformulated to remove path-dependence from the levelspecific estimates, with βhfr = Bφ.(C.1)To derive this result, recall once again the case with K = 4 and L = 3 presented in Section 3.2 and in Appendix A.Here B = ŵ1 • • • ŵL , with ŵ = S Q −1 Q y , using the notation in Appendix .4) Using the definition of ŵ yieldsβhfr = ŵ1 (θ 1 − θ 2 ) + ŵ2 (θ 2 − θ 3 ) + ŵ3 (θ 3

Table 5 .
2: Prediction accuracy (median MSE) for simulations (e)-(f) based on 500 simulation runs.Standard errors in parantheses.Standard errors are calculated using 500 bootstrap resamplings of the estimated MSE.In each case the two best methods are highlighted.