Topological regularization with information filtering networks

A methodology to perform topological regularization via information filtering network is introduced. This methodology can be directly applied to covariance selection problem providing an instrument for sparse probabilistic modeling with both linear and non-linear multivariate probability distributions such as the elliptical and generalized hyperbolic families. It can also be directly implemented for $L_0$-norm regularized multicollinear regression. In this paper, I describe in detail an application to sparse modeling with multivariate Student-t. A specific $L_0$-norm regularized expectation-maximization likelihood maximization procedure is proposed for this sparse Student-t case. Examples with real data from stock prices log-returns and from artificially generated data demonstrate the applicability, performances, and potentials of this methodology.


Introduction
In this paper I propose the use of information filtering networks (IFN) for L 0 norm topological regularization. IFN are a class of networks introduced in  and Tumminello et al. (2005) to analyze the relevant 'backbone' structure of interrelations in complex systems comprising a large number of interacting elements. They have been shown to provide a meaningful characterization of the structure of many systems in different domains from finance (Tumminello et al., 2007;Aste et al., 2010;Pozzi et al., 2013;Musmeci et al., 2014Musmeci et al., , 2015Procacci and Aste, 2019), to psychology Christensen, 2018) and biology (Song et al., 2008(Song et al., , 2012. Recently Massara et al. (2017) have introduced a class of IFN that is chordal and therefore particularly suited for probabilistic inference modeling. These networks are clique forests and they can be generated in a computationally efficient way through a clique expansion algorithm (Massara and . The ability to efficiently use IFN for L 0 norm regularization introduces into the machine learning and inverse problem contexts a novel element of direct interpretability with meaningful structures. In data-driven modeling regularization consists in adding a penalization to the objective function in order to control for the complexity of the model with the aim of reducing overfitting. The idea was originally introduced by Tikhonov (1943) and since then it has permeated the field of inverse problems and machine learning. There are different possible regularizations depending on the form of the penalization. The original Tikhonov approach (also known as ridge regression) was introduced in the context of multicollinear regression and consisted in penalizing the sum-of-square loss function by adding the sum of square of the regression coefficients (the L 2 norm) giving in this way preference to models with smaller coefficients. Other forms of penalization can of course be implemented and a particularly successful one uses of the L 1 norm instead of the L 2 norm and it was named 'lasso' by Tibshirani (1996). L 1 norm penalization forces certain coefficients to be set to zero producing therefore sparse models opening the way to better interpretability. If sparsity and interpretability are the objective, then an intuitive approach is to penalize the objective function directly with a L 0 norm that introduces a cost for the number of coefficients in the model. An advantage of the L 0 norm penalization is that the non-zero coefficients are not shrank in value allowing, in some cases, the use of local optimization methods. However, L 0 norm regularization has been proven to be in general challenging. Indeed, optimization under L 0 norm penalty is computationally intractable being non-differentiable and having a combinatorically large number of possible configurations to be explored. However, in Barfuss et al. (2016) this issue was successfully addressed by topologically regularizing multivariate normal models introducing a local-global procedure named LoGo. In the LoGo approach, the model is sparsified accordingly with a given sparse network structure and the coefficients of the sparse inverse covariance are retrieved from local inversions on the network structure. This methodology was proven to be extremely effective producing sparse models with better performances than lasso and with lower computational burden. A key element of the LoGo approach is the choice of the sparse network structure. In Bayesian terms this corresponds to the choice of a good prior inference model. In the original paper, LoGo utilizes a class of chordal graphs, named TMFG (Massara et al., 2017), that belongs to the broader category of information filtering networks and have clique-tree structure consisting of 4-cliques separated by 3-cliques.
In this paper I propose a methodology to make use of the meaningful and interpretable structure of IFN for L 0 norm regularization. For this purpose I adopt an approach analogous to LoGo but applied to elliptical distributions and using a more general IFN construction, named MFCF (Massara and , that produces clique forests with cliques of different sizes. In particular, I provide a detailed application to multivariate Student-t sparse models that are of interest in the financial domain where probability distribution of price log-returns are often modeled with such distributions. I show explicitly that the expectation-maximization (EM) procedure commonly used to optimize Student-t models (Dempster et al., 1977;Bishop, 2006) can be extended also to this L 0 norm regularized sparse model.
Examples with real data from financial equity prices and synthetic data demonstrate the applicability, robustness and validity of this topological regularization methodology.
The paper is organize as follows. In section 2, I recall the Student-t multivariate and some of its relevant properties; I also introduce some notation. The construction of information filtering networks is presented in Section 3. The topological regularization methodology is introduced in Section 4. The construction of the L 0 norm IFN-penalized set of coefficients for Student-t multivariate modeling is introduced in Section 5. Section 6 provides a demonstration that the expectation-maximization procedure can be extended to maximize likelihood of sparse Student-t models. Examples of the application of this methodology to real and synthetic data are provided in Section 7. Conclusions and perspectives are provided in Section 8. Appendix A contains some proofs.

Student-t distribution
Let me start from the definition of the multivariate Student-t probability density function.
Definition 1 (Multivariate Student-t distribution) Given of a set of random variables Z ∈ R p Z ×1 the multivariate Student-t probability density function has the following canonical general expression (Kotz and Nadarajah, 2004): where µ ∈ R p Z ×1 is the vector of location parameters; Ω ∈ R p Z ×p Z is a positively defined matrix called shape matrix; and ν > 0 is a scalar called degrees of freedom.

If we consider explicitly two subsets of variables
Where Ω XX ∈ R p X ×p X and Ω YY ∈ R p Y ×p Y are two diagonal block elements of the shape matrix Ω The conditional distribution of Y given X is instead and with Ω XY ∈ R p X ×p Y the off-diagonal block element of Ω and Ω YX its transpose.

Remark 2
The equalities for the marginals and the conditional distributions (Eqs.2,3 and 5) are the consequence of the completion of the square in the (z−µ) T Ω −1 (z − µ) term. They are general and, in particular, expressions Eqs. 6 and 7 for the conditional mean and shape matrix are valid for the entire family of elliptical distributions. Indeed, these are multivariate probability distribution functions that can be written as a function of (z − µ) T Ω −1 (z − µ) (Fang, 2018).
Remark 3 Eq. 6 is the expression for the multi collinear regression of Y given X = x in linear modelling. Interestingly, such a regression has the same form also in non-linear models such as the current Student-t or any distribution of the elliptical family. However, the meaning of the coefficients in Ω is not the same and their estimation is different for different models.

Information filtering network learning
The structure of chordal IFN can be learned by using a clique expansion procedure as described in Massara and , where a clique forest is constructed by means of a clique expansion algorithm which starts from a seed structure and includes vertices into the forest one by one accordingly with a given gain function. The resulting network is named MFCF. Such a clique forest network is made of a set of cliques C that are the 'vertices' in the clique-forest structure, the 'edges' of the clique-forest structure are instead a set S of separators that are cliques themselves with the property that by removing one of them the connected component becomes separated into two components. Clique forests are chordal graphs.
The MFCF complexity can be constrained by limiting the minimum and maximum clique sizes. By increasing the clique sizes ones increases the number of edges in the network making it denser. The full network is retrieved when the minimum clique size equals the total number of vertices.
The gain function that I use in this paper for the MFCF construction is the sum of the squares of the coefficients of the Kendall correlation matrix. This is a very simple gain function that lead to networks with all cliques with maximum size. Indeed, with this kind of additive gain the algorithm always gains by enlarging the clique, if allowed. I choose Kendall correlations because they describe dependency for a broader class of multivariate random variables than the Pearson's correlations. They, are non-linear and have been proven to be effective in practical applications (Pozzi et al., 2013). The IFN structure is learned before the maximum likelihood estimate of the parameters and it is passed to the optimization procedure as a Bayesian prior. This approach is analogous to the LoGo methodology introduced in Barfuss et al. (2016). However, here we apply it to non-normal models and this has important implications. Indeed, outside normal modeling the structure of the IFN graph does no longer represents conditional independence and the sparse probability distribution function no longer factorizes over the IFN clique and separator structure (see Lauritzen (1996) and Eq.22 in Appendix A for this factorization for the multivariate normal probability distribution function case).

Topological regularization
As discussed in the introduction, regularization is a methodology used to reduce overfitting by favoring models with a lower number of parameters or with smaller values of them. This is done by penalizing the introduction of extra or large coefficients in a model by modifying the objective function reducing it in proportion to number and/or the weights of the parameters in the model. In this paper I use as objective function the log-likelihood of the Student-t distribution and I panelize it with the number of non-zero parameters in the inverse shape matrix J = Ω −1 . When the penalizing function is linear, this is a case of L 0 norm regularization.
I want to maximize the log-likelihood under a form of penalization that depends on the number of parameters in the model.
Definition 6 (Penalized log-likelihood) The objective function to maximize is where J 0 = k is the number of terms different from zero in the inverse shape matrix J.
The simplest version of such penalization is λ( J 0 ) = J 0 = k and it corresponds to the penalization proposed by Akaike (1974Akaike ( , 1998 and Sakamoto et al. (1986) when base-two logarithm are used in Eq.8. A non linear version of this penalizer, proposed by Cavanaugh (1997) and Burnham et al. (2011), consists in k + k(k + 1)/(q − k − 1) which accounts also for the number of observations q and it becomes relevant for small observation sets when this number is small. The penalised likelihood in Eq.10 is an instance of L 0 norm topological regularization (Tikhonov, 1963b,a;Witten and Tibshirani, 2009;Zou and Hastie, 2005) that can overcome overfitting issues and provide better modeling. However, maximization of this penalized likelihood is not in general achievable neither analytically nor numerically for systems with a large number of variables. Indeed, at each level of sparsity, one must investigate the combinatorially large set of all positively defined symmetric matrices and, select among them the one that maximize ℓ(µ, J, ν) and iterate over sparsity to find the absolute maximum for a given penalizing function λ( J 0 ).
In this paper, I propose to simplify the exploration of all sparse J structures by restricting them to a set of IFN with different sparsity. The proposed methodology starts from a set of prior sparse MFCF structures with a range of maximum clique sizes, then the likelihood of the sparse models is maximized and finally the best performing model is selected. Specifically the methodology consists in the following three steps procedure: 1. generate a set of IFN prior graphs using the MFCF construction by Massara and  with structures generated for a range of different max-clique sizes using a sumof-square Kendall correlation coefficient gain function estimated over the train-set; 2. compute optimal sparse Student-t models for each of the IFN by maximizing likelihood on the train-set with the expectation-maximization procedure applied on the sparse IFN structure; 3. select the model with largest objective function on the validation-set.
Performance of the sparse IFN-Student-t model is then tested on the test-set and compared with other models.
Remark 7 The IFN structures are constructed and the model coefficients are optimized on the train-set; the models are selected on the validation-set; and finally results are evaluated and compared with other models on the test-set.
Remark 8 In this procedure there are two regularizations: the first is topological and it is the sparse IFN prior structure and the second is the L 0 norm penalization of the likelihood. These two regularizations interplay when the IFN prior is selected by choosing the network with largest penalized likelihood.

Sparse inverse shape matrix decomposition
The likelihood of multivariate Student-t models can be maximized though a procedure known as expectation-maximization (EM) introduced by Dempster et al. (1977) (see also Bishop (2006) Chap.9). Hereafter I shall show that such a procedure can be applied identically also to the maximization the likelihood of the sparse Student-t model for any given chordal IFN structure. Indeed, one of the important properties the sparse matrix J constructed over the chordal IFN network is that it can be computed as a sum of the inverse of local portions Ω c and Ω s of the shape matrix Ω that are respectively populated by elements in the clique set c ∈ C and separator set s ∈ S of the IFN structure.
Theorem 9 (Shape matrix inversion decomposition) For a given IFN clique forest, with clique set C and separator set S, for any given couple of vertices i, j associated with an edge of the IFN graph one has: Otherwise J i,j = 0 for all other couples of i, j not belonging to cliques.
The proof of this theorem uses the representation of the multivariate Student-t as a normal mixture and it is provided in Appendix A. Equation 11 transforms the global problem of estimating the whole matrix inverse into a set of local problems at clique and separator levels.
Corollary 10 (Positive definitness) The sparse inverse shape matrix J constructed from Eq.11 is positively defined if Ω c and Ω s are positively defined.
Proof A sum of positively defined matrices is positively defined.

Remark 11
The decomposition in Eq.11 allows to use, in combination with the L 0 norm regularization, any other L 1 or L 2 shrinkage penalization term applied to the local inversions of Ω c and Ω s .

Remark 12
The sparsification of J through Eq.11 provides a way to overcome the curse of dimensionality in the estimation of covariances from observations. Indeed, independently on the overall dimension of the system of variables X, the sparse matrix J is positively defined as far as Ω c and Ω s are positively defined. When the sparse inverse covariance J is estimated from data, it is then sufficient to have a number of observations, q, larger than the size of the largest clique, which is independent from the dimension, p, of X. Therefore, through Eq.11 one can obtain well conditioned covariance matrices even when q ≪ p.

EM maximum likelihood estimation of the sparse Student-t model
I shall now make use of the decomposition in Eq.11 to maximize Student-t likelihood over the sparse IFN structure. This can be done by following step-by-step the conventional EM procedure for a given ν (Bishop, 2006). For the maximization step, the coefficients that maximize likelihood can be obtained by differentiating with respect to µ i and the non-zero J i,j . In the present sparse model, the only difference is that one does not differentiates over the J i,j that are not associated with edges in the IFN and have assigned zero value. By differentiating with respect to the location parameters µ i and equalling to zero one obtains the following expression for the maximum: Whereas, by differentiating with respect to the parameters of the shape matrix Ω i,j and equalling to zero one obtains: Where the weights w(t) are: withd whereĴ i,j are obtained by substituting Eqs.13 and 14 into Eq.11. Solutions 12 and 13 are in an implicit form and depend one from the other and must be evaluated recursively (the e-step). Starting point of the recursion is the set of weights w(t) (Eq.14). In the present approach I first estimate the coefficients with the sample means and the sample covariance by using the decomposition in Eq.11 with Ω c and Ω s estimated from the Kendall correlation coefficients multiplied by the product of standard deviations in the rows and columns and all multiplied by (ν − 2)/ν (in this paper I will always consider ν > 2). Then the computation is reiterated until convergence to a stable set of coefficients. Convergence is guaranteed (Theorem 2 in Dempster 1977) although it can be slow. The sparsity ofĴ is not affecting the form of the solutions which have the same form also in the full case. However, in the sparse case only the elements belonging to cliques must be computed which reduces computational complexity from O(p 2 ) to O(p). The rest of the shape matrixΩ is automatically populated by invertingĴ.
The parameter ν can also be computed through the EM procedure. However, I prefer to estimate it independently from the tail exponent via a fit with a power law of the left and right tails of the probability distribution of all the univariate marginals of X. Indeed, all marginal Student-t distributions of X behave as a power law on both left and right tails with tail-exponent α = ν.
As a consequence of Theorem 9 (Eq.11) one has the following two useful results.

(16)
Corollary 14 (Decomposition of the Mahalanobis distance) The proofs of these two corollaries use the representation of the multivariate Student-t as a normal mixture and are provided in Appendix A.

Experiments
In order to test the novel topological regularization methodology introduced with this paper I collected daily prices from 623 stocks continuously traded on the US equity market between 01/02/1999 and 20/03/2020 for a total of 5515 trading days. For each stock 'i' (= 1, ..., p) I computed the log-returns,x i (t) = log P rice i (t) − log P rice i (t − 1), (t = 1, ..., q). Results are computed over 100 random re-sampled datasets generated by randomly picking with repetitions p = 100 return series among the 623 stocks. For each random choice of the p = 100 return series I randomly sampled 3×q returns without repetition using q observations for the training set, q observations for the validation set and q observations for the test set. I performed two sets of experiments with q = 150 and q = 600 respectively. I also tested the procedure on synthetic datasets artificially generated from multivariate normal distributions and multivariate Student-t distributions. In these cases I used the empirical covariance and means from the real data as parameters to generate artificial datasets with properties consistent with the real data. For each dataset I first computed the Kendall's correlation and from it I generated a set of IFN graphs using the MFCF algorithm with maximum clique sizes in the range from 2 to 100. As MFCF gain function I chose the sum of the squares correlations, which is one of the simplest choices that produces cliques all of sizes equal to the maximum clique size. I then  (2018). J 0 is the number of edges in the IFN graph. Models have been optimized to maximize likelihood on the train-set by using the EM procedure described in section 5. On the lefthand-side are reported the log likelihoods computed on the train-set. On the right-hand-side are reported the likelihoods of the same models of the left figure but computed on the validation-set. The bands around the lines are the 10% and 90% quantiles computed over 100 re-sampling. Figure 2: Mean penalized log-likelihood L/q = ℓ(µ, J, ν)/q − J 0 results for the same a set of Student-t models as in Fig.1. The bands around the lines are the 10% and 90% quantiles computed over 100 re-samplings.
computed the sparse inverse shape matrix on the training set using Eq.11 from the Kendall correlation coefficients multiplied by the standard deviations and by the factor (ν − 2)/ν. As degrees of freedom I empirically investigated the tails of the marginal distributions across the whole dataset retrieving a tail exponent ν = 2.2 as a good average estimator for the degrees of freedom. I verified that results are qualitatively little sensitive to this parameter although quantitatively the likelihood changes sensibly with ν. I optimized the parameters to maximize likelihood on the train set by using the EM procedure described in section 5. Results for the log-likelihood (Eq.8) computed on the train set as function of the number of edges J 0 in the sparse inverse shape matrix are reported in Fig. 1 (left). One can note that the mean log-likelihood (ℓ/q) computed over the train dataset increases with the number of parameters in the model and it is larger for the shorter observation set (q = 150). This is expected because optimized models with larger number of parameters and lower number of observations are overfitting the dataset describing more the noise than the actual information.
To account for overfitting and test the goodness of the models I computed again the mean loglikelihood using the same models with the same parameters but with observations from the validation dataset. In this way overfitting is eliminated because the model is constructed and optimized over the train dataset but then performance are measured over validation dataset. Results are reported in Fig.1 (right). One can see that on the validation dataset the models gain in likelihood becoming denser only up to a point and then they start overfitting the data and they perform less well than the sparse counterparts. As expected, on the validation set the model trained with less data (q = 150) performs less well than than model trained with more data (q = 600). Note that the last points on the right of the two plots are the full models (max clique = 100) with the complete (non-sparse) inverse shape matrix. As one can see, for small observation sets (q = 150) the sparse models largely over-perform the complete models. Whereas, for larger observation sets (q = 600) the difference is smaller. Note that the data reported are averages over 100 re-samplings and the bands are the 10% and 90% quantiles. The re-sampling picks randomly both the time series and the returns. Therfore the observed consistency and the relatively narrow quantile band are strong indications of statistical robustness of the results. Models with smaller number of parameters are preferable. This can be quantified by adopting the Akaike information criterion and compute the penalized likelihood L = ℓ(µ, J, ν) − J 0 . Results for the penalized likelihood computed from the validation set are shown in Fig.2 as function of the maximum clique size in the IFN-MFCF graph. One can see two well defined maxima of the penalized likelihood at max clique size = 5 and 10 respectivelly for the two sample sizes q = 150 and 600.
Results for the test set are reported in Table.1 together with results for the synthetic datasets. The table reports also comparisons with the complete (non-sparse) models, named FULL, and with topological; regularization with another class of IFNs called TMFG (Massara et al., 2017) generated with the same gain function (sum of correlation coefficient square) but with a slightly different procedure and with fixed max clique size equal to four. The reported results are means computed on the test set over 100 different re-samplings for the real data and new realizations for the artificial data. We can see that the TMFG preforms well with results for L consistently similar to the best performing MFCF models. This indicates that the TMFG is in general a very good tool to generate sparse models. The advantage is that the TMFG has a simplified construction with respect to the MFCF and having a fixed  Table 1: Summary of results for the mean likelihood ℓ/q (Eq.8), mean penalized likelihood L/q (Eq.10, with λ( J 0 ) = J 0 ) and IFN-MFCF max-clique-size computed on the test set for the MFCF, TMFG and FULL models. The MFCF are sparse models with IFN network learned on the train set, parameters optimized on the train set and max clique size selected on the validation set. The TMFG is also learned and optimized on the train set. The FULL is a non-sparse model optimized on the train set. Results are means computed on the test set over 100 different re-samplings for the real data and new realizations for the artificial data. Columns respectively refer to: real log-return data (Real data) , multivariate normal synthetic data (Normal data), multivariate Student-t synthetic data (Student-t data); all for both q = 150 and 600.
clique size does not need structure selection on the validation set making the procedure less computationally and data intensive. Let me note the consistency of the results for both real data and synthetic data.

Conclusions and perspectives
In this paper I have introduced a methodology for topological L 0 norm regularization with information filtering networks and I have applied it in detail for a case of penalized likelihood in Student-t sparse modeling. The regularization methodology consists in keeping different from zero only the parameters of the multivariate distribution that correspond to edges in the IFN. By using clique forests IFN, one guarantees positive definiteness and decomposition into local parts of the inverse shape matrix (Eq.11). This is an important property associated with this kind of IFN and it applies to the vast class of models belonging to the elliptical family (Fang, 2018) which includes the Student-t but also the normal, the Laplace and the multivariate stable distributions. This L 0 norm topological regularization methodology strongly improves model interpretability because IFN structures are known to meaningfully represent relevant interrelations in complex data structures with a vast literature reporting their successful applications to various domains from finance to biology.
I have shown that the expectation-maximization methodology commonly used to estimate the maximum likelihood coefficients in Student-t models can be used also for this L 0 norm regularized sparse models with the advantage that in this case computation must be done only for the coefficients corresponding to IFN edges reducing computation complexity from O(p 2 ) to O(p).
Experiments on real datasets from equity prices and multivariate synthetic datasets demonstrate that the proposed methodology is directly applicable to a range of practically relevant problems. Results demonstrate that L 0 norm topologically regularized models outperform the full models and reveal that smaller observation sets selects sparser IFN models.
The present paper reports exclusively on L 0 norm regularization via IFN priors, however, the nature of this sparsification allows to combine straightforwardly L 1 and L 2 regularizations as well within this methodology. Indeed, Theorem 9 provides the sparse inverse shape matrix as sum of local inversions of matrices associated with the clique and separator sets. On such local inversions, shrinkage and lasso regularization can be applied directly.
I think it is important to stress that, although I have focused the illustration of this topological regularization methodology to the very specific case of likelihood maximization for a multivariate Student-t model, the method itself is general. For instance, Eq.6 provides the expression for the L 0 norm IFN penalized multilinear regression for the elliptical family. However, this will be the subject of future works. a gamma distribution defined in the domain r ∈ [0, ∞). This leads to the probability distribution function: The canonical representation of the Student-t with degrees of freedom ν is retrieved when both shape and scale parameters are equal to α = β = ν/2.

Proof of Theorem 9
Consider the case when the inverse shape matrix Ω −1 = J is a sparse matrix obtained from the IFN clique forest construction. This implies that Σ −1 = rΩ −1 = rJ is sparse as well with the same structure. There are several known results for normal distributions with sparse precision matrices Lauritzen (1996). For instance, the multivariate Normal Eq.19 can be decomposed as From this decomposition in Eq.22 it follows directly that the precision matrix can be computed as a sum of local matrix inverse (see Lauritzen (1996)): Consequently the inverse shape matrix J = Ω −1 = Σ −1 /r must have the same decomposition that is Eq.11.

Proof of Corollary 13
From the decomposition in Eq.22 it follows directly that the determinant can be decomposed as (see Lauritzen (1996)): The covariance matrices are related to the shape matrix with Σ c = rΩ c and Σ s = rΩ s . The right-hand side of Eq.24 has therefore a factor r p with p = sum of the clique sizes minus sum of the separator sizes times k s − 1 which is actually equal to the total number of variables. The left-hand side is |Σ| = r p |Ω| and therefore the equality Eq.16 follows.