A Mathematical Programming Approach to Sparse Canonical Correlation Analysis

Recent developments in the interplay between Operational Research and Statistics allowed to exploit advances in Mixed Integer Optimization (MIO) solvers to improve the quality of statistical analysis. In this work we tackle Canonical Correlation Analysis (CCA), a dimensionality reduction method that summarises multiple data sources jointly, retaining their dependency structure. We propose a new technique to encode Sparsity in CCA by means of a new mathematical programming formulation which allows to obtain an exact solution using readily available solvers (like Gurobi). We show a preliminary investigation of the performance of our proposal through a simulation study, which highlights the potential of our approach.


Introduction
Despite the great deal of overlap between Operational Research (OR) algorithms and Data Science (DS) methods, the interaction between the two fields is still under-exploited. In recent years there have been more efforts on both sides to borrow from each others literature, bridging the gap between the two disciplines. 5 From the OR side, Machine Learning (ML) techniques have been used to improve solver performances. Among the contributions in this direction we recall as examples [1], which recasts the solver's algorithm selection for Mixed-Integer Quadratic Programming (MIQP) as a classification procedure; [2], which adopts supervised learning to predict if a Mixed-Integer Programming (MIP) instance 10 will be solved within a given time limit; [3], which provides a survey on ML approaches to variable and node selection in Branch-and-Bound algorithms.
On the other side, the one this work falls in, OR methods have been instrumental in both developing and improving fitting procedures of statistical models.
Among the most established strategies to improve statistical fitting procedure 15 by means of advances in the OR literature, we recall [4] where a mixed integer quadratic optimization (MIQO) approach has been adopted for designing high quality linear regression models balancing many competing desired properties; [5] which presents a novel Mixed-Integer Linear Programming (MILP) formulation of the optimal classification tree problem or [6] that proposes a Branch-and- 20 Bound algorithm, based on principles of Mixed-Integer Optimization (MIO), to solve the Sparse Principal Component Analysis (SPCA) to optimality, just to mention a few. Indeed, improvements in hardware and in optimization methods in the last 30 years produced impressive advances in MIO solver performances.
Consequently, several statistical problems that have a natural MIO formulation 25 considered intractable in the past, now can be solved to optimality.
In this work, inspired by [4], we adopt a mathematical programming approach to recast sparsity in Canonical Correlation Analysis (CCA), a dimensionality reduction method that allows to summarise multiple sources of data simultaneously while retaining their dependency structure. We propose a new fitting 30 procedure which, exploiting advances of MIO, provides an exact solution for Sparse CCA, a problem which, to the best of our knowledge is only approximated in the DS literature. More in details, we implement our proposal in Gurobi 9.0, whose latest update includes a bilinear solver for problems with quadratic constraints. This new solver has the noticeable advantage of being 35 able to find a global optimum, which guarantees the exactness of the solution 2 to our problem. The remainder of this paper is organized as follows. Section 2 reviews the OR literature on sparsity for different statistical methods and the general MIO framework in which the sparsity can be embedded. Section 3 first recalls the CCA technique and the meaning of sparsity in this context and then 40 presents the proposed MIQO formulation. Section 4 reports the results obtained testing the model on simulated data sets.

State of the Art
Nowadays large quantities of data, related to every field, like in marketing, social networks, telecommunications, medicine and health care and others, are avail-45 able. However, in order to extract useful knowledge from them, it is necessary to summarise such huge amount of information and discard all non-informative elements. To this end the concept of sparsity is very important in statistical models. Indeed, sparsity is the property which guarantees that only a reduced number of parameters (or predictors) play an important role in the statisti-50 cal model under consideration, thus greatly improving its interpretability. It is therefore no coincidence that in the research area that studies the use of advances in operational research techniques to improve the performance of statistical models, sparsity was one of the first properties to be analyzed under a "modern optimization lens".

55
In this section we focus on the articles related to sparsity embedded in optimization approaches proposed to improve the quality of several statistical models and on the main contributions of this work. After the seminal work [4], by Bertsimas and King, on MIQO formulations of linear regression, in [6] a MIO approach to sparse principal component analysis has been presented. The authors pro-60 posed a Branch-and-Bound algorithm able to prove optimality or to find high quality solutions in seconds, depending on the sample size, explaining a higher portion of variance with respect to other existing methods in literature. In [7] the sparse regression problem has been reformulated as a pure integer convex optimization problem. A cutting plane algorithm to solve it on instances with 65 3 number of samples and regressors in the 100000s is presented. Moreover, the authors observed that the sparse regression problem has the property that, as the sample size (n) increases, the problem becomes easier in the resolution perfectly recovering the support of the true signal, faster than LASSO, whereas for small n values, their approach takes a large amount of time to solve the problem. and randomized rounding schemes to obtain feasible solutions of high quality (very close to the optimal one) within minutes for a number of variables p = 100s or hours for p = 1000s. In [9] the authors formulated the cardinality constrained maximum likelihood problem for the sparse inverse covariance matrix as a MIO model and proposed a combination of outer-approximation algorithm 80 and first-order methods to solve it. Their approach provides near optimal solutions fast, and a guarantee on the solutions suboptimality if the method is terminated early. It delivers near optimal solutions in a matter of seconds, and provably optimal solutions in a matter of minutes for p in the 100s and k in the 10s. The algorithm also provides high-quality solutions to problems in the 85 1000s, but a certificate of optimality is more computationally expensive for those sizes. In [10] Blanquero et al. studied sparse optimal randomized classification trees adopting a new optimization approach based on the one presented in [11] with oblique cuts. This is a continuous optimization model which is able to find a trade-off between accuracy and global sparsity. Indeed, the problem of 90 building optimal decision trees is NP-complete and for this reason in literature greedy procedures have been proposed in which at each branch node of the tree, some purity criterion is (locally) optimized like CARTs. However, these approaches provide good local sparsity making classic decision trees locally easy to interpret but this is not true at a global level. Mathematical optimization 95 is again the key to build optimal, deterministic and randomized classification trees and address issues like global sparsity as shown in [10]. Another recent example to be mentioned among the works in literature focusing on sparsity in statistical models dealt with an optimization approach is [12]. It focuses on the sparse polynomial regression problem, also named sparse hierarchical regression 100 problem, which consists in determining a polynomial of degree r that depends on at most k inputs counting at most l monomial terms minimizing the sum of squares of its prediction errors. For this problem the authors presented a two-step approach: first a fast input ranking heuristic discards the irrelevant inputs, then a cutting plane method solves the integer nonlinear optimization 105 model used to formulate the remaining reduced sparse hierarchical regression problem. The experimental results show that the proposed method is able to deal with problems of practical size (n=10000 and p=1000), for which its computational complexity is on par with LASSO, outperforming heuristic methods in both finding all relevant non-linearities as well as rejecting obfuscating ones.

110
In this paper, we study another statistical model, Canonical Correlation Analysis, for which we embed sparsity by adopting a mathematical programming approach. To the best of our knowledge this is still an uncovered topic by the existing literature related to OR advances application for improving fitting and quality of statistical methods. The main contributions of this work are a new 115 MIO formulation for Sparse Canonical Correlation Analysis and experimental results based on simulated dataset showing that the proposed approach is able to discern between informative and non informative variables.

Mixed Integer Optimization and Sparsity
In this section we give a glimpse of mixed integer optimization (MIO) and of its 120 use in modelling sparsity.
The general formulation of a MIO problem is as follows: MIO models with a quadratic objective but without quadratic constraints are We focus in particular on the case of sparsity, that is when we require some of the variables to be set to 0, following the approach defined by Bertsimas.
According to [13], the notion of sparsity can in fact be formalized by considering 140 the variable x = (y, z) by composed of two subsets of equal dimension. The first one y = (y 1 , . . . , y p ) ∈ R p represents the statistical parameters of interest, while the second one z = (z 1 , . . . , z p ) ∈ {0, 1} p can be taken as auxiliary variables, which allow to identify which elements of the first subset are different than 0.
The presence of sparsity is guaranteed by constraints (10), which imply some of the elements of y are automatically set to zero, while the amount of sparsity is regulated by constraint (8).
In the past decades, because of the difficulty in scaling solution algorithms for MIO problems, the research to deal with statistical sparsity focused on methods which solve convex approximation of the original problem. We now briefly recall the most famous example of this approach, the LASSO [14], in which a penalty based on the L 1 norm is used to regularize the standard least square fit of a regression problem.
Given a sample of n observations, the standard linear regression model can be written as where y = (y 1 , . . . , y n ) is the response (or dependent) variable, ε = (ε 1 , . . . , ε n ) is the stochastic error term and X is (n × p) the matrix containing the values of p independent variables (also known as covariates, or predictors) observed on the sample and β = (β 1 , . . . , β p ) their corresponding coefficients. The standard approach to fit this model is to minimize the error term, typically l2.
Sparsity is enforced in the coefficients β by adding the following constraint The rationale for inducing sparsity in this kind of model is, in fact, reducing the number of predictors by assessing which independent variables are relevant for recovering the dependent one. Each element β i of the coefficient β represents the impact of the i − th covariate on the response y. If β i is 0, the corresponding 160 covariate does not play any role in explaining the dependent variable y. Solving 7 12-13 was not considered to be approachable directly, as it is a NP hard problem.
The basic idea of the LASSO thus consisted in approximating it by While the LASSO rightfully enjoys popularity in the statistical community due 165 to its good predictive properties, and depending on the signal-to-noise ratio in the data, it may even outperform an exact solution of 12-13 [15], it still provides only an approximated solution to the original problem of sparsity. However, [13] showed that it is possible to obtain an exact solution to the problem 12-13 by adding a binary variable z and reformulating it as a MI(Q)O problem: where the parameters C and k represent respectively the maximum value coefficients β i can take and the maximum number of non zero β coefficients. Variable 175 z i is equal to 0 when the corresponding coefficient β i is 0 and 1 otherwise. Constraint 17 insures that if z i is 0, β i is 0 as well. It is easy to see that (16)- (19) is a special case of the framework of (7)- (11). In the last 30 years the compu-

A MIQO formulation for Canonical Correlation Analysis
Canonical Correlation Analysis (CCA) is a multivariate statistical technique designed to investigate the relationship between two sets of variables. The basic framework of CCA is that where we have two datasets X 1 ∈ R n×p1 and X 2 ∈ R n×p2 , taken to be i.i.d. realization of two random vectors Each component of these vectors, and therefore each column of the datasets, represents a different source of information or attribute. Example of interest in CCA are the cases of genomic data [16], microbiome data [17] and EEG [18].
In its original formulation, given two scaled datasets X 1 and X 2 , CCA's goal is to find a lower dimensional linear representation X 1 w 1 = Y 1 and X 2 w 2 = Y 2 so that the resulting transformed variables Y 1 ∈ R n and Y 2 ∈ R n are maximally correlated. After imposing that X 1 and X 2 are centered, i.e. they have columnwise 0 mean, the empirical correlation between the two new variables Y 1 and Y 2 can be expressed as The coefficients w 1 ∈ R p1 and w 2 ∈ R p2 that guarantee that the correlation 190 between Y 1 and Y 2 is maximized, can thus be found by solving the following optimization problem max w1,w2 where the constraints 21 are imposed to ensure identifiability, since solutions to problem 20 are invariant with respect to orthogonal rotation. Notice that, as a 195 consequence of these constraints, formulation 20-21 can be simplified as: The new variables Y 1 and Y 2 , usually called Canonical Variables, provide a vector representation of the original matrices X 1 and X 2 , and can be used as that are close to 0, identify attributes that can be neglected for explaining the dependence between the two sets.
In the context of CCA, inducing sparsity consists of forcing some elements of the coefficients w 1 and w 2 to be 0. Considering only a subset of the original 210 attributes in our analysis, thus constraining the number of non-zero coefficients to be smaller than a maximum acceptable value, can lead to more informative Canonical Weights and Variables. This especially true when p 1 and p 2 become large, and the interpretability of the weights, as well as the Canonical Variables, becomes harder.

215
As in the case of the LASSO for regression problems, sparsity has most often be induced by adding a regularization term based on the L 1 norm in CCA as well.
The most common formulation for Sparse CCA in the statistical and machine learning literature is thus the following: Several algorithms have been introduced to efficiently solve this convex optimization problem. For example [19] propose an approach based on a penalized matrix decomposition and the resulting regularized version of the singular value decomposition. In [20], the solution to the problem 24-26 is obtained by means 225 of a linearized Bregman iterative method. Similar problems to 24-26 is also considered in [21] and more recently in [22] and in [23]. In a Bayesian fashion, probabilistic approaches to sparse CCA have also been considered [24]. Methods that tackle sparsity by means of L 0 loss, and are thus more similar in spirit to our work, are far less common in the literature. Among these, [25] introduces a 230 provable algorithm for estimating canonical vector with an L 0 penalization, but it is limited by the assumption that each of the two data objects X 1 and X 2 , although dependent from the other object, consists of independent variables.
[26] introduces a two stage procedure to solve sparse CCA, and provides an algorithm to identify active entries of the L 0 -penalized canonical vectors, i.e.

235
non-zero elements of w 1 and w 2 . Values of these active entries must be then computed through a separate procedure. Finally, [27] suggests an alternating iterative algorithm to solve the L 0 sparse CCA formulation by using a sparse projection strategy.

The MIQO formulation 240
In order to obtain an exact solution to Sparse Canonical Correlation, we recast problem 24-26 as a MIQO, adding auxiliary binary variables in the fashion of [13]. The resulting mathematical programming formulation is as follows: where w 1 ∈ R p1 and w 2 ∈ R p2 are sets of continuous variables bounded by Gurobi module for this class of problems.

Experimental Results & Discussion
We evaluate the performance of our proposal on simulated data. Following [19] we generate data according to the model where each component of the error matrix ε i follows a Normal distribution with 260 mean 0 and standard deviation 0.1. We consider 2 different scenarios w.r.t the number of variables p 1 and p 2 and test them at different levels of sparsity: a standard one and extreme one. Details are given in Table 1.
We generate the u vector from a n variate normal distribution, with 0 mean vector and identity and variance-covariance matrix. We also consider 3 different 265 values for the sample size n: 50, 150, 500.
Even though still very limited, results shown in Figures 1-4 highlight the potential of our proposal. The sparsity pattern appears to be captured well in almost Scenario 1 Scenario 2 p 1 = 5 p 2 = 3 p 1 = 20 p 2 = 9 Sparse k 1 = 3 k 2 = 2 k 1 = 4 k 2 = 3 Very Sparse k 1 = 1 k 2 = 1 k 1 = 1 k 2 = 1   Interestingly, the model performances seem to be affected by the sample size more than the sparsity level, and they are reassuring even in presence of extreme sparsity. Figure 4 shows in fact how the model seems to be able to deal with a sparsity level of more than 90%. As we could expect, the estimated values, denoted by the coloured circles, become closer to the real values, indicated by 275 the black star, as the sample size n increases. This is because when the sample size is small, the empirical covariance X t 1 X 2 in the objective function (27)   In conclusion, additional investigation is needed, but these preliminary results suggest that our approach is promising and encourage us to further explore it in more challenging settings as well as real data examples.