Perspective Maximum Likelihood-Type Estimation via Proximal Decomposition

We introduce an optimization model for maximum likelihood-type estimation (M-estimation) that generalizes a large class of existing statistical models, including Huber's concomitant M-estimator, Owen's Huber/Berhu concomitant estimator, the scaled lasso, support vector machine regression, and penalized estimation with structured sparsity. The model, termed perspective M-estimation, leverages the observation that convex M-estimators with concomitant scale as well as various regularizers are instances of perspective functions. Such functions are amenable to proximal analysis, which leads to principled and provably convergent optimization algorithms via proximal splitting. Using a geometrical approach based on duality, we derive novel proximity operators for several perspective functions of interest. Numerical experiments on synthetic and real-world data illustrate the broad applicability of the proposed framework.


Introduction
High-dimensional regression methods play a pivotal role in modern data analysis. A large body of statistical work has focused on estimating regression coefficients under various structural assumptions, such as sparsity of the regression vector [36]. In the standard linear framework, regression coefficients constitute, however, only one aspect of the model. A more fundamental objective in statistical inference is the estimation of both location (i.e., the regression coefficients) and scale (e.g., the standard deviation of the noise) of the statistical model from the data. A common approach is to decouple this estimation process by designing and analyzing individual estimators for scale and location parameters (see, e.g., [21, pp. 140], [41]) because joint estimation often leads to non-convex formulations [14,34]. One important exception has been proposed in robust statistics in the form of a maximum likelihood-type estimator (M-estimator) for location with concomitant scale [21, pp. 179], which couples both parameters via a convex objective function. To discuss this approach more precisely, we introduce the linear heteroscedastic mean shift regression model. This data formation model will be used throughout the paper.

Model 1.1
The vector y = (η i ) 1 i n ∈ R n of observations is where X ∈ R n×p is a known design matrix with rows (x i ) 1 i n , b ∈ R p is the unknown regression vector (location), o ∈ R n is the unknown mean shift vector containing outliers, e ∈ R n is a vector of realizations of i.i.d. zero mean random variables, and C ∈ [0, +∞[ n×n is a diagonal matrix the diagonal of which are the (unknown) standard deviations. One obtains the homoscedastic mean shift model when the diagonal entries of C are identical.
The concomitant M-estimator proposed in [21, pp. 179] is based on the objective function where h ρ 1 is the Huber function [20] with parameter ρ 1 ∈ ]0, +∞[, δ ∈ [0, +∞[, and the scalar σ is a scale. The objective function, which we also refer to as the homoscedastic Huber M-estimator function, is jointly convex in both b and scalar σ, and hence, amenable to global optimization. Under suitable assumptions, this estimator can identify outliers o and can estimate a scale that is proportional to the diagonal entries of C in the homoscedastic case. In [2], it was proposed that joint convex optimization of regression vector and standard deviation may also be advantageous in sparse linear regression. There, the objective function is where the term · 1 promotes sparsity of the regression estimate, α 1 ∈ ]0, +∞[ is a tuning parameter, and σ is an estimate of the standard deviation. This objective function is at the heart of the scaled lasso estimator [35]. The resulting estimator is not robust to outliers but is equivariant, which makes the tuning parameter α 1 independent of the noise level. In [29], an extension of (1.2) was introduced that includes a new penalization function as well as concomitant scale estimation for the regression vector. The objective function is where b ρ 2 is the reverse Huber (Berhu) function [29] with parameter ρ 2 ∈ ]0, +∞[, constants δ 1 ∈ ]0, +∞[ and δ 2 ∈ ]0, +∞[, and tuning parameter α 1 ∈ ]0, +∞[. This objective function is jointly convex in b and the scalar parameters σ and τ . The estimator inherits the equivariance and robustness of the previous estimators. In addition, the Berhu penalty is advantageous when the design matrix comprises correlated rows [23]. In [10], it was observed that these objective functions, turn out to be instances of the class of composite "perspective functions" [8], a powerful construct that extends a convex function of a single variable to a jointly convex one in terms of an additional scale variable (see Section 2.2 for a formal definition). Let us add that perspective functions are also implicitly present in many data analysis models in the form of regularization penalties for structured sparsity [3,25,26].
In the present paper, we bring to light the ubiquity of perspective functions in statistical Mestimation and introduce a new statistical optimization model, perspective M-estimation. The proposed perspective M-estimation model, put forward in detail in Section 3, uses perspective functions as fundamental building blocks to couple scale and regression variables in a jointly convex fashion. It includes in particular the formulations discussed in [10] as well as the M-estimators discussed above as special cases, and it will be seen to cover a wide range of models beyond those. In [10] an algorithm was proposed to solve a specific formulation involving perspective functions in the context of generalized TREX estimation. To date, however, there exists no provably convergent algorithm to solve composite convex optimization problems involving general perspective functions. To fill this gap, we construct in Section 4 a new proximal splitting algorithm tailored to the perspective M-estimation problem and rigorously establish the convergence of its iterates. Since the proximity operators of perspective functions are known only in limited cases [10], another important contribution of our work is to derive new ones to broaden the effective scope of the proposed perspective M-estimation framework. Using geometrical insights revealed by the dual problem, we derive in Section 2 new proximity operators for several perspective functions, including the generalized scaled lasso, the generalized Huber, the abstract Vapnik, and the generalized Berhu function. These developments lead to a unifying algorithmic framework for global optimization of the proposed model using modern splitting techniques. The model also allows for the seamless integration of a large class of regularizers for structured sparsity and novel robust heteroscedastic estimators of location and scale. Numerical experiments on synthetic and real-world data illustrate the applicability of the proposed framework in Section 5.

Proximity operators of perspective functions
The general perspective M-estimation model to be proposed in Problem 3.1 will hinge on the notion of a perspective function (see (2.15) below). Since perspective functions are nonsmooth, to solve Problem 3.1 we need to bring into play the machinery of proximal methods [4] and must therefore be able to compute the proximity operators of these functions. A few examples of such computations were presented in [10]. In this section, using a novel geometric approach, we derive a number of important new instances. Since these results are of general interest beyond statistical analysis, throughout, H is a real Hilbert space with scalar product · | · and associated norm · .

Notation and background on convex analysis
The closed ball with center x ∈ H and radius ρ ∈ ]0, +∞[ is denoted by B(x; ρ). Let C be a subset of H. Then is the indicator function of C, is the distance function to C, and is the support function of C. If C is nonempty, closed, and convex then, for every x ∈ H, there exists a unique point proj C x ∈ C, called the projection of The normal cone to C is We denote by Γ 0 (H) the class of proper lower semicontinuous convex functions from H to ]−∞, +∞]. Let ϕ ∈ Γ 0 (H). The conjugate of ϕ is It also belongs to Γ 0 (H) and ϕ * * = ϕ. The Moreau subdifferential of ϕ is the set-valued operator We have Moreover, If ϕ is Gâteaux differentiable at x ∈ dom ϕ, with gradient ∇ϕ(x), then The infimal convolution of ϕ and ψ ∈ Γ 0 (H) is Given any z ∈ dom ϕ, the recession function of ϕ is Finally, the proximity operator of ϕ is [27] prox ϕ : H → H : x → argmin y∈H ϕ(y) + 1 2 x − y 2 . (2.14) For detailed accounts of convex analysis, see [4,31].

The geometry of proximity operators of perspective functions
Let ϕ ∈ Γ 0 (H). The perspective of ϕ is otherwise. (2.15) We have ϕ ∈ Γ 0 (R ⊕ G) [8, Proposition 2.3]. The following property is useful to establish existence results for problems involving perspective functions.
Let us now turn to the proximity operator of ϕ.
When dom ϕ * is not open, Lemma 2.2 is not applicable. To deal with such cases, we propose a geometric construction that computes prox γ ϕ via the projection onto a certain convex set. It is based on the following property, which reduces the problem of evaluating the proximity operator of ϕ to a projection problem in R 2 if ϕ is radially symmetric.

Examples
We provide several examples that are relevant to the statistical problems we have in sight.
and let σ ∈ R and x ∈ H. Then otherwise.

Optimization model and examples
Let us first recall that our data formation model is Model 1.1. We now introduce our perspective Mestimation model, which enables the simultaneous estimation of the regression vector b = (β k ) 1 k p ∈ R p as well as scale vectors s = (σ i ) 1 i N ∈ R N and t = (τ i ) 1 i P ∈ R P . If robust data fitting functions are used, the outlier vector in Model 1.1 can be identified from the solution of (3.2) below. For instance, if the Huber function is used for data fitting, one can estimate the mean shift vector o in (1.1) [1,33].
The proposed perspective M-estimation optimization problem is as follows.
Problem 3.1 Let N and P be strictly positive integers, let ς ∈ Γ 0 (R N ), let ̟ ∈ Γ 0 (R P ), let θ ∈ Γ 0 (R p ), let (n i ) 1 i N be strictly positive integers such that N i=1 n i = n, and let (p i ) 1 i P be strictly positive integers. For every i ∈ {1, . . . , N }, let ϕ i ∈ Γ 0 (R n i ), let X i ∈ R n i ×p , and let y i ∈ R n i be such that Finally, for every i ∈ {1, . . . , P }, let ψ i ∈ Γ 0 (R p i ), and let L i ∈ R p i ×p . The objective of perspective M-estimation is to  (ii) It is also possible to use "scaleless" non-perspective functions of the variables (X i b − y i ) 1 i N and (L i b) 1 i P . For instance, given i ∈ {1, . . . , N }, the term ϕ i (X i b − y i ) is obtained by using ϕ i (σ i , X i b − y i ) and imposing σ i = 1 via ς.
(iii) We attach individual scale variables to each of the functions ( ϕ i ) 1 i N and ( ψ i ) 1 i P for flexibility in the case of heteroscedastic models, but also for computational reasons. Indeed, the proximal tools we are proposing in Sections 4 and 5 can handle separable functions better. For instance, it is hard to process the function via proximal tools, whereas the equivalent separable function with coupling of the scales will be much easier.
We now present some important instantiations of Problem 3.1.
It derives from Problem 3.1 by setting (3.8) For q = 1, we obtain the fused lasso model [38], while q = 2 yields the smooth lasso formulation of [18]. Let us note that one obtains alternative formulations such that of [37] by suitably redefining the operators (L i ) 1 i P in (3.8).

Example 3.5
Given ρ 1 and ρ 2 in ]0, +∞[, the formulation proposed in [29] is where h ρ 1 and b ρ 2 are the Huber and Berhu functions of (2.30) and (2.59), respectively. From a convex optimization viewpoint, we reformulate this problem more formally in terms of the lower semicontinuous function of (2.15) to obtain This is a special case of Problem 3.1 with If one omits the right-most summation in (3.10) one recovers Huber's concomitant model [21]. Note that The operator prox ̟ is computed likewise. On the other hand, the proximity operators of h ρ 1 and b ρ 2 are provided in Examples 2.5 and 2.6, respectively.
was proposed in [40] under the name "natural lasso." It can be cast in the framework of Problem 3.1 with The proximity operators of θ and ̟ are given in [13].

Example 3.9
Given α and ε in ]0, +∞[, define v i : R → R : η → α + max(|η| − ε, 0). Using the perspective function derived in Example 2.8, we can rewrite the linear ν-support vector regression problem of [32] as We identify this problem as a special case of Problem 3.1 with The proximity operator of v i is given in Example 2.8 and that of ς in (3.12). The concomitant parameter σ scales the width of the "tube" in the ν-support vector regression and trades off model complexity and slack variables [32].
The next two examples are novel M-estimators that will be employed in Section 5.

Example 3.10
In connection with (3.1), we introduce a generalized heteroscedastic scaled lasso with N data blocks, which employs the perspective derived in Example 2.4. Recall that n i is the number of data points in the ith block, let α 1 ∈ [0, +∞[, and set The objective is to This is a special case of Problem 3.1 with (3.26) The choice of the exponent q ∈ ]1, +∞[ reflects prior distributional assumptions on the noise. This model can handle generalized normal distributions. The proximity operator of c i,q is provided in Example 2.4.

Example 3.11
In connection with (3.1), we introduce a generalized heteroscedastic Huber Mestimator, with J scale variables (σ j ) 1 j J , which employs the perspective derived in Example 2.5. Each scale σ j is attached to a group of m j data points, hence J j=1 m j = n. Let α 1 and α 2 be in [0, +∞[, let δ, ρ 1 , and ρ 2 be in ]0, +∞[, and denote by h ρ 1 ,q the function in (2.31), where H = R. The objective is to (3.28) The choice of the exponent q ∈ ]1, +∞[ reflects prior distributional assumptions on the noise. This model handles generalized normal distributions and can identify outliers. Note that Remark 3.12 Particular instances of perspective M-estimation models come with statistical guarantees. For the scaled lasso, initial theoretical guarantees are given in [35]. In [22,23] results are provided for the homoscedastic Huber M-estimator with adaptive ℓ 1 penalty and the adaptive Berhu penalty. In [17], explicit bounds for estimation and prediction error for "convex loss lasso" problems are given which cover scaled homoscedastic lasso, the least absolute deviation model, and the homoscedastic Huber model. For the heteroscedastic M-estimators we have presented above, statistical guarantees are, to the best of our knowledge, elusive.

Algorithm
Recall from (3.2) that the problem of perspective M-estimation is to This minimization problem is quite complex, as it involves the sum of several terms, compositions with linear operators, as well as perspective functions. In addition, none of the functions present in the model is assumed to have any full domain or smoothness property. In this section, we reformulate (4.1) in a suitable higher dimensional product space through a series of reparametrizations. The resulting reformulation is shown to be solvable by the Douglas-Rachford splitting algorithm.
Once reformulated in the original scale/regression space, this algorithm yields a new proximal splitting method which requires only to use separately the proximity operators of the functions ς, ̟, θ, ( ϕ i ) 1 i N , and ( ψ i ) 1 i P , as well as application of simple linear transformations. It will be shown to produce sequences (s k ) k∈N , (t k ) k∈N , and (b k ) k∈N which converge respectively to vectors s, t, and b that solve (4.1).

Numerical illustrations on low-dimensional data
Our algorithmic approach to perspective M-estimation can effortlessly handle non-smooth data fitting terms. To illustrate this property, we consider a partially noiseless data formation model in low dimensions. We instantiate the data model (1.1) as follows. We consider the design matrix X ∈ R n×p with p = 3 and sample size n = 18. Entries in the design matrix and the noise vector e ∈ R n are sampled from a standard normal distribution N (0, 1). The matrix C ∈ R n×n is a diagonal matrix with N = 2 groups. We set s = [σ 1 , σ 2 ] ⊤ = [3, 0] ⊤ . The ith diagonal entry of C is set to σ 1 for i ∈ {1, . . . , 8} and to σ 2 for i ∈ {10, . . . , 18}, resulting in noise-free observations for the second group.

Numerical illustrations for correlated designs and outliers
To illustrate the efficacy of the different M-estimators we instantiate the full data formation model (1.1) as follows. We consider the design matrix X ∈ R n×p with p = 64 and sample size n = 75 where each row X i is sampled from a correlated normal distribution N (0, Σ) with off-diagonal entries 0. The presence of outliers, correlation in the design, and heteroscedasticity provides a considerable challenge for regression estimation and support recovery with standard models such as the lasso. We consider instances of the perspective M-estimation model of increasing complexity that can cope with various aspects of the data formation model. Specifically, we use the models described in Examples 3.10 and 3.11 (with α 2 = 0) in homoscedastic and heteroscedastic mode. For all models, we compute the minimally achievable mean absolute error (MAE) Xb − Xb 1 /n across the α 1regularization path, where α 1 ∈ {0.254, . . . , 25.42}, with 50 values equally spaced on a log-linear grid. The convergence criterion is ǫ = 5 · 10 −4 .
Homoscedastic models. We first consider homoscedastic instances of Examples 3.10 and 3.11, in which we jointly estimate a regression vector and a single concomitant parameter in the data fitting part. We consider the generalized scaled lasso of Example 3.10 and the generalized Huber of Example 3.11 with exponents q ∈ {3/2, 2}. Figure 3 presents the estimation results of b over the relevant α 1 -path.
Heteroscedastic models. We consider the same model instances as previously described but in the heteroscedastic setting. We jointly estimate regression vectors and concomitant scale parameters for each of the three groups. Figure 4 presents the results for heteroscedastic lasso and Huber estimations of b across the relevant α 1 -path. The convergence criterion is ǫ = 10 −4 .
The numerical experiments indicate that only heteroscedastic M-estimators are able to produce convincing b estimates (as captured by lower MAE). The heteroscedastic Huber model with q = 3/2 (see Figure 3 lower right panel) achieves the best performance in terms of MAE among all tested models.

Robust regression for gene expression data
We consider a high-dimensional linear regression problem from genomics [6]. The design matrix X consists of p = 4088 highly correlated gene expression profiles for n = 71 different strains of Bacillus subtilis (B. subtilis). The response y ∈ R 71 comprises standardized riboflavin (Vitamin B) log-production rates for each strain. The statistical task is to identify a small set of genes that is highly predictive of the riboflavin production rate. No grouping of the different strain measurements is available. We thus consider the homoscedastic models from Example 3.6 with α 2 = 0 and Example 3.11 with α 2 = 0. We optimize the corresponding perspective M-estimation models over the α 1 -path where α 1 = [0.623, . . . , 6.23] with 20 values equally spaced on a log-linear grid. We compare the resulting models with the standard lasso in terms of in-sample prediction performance. Figure 5 summarizes the results for the in-sample prediction of the three different models with identical model complexity (twelve non-zero entries in b). To assess model quality, we compute the minimally achievable mean absolute error (MAE) Xb − y 1 /n for these three models. The Huber model achieves significantly improved MAE (0.24) compared to lasso (0.32). The Huber models also identifies 26 non-zero components in the outlier vector o (shown in red in the rightmost panel of Figure 5).