Perspective Functions: Proximal Calculus and Applications in High-Dimensional Statistics

Perspective functions arise explicitly or implicitly in various forms in applied mathematics and in statistical data analysis. To date, no systematic strategy is available to solve the associated, typically nonsmooth, optimization problems. In this paper, we fill this gap by showing that proximal methods provide an efficient framework to model and solve problems involving perspective functions. We study the construction of the proximity operator of a perspective function under general assumptions and present important instances in which the proximity operator can be computed explicitly or via straightforward numerical operations. These results constitute central building blocks in the design of proximal optimization algorithms. We showcase the versatility of the framework by designing novel proximal algorithms for state-of-the-art regression and variable selection schemes in high-dimensional statistics.


Introduction
Perspective functions appear, often implicitly, in various problems in areas as diverse as statistics, control, computer vision, mechanics, game theory, information theory, signal recovery, transportation theory, machine learning, disjunctive optimization, and physics (see the companion paper [7] for a detailed account). In the setting of a real Hilbert space G, the most useful form of a perspective function, first investigated in Euclidean spaces in [24], is the following. (1.1) Many scientific problems result in minimization problems that involve perspective functions. In statistics, a prominent instance is the modeling of data via "maximum likelihood-type" estimation (or M-estimation) with a so-called concomitant parameter [17]. In this context, ϕ is a likelihood function, η takes the role of the concomitant parameter, e.g., an unknown scale or location of the assumed parametric distribution, and y comprises unknown regression coefficients. The statistical problem is then to simultaneously estimate the concomitant variable and the regression vector from data via optimization. Another important example in statistics [15], signal recovery [5], and physics [16] is the Fisher information of a function x : R N → ]0, +∞[, namely R N ∇x(t) 2 2 x(t) dt, (1.2) which hinges on the perspective function of the squared Euclidean norm (see [7] for further discussion).
In the literature, problems involving perspective functions are typically solved with a wide range of ad-hoc methods. Despite the ubiquity of perspective functions, no systematic structuring framework has been available to approach these problems. The goal of this paper is to fill this gap by showing that they are amenable to solution by proximal methods, which offer a broad array of splitting algorithms to solve complex nonsmooth problems with attractive convergence guarantees [1,8,11,14]. The central element in the successful implementation of a proximal algorithm is the ability to compute the proximity operator of the functions present in the optimization problem. We therefore propose a systematic investigation of proximity operators for perspective functions and show that the proximal framework can efficiently solve perspective-function based problems, unveiling in particular new applications in high-dimensional statistics.
In Section 2, we introduce basic concepts from convex analysis and review essential properties of perspective function. We then study the proximity operator of perspective functions in Sections 3. We establish a characterization of the proximity operator and then provide examples of computation for concrete instances. Section 4 unveils new applications of perspective functions in high-dimensional statistics and demonstrates the flexibility and potency of the proposed framework to both model and solve complex problems in statistical data analysis.

Notation and background 2.1 Notation and elements of convex analysis
Throughout, H, G, and K are real Hilbert spaces and H ⊕ G denotes their Hilbert direct sum. The symbol · denotes the norm of a Hilbert space and · | · the associated scalar product. The closed ball with center x ∈ K and radius ρ ∈ ]0, +∞[ is denoted by B(x; ρ). , and let f ∈ Γ 0 (K). The conjugate of f is the function (2.1) It also belongs to Γ 0 (K) and f * * = f . The subdifferential of f is the set-valued operator We have Moreover, The infimal convolution operation is denoted by . Now let C be a subset of K. Then is the indicator function of C, is the distance function to C, and is the support function of C. If C is nonempty, closed, and convex then, for every x ∈ K, there exists a unique point P C x ∈ C, called the projection of x onto C, such that x − P C x = d C (x). We have The normal cone to C is For further background on convex analysis, see [1,24].

Proximity operators
The proximity operator of f ∈ Γ 0 (K) is This operator was introduced by Moreau in 1962 [20] to model problems in unilateral mechanics. In [12], it was shown to play an important role in the investigation of various data processing problems, and it has become increasingly prominent in the general area of data analysis [10,25]. We review basic properties and refer the reader to [1] for a more complete account.
Let f ∈ Γ 0 (K). Then (2.14) If C is a nonempty closed convex subset of K, then The following facts will also be needed.

Perspective functions
We review here some essential properties of perspective functions.
(2.23) (iv) Suppose that dom ϕ * is open or that ϕ is supercoercive, let η ∈ R, and let y ∈ G. Then if η = 0 and y = 0; ∅, otherwise. (2.24) We refer to the companion paper [7] for further properties of perspective functions as well as examples. Here are two important instances of (composite) perspective functions that will play a central role in Section 4.

Main result
We start with a characterization of the proximity operator of a perspective function when dom ϕ * is open.
Next, we compute the proximity operator of a special case of the perspective function introduced in Lemma 2.5.
• y = γv: As seen in (3.33), p = v. Using (3.30) and (3.31), (3.32) can be rewritten as In view of (3.27), this is equivalent to Upon taking the norm on both sides of the second equality, we obtain We note that, since φ * is convex, ψ is the derivative of the strongly convex function Consequently, ψ is strictly increasing [1, Proposition 17.13], hence invertible. It follows that . In turn, (3.36) yields (3.28).

Example 3.6 Define
Proof. This is a special case of Corollary 3.5 with with δ = 0, v = 0, and
Proof. Set z = (x, y). It follows from Lemma 2.3(i) that ϕ ∈ Γ 0 (R ⊕ G), and [7, Therefore, the result is obtained by applying Lemma 2.1 with K = R ⊕ G and K = H ⊕ G.
Remark 3.10 Proposition 3.9 provides a general setting for computing the proximity operators of abstract integral functionals by reducing it to the computation of the proximity operator of the integrand. In particular, by suitably choosing the underlying measure space and the integrand, it provides a framework for computing the proximity operators of the integral function based on perspective functions discussed in [7], which include general divergences. For instance, discrete N -dimensional divergences are obtained by setting Ω = {1, . . . , N } and F = 2 Ω , and letting µ be the counting measure (hence H = G = R N ) and G = R. While completing the present paper, it has come to our attention that the computation of the proximity operators of discrete divergences has also been recently addressed in [13].

Applications in high-dimensional statistics
Sections 2 and 3 provide a unifying framework to model a variety of problems around the notion of a perspective function. By applying the results of Section 3 in existing proximal algorithms, we obtain efficient methods to solve complex problems. To illustrate this point, we focus on a specific application area: high-dimensional regression in the statistical linear model.

Penalized linear regression
We consider the standard statistical linear model where z = (ζ i ) 1 i n ∈ R n is the response, X ∈ R n×p a design (or feature) matrix, b = (β j ) 1 j p ∈ R p a vector of regression coefficients, σ ∈ ]0, +∞[, and e = (ε i ) 1 i n the noise vector; each ε i is the realization of a random variable with mean zero and variance 1. Henceforth, we denote by X i: the ith row of X and by X :j the jth column of X. In the high-dimensional setting where p > n, a typical assumption about the regression vector b is sparsity. In this scenario, the Lasso [27] has become a fundamental tool for variable selection and predictive modeling. It is based on solving the penalized least-squares problem where λ ∈ [0, +∞[ is a regularization parameter that aims at controlling the sparsity of the solution. The Lasso has strong performance guarantees in terms of support recovery, estimation, and predictive performance if one takes λ ∝ σ X ⊤ e ∞ . In the high-dimensional setting, two shortcomings of the Lasso are the introduction of bias in the final estimates due to the ℓ 1 norm and lack of knowledge about the quantity σ which necessitates proper tuning of λ via model selection strategies that is dependent on σ. Bias reduction can be achieved by using a properly weighted ℓ 1 norm, resulting in the adaptive Lasso [30] formulation where the fixed weights w j ∈ ]0, +∞[ are estimated from data. In [30], it was shown that, for suitable choices of w j , the adaptive Lasso produces (asymptotically) unbiased estimates of b. One of the first methods to alleviate the σ-dependency of the Lasso has been the Sqrt-Lasso [2]. The Sqrt-Lasso problem is based on the formulation This optimization problem can be cast as second order cone program (SOCP) [2]. The modification of the objective function can be interpreted as an (implicit) scaling of the Lasso objective function by an estimate Xb − z 2 / √ n of σ [19], leading to minimize b∈R p In [2], it was shown that the tuning parameter λ does not depend on σ in Sqrt-Lasso.
Alternative approaches rely on the idea of simultaneously and explicitly estimating b and σ from the data. The scaled Lasso [26], a robust hybrid of ridge and Lasso regression [23], and the TREX [19] are important instances. In the following, we will show that these estimators are based on perspective functions under the unifying statistical framework of concomitant estimation. We will introduce a novel family of estimators and show how the corresponding optimization problems can be solved using proximal algorithms. In particular, we will derive novel proximal algorithms for solving both the standard TREX and a novel generalized version of the TREX which includes the Sqrt-Lasso as special case.

Penalized concomitant M-estimators
In statistics, the task of simultaneously estimating a regression vector b and an additional model parameter is referred to as concomitant estimation. In [17], Huber introduced a generic method for formulating "maximum likelihood-type" estimators (or M-estimators) with a concomitant parameter from a convex criterion. Using our perspective function framework, we can extend this framework and introduce the class of penalized concomitant M-estimators defined through the convex optimization problem with concomitant variables σ and τ under the assumptions outlined in Theorem 3.1 and in Section 3.2.
Here, ϕ i ∈ Γ 0 (R), ψ j ∈ Γ 0 (R), and a j ∈ R p . The terms ϕ i are data fitting terms and ψ j are penalty terms. A prominent instance of this family of estimators is the scaled Lasso [26] formulation which yields estimates equivalent to the Sqrt-Lasso. Here, setting ϕ i = | · | 2 /(2n) + 1/2 and ψ j = λ| · | leads to the scaled (or concomitant) Lasso formulation (see Lemma 2.5, Corollary 3.5, and [21]). Other function choices result in well-known estimators. For instance, taking each ϕ i to be the Huber function (see Example 3.12) and each ψ j to be the Berhu (reversed Huber) function recovers the robust Lasso variant, introduced and discussed in [23]. Setting each ψ j = λ|w j · | to be a weighted ℓ 1 component results in the "Huber + adaptive Lasso" estimator, analyzed theoretically in [18]. Note that for the latter two approaches, no dedicated optimization algorithms exist that can solve the corresponding optimization problem with provable convergence guarantees. Combining the proximity operators introduced here with proximal algorithms enables us to design such algorithms. To exemplify this powerful framework we focus next on a particular instance of a penalized concomitant M-estimator, the TREX estimator, and derive proximity operators and proximal algorithms.

Proximal algorithms for the TREX
The TREX [19] extends Sqrt-Lasso and scaled Lasso by taking into account the unknown noise distribution of e. Recalling that a theoretically desirable tuning parameter for the Lasso is λ ∝ σ X ⊤ e ∞ , the TREX scales the Lasso objective by an estimate of this quantity, namely, The parameter α > 0 can be set to a constant value (α = 1/2 being the default choice). In [19], promising statistical results were reported where an approximate version of the TREX, with no tuning of α, has been shown to be a valid alternative to the Lasso. A major technical challenge in the TREX formulation is the non-convexity of the optimization problem. In [3], this difficulty is overcome by showing that the TREX problem, although non-convex, can be solved by observing that problem (4.8) can be equivalently expressed as finding the best solution to 2p convex problems of the form Each subproblem can be reformulated as a standard SOCP and numerically solved using generic SOCP solvers [3]. Next we show how our perspective function approach allows us to derive proximal algorithms for not only the TREX subproblems and but also for novel generalized versions of the TREX. The proximal algorithms construct a sequence (b k ) k∈N that is guaranteed to converge to a solution to (4.9).

Proximal operators for the TREX subproblem
We first note that the data fitting term of the TREX subproblem (4.9) is the special case of (2. 25) where H = R p , G = R n , q = 2, L = X, r = z, u = X ⊤ x j , and ρ = x ⊤ j z. Given α ∈ ]0, +∞[, the data fitting term of the TREX subproblem thus assumes the form otherwise, (4.10) and the corresponding TREX subproblem is to Now consider the linear transformation and introduce otherwise. (4.13) Then f j = g j • M j . Upon setting h = · 1 , we see that (4.11) is of the form (4.14) Next, we determine the proximity operators prox g j and prox h , as only those are needed in modern proximal splitting methods [8,11] to solve (4.14). The proximity operator prox h is the standard soft thresholding operator. A formula for prox g j is provided by Example 3.8 up to a shift by (x ⊤ j z, z). Let γ ∈ ]0, +∞[ and let g be as in (3.48). Combining Example 3.8 and [1, Proposition 23.29(ii)], we obtain, for every η ∈ R and every y ∈ R n , and where t is the unique solution in ]0, +∞[ to the depressed cubic equation

Proximal operators for generalized TREX estimators
Thus far, we have shown that the data-fitting function in the TREX subproblem (4.9) is a special case of (2.25). However, the full potential of (2.25) is revealed by taking a general q ∈ ]1, +∞[, leading to the composite perspective function otherwise. (4.18) This function is the data fitting term of a generalized TREX subproblem for the corresponding global generalized TREX objective This objective function provides a novel family of generalized TREX estimators, parameterized by q.
The first important observation is that, in the limiting case q → 1, the generalized TREX estimator collapses to the Sqrt-Lasso (4.4). Secondly, particular choices of q allow very efficient computation of proximity operators for the generalized TREX subproblems. Considering the linear transformation 0, if y = z and η = x ⊤ j z; +∞, otherwise (4.20) we arrive at f j,q = g j,q • M j . Setting h = · 1 the corresponding problem is to The proximity operator prox g j,q is provided by Example 3.7, where δ = 0 and v = 0, up to a shift by (x ⊤ j z, z). Let g be the function in (3.44) and let γ ∈ ]0, +∞[. Set q * = q/(q − 1), set ̺ = (α(1 − 1/q * )) q * −1 , and take (η, y) ∈ R × G. If q * γ q * −1 (η − x ⊤ j z) + ̺ y − z q * 2 > 0 and y = z, let t ∈ ]0, +∞[ be the unique solution to the polynomial equation (4.23) Then we derive from Example 3.7 that The key step in the calculation of the proximity operator is to solve (4.22) efficiently. The solution is explicit for q = 2, as discussed in Example 3.8. For q = 3, we obtain a quartic equation that can also be solved explicitly. For q ∈ (i + 1)/i i ∈ N, i 2 (4.22) is a polynomial with integer exponents and is thus amenable to efficient root finding algorithms. For a general q, a one-dimensional line search for convex functions on a bounded interval needs to be performed.

Douglas-Rachford for generalized TREX subproblems
Problem (4.14) is a standard composite problem and can be solved via several proximal splitting methods that require only the ability to compute prox g j and prox h ; see [6] and references therein.
For large scale problems, one could also employ recent algorithms that benefit from block-coordinate [11] or asynchronous block-iterative implementations [8], while still guaranteeing the convergence of their sequence (b k ) k∈N of iterates to a solution to the problem. In this section, we focus on a simple implementation based on the Douglas-Rachford splitting method [1] in the context of the generalized TREX estimation to illustrate the applicability and versatility of the tools presented in Sections 2 and 3.
Then we can rewrite (4.14) as Let γ ∈ ]0, +∞[, let y 0 ∈ R p+n+1 , and let (µ k ) k∈N be a sequence in ]0, 2[ such that inf k∈N µ k > 0 and sup k∈N µ k < 2. The Douglas-Rachford algorithm is (4.26) The sequence (x k ) k∈N is guaranteed to converge to a solution to (4.25) [1,Corollary 27.4]. Note that and, in view of (2.15), is the projection operator onto V . Hence, upon setting Then (b k ) k∈N converges to a solution b to (4.14) or (4.21). Note that the matrix R j needs to be precomputed only once by inverting a positive definite symmetric matrix.

Numerical illustrations
We illustrate the convergence behavior of the Douglas-Rachford algorithm for TREX problems and the statistical performance of generalized TREX estimators using numerical experiments. All presented algorithms and experimental evaluations are implemented in MATLAB and are available at http://github.com/muellsen/TREX. All algorithms are run in MATLAB 2015a on a MacBook Pro with 2.8 GHz Intel Core i7 and 16 GB 1600 MHz DDR3 memory.
In practice, the Douglas-Rachford algorithm for the TREX subproblem can be enhanced by an online sign selection rule (DR-Sel). When a TREX subproblem for fixed X :j is considered, we can solve the problem for s ∈ {−1, 1} concurrently for a small number k 0 of iterations (standard setting k 0 = 50) and select the signed optimization problem with best progress in terms of objective function value.
We compare the run time scaling and solution quality of Douglas-Rachford and DR-Sel with those of the state-of-the-art Splitting Conic Solver (SCS). SCS is a general-purpose first-order proximal method that provides numerical solutions to several standard classes of optimization problems, including SOCPs and Semidefinite Programs (SDPs). We use SCS in indirect mode [22] to solve the SOCP formulation of the TREX subproblem [3] with convergence tolerance 10 −4 .
The run time scaling results are shown in Figure 1. We emphasize that the scaling experiments are not meant to measure absolute algorithmic performance but rather efficiency with respect to optimization formulations that are subsequently solved by proximal algorithms. We observe that SCS with the SOCP formulation of TREX compares favorably with Douglas-Rachford and DR-Sel in low dimensions while, for p > 200, both Douglas-Rachford variants perform better. DR-Sel outperforms Douglas-Rachford by a factor of 2 to 4 and always selects the correct signed subproblem (data not shown). The TREX solutions found by SCS and Douglas-Rachford are close in terms of b (DR) −b (SCS) , with DR typically reaching slightly lower function values than SCS. Values for the first 40 dimensions of a typical solution b K in p = 2000 dimensions are shown in Figure 1 (right panels).

Behavior of generalized TREX estimators
We next study the effect of the exponent q on the statistical behavior of the generalized TREX estimator. We use the synthetic setting outlined in [29] to study the phase transition behavior of the different generalized TREX estimators. We generate data from the linear model (4.1) with p = 64 and m = ⌈0.4p 3/4 ⌉ nonzero variables, regression vector b * = [−1, 1, −1, . . . , 0 ⊤ p−m ] ⊤ , and feature vectors X i: ∼ N (0, Σ) with Σ ii = 1 and Σ ij = 0 and Gaussian noise e with σ = 0.5. Each column X :j is normalized to have norm √ n. We define the rescaled sample size according to θ(n, p, m) = n/(2m log (p − m)) and consider θ(n, p, m) ∈ {0.2, 0.4, . . . , 1.6}. At θ(n, p, m) = 1, the probability of exact recovery of the support of b * is 0.5 for the (Sqrt)-Lasso with oracle regularization parameter [29]. We consider the generalized TREX with different exponents q ∈ {9/8, 7/6, 3/2, 2} and the Sqrt-Lasso as limiting case q = 1. For all generalized TREX estimators we consider regularization parameters α ∈ {0.1, 0.15, . . . , 2}. For Sqrt-Lasso we consider the standard regularization path setting outlined in [21]. We solve all generalized TREX problems with the Douglas-Rachford scheme using the previously described parameter and convergence settings. We measure the probability of exact support recovery and Hamming distance to the true support over d = 12 repetitions. We threshold all "numerical zeros" in the generalized TREX solutions vectors at level 0.05. For all solutions closest to the true support in terms of Hamming distance, we also calculate estimation error b K − b * 2 2 /n and prediction error Xb K − Xb * 2 2 /n. Figure 2 shows average performance results across all repetitions. We observe several interesting phenomena for the family of generalized TREX estimators. In terms of exact recovery, the performance is slightly better than predicted by theory (see gray dashed line in Figure 2 top left panel), with decrease in performance for increasing q. This is also consistent with average Hamming distance measurements (top right panel). We observe that generalized TREX oracle solutions (according to the minimum Hamming distance criterion) show best performance in terms of estimation and prediction error for exponents q ∈ {9/8, 7/6}, followed by q ∈ {3/2, 2}. The present numerical experiments highlight the usefulness of the family of generalized TREX estimators for sparse linear regression problems. Further theoretical research is needed to derive asymptotic properties of generalized TREX. A central prerequisite for establishing generalized TREX as statistical estimator is to solve the underlying optimization problem with provable guarantees. We have shown that our perspective function framework along with efficient computation of proximity operators enables this important task in a seamless way.