Open Journal of Mathematical Optimization

Piecewise Linear-Quadratic (PLQ) penalties are widely used to develop models in statistical inference, signal processing, and machine learning. Common examples of PLQ penalties include least squares, Huber, Vapnik, 1-norm, and their asymmetric generalizations. Properties of these estimators depend on the choice of penalty and its shape parameters, such as degree of asymmetry for the quantile loss, and transition point between linear and quadratic pieces for the Huber function. In this paper, we develop a statistical framework that can help the modeler to automatically tune shape parameters once the shape of the penalty has been chosen. The choice of the parameter is informed by the basic notion that each PLQ penalty should correspond to a true statistical density. The normalization constant inherent in this requirement helps to inform the optimization over shape parameters, giving a joint optimization problem over these as well as primary parameters of interest. A second contribution is to consider optimization methods for these joint problems. We show that basic ﬁrst-order methods can be immediately brought to bear, and design specialized extensions of interior point (IP) methods for PLQ problems that can quickly and eﬃciently solve the joint problem. Synthetic problems and larger-scale practical examples illustrate the utility of the approach. Code for the new IP method is implemented using the Julia language ( https://github.com/UW-AMO/shape-parameters/tree/ojmo ). Digital Object Identiﬁer 10.5802/ojmo.10 Acknowledgments The authors acknowledge the Washington Research Foundation Data Science Professorship.


Introduction
Piecewise Linear-Quadratic (PLQ) functions and their superclass, Quadratic Support functions [1] form a rich class of penalties that are used for a range of applications. The choice of penalty plays a key role in the properties of the final estimate. Using the Huber penalty ( Figure 1 (a)) rather than a quadratic to measure data misfit is a typical approach to make an estimate robust to outliers [17]. The 1-norm applied directly to parameters makes the final solution sparse, and is used in Lasso regression [29] and compressed sensing [15]. The asymmetric quantile loss (see Figure 1 (b)) is used for many regression applications [20]. The correspondence of the shape of a PLQ function to its role in estimation is well-understood. For example, it is the linear tails of the Huber penalty that limit the effects of large residuals on the final estimate when used as a misfit, and it is the nonsmooth behavior at the origin of the 1-norm that promotes sparse solutions when used as a regularizer. Likewise, the asymmetry of the quantile loss that make it useful for financial applications, where loss and gain are treated asymmetrically.
Here, we consider the choice of parameters that fully specify this shape. At what value should the Huber transition between quadratic and linear? How asymmetric do we need the quantile loss to be? We focus on questions that can be answered once we have model estimates and corresponding residuals in hand, but seem as difficult a priori as any other parameter selection problem. Unknown parameters in regression and inverse problems are usually estimated using cross-validation or grid search. Standard methods require multiple solutions of any given learning problem, where a held-out dataset is used to evaluate each configuration [5]. More recently, Bayesian optimization [9,18,19,28] and random search [10,22] have come to the forefront as two popular techniques used to obtain meta-parameters in a very wide range of contexts. While these techniques can be used for the problems we consider, as well as a much more general problems, their implementation is more expensive than solving a single problem instance (cross-validation, random search and Bayesian optimization require many instance evaluations).
Here we take a different tack, and build on the statistical perspective developed in [1] to develop a simple modification of the PLQ estimation problem to infer shape parameters simultaneously with variables of interest. The most relevant works related to this paper focus on the relation between the quantile penalty and the asymmetric Laplace distribution(ALD) [8,30,33]. In particular, [8] jointly estimate the model and the shape parameters for quantile penalty, and [30] infer the joint posterior distribution of these parameters.
To explain the general idea, we look at two simple examples.

Classic variance estimation in linear regression
As a warm-up, we consider the classic problem of estimating regression variables jointly with noise variance parameter in a linear Gaussian model. Take the basic statistical model y = Ax + , ∼ N (0, σ 2 I), i = 1, . . . , n, we can find the maximum likelihood estimator for (x, σ 2 ) by maximizing or equivalently by minimizing the negative log of this expression: In this simple case, the problem separates, and regardless of σ 2 we have Minimizing (2) with respect to σ 2 , we also get σ 2 = y − A x 2 /n, the average of empirical residuals squared. The main point to here is that the likelihood arising from the statistical model (1) enables us to estimate σ 2 . The idea becomes much more interesting when parameters are coupled. For example, in pharmacokinetic modeling several different datasets with unknown scalar variance parameters may inform a shared parameter vector x [7]. Another example where the shape parameter affects the estimate of x is presented in the next section. Left panel: regression from data with asymmetric errors. Data are shown in black; true model in black; linear regression estimate in black dash, and the new auto-tuned quantile estimate in red dash. Right: graph of log(1/τ + 1/(1 − τ )) in (4). This term, derived in Section 3, acts as a barrier pushing the quantile estimator into (0, 1).

Simultaneous quantile estimation and regression.
Consider Figure 2, where linear observations have been contaminated with asymmetric errors, i.e. they are more likely to be positive than negative. The data generating mechanism is shown in black. The linear regression model for the data {y i , a i } is equivalent to the least square problem, where x contains slope and intercept. As expected, classic regression fails to capture the true mechanism because the errors are asymmetric, see black dash in Figure 2.
In this case, we can actually recover the "true" line by assuming the errors arise from an asymmetric generalization of the Laplace distribution: where ρ τ is the penalty shown in Figure 1(b), with τ ∈ [0, 1] controlling the relative slopes to the left and right of the origin. Solving for a joint maximum likelihood estimator for (x, τ ), we get the fit in red dash in Figure 2. The estimator is derived by considering the density corresponding to (3), and has the simple expression where the last term arises from considering the normalization constant required to make exp(−ρ τ ( · )) a true statistical density, and is derived in Section 3, where the general approach is developed. In this simple example, the right solution is found by solving the single problem (4) once, rather than considering multiple problem instances as we expect from classic approaches. The optimization problem itself is nonsmooth and nonconvex, but as we will see straightforward. The main thrust of the paper is to get a systematic way of formulating estimators such as (4), understand their properties, and consider different algorithms for solving these problems.

Contributions and Roadmap
Building on the toolkit of PLQ penalties and corresponding densities developed in [1], we use the bridge between penalties and densities to develop an extended likelihood formulation over both shape parameters θ and regression variables x. The key idea is encoded in the normalization constant, that arises from the statistical interpretation and essentially acts as a balance against the data to ensure the final shape parameters still yield a statistically valid model. We consider properties of the extended likelihood formulations over (x, θ). In many cases, we show that standard techniques (including proximal alternating minimization and variable projection) can be applied. We also build on the interior point method of [1] to develop a new extended interior point (IP) approach tailored to these joint estimators. Code for the new IP method is made available in the Julia language 1 .
The paper proceeds as follows. In Section 2 we review PLQ functions, along with their conjugate representations, and examples. Conjugate representations are essential for solving PLQ models using interior point methods. In Section 3, we focus on the statistical interpretation, and derive the normalization constant needed to create joint estimators such as (4). In Section 4, we consider optimization methods that can be applied to the new class of estimators, and derive a new customized interior point method for the class. In Section 5 we provide synthetic verifications that show we can recover shape parameters and regression variables in simple examples. We empirically show that the new joint estimation approach is consistent, and performs as well as simple least squares estimation when errors are Gaussian. Section 6 contains some applications of these ideas to real datasets, particularly focusing on self-tuning penalties for robust PCA.

PLQ Functions and Their Conjugate Representations
Piecewise linear-quadratic (PLQ) functions [27,Definition 10.20] are convex functions whose domain can be represented as the union of finitely many polyhedral sets, relative to each of which the function can be written as a convex quadratic. These functions have a convenient representation using their convex conjugates.
Definition 1 (Convex conjugate). The convex conjugate of a function f is given by A PLQ function is defined as the conjugate to a quadratic over a polyhedral set [27,Example 11.18]. Examples of common penalties used in statistical modeling, machine learning, and inverse problems can all be written using simple conjugates, as shown in Table 1. Quadratic support (QS) functions [1] generalize PLQ functions by removing the polyhedral set restriction [1]. Here we focus on PLQ examples, since they are computationally tractable.  Hinge We can explicitly define PLQ functions using their conjugate representation.

Definition 2 (PLQ Functions and penalties). A piecewise linear-quadratic (PLQ) function is given by
where M 0, B is an injective linear map, and U := {u : C T u ≤c} is a polyhedral set. When 0 ∈ U , it follows immediately that ρ must be non-negative; and in this case we call it a penalty.
PLQ functions are closed under sums and affine compositions [1], and have a representation calculus that makes it easy to conjugate for a PLQ problem that combines several terms, such as measurement models and regularizers.
A statistical interpretation of PLQs requires the notion of coercivity.

Definition 3 (Coercivity). A function
Coercivity of PLQs can be expressed in terms of their conjugate representations, and this topic is addressed by [1,Theorem 10]. In the next section, we use coercive PLQs to define corresponding densities, which enable the joint likelihood approach developed in this paper.

Statistical Model and Properties of Joint Objective
To formulate a learning problem over spaces of penalties, we first review the relationship between penalties and corresponding residual distributions. We then use this relationship to develop a joint formulation for model and shape parameter inference, and characterize properties of the resulting objective function. The relationship between penalties and associated densities can be made precise. Every coercive PLQ penalty can be viewed as the negative logarithm of an integrable density. Specifically, given a coercive penalty ρ(r; θ) whose shape is parametrized by θ, we define an associated density The term n c (θ) is a normalization constant that ensures that p(r; θ) in (6) is a true statistical density, i.e. it integrates to 1. These observations yield a simple recipe for extending a negative log likelihood in x (based on optimizing a PLQ penalty) to inform shape parameters θ. Specifically, given an optimization problem of the form we consider the extended problem where θ may be restricted to a domain D. In the least squares case, this idea reduces to the classic variance estimation example in Section 1.1. The quantile penalty example in Section 1.2 recovers the formulation studied by [8]. The general approach is a new way to estimate shape parameters of error distributions corresponding to PLQ penalties. The quantile regression problem in Section 1.2 yields a closed form solution for n c (τ ). We have D = [0, 1] and The log(n c ) term acts as a barrier (see Figure 2), pushing τ away from the boundary points 0 and 1 into the interior (0, 1), and favoring τ = 0.5. Considering the problem from an algorithm design perspective, the log of the normalization constant for the quantile regression problem is smooth and strongly convex function of the shape parameter τ , but its gradient is not Lipschitz continuous. In the remainder of this section, we characterize theoretical properties of the general objective (7).

Assumption 4.
To ensure the validity of the statistical viewpoint, we require ρ to satisfy: 1. ρ(r; θ) ≥ 0 is a PLQ penalty for every θ ∈ D, so we have non-negativity. 2. ρ is coercive for any θ ∈ D, so we have integrability. 3. For any θ 0 ∈ D, ρ(r; θ) is C 2 around θ 0 for almost every r ∈ R (smoothness in θ) Under these assumptions, we can obtain formulas for the first and second derivatives of n c (θ).
Theorem 5 (smoothness of n c (θ)). For n c (θ)in (6), suppose Assumption 4 holds and for θ 0 ∈ D, there exist functions g k (r), k = 1, 2, such that, The proof is elementary, and is included below for completeness.
Proof. From Assumption 4, we know that for any θ 0 ∈ D, ∇ θ exp[ρ(r; θ 0 )] and ∇ 2 θ exp[ρ(r; θ 0 )] exist for almost every r ∈ R. For any h such that h is small enough to make θ 0 + h stay in the neighborhood of θ 0 . Applying the mean value theorem, we have, whereθ lie in segment with the end points θ 0 and θ 0 + h. The first and third assumptions allow us to apply the dominated convergence theorem to get where we set h = αv and let α → 0 + and keep v fix as an unit vector. From the definition of the gradient we know that, Following the same steps we can show ∇ 2 n c (θ 0 ) exists and satisfies The derivative formulas (8) are used for first-and second-order methods to infer x and θ. The parametrization conditions in θ are satisfied by all piecewise linear-quadratic (PLQ) penalties. The theorem can also be applied to densities that are not log-concave. For instance, the Student's t density and associated penalty satisfy all assumptions of Theorem 5.
In the quantile case (4), the term log[n c (θ)] is convex. We characterize sufficient conditions for convexity of log[n c (θ)] for a general class of penalties ρ. The results can be derived using [12,Chapter 3.5].
Theorem 6 (convexity of log[n c (θ)]). Let n c (θ) be defined as in Theorem 5, and suppose Assumption 4 holds. We have the following results: 1. If ρ(r; θ) is jointly convex in r and θ, then log[n c (θ)] is a concave function of θ. 2. If ρ(r; θ) is concave with respect to θ for every r, then log[n c (θ)] is a convex function.
Theorems 5 and 6 tell an interesting story. The log-normalization constant log[n c (θ)] is nearly always smooth, even when the loss ρ is nonsmooth in x. The inference problem (7) is never guaranteed to be jointly convex in (x, θ): if ρ(x; θ) is jointly convex, then log[n c (θ)] will be concave. This is intuitive, as we are attempting to learn both the model and error structure at the same time. Objective (7) is generally nonsmooth and nonconvex. In the next section, we show how to optimize it using first and second order methods. Level sets for value function v(θ) (9) for the quantile Huber model. The black star is the maximum likelihood estimator, while the red dot represents the true parameters in the simulation.

Level sets of the objective function and maximum likelihood estimate
Although (7) is non-convex, in many cases we can still find the global minimum in θ. To illustrate, we generate the samples i from the distribution defined by the quantile Huber function with κ = 1 and τ = 0.05, select a set of weights x, generate data y i = a i , x + i , and plot the so called value function for (7): Evaluating v(θ) requires solving a smooth convex problem in x, and thus problem (7) reduces to minimizing (9) in θ. Results for the simple 2D case are shown in Figure 4. v(θ) is non-convex (note the level sets), but there is a unique minimum that is close to the true parameters, and was found in every run by a local search. In the next section, we design methods that are far more efficient than optimizing v(θ).

Optimization Methods for Joint PLQ and Shape Parameter Estimation.
In this section, we discuss algorithms for (7), and develop a new interior point method building on the PLQ optimization strategy of [1]. When ρ is smooth in x and θ, we show how to apply Proximal Alternating Linearized Minimization (PALM) [11] and Proximal Alternating Minimization (PAM) [6]. We also discuss variable projection (VP) type algorithms [3,4]. All three algorithms can be applied to the joint estimation problem, as long as we carefully consider the problem structure. PAM has a disadvantage compared to the other two, since it requires a fast solver for a regularized PLQ problem. Such a solver was developed in [1], but in this case we might as well use the new IP method developed in Section 4.2. In contrast, PALM can be readily implemented, as long as we have smoothness in the term that couples x and θ; we use PALM for the large-scale experiments in Section 6. One catch is that the log normalization constant log[n c (θ)] must be treated carefully, as its gradient does not have a global Lipschitz constant. The VP variant we consider here is also easy to implement, since it focuses on projecting out θ rather than x. In Section 4.2 we extend interior point methods developed in [1] to problem (7). These methods can simultaneously fit PLQ problems (without e.g. smoothness requirements) and estimate their shape parameters. The reader will recall that the quantile regression example in Section 1.2 uses a nonsmooth PLQ penalty ρ. The extended IP approach can solve nonconvex problems with primal, conjugate, dual, and shape parameters together by directly working with the associated system of optimality conditions. We have released the implementation of the method in the Julia language.

PALM, PAM, and Variable Projection
Several algorithms from existing literature can be brought to bear on problem (7) when it is of the form under some mild assumptions on H. Here we briefly review PALM [11], and PAM [6] algorithms, as well as similar algorithms that partially minimize the objective in θ, known as partial minimization or variable projection [3,4]. The PALM and PAM algorithms [11] can be used to minimize (10) when H is C 1 , with globally Lipschitz partial gradients, while the functions r 1 and r 2 are proper lower semicontinuous; in particular they are not required to be convex, finite-valued, or smooth. The term r 1 (x) is useful if we need simple regularization and constraints on x, such as sparsity or non-negativity. Even though log[n c (θ)] is smooth (see Theorem 5), it must be relegated to r 2 (θ), since otherwise it violates the Lipschitz assumptions on the partial gradients of H. Therefore, to apply PALM or PAM to (7), we take Here δ D is the indicator function for the set D, The PALM algorithm is detailed in Algorithm 1. The steps c k and d k are obtained from Lipschitz constants of the (partial) gradients of H. The PAM algorithm is given in Algorithm 2, with the same step sizes to facilitate an easier comparison. PAM essentially requires the ability to partially minimize a quadratically regularized problem H with respect to both x and θ. In many cases, one of these variables may be easier to solve for than the other; in Example 3.1, θ has dimension 2. Variable projection algorithms [3,4] exploit the ability to fully minimize in one variable while applying an iterative approach in the other. Algorithm 3 shows an example where we have chosen to optimize out θ for every iteration in x.
The PALM algorithm works well for large-scale shape inference problems with smooth ρ. We use it for the self-tuning RPCA experiments in Section 6.

Interior Point method for nonsmoothly coupled nonconvex joint QS inference
In this section, we use conjugate representations of PLQ penalties to develop an interior point method for the extended inference problem (7): The approach converges superlinearly in practice, but each iteration requires solving a linear system. We solve these systems directly, and scaling issues are explored in Table 3. In practice, large-scale linear systems are solved iteratively, often using pre-conditioners [25]; we leave these developments for future work.
To optimize PLQ penalties parametrized by θ, we allowb andc in (5) to be affine functions of θ, and assume D is also polyhedral: We then solve a saddle point for primal variables x, conjugate variables u, and shape parameters θ: For example, the self-tuning quantile penalty (4) is written as

Algorithm 4 Interior point method for QS with θ estimation
Require: A, y, B, b, C Interior point (IP) methods apply damped Newton to a relaxation of the optimality conditions of (13), see [21,24,32]. Let F (z) denote the optimality conditions for problem (13), with z = (x, u, θ, λ), where λ are dual variables for inequality constraints C T u ≤ H T θ + c and S T θ ≤ s, and let F µ (z) denote the relaxed system obtained by using a logarithmic barrier with parameter µ. The IP method is summarized in Algorithm 4.
We introduce a log-barrier for the conjugate variables u and shape parameters θ as follows.
As µ ↓ 0, the barriers approach true indicator functions for the D and U . The parameter µ is decreased to a specified tolerance as the optimization proceeds. For fixed µ, the approximate subproblem with fixed µ can itself be written as a saddle point system: We introduce dual variables q and slack variables d: where the Diag operator acting on a vector v returns a diagonal matrix with diagonal v. We can then form the KKT system The Jacobian matrix F (1) µ of the system is given by Algorithm 4 is a damped Newton iteration to find the stationary point of F µ . At each iteration, µ is taken to be a fraction of the current average complementarity conditions, just as in the implementation used in [1,2]. We can apply block Gaussian elimination to obtain conditions that make the algorithm implementable. We state the result as a simple theorem.

Theorem 7 (IP implementability). Suppose that the PLQ penalty is nondegenerate, that is null(C) ∩ null(M ) = {0}, and that the linear model A is full-rank. Then the interior point iteration
is implementable when a certain square symmetric system is invertible: T 3 is a symmetric square matrix with dimension equal to the length of the parameter vector θ.
Proof. Implementing the IP iteration is equivalent to applying block-Gaussian elimination to the system (16). The set of operations is given by where we define The invertibility of D is enforced by Algorithm 4, which keeps slack variables d strictly positive. Invertibility of T 1 is guaranteed by the nondegeneracy hypothesis for the PLQ, see [1,Theorem 14]. Since B must be injective (see Definition 5), the full rank condition on A guarantees that T 2 is invertible. The row operations yield an upper triangular matrix of form which is invertible if and only if T 3 in (17) is invertible, given the other hypotheses.
The full rank condition on A may seem restrictive. However, in settings with rank deficient systems (e.g. in variable selection problems), the matrix A is naturally augmented to include the (PLQ) regularizer. For example, in Lasso, the 1-norm of x is part of the objective, and so the PLQ linear system is a stack the measurement matrix with two copies of the identity [1, Remark 5]. Thus, a rank-deficient A in the PLQ construction always points to an ill-posed statistical model.
In the next section, we present synthetic examples and that show the applicability of the approach for inferring simple shape parameters.

Synthetic Data Experiments
We illustrate the proposed approach using a linear regression example. We work in a simple regression setting, focusing on the quantile Huber family (Figure 1). Where appropriate, we make comparisons with least squares and least absolute deviation (LAD) formulations. Least squares estimates have a closed form solution, while we compute LAD estimates using cvx.jl. In Section 5.1 we discuss the parametrization of the quantile Huber family. In Section 5.2 study consistency of the estimates for x in the Gaussian regime, as well as for errors generated from a particular quantile Huber distribution. In particular, we compare estimates as the number of observations increase with those obtained by for the least squares solution and LAD estimates. In Section 5.3, we consider a moderate number of observations, and compare the performance of the quantile Huber approach to those of least squares and LAD across a range of different quantile Huber parameters. In Section 5.4 we compare the performance of PALM to that of the interior point method for the quantile Huber problem. Finally, in Section 5.5 we compare the new approach to a grid search, a simple generic method applicable to the problem but unaware of the special problem structure.

Parametrization of the quantile Huber family
The quantile huber family is parametrized by τ , which controls slope of the penalty, and κ, which is the robustness threshold. We want to fit the regression model x as well as obtain the correct parameters τ and κ. When κ > 0 in quantile Huber, ρ(x; θ) is smooth, and we can use the PALM algorithm from Section 4.1. The quantile Huber penalty is PLQ, so we can also apply the proposed IP method from Section 4.2.
The primal form for the quantile Huber penalty is We must choose a parametrization in terms of θ. One option is to take θ = [τ, κ] T . This parametrization could not be solved using the interior point approach, since the shape parameters would not be affine functions of θ.
It would also require a modification of the PALM algorithm, since the θ block as given would not have a global Lipschitz constant. An interesting insight here is that we can find an affine reparametrization of the shape parameters, which allows the problem to be solved using the interior point method we developed and satisfies the assumptions of the basic PALM algorithm. Specifically, we can take θ 1 = τ κ, θ 2 = τ (1 − κ). These new parameters must be non-negative. The primal objective can be written as where a i ∈ R m are random Gaussian vectors, x ∈ R n is the model parameter vector, and y ∈ R m is the observed data vector contaminated by outliers. From Theorem 5, n c (θ) is C 2 smooth. From Theorem 6, the objective in θ is the sum of a concave term ρ(Ax − y; θ) and a convex term m log[n c (θ)]. The joint problem in (x, θ) is nonconvex, but both first-and second-order methods from Section 4 can be applied.

Consistency
In this section, we test the performance of the estimation framework as the number of observations increases, for both Gaussian and quantile Huber errors. In both regimes, we want to estimate regression parameters x more accurately as the number of observations increases. For the Gaussian case, the idea is to check that performance is reasonable to ensure the framework is applicable in standard cases, i.e. in the absence of outliers or asymmetry. For quantile Huber, we want to know that we indeed recover all parameters as we see more observations. We consider x ∈ R n for n = 50, and evaluate the performance of the method for m ∈ {1000, 1500, . . . , 4500, 5000}. Errors in all experiments are generated from the N (0, 1) distribution.  N (0, 1). Box plots summarize results across 10 realizations of each experiment. The quality of quantile Huber results is identical to that of the least squares solution, which is optimal in this nominal case. LAD solutions are comparable but always worse in terms of relative error.
We look at the relative error in inferred x across 10 realizations for each experiment, and plot this relative error as a function of the number of observations m in Figure 5. The quality of the quantile Huber solution is on par with the least squares solution, which is optimal in this case. The LAD solution is worse in comparison to the other approaches, but also improves with additional observations as expected.
In the nominal case, the inferred κ parameter in all cases pushes all residuals into the "inlier" region. To show this, we plot the minimum quadratic region, compared to the 95% confidence interval for the true residuals in Figure 6. The minimal quadratic region discovered across any problem realization is above 3, which means all of the residuals fall into the quadratic region of the objective function. This result also means values of τ are irrelevant, and uninformed by the estimated residuals. This is evident from model results; while τ estimates still center at 0.5, the estimates vary widely regardless of the available measurements, with a 95% interval from 0.2 to 0.7 across the experiments. For the quantile Huber experiment, we generate asymmetric errors from the quantile Huber distribution with parameters τ * = 0.1 and κ * = 1.0. Just as in the Gaussian case, we plot the relative error in recovery of x across 10 realizations for each experiment. The results are shown in Figure 7. The performance of the quantile Huber is far superior to that of the least squares better than that of LAD solutions. The inferred shape parameters compared to ground truth are show in Figure 8. As the number of observations increases, the ability to infer shape parameters also improves. The quality of quantile Huber result is far superior to those of both the least squares and LAD solutions across all observations. The least squares solution is particularly vulnerable to outliers, but the LAD solution is also affected by the asymmetry of errors.

Shape-Optimized Quantile Huber vs. Least Square and LAD
In this section, we stay with problem size n = 50 and m = 1000, and compare performance for errors generated from different quantile Huber distributions. The measurement errors are sampled from quantile Huber distributions, to verify that the approach is able to recover "ground truth" values for (τ, κ) parameters. We denote ground truth parameters as x t , τ t , κ t , while x * , τ * , κ * are the solutions obtained by solving (7). We provide two reference solutions: x LS is the least square solution, and x M is the solution obtained by solving Ax − b 1 using convex.jl. For each κ and τ setting, we run the simulation 10 times, and show the average of the results in Table 2. Results shown are obtained by the IP method. Table 2 Joint inference of the shape and model parameters for the quantile Huber loss. Columns 2 and 3 show results for proposed method. r(x) = x − xt / xt denotes relative error. Column 2 contains τ, κ estimated using joint optimization (compare to ground truth simulation parameters in column 1) solved by the proposed IP algorithm with KKT tolerance set to 10 −6 . Column 3 shows relative error of the new estimate; compare to Columns 4 and 5, which are relative errors for LS and minimum 1-norm estimates.

PALM vs. IP for Penalty Optimization
We also compared the performance of PALM and IP over m ∈ {100, 500, 1000, 2000} and n ∈ {50, 100, 200, 500}, for both iterations and run time. The results are shown in Figure 9 and Table 3. From Table 3, IP converges log(f − f * ) Figure 9 Convergence history (iterations) for PALM (green) and interior point method (black). The experiment shown is for τ = 0.1, κ = 1. This behavior is representative; in particular the proposed IP method converges in fewer than 20 iterations in all cases. within 20 iterations independently of problem dimension. However, the total run time exceeds PALM as the problem size increases. The current interior point is implemented by solving explicit linear systems, and is dominated by the cost of forming and solving linear systems. For our problems, with x ∈ R n and y ∈ R m with m ≥ n, the cost of forming and solving T 1 (see Theorem 7) is O(mn 2 + n 3 ), and it is updated in every iteration. In contrast, PALM uses linearization, so the cost of each iteration is proportional to that of a matrix-vector multiply, which is O(mn) in this case. Each iteration is thus faster, but PALM needs more of them. The balance between IP and PALM can be shifted in favor of IP by implementing iterative solvers with pre-conditioners; this would be particularly valuable for nonsmooth ρ, where PALM cannot be applied. However, this effort is outside the scope of the paper. Table 4 Column 3 shows time (s)/iterations for gradient descent on a single instance of the quantile huber problem. Columns 4 and 5 show the time/iterations for PALM for the joint problem over (x, θ) and resulting estimates of (κ, τ ). Columns 6 and 7 shows the time and estimates obtained by a 5 × 5 grid search over (κ, τ ). Penalty optimization finds good estimates very quickly relative to the baseline. Grid search takes an order of magnitude more time.

Penalty Optimization vs. Grid Search
Finally, we compared the run time and shape parameter estimates for the quantile Huber family with a grid search method. We construct a 5 × 5 grid over τ ∈ [0.1, 0.2, 0.5, 0.8, 0.9] and κ ∈ [0.1, 0.5, 1.0, 1.5, 1.9], split the data into training and validation sets (80% and 20%) and pick the parameter that performs the best over validation data. We also provide the run time for a single instance evaluation with fixed θ (using gradient descent) as a baseline comparison. Results are shown in Table 4. From the table we can see that both methods find good estimates of the shape parameters. However, grid search clearly suffers form the "curse of dimensionality" even at dimensionality 2, as it takes approximately 20 times longer to complete.

Real Data Example
In this section, we consider large-scale examples, developing approaches for using "self-tuning" penalties for robust principal component analysis (RPCA). RPCA has applications to alignment of occluded images [26], scene triangulation [34], model selection [14], face recognition [31] and document indexing [13]. We develop a self-tuning background separation approach. Given a sequence of images [23], our goal is to separate the moving objects from the background. We pick 202 images from the data set, convert them to grey scale and reshape them as column vectors of matrix Y ∈ R 20480×202 . We model the data Y as the sum of low rank component L and sparse noise S; we expect moving objects to be captured by S. We take advantage of the fact that RPCA is equivalent to regularized Huber regression [16]: where ρ( · , [κ; σ]) is the scaled Huber function given by ρ(r; [κ, σ]) = κ|r|/σ − κ 2 /2, |r| > κσ r 2 /(2σ 2 ), |r| ≤ κσ.
When the variance σ 2 of the observation noise is known, the problem reduces to Huber threshold estimation; when it is not known, it is an additional shape parameter we can estimate using the proposed framework. In the latter case, there is no simple parametrization that makes the PLQ representation an affine function of θ.
To obtain a more efficient numerical scheme, we can model the low rank component as L = U T V , where U ∈ R k×m and V ∈ R k×n . The resulting self-tuning RPCA formulation is given by We solve the problem with PALM, obtaining the result in Figure 10(b). As the optimization proceeds, κ and σ decrease to 0 with a fixed ratio α = κ/σ. The self-tuning Huber becomes the scaled 1-norm, recovering the original RPCA formulation [13]. The result in Figure 10(b) is an improvement over the result with initial (κ, σ) values shown in Figure 10(a). As in Section 5, we compared the run time of solving the joint (self-tuning) problem with solving a single instance of RPCA. The results are shown in Table 5. In this case, the extended self-tuning problem only takes twice the run time compared to a single instance. Cross-validation, grid-search and black-box optimization typically require multiple solutions of the original optimization problem to find good parameter values.

Discussion
In this paper we developed a simple approach that extends a regression problem using PLQ penalties to also infer unknown shape constraints. We used the statistical interpretation of the PLQ penalty to introduce an additional term that relates shape constraints to a normalization constant. Many existing algorithms can be brought to bear on the extended problem. For smooth penalties, the PALM algorithm was quite useful in both synthetic and real examples, particularly for large-scale problems. We also developed an interior point method for the PLQ class, that uses conjugate representations of such penalties to solve an augmented saddle point system.
The approach offers several interesting avenues for future research. The nonconvex coupling between convex PLQ penalties and their shape parameters is interesting, and finding conditions when the approach is guaranteed to work in general is an open question. The maximum likelihood criterion itself is just one approach, and may have limitations as the number of shape parameters grows with respect to the data. Characterizing these limitations and trying alternatives (such as marginal likelihood) is also left to future work.