A Hamilton-Jacobi-based Proximal Operator

First-order optimization algorithms are widely used today. Two standard building blocks in these algorithms are proximal operators (proximals) and gradients. Although gradients can be computed for a wide array of functions, explicit proximal formulas are only known for limited classes of functions. We provide an algorithm, HJ-Prox, for accurately approximating such proximals. This is derived from a collection of relations between proximals, Moreau envelopes, Hamilton-Jacobi (HJ) equations, heat equations, and Monte Carlo sampling. In particular, HJ-Prox smoothly approximates the Moreau envelope and its gradient. The smoothness can be adjusted to act as a denoiser. Our approach applies even when functions are only accessible by (possibly noisy) blackbox samples. We show HJ-Prox is effective numerically via several examples.

T he rise of computational power and availability of big data brought great interest to first-order optimization methods. Second-order methods (e.g. Newton's method) are effective with moderately sized problems, but generally do not scale well due to memory requirements increasing quadratically with problem size and computation costs increasing cubically. First-order methods are often comprised of gradient and proximal operations, which are typically cheap to evaluate relative to problem size. Although gradients can be computed for many functions (or numerically approximated), the computation of proximals involves solving a small optimization problem. In special cases (e.g. with ℓ1 norms), these subproblems admit closed-form solutions that can be quickly evaluated (e.g. see (1)). These formulas yield great utility in many applications. However, we are presently interested in the class of problems with (potentially nondifferentiable) objectives for which proximal formulas are unavailable.
We propose a new approach to compute proximal operators and corresponding Moreau envelopes for functions f . We leverage the fact that the Moreau envelope of f is the solution to a Hamilton-Jacobi (HJ) equation (2). The core idea is to add artificial viscosity to HJ equations and obtain explicit formulas for the proximal and Moreau envelopes using Cole-Hopf transformation (2,Sec. 4.5.2). This approach enables proximals and Moreau envelopes of arbitrary f to be approximated. Our proposed proximal approximations (called HJ-Prox) are computed using only function evaluations and can, thus, be used in a zeroth-order fashion when integrated within an optimization algorithm. Finally, an importance sampling procedure is employed to mitigate the curse of dimensionality when estimating the HJ-Prox in dimensions higher than three. Numerical experiments show HJ-Prox is effective when employed within optimization algorithms when the proximal is unavailable and for blackbox oracles. Our work can generally be applied to first-order proximal-based algorithms such as Alternating Direction Method of Multipliers (ADMM) and its variants (3)(4)(5)(6), and operator splitting algorithms (7)(8)(9)(10)(11).

Proximal Operators and Moreau Envelopes
Consider a function f : R n → R and time t > 0. The proximal prox tf and the Moreau envelope u of f (12,13) are defined by and The proximal is the set of minimizers defining the envelope. As shown in Figure 1, the envelope u widens valleys of f while sharing global minimizers. A well-known result (e.g. see (1,14)) states, if the envelope u is differentiable at x, then Rearranging reveals A key idea we use is to estimate the proximal by replacing u with a smooth approximation u δ ∈ C ∞ (R), derived from a Hamilton-Jacobi (HJ) equation.

Hamilton-Jacobi Connection
The envelope u is a special case of the Hopf-Lax formula (2). Fix any T > 0. For all t ∈ [0, T ], the envelope u is a viscocity solution (e.g. see (15,Theorem 3.2)) to the HJ equation [5] Fixing δ > 0, the associated viscous HJ equation is If f is bounded and Lipschitz, Crandall and Lions (16) show u δ approximates u, i.e. u δ → u uniformly as δ → 0 + .

Contribution
Many objectives do not admit explicit proximal formulas (e.g. when objectives are either nonconvex or only accessible via an oracle). Yet, only using (possibly noisy) objective samples, we give a formula for accurately approximating such proximals.

Cole-Hopf Transformation
Using the transformation v δ ≜ exp(−u δ /δ), originally attributed to Cole and Hopf (2,17), the function v δ solves the heat equation This transformation is of interest since v δ can be expressed via the convolution formula (e.g. see (2) where Φ δt is a fundamental solution to [7], i.e. otherwise.
[9] Using algebraic manipulations, we recover the viscous solution [10] Differentiating reveals . [11] Importance Sampling At first glance, the integral formula for v δ in [11] may appear to require use of a grid for numerical estimation (and similarly for ∇v δ ). However, we may avoid such grids by noting v δ can be written as an expectation, i.e.
where y ∼ N (x, δt) denotes y ∈ R n is sampled from a normal distribution with mean x and standard deviation √ δt. In practice, finitely many samples y i ∼ N (x, δt) are used to estimate [12b]. This can greatly reduce sampling complexity (18,19).

Numerical Considerations
A possible numerical challenge in our formulation is to address numerical instabilities arising from the exponential term underflowing with limited numerical precision, due to either δ being small or f (y) being large. To this end, note the proximal formula may equivalently be re-scaled via where t is replaced by t/α and f by αf in [15]. In this case, if f /δ becomes too large with respect to numerical precision limitations, it may be scaled down with a corresponding α. To make the implementation stable, we check whether we obtain an underflow with exp(αf (y)/δ) and rescale α using a linesearch-like approach. In particular, we recursively halve α until exp(αf (y)/δ) > ε for a tolerance ε (see line 7 of Algorithm 1. Yet, small α makes the variance large and more samples may be required to accurately estimate the expectations. Another mitigation is to adaptively rescale f based on the number of recursive steps taken in HJ-Prox. Large δ can be used to smooth approximations and mitigate the stochastic characteristics of HJ-Prox. Another potential instability that may arise is when f is negative in certain parts of the domain. In this case, exp(αf (y)/δ) may overflow. To remedy this, we check whether f (y) is negative and recursively shift the function until it is nonnegative (see line 5 of Algorithm 1).

Convergence Analysis
The arguments above give intuition for a proximal approximation. However, having now the formula [15], we may formalize its utility without reference to differential equations. Below we define two standard classes of functions used in optimization.
Our main result shows HJ-Prox converges to the proximal.
The theoretical results in our work is closely related to the study of asymptotics as δ → 0 of integrals containing expressions of the form exp(−f /δ), i.e. Laplace's method (2). Moreover, the idea of adding artificial diffusion to Burgers' equation and then applying Cole-Hopf transformation to approximate the gradient of the solution to the HJ equation has been largely developed in (2) in the context of obtaining solutions to conservation laws in 1D. The connections between Hopf-Lax and Cole-Hopf have been observed in the context of machine learning in (17), image denoising and Bayesian inference (41)(42)(43), and in the context of global optimization in (20).  Since there is no analytic proximal formula, we obtain the "true" proximal by solving the optimization Eq. (1) using gradient descent. The HJ-based proximal is a good approximation of the true proximal operators and can even be applied when only (potentially noisy) samples are available. As in the analytic case, we obtain a C ∞ approximation of the underlying function f in the noisy case. Here, δ = 0.1 for the noiseless case and δ = 0.5 and δ2 = 0.1 for the noisy case.

Numerical Experiments
Examples herein show HJ-Prox (Algorithm 1) can ▶ approximate proximals and smooth noisy samples, ▶ converge comparably to existing algorithms, and ▶ solve a new class of zeroth-order optimization problems. Each item is addressed by a set of experiments. Regarding the last item, to our knowledge, HJ-Prox is the first tool to enable faithful solution estimation for constrained problems where the objective is only accessible via noisy blackbox samples.

Proximal and Moreau Envelope Estimation.
Herein we compare HJ-Prox to known proximal operators. Figure 2 shows HJ-Prox for three functions (absolute value, quadratic, and log barrier) whose proximals are known. In the leftmost column (a), we show the Moreau envelope u(x, t) given by [2], and an estimate of Moreau envelope using the HJ-Prox u δ (x, t). Given the close connection between proximals and Moreau envelopes, we believe this visual is a natural and intuitive way to gauge whether the proximal operator is accurate. Column (b) juxtaposes the true proximal and HJ-Prox. Column (c) shows the accuracy of HJ-Prox across different dimensions and numbers of samples. In the rightmost column (d), we estimate Moreau envelopes using HJ-Prox using noisy function values. The resulting envelopes are smooth since u δ is a smooth (i.e. C ∞ ) approximation of u. Thus, HJ-Prox can be used to obtain smooth estimates from noisy observations. Figure 3 shows Moreau envelopes for nonconvex functions f . As in the other example, here HJ-based Moreau envelope estimates also accurately approximate Moreau envelopes. Note these proximals may be well-defined only for small time t (as the proximal operator objective in [1] is strongly convex for small t). Lastly, we apply HJ-Prox with a function that has no analytic formula for its proximal or Moreau envelope in Figure 4. In this experiment, we obtain a "true" Moreau envelope and proximal operator by solving the minimization problem [1] iteratively via gradient descent. Faithful recovery is shown in Figures 4a and 4b, and smoothing in Figure 4c.
Optimization with Proximable Function. This experiment juxtaposes HJ-prox and an analytic proximal formula in an optimization algorithm. Consider the Lasso problem (44) min x∈R 1000 [18] where entries of A ∈ R 500×1000 and b ∈ R 500 are i.i.d. Gaussian samples. The iterative soft thresholding algorithm (ISTA) (45) defines a sequence of solution estimates {x k } for all k ∈ N via [19] where the shrink operator defined element-wise by shrink(x; t) ≜ sign(x) max(0, |x| − t).
[20]  [18] with ISTA, juxtaposing use of an analytic proximal formula, gradient descent (i.e. ignoring the proximal), and the approximate HJ-prox (Algorithm 1). Plots with HJ-prox show averaged results from 30 trials with distinct random seeds. To ensure the proximal is playing a role in the optimization process, we also show a function value history of gradient descent applied to the unregularized least squares problem in [18] (i.e. , with no ℓ1 norm term).  [21] with linearized method of multipliers and HJ-prox (Algorithm 1). Each plot shows averaged results from 30 trials with distinct random seeds. Due to the noise , we observe in (a) that a larger δ = 10 leads to a better approximation, but too large (δ = 100) leads to oversmoothing and reduces accuracy. We find δ = 10 to be most optimal, and (b) shows that more samples lead to more accurate approximations (as expected). To ensure the proximal is playing a role in the optimization process, we also show the relative error when gradient decent is applied to the constraint residual in [21] (i.e. , we only minimize constraint residual). Indeed, gradient descent performs poorly by comparison.

Relative Errors for HJ-MM using noisy
Optimization with Noisy Objective Oracles. Consider a constrained minimization problem where objective values f can only be accessed via a noisy oracle * O. Our task is to solve where A and b are as in the prior experiment and the expectation E is over oracle noise. To model "difficult" settings (e.g. when a singular value decomposition of A is unavailable), we do not use any projections onto the feasible set. As knowledge of the structure of O is unknown to the solver, we emphasize schemes for solving [21] must use zeroth-order optimization schemes (29). Here, each oracle call returns with a new noise sample ε ∈ R used in each oracle evaluation, σ = 0.005, and W ∈ R 1000×1000 a fixed Gaussian matrix. In words, the noise has magnitude 0.5% of ∥W x∥1. Although the oracle structure is shown by [22], our task is to solve [21] without such knowledge. We do this with the linearized method of multipliers (e.g. see Section 3.5 in (9)). Specifically, for each index k ∈ N, the update formulas for the solution estimates {x k } and corresponding dual variables {u k } are with step sizes t = 1/∥A ⊤ A∥2 and λ = 1/2. Without noise ε, convergence occurs if tλ∥A ⊤ A∥2 < 1 (9), justifying our choices for t and λ. The proximal prox tO is estimated by HJ-prox. * Here O is a noisy function, not to be confused with "Big O" often used to describe limit behaviors.
We separately solve the optimization problem using full knowledge of the objective ∥W x∥1 without noise; doing this enables us to plot the relative error of the sequence {x k } in Figure 6. All the plots show {x k } converges to the optimal x ⋆ , up to an error threshold, regardless of the choice of δ and number of samples N . Notice Figure 6a shows "small" values of δ give comparable accuracy, but that oversmoothing with "large" δ = 100 degrades performance of the algorithm. These plots also illustrate the HJ-prox formula is efficient with respect to calls to the oracle O. Indeed, note the plots in Figure 6b that decrease relative error use, at each iteration, respectively use 0.1, 1, and 10 oracle calls per dimension of the problem! We hypothesize the smoothing effect of the viscous u δ and averaging effect of importance sampling contribute to the observed convergence. In this experiment, HJ-prox converges to within an error tolerance, is efficient with respect to oracle calls, and smooths Gaussian noise.

Conclusion
We propose a novel algorithm, HJ-prox, for efficiently approximating proximal operators. This is derived from approximating Moreau envelopes via viscocity solutions to Hamilton-Jacobi (HJ) equations, as given via the Hopf-Lax formula. Upon rewriting this approximation in terms of expectations, we use importance sampling to avoid discretizing the integrals, thereby mitigating the curse of dimensionality. Our numerical examples show HJ-Prox is effective for a collection of functions, both with and without known proximal formulas. Moreover, HJ-prox can be effectively used in constrained optimization problems even when only noisy objective values are available.
Step 1 The numerator and denominator in the definition [25] for σ δ are nonnegative, making σ δ ≥ 0 everywhere. By the choice of t, ϕt is θ ≜ 1/t − ρ strongly convex, and so it admits a unique minimizer ξ ⋆ = prox tf (x) and satisfies for all y ∈ R n . [28] Consequently, , for all y ∈ R n . [29] Since the upper bound above is an exponential that decays quadratically, the middle term in [29] is integrable over R n . As ϕ ⋆ t = u(ξ ⋆ , t) ≥ 0 by hypothesis, the denominator in the definition of σ δ is positive. Then [26] readily follows.
Step 2 A classic result in analysis (e.g. see (46, Exercise 3.4)) states L p norms converge to the L ∞ norm as p → ∞, and so where the L 1 δ norm is always finite by Step 1 and the final equality holds since the exponential is maximized by ϕ ⋆ t .