1 Introduction

In this paper, we consider the optimization of a nonsmooth function \(f:\mathbb {R}^n\rightarrow \mathbb {R}\) over a closed convex set, namely

$$\begin{aligned} \begin{array}{ll} \min &{}f(x)\\ s.t. &{} x\in X. \end{array} \end{aligned}$$
(1)

We assume that f is locally Lipschitz continuous and that first order information is unavailable or impractical to obtain.

The aim of the optimization, for nonsmooth problems, is to find Clarke-stationary points [6, 9].

Literature on derivative-free methods for smooth constrained optimization (i.e. when f is differentiable even tough derivatives are not available) is wide. Several approaches, based on the pattern search methods dating back to [13], have been developed for bound and linearly constrained problems in [16] and [17] and more general type of constraints in [18]. Other works stem from research presented in [11, 21] whose line-search approach has been extended for box and linearly constrained problems in [20] and [22] while more general constraints are covered in [19]. An interesting work in the field of global optimization is [8]. We refer the reader interested in derivative-based methods, which are not considered in this work, to [7].

The nonsmooth case has seen less development. One of the two major approaches that have emerged is represented by Mesh adaptive direct search (MADS) that dates back to [3, 4] and that has been later modified in [1, 5]. This method combines a dense search with an extreme barrier to deal with the constraints. A second main approach, proposed in [10], is instead based on an exact penalty function.

In this case, the feasible set is expressed by a possibly nonsmooth set of inequalities \(g:\mathbb {R}^n\rightarrow \mathbb {R}^m\) and the original problem is replaced by the penalized version, for a given \(\epsilon >0\),

$$\begin{aligned} \min _x f(x)+\frac{1}{\epsilon }\sum _{i=1}^{m}\text {max}\{0,g_i(x)\}. \end{aligned}$$
(2)

It can be shown (see Proposition 3.6 in [10]) that, under suitable assumptions, a value \(\epsilon ^*\) exists such that \(\forall \epsilon \in (0, \epsilon ^*]\) every Clarke-stationary point \({\bar{x}}\) of (2) is also a stationary point of the original problem. Thus, any algorithm for nonsmooth uncontrained optimization can be applied. In [10], however, a linesearch based algorithm that employs a dense set of directions alongside the 2n coordinate directions (CS-DFN) is proposed. This latter reformulation combined with CS-DFN is shown to be favorably comparable against state-of-art MADS based software like NOMAD [15].

The value of \(\epsilon ^*\) is, however, in general unknown and choosing a proper value of \(\epsilon \) can be a difficult task. Setting a wrong value of \(\epsilon \) can be extremely harmful to the performance of the algorithm. For example, if f is unbounded outside the feasible set setting too high a value of \(\epsilon \) can drive the algorithm towards minus infinity. On the other hand, too small a value (\(<\epsilon ^*\)), even if theoretical convergence is assured, can yield extremely poor performances because the algorithms may be forced to take really small steps near the boundary of the feasible set.

In this work we propose a novel way of treating convex constraints that is not based on penalty functions. We assume that the feasible set X is a closed convex set and that a projection operator onto the feasible set is available. We do not require X to have an analytical expression nor make any other regularity assumptions. However, we make the assumption that the computational effort needed to compute the projection is negligible compared to the evaluation of the objective function. Indeed, the only computational cost we consider is the number of evaluations of the objective function.

The paper is organized as follows. In Sect. 2 we recall some necessary definitions and known results before introducing the proposed reformulation in Sect. 3. Then, in Sect.  4, we prove the equivalence between the novel formulation and the original problem. Some numerical results are given in Sect. 5. In particular, we propose a comparison between the exact penalty approach of [10] and the proposed reformulation. We make use of the CS-DFN algorithm, used in [10], for both formulations to make a fair comparison. Finally, we give some concluding remarks in Sect. 6.

2 Preliminary background

We recall that optimality conditions for nonsmooth problems can be given in terms of the Clarke generalized directional derivative [9]. In particular, in the unconstrained case, we have that

Definition 1

(Clarke stationarity–unconstrained case) A point \({\bar{x}}\in X\) is Clarke stationary w.r.t. problem \(\min _x f(x)\) if

$$\begin{aligned} f^{\circ }({\bar{x}}; d) \ge 0 \quad \forall d\in \mathbb {R}^n, \end{aligned}$$

where

$$\begin{aligned} f^{\circ }(x; d) = \limsup _{\begin{array}{c} y\rightarrow x,~ t \rightarrow 0 \end{array}}{\frac{f(y+td)-f(y)}{t}\ge 0} \end{aligned}$$

is the Clarke generalized directional derivative.

We follow [6] for the treatment of Clarke stationarity for constrained problems. First, we define the cone of the hyper-tangent directions.

Definition 2

(Hyper-Tangent Cone) A vector \(d\in \mathbb {R}^n\) is said to be a hyper-tangent vector to the set X at \({\bar{x}}\in X\) if there exists \(\epsilon >0\) such that

$$\begin{aligned} y+tw\in X \quad \forall y\in X\cap B(x, \epsilon ), \quad w\in B(d, \epsilon ), \quad t \in (0, \epsilon ). \end{aligned}$$

The set of all hyper-tangent vector is called the hyper-tangent cone to X at \({\bar{x}}\) and is denoted \(T_X^H({\bar{x}})\). For a more detailed treatment we refer the reader to [6]. Figure 6.5 of [6] offers a graphical illustration of the hyper-tangent cone.

Then we can give a definition of Clarke-stationary points as expressed by the following.

Definition 3

(Clarke stationarity-constrained case) A point \({\bar{x}}\in X\) is Clarke stationary w.r.t. problem \(\min _{x\in X} f(x)\) if

$$\begin{aligned} f^{\circ }({\bar{x}}; d) \ge 0 \quad \forall d\in T_X^H({\bar{x}}), \end{aligned}$$

where

$$\begin{aligned} f^{\circ }(x; d) = \limsup _{\begin{array}{c} y\rightarrow x, ~y \in X \\ t \rightarrow 0 \end{array}}{\frac{f(y+td)-f(y)}{t}\ge 0}. \end{aligned}$$

We will make use of the following result which relates to the Clarke-derivative and classical directional derivative in the case of convex functions. For the proof we refer the reader to Theorem 3.42 in [14].

Theorem 1

Let \(f: \mathbb {R}^n \rightarrow \mathbb {R}\) be a convex functional which is Lipschitz continuous at some \({\bar{x}} \in X\). Then the Clarke derivative f at \({\bar{x}}\) coincides with the directional derivative of f at x that is

$$\begin{aligned} f^{\circ }({\bar{x}}; d) = \limsup _{\begin{array}{c} y\rightarrow {\bar{x}},~ t \rightarrow 0 \end{array}}{\frac{f(y+td)-f(y)}{t}} = \lim _{t\rightarrow 0} \frac{f({\bar{x}}+td)-f({\bar{x}})}{t} = f^\prime ({\bar{x}}; d). \end{aligned}$$

3 A novel formulation

Generalizing a bit, all the approaches that have been proposed in the literature to deal with general constraints, try to steer the search towards the feasible set by adding (maybe in a sequential manner) to the objective function some kind of penalty \(\phi \), which, in its most general form, can be described by

$$\begin{aligned} \phi (x) = {\left\{ \begin{array}{ll} 0 &{} \text {if } x\in X\\ > 0 &{} \textit{otherwise}. \end{array}\right. } \end{aligned}$$

Such function can be a smooth quadratic or an exact nonsmooth penalty or, also, a hard barrier that takes \(+\infty \) outside the feasible set. The problem is thus rewritten as

$$\begin{aligned} \min _x P(x; \epsilon ) = f(x) + \epsilon \phi (x), \end{aligned}$$

where \(\epsilon >0\) is a parameter that must be set.

Consider a local minimum of the original problem \(x^*\). We have that, for some neighborhood \(B(x^*, \delta )\),

$$\begin{aligned} f(x^*)\le f(x) ~\forall x \in B(x^*, \delta )\cap X. \end{aligned}$$

The strategy of penalty-based approaches is making the penalty large enough (either by making \(\epsilon \) large or by using a hard barrier) so that

$$\begin{aligned} f(x)+p(x) \ge f(x^*)+p(x^*) = f(x^*), \quad \forall x\in B(x^*, \delta ) \end{aligned}$$

and, hence, \(x^*\) is also a local minimum of the penalized problem.

The idea behind the proposed reformulation is, instead, to avoid penalties by “assigning” to a point x outside the feasible set the value of the objective function computed at its projection \(\pi (x)\). In this way, we do not ever compute f outside the feasible set and we do not need to “correct” f with a penalty for points outside the feasible set.

Let \(\pi \) be the projector operator over X defined as

$$\begin{aligned} \pi (x) \in \mathop {\mathrm {argmin}}\limits _{z\in X} \left\| z-x \right\| . \end{aligned}$$

Notice that since X is compact and convex the projection has a unique solution. A proof of the uniqueness of the projection for convex sets alongside other properties of the projection can be found in Proposition 2.1.3 of [7]. One property that will be extensively used in the following is the non expansiviness of the projection i.e. \(|\pi (x)-\pi (y)|\le |x-y | ~\forall x, ~y \in X\). Dealing with non convex sets would require a more complex treatment since the projection may not be unique. We leave the study of a possible extension to future work.

We can thus define the problem

$$\begin{aligned} \min _x f(\pi (x)) \end{aligned}$$

where each point outside the feasible set assumes the value of its projection. In the latter formulation it is guaranteed that no point outside the feasible set can take a value lower than some point in X. We have however that all the points that share the same projection (consider a ray perpendicular to the constraints) share the same function value. To overcome this issue is sufficient to add to the previous formulation a term that penalizes the distance of a point from its projection. We thus propose to replace the original problem by

$$\begin{aligned} \min _x \tilde{f}(x) = f(\pi (x)) + d_X(x), \end{aligned}$$
(3)

where \(d_X(x) = \left\| x-\pi (x) \right\| \) is the distance from x to the feasible set X.

We note also that since the projection operator is continuous we have that \(\tilde{f}\) is continuous. Moreover, if f is bounded from below on the feasible set X then \(\tilde{f}\) is bounded on \(\mathbb {R}^n\) since \(\tilde{f}(x)\ge f(\pi (x))\ge \inf _{x\in X}f(x)\). On the contrary in the penalty approach \(f(x)+\frac{1}{\epsilon }\sum _{i=1}^{m}\text {max}\{0,g_i(x)\}\) can be unbounded.

Consider, as a simple example, the problem HS224 from the test suite [26].

$$\begin{aligned} \begin{array}{ll} \displaystyle \min _x &{}2x_1^2 + x_2^2-48 x_1-40 x_2\\ s.t.&{} x_1 + 3x_2\ge 0\\ &{} 18- x_1-3 x_2 \ge 0\\ &{} x_1 + x_2 \ge 0\\ &{} 8 - x_1 - x_2 \ge 0\\ &{}0 \le x \le 6. \end{array} \end{aligned}$$

The level curves of the original objective function f and of the modified problem \({\tilde{f}}\) are shown in Fig. 1.

Fig. 1
figure 1

Level curves of f (left) and \(\tilde{f}\) (right) for the two-variables problem HS224. The color bar indicates the objective function values. The feasible set is represented by the area shadowed in gray. The solution is indicated by the red dot

Notice how the solution to the problem \(x^*=(4, 4)\) becomes an unconstrained global minimum in the proposed formulation.

4 Equivalence of the formulations

In this section, we prove the equivalence between the original constrained problem (1) and the proposed formulation (3) in terms of both local/global minima and stationary points. We also show that by modifying the objective function we do not lose Lipschitz continuity so that if f is (locally) Lipschitz \(\tilde{f}\) is (locally) Lipschitz too. We start with the latter.

Lemma 1

Let f be locally Lipschitz continuous. Then the modified function \(\tilde{f}= f(\pi (x)) + \left\| x-\pi (x) \right\| \) is also locally Lipschitz.

Proof

Let \(x_0 \in \mathbb {R}^n\). Since f is locally Lipschitz there exists \(L_0\) and \(\delta _0\) so that

$$\begin{aligned} |f(x)-f(x_0)| \le L_0 \left\| x-x_0 \right\| \quad \forall x \in B(x_0, \delta _0). \end{aligned}$$

Now consider \(\tilde{f}\). For every \(x \in B(x_0, \delta _0)\) we have

$$\begin{aligned} |\tilde{f}(x)- \tilde{f}(x_0)|&= |f(\pi (x)) + \left\| x-\pi (x) \right\| - f(\pi (x_0)) - \left\| x_0-\pi (x_0) \right\| |\\&\le |f(\pi (x))-f(\pi (x_0))| + |\left\| x-\pi (x) \right\| -\left\| x_0-\pi (x_0) \right\| |\\&\le L_0 \left\| \pi (x)-\pi (x_0) \right\| + |\left\| x-\pi (x) \right\| -\left\| x_0-\pi (x_0) \right\| |\\&\le L_0 \left\| x-x_0 \right\| + |\left\| x-\pi (x) \right\| -\left\| x_0-\pi (x_0) \right\| |, \end{aligned}$$

where we have used the local Lipschitz continuity of f and the non expansiveness property of the projection operation. Now, by the triangular inequality we have \(\left\| x_0 -\pi (x_0) \right\| \le \left\| x_0-x \right\| +\left\| x-\pi (x) \right\| +\left\| \pi (x)-\pi (x_0) \right\| \) so that

$$\begin{aligned} \left\| x_0 -\pi (x_0) \right\| -\left\| x-\pi (x) \right\|&\le \left\| x_0-x \right\| +\left\| \pi (x)-\pi (x_0) \right\| \\&\le 2\left\| x_0-x \right\| . \end{aligned}$$

The same reasoning applies to the opposite sign \(-\left\| x_0 -\pi (x_0) \right\| +\left\| x-\pi (x) \right\| \) so that \(|\left\| x-\pi (x) \right\| -\left\| x-\pi (x_0) \right\| | \le 2\left\| x_0-x \right\| \) and we can conclude that

$$\begin{aligned} |\tilde{f}(x)- \tilde{f}(x_0)| \le (2+ L_0)\left\| x - x_0 \right\| \quad \forall x \in B(x_0, \delta _0). \end{aligned}$$

\(\square \)

We now consider the relationship between the global and local minimum of the two formulations. We first prove that each global (local) minimum of the original problem is also a global (local) minimum of the proposed formulation in Proposition 1 and 2.

Proposition 1

Every global minimum of problem (1) is also a global minimum for problem (3).

Proof

Let \(x^*\in X\) be a global minimum for problem (1). Suppose by contradiction that there exists \({\bar{x}} \in \mathbb {R}^n\) such that \(\tilde{f}({\bar{x}}) < \tilde{f}(x^*)\). Then

$$\begin{aligned} \tilde{f}({\bar{x}}) = f(\pi ({\bar{x}})) + \left\| {\bar{x}}-\pi ({\bar{x}}) \right\| < \tilde{f}(x^*) = f(\pi (x^*)) + \left\| x^*-\pi (x^*) \right\| = f(x^*). \end{aligned}$$

Thus, we have found a point \(y=\pi ({\bar{x}})\in X\) s.t.

$$\begin{aligned} f(y)< f(x^*) -\left\| {\bar{x}}-y \right\| < f(x^*), \end{aligned}$$

which is a contradiction. \(\square \)

Proposition 2

Every local minimum of problem (1) is also a local minimum for problem (3).

Proof

Let \(x^*\in X\) be a local minimum for problem (1). Then there exists a ball \(\mathcal {B}(x^*, \rho )\) with \(\rho >0\) s.t.

$$\begin{aligned} f(x^*) \le f(x) \quad \forall x \in \mathcal {B}(x^*, \rho ) \cap X. \end{aligned}$$

Let \({\bar{x}} \in \mathcal {B}(x^*, \rho )\) and suppose by contradiction that \(\tilde{f}({\bar{x}}) < \tilde{f}(x^*)\). Thus we have

$$\begin{aligned} \tilde{f}({\bar{x}}) = f(\pi ({\bar{x}})) + \left\| {\bar{x}}-\pi ({\bar{x}}) \right\| < \tilde{f}(x^*) = f(\pi (x^*)) + \left\| x^*-\pi (x^*) \right\| = f(x^*). \end{aligned}$$

Let \(y=\pi ({\bar{x}})\in X\). We have, by the properties of the projection operator that

$$\begin{aligned} \left\| y-\pi (x^*) \right\| = \left\| y-x^* \right\| \le \left\| {\bar{x}} -x^* \right\| \le \rho , \end{aligned}$$

so that \(y\in \mathcal {B}(x^*, \rho ) \cap X.\) Moreover, it holds that

$$\begin{aligned} f(y)< f(x^*) -\left\| {\bar{x}}-y \right\| < f(x^*), \end{aligned}$$

which is a contradiction. \(\square \)

Furthermore, since f and \(\tilde{f}\) take the same values on X it holds that any global (local) minimum of the modified problem which belongs to the feasible set is also o global (local) minimum of the original problem.

Proposition 3

Every global (local) minimum \(x\in X\) for problem (3) is also a global (local) minimum of problem (1).

We now show, in Lemma 2, that no minimal point does exist outside the feasible region X so that we have a perfect equivalence between global and local minima in the two formulation as remarked in Corollary 1

Lemma 2

Suppose \({\hat{x}} \in \mathbb {R}^n\setminus X\) then \({\hat{x}}\) is not a global or local minimum for problem (3). In particular, \(d=\pi ({\hat{x}})- {\hat{x}}\) is a descent direction for \(\tilde{f}\) at \({\hat{x}}\).

Proof

We will prove the thesis by showing that there exists a descent direction at \({\hat{x}}\) and hence \({\hat{x}}\) cannot be a minimum.

Consider the projection of \({\hat{x}}\) onto the feasible set \({\bar{x}} = \pi ({\hat{x}})\). Let \(d={\bar{x}}-{\hat{x}}\) and consider a point \({\hat{x}}+ \alpha d\) with \(\alpha \in (0, 1]\). For every y we have

$$\begin{aligned} \begin{aligned} ({\hat{x}}+\alpha d-{\bar{x}})^T(y-{\bar{x}})&= ({\hat{x}} + \alpha ({\bar{x}}-{\hat{x}})-{\bar{x}})^T(y-{\bar{x}}) \\&= (1-\alpha )({\hat{x}}-{\bar{x}})(y-{\bar{x}}) \le 0, \end{aligned} \end{aligned}$$
(4)

where the latter holds because \({\bar{x}}=\pi ({\hat{x}})\). Since the projection has a unique solution, from (4) we conclude that \({\bar{x}} = \pi ({\hat{x}}) = \pi ({\hat{x}}+\alpha d) ~\forall \alpha \in [0, 1]\). Thus we have that

$$\begin{aligned} \tilde{f}({\hat{x}}) - \tilde{f}({\hat{x}}+ \alpha d)&= f({\bar{x}}) + \left\| {\hat{x}}-{\bar{x}} \right\| - f(\pi ({\hat{x}} + \alpha d)) - \left\| {\hat{x}}+ \alpha d - \pi ( {\hat{x}} + \alpha d) \right\| \\&= f({\bar{x}}) +\left\| {\hat{x}} - {\bar{x}} \right\| - f({\bar{x}}) - \left\| {\hat{x}}+\alpha ({\bar{x}}-{\hat{x}}) - {\bar{x}} \right\| \\&= \left\| {\hat{x}}-{\bar{x}} \right\| - (1-\alpha )\left\| {\hat{x}}-{\bar{x}} \right\| \\&= \alpha \left\| {\hat{x}}-{\bar{x}} \right\| \\&>0, \end{aligned}$$

so that d is a descent direction at \({\hat{x}}\). \(\square \)

By putting together Proposition 1, 2, 3 and Lemma 2 we establish the perfect equivalence of the two formulations in terms of local and global minima as expressed by the following.

Corollary 1

Every global (local) minimum of problem (1) is also a global (local) minimum of problem (3) and reciprocally.

To conclude the discussion we investigate on the relationship between stationary points. In particular, it is interesting to check if Clarke-stationary points of the modified problem are also Clarke-stationary for the original problem.

We are able to prove the latter under the following assumption.

Assumption 1

We assume that X is such that

$$\begin{aligned} \limsup _{y\rightarrow {\bar{x}}, ~t\rightarrow 0} {\left\| \frac{\pi (y+td)-\pi (y)}{t}-d \right\| } = 0, \end{aligned}$$

for every hyper-tangent direction \(T_X^H({\bar{x}})\) and every feasible point \({\bar{x}}\in X\).

We start by showing that any Clarke-stationary point of the modified problem must belong the feasible set.

Lemma 3

Let \({\bar{x}}\) be a Clarke-stationary point of problem (3). Then \({\bar{x}} \in X\).

Proof

Since \({\bar{x}}\) is Clarke-stationary we have, by definition, that

$$\begin{aligned} \tilde{f}^{\circ }({\bar{x}}; d) = \limsup _{y\rightarrow {\bar{x}}, ~t\rightarrow 0} \frac{\tilde{f}(y+td)-\tilde{f}(y)}{t}\ge 0, \quad \forall d\in \mathbb {R}^n. \end{aligned}$$
(5)

In particular the latter must hold also for direction \(d=\pi ({\bar{x}})-{\bar{x}}\).

Now, let us suppose by contradiction that \(x\not \in X\). Letting \({\hat{d}}=\pi (y)-y\) we can write

$$\begin{aligned} \frac{\tilde{f}(y+td)-\tilde{f}(y)}{t}&= \frac{\tilde{f}(y+t{\hat{d}})-\tilde{f}(y)}{t} + \frac{\tilde{f}(y+td)-\tilde{f}(y+t{\hat{d}})}{t}\\&\le -\left\| \pi (y)-y \right\| + \tilde{L} \left\| d-{\hat{d}} \right\| , \end{aligned}$$

where we have used, that \({\hat{d}}\) is a descent direction for \(\tilde{f}\) at y (Lemma 2) for the first term and that \(\tilde{f}\) is Lipschitz continuous (Lemma 1) for the second one.

Now for every \(y \rightarrow {\bar{x}}\) we have that \(\left\| d-{\hat{d}} \right\| = \left\| \pi ({\bar{x}})-{\bar{x}}-\pi (y)+y \right\| \rightarrow 0\) and that \(\left\| \pi (y)-y \right\| \rightarrow \left\| \pi ({\bar{x}})-{\bar{x}} \right\| = M>0\) since \({\bar{x}}\not \in X\). Thus we have that

$$\begin{aligned} \frac{\tilde{f}(y+td)-\tilde{f}(y)}{t} \le -M < 0, \end{aligned}$$

which contradicts (5).

\(\square \)

Proposition 4

Let \({\bar{x}}\) be a Clarke-stationary point of problem (3). Then, under Assumption 1, \({\bar{x}}\) is also a Clarke-stationary point of problem (1).

Proof

Let \({\bar{x}}\) be a Clarke-stationary point for problem (3). By Lemma 3 it must be that \({\bar{x}}\in X\). Then from Definition 3 we have

$$\begin{aligned} \tilde{f}^{\circ }({\bar{x}}; d) \ge 0, \end{aligned}$$

for every hyper-tangent direction \(d\in T_X^H({\bar{x}})\).

We can calculate

$$\begin{aligned} 0&\le \tilde{f}^{\circ }({\bar{x}}; d) = \limsup _{y\rightarrow {\bar{x}}, ~t\rightarrow 0} \frac{\tilde{f}(y+td)-\tilde{f}(y)}{t}\\&\le \limsup _{y\rightarrow {\bar{x}}, ~t\rightarrow 0} \frac{f\left( \pi (y+td)\right) -f\left( \pi (y)\right) }{t} + \limsup _{y\rightarrow {\bar{x}}, ~t\rightarrow 0} \frac{d_X(y+td)-d_X(y)}{t} \end{aligned}$$

Since \(d_X\) is a convex function we have, by Theorem 1, that \(d_X^\circ = d_X^\prime \). Thus

$$\begin{aligned} \limsup _{y\rightarrow {\bar{x}}, ~t\rightarrow 0} \frac{d_X(y+td)-d_X(y)}{t} = \lim _{t\rightarrow 0} \frac{d_X({\bar{x}}+td)-d_X({\bar{x}})}{t} = 0. \end{aligned}$$

Hence

$$\begin{aligned} 0&\le \limsup _{y\rightarrow {\bar{x}}, ~t\rightarrow 0} \frac{f\left( \pi (y+td)\right) -f\left( \pi (y)\right) }{t}\\&\le \limsup _{y\rightarrow {\bar{x}}, ~t\rightarrow 0} \frac{f\left( \pi (y)+td\right) -f\left( \pi (y)\right) }{t} + \limsup _{y\rightarrow {\bar{x}}, ~t\rightarrow 0} \frac{f\left( \pi (y+td)\right) -f\left( \pi (y)+td\right) }{t}\\&\le \limsup _{y\rightarrow {\bar{x}}, ~t\rightarrow 0, ~y\in X} \frac{f\left( y+td\right) -f\left( y\right) }{t} + L\limsup _{y\rightarrow {\bar{x}}, ~t\rightarrow 0} \frac{\left\| \pi (y+td)-\pi (y)-td \right\| }{t}\\&\le f^\circ ({\bar{x}}; d) + L\limsup _{y\rightarrow {\bar{x}}, ~t\rightarrow 0} {\left\| \frac{\pi (y+td)-\pi (y)}{t}-d \right\| }. \end{aligned}$$

Thus, because of Assumption 1, we conclude that

$$\begin{aligned} f^\circ ({\bar{x}}; d) \ge 0. \end{aligned}$$

\(\square \)

5 Numerical experiments

In the following, we propose some numerical experiments to investigate advantages of the proposed formulation comparing it against the exact penalty approach of [10]. To make a fair comparison we used the same algorithm to solve both formulations. In particular, we used the CS-DFN algorithm proposed in [10]. In the following, we call solver an algorithm applied to a particular formulation of a given problem. So we compare the exact penalty solver, i.e. the CS-DFN algorithm applied to the penalized formulation, and the Projection-based Penalty Method (PPM) solver, i.e. the CS-DFN algorithm applied to the proposed formulation.

Test problems We set up a benchmark composed of 28 problems belonging to different classes: general nonlinear functions subjected to (1) non-degenerate linear constraints from the collection [12, 26]; (2) degenerate linear constraints from [2]; (3) general convex constraints again from [12, 26]; and minmax programs under linear constraints from [23]. The problems are listed in Table 1.

Table 1 Benchmark problems details

Performance metric To compare the results we employ data profiles. Data profile for benchmarking derivative free algorithms have been proposed in [24]. They take into account the case of unconstrained problems. Consider a test set of \(\mathcal {P}\) problems. We fix a tolerance parameter \(\tau \) and we say that a problem has been solved by if

$$\begin{aligned} f(x)-f_L \le \tau \left( f(x_0)-f_L\right) , \end{aligned}$$
(6)

where \(f_L\) is an accurate estimate of the minimum of the problem (in our benchmark we can set \(f_L=f^*\) since \(f^*\) is available). Note that, here, function values are referred to the original problem even if the solver employs a different formulation.

Then for each solver s we define its data profile as

$$\begin{aligned} d_s(\alpha ) = \frac{1}{|\mathcal {P}|}\left| \left\{ p\in \mathcal {P}~s.t.~ \frac{nf(\tau )}{n_p+1}\le \alpha \right\} \right| , \end{aligned}$$

where \(n_p\) is the number of variables of problem p and \(nf(\tau )\) is the number of function evaluations needed to satisfy the convergence criterion (6).

Data profiles are extracted for different values of \(\tau \) to compare the solvers against different balances of speed versus accuracy.

In the constrained case, however, the proposed scheme is not readily applicable. We propose to modify condition (6) by considering also the constraint violation as follows:

$$\begin{aligned} (1-\beta )(f(x)-f_L) + \beta \left\| g_+(x) \right\| \le \tau (1-\beta )(f(x_0)-f_L) + \tau \beta \left\| g_+(x_0) \right\| , \end{aligned}$$
(7)

where \(\beta \in (0, 1)\) is a new parameter which balances function value and constraint violation.

Note that, by violating the constraints, the function values can be lower than the \(f^*\). Naturally, by choosing a high value for \(\beta \) this situation can be arbitrarily penalized as long as f does not go to \(-\infty \). These cases must be removed before computing the profiles.

We extract different data profiles for different values of both \(\tau \) and \(\beta \). In particular we extract the curves for \(\tau ^{-k}\) with \(k\in \{1, 3, 5, 7\}\) and \(\beta \in \{0.9, 0.99\}\)

Solvers details As already mentioned for both solvers we employ the CS-DFN proposed in [10]. We report the pseudo-code of the method in Algorithm 1.

figure a

We set \(\delta =0.5, ~\gamma =10^{-6}, \eta =10^{-6}\) for Algorithm 1. The dense sequence of direction \(\{d_k\}\) was obtained by the implementation available at [25] of the Sobol quasi-random generation proposed in [27].

For the exact penalty solver we employ the adaptive strategy for tuning \(\epsilon \) which is proposed in [10] and is deemed to be a better choice.

The implementation of the algorithms, alongside the code needed to reproduce all the following experiments is available as python code at https://github.com/jth3galv/dfppm.

5.1 A first comparison

We start by comparing the PPM solver against the exact penalty. We let both solvers run for up to \(10^4\) function evaluations and then we extract the data profiles, which are reported in Fig. 2. From the plots we can see that the PPM solver enjoys generally better performance both in terms of speed and robustness (number of problems eventually solved). We note however that neither solver manages to solve more than the \(65\%\) of the test problems within the budget of function evaluations when a relatively high precision (\(\tau =10^{-7}\)) is required.

Fig. 2
figure 2

Data profiles for the PPM solver and the exact penalty

5.2 A parametrization

In this section, we introduce a scale factor \(\epsilon \) that controls how much to penalize points outside the feasible region. Namely we modify our formulation as

$$\begin{aligned} \min _x f(\pi (x)) + \epsilon \left\| x-\pi (x) \right\| . \end{aligned}$$

To understand the effect of \(\epsilon \) we start with a qualitative analysis. In Fig. 3 we show the iterates of the algorithm when run on the same problem with different values of \(\epsilon \).

Fig. 3
figure 3

Iterates (in red) of CS-DFN for \(\epsilon =0.1\) (left) and \(\epsilon =10\) (right) for problem HS224. The feasible set is in gray, the initial point in green. Each dot is obtained after a search along the 2n coordinate directions and possibly a direction from the dense sequence

From Fig. 3 we can see that the iterates for greater values of \(\epsilon \) stay closer to the feasible set while for small values of \(\epsilon \) the algorithm is allowed to stay far from it.

To understand whether staying closer to the feasible set has a good or bad effect on the overall optimization process we measure the performance on the solver for different values of \(\epsilon \). Namely we try \(\epsilon \in \{0.1, 1, 10, 100\}\). We use the same setup of the previous experiments. In Fig. 4 we give the data profiles for the different solvers (we include, for later reference, another configuration \((\epsilon =10, ~\sigma =2)\) which is explained later in the manuscript).

Fig. 4
figure 4

Data profiles for different value of the parameter \(\epsilon .\)

From Fig. 4 we can see that choosing a high value of \(\epsilon \) yields good performances when low precision is required although they quickly lessen for higher values of \(\tau \). For instance for \(\tau =10^{-7}, ~\beta =0.99\) the algorithm with \(\epsilon =100\) manages to satisfy the convergence criterion only in roughly \(40\%\) of the test problems. The opposite is true for small values of \(\epsilon \). The algorithm is generally slower but can achieve very good solutions if a greater number of function evaluations is allowed.

It is, thus, natural to ask if employing an adaptive strategy for \(\epsilon \) may be advantageous. For example, one could start with \(\epsilon \) set to a large number and gradually decrease it to get to accurate solutions.

Coming up with a good schedule for \(\epsilon \) that works well for all problems can be a hard task. However, the CS-DFN algorithm offers a good way to understand in what regime the algorithm is working by looking at the length of the steps that the algorithm takes at each iteration. We thus propose to set the value of \(\epsilon \) as a function of the set length \(\alpha _k\). More precisely we set

$$\begin{aligned} \epsilon _{k+1} = \sigma \alpha _k. \end{aligned}$$
(8)

In this way, at the beginning of the algorithm we can start with a large value of \(\epsilon _0\) and then let it decreasing as the algorithm takes smaller steps.

We found, by manually tuning, that good performances can be obtained by setting \(\epsilon _0=10, \sigma =2\) although other configurations perform similarly well. As we can see, again in Fig. 4, this configuration performs almost equally well when low or high precision is required. Moreover, notice that when high precision is required we go from less than \(70\%\) of solved problems to more than \(90\%\).

5.3 Final comparison

To end the discussion we propose a final comparison of the PPM solver, in its parameterized version, equipped with the adaptive strategy for tuning \(\epsilon \) against the exact penalty approach. The results are reported in Fig. 5.

Fig. 5
figure 5

Data profiles for the PPM and exact penalty solvers

We can see that the PPM solver enjoys better performance for every threshold of accuracy although the exact penalty can be faster for some problems when a relatively low precision is required. We note, furthermore, that the exact penalty fails to reach accurate solutions for a large portion of the test problems whether, as already noticed, the PPM solver manages to reach more than the \(90\%\) of solved problems even when high precision is required. We also report, for completeness, in Table 2 the distance of the objective function from the optimum value and the constraint violation after the total budget of function evaluations has been used.

Table 2 Details for the PPM and exact penalty solvers after the maximum budget of function evaluations have been used: \(\varDelta f^*\) is the difference between the final objective function f and the minimum value \(f^*\), \(\Vert g_+\Vert \) is the norm of the constraints violations

6 Conclusion

In this work we proposed to rewrite a nonsmooth optimization problem subjected to convex constraints as an unconstrained parameter-free problem. Such formulation is proven to be equivalent to the original problem, in terms of global and local minima. Furthermore we were able to prove, under suitable assumptions, that any Clarke-stationary point of the proposed formulation is also a Clarke-stationary point of the original problem. The formulation can be solved by any optimization algorithm for nonsmooth optimization.

We compared the proposed formulation against a state-of-art approach for constrained nonsmooth optimization. In particular we compared it against the exact penalty method. We used the same solver that is shown to deliver state-of-art performances for the penalized problem to solve the proposed formulation. The results clearly show the advantages of the proposed formulation.

Future work will be devoted to (1) handle a mix convex and non-convex constraints by combining the proposed formulation with an exact penalty to deal with the non convex part of the constraints and (2) to study cases where the projection operation is expensive so that a truncated projection is to be employed.