1 Introduction

We present and analyze a new generalized Frank–Wolfe method [8, 14, 18,19,20, 24, 39] for the following composite optimization problem:

$$\begin{aligned} (P): \ \ \ F^*:= {\min }_{x\in {\mathbb {R}}^n} \;[F(x):= f(\mathsf {A}x) + h(x)] \end{aligned}$$
(1.1)

where \(f: {\mathbb {R}}^m\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is an extended real-valued convex function that is differentiable on its domain \(\mathsf {dom}\,f:=\{u \in \mathbb {R}^m : f(u) < +\infty \}\), \(\mathsf {A}:{\mathbb {R}}^{n}\rightarrow {\mathbb {R}}^{m}\) is a linear operator (though not necessarily invertible or surjective) and the function \(h:{\mathbb {R}}^n\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is proper, closed and convex (but possibly non-smooth), for which \(\mathsf {dom}\,h\) is a nonempty compact convex set. Furthermore, and in contrast to the standard setting where f is assumed to be L-smooth on \(\mathsf {dom}\,F\) (i.e., its gradient is L-Lipschitz on \(\mathsf {dom}\,F\)), our focus is on the setting where f belongs to a particularly special and important class of functions that arise in practice, namely \(\theta \)-logarithmically-homogeneous self-concordant barrier functions (whose definition and properties will be reviewed below). For convenience we distinguish between \(u \mapsto f(u)\) and \(x \mapsto f(\mathsf {A}x)\) by defining

$$\begin{aligned} \bar{f}(x):=f(\mathsf {A}x) \ . \end{aligned}$$
(1.2)

The Frank–Wolfe method was developed in 1956 in the seminal paper of Frank and Wolfe [24] for the case \(h = \iota _{\mathcal {X}}\), where \(\iota _{\mathcal {X}}\) denotes the indicator function of \(\mathcal {X}\) (i.e., \(\iota _{\mathcal {X}}(x) = 0\) for \(x\in \mathcal {X}\) and \(+\infty \) otherwise), and \(\mathcal {X}\) is a (bounded) polytope. (In particular, \((P)\) then is the constrained problem \(\min _{x\in \mathcal {X}}\, \bar{f}(x)\).) The Frank–Wolfe method was a significant focus of research up through approximately 1980, during which time it was generalized to handle more general compact sets \(\mathcal {X}\), see e.g., [14, 18,19,20]. Each iteration of the Frank–Wolfe method computes the minimizer of the first-order (gradient) approximation of \(\bar{f}(x)\) on \(\mathcal {X}\), and constructs the next iterate by moving towards the minimizer. Just in the last decade, and due to the importance of optimization in modern computational statistics and machine learning, the Frank–Wolfe method has again attracted significant interest (see [25,26,27, 31, 34, 46] among many others), for at least two reasons. First, in many modern application settings, \(\mathcal {X}\) is a computationally “simple” set for which the linear optimization sub-problem in the Frank–Wolfe method is easier to solve compared to the standard projection sub-problems required by other first-order methods, see e.g., [45]. Secondly, the Frank–Wolfe method naturally produces “structured” (such as sparse, or low-rank) solutions in several important settings, which is very useful in the high-dimensional regime in machine learning. This is because each iterate of the Frank–Wolfe method is a convex combination of all the previous linear optimization sub-problem solutions, and hence if the extreme points of \(\mathcal {X}\) are unit coordinate vectors (for example), then the k-th iterate will have at most k non-zeroes. This statement can be made more precise in specific settings, and also generalizes to the matrix setting where \(\mathcal {X}\) is the nuclear norm ball [31]—in this setting the k-th (matrix) iterate will have rank at most k.

More recently, the Frank–Wolfe method has been generalized to the composite setting, where the function h is a general convex non-smooth function with compact domain \(\mathcal {X}\), see e.g., [4, 27, 46]. In this generalized framework, the sub-problem solved at iteration k is

$$\begin{aligned} {{\,\mathrm{minimize}\,}}_{x \in {\mathbb {R}}^n}\; \langle \nabla \bar{f}(x^k), x \rangle + h(x) \ , \end{aligned}$$
(1.3)

which specializes to the standard Frank–Wolfe sub-problem in the case when \(h = \iota _\mathcal {X}\). In certain situations, this minimization problem admits (relatively) easily computable solutions despite the presence of the non-smooth function h. For example, if \(h= \bar{h}+ \iota _\mathcal {P}\), where \(\bar{h}\) is a polyhedral function and \(\mathcal {P}\) is a polytope, then (1.3) can be reformulated as a linear optimization problem (LP), which can be solved efficiently if it has moderate size or a special structure, e.g., network flow structure [31]. For more such examples we refer the reader to [46].

In addition, there has been recent research work on using the Frank–Wolfe method to solve the projection sub-problems (which are constrained quadratic problems) that arise in various optimization algorithms. For example, [40] presents a projected Newton method for solving a class of problems that is somewhat different from (but related to) (1.1); specifically [40] assumes that the linear operator \(\mathsf {A}\) is invertible and the function f is self-concordant but is not necessarily a logarithmically-homogeneous barrier. The Frank–Wolfe method is used therein to solve each projection sub-problem in the projected Newton method, and [40] shows that the total number of linear minimization sub-problems needed is \(O(\varepsilon ^{-(1+o(1))})\). Another such example is in [17, Section 5], which develops an affine-invariant trust-region type of method for solving a class of convex composite optimization problems in a similar form as (1.1), with the key difference being that in [17] f is assumed to be twice differentiable with Lipschitz Hessian on \(\mathsf {dom}\,h\). The Frank–Wolfe method is used in [17] to solve each projection sub-problem, wherein it is shown that the total number of linear minimization sub-problems needed is \(O(\varepsilon ^{-1})\).

When analyzing the convergence of the standard or generalized Frank–Wolfe method, almost all such analyses rely on the L-smooth assumption of the function f. Perhaps accidentally, the first specific attempt to extend the Frank–Wolfe method and analysis beyond the case of L-smooth functions is due to Khachiyan [36]. In the specific case of the D-optimal design problem [23], Khachiyan [36] developed a “barycentric coordinate descent” method with an elegant computational complexity analysis, and it turns out that this method is none other than the Frank–Wolfe method with exact line-search [56, 61]. Khachiyan’s proof of his complexity result (essentially \(O(n^2/\varepsilon )\) iterations) used clever arguments that do not easily carry over elsewhere, and hence begged the question of whether or not any of the arguments in [36] might underly any general themes beyond D-optimal design, and if so what might be the mathematical structures driving any such themes? In this work, we provide affirmative answers to these questions, by considering the D-optimal design problem as a special instance of the broader class of composite optimization problem (P). The second attempt was the recent paper of Dvurechensky et al. [21], which presented and analyzed an adaptive step-size Frank–Wolfe method for tackling the problem \(\min _{x \in \mathcal {X}} \bar{f}(x)\) where \({\bar{f}}\) is assumed to be a non-degenerate (i.e., with positive-definite Hessians) self-concordant function and \(\mathcal {X}\) is a compact convex set, and was the first paper to study the Frank–Wolfe method for these special functions. The set-up in [21] can be seen as an instance of \((P)\) with h being the indicator function of \(\mathcal {X}\), namely, \(h=\iota _\mathcal {X}\), and the additional assumption that \({\bar{f}}\) is non-degenerate, which we do not need. (Note that in our setting, this amounts to assuming that \(\mathsf {A}= \mathsf {I}\), namely the identity operator or that the linear operator \(\mathsf {A}\) is invertible.) However, unlike [21], we additionally assume that f is \(\theta \)-logarithmically homogeneous. As our analysis will show, this last property — which holds true for all applications that we are aware of — is the key property that leads to relatively simple and natural computational guarantees for the Frank–Wolfe method in this expanded relevant setting.

Let us now review the formal definition of a \(\theta \)-logarithmically-homogeneous self-concordant barrier function. Let \(\mathcal {K}\subsetneqq {\mathbb {R}}^m\) be a regular cone, i.e., \(\mathcal {K}\) is closed, convex, pointed (\(\mathcal {K}\) contains no line), and has nonempty interior (\(\mathsf {int}\,\mathcal {K}\ne \emptyset \)). We say that f is a \(\theta \)-logarithmically-homogeneous (non-degenerate) self-concordant barrier on \(\mathcal {K}\) for some \(\theta \ge 1\) and we write “\(f \in \mathcal {B}_\theta (\mathcal {K})\)”, if f is three-times differentiable and strictly convex on \(\mathsf {int}\,\mathcal {K}\) and satisfies the following three properties:

  1. (P1)

    \(\vert D^3f(u)[w,w,w]\vert \le 2(\langle {H(u)w},{w}\rangle )^{3/2}\)\(\forall \,u\in \mathsf {int}\,\mathcal {K}\), \(\forall \,w\in {\mathbb {R}}^m\),

  2. (P2)

    \(f(u_k)\rightarrow \infty \) for any \(\{u_k\}_{k\ge 1}\subseteq \mathsf {int}\,\mathcal {K}\) such that \(u_k\rightarrow u\in \mathsf {bd}\,\mathcal {K}\), and

  3. (P3)

    \(f(tu) = f(u) - \theta \ln (t)\)\(\forall \,u\in \mathsf {int}\,\mathcal {K}\), \(\forall \,t>0\) ,

where H(u) denotes the Hessian of f at \(u\in \mathsf {int}\,\mathcal {K}\). For details on these properties, we refer readers to Nesterov and Nemirovski [47, Section 2.3.3] and Renegar [51, Section 2.3.5]. Properties (P1) and (P2) correspond to f being a (standard, strongly) self-concordant function on \(\mathsf {int}\,\mathcal {K}\) (cf. [47, Remark 2.1.1]), and property (P3) corresponds to f being a \(\theta \)-logarithmically-homogeneous barrier function on \(\mathcal {K}\). Here \(\theta \) is called the complexity parameter of f in the terminology of Renegar [51]. The two prototypical examples of such functions are (i) \(-\ln \det (U)\) for \(U \in \mathcal {K}:= \mathbb {S}_{+}^{k}\) and \(\theta = k\), and (ii) \(-\sum _{j=1}^m w_j \ln (u_j)\) for \(u \in \mathcal {K}:= \mathbb {R}_{+}^{m}\) and \(\theta = \sum _{j=1}^m w_j\) where \(w_1, \ldots , w_n \ge 1\), see [47, 51].

We now present some application examples of \((P)\) where \(f \in \mathcal {B}_\theta (\mathcal {K})\), including the aforementioned D-optimal design problem.

1.1 Applications

1. Poisson image de-blurring with total variation (TV) regularization [10, 15, 33]. Let the \(m\times n\) matrix X be the true representation of an image, such that each entry \(X_{ij}\ge 0\) represents the intensity of the pixel at location \((i,j)\in [m]\times [n]\), and \(X_{ij}\in \{0,1,\ldots ,M\}\), where \(M:=2^b-1\) for b-bit images. In many applications, ranging from microscopy to astronomy, we observe a blurred image contaminated by Poisson noise, which we denote by Y, and we wish to estimate the true image X from Y. The generative model of Y from X is presumed to be as follows. Let \(\mathsf {A}:{\mathbb {R}}^{m\times n}\rightarrow {\mathbb {R}}^{m\times n}\) denote the 2D discrete convolutional (linear) operator with periodic boundary conditions, which is assumed to be known. This convolutional operator is defined by a \(p\times p\) 2D convolutional kernel with a size \(q:=p^2\) that is typically much smaller than the size of image \(N:= mn\). (For an illustration of the 2D convolution, see [32] for example.) The blurred image \(\widetilde{Y}\) is obtained by passing X through \(\mathsf {A}\), i.e., \(\widetilde{Y}:= \mathsf {A}(X)\), and the observed image Y results from adding independent entry-wise Poisson noise to \(\widetilde{Y}\), i.e., \( Y_{ij}\sim \mathsf{Poiss}(\widetilde{Y}_{ij})\), for all \((i,j)\in [m]\times [n]\), and \(\{Y_{ij}\}_{(i,j)\in [m]\times [n]}\) are assumed to be independent.

It will be preferable to work with vectors in addition to matrices, whereby we equivalently describe the above model using vector notation as follows. We denote \(X = [x_1 \;\cdots \; x_m]^\top \), where \(x_i^\top \) denotes the i-th row of X, and let \(\mathsf{vec}:{\mathbb {R}}^{m\times n}\rightarrow {\mathbb {R}}^{mn}\) denote the vectorization operator that sequentially concatenates X into the column vector \( \mathsf{vec}(X) := x :=[x_1^\top \cdots ~ x_m^\top ]^\top \), and let \(\mathsf{mat}(x)\) denote the inverse operator of \(\mathsf{vec}\), so that \(\mathsf{mat}(x) = X\). Define \(y := \mathsf{vec}(Y)\) and \(\widetilde{y}:= \mathsf{vec}(\widetilde{Y})\). In addition, we represent \(\mathsf {A}\) in its matrix form \(A\in {\mathbb {R}}^{N\times N}\) (recall \(N := m n\)), such that \(\widetilde{y}:= A x\). Furthermore, let us represent \(A:= [a_1\;\ldots \;a_N]^\top \), where \(a_l^\top \) denotes the l-th row of A for \(l \in [N]\). Note that A is a sparse doubly-block-circulant matrix, such that each row \(a_l^\top \) of A has at most q non-zeros, where \(q\ll N\) denotes the size of the 2D convolution kernel. Finally, we have \(y_{l}\sim \mathsf{Poiss}(\widetilde{y}_{l})\) for all \(l\in [N]\), and \(\{y_l\}_{l\in [N]}\) are independent.

The maximum likelihood (ML) estimator of X from the observed image Y is the optimal solution of the following optimization problem:

$$\begin{aligned} \min _{x\in {\mathbb {R}}^{N}} \;\; -\sum _{l=1}^{N} y_l\ln (a_l^{\top } x) + (\sum _{l=1}^{N} a_l)^{\top } x \quad {{\,\mathrm{s.t.}\,}}\;\; 0\le x\le Me \ , \end{aligned}$$
(1.4)

where e denotes the vector with all entries equal to one. In addition, following [52], in order to recover a smooth image with sharp edges, we add Total Variation (TV) regularization to the objective function in (1.4), which yields the following regularized ML estimation problem:

$$\begin{aligned} {\min }_{x\in {\mathbb {R}}^N}&\;\; \bar{F}(x):= -\textstyle \sum _{l=1}^{N} y_l\ln (a_l^\top x) + (\sum _{l=1}^{N} a_l)^\top x + \lambda \mathrm{TV}(x)\quad \nonumber \\ {{\,\mathrm{s.t.}\,}}&\;\; 0\le x\le Me \ , \end{aligned}$$
(1.5)

where

$$\begin{aligned} \mathrm{TV}(x)&:= \textstyle \sum _{i=1}^m \sum _{j=1}^{n-1} \left|[\mathsf{mat}(x)]_{i,j} - [\mathsf{mat}(x)]_{i,j+1}\right|\\&\qquad + \textstyle \sum _{i=1}^{m-1} \sum _{j=1}^{n} \left|[\mathsf{mat}(x)]_{i,j} - [\mathsf{mat}(x)]_{i+1,j}\right| \\&:= \textstyle \sum _{i=1}^m \sum _{j=1}^{n-1} \left|X_{i,j} - X_{i,j+1}\right| + \textstyle \sum _{i=1}^{m-1} \sum _{j=1}^{n} \left|X_{i,j} - X_{i+1,j}\right| \end{aligned}$$

is a standard formulation of the total variation. Here we see that (1.5) is an instance of \((P)\) with \(f(u) := -\textstyle \sum _{l=1}^N y_l\ln \big (u_l)\), \(\mathcal {K}:= {\mathbb {R}}^N_+\), \(h(x) := (\sum _{l=1}^{N} a_l)^\top x + \lambda \mathrm{TV}(x) + \iota _{\mathcal {X}}\) where \(\mathcal {X}= \{ x \in {\mathbb {R}}^N : 0 \le x \le Me \}\), \(\mathsf {A}\) is defined by \((\mathsf {A}x)_l := a_l^\top x\), \(l=1, \ldots , N\), and \(\theta = \sum _{l=1}^N y_l\). We note that \(y_l \ge 1\) whenever \(y_l \ne 0\) for all \(l \in [N]\), and hence \(f \in \mathcal {B}_\theta (\mathcal {K})\). In Sect. 4.1 we will discuss how the Frank–Wolfe sub-problem (1.3) associated with (1.5) can be efficiently solved.

2. Positron emission tomography (PET) [6, 54]. PET is a medical imaging technique that measures the metabolic activities of human tissues and organs. Typically, radioactive materials are injected into the organ of interest, and these materials emit (radioactive) events that can be detected by PET scanners. The mathematical model behind this process is described as follows. Suppose that an emission object (e.g., a human organ) has been discretized into n voxels. The number of events emitted by voxel i (\(i\in [n]\)) is a Poisson random variable \({\tilde{X}}_i\) with unknown mean \(x_i \ge 0\) and so \({\tilde{X}}_i\sim \mathsf{Poiss}(x_i)\), and furthermore \(\{\tilde{X}_i\}_{i=1}^n\) are assumed to be independent. We also have a scanner array with m bins. Each event emitted by voxel i has a known probability \(p_{ij}\) of being detected by bin j (\(j\in [m]\)), and we assume that \(\sum _{j=1}^m p_{ij} = 1\), i.e., the event will be detected by exactly one bin. Let \({\tilde{Y}}_j\) denote the total number of events detected by bin j, whereby

$$\begin{aligned} \mathbb {E} [{\tilde{Y}}_j ]:= y_j := \textstyle \sum _{i=1}^n p_{ij} x_i \ . \end{aligned}$$
(1.6)

By Poisson thinning and superposition, it follows that \(\{\tilde{Y}_j\}_{j=1}^m\) are independent random variables and \({\tilde{Y}}_j\sim \mathsf{Poiss}(y_j)\) for all \(j\in [m]\).

We seek to perform maximum-likelihood (ML) estimation of the unknown means \(\{x_i\}_{i=1}^n\) based on observations \(\{Y_j\}_{j=1}^m\)of the random variables \(\{{\tilde{Y}}_j\}_{j=1}^m\). From the model above, we easily see that the log-likelihood of observing \(\{Y_j\}_{j=1}^m\) given \(\{{\tilde{X}}_i\}_{i=1}^n\) is (up to some constants)

$$\begin{aligned} l(x):= - \textstyle \sum _{i=1}^n x_i + \textstyle \sum _{j=1}^mY_j\ln \big (\sum _{i=1}^n p_{ij} x_i\big ) \ , \end{aligned}$$
(1.7)

and therefore an ML estimate of \(\{x_i\}_{i=1}^n\) is given by an optimal solution \(x^*\) of

$$\begin{aligned} {\max }_{x\ge 0} \;l(x) \ . \end{aligned}$$
(1.8)

It follows from the first-order optimality conditions that any optimal solution x must satisfy

$$\begin{aligned} \textstyle \sum _{i=1}^n x_i = S := \sum _{j=1}^m Y_j \ , \end{aligned}$$
(1.9)

and by incorporating (1.9) into (1.8) and defining the re-scaled variable \(z:= x/S\), (1.8) can be equivalently written as

$$\begin{aligned} {\min }_z \; L(z) := -\textstyle \sum _{j=1}^m Y_j\ln \big (\sum _{i=1}^n p_{ij} z_i\big )\quad {{\,\mathrm{s.t.}\,}}\;\; {z\in \Delta _n} \ , \end{aligned}$$
(1.10)

where \(\Delta _n := \{ z \in \mathbb {R}^n : \sum _{i=1}^n z_i =1, \ z \ge 0 \}\) is the unit simplex in \(\mathbb {R}^n\). Here we see that (1.10) is an instance of (1.1) with \(f(u) := -\textstyle \sum _{j=1}^m Y_j\ln \big (u_j)\), \(\mathcal {K}:= {\mathbb {R}}^m_+\), \(h := \iota _{\Delta _n}\), \(\mathsf {A}\) defined by \((\mathsf {A}z)_j := \sum _{i=1}^n p_{ij} z_i\), \(j=1, \ldots , m\), and \(\theta = \sum _{j=1}^m Y_j\). We note that \(Y_j \ge 1\) whenever \(Y_j \ne 0\) for all \(j \in [m]\), and hence \(f \in \mathcal {B}_\theta (\mathcal {K})\).

3. Poisson phase retrieval [48]. In Poisson phase retrieval, we seek to estimate an unknown unit complex signal \(x\in \mathbb {C}^n\), namely \(\Vert x\Vert _2:= (x^H x)^{1/2}=1\) where \(x^H\) denotes the conjugate transpose of x. We estimate x using m linear measurements that are subject to Poisson noise; for \(j\in [m]\), the j-th measurement vector is denoted by \(a_j\in \mathbb {C}^n\), and the measurement outcome \(y_j\) is a Poisson random variable such that \(y_j\sim \mathsf{Poiss}(\widetilde{y}_j)\), where \(\widetilde{y}_j:= \vert \langle {a_j},{x}\rangle \vert ^2\). Oder et al. [48] proposed to estimate x by solving the following matrix optimization problem:

$$\begin{aligned} {\min }_X&\;\; -\textstyle \sum _{j=1}^m y_j\ln \langle {a_j a_j^H},{X}\rangle + \langle {\sum _{j=1}^m a_j a_j^H},{X}\rangle \nonumber \\ {{\,\mathrm{s.t.}\,}}&\;\; {X\in \mathcal {X}:= \{X\in \mathbb {H}_+^n:{{\,\mathrm{tr}\,}}(X)\le c\}} \ , \end{aligned}$$
(1.11)

where \(\langle {\cdot },{\cdot }\rangle \) denotes the Frobenius matrix inner product, \(\mathbb {H}_+^n\) denotes the set of complex Hermitian positive semi-definite matrices of order n, \({{\,\mathrm{tr}\,}}(X)\) denotes the trace of X, and the parameter \(c>0\) is typically chosen as \(c = (1/m)\sum _{j=1}^m y_j\). Let \(X^*\) be the optimal solution of (1.11). One then computes a unit eigenvector \({\bar{x}}\) associated with the largest eigenvalue of \(X^*\) and uses \({\bar{x}}\) as the estimate of x. Note that (1.11) has a similar form to (1.10) except in two ways: first, the objective function in (1.11) has an additional linear term, and second, the constraint set is the intersection of a nuclear norm ball with the positive semi-definite cone, instead of a simplex. Therefore, using the same arguments as above, we see that (1.11) is an instance of (1.1). To solve (1.11), [48] proposed a Frank–Wolfe method with a pre-determined step-size sequence, and showed that this method converges with rate O(C/k), where C depends on several factors including (i) the diameter of \(\mathcal {X}\) under the spectral norm, (ii) \(\max _{j\in [m]} \Vert a_j\Vert ^2_2\), (iii) \(\max _{j\in [m]}\max _{X\in \mathcal {X}} \langle {a_j a_j^H},{X}\rangle \), and (iv) \(\min _{j\in [m]} \langle {a_j a_j^H},{X_0}\rangle \) where \(X_0\in \mathcal {X}\) denotes the starting point of the Frank–Wolfe method.

4. Optimal expected log investment [1, 12, 59]. In this problem we consider n stocks in the market, and let \(R_i\) denote the random per-unit capital return on investing in stock i, for \(i\in [n]\). The random vector \(R:= (R_1,\ldots ,R_n)\) has unknown distribution P. An investor allocates her investment capital over these n stocks, and let \(w_i\) denote the (nonnegative) proportion of capital invested in stock i, whereby \(w_i\ge 0\) for all \(i\in [n]\) and \(\sum _{i=1}^n w_i=1\). Define \(w:=(w_1,\ldots ,w_n)\). The goal of the investor is to maximize her expected log return \(f(w):= \mathbb {E}_{R\sim P}[\ln (w^\top R)]\) subject to the constraint \(w\in \Delta _n\) where \(\Delta _n :=\{w \in {\mathbb {R}}^n : w \ge 0, \ e^Tw = 1\}\). The naturalness of this objective can be justified from several perspectives involving the principle that money compounds multiplicatively rather than additively, see the discussion and references in [1, 12]. Since P is unknown, one can use a (historical) data-driven empirical distribution such as \(\widehat{P}_m:= \sum _{j=1}^m p_j \delta _{r_j}\), where \(p_j > 0\), \(\sum _{j=1}^m p_j=1\), \(r_j\in {\mathbb {R}}^n\) is a realization of R and \(\delta _{r_j}\) denotes the unit point mass at \(r_j\) for \(j \in [m]\). Under this empirical distribution, the investor instead solves the problem:

$$\begin{aligned} {\min }_{w\in \Delta _n}&\; -\textstyle \sum _{j=1}^{m} p_j\ln (r_j^\top w) \ . \end{aligned}$$
(1.12)

Note that (1.12) has the same basic format as the PET problem in (1.10). Indeed, both of these problems fall under a more general class of problems called “positive linear inverse problems” [59]. Define \(p_{\min } := \min _{j \in [m]}\{p_j\}>0\) and consider re-scaling the objective function of (1.12) by \(1/p_{\min }\), which ensures the coefficient in front of each \(\ln (\cdot )\) term is at least 1. Then this re-scaled problem is an instance of \((P)\) with \(f(u) := -\textstyle \sum _{j=1}^m (p_j/p_{\min }) \ln (u_j)\), \(\mathcal {K}:= {\mathbb {R}}^m_+\), \(h := \iota _{\Delta _n}\), \(\mathsf {A}\) defined by \((\mathsf {A}w)_j := r_j^\top w\) for \(j \in [m]\), \(\theta = 1/p_{\min }\), and \(f \in \mathcal {B}_\theta (\mathcal {K})\).

5. Computing the analytic center [44]. Given a nonempty solid polytope \(\mathcal {Q}= \{x\in {\mathbb {R}}^n: a_i^\top x\ge d_i,\;i=1,\ldots ,m\}\), the function

$$\begin{aligned} b(x) := -\textstyle \sum _{i=1}^m \ln (a_i^\top x- d_i), \quad x\in \mathcal {Q}\ , \end{aligned}$$
(1.13)

is an m-self-concordant barrier \(\mathcal {Q}\), see [47]. We wish to compute the analytic center of \(\mathcal {Q}\) under b, which is the optimal solution to the problem \({\min }_{x\in \mathcal {Q}}\, b(x)\). We can transform this problem into an instance of (P) as follows. Define

$$\begin{aligned} f(x,t) := -\textstyle \sum _{i=1}^m \ln (a_i^\top x- td_i) = -\sum _{i=1}^m \ln (a_i^\top (x/t)- d_i) - m\ln (t),\nonumber \\ \end{aligned}$$
(1.14)

which is a m-logarithmically-homogeneous self-concordant barrier on the conic hull of \(\mathcal {Q}\), denoted by \(\text {cone}\,(\mathcal {Q})\) and defined by

$$\begin{aligned} \text {cone}\,(\mathcal {Q})&:= \mathsf {cl}\,\{(x,t)\in {\mathbb {R}}^{n+1}: x/t\in \mathcal {Q},\; t>0\}\\&= \{x\in {\mathbb {R}}^n: a_i^\top x\ge td_i,\;i=1,\ldots ,m,\; t \ge 0\} \ . \end{aligned}$$

Then we can formulate the analytic center problem as

$$\begin{aligned} {\min }_{(x,t)\in {\mathbb {R}}^{n+1}}\; f(x,t)\quad {{\,\mathrm{s.t.}\,}}\quad (x,t)\in \text {cone}\,(\mathcal {Q}),\;t=1 \ , \end{aligned}$$

which is an instance of (P) with \(f(u) := -\textstyle \sum _{i=1}^m \ln (u_i)\), \(\mathcal {K}:= {\mathbb {R}}^m_+\), \(h = \iota _{\mathcal {X}}\) where \(\mathcal {X}= \{(x,t)\in {\mathbb {R}}^{n+1} : x \in \mathcal {Q}, \ t=1\}\), \(\mathsf {A}\) defined by \((\mathsf {A}(x,t))_i := a_i^\top x- td_i\) for \(i \in [m]\), and \(\theta = m\).

The above formulation can be generalized to any nonempty convex compact set \(\mathcal {Q}\subseteq {\mathbb {R}}^{n}\) equipped with a (standard strongly) \(\vartheta \)-self-concordant barrier b on \(\mathcal {Q}\). From [47, Proposition 5.1.4], there exists a constant \(c\le 20\) for which

$$\begin{aligned} f(x,t):= c^2(b(x/t)-2\vartheta \ln t) \end{aligned}$$
(1.15)

is a \((2c^2\vartheta )\)-logarithmically-homogeneous self-concordant barrier on \(\text {cone}\,(\mathcal {Q})\). Therefore using (1.15) the analytic center problem can be formulated as an instance of \((P)\) in a similar way as above.

6. D-optimal design [23]. Given m points \(a^1,\ldots ,a^m \in \mathbb {R}^n\) whose affine hull is \(\mathbb {R}^n\), the D-optimal design problem is:

$$\begin{aligned} \min \; h(x):= -\ln \det \big (\textstyle \sum _{i=1}^m x_i a_ia_i^T\big ) \quad {{\,\mathrm{s.t.}\,}}\;\; x\in \Delta _m \ . \end{aligned}$$
(1.16)

In the domain of statistics the D-optimal design problem corresponds to maximizing the determinant of the Fisher information matrix \(\mathbb {E}(aa^T)\), see [37, 2], as well as the exposition in [7]. And in computational geometry, D-optimal design arises as a Lagrangian dual problem of the minimum volume covering ellipsoid (MVCE) problem, which dates back at least 70 years to [35], see Todd [57] for a modern treatment. Indeed, (1.16) is useful in a variety of different application areas, for example, computational statistics [13] and data mining [38].

A recent extension of (1.16) is the design of a supervised learning pipeline for new datasets to maximize the Fisher information, see the recent paper by Yang et al. [60]. The optimization problem they consider is the following variant of (1.16):

$$\begin{aligned} \min&\; h(x):= -\ln \det \big (\textstyle \sum _{i=1}^m x_i a_ia_i^T\big ) \end{aligned}$$
(1.17)
$$\begin{aligned} \quad {{\,\mathrm{s.t.}\,}}&\;\; \textstyle \sum _{i=1}^n {\bar{t}}_i x_i \le \tau \ \end{aligned}$$
(1.18)
$$\begin{aligned} \quad&\;\;x_i \in [0,1] \ \text{ for } \ i \in [m] \ , \end{aligned}$$
(1.19)

where the decision variable \(x_i\) models the decision to fit model i or not, \({\bar{t}}_i\) is the estimated pipeline running time of pipeline i, \(\tau \) is the runtime limit, and \(a_i\) is the vector of latent meta-features of model i for \(i \in [m]\). The constraints (1.19) are a linear relaxation of the (computationally unattractive) desired combinatorial constraints \(x_i \in \{0,1\}\), \(i \in [m]\). We refer the interested reader to [60] for further details and model variations. Here we see that (1.17)–(1.19) is an instance of (1.1) with \(f(U) := -\ln \det (U)\), \(\theta = n\), \(\mathsf {A}x := \textstyle \sum _{i=1}^m x_i a_ia_i^T\), and h is the indicator function of the feasible region of the constraints (1.18)–(1.19).

As mentioned earlier, the D-optimal design problem was one of the primary motivators for the research in this paper. Indeed, Khachiyan proved that his “barycentric coordinate descent” method for this problem—which turns out to be precisely the Frank–Wolfe method (with exact line search)—has a computational guarantee that is essentially \(O\big (n^2/\varepsilon \big )\) iterations to produce an \(\varepsilon \)-approximate solution. What has been surprising about this result is that the D-optimal design problem violates the basic assumption underlying the premise for the analysis of the Frank–Wolfe method, namely L-smoothness. Khachiyan’s proof used original and rather clever arguments that do not easily carry over elsewhere, which has begged the question of whether or not any of Khachiyan’s arguments might underlie any general themes beyond D-optimal design, and if so what might be the mathematical structures driving any such themes? We will answer these questions in the affirmative in Sect. 2 by showing that the Frank–Wolfe method achieves essentially \(O((\theta + R_h)^2/\varepsilon )\) iteration complexity when used to tackle any problem of the form (1.1), where \(R_h\) is the variation of h on its domain (\(R_h:= {\max }_{x,y\in \mathcal {X}}\; \vert h(x) - h(y)\vert \)) and f is a \(\theta \)-logarithmically homogeneous self-concordant barrier. When specialized to the D-optimal design problem, we have \(R_h= 0\) (since h is an indicator function), and \(\theta =n\), whereby we recover Khachiyan’s \(O(n^2/\varepsilon )\) dependence on \(\varepsilon \). In this respect, our results reveal certain intrinsic connections between \(\theta \)-logarithmic homogeneity and the Frank–Wolfe method.

Interestingly, we note that historically the theory of self-concordant functions was initially developed to present a general underlying theory for Newton’s method for barrier methods in convex optimization. However, the results in this paper indicate that a subclass of self-concordant functions, namely the class of \(\theta \)-logarithmically homogeneous self-concordant barriers, are also tied to an underlying theory for the Frank–Wolfe method.

By way of concluding this discussion, we note that the relevant literature contains some first-order methods other than Frank–Wolfe have been proposed to solve problems similar to (1.1). For example, [58] considered (1.1) with the linear operator \(\mathsf {A}\) being invertible and the function f being standard self-concordant but not necessarily a logarithmically-homogeneous barrier. The authors in [58] proposed a proximal gradient method for solving this problem, and showed that the method globally and asymptotically converges to the unique optimal solution. However, the global convergence rate of this method was not shown.

1.2 Contributions

We summarize our contributions as follows:

  1. 1.

    We propose a generalized Frank–Wolfe method for solving \((P)\) with \(f\in \mathcal {B}_\theta (\mathcal {K})\). We show that the Frank–Wolfe method requires \(O((\delta _0 + \theta + R_h)\ln (\delta _0) + (\theta +R_h)^2/\varepsilon )\) iterations to produce an \(\varepsilon \)-approximate solution of \((P)\), namely \(x\in \mathsf {dom}\,F\) such that \(F(x) - F^*\le \varepsilon \), where \(\delta _0\) denotes the initial optimality gap and \(R_h\) denotes the variation of h on its domain. This iteration complexity bound depends on just three (natural) quantities associated with \((P)\): (i) the initial optimality gap \(\delta _0\), (ii) the complexity parameter \(\theta \) of f, and (iii) the variation of h on \(\mathcal {X}\). When h is the sum of a convex quadratic function and an indicator function, our algorithm specializes to that in Dvurechensky et al. [21]. However, our iteration complexity bounds are quite different from that in [21]—in particular our bounds are affine invariant, more naturally interpretable, and are typically easy to estimate and can yield an a priori bound on the number of iterations needed to guarantee a desired optimality tolerance \(\varepsilon \). These issues are discussed in details in Remark 3.

  2. 2.

    Our analysis also yields \(O((\delta _0 + \theta + R_h)\ln (\delta _0) + (\theta +R_h)^2/\varepsilon )\) iteration complexity to produce \(x\in \mathsf {dom}\,F\) whose Frank–Wolfe gap (defined in (2.3) below) is no larger than \(\varepsilon \). Since the Frank–Wolfe gap is constructed at each iteration and is often used as the stopping criterion for the Frank–Wolfe method, our result provides a further constructive bound on the number of iterations required to detect a desired optimality tolerance.

  3. 3.

    When specialized to the D-optimal design problem, our general algorithm almost exactly recovers the iteration complexity of Khachiyan’s specialized method for D-optimal design in [36]. Indeed, the complexities of these two methods have identical dependence on \(\varepsilon \), namely \(n^2/\varepsilon \). However, Khachiyan’s specialized method has an improved “fixed” term over our general method by a factor of \(\ln (m/n)\); see Remark 2 for details.

  4. 4.

    We present a mirror descent method with adaptive step-size applied to the (Fenchel) dual problem \((D)\) of \((P)\). The dual problem \((D)\) shares a somewhat similar structure to \((P)\) in that its objective function is non-smooth and non-Lipschitz. However, unlike (P), the objective function has unbounded domain. Although these features make the direct analysis of mirror descent rather difficult, through the duality of mirror descent and the Frank–Wolfe method we provide a computational complexity bound for this mirror descent method via the Frank–Wolfe method applied to \((P)\). An application of our mirror descent method for \((D)\) arises in the sub-problem in Bregman proximal-point methods.

  5. 5.

    We apply our method to the TV-regularized Poisson image de-blurring problem. We present computational experiments that point to the potential usefulness of our generalized Frank–Wolfe method on this imaging problem in Sect. 4.1.

1.3 Outline and notation

Outline. The paper is organized as follows. In Sect. 2 we present and analyze our generalized Frank–Wolfe method for \((P)\) when \(f \in \mathcal {B}_\theta (\mathcal {K})\), using an adaptive step-size strategy that is a natural extension of the strategy developed in [21]. In Sect. 3 we study the (Fenchel) dual \((D)\) of \({(P)}\) and derive and analyze a dual mirror descent method for solving \((D)\) based on the generalized Frank–Wolfe method for solving (P). In Sect. 4 we present computational experiments that point to the potential usefulness of our generalized Frank–Wolfe method on Poisson image de-blurring problems with TV regularization, and we also present computational experiments on the PET problem.

Notation. Let \(\mathbb {R}^n_+ := \{ x \in \mathbb {R}^n : x \ge 0 \}\) and \(\mathbb {R}^n_{++} := \{ x \in \mathbb {R}^n : x > 0 \}\). The set of integers \(\{1,\ldots ,n\}\) is denoted by [n]. The domain of a convex function f is denoted by \(\mathsf {dom}\,f :=\{x \in \mathbb {R}^n : f(x) < \infty \}\). We use H(x) to denote the Hessian of the function f. The interior and relative interior of a set \(\mathcal {S}\) are denoted by \(\mathsf {int}\,\mathcal {S}\) and \(\mathsf {ri}\,\mathcal {S}\), respectively. We use e to denote the vector with entries all equal to 1, \(\mathsf {diag}\,(x)\) to denote the diagonal matrix whose diagonal entries correspond to those of x, and \(\Delta _n\) to denote the standard \((n-1)\)-dimensional simplex in \(\mathbb {R}^n\), namely \(\Delta _n := \{ x \in \mathbb {R}^n : \sum _{i=1}^n x_i =1, \ x \ge 0 \}\). We use \(\mathbb {S}_{+}^{n}\) (\(\mathbb {S}_{++}^{n}\)) to denote the set of \(n\times n\) symmetric positive semidefinite (positive definite) matrices, and write \(B\in \mathbb {S}_{+}^{n}\) as \(B\succeq 0\) and \(B\in \mathbb {S}_{++}^{n}\) as \(B\succ 0\). The p-norm of a vector \(x\in {\mathbb {R}}^n\) is denoted and defined by \(\Vert x\Vert _p = (\sum _{i=1}^n \vert x_i\vert ^p)^{1/p}\).

2 A generalized Frank–Wolfe method for \((P)\) when f is a \(\theta \)-logarithmically-homogeneous self-concordant barrier

In this section we present and analyze a generalized Frank–Wolfe method for the composite optimization problem \((P)\) in the case when \(f \in \mathcal {B}_\theta (\mathcal {K})\), using an adaptive step-size strategy that is a natural extension of the strategy developed in Dvurechensky et al. [21]. We assume throughout that \(\mathcal {X}:= \mathsf {dom}\,h\) is a convex and compact set, and that \(\mathsf {A}(\mathcal {X})\cap \mathsf {dom}\,f\ne \emptyset \). These two assumptions together with the differentiability of f on \(\mathsf {int}\,\mathcal {K}\) ensure that \((P)\) has at least one optimal solution which we denote by \(x^*\) and hence \(F^*=F(x^*)\).

Let us introduce some important notation. For any \(u \in \mathsf {int}\,\mathcal {K}\), the Hessian H(u) of f is used to define the local (Hilbert) norm \(\Vert \cdot \Vert _u\) defined by:

$$\begin{aligned} \Vert w \Vert _u := \sqrt{\langle {w},{H(u)w}\rangle } \ \ \text {for~all~} w \in \mathbb {R}^m \ . \end{aligned}$$

The Dikin ball \(\mathcal {D}(u,1)\) at \(u\in \mathsf {int}\,\mathcal {K}\) is defined by

$$\begin{aligned} \mathcal {D}(u,1) := \{v\in \mathcal {K}:\Vert v-u\Vert _u<1\} \ , \end{aligned}$$

and it can be shown that \(\mathcal {D}(u,1)\subseteq \mathsf {int}\,\mathcal {K}\), see Nesterov and Nemirovski [47, Theorem 2.1.1]. The self-concordant function f is well-behaved inside the Dikin ball as we will review shortly. Define the univariate function \(\omega \) and its Fenchel conjugate \(\omega ^*\) as follows:

$$\begin{aligned} \omega (a) := -a - \ln (1-a) \;\; \forall \, a < 1 \ , \ \ \ \text {and} \ \ \omega ^*(a) := a - \ln (1+a) \;\; \forall \, a > -1 .\nonumber \\ \end{aligned}$$
(2.1)

It turns out that f can be nicely upper-bounded inside the Dikin ball, namely:

$$\begin{aligned} f(v)&\le f(u) + \langle {\nabla f(u)},{v-u}\rangle + \omega \left( \Vert v-u\Vert _u\right) \ \forall \,u\in \mathsf {int}\,\mathcal {K}, \;\forall \, v \in \mathcal {D}(u,1) \ , \end{aligned}$$
(2.2)

see Nesterov [44, Theorem 4.1.8].

We now develop our generalized Frank–Wolfe method for the composite optimization problem \((P)\) under the condition that \(f \in \mathcal {B}_\theta (\mathcal {K})\), and whose formal rules are presented in Algorithm 1. First, we choose any starting point \(x^0\in \mathcal {X}\) such that \(\mathsf {A}x^0\in \mathsf {dom}\,f(=\mathsf {int}\,\mathcal {K})\), namely \(x^0\in \bar{\mathcal {X}}:=\mathcal {X}\cap \mathsf {A}^{-1}(\mathsf {dom}\,f)\). Indeed, from the description below, we will see that the whole sequence of iterates \(\{x^k\}_{k\ge 0}\) generated by our method lies in \(\bar{\mathcal {X}}\). Given the current iterate \(x^k\in \bar{\mathcal {X}}\), the method first computes the gradient \(\nabla f({\mathsf {A}}x^k)\) and then solves for a minimizer \(v^k\) of the generalized Frank–Wolfe sub-problem given by

$$\begin{aligned} v^k\in {\arg \,min}_{x\in {\mathbb {R}}^n} \langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}x}\rangle + h(x)\ . \end{aligned}$$

The next iterate is then determined as a convex combination of \(x^k\) and \(v^k\): \(x^{k+1} \leftarrow x^k + \alpha _k (v^k - x^k)\) for some \(\alpha _k \in [0,1]\), where \(\alpha _k\) is the step-size at iteration k. For L-smooth functions f, the step-size can be chosen by a variety of strategies, including simple rules such as \(\alpha _k = \tfrac{2}{(k+2)}\), exact line-search to minimize \(f(x^k + \alpha (v^k - x^k))\) on \(\alpha \in [0,1]\), or an adaptive strategy based on curvature information, etc. Here we present an adaptive strategy based on the upper bound model (2.2), which can also be viewed as an extension of the adaptive strategy used in [21] and which itself is an extension of the adaptive strategy in Demyanov and Rubinov [14]. Define the Frank–Wolfe gap (“FW-gap") \(G_k\) by

$$\begin{aligned} G_k := \langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}(x^k - v^k)}\rangle + h(x^k) - h(v^k) \ , \end{aligned}$$
(2.3)

and the optimality gap \(\delta _k := F(x^k) - F^*\). Note that \(G_k \ge 0\) and in fact by the convexity of f it holds that

$$\begin{aligned} \delta _k&= (f(\mathsf {A}x^k) - f(\mathsf {A}x^*)) + (h(x^k) - h(x^*))\nonumber \\&\le \langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}(x^k - x^*)}\rangle + (h(x^k) - h(x^*))\nonumber \\&\le \langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}(x^k - v^k)}\rangle + (h(x^k) - h(v^k)) = G_k \ , \end{aligned}$$
(2.4)

hence \(G_k\) is indeed an upper bound on the optimality gap \(\delta _k\). Denoting \(D_k := \Vert \mathsf {A}(v^k-x^k)\Vert _{\mathsf {A}x^k}\), we then have that for any \(\alpha \ge 0\),

$$\begin{aligned} f(\mathsf {A}x^{k} + \alpha \mathsf {A}(v^k-x^k)) \le f(\mathsf {A}x^k) - \alpha \langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}(x^k - v^k)}\rangle + \omega \left( \alpha D_k\right) .\nonumber \\ \end{aligned}$$
(2.5)

(Note that if \(\alpha <1/D_k\), then (2.5) follows from (2.2); otherwise, by the definition of \(\omega \) in (2.1), we have \(\omega \left( \alpha D_k\right) =+\infty \) and (2.5) still holds.) Also, by the convexity of h, we have

$$\begin{aligned} h(x^{k} + \alpha (v^k-x^k))\le (1-\alpha )h(x^{k}) + \alpha h(v^k) = h(x^{k}) - \alpha (h(x^k) - h(v^{k})) .\nonumber \\ \end{aligned}$$
(2.6)

Adding (2.5) and (2.6) together, we obtain

$$\begin{aligned} F(x^{k} + \alpha (v^k-x^k))\le F(x^k) - \alpha G_k + \omega \left( \alpha D_k\right) \ , \quad \forall \,\alpha \ge 0 \ , \end{aligned}$$
(2.7)

and optimizing the right-hand-side over \(\alpha \in [0,1]\) yields the step-size:

$$\begin{aligned} \alpha _k := \min \left\{ \frac{G_k}{D_k(G_k+D_k)}, 1\right\} . \end{aligned}$$

Notice that this step-size specializes precisely to the adaptive step-size developed in [21] in the case when the function h is the indicator function \(\iota _{\mathcal {X}}\) of a compact convex set \(\mathcal {X}\), since in this case \(h(x^k)=h(v^k)=0\) for all k and hence \(G_k\) turns out to be the standard “gap function” \(\langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}(x^k - v^k)}\rangle \) as used in [21]. In addition, note that the step-size \(\alpha _k\) ensures that \(x^{k+1}\in \bar{\mathcal {X}}\).

figure a

Before presenting our analysis of Algorithm 1 we make two remarks. First, notice that the complexity parameter \(\theta \) of f is not needed to run Algorithm 1, but it will play a central role in analyzing the iteration complexity of the algorithm. In a sense, Algorithm 1 automatically adapts to the value of \(\theta \). Second, for most applications—including all of the applications discussed in Sect. 1.1—the computational cost of computing \(D_k\) (in Step 2) is of the same order as computing \(G_k\), and grows linearly in the ambient dimension of the variable x. (Note that here our discussion focuses on the dependence of the computational cost on the dimension of x only.) To see this, let us fix any \(v,x\in \mathsf {dom}\,F\). For the first application in Sect. 1.1 (where N denotes the dimension), both \(u:=\mathsf {A}x\) and \(w:=\mathsf {A}v\) can be computed in O(qN) time, due to the fact that the matrix representation of \(\mathsf {A}\) has only O(qN) nonzeros. Since \(D_k^2= \sum _{l=1}^N (u_l-w_l)^2/u_l^2\), it can be computed in O(qN) time. Using similar reasoning, we easily see that for the second, third and fourth applications (where n denotes the dimension), \(D_k\) can be computed in O(mn) time. For the last application (namely the D-optimal design problem), where the dimension is denoted by m, \(D_k\) can be computed in \(O(mn^2+n^3)\) time for \(k=0\) and \(O(n^2)\) time for \(k\ge 1\); for details see Appendix 1.

2.1 Computational guarantees for Algorithm 1

We now present our computational guarantees for Algorithm 1. These guarantees depend on only three natural quantities associated with \((P)\): (i) the initial optimality gap \(\delta _0\), (ii) the complexity parameter \(\theta \) of f, and (iii) the variation of h on \(\mathcal {X}\), which is defined as:

$$\begin{aligned} R_h:= {\max }_{x,y\in \mathcal {X}}\; \vert h(x) - h(y)\vert \ . \end{aligned}$$
(2.9)

Regarding the variation \(R_h\) we mention two cases in particular:

  1. 1.

    when \(h = \iota _{\mathcal {X}}\), we have \(R_h= 0\), and

  2. 2.

    when h is L-Lipschitz on \(\mathcal {X}\) with respect to some norm \(\Vert \cdot \Vert \), we have \(R_h\le L D_{\mathcal {X},\Vert \cdot \Vert }\), where

    $$\begin{aligned} D_{\mathcal {X},\Vert \cdot \Vert }:= {\sup }_{x,x'\in \mathcal {X}}\,\Vert x-x'\Vert <+\infty \end{aligned}$$
    (2.10)

    is the diameter of \(\mathcal {X}\) under \(\Vert \cdot \Vert \). And in particular if \(h=\Vert \cdot \Vert \), then \(R_h\le D_{\mathcal {X},\Vert \cdot \Vert }\).

Theorem 1

Suppose that \(f \in \mathcal {B}_\theta (\mathcal {K})\) and that Algorithm 1 is initiated at \(x^0 \in \bar{\mathcal {X}}\). Let \(\delta _0\) denote the initial optimality gap.

  1. 1.

    At iteration k of Algorithm 1 the following hold:

    1. (a)

      If \(G_k > \theta + R_h\), then the optimality gap decreases at least linearly at the iteration:

      $$\begin{aligned} \delta _{k+1}\le \left( 1-\frac{1}{5.3(\delta _0+\theta +R_h)}\right) \delta _k \ , \end{aligned}$$
      (2.11)
    2. (b)

      If \(G_k \le \theta + R_h\), then the inverse optimality gap increases by at least a constant at the iteration:

      $$\begin{aligned} \frac{1}{\delta _{k+1}} \ge \frac{1}{\delta _k} + \frac{1}{12 (\theta +R_h)^2} \ , \ \ \ \text{ and } \end{aligned}$$
      (2.12)
    3. (c)

      The number of iterations \(K_{\text {Lin}}\) where \(G_k > \theta +R_h\) occurs is bounded from above as follows: \(K_{\text {Lin}} \le \lceil 5.3(\delta _0 + \theta +R_h)\ln (10.6\delta _0) \rceil \).

  2. 2.

    Let \(K_\varepsilon \) denote the number of iterations required by Algorithm 1 to obtain \(\delta _k \le \varepsilon \). Then:

    $$\begin{aligned} K_\varepsilon \le \lceil 5.3(\delta _0 + \theta +R_h)\ln (10.6\delta _0) \rceil + \left\lceil 12(\theta +R_h)^2 \max \left\{ \frac{1}{\varepsilon } - \frac{1}{\delta _0} \ , 0 \right\} \right\rceil .\nonumber \\ \end{aligned}$$
    (2.13)
  3. 3.

    Let \(\text {FWGAP}_\varepsilon \) denote the number of iterations required by Algorithm 1 to obtain \(G_k \le \varepsilon \). Then:

    $$\begin{aligned} \text {FWGAP}_\varepsilon \le \lceil 5.3(\delta _0 + \theta +R_h)\ln (10.6\delta _0) \rceil + \left\lceil \frac{24(\theta +R_h)^2}{\varepsilon } \right\rceil . \end{aligned}$$
    (2.14)

\(\square \)

Before we present our proof, let us make a few remarks about the results in Theorem 1.

Remark 1

Theorem 1 indicates that if \(f\in \mathcal {B}_\theta (\mathcal {K})\), then the iteration complexity to obtain an \(\varepsilon \)-optimal solution using Algorithm 1 is

$$\begin{aligned} O\big ((\delta _0 + \theta + R_h)\ln (\delta _0) + (\theta + R_h)^2/\varepsilon \big ) \ , \end{aligned}$$
(2.15)

which only depends on three measures, namely (i) the logarithmic homogeneity constant (also known as the “complexity value”) \(\theta \) of f, (ii) the initial optimality gap \(\delta _0\), and (iii) the variation \(R_h\) of h on its domain, in addition to the desired optimality gap \(\varepsilon \). Furthermore, in most of the applications discussed in Sect. 1.1 (namely applications (2.), (4.), (5.), and (6.)) we have \(h = \iota _{\mathcal {X}}\) for some \(\mathcal {X}\) and hence \(R_h= 0\), and so the iteration complexity depends only on \(\theta \) and \(\delta _0\).

It is interesting to note in the case when \(h=\iota _\mathcal {X}\) is the indicator function of a compact region \(\mathcal {X}\), that the iteration complexity bound (2.15) does not rely on any measure of size of the feasible region \(\mathcal {X}\), since in this case \(R_h= 0\). And even when \(R_h>0\), the complexity bound (2.15) has no specific dependence on the behavior of \({\bar{f}}\) on \(\mathcal {X}\). In this way we see that the only way that the behavior of \({\bar{f}}\) enters into the iteration complexity is simply through the value of \(\theta \). (This is in contrast to the traditional set-up of the Frank–Wolfe method for L-smooth optimization, where the fundamental iteration complexity depends on a bound on the curvature of the function \({\bar{f}}\) on the feasible region \(\mathcal {X}\) – which we will discuss later in Remark 3 – which in turn can grow quadratically in the diameter of the feasible region, see for example [25, 34].)

We also note that the iteration complexity bound (2.15) is also independent of any properties of the linear operator \(\mathsf {A}\), and in this way its dependence on a specific data instance \(\mathsf {A}\) is only through the initial optimality gap \(\delta _0\). Therefore (2.15) is data-instance independent except for the way that the data \(\mathsf {A}\) affects the initial optimality gap.

Remark 2

[Comparison with Khachiyan [36] for D-optimal design] Let us specialize the complexity bound in (2.15) to the D-optimal design problem in (1.16), and compare it with the complexity bound in Khachiyan [36]. Note that for the problem in (1.16) we have \(\theta =n\), i.e., the dimension of the ambient space of the points \(a_1, \ldots , a_m\). In addition, if we choose the starting point \(p^0 = (1/m)e\), where \(e:=(1,\ldots ,1)\in {\mathbb {R}}^m\), then

$$\begin{aligned} \delta _0\le n\ln (m/n) \end{aligned}$$
(2.16)

(which we show in Appendix 1), and then based on (2.16) and \(R_h=0\), the itration complexity bound in (2.15) becomes

$$\begin{aligned} O\big (n\ln (m/n)(\ln n + \ln \ln (m/n)) + n^2/\varepsilon \big ) \ . \end{aligned}$$
(2.17)

Using the same starting point, Khachiyan’s Frank–Wolfe method [36] uses exact line-search (based on a clever observation from the Inverse Matrix Update formula [30]), and attains the complexity bound

$$\begin{aligned} O\big (n(\ln n + \ln \ln (m/n)) + n^2/\varepsilon \big ) \ . \end{aligned}$$
(2.18)

Observe that (2.17) has the exact same dependence on \(\varepsilon \) as (2.18), namely \(O(n^2/\varepsilon )\), but its first term is inferior to (2.18) by the factor \(O(\ln (m/n))\). The improvement in the first term of Khachiyan’s bound over the bound in (2.17) is due to his improved estimate of the linear convergence rate in the case \(G_k > \theta \) in Algorithm 1, which arises from exploiting an exact line-search on f. This is in contrast with our method, which only does an exact line-search on the upper bound model of f in (2.7). A detailed analysis of this last point is given in Remark 4 at the end of this section.

Remark 3

[Comparison with Dvurechensky et al. [21]] The recent work of Dvurechensky et al. [21] considers the Frank–Wolfe method for the problem \(\min _{x\in \mathcal {X}}\, {\bar{F}}(x)\), where \(\mathcal {X}\) is nonempty, convex and compact and \({\bar{F}}\) is a non-degenerate (strongly) M-self-concordant function for some \(M>0\). This means \(\bar{F}\) is convex and three-times differentiable on \(\mathsf {dom}\,\bar{F}\), \(\nabla ^2 \bar{F}(x)\succ 0\) for all \(x\in \mathsf {dom}\,\bar{F}\) (this is the definition that \(\bar{F}\) is non-degenerate), and

$$\begin{aligned} \vert D^3\bar{F}(x)[z,z,z]\vert \le 2M^{-1/2}(\langle {\nabla ^2 \bar{F}(x)z},{z}\rangle )^{3/2} \ \quad \forall \,x\in \mathsf {dom}\,\bar{F}\ , \;\;\forall \,z\in {\mathbb {R}}^n .\nonumber \\ \end{aligned}$$
(2.19)

For convenience of comparison, henceforth let \(M=1\). When h is the sum of a convex quadratic function and an indicator function, i.e., \(h(x) = \tfrac{1}{2}\langle {x},{Qx}\rangle + \xi ^\top x + \iota _\mathcal {X}(x)\) for some \(Q \succeq 0\) and \(\xi \in {\mathbb {R}}^n\), then our method coincides with that in Dvurechensky et al. [21] with \(\bar{F}(x) = f(\mathsf {A}x) + \tfrac{1}{2}\langle {x},{Qx}\rangle + \xi ^\top x\). The complexity bound developed in [21] for computing an \(\varepsilon \)-optimal solution is:

$$\begin{aligned} O\left( \sqrt{L(x^0)}D_{\mathcal {X},\Vert \cdot \Vert _2}\ln \left( \frac{\delta _0}{\sqrt{L(x^0)}D_{\mathcal {X},\Vert \cdot \Vert _2}}\right) + \frac{L(x^0)D_{\mathcal {X},\Vert \cdot \Vert _2}^2}{\varepsilon }\right) \ , \end{aligned}$$
(2.20)

where

$$\begin{aligned} \mathcal {S}(x^0)&:= \big \{x\in \mathsf {dom}\,\bar{F}\cap \mathcal {X}\,:\, \bar{F}(x) \le \bar{F}(x^0) \big \} \ , \text{ and } \end{aligned}$$
(2.21)
$$\begin{aligned} L(x^0)&:= {\max }_{x\in \mathcal {S}(x^0)} \;\;\lambda _{\max } (\nabla ^2 \bar{F}(x) ) <+\infty \ . \end{aligned}$$
(2.22)

In (2.22) \(\lambda _{\max } (\nabla ^2 \bar{F}(x))\) denotes the largest eigenvalue of \(\nabla ^2 \bar{F}(x)\), and in [21] it is further assumed that \(\bar{F}\) is non-degenerate (which necessarily presumes that \({{\,\mathrm{rank}\,}}(\mathsf {A}) = n \)) in order to ensure the compactness of \(\mathcal {S}(x^0)\), and hence the finiteness of \(L(x^0)\).

It is instructive to compare the two iteration complexity bounds (2.15) and (2.20), and we note the following:

  • Affine invariance. Affine invariance is an important structural property of certain algorithms; for instance Newton’s method is affine invariant whereas the steepest descent method is not. It is well known that the Frank–Wolfe method is affine invariant. Current state-of-the-art complexity analysis of the Frank–Wolfe method for L-smooth functions yields an appropriate affine-invariant complexity bound by utilizing the so-called curvature constant \(C_{\bar{F}}\) of Clarkson [11] defined by

    $$\begin{aligned} C_{\bar{F}}:= \max _{x,y\in \mathcal {X},\alpha \in [0,1]}\,\frac{2}{\alpha ^2}(\bar{F}(x+\alpha (y-x))-\bar{F}(x) - \alpha \langle {\nabla \bar{F}(x)},{y-x}\rangle ) ,\nonumber \\ \end{aligned}$$
    (2.23)

    which is a finite affine-invariant quantity [11] when \({\bar{F}}\) is L-smooth. (This is the same curvature which was alluded to in Remark 1.) Of course \(C_{\bar{F}}\) is not typically finite when \(\bar{F}\) is self-concordant. The complexity bound in (2.20) depends on measures that are tied to the Euclidean inner product and norm, namely \(D_{\mathcal {X},\Vert \cdot \Vert _2}\) and \(L(x^0)\), and are not affine-invariant, even though the Euclidean norm plays no part in the algorithm. (Under an invertible affine transformation of the variables these measures will change but the performance of the Frank–Wolfe method will not change.) In contrast, all the quantities in our complexity bounds, namely \(\delta _0\), \(\theta \) and \(R_h\) are affine-invariant, therefore the complexity bounds in (2.15) are affine-invariant.

  • Interpretability. Apart from \(\varepsilon \), our complexity result only depends on \(\delta _0\), \(\theta \), and \(R_h\), all of which admit natural behavioral interpretations. Specifically, \(\delta _0\) measures the initial sub-optimality gap, \(\theta \) is the “complexity parameter” of the barrier f (in the lexicon of Renegar [51]), and \(R_h\) measures the variation of h over the set \(\mathcal {X}\).

  • Ease of parameter estimation. Note that all of the three parameters \(\delta _0\), \(\theta \), and \(R_h\) in our complexity bound (2.15) are either easy to know or easy to appropriately bound. Given a logarithmically-homogeneous self-concordant barrier f, its complexity value \(\theta \) is typcially known a priori or can be easily determined using (P7) of Lemma 1. Since \(\delta _0 \le G_0\), a natural upper bound on \(\delta _0\) is the initial FW-gap \(G_0\), which is computed in the second step at iteration \(k=0\) of Algorithm 1. Regarding \(R_h\), note that \(R_h= 0 \) when h is the indicator function of a convex set \(\mathcal {X}\), namely \(h = \iota _\mathcal {X}\). In the case when \(h(x) := \xi ^\top x + \iota _\mathcal {X}(x)\) then \(R_h\) can be computed exactly as the difference of two linear optimization optimal values on \(\mathcal {X}\) or can be upper bounded using

    $$\begin{aligned} R_h= {\max }_{x,x'\in \mathcal {X}}\,\vert \xi ^\top (x-x')\vert \le \Vert \xi \Vert _*D_{\mathcal {X},\Vert \cdot \Vert } \ , \end{aligned}$$

    in the case when the norm \(\Vert \cdot \Vert \) can possibly be chosen to yield easily computable values of \(D_{\mathcal {X},\Vert \cdot \Vert }\). Apart from this case, there also exist many other cases where the simple nature of h and \(\mathcal {X}\) yield easily computable upper-bounds on \(R_h\).

2.2 Proof of Theorem 1

We first state some facts about \(\theta \)-logarithmically homogeneous self-concordant barrier functions.

Lemma 1

[see Nesterov and Nemirovskii [47, Corollary 2.3.1, Proposition 2.3.4, and Corollary 2.3.3]] If \(f\in \mathcal {B}_\theta (\mathcal {K})\), then for any \(u\in \mathsf {int}\,\mathcal {K}\), we have

  1. (P4)

    \(\left|\langle {\nabla f(u)},{w}\rangle \right| \le \sqrt{\theta }\Vert w\Vert _u\) \(\quad \forall \,w\in {\mathbb {R}}^m\),

  2. (P5)

    \(\Vert v\Vert _u \le -\langle {\nabla f(u)},{v}\rangle \) \(\quad \forall \,v\in \mathcal {K}\),

  3. (P6)

    \(\langle {\nabla f(u)},{w}\rangle = -\langle {H(u)u},{w}\rangle \) \(\quad \forall \,w\in {\mathbb {R}}^m\),

  4. (P7)

    \(\langle {\nabla f(u)},{u}\rangle = -\theta \), and

  5. (P8)

    \(\theta \ge 1\). \(\square \)

We also introduce some properties of the function \(\omega ^*\) (cf. (2.1)) and present an “old” property of the logarithm function.

Proposition 1

The function \(\omega ^*\) is strictly increasing on \([0,+\infty )\), and

$$\begin{aligned} \omega ^*(s)&\ge s^2/3 \quad \;\; \ \forall s \in [0,1/2] \ , \ \text{ and } \end{aligned}$$
(2.24)
$$\begin{aligned} \omega ^*(s)&\ge s/5.3 \quad \;\; \forall s \ge 1/2\ . \end{aligned}$$
(2.25)

Proof

See Appendix 1. \(\square \)

Proposition 2

$$\begin{aligned} \ln (1+s)\ge s - \frac{s^2}{2(1-\left|s\right|)} \quad \forall \ s\in (-1,1) . \end{aligned}$$
(2.26)

Proof

See Appendix 1. \(\square \)

We have the following inequality concerning values of \(D_k\) and \(G_k\) computed in Step 2 of Algorithm 1. For convenience, in the following, define

$$\begin{aligned} \widetilde{G}_k := \langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}(x^k - v^k)}\rangle \quad \text{ and }\quad \beta _k := h(x^k) - h(v^k) \ , \end{aligned}$$

so that \(G_k = \widetilde{G}_k + \beta _k\). Also, by the definition of \(R_h\), we know that \(\vert \beta _k\vert \le R_h\).

Proposition 3

For all \(k \ge 0\) it holds that

$$\begin{aligned} D_k \le G_k +\theta + R_h\ . \end{aligned}$$
(2.27)

Proof

We have:

$$\begin{aligned} D_k^2&= \langle {H(\mathsf {A}x^k)\mathsf {A}v^k},{\mathsf {A}v^k}\rangle - 2\langle {H(\mathsf {A}x^k)\mathsf {A}x^k},{\mathsf {A}v^k}\rangle + \langle {H(\mathsf {A}x^k)\mathsf {A}x^k},{\mathsf {A}x^k}\rangle \ . \end{aligned}$$
(2.28)

By (P5) we see that

$$\begin{aligned} \langle {H(\mathsf {A}x^k)\mathsf {A}v^k},{\mathsf {A}v^k}\rangle\le & {} \langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}v^k}\rangle ^2 = (-\widetilde{G}_k + \langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}x^k}\rangle )^2\nonumber \\= & {} (\widetilde{G}_k + \theta )^2 , \end{aligned}$$
(2.29)

where the last equality above uses (P7). In addition, from (P6) and (P7) we have

$$\begin{aligned} -2\langle {H(\mathsf {A}x^k)\mathsf {A}x^k},{\mathsf {A}v^k}\rangle + \langle {H(\mathsf {A}x^k)\mathsf {A}x^k},{\mathsf {A}x^k}\rangle&= 2\langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}v^k}\rangle - \langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}x^k}\rangle \nonumber \\&= -2\widetilde{G}_k + \langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}x^k}\rangle \nonumber \\&= -2\widetilde{G}_k - \theta \ . \end{aligned}$$
(2.30)

Combining (2.28), (2.29) and (2.30), we have

$$\begin{aligned} D_k^2&\le (\widetilde{G}_k + \theta )^2 -2\widetilde{G}_k - \theta \nonumber \\&= (G_k -\beta _k + \theta )^2 + 2\beta _k - (2G_k + \theta )\nonumber \\&\le (G_k + \theta )^2 + \beta _k^2 - 2\beta _k (G_k + \theta -1) \end{aligned}$$
(2.31)
$$\begin{aligned}&\le (G_k + \theta )^2 + R_h^2 + 2R_h(G_k + \theta ) \end{aligned}$$
(2.32)
$$\begin{aligned}&= (G_k + \theta + R_h)^2 \ , \end{aligned}$$
(2.33)

where in (2.31) we use \(2G_k + \theta \ge 0\) and in (2.32) we use \(\vert \beta _k\vert \le R_h\) and \(G_k + \theta \ge \theta \ge 1\). \(\square \)

The basic iteration improvement inequality for the Frank–Wolfe was presented in (2.5), and the step-size in Algorithm 1 is given by (2.8). In the case when \(\alpha _k < 1\), it follows from substituting \(\alpha _k = \tfrac{G_k}{D_k(G_k+D_k)}\) from (2.8) into (2.7) that the iteration improvement bound is

$$\begin{aligned} F(x^{k+1}) \le F(x^k) -\omega ^*\left( \frac{G_k}{D_k}\right) . \end{aligned}$$
(2.34)

Using the notation \(\Delta _k := F(x^k) - F(x^{k+1}) = \delta _k-\delta _{k+1}\), we can write this improvement as:

$$\begin{aligned} \Delta _k = \delta _{k} - \delta _{k+1} = F(x^k) - F(x^{k+1})&\ge \omega ^*\left( \frac{G_k}{D_k}\right) \ge 0 \ \text{ when } \ \alpha _k \ < 1 \ . \end{aligned}$$
(2.35)

Let us now prove (2.11) and part 1 of Theorem 1. Since \(G_k > \theta + R_h\), by (P4) in Lemma 1 we have

$$\begin{aligned} D_k = \Vert \mathsf {A}v^k - \mathsf {A}x^k\Vert _{\mathsf {A}x^k}&\ge \vert \langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}v^k-\mathsf {A}x^k}\rangle \vert /\sqrt{\theta } \nonumber \\&\ge \widetilde{G}_k/\sqrt{\theta } = (G_k-\beta _k)/\sqrt{\theta }\ge (G_k-R_h)/\sqrt{\theta } > \sqrt{\theta }\ge 1 \ . \end{aligned}$$
(2.36)

As a result, \(\alpha _k = \tfrac{G_k}{D_k(G_k+D_k)}<1\). Consequently, by (2.35) we have

$$\begin{aligned} F(x^{k+1})\le F(x^k) - \omega ^*\left( \frac{G_k}{D_k}\right) \le F(x^k) - \omega ^*\left( \frac{G_k}{G_k + \theta +R_h} \right) \ , \end{aligned}$$
(2.37)

where the last inequality uses (2.27) and the monotonicity of \(\omega ^*\). Now notice from the condition \(G_k > \theta + R_h\) that \({G_k}/(G_k+\theta + R_h)>1/2\), whereby invoking (2.25) we have

$$\begin{aligned} \Delta _k = \delta _k - \delta _{k+1} = F(x^{k}) - F(x^{k+1})&\ge \omega ^*\left( \frac{G_k}{G_k+\theta +R_h}\right) \ge \frac{G_k}{5.3(G_k+\theta +R_h)} \ . \end{aligned}$$
(2.38)

In addition we have

$$\begin{aligned} \frac{G_k}{5.3(G_k+\theta + R_h)}\ge \frac{\delta _k}{5.3(\delta _k+\theta + R_h)}\ge \frac{\delta _k}{5.3(\delta _0+\theta + R_h)} \ , \end{aligned}$$
(2.39)

where the first inequality uses the strict monotonicity of the function \(c \mapsto c/(c+\theta + R_h)\) on \([0,+\infty )\), and the second inequality uses the monotonicity of the sequence \(\{\delta _k\}_{k\ge 0}\) (see (2.35)). Combining (2.38) and (2.39), we obtain

$$\begin{aligned} \delta _{k+1}\le \left( 1-\frac{1}{5.3(\delta _0+\theta + R_h)}\right) \delta _k \ , \end{aligned}$$
(2.40)

which proves (2.11). Furthermore, \(G_k>\theta + R_h\) implies that

$$\begin{aligned} \delta _k\ge \Delta _k \ge \frac{G_k}{5.3(G_k+\theta + R_h)} > \frac{G_k}{5.3(G_k+G_k)} = \frac{1}{10.6} \ . \end{aligned}$$
(2.41)

Now let \(K_{\text {Lin}}\) denote the number of iterations of Algorithm 1 where \(G_k > \theta +R_h\) occurs. By (2.40) and (2.41) it follows that

$$\begin{aligned} \frac{1}{10.6} < \delta _0 \left( 1-\frac{1}{5.3(\delta _0+\theta +R_h)}\right) ^{K_{\text {Lin}}-1} \ , \end{aligned}$$
(2.42)

which then implies that \(K_{\text {Lin}} \le \lceil 5.3(\delta _0 + \theta +R_h)\ln (10.6\delta _0) \rceil \) and thus proving part 1 of Theorem 1.

Let us now prove (2.12) of Theorem 1. Towards doing so, we will establish:

$$\begin{aligned} G_k \le \theta +R_h\ \implies \ \Delta _k \ge \frac{G_k^2}{12(\theta +R_h)^2} \ . \end{aligned}$$
(2.43)

We first consider the case where \(\alpha _k=1\), whereby \(D_k(G_k+D_k) \le G_k\), which implies that \(D_k < 1\), and also can be rearranged to yield

$$\begin{aligned} G_k \ge \frac{D_k^2}{1-D_k} \ . \end{aligned}$$
(2.44)

In addition, by (2.7) we obtain

$$\begin{aligned} \Delta _k = F(x^k)-F(x^{k+1}) \ge G_k - \omega (D_k)= G_k +D_k + \ln (1-D_k) \ . \end{aligned}$$
(2.45)

By (2.45), Proposition 2, and (2.44), we have

$$\begin{aligned} \Delta _k \ge G_k - \frac{D_k^2}{2(1-D_k)} \ge \frac{G_k}{2} \ , \end{aligned}$$
(2.46)

which then implies that

$$\begin{aligned} \Delta _k \ge \frac{G_k}{2} \ge \frac{G_k^2}{2(\theta +R_h)} \ge \frac{G_k^2}{2(\theta +R_h)^2} \ge \frac{G_k^2}{12(\theta +R_h)^2} \ , \end{aligned}$$
(2.47)

where the second inequality used \(G_k \le \theta +R_h\) and the third inequality used \(\theta +R_h\ge 1\). This establishes (2.43) for the case when \(\alpha _k = 1\).

We next consider the case where \(\alpha _k <1\), whereby \(\alpha _k = \frac{G_k}{D_k(G_k+D_k)}\), and then by (2.35) we have

$$\begin{aligned} \Delta _k&= F(x^k) - F(x^{k+1}) \nonumber \\&\ge \omega ^*\left( \frac{G_k}{D_k}\right) \ge \omega ^*\left( \frac{G_k}{G_k + \theta +R_h}\right) \ge \frac{G_k^2}{3(G_k+\theta +R_h)^2}\ge \frac{G_k^2}{12(\theta +R_h)^2} \ , \end{aligned}$$
(2.48)

where the second inequality uses (2.27) and the monotonicity of \(\omega ^*\), the third inequality uses (2.24) in conjunction with \(G_k/(G_k + \theta +R_h) \le 1/2\), and the fourth inequality uses \(G_k \le \theta +R_h\). This establishes (2.43) for the case when \(\alpha _k < 1\), completing the proof of (2.43). It thus follows for \(G_k \le \theta +R_h\) that

$$\begin{aligned} \delta _k - \delta _{k+1} = \Delta _k \ge \frac{G_k^2}{12(\theta +R_h)^2} \ge \frac{\delta _k \delta _{k+1}}{12(\theta +R_h)^2} \ , \end{aligned}$$

where the last inequality follows from \(\delta _{k+1} \le \delta _k \le G_k\), and dividing both sides by \( \delta _k \delta _{k+1}\) and rearranging yields the inequality (2.12).

We next prove (2.13) and (2.14). If \(\delta _0 \le \varepsilon \) the result follows trivially; thus we assume that \(\delta _0 > \varepsilon \). Let \({\bar{K}}\) denote the expression on the right-side of (2.13), and suppose Algorithm 1 has been run for \({\bar{K}}\) iterations. Let \(N:= \lceil 12(\theta +R_h)^2\left( {1}/{\varepsilon } - {1}/{\delta _0} \right) \rceil \), whereby it follows from part 1 of Theorem 1 that among the first \({\bar{K}}\) iterations, the number of iterations where \(G_k \le \theta +R_h\) is at least N. Thus from (2.12) it follows that

$$\begin{aligned} \frac{1}{\delta _{K_\varepsilon }} \ge \frac{1}{\delta _0} + \frac{N}{12 (\theta +R_h)^2} \ge \frac{1}{\delta _0} + \left( \frac{1}{\varepsilon } - \frac{1}{\delta _0} \right) = \frac{1}{\varepsilon } \ , \end{aligned}$$

and rearranging yields part 1 of Theorem 1.

Let \(k_0< k_1< k_2 < \cdots \) denote indices where \(G_k \le \theta + R_h\). From (2.43), (2.4), and the monotonicity of the sequence \(\{\delta _k\}_{k\ge 0}\), it follows for all \(j\ge 0\) that

$$\begin{aligned} \delta _{k_{j+1}} \le \delta _{k_j+1} \le \delta _{k_{j}} - \frac{G_{k_{j}}^2}{12 (\theta +R_h)^2} \ \ \ \text{ and } \ \ \ G_{k_{j}} \ge \delta _{k_{j}} \ . \end{aligned}$$

Let \(d_j := \delta _{k_{j}}\) and \(g_j := G_{k_{j}}\) for all \(j\ge 0\), then the nonnegative sequences \(\{ d_j \}_{j\ge 0}\) and \(\{ g_j \}_{j\ge 0}\) satisfy for all \(j\ge 0\):

$$\begin{aligned} d_{{j+1}} \le d_{{j}} - \frac{g_{{j}}^2}{12 (\theta +R_h)^2} \ \ \ \text{ and } \ \ \ g_{{j}} \ge d_{{j}} \ . \end{aligned}$$

Thus \(\{ d_j \}_{j\ge 0}\) and \(\{ g_j \}_{j\ge 0}\) satisfy the hypotheses of the following elementary sequence proposition using \(M=12(\theta +R_h)^2\). (This proposition is a slight extension of the standard sequence property for Frank–Wolfe type sequences, and we provide a proof in Appendix 1.)

Proposition 4

Suppose the two nonnegative sequences \(\{ d_j \}_{j\ge 0}\) and \(\{ g_j \}_{j\ge 0}\) satisfy for all \(j \ge 0\):

  • \(d_{j+1} \le d_j - g_j^2/M\) for some \(M>0\), and

  • \(g_j \ge d_j\).

Then for all \(j \ge 0\) the following holds:

$$\begin{aligned} d_{j} \le \frac{M}{j + \frac{M}{d_0}} < \frac{M}{j} \ , \end{aligned}$$
(2.49)

and

$$\begin{aligned} \min \{g_0, \ldots , g_j\} < \frac{2M}{j} \ . \end{aligned}$$
(2.50)

\(\square \)

Let \(\text {FWGAP}_\varepsilon \) be as given in part 1 of Theorem 1, and let \({\tilde{K}}\) denote the expression on the right-side of (2.14). Suppose Algorithm 1 has been run for \({\tilde{K}}\) iterations. Let \({\tilde{N}}:= \lceil {24(\theta +R_h)^2}/{\varepsilon } \rceil \), whereby it follows from part 1 of Theorem 1 that among the first \({\tilde{K}}\) iterations, the number of iterations where \(G_k \le \theta +R_h\) is at least \(\tilde{N}\). Then, it follows that

$$\begin{aligned} \min \{G_0, \ldots , G_{{\tilde{K}}}\} \le \min \{G_{k_{0}}, \ldots , G_{k_{{\tilde{N}}}}\}&= \min \{g_0, \ldots , g_{{\tilde{N}}}\}\\&< \frac{2M}{{\tilde{N}}} = \frac{24(\theta +R_h)^2}{{\tilde{N}}} \le \varepsilon \ , \end{aligned}$$

where the strict inequality uses Proposition 4. This shows (2.14) and completes the proof of Theorem 1. \(\square \)

Remark 4

[Continued discussion from Remark 2 comparing Theorem 1 with Khachiyan [36] for D-optimal design] Here \(h = \iota _{\Delta _n}\), whereby \(R_h= 0\), and the rate of linear convergence for iterates where \(G_k > \theta \) in (2.11) is order \(O(1-1/(n+\delta _0))\) as compared to the rate of \(O(1-1/n)\) proved in [36] specifically for the D-optimal design problem with exact line-search. Due to the very special structure of the D-optimal design problem, the exact line-search is in closed-form, and it enables Khachiyan [36] to show that the optimality gap improvement bound (2.35) is instead

$$\begin{aligned} \delta _{k+1} \le \delta _{k} - \omega \left( \frac{G_k}{G_k+\theta }\right) \ . \end{aligned}$$
(2.51)

Notice that \(\omega \) is larger than \(\omega ^*\), and all the moreso for larger values of its argument, which corresponds to \(G_k > \theta \); this then leads to an improved guaranteed linear rate of Khachiyan’s algorithm in the case when \(G_k > \theta \). However, we stress that the stronger estimate in (2.51) is rather specific to the D-optimal design problem, and we do not expect it to hold in general for \(f\in \mathcal {B}_\theta (\mathcal {K})\).

3 A mirror descent method for the dual problem

In this section we present a mirror descent method with adaptive step-size applied to the (Fenchel) dual problem of \((P)\). We denote the dual problem of \((P)\) by \((D)\), which is given by:

$$\begin{aligned} (D):\ \ \ -d^*:= -{\min }_{y\in {\mathbb {R}}^m} \;[d(y):= f^*(y) + h^*(-\mathsf {A}^*y)] \ , \end{aligned}$$
(3.1)

where \(f^*\) and \(h^*\) are the Fenchel conjugates of f and h, respectively, and \(\mathsf {A}^*:{\mathbb {R}}^{m}\rightarrow {\mathbb {R}}^{n}\) denotes the adjoint of \(\mathsf {A}\). We observe the following properties related to \((D)\):

  1. 1.

    \(f^*\) is a \(\theta \)-logarithmically-homogeneous self-concordant barrier on the polar of \({\mathcal {K}}\), namely \({\mathcal {K}}^\circ := \{y\in {\mathbb {R}}^m:\langle {y},{u}\rangle \le 0\;\forall \,u\in {\mathcal {K}}\}\), and \(\mathcal {K}^\circ \) is also a regular cone. This follows from [47, Theorem 2.4.4].

  2. 2.

    \(h^*\) is Lipschitz (but not necessarily differentiable) on \({\mathbb {R}}^n\). Indeed, let \(\Vert \cdot \Vert \) be a given norm on the space of x variables, and define \(R_\mathcal {X}:= \max _{x \in \mathcal {X}} \Vert x\Vert \). Since \(\mathcal {X}=\mathsf {dom}\,h\) is compact and h is closed, it follows that \(R_\mathcal {X}<+\infty \) and \(h^*\) is \(R_\mathcal {X}\)-Lipschitz on \({\mathbb {R}}^n\).

  3. 3.

    \(F^*= -d^*\) and \((D)\) has at least one optimal solution. Indeed, since \(\mathsf {A}(\mathcal {X})\cap \mathsf {dom}\,f\ne \emptyset \) and f is continuous on \(\mathsf {dom}\,f\), the strong duality and attainment follows from [49, Theorem 3.51].

Although \((D)\) has a similar structure as \((P)\), certain key differences are unattractive in regards to the application of first-order methods for solving \((D)\). One key difference is that the domain of the dual function d is unbounded, which is in obvious contrast to the bounded domain of the primal function F. This makes it difficult or prohibitive to apply a Frank–Wolfe type method to solve \((D)\). Furthermore, and similar to \((P)\), \(\nabla f^*\) does not satisfy either uniform boundedness or uniform Lipschitz continuity on \({\mathcal {K}}^\circ \), thereby preventing the application of most other types of first-order methods. Nevertheless, below we present a mirror descent method with adaptive step-size for \((D)\). Although the lack of good properties prevents the direct analysis of mirror descent in the usual manner, through the duality of mirror descent and the Frank–Wolfe method we provide a computational complexity bound for this mirror descent method.

Before presenting our mirror descent method for tackling \((D)\), we first present an important application of \((D)\) in the Bregman proximal-point method.

Application in the Bregman proximal-point method (BPPM) [3, 9, 22]. Consider the convex non-smooth optimization problem \(\min _{y\in {\mathbb {R}}_+^m}\, \xi (y)\), where \(\xi :{\mathbb {R}}^m\rightarrow {\mathbb {R}}\) is assumed to be Lipschitz on \({\mathbb {R}}^m\). At the k-th iteration of BPPM, one solves the following problem:

$$\begin{aligned} y^{k+1}:= {\mathrm{arg min}}_{y\in {\mathbb {R}}^m}\; \xi (y) + \beta _k^{-1} D_\zeta (y,y^k) \ , \end{aligned}$$
(3.2)

where \(y^k\) is the k-th iterate of BPPM, \(\beta _k>0\) is the step-size, and \(\zeta :{\mathbb {R}}_{++}^m\rightarrow {\mathbb {R}}\) is the prox function that induces the Bregman divergence

$$\begin{aligned} D_{\zeta }(y,y^k) := \zeta (y) - \zeta (y^k) - \langle {\nabla \zeta (y^k)},{y- y^k}\rangle \ . \end{aligned}$$
(3.3)

As pointed out in [3], one of the standard choices of \(\zeta \) is \(\zeta (y)= -\sum _{i=1}^m \ln (y_i)\), and under this choice, if \(y^0\in {\mathbb {R}}_{++}^m = \mathsf {int}\,{\mathbb {R}}_{+}^m\), then \(y^k\in {\mathbb {R}}_{++}^m \) for all \(k\ge 1\), and so the constraint set \({\mathbb {R}}_+^m\) is automatically taken care of by the prox-function \(\zeta \). From (3.2) and (3.3) we note that (3.2) is in the form of \((D)\) with \(f^*(y) := \zeta (y)= -\sum _{i=1}^m \ln (y_i)\), \(\mathsf {A}=-\mathsf {I}\) and \(h^*(-\mathsf {A}^*y) =\beta _k \xi (y) -\langle {\nabla \zeta (y^{k})},{y}\rangle \).

Our mirror descent method for \((D)\) is shown in Algorithm 2, and is based on using the function \(f^{*}\) itself as the prox function to induce the Bregman divergence:

$$\begin{aligned} D_{f^*}(y,y^k) := f^*(y) - f^*(y^k) - \langle {\nabla f^*(y^k)},{y- y^k}\rangle \ , \end{aligned}$$

(When f is L-smooth, similar ideas have appeared in some previous works, for example Grigas [28], Bach [4], as well as Lu and Freund [41].) In step 1, we compute a subgradient of the dual function d at \(y^k\), which is denoted by \(g^k\). In step 2 we update the primal variables \(z^k\) which are used in the method to adaptively determine the step-size in the next step. In step 3, we compute the step-size \(\gamma _k\), which we will show to be same as the step-size \(\alpha _k\) in the Frank–Wolfe method (shown in Algorithm 1). Equipped with \(y^k\), \(g^k\) and \(\gamma _k\), in step 4 we perform a Bregman proximal minimization step to obtain \(y^{k+1}\). We emphasize that different from the classical mirror descent method (e.g., [43]), in step 4 we use \(f^{*}\) (which is part of the objective function) as the prox function to induce the Bregman divergence \(D_{f^*}(\cdot ,\cdot )\). Also notice that the domain of the sub-problem (3.7) is \(\mathsf {int}\,\mathcal {K}^\circ \) and it is perhaps not so obvious without further analysis that (3.7) has an optimal solution.

At first glance it appears that Algorithm 2 might not be efficient to implement, since it involves working with a system of linear equations to determine \(z^0\) in the Input, and also involves solving the minimization sub-problem in step 4. However, as we show below, Algorithm 2 corresponds exactly to the generalized Frank–Wolfe method in Algorithm 1 for solving \((P)\), which does not involve these computationally expensive steps. This of course implies that Algorithm 2 can be implemented via Algorithm 1 to obtain the primal iterate sequence \(\{x^k\}_{k\ge 0}\), and then the dual iterate sequence \(\{y^k\}_{k\ge 0}\) is determined by the simple rule \(y^k = \nabla f(\mathsf {A}x^k)\) for \(k \ge 0\).

figure b

Theorem 2

Algorithms 1 and 2 are equivalent in the following sense: If the starting points \(x^0\) in Algorithm 1 and \(z^0\) in Algorithm 2 satisfy \(x^0 = z^0\) and \(y^0 =\nabla f(\mathsf {A}x^0)\), then an iterate sequence of either algorithm exactly corresponds to an iterate sequences of the other.

Before we prove this theorem, we first recall some properties of conjugate functions. Let \(w:{\mathbb {R}}^p\rightarrow {\mathbb {R}}\cup \{+\infty \}\) be a closed convex function and let \(w^*\) denote its conjugate function, which is defined by \(w^*(g) := \max _u\{\langle {g},{u}\rangle -w(u)\}\). Then \(w^*:{\mathbb {R}}^p\rightarrow {\mathbb {R}}\cup \{+\infty \}\) is a closed convex function, and

$$\begin{aligned} g \in \partial w(u) \Longleftrightarrow \ u \in \partial w^*(g) \Longleftrightarrow \langle {g},{u}\rangle = w(u) + w^*(g) \ . \end{aligned}$$
(3.8)

Proof of Theorem 2. Let \(\{y^k\}_{k\ge 0}\) be the sequence of iterates of Algorithm 2, and let us also collect the sequences \(\{z^k\}_{k\ge 0}\), \(\{s^k\}_{k\ge 0}\), \(\{g^k\}_{k\ge 0}\), \(\{\bar{G}_k\}_{k\ge 0}\), \(\{\bar{D}_k\}_{k\ge 0}\), and \(\{\gamma ^k\}_{k\ge 0}\) generated in Algorithm 2, and use these sequences to define the following five sequences by the simple assignments \(x^k := z^k\), \(v^k := s^k\), \(\alpha ^k := \gamma ^k\), \(G_k := \bar{G}_k\), \(D_k := \bar{D}_k\), for \(k \ge 0\). We now show that these five sequences correspond to an iterate sequence of Algorithm 1. Our argument will rely on the following identity:

$$\begin{aligned} y^k = \nabla f(\mathsf {A}z^k) \ \ \forall k \ge 0 \ , \end{aligned}$$
(3.9)

which we will prove by induction. Notice that (3.9) is true for \(k=0\) by supposition in the statement of the theorem. Now let us assume that (3.9) holds for a given iteration k, and let us examine the properties of our sequences. We have

$$\begin{aligned} v^k :=s^k\in \partial h^*(-\mathsf {A}^*y^k)&\;\;\Longrightarrow \;\; -\mathsf {A}^*y^k \in \partial h(v^k) \end{aligned}$$
(3.10)
$$\begin{aligned}&\;\;\Longrightarrow \;\; v^k \in {\arg \,min}_{x\in {\mathbb {R}}^n}\; \langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}x}\rangle + h(x) \ , \end{aligned}$$
(3.11)

where (3.10) follows from the conjugacy properties (3.8). This shows that \(v^k\) satisfies Step 2 of iteration k in Algorithm 1. We also have

$$\begin{aligned} G_k := \bar{G}_k :=&\langle {g^k},{y^k}\rangle + h(z^k) - h(s^k) \end{aligned}$$
(3.12)
$$\begin{aligned} =&\langle {\nabla f^*(y^k) - \mathsf {A}s^k},{\nabla f(\mathsf {A}z^k) }\rangle + h(z^k) - h(s^k) \ , \end{aligned}$$
(3.13)
$$\begin{aligned} =&\langle {\mathsf {A}x^k - \mathsf {A}v^k},{\nabla f(\mathsf {A}x^k) }\rangle + h(x^k) - h(v^k) \ \end{aligned}$$
(3.14)

satisfies the definition of \(G_k\) in Step 2 of iteration k of Algorithm 1. Similarly, we have

$$\begin{aligned} D_k := \bar{D}_k := \Vert g^k\Vert _{\nabla f^*(y^k)} = \Vert \nabla f^*(y^k) - \mathsf {A}s^k\Vert _{\mathsf {A}z^k }&= \Vert \mathsf {A}z^k - \mathsf {A}s^k\Vert _{\mathsf {A}z^k }\nonumber \\&= \Vert \mathsf {A}x^k - \mathsf {A}v^k\Vert _{\mathsf {A}x^k } \end{aligned}$$
(3.15)

satisfies the the definition of \(D_k\) in Step 2 of iteration k in Algorithm 1, which then implies similarly for the formula for \(\alpha _k\) in (2.8) of Algorithm 1. Last of all, we prove the inductive step of the equality (3.9). From the optimality conditions of the optimization problem in (3.7) we have

$$\begin{aligned}&\nabla f^*(y^{k+1}) = \nabla f^*(y^k) - \gamma _k g_k \;\;\\ \Longrightarrow \;\;&\nabla f^*(y^{k+1}) = \nabla f^*(y^{k}) - \gamma _k g_k = (1-\gamma _k) \mathsf {A}z^k + \gamma _k \mathsf {A}s^k, \end{aligned}$$

where in the last step we use \(\nabla f^*(y^{k}) = \mathsf {A}z^k\) from (3.9) and (3.4). Since \(z^{k+1}:= (1-\gamma _{k})z^{k} + \gamma _{k}s^{k}\), we have \(\nabla f^*(y^{k+1}) = \mathsf {A}z^{k+1}\) implies \(y^{k+1} = \nabla f(\mathsf {A}z^{k+1})\) by conjugacy and completes the proof of (3.9). This then shows that an iterate sequence of Algorithm 2 corresponds to an iterate sequence of Algorithm 1. The reverse implication can also be proved using identical logic as above. \(\square \)

We now leverage the equivalence of Algorithms 1 and 2 to analyze the iteration complexity of Algorithm 2. The following proposition relating the duality gap to the Frank–Wolfe gap will be useful.

Proposition 5

\(G_k = d(y^k) + F(x^k)\) for all \(k \ge 0\).

Proof

We have for all \(k \ge 0\):

$$\begin{aligned} G_k&= \langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}x^k}\rangle - \langle {\nabla f(\mathsf {A}x^k)},{\mathsf {A}v^k}\rangle + h(x^k) - h(v^k) \end{aligned}$$
(3.16)
$$\begin{aligned}&= f(\mathsf {A}x^k)+ h(x^k) + f^*(y^k)+ \langle {-\mathsf {A}^*y^k},{ v^k}\rangle - h(v^k) \end{aligned}$$
(3.17)
$$\begin{aligned}&= f(\mathsf {A}x^k)+ h(x^k) + f^*(y^k)+h^*(-\mathsf {A}^*y^k) = d(y^k) + F(x^k) \ . \end{aligned}$$
(3.18)

where in (3.17) we use the conjugacy property in (3.8) and \(y^k =\nabla f(\mathsf {A}x^k)\), and in (3.18) we use \(-\mathsf {A}^*y^k\in \partial h(v^k)\) (by (3.10) and \(s^k = v^k\)) and (3.8). \(\square \)

Since \(G_k\) upper bounds the dual optimality gap \(d_k:= d(y^k) - d(y^{*})\), from part 1 of Theorem 1 we immediately have the following corollary.

Corollary 1

Let \(\text {DGAP}_\varepsilon \) denote the number of iterations required by Algorithm 2 to obtain \(d_k \le \varepsilon \). Then:

$$\begin{aligned} \text {DGAP}_\varepsilon \le \lceil 5.3(\delta _0 + \theta +R_h)\ln (10.6\delta _0) \rceil + \left\lceil \frac{24(\theta +R_h)^2}{\varepsilon } \right\rceil \ . \end{aligned}$$
(3.19)

\(\square \)

We end this section with some remarks. First, if one considers \((D)\) directly then its “primitive” objects are \(f^*\) and \(h^*\), and implementing Algorithm 2 via Algorithm 1 requires knowing \(h = (h^*)^*\) and also the Hessian of \(f = (f^*)^*\) (used to compute the step-size \(\gamma _k\)). While these objects are not part of the primitive description of \((D)\), it follows from conjugacy of self-concordant barriers that \(\nabla f = (\nabla f^*)^{-1}\) and \(H(\cdot ) = H^*(\nabla f(\cdot ))^{-1}\) where \(H^*\) is the Hessian of \(f^*\) (see [51, Theorem 3.3.4]), and therefore one can work directly with \(\nabla f\) and H through the primitive objects \(\nabla f^*\) and \(H^*\). Of course, for standard logarithmically-homogeneous barriers \(f^*\) such as \(f^*(y)= -\sum _{i=1}^m \ln y_i\), where \(y\in {\mathbb {R}}_{++}^m\) or \(f^*(Y) = -\ln \det Y\), where \(Y\in \mathbb {S}_{++}^p\) (and \(m = p(p+1)/2\)), its Fenchel conjugate of f and the Hessian of f are well-known. In addition, for many simple non-smooth functions \(h^*\), e.g., \(h^*(w) = \Vert w\Vert _p\) (\(p\in [1,+\infty ]\)), its Fenchel conjugate either is well-known or can be easily computed. Therefore, for many problems of interest implementing Algorithm 2 via Algorithm 1 is likely to be quite viable. Second, note that the step-size \(\gamma _k\) used in Algorithm 2 is adaptive, and is different from a standard step-size that is monotone decreasing in k, e.g., \(\gamma _k = O(1/\sqrt{k})\) or \(\gamma _k = O(1/{k})\) (cf. [43]). This poses difficulties in attempting to directly analyze Algorithm 2 via standard approaches which involves using \(D_{f^*}(y^k,y^*)\) as the potential function (see e.g., [4]). Nevertheless, the convergence guarantee for \(G_k\) derived from Algorithm 1 enables us to analyze the converge of the dual optimality gap \(d_k\) in Algorithm 2.

4 Computational experiments

In this section we present the results of some basic numerical experiments where we evaluate the performance of our generalized Frank–Wolfe method in Algorithm 1 on the Poisson image de-blurring problem with TV regularization (Application 1 in Sect. 1.1), and also on the PET problem (Application 2 of Sect. 1.1).

4.1 First numerical experiment: Poisson image de-blurring with TV regularization

We consider the Poisson image de-blurring problem with TV regularization as described in Application 1 of Sect. 1.1, where the formulation was presented in equation (1.5). Observe that \(f:u \mapsto -\textstyle \sum _{l=1}^N y_l\ln \big (u_l)\) is neither Lipschitz nor L-smooth on the set \(\{u\in {\mathbb {R}}^N: u = \mathsf {A}x, \ 0 \le x \le Me\}\), and \(\mathrm{TV}(\cdot )\) does not have an efficiently computable proximal operator, which prevents most standard first-order methods from being applicable to tackle (1.5). As a result, very few methods have been proposed to solve (1.5) in the literature. In [15] an ad-hoc expectation-maximization (EM) method was proposed to solve (1.5), however the method is not guaranteed to converge due to the non-smoothness of the function \(\mathrm{TV}(\cdot )\) (see e.g., [50]). Both [33] and [10] proposed methods to solve a “perturbed” version of (1.5) by adding a small offset to each \(\ln (\cdot )\) term, which of course, compromises the original objective function \(\bar{F}(x)\) near \(x=0\). (Such a perturbed version is not needed for the theoretical analysis in [10], but seems to be used to improve practical performance.) Using the generalized Frank–Wolfe method of Algorithm 1, we are able to directly solve formulation (1.5), the details of which we now describe.

4.1.1 Implementation of generalized Frank–Wolfe method for solving (1.5)

We first re-describe the total variation function \(\mathrm{TV}(x)\) by introducing some network definitions and terminology. The TV function penalizes potential differences on the horizontal and vertical grid arcs of the standard \(m \times n\) pixel grid. Considering each pixel location as a node, let \(\mathcal {V}\) denote these nodes enumerated as \(\mathcal {V}= [N] =\{1, \ldots , N = m \times n\}\), and consider the undirected graph \(\mathcal {G}= (\mathcal {V},\mathcal {E})\) where \(\mathcal {E}\) is the set of horizontal and vertical edges of the grid. These horizontal and vertical edges can be described as \(\mathcal {E}_\text {h}\) and \(\mathcal {E}_\text {v}\), respectively, where

$$\begin{aligned} \mathcal {E}_\text {h}:= \{\{l,l+1\}: l \in [N],\;\; l \!\!\! \mod n \ne 0\}\quad \text{ and }\quad \mathcal {E}_\text {v}:= \{\{l,l+n\}: l \in [N-m] \} . \end{aligned}$$

With this notation we have

$$\begin{aligned} \mathrm{TV}(x) = \textstyle {\sum }_{\{i,j\}\in \mathcal {E}} \;\vert x_i-x_j\vert \ . \end{aligned}$$
(4.1)

Based on \(\mathcal {G}\), we define the directed graph \({\widetilde{\mathcal {G}}}= (\mathcal {V},{\widetilde{\mathcal {E}}})\) where \({\widetilde{\mathcal {E}}}\) is obtained by replacing each (undirected) edge in \(\mathcal {E}\) with two directed edges of opposite directions. Then from (4.1) we have

$$\begin{aligned} \mathrm{TV}(x) = \textstyle \sum _{(i,j)\in {\widetilde{\mathcal {E}}}}\;\; \max \{x_i-x_j,0\} = \min&\;\; e^\top r\nonumber \\ {{\,\mathrm{s.t.}\,}}&\;\;r_{ij}\ge x_i-x_j \ ,\;\; \forall \;(i,j) \in {\widetilde{\mathcal {E}}}\end{aligned}$$
(4.2)
$$\begin{aligned} = \min&\;\; e^\top r\;\;\nonumber \\ {{\,\mathrm{s.t.}\,}}&\;\;r\ge B^\top x \ , \end{aligned}$$
(4.3)

where B is the node-arc incidence matrix of the network \({\widetilde{\mathcal {E}}}\).

The Frank–Wolfe subproblem (1.3) associated with (1.5) is:

$$\begin{aligned} {\min }_{x\in {\mathbb {R}}^N}&\;\; \langle {\nabla \bar{f}(x^k)},{x}\rangle + \textstyle (\sum _{l=1}^{N} a_l)^\top x + \lambda \mathrm{TV}(x)\qquad {{\,\mathrm{s.t.}\,}}\;\; 0\le x\le M e \ , \end{aligned}$$
(4.4)

where

$$\begin{aligned} \bar{f}(x): = -\textstyle \sum _{l=1}^{N} y_l\ln (a_l^\top x) \ . \end{aligned}$$
(4.5)

Based on (4.3), we can rewrite (4.4) as the following linear optimization problem:

$$\begin{aligned} \min _{(x,r)\in {\mathbb {R}}^N\times {\mathbb {R}}^{2\vert \mathcal {E}\vert }}&\;\; \textstyle \langle {\nabla \bar{f}(x^k)},{x}\rangle + (\sum _{l=1}^{N} a_l)^\top x + \lambda e^\top r \nonumber \\ {{\,\mathrm{s.t.}\,}}&\;\; 0\le x\le M e, \;\;r\ge B^\top x \ . \end{aligned}$$
(4.6)

This linear problem can be solved as a linear program (LP) using available software, or as a constrained dual network flow problem using available network flow software. In our implementation we solved (4.6) using the primal simplex method in Gurobi [29]. Note that in (4.6), the variable r in the constraint set appears to be unbounded. However, from the form of the objective function and the definition of B, it is easy to see that any optimal solution \((x^*,r^*)\) must satisfy \(0\le r^*\le Me\), and hence (4.6) always admits an optimal solution.

Given the representation of \(\mathrm{TV}(\cdot )\) in (4.3), we can also rewrite (1.5) as

$$\begin{aligned} \min _{x\in {\mathbb {R}}^N, \;\;r\in {\mathbb {R}}^{2\vert \mathcal {E}\vert }}&\;\; \bar{f}(x) + \textstyle (\sum _{l=1}^{N} a_l)^\top x + \lambda e^\top r \nonumber \\ {{\,\mathrm{s.t.}\,}}&\;\; 0\le x\le M e, \;\;r\ge B^\top x. \end{aligned}$$
(4.7)

(Note that the Frank–Wolfe sub-problem (1.3) associated with (4.7) is the same as that associated with (1.5), which is shown in (4.6).) In the following, we will apply our Frank–Wolfe method to solve the reformulated problem (4.7). The advantage of (4.7) lies in that its structure yields an efficient procedure for an exact line-search to compute the step-size \(\alpha _k\). Specifically, given \((x^k,r^k)\), let \((v^k,w^k)\) be an optimal solution of (4.6); then the exact line-search problem using (4.7) is:

$$\begin{aligned} \alpha _k&= {\mathrm{arg min}}_{\alpha \in [0,1]} \; \bar{f}(x^k + \alpha (v^k - w^k))\nonumber \\&\quad + \alpha \big (\textstyle (\sum _{l=1}^{N} a_l)^\top (v^k - x^k) + \lambda e^\top (w^k - r^k)\big ) \ . \end{aligned}$$
(4.8)

The detailed description of the exact line-search procedure for problems of the format (4.8) are presented in Appendix 1. Henceforth, we denote our generalized Frank–Wolfe method for Poisson de-blurring with the adaptive step-size in (2.8) as FW-Adapt, and we denote our Frank–Wolfe method with exact line-search of (4.8) as FW-Exact. Note that since in each iteration, FW-Exact makes no less progress than FW-Adapt in terms of the objective value, the computational guarantees in Theorem 1 (which are proved for FW-Adapt) also apply to FW-Exact.

4.1.2 Results

We tested FW-Adapt and FW-Exact on the Shepp-Logan phantom image [53] of size \(100\times 100\) (hence \(N = 10,000\)). The true image X is shown in Fig. 1a; this image acts as a 2D slice of a representative 3D human brain image, and is a standard test image in image reconstruction algorithms. We generated the blurred noisy image Y shown in Fig. 1b using the methodology exactly as described in Sect. 1.1. For both FW-Adapt and FW-Exact, we chose the starting point \(x^0 = \mathsf{vec}(Y)\), and we set \(\lambda = 0.01\). In order to have a accurate computation of optimality gaps, we used CVXPY [16] to (approximately) find the optimal objective value \(\bar{F}^*\) of (1.5). All computations were performed in Python 3.8 on a laptop computer.

Fig. 1
figure 1

True and noisy \(100\times 100\) versions of the Shepp-Logan phantom image

Figure 2a, b show (in log-log scale) the empirical optimality gaps \(\bar{F}(x^k) - \bar{F}^*\) obtained by FW-Adapt and FW-Exact both as a function of the wall-clock time (in seconds) and as a function of the iteration counter k, respectively. From the figure we observe that FW-Exact converges faster than FW-Adapt, although the difference between the empirical optimality gaps of these two methods gradually lessens. This is expected since with exact line-search, FW-Exact can take a potentially larger step at each iteration than FW-Adapt, and likely therefore has faster convergence. The recovered images computed using FW-Adapt and FW-Exact are shown in Fig. 3a, b, respectively. We observe that the image recovered by FW-Adapt has similar but slightly inferior quality compared to that recovered by FW-Exact. This is consistent with the algorithms’ performance shown Fig. 2, as FW-Adapt has a larger empirical optimality gap at termination compared to that of FW-Exact.

Fig. 2
figure 2

Comparison of empirical optimality gaps of FW-Adapt (FW-A) and FW-Exact (FW-E) for image recovery of the Shepp-Logan phantom image [53] of size \(100\times 100\)

Fig. 3
figure 3

Recovered images computed using FW-Adapt and FW-Exact

4.2 Second numerical experiment: positron emission tomography

We consider the positron emission tomography (PET) problem as described in Application 2 of Sect. 1.1, where the formulation was presented in equation (1.10). We generated artificial data instances of this problem according to the following data generation process. Because the events emitted from each voxel i can only be detected by a small proportion of bins, it follows that the probability matrix P should be highly sparse. We chose a sparsity value of \(5\%\). Given the number of voxels n and the number of bins m, for each \(i \in [n]\) we randomly chose (without replacement) a subset of [m] denoted by \(\mathcal {J}_i\), such that \(\vert \mathcal {J}_i\vert =\lfloor {m/20}\rfloor \). Next, for all \(j\in \mathcal {J}_i\) we then generated i.i.d. entries \({\bar{p}}_{ij} \sim U[0,1]\) and normalized the values to obtain \(p_{ij} := \bar{p}_{ij}/\sum _{j'\in \mathcal {J}_i} \bar{p}_{ij'}\). For all \(j\not \in \mathcal {J}_i\) we set \(p_{ij}=0\). We generated the mean values \(x_i\) for \(i \in [n]\) by first generating i.i.d. values \({\bar{x}}_i \sim N(100,9)\) and then set \(x_i = \vert \bar{x}_i\vert \) for \(i\in [n]\). We then simulated the event counts \(X_i\) at each voxel i by generating \(X_i \sim \mathsf{Poisson}(x_i)\) for \(i\in [n]\). Finally, using P and \(\{X_i\}_{i\in [n]}\), we generated the number of observed events \(Y_j\) detected at bin j by independently generating values \({\tilde{Y}}_j \sim \mathsf{Poisson}(y_j)\) with \(y_j := \textstyle \sum _{i=1}^n p_{ij} X_i\) for \(j\in [m]\). Since \(Y_j\in \{0,1,2,\ldots \}\), by omitting bins for which \(Y_j = 0\) we ensure that \(Y_j\ge 1\) for all \(j\in [m]\) in the PET problem (1.10).

4.2.1 Comparison of algorithms

We solved instances of the PET problem (1.10) using the following five algorithms/variants:

  • FW-Adapt—our generalized Frank–Wolfe method in Algorithm 1 with the adaptive step-size as stated in (2.8),

  • FW-Exact—our generalized Frank–Wolfe method in Algorithm 1 with the adaptive step-size (2.8) replaced by an exact line-search as described in detail in Appendix 1,

  • RSGM-Fixed – relatively smooth gradient method with fixed step-size [5, 42],

  • RSGM-LSBack – relatively smooth gradient method with backtracking line-search [55],

  • EM – a simple algorithm developed by Cover in 1984 specifically for a problem that is equivalent to the normalized PET problem [12].

We excluded mirror descent from our computational comparisons because the sparsity of P violates the basic assumption needed to apply the mirror descent (MD) method to (1.10) (see e.g., [6]). We now review the relevant details of the three algorithms RSGM-F, RSGM-LS and EM.

1. RSGM-Fixed [5, 42]. Although the objective function L of (1.10) is differentiable on its domain, its gradient \(\nabla L\) is not Lipschitz on the constraint set \(\Delta _n\). Therefore standard gradient methods (or accelerated versions) [45] are not applicable. As a remedy for this situation, we can use the relatively-smooth gradient method [5, 42] to solve (1.10), which is designed in particular for problems whose objective functions have certain types of structured non-smooth gradients. Indeed, as shown in Bauschke et al. [5], L is \({\bar{Y}}\)-smooth relative to the reference function

$$\begin{aligned} r(z) := -\textstyle \sum _{i=1}^n \ln (z_i) \ \text{ for } \ z\in {\mathbb {R}}^n_{++}:=\{x\in {\mathbb {R}}^n:x>0\} \ , \end{aligned}$$
(4.9)

where \({\bar{Y}} := \sum _{j=1}^m Y_j\). Specifically, this means that

$$\begin{aligned} \nabla ^2 L(z) \preceq {\bar{Y}} \nabla ^2 r(z) \quad \forall \,z\in {\mathbb {R}}^n_{++}\; \ . \end{aligned}$$
(4.10)

Algorithm 3 describes RSGM specified to solve the PET problem (1.10) using the reference function r with relative-smoothness parameter \({\bar{Y}}\), where \(\mathsf {ri}\,\Delta _n\) denotes the relative interior of \(\Delta _n\) in the Input, and \(D_r(\cdot ,\cdot )\) denotes the Bregman divergence in (4.11) which is defined similarly as in (3.3). We set the step-size \(\alpha _k =1/\bar{Y}\) for all \(k\ge 0\) in the fixed-step-size version of the method. Regarding the sub-problem (4.11) that needs to be solved at each iteration, its optimal solution is unique and lies in \(\mathsf {ri}\,\Delta _n\) since the reference function r is Legendre with domain \({\mathbb {R}}_{++}^n\) (see [5, Section 2.1] for details). Therefore we have \(z^k\in \mathsf {ri}\,\Delta _n\) for all \(k\ge 0\). To efficiently solve (4.11), we used the approach in [42, pp. 341–342], which reduces (4.11) to finding the unique positive root of a strictly decreasing univariate function on \((0,+\infty )\).

figure c

2. RSGM-LSBack [55]. This method is a version of the relatively smooth gradient method shown in Algorithm 3, with the extension that a backtracking line-search procedure is employed to compute the local relative-smoothness parameter \({\bar{Y}}_k\) at \(z^k\) and then the step-size is chosen as \(\alpha _k = 1/{\bar{Y}}_k\) at each iteration. (The details of this procedure can be found in [55, Algorithm 1].) Note that depending on the location of \(z^k\), \({\bar{Y}}_k\) may be (significantly) smaller than the (global) relative-smoothness parameter \(\bar{Y}\).

3. EM [12]. This surprisingly simple algorithm was developed by Cover in 1984 specifically for a problem that is equivalent to the following normalized PET problem (see [12]):

$$\begin{aligned} \max \; \bar{L}(z) := \textstyle \sum _{j=1}^m \bar{Y}_j\ln \big (\sum _{i=1}^n p_{ij} z_i\big )\quad {{\,\mathrm{s.t.}\,}}\;\; {z\in \Delta _n} \ , \end{aligned}$$
(4.12)

where \(\bar{Y}_j:= Y_j/\sum _{j'=1}^m Y_{j'}>0\) for all \(j\in [m]\) (recall that we have assumed without loss of generality that \(Y_j\ge 1\) for all \(j\in [m]\)). The method starts with \(z^0\in \mathsf {ri}\,\Delta _n\) and at each iteration k updates \(z^k\) as follows:

$$\begin{aligned} z^{k+1}_i := z^k_i \nabla _i \bar{L}(z^k) = \sum _{j=1}^m \bar{Y}_j \frac{p_{ij} z^k_i }{\sum _{l=1}^n p_{lj} z_l^k} \ , \quad \forall \,i\in [n] \ . \end{aligned}$$
(4.13)

Note that since \(\bar{Y}_j>0\) for all \(j\in [m]\), we easily see that \(z^k\in \mathsf {ri}\,\Delta _n\) for all \(k\ge 0\).

4.2.2 Results

We report results for problems of dimensions \(n=m=1000\), as we observed that the algorithms’ relative performance were not particularly sensitive to the dimensions m and n. We ran the algorithms using two different choices of starting points: (i) \(z^0 = z^{\text {bd}}\in \mathsf {ri}\,\Delta _n\) chosen very close to \(\mathsf {bd}\,{\mathbb {R}}^n_+\) (the boundary of \({\mathbb {R}}^n_+\)), and (ii) \(z^0 = z^{\text {ct}}:=\tfrac{1}{n}e\) which is the barycenter of \(\Delta _n\). To determine \(z^{\text {bd}}\) we first used a greedy heuristic to determine a low- (or lowest-)cardinality index set \(\mathcal {I}\subseteq [n]\) for which \(\sum _{i\in \mathcal {I}} \, p_j>0\), where \(p_j\) is the j-th column of P. Define \({\bar{\delta }} := 10^{-6}/n\), and we then set

$$\begin{aligned} z^{\text {bd}}_i := \left\{ \begin{array}{ll} {\bar{\delta }} &{} \text{ for } \ \ i\not \in \mathcal {I}\ , \\ (1 - (n-\vert \mathcal {I}\vert ){\bar{\delta }})/\vert \mathcal {I}\vert \ \ &{} \text{ for } \ \ i\in \mathcal {I}\ . \end{array} \right. \end{aligned}$$

Note that this ensures \(z^{\text {bd}}\in \mathsf {ri}\,\Delta _n\). Similar to Sect. (4.1), we used CVXPY [16] to (approximately) compute the optimal objective function value \(L^*\) of (1.10) in order to accurately report optimality gaps. Again, all computations were performed in Python 3.8 on a laptop computer.

Fig. 4
figure 4

Comparison of optimality gaps of FW-Adapt (FW-A), FW-Exact (FW-E), RSGM-Fixed (RSGM-F), RSGM-LSBack (RSGM-LS) and EM, with \(z^0 = z^{\text {bd}}\)

Figures 4 and 5 show plots of the empirical optimality gaps \(L(z^k) -L^*\) of all five methods with the starting points \(z^{\text {bd}}\) and \(z^{\text {ct}}\), respectively. From Fig. 4 we observe the following:

  1. (i)

    The relatively smooth gradient methods, namely RSGM-Fixed and RSGM-LSBack, make very little progress in reducing the empirical optimality gap. For RSGM-Fixed, this is because the relative smoothness parameter \(\bar{Y}= \sum _{j=1}^m Y_j\) is typically very large, implying that the step-size \(1/\bar{Y}\) is very small. Since the starting point \(z^{\text {bd}}\) is very close to \(\mathsf {bd}\,{\mathbb {R}}^n_+\), where the local relative smoothness parameter of the objective function L is close to \(\bar{Y}\), RSGM-LSBack exhibits similar behavior to RSGM-Fixed. (In other words, the backtracking line-search does not help much in this case.)

  2. (ii)

    The two versions of our generalized Frank–Wolfe methods, namely FW-Adapt and FW-Exact, outperform the relatively smooth gradient methods. In addition, FW-Exact converges faster than FW-Adapt initially, but when close to the optimum (for example, when the empirical optimality gap falls below 10), these two methods have the same convergence behavior. Indeed, this observation also appears on the Poisson image de-blurring problem with TV-regularization (see Sect. 4.1.2).

  3. (iii)

    The EM algorithm outperforms all the other methods which is rather surprising at first glance. (And in fact it is unknown from the literature whether the method has any type of non-asymptotic convergence guarantee.) However, we note that this method, which uses a multiplicative form of update (see Eq. (4.13)), is specifically designed for problems with the PET problem structure in (1.10), and it is not clear how to suitably generalize this method to more general problems of the form \((P)\) such as the Poisson image de-blurring problem with TV-regularization in (1.5).

Figure 5 shows the performance of the five algorithms when \(z^0 = z^{\text {ct}}\) is the barycenter. We observe that the performance of the methods is mostly similar to when started at \(z^{\text {bd}}\) except for one significant difference, namely that RSGM-LSBack exhibits significantly faster convergence compared to when started at \(z^{\text {ct}}\). Indeed, RSGM-LSBack outperforms all the other general purpose methods, namely RSGM-Fixed, FW-Adapt and FW-Exact. This is likely because on the “central region” of \(\Delta _n\) the local relative smoothness parameter of L is probably much smaller than the global bound \(\bar{Y}\), and therefore RSGM-LSBack is able to take a much larger steps. This also indicates that the convergence behavior of RSGM-LSBack is likely to be sensitive to the choice of starting point, which is not the case for the other methods (including our generalized Frank–Wolfe methods).

Fig. 5
figure 5

Comparison of optimality gaps of FW-Adapt (FW-A), FW-Exact (FW-E), RSGM-Fixed (RSGM-F), RSGM-LSBack (RSGM-LS) and EM, with \(z^0 = z^{\text {ct}}\)