Smoothed Variable Sample-size Accelerated Proximal Methods for Nonsmooth Stochastic Convex Programs

We consider minimizing $f(x) = \mathbb{E}[f(x,\omega)]$ when $f(x,\omega)$ is possibly nonsmooth and either strongly convex or convex in $x$. (I) Strongly convex. When $f(x,\omega)$ is $\mu-$strongly convex in $x$, we propose a variable sample-size accelerated proximal scheme (VS-APM) and apply it on $f_{\eta}(x)$, the ($\eta$-)Moreau smoothed variant of $\mathbb{E}[f(x,\omega)]$; we term such a scheme as (m-VS-APM). We consider three settings. (a) Bounded domains. In this setting, VS-APM displays linear convergence in inexact gradient steps, each of which requires utilizing an inner (SSG) scheme. Specifically, mVS-APM achieves an optimal oracle complexity in SSG steps; (b) Unbounded domains. In this regime, under a weaker assumption of suitable state-dependent bounds on subgradients, an unaccelerated variant mVS-PM is linearly convergent; (c) Smooth ill-conditioned $f$. When $f$ is $L$-smooth and $\kappa = L/\mu \ggg 1$, we employ mVS-APM where increasingly accurate gradients $\nabla_x f_{\eta}(x)$ are obtained by VS-APM. Notably, mVS-APM displays linear convergence and near-optimal complexity in inner proximal evaluations (upto a log factor) compared to VS-APM. But, unlike a direct application of VS-APM, this scheme is characterized by larger steplengths and better empirical behavior; (II) Convex. When $f(x,\omega)$ is merely convex but smoothable, by suitable choices of the smoothing, steplength, and batch-size sequences, smoothed VS-APM (or sVS-APM) produces sequences for which expected sub-optimality diminishes at the rate of $\mathcal{O}(1/k)$ with an optimal oracle complexity of $\mathcal{O}(1/\epsilon^2)$. Finally, sVS-APM and VS-APM produce sequences that converge almost surely to a solution of the original problem.

f (x) E[f (x, ξ(ω))], ξ : Ω → R o ,f : R n × R o → R, g is a closed, convex, and proper deterministic function with an efficient proximal evaluation, (Ω, H, P) denotes the associated probability space, and E[•] denotes the expectation with respect to the probability measure P. Throughout, we refer tof (x, ξ(ω)) byf (x, ω), whereasF (x, ω) f (x, ω) + g(x). We consider settings wherẽ f (·, ω) is nonsmooth strongly convex/convex in x for every ω, generalizing the focus beyond the structured nonsmooth setting where the "stochastic part" is smooth. Specifically, structured nonsmooth problems require minimizing f (x) + g(x) where f is smooth and g is nonsmooth with an efficient prox evaluation (allows for capturing constrained problems over closed and convex sets). Amongst the earliest avenues for resolving (1) is stochastic approximation [34,20] and has proven to be effective on a breadth of stochastic computational problems including convex optimization problems. [33] developed an averaging scheme in convex differentiable settings, deriving the optimal convergence rate of O(1/ √ K) under classical assumptions, where k is the number of iterations. Amongst the cleanest of early complexity requirements for the minimization of expectationvalued µ-strongly convex and convex functions over a closed and convex set X were given by max M 2 /µ 2 , x 0 − x * 2 (1/ ) (to ensure that E[ x k − x * 2 ] ≤ ) and O(M D X / 2 ) (to ensure that the expected optimality gap is less than ), respectively where S(x, ω) denotes a measurable selection from ∂ xf (x, ω), sup x∈X E[ S(x, ω) 2 ] ≤ M 2 and D X max x∈X x 0 −x . Of these, the former was presented by [38] whereas the latter is the result of an optimal robust constant steplength SA scheme suggested by [23]. When f is both L-smooth and µ-strongly convex, an improved complexity requirement (from a constant factor standpoint) of O( (L x 0 − x * 2 / ) + ν 2 /(µ )) was provided by [15]. This contrasts sharply with the deterministic regime where O(log(1/ )) and O(1/ √ ) steps are required in smooth strongly convex and smooth convex regimes to compute an -accurate solution ( -solution in terms of mean-squared error) and -optimal solution ( -solution in terms of expected sub-optimality), respectively. In structured nonsmooth regimes, there has been an effort to employ the stochastic generalization of an accelerated proximal gradient method to minimize f + g when f is smooth. Reliant on a first-order oracle that produces a sampled gradient ∇ xf (x, ω) and given an x 0 , our proposed variable sample-size accelerated proximal gradient scheme (VS-APM) (also see [16] and [19]) is stated as follows where the true gradient is replaced by a sample average (∇ x f (x k ) +w k,N k ) with batch size N k . y k+1 := P γ k g (x k − γ k (∇ x f (x k ) +w k,N k )) x k+1 := y k+1 + β k (y k+1 − y k ), (2) wherew k,N k N k j=1 (∇xf(xk,ωj,k)−∇xf(xk)) N k , P ηg (y) arg min x { 1 2 x − y 2 + 1 2η g(x)}, γ k , and β k are suitably defined steplengths. Our approach produces linearly convergent iterates in strongly convex regimes and achieves an iteration complexity of O(1/K 2 ) in merely convex and smooth regimes, where K is the total number of iterations, matching the deterministic results seen in the work by [2] and [24], . The avenue represented by (2) has two key distinctions: (i) Increasingly exact gradients through increasing batch-sizes N k of sampled gradients, allowing for progressive variance reduction; (ii) Larger (non-diminishing) step-sizes in accordance with deterministic accelerated schemes. Collectively, (i) and (ii) allow for recovering fast (i.e. deterministic) convergence rates (in an expected value sense) when N k grows sufficiently fast. Additionally, such schemes have a more muted reliance on the condition number κ = L/µ (in µ-strongly convex and L-smooth regimes); specifically, in accelerated schemes, such dependence reduces to √ κ in comparison with κ in unaccelerated counterparts (cf. [27]).

Gaps and Contributions.
Unfortunately whenf (·, ω) is a nonsmooth strongly convex/convex function, stochastic subgradient schemes, subsequently defined in (SSG), while a de-facto standard, generally display poor empirical behavior, since they utilize diminishing steplengths and noisy gradients. We develop two distinct avenues for combining smoothing with acceleration and variance-reduction in strongly convex and convex regimes that ameliorate these concerns while achieving optimal rates.
(I) (mVS-APM) for strongly convex nonsmooth f . In Section 2, our smoothing framework is reliant on a variable sample-size accelerated proximal method (VS-APM) which requires smoothness of f while displaying linear convergence and optimal oracle complexity. In two distinct settings, we propose applying (VS-APM) (or an unaccelerated variant) on the Moreau envelope of F , denoted by F η , where F η is 1 η -smooth and retains the minimizers of F . (a) Compact domains. Under the assumption that the domain of g is bounded and E[ S(x, ω) 2 ] ≤ M 2 for all x ∈ R n where S(x, ω) is a measurable selection from ∂f (x, ω), i.e. S(x, ω) ∈ ∂f (x, ω), we show that (mVS-APM) produces a linearly convergent sequence with an iteration complexity of O(log(1/ )) in inexact gradient steps ∇ x F η (x k ), where increasingly exact gradients ∇ x F η (x) are obtained by employing an (prox-SSG) scheme. In particular, our variance-reduced scheme endeavors to get increasingly exact gradients by progressively reducing the bias in the gradients (since we utilize an increasing number of SSG steps); such a benefit does not appear in a naive implementation of SSG. Moreover, the overall complexity in subgradient evaluations (and consequently sample or oracle complexity) is O(1/ ), matching the optimal complexity in subgradient steps achieved by (SSG) schemes. (b) Unbounded domains. When domains are possibly unbounded, assuming that E[ S(x, ω) 2 ] ≤M 2 x 2 + M 2 , where S(x, ω) ∈ ∂F (x, ω), the proposed (unaccelerated) variable sample-size proximal method (mVS-PM) achieves an iteration complexity of O(log(1/ )) (in gradient steps with ∇ x F η ) and overall complexity in subgradient steps of O(1/ ). (II) (sVS-APM) for convex nonsmooth f . In this setting, in Section 3, we develop an iterative smoothing-based extension of (VS-APM), denoted by (sVS-APM). By reducing the smoothing and steplength parameters at a suitable rate, E[F (y K ) − F (x * )] ≤ O(1/K). Notably (sVS-APM) produces asymptotically accurate solutions (unlike the scheme by [26] which produces approximate solutions via a fixed smoothing parameter) and is characterized by the optimal oracle complexity of O(1/ 2 ). When f is convex and smooth, we may specialize these results to obtain an optimal rate of O(1/K 2 ) and displays an optimal sample complexity of O(1/ 2 ). When f is deterministic but nonsmooth, (s-APM) matches the rate by [26] but produces asymptotically exact solutions. Additionally, we prove that for suitable (but distinct) choices of steplength and smoothing sequences, (sVS-APM) and (VS-APM) produce sequences that converge a.s. to a solution of (1), a convergence statement that was unavailable thus far, matching deterministic results by [29] and [5] which leverage Moreau smoothing; we provide a result for (α, β)-smoothable functions (see [1]).
Notation: A vector x is assumed to be a column vector while x denotes the Euclidean vector norm, i.e., x = √ x T x. P ηg (x) denotes the prox with respect to g with prox parameter 1 2η at x. We abbreviate "almost surely" by a.s. and E[z] denotes the expectation of a random variable z. We let X * denote the set of optimal solutions of the (1).
Non-dminishing outer steps; Approx. ∇xFη (x) by (SSG) with increasing exactness; In this section, we develop rate and complexity analysis for nonsmooth strongly convex optimization problems via techniques that combine smoothing, acceleration, and variance reduction. In Section 2.1, we review a linearly convergent variance-reduced accelerated proximal scheme (VS-APM) for smooth stochastic convex optimization; this scheme will serve as our subproblem solver.
In Section 2.2, we present a Moreau-smoothed variant of (VS-APM), referred to as (mVS-APM), which relies on minimizing the Moreau envelope F η (x) of the strongly convex nonsmooth function F (x) by (VS-APM). In Section 2.3, we then derive rate and complexity guarantees for (mVS-APM) , where ∇ x F η (x) is approximated with increasing accuracy by a stochastic subgradient (SSG) scheme. Finally, in Section 2.4, we derive analogous statements when applying an unaccelerated variable sample-size proximal method (mVS-PM) under possibly non-compact domains and under a (weaker) state-dependent bound on the subgradient (See Table 1 for a summary of findings).

Background on (VS-APM)
Consider (1) where f, g, and the initial point x 0 satisfy the following assumption.
(i) f is a µ-strongly convex function and g is a closed, convex, and proper deter- In a subset of regimes, we impose an L-smoothness assumption on f .
We utilize a variable sample-size accelerated proximal scheme (VS-APM), as defined in Algorithm 1, which can process such problems and differs from a standard accelerated proximal method in that we employ an inexact gradient ∇ x f (x k ) +w k,N k where the bound on the second moment of is diminishing with k, a consequence of using variance reduction.
We outline the assumptions on the first and second moments ofw k .
(VS-APM) can be shown to achieve linear convergence akin to that by [27] by combining inexact gradients where the inexactness is driven to zero by increasing the sample-size in estimating the gradients. This avenue also allows for achieving the optimal oracle complexity to obtain an -accurate solution. These differences lead to a slightly modified set of update rules in contrast with that developed by [27] and requires that γ k = 1/2L rather than 1/L. This scheme serves as a subproblem solver in subsequent sections and we now state a lemma and the associated complexity statement of (VS-APM). The proof is similar to that by [27] and is in the Appendix. Importantly, this scheme allows for a possibly biased estimate of the gradient. Lemma 1. Suppose Assumptions 1, 2 and 3(i) hold. Consider the iterates generated by (VS-APM), where γ k = 1 2L for all k ≥ 0, κ = L µ , andᾱ = 1 2 √ κ Then the following holds for all K.
The following theorem characterizes the iteration and oracle complexity of (VS-APM).
We know of no other result for variance-reduced accelerated proximal schemes in strongly convex (or even convex) smooth regimes that allows for biased oracles. For instance, [35] impose unbiasedness in strongly convex regimes. Next, we show that by adding the unbiasedness requirement, i.e. E[w k | H k ] = 0 a.s. for all k, improves the constants in these bounds.
In addition, (VS-APM) needs O( √ κ log(1/ )) steps to obtain an -accurate solution. (ii) To compute an -accurate solution, The application of (VS-APM) is afflicted by the need for the L-smoothness of f as well as the availability of L, the Lipschitz constant. Naturally, in many settings, the problem may not be smooth and even if L-smoothness holds, an estimate of L may be unavailable. Consequently to broaden the reach of the scheme, an approach that obviates the need for L or the imposition of the smoothness assumption is necessitated. This prompts the subsequent smoothed scheme (mVS-APM). This scheme can always be implemented if the strong convexity modulus (denoted by µ) is known but the function is either nonsmooth or smooth with an unknown Lipschitz constant L. It is worth noting that estimating µ is challenging and if µ is indeed unknown, then in Section 3, we introduce an iteratively smoothed VS-APM (sVS-APM) method which necessitates neither the knowledge of the Lipschitz constant L, nor the smoothness of f , nor the strong convexity modulus µ.

A Moreau-smoothed Inexact Accelerated Framework (mVS-APM)
Whenf (·, ω) is a nonsmooth strongly convex function for almost every ω, then the standard approach lies in utilizing stochastic subgradient schemes (SSG) where convergence relies on choosing square-summable but non-summable steplength sequences. The choice of the parameters in such sequences can have debilitating impact on performance in some settings (cf. [38]). Specifically, while choosing γ k as 1 µk minimizes the mean-squared error but over-estimating µ can have catastophic impact as seen in [38,Sec 5.9,Ex. 5.36]. More generally, such choices are often characterized by poor asymptotic behavior, a consequence that arises in part from the diminishing nature of steplength sequences and the noisy subgradients. We consider a distinct avenue reliant on minimizing the Moreau envelope of a closed, convex, and proper function F (cf. [22]), denoted by F η (x) and defined next.
Notably, this smoothing retains the minimizer of F (x) when F is strongly convex. . Consequently, we minimize theμ-strongly convex and 1 η -smooth function F η , which is not necessarily an easy task since computing ∇ x F η (x) necessitates solving nonsmooth stochastic optimization problems. We adopt an inexact accelerated proximal scheme for minimizing F η . But in contrast with (SSG) schemes applied to minimizing F , we control the smoothness of the outer problem by choosing η and utilize (i) larger non-diminishing steplengths, (ii) acceleration, and (iii) increasingly exact gradients, all of which are distinct from (SSG), as shown next. γ k →0, u k is noisy subgradient. (SSG) Non-diminishing γ k + increasingly exact gradients + Acceleration x k+1 := y k+1 + β k (y k+1 − y k ).
(mVS-APM) Importantly, ∇ x F η (x k )+w k,N k represents an approximation of the gradient of the Moreau envelope. The true gradient of the Moreau envelope F η (x) is defined as ∇ But prox ηF (x) cannot be computed in finite time since F is a nonsmooth expectation-valued convex function. Instead, via stochastic approximation, we compute an approximate solution of prox ηF (x), denoted by prox ηF (x), implying the inexact gradient of F η (x) is given by 1 η (x − prox ηF (x)). In Algorithm 1, the inexact gradient ∇ x F η (x k ) +w k,N k is defined as We now proceed to develop (mVS-APM) for compact domains in Section 2.3 and then weaken compactness requirements in Section 2.4 for an unaccelerated variant.

Linear Convergence of (mVS-APM): Compact Domains
, prox ηF (x), defined as (7), is generally unavailable in closed-form and requires solving a strongly convex nonsmooth stochastic optimization problem exactly. Instead, one may solve (6) inexactly using (prox-SSG), a slightly extended variant of (SSG) scheme [38]. In particular, we propose (mVS-APM) with the following update rules for k ≥ 1, where prox ηF (x k ) is obtained by taking finite number of steps of (prox-SSG) with a sample size of one at each step and having the following update rule for j = 0, . . . , N k − 1, Next, we state our assumptions and present the main result of this section. The constant in the rate and complexity bounds is dependent onκ; unlike, the condition number κ in smooth regimes, κ is user-specified and can be relatively small. For instance,κ = 2 when η = 1/µ. We employ a measurable selection from ∂f (x, ω) as a stochastic subgradient in (SSG) and impose the following assumption.
The function g has a compact domain, i.e., there exists ∆ > 0 such that x ≤ ∆ for any x ∈ dom(g).
Minimizing the convergence bound in (15) in η is possible via a less obvious coercivity and strict convexity claim for the nonsmooth function C(η) (See Appendix for proof).
Then the following hold. ( Remark 2. Lemma 3 allows for claiming that C(η) has a unique minimizer η * ; in fact, such a minimizer can be computed by a standard semismooth Newton method [13]. Fig. 1 provides a schematic of C(η) for different values of µ while η * is computed by semismooth Newton method. We note that when µ is larger, η * (µ) tends to be smaller. In such cases, obtaining an optimal η * is particularly useful. However, when µ 1, we observe that η * (µ) 1; consequently, this leads to rescaling of the step γ k to γ k η , resulting in poorer behavior. Therefore, if µ 1, we employ η = 1 and this has far better empirical behavior as seen in the numerics.

Linear Convergence of (mVS-PM): Non-compact Domains
In this subsection, we derive rate and complexity guarantees when (VS-PM), an unaccelerated variant of (VS-APM), is applied on a Moreau-smoothed problem under possibly non-compact domains and under a (weaker) state-dependent bound on the subgradient (Assumption 5). When the subgradient of g is characterized by a state-dependent bound, the bound on the cumulative error in the accelerated method builds up due to a recursive relation, see (57) in the Appendix. Hence, in this section, we consider a more general case in which Assumption 5 imposes a statedependent bound, weakening Assumption 4. By employing an unaccelerated method, we derive a similar oracle complexity as in section 2.3. To obtain rate results, we apply (VS-PM) with the following update rule: taking N k (stochastic) subgradient steps. Consider the sequence of iterates {x k } generated by applying an inexact gradient scheme on the following strongly convex smooth optimization problem.
In effect, given an x 0 ∈ R n , the inexact gradient scheme generates a sequence {x k } such that Given an x k , we denote the update with the exact gradient byx k+1 , which is defined as follows.
In other words, z * k is defined as Since prox ηF (x k ) is unavailable in closed form, we may compute increasingly exact analogs; given Consequently, at major iteration k, the inexact gradient of F η (x) is given by 1 . Consequently, we have that We proceed to derive a bound on the conditional second moment of G(z k,j , ω k,j )= S(z k,j , ω k,j ) This requires defining the history upto iteration j at outer iteration k by F k,j as follows.
We now outline an assumption on the bound on the stochastic subgradient that scales with the size of x allowing for non-compact domains.
With these constructs, the following are assumed to hold.
Consequently, we have that Based on Assumption 5 and inspired by a proof technique from [7] amongst others, we derive a rate statement for (SSG) (See Appendix for proof). (17) where F (·, ω) is a µ-strongly convex function and S(z, ω) ∈ ∂F (z, ω) for any z. Suppose Assumption 5 holds andâ 2 4 + 4M 2

Proposition 1. Consider
Then the following holds for j ≥J.
We now show the convergence of (mVS-PM) when ∇ x F η (x) is approximated via (SSG) (See Appendix for proof).
Theorem 3 ((mVS-PM) under state-dependent bound on subgradients). Suppose Assumptions 1 and 5 hold. Consider the iterates generated by (VS-PM) applied on F η (x), wherẽ Then the following hold. (i) (Rate). For all k ≥ 1, we have that the following holds.
Remark 3. We observe that when ρ > p 0 , we achieve the optimal oracle complexity in subgradient steps akin to the statement in the regime of bounded subgradients. Notably,κ can be controlled since η is any nonnegative scalar. For instance, if η = 1 µ ,κ = 2.

Iteratively Smoothed VS-APM for Nonsmooth Convex Problems
Thus far, we have considered settings where f is a strongly convex function. However, there are many instances when the function f is neither smooth nor strongly convex. In fact, in strongly convex regimes, estimating the strong convexity parameter may often be challenging. In such settings, if the function f is subdifferentiable, then subgradient methods provide an avenue for resolving such problems in stochastic regimes but display a significantly poorer rate of convergence. [26] showed that for a subclass of problems, an accelerated gradient scheme may be applied to a suitably smoothed problem where the smoothing leads to a differentiable problem with Lipschitz continuous gradients (with known Lipschitz constants). If the smoothing parameter is chosen suitably, the convergence rate to an approximate solution can be improved to O(1/K) from O(1/ √ K) in terms of expected sub-optimality. However, since the smoothing parameter is maintained as fixed, Nesterov's approach can provide approximate solutions at best but not asymptotically exact solutions. Subsequently, [25] considered a primal-dual smoothing technique where the smoothing parameter is reduced at every step while extensions and generalizations have been considered more recently by [40] and [41]. In this section, we develop an iteratively smoothed variable sample-size accelerated proximal gradient scheme that can contend with expectation-valued objectives and is asymptotically convergent. This can be viewed as a variant of the primal smoothing scheme introduced by [26] where the smoothing parameter is reduced after every step; this scheme is shown to admit a rate of O(1/K), matching the finding by [26]; however, our scheme is blessed with asymptotic guarantees rather than providing approximate solutions. In Section 3.1, we derive rate and complexity statements in Section 3.2 for the iteratively smoothed VS-APM (or sVS-APM), recovering the optimal rate of O(1/K 2 ) with the optimal oracle complexity of O(1/ 2 ) under smoothness. Finally, in Section 3.3, under suitable choices of smoothing sequences, (sVS-APM) produces sequences that converge a.s. to an optimal solution.

Smoothing Techniques
In this section, we consider minimizing such that f and g are convex and may be nonsmooth while g has an efficient prox evaluation (or "proximable") but f is not proximable. Note that this setting is more general than structured nonsmooth problems, where the function f is considered to be convex and smooth. In contrast to the previous section, we assume that ∇ xf η k (x k , ω k ) is generated from the stochastic oracle, where η k is a smoothing parameter at iteration k such that its sequence is diminishing. [3] define an (α, β)-smoothable function as follows.
There are a host of smoothing functions based on the nature of h. For instance, when [3] for more examples). Recall that when h is a proper, closed, and convex function, the Moreau envelope is defined as -smoothable when h η is given by the Moreau envelope (see [3]) and B denotes a uniform bound on s in x where s ∈ ∂h(x). There are a range of other smoothing techniques including Nesterov smoothing (see [26]) and inf-conv smoothing (see [1]); our approach is agnostic to the choice of smoothing. In particular, if f (·, ω) is a proper, closed, and convex function in x for every ω, thenf (·, ω) is (1, B 2 )-smoothable for every ω wheref η (·, ω) is a suitable smoothing. In fact, iff (·, ω) satisfies the following smoothability assumption, then smoothability of f follows, as shown by Lemma 4. It is worth emphasizing that the smoothing of f , denoted by f η is defined as wheref η (·, ω) is a smoothing off (·, ω).
We proceed to develop a smoothed variant of (VS-APM), referred to as (sVS-APM), in which ∇ xf η k (x k , ω k ) is generated from the stochastic oracle and η k is driven to zero at a sufficient rate (See Algorithm 2).

Rate and Complexity Analysis
In this subsection, we develop rate and oracle complexity statements for Algorithm 2 when f is (1, B 2 ) smoothable and then specialize these results to both the deterministic nonsmooth and the stochastic smooth regimes. We begin with a modified assumption.
Note that Assumption 6 represents a set of sufficiency conditions for f to be smoothable; here, we directly assume that f is smoothable to ease the exposition.
Lemma 5. Suppose Assumption 7 holds. Consider the iterates generated by (sVS-APM) on F (x). Suppose Assumption 3 holds for f η k (x). If {γ k } is a decreasing sequence and γ k ≤ η k /2, then the following holds for all K ≥ 2: Proof. By the update rule in Algorithm 2, we have From the optimality condition for (23), 0 ∈ ∂g(y k+1 )+ 1 γ k (y k+1 −x k )+∇ x f η k (x)+w k . By convexity of g(x), we have that g(x) ≥ g(y k ) + s T (x − y k+1 ) for all s ∈ ∂g(y k ). Hence, we obtain the following.
Now by using Lemma 8, we obtain that By invoking the convexity of f η k and by using the Lipschitz continuity of ∇ x f η k , we obtain (25) where the last equality follows from adding and subtractingw k . By adding (24) and (25), we obtain where the last inequality follows from Lemma 8 by choosing By setting x = y k in (26), we have Similarly, by letting x = x * , we can obtain Consequently, (27) can further bounded as follows: Similarly, we have that By multiplying (29) by (λ k − 1) and adding to (30) Again by using Lemma 8, we may express the terms in (32) as follows: In addition, From the update rule, where in the last inequality we used the update rule of algorithm, x k+1 = y k+1 + λ k −1 λ k+1 (y k+1 − y k ), to obtain the following: By multiplying both sides by γ k and assuming γ k ≤ γ k−1 , we obtain By assuming γ k ≤ η k 2 , we obtain 1 Summing (36) from k = 1 to K − 1, we have the following: Taking expectations, we note that the last term on the right is zero (under a zero bias assumption), leading to the following: where in the last inequality we used the fact that y − x * ≤ C for all y ∈ dom(g) and k 2 ≤ λ k ≤ k which may be shown inductively.
We are now ready to prove our main rate result and oracle complexity bound for (sVS-APM).
By invoking (1, B 2 )-smoothability of f and η K = 1/K, we have that F η K (y K+1 ) ≤ F (y K+1 ) and −F η K (x * ) ≤ −F (x * ) + ηB 2 . Hence, the required bound follows from (37) (b) a = 1. Recall that the convergence rate is given by the following: Taking limits, we obtain that Therefore, we have that We again consider two cases. (a) a = 1 + δ where δ ∈ [δ L , δ U ]. Since we haveC K ≤ which implies that K = C / . To obtain the optimal oracle complexity we require K k=1 N k gradients. Hence, the following holds for sufficiently small such that 2 ≤C/ : To compute K such that a+b log(K) K ≤ is not immediately obvious but may be obtained via the Lambert function 2 [8]. For purposes of simplicity, suppose a = 0 and b = 1. Then we have the following.
By definition of the Lambert function, we have that e W (x) = x W (x) , implying that where the first inequality follows from (3) in [8]. Hence, the oracle complexity for a = 1 will be O log 2 (1/ ) 2 , which is near optimal (where optimal is O(1/ 2 )). We now consider two cases of Theorem 4 for which similar rate statements are available. Case 1. Structured stochastic nonsmooth optimization with f smooth. Now consider problem (1), where f (x) is a smooth function. Recall that we considered such a problem in Section 2 for strongly convex f and in this case, we consider the merely convex case. When f is deterministic, accelerated gradient methods first proposed by [24] and their proximal generalizations suggested by [2] were characterized by the optimal rate of convergence of O(1/K 2 ). When f is expectationvalued, [16] presented the first known accelerated scheme for stochastic convex optimization where the optimal rate of 1/k 2 was shown for the expected sub-optimality error. This rate required choosing the simulation length K and choosing N k = k 2 K which led to the optimal oracle complexity of O(1/ 2 ). However, this method is somewhat different from (VS-APM). In particular, every step requires two prox evaluations (rather than one for VS-APM). 3 [19] developed an accelerated proximal scheme for convex problems with a similar algorithm but allow for state dependent noise. The weakening of the noise requirement still allows for deriving the optimal rate of O(1/K 2 ) but necessitates choosing N k = k 3 (ln k) . As a consequence, the oracle complexity is slightly poorer than the optimal level and is given by O −2 ln 2 ( −0.5 ) . We note that (VS-APM) displays the optimal oracle complexity O( −2 ) by choosing N k = k 2 K while by choosing N k = k a for a = 3 + δ, then the oracle complexity can be made arbitrarily close to optimal and is given by O( −2−δ/2 ). However, (VS-APM) imposes a stronger assumption on noise, as formalized next.

Corollary 2. (Rate and oracle complexity bounds with smooth f for (VS-APM)) Suppose Assumptions 2, 3, and 7 hold. Suppose
Then the following holds.
Proof. (i) Similar to the proof of Lemma 5, by defining δ k = F (y k ) − F (x * ) we can prove: Let N k = k a ≥ 1 2 k a and γ k = γ. Then we have that the following holds where C 2ν 2 γ(a−2) a−3 where the first inequality follows from bounding the summation as follows: Suppose y K+1 satisfies E[F (y K+1 ) − F (x * )] ≤ , implying that C K 2 ≤ or K = C 1/2 / 1/2 . If ≤ C/2, then the oracle complexity can be bounded as follows: Then similar to part (i), we may bound the expected sub-optimality as follows whereC 2ν 2 γ + 4C 2 γ .
Since K = C 1/2 / 1/2 , the oracle complexity may be bounded as follows: Case 2: Deterministic nonsmooth convex optimization. When the function f in (1) is deterministic but possibly nonsmooth, [26] showed that by applying an accelerated scheme to a suitably smoothed problem (with a fixed smoothing parameter) leads to a convergence rate of O(1/K). In contrast with Theorem 4, utilizing a fixed smoothing parameter leads to an approximate solution at best and such a scheme is not characterized by asymptotic convergence guarantees. In addition, we observe that the rate statement for the deterministic counterpart of (sVS-APM), denoted by (s-APM), is global (valid for all k) while any statement with constant smoothing holds for the prescribed K. We observe that the rate statements by using an appropriately chosen smoothing and steplength parameter matches that by using a selecting a suitable smoothing and steplength sequence.

Remark 4.
By recalling that f η (x) E[f η (x, ω)], by using Theorem 7.47 in [38] (interchangeability of the derivative and the expectation), and noting thatf η (·, ω) is differentiable in x for every ω, . Therefore, such a gradient estimator is unbiased and our assumption holds. We now derive bounds on the second moments for some common smoothings in Table 2.
Proof. From inequality (34), we have that the following holds.
Dividing both sides of the previous inequality by γ k , we obtain the following relationship.
where in the last inequality we use b ∈ (0, 1/2]: By taking conditional expectations and recalling that η k = cγ k where c > 1, we obtain the following.
If γ k = k −b where b ∈ (0, 1/2] and N k = k a where a + b > 1, by Lemma 7, we have that ∞ k=1 γ k ν 2 N k < ∞ and the following holds for η k = ck −b , c > 1 and b ∈ (0, 1/2]: Furthermore, from (40), it follows that ∞ k=1 α k = ∞ and for b ∈ (0, 1/2] and a + b > 1. Additionally, we have the following: can be concluded as follows. For any b ∈ (0, 1/2], we have: Therefore, Lemma 5 can be applied andv The next proposition provides a similar a.s. convergence for (VS-APM) that can accommodate structured nonsmooth optimization where f (x) is a smooth merely convex function. The proof of this result is similar to Proposition 2, but δ k in this case is defined as δ k = F (y k ) − F (x * ).
A(ω) =Ā + W ∈ R n×n and the elements of W have an i.i.d. normal distribution with mean zero and standard deviation (std) 0.1. Similarly, β(ω) =β +w ∈ R n , where w is a random vector. Since, tractable prox evaluations are not available for (41), we compute approximate gradients ∇ x f η using (SSG). We set N k = ρ −k , where ρ 1 − 1 2a √κ and a = 2.01. Using a budget of 1e5 and 10 replications, we provide results in Table 3 (L) while Figure 2 shows the behavior of (mVS-APM) with different smoothing parameters η versus (SSG). When the strong convexity modulus µ is small, mVS-APM performs significantly better than (SSG) and is far more stable. For instance, when η = 1, (mVS-APM) terminates with an empirical error of approximately 4.8e-3 and 5.5e-3 for µ = 1 and µ = 1e-4 while corresponding errors for (SSG) are 7.8e-3 to 6.3. As one can see, η = 1 for (mVS-APM) seems to be a reasonable practical choice for different problem settings. Note that in this table, η * is chosen according to Lemma 3 where we note that as µ 1, the benefit of utilizing η * is muted. Next, we consider the unconstrained variant (41), where x ∈ R n . Since the subgradient is unbounded, we use unaccelerated method (mVS-PM). In Table 3 (R), the behavior of (mVS-PM) is compared with (SSG) for different choices of µ. As suggested after Theorem 3, we set η = 1 µ + 1e-3 > 1 µ .
In Table 4, we compare (mVS-APM) with (SSG) for   different choices of standard deviation of noise and dimension (n). In Table 4 (L), we set µ = 0.1 and n = 20 while in Table 4 (R), we set µ = 0.1 and std. dev. is 0.1. We run both schemes with total budget in subgradient evaluations of 1e5 and 10 replications and observe that (mVS-APM) outperforms (SSG) .
, ω i are iid normal random variables with mean zero and variance one and v i , s i ∈ (0, 1). Table 5 shows similar behavior as in Example 1. In Table 6, we compare (mVS-APM) with (SSG) for different choices of std. dev. and dimension (n). In Table 6 (L), we set µ = 0.1 while n = 20 and in

(sVS-APM).
Convex and smoothable f . Example 4. In this setting, we compare the performance of (sVS-APM) for merely convex problems on Example 2 with µ = 0. The δ-smoothed approximation of φ(t) provided by [3] is given by φ δ (t) = δ log m i=1 e (v i +s i t)/δ . In Table 7, we generate 20 replications for (sVS-APM) with fixed and diminishing smoothing sequences with η k = δ k /2, N k = k 3.001 , and sampling budget is 1e6. In Figure 3, we compare trajectories for (sVS-APM) with those for constant smoothing for n = 200.

sVS-APM
Fixed smooth.  Key observations. The empirical behavior of (sVS-APM) appears to be better on this test problem. One rationale for this may be drawn from noting that (sVS-APM) allows for larger steplengths early (since η k ≤ δ k ) on while in fixed smoothing technique, η k ≤ δ k (where δ k may be quite small). This can be seen in the trajectories where early progress by the iterative smoothing scheme can be observed. A larger δ k allows for larger steplengths but leads to a coarser approximation of the original problem while smaller δ k leads to poorer progress but better approximations (See Table 7 and Figure 3). 4. a.s. convergence. Next, we implemented sVS-APM on the stochastic utility problem with n = 20 and m = 10 for different choices of the smoothing sequences. Specifically, we allow δ k to be δ k ∈ {1/k, 1/ √ k, 1/k 0.25 } (where δ k = 1/k is required for convergence in mean and δ k = 1/k b with b ∈ (0, 1/2] for a.s. convergence). We employ N k = k 3.001 . For each experiment, the mean  of 20 replications and their 95% confidence intervals are plotted in Figure 4 and 5. It can be seen that when δ k → 0 at a slower rate as mandated by the requirement of the a.s. convergence result, the confidence bands are tighter, becoming more apparent in Figure 4 where the variance is 5. Furthermore, our numerical studies have revealed that even for less aggressive choices of N k such as when N k = k a and a > 1, the trajectories show the desired behavior in accordance with Prop. 2.

Concluding Remarks
Drawing motivation from the generally poor behavior of (SSG) schemes on general (rather than structured) nonsmooth stochastic convex optimization problems, we develop two sets of accelerated proximal variance-reduced schemes, both of which rely on a variable sample-size accelerated proximal method (VS-APM) for smooth convex problems. In nonsmooth strongly convex regimes, we present three sets of schemes, each of which produces linearly convergent sequences and is characterized by an overall complexity in subgradients (or proximal evaluations in the third case) that is optimal (or near-optimal). First, in compact domains, we propose (mVS-APM), an avenue that requires applying (VS-APM) on the Moreau envelope of F (x) where increasingly exact gradients are computed via an inner (SSG) scheme. Second, in unbounded domains, we apply an unaccelerated variable sample-size proximal method (VS-PM) which also relies on (SSG) for approximating gradients to increasing accuracy. Whenf (·, ω) is smoothable and convex, our smoothed (VS-APM) scheme (or sVS-APM) admits optimal rate and oracle complexity. Our findings, when specialized to the smooth and convex f , provide an optimal accelerated rate of O(1/K 2 ) with optimal oracle complexity matching findings by [16] and [19]. When f is deterministic, our rate matches that obtained by [26] but does so while providing asymptotically convergent schemes. Preliminary numerics suggest that the schemes compare well with existing techniques both in terms of complexity as well as in terms of sensitivity to problem parameters.

Appendix
Lemma 7. For any real number y ≥ 1 we have that: y ≥ 1 2 y .
Lemma 8. Given a symmetric positive definite matrix Q, then, we have the following for any Lemma 9. Suppose Assumptions 1 and 3(i) hold. Furthermore, γ k = 1/(2L) for all k.
Proof. Since y k+1 arg min x Then ∇ x ψ k (x) may be expressed as ∇ x ψ k (x) = ∇ x f (x k ) + 2L(x − x k ) +w k,N k By the optimality condition of (42), we have 0 ∈ ∂g(y k+1 ) + ∇ψ k (y k+1 ). Hence, by convexity of function g(x) we obtain Consequently, by using the definition of ψ k (x) and h(x) we have that Since f is a µ-strongly convex function, .
From the definition of h(x k ), L y k+1 −x k 2 = 1 4L h(x k ) 2 and inequality (43), we have the following: where (45) follows from the definition of h(x k ) and (46) follows by using the fact that where (47) follows from 2a T b + a 2 ≥ − b 2 . By substituting (47) in (46), the result follows.
It is worth emphasizing that in the proof of Lemma 9, we employ a simple bound to ensure that the termw T k,N k (y k+1 − x k ) does not appear in the final bound. Instead, the term w k,N k 2 emerges and this allows for deriving the optimal (rather than sub-optimal) oracle complexity. Next, we define a set of parameter sequences that form the basis for updating the iterates.
Definition 2 (Defn. of v k , α k , τ k ). Given v 0 , τ 0 , sequences {v k , τ k , α k } are defined as follows: We employ this set of parameters in showing that the update rule (3) in Algorithm 1 can be recast using the parameters τ k , α k , and v k . This observation is crucial as we analyze the update.
(ii) Suppose α k = 1 λ k for all k. Then the update rule (1b) in Algorithm 1 with σ k (1− 1 4κ )λk+1 for all k is equivalent to the following: Proof. (i). The update rule on the right in (i) can be recast as follows: Now by substituting the expression for v k from (51) in (48) and recalling that τ k+1 = (1 − α k )τ k + 1 2 α k µ = 2Lα 2 k and h(x k ) = 2L(x k − y k+1 ), we obtain the following sequence of equalities.
We now show that the update rule for x k+1 on the left is equivalent to that on the right in (i). (49) and (50), Now by choosing α k = 1 λ k , we have the following: From the update rule for λ k , we can obtain: By substituting (55) in (54) we obtain α k (1−α k ) . Hence (53) can be written as .
We now utilize the previous Lemma in defining an auxiliary function sequence {φ k+1 (x)} and a sequence {p k }. These sequences form the basis for carrying out the final rate analysis. x − x 0 2 and p 1 = 0. If φ k (x) and p k are defined as follows for k ≥ 1: Proof. We begin by showing that ∇ 2 φ k (x) = τ k I, where I denotes the identity matrix. For k = 1, ∇ 2 φ 1 (x) = τ 1 I. Suppose, this holds for k and we proceed to show that this holds for k := k + 1 : By choosing τ k+1 = (1 − α k )τ k + 1 2 α k µ, the required claim follows. Next we show that the sequence φ k (x) can be written as follows: where φ * k = min x φ k (x) and v k = arg min x φ k (x). Since φ k+1 (x) is a convex quadratic function by definition, we may represent it as φ k+1 (x) = a+b T x+ 1 2 x − v k+1 2 and (59) has been shown to be true for all k. Next, we proceed to obtain the recursive rule for v k+1 and φ * k+1 . By using the optimality conditions for the unconstrained strongly convex problem min x φ k (x), we obtain the following: By using equations (56) and (59), we obtain the following: The expression on the right can be further simplified as follows: Next, we inductively prove that φ * k ≥ F (y k ) − p k where p k is defined in (57). This holds for k = 1 where p 1 = 0. Assuming, it is true for k, we prove it holds for k + 1 by invoking Lemma 9 for x = y k : where the last inequality follows noting that terms (a) and (b) are zero from recalling that 2Lα 2 k = τ k+1 and x k = Before analyzing the rate of convergence, we proceed to examine the limiting behavior of the sequence {λ k } and show that λ k → √ κ, where κ denotes the condition number of the problem.
Proof. First by induction we show that sequence {λ k } is bounded above by 2 √ κ. By assumption, λ 1 ≤ 2 √ κ, we assume λ k ≤ 2 √ κ and proceed to show that λ k+1 ≤ 2 √ κ: Since the sequence is increasing and bounded above, its limit exists. Suppose, lim k→∞ λ k+1 = λ, Second we show that sequence {λ k } is increasing, i.e. λ k+1 ≥ λ k , which can be written equivalently by replacing the recursive rule λ k+1 as follows We are now in a position to provide our main proposition that provides a bridge towards deriving rate statements and oracle complexity bounds.
Proof of Lemma 1.
Proof. We have that: By rearranging terms and setting x = x * in the inequality above, we obtain we obtain the following sequence of inequalities: By using Lemma 11 and (62), we may obtain where we used the fact that τ 1 = µ and α k ∈ [ᾱ, 1). Next, we derive a bound on E[p k ]. By definition, we have By taking expectations and invoking Assumptions 1 and 3(i), By substituting (64) in (63), we obtain the desired result.
Proof of Theorem 1.
(ii) We observe that for η > 0, M . Therefore, we have that Q(η) is a.e. twice differentiable and its Clarke generalized gradient and Hessian are defined as follows.
We now proceed to show that H 0 for all H ∈ ∂ 2 C(η) and for all η > 0. Case 1: 0 < η <η. In this setting, Q (η) = Q (η) = 0. It follows that ∂ 2 C(η) is a singleton given by the scalar H and it suffices to show that H > 0. This follows as shown next.
≤ t 0 e 0 + (cJ + 3)d k t 0 e 0 +d k , where (75) follows from ∞ j=1 1 (j+1) log(j+1) ≤ 3. Next, we derive a bound on e 0 = E[ z k,0 − z * where the last inequality is a result of x k being F k -measurable and non-expansivity of the prox. operator. Similarly, d k can be bounded as follows.
Proof of Theorem 3.
where (76) follows from γ k = η. By Prop. 1 , the first term on the right can be bounded as where N k denotes the number of stochastic subgradient steps taken at major iteration k. Then by taking unconditional expectations, we have Let p k (1 + δ)q + (1+1/δ)â 2 N k and N k = N 0 ρ −k for k ≥ 0, where N 0 > (1+1/δ)â 2 1−(1+δ)q . Note that p 0 < 1 and {p k } is a decreasing sequence based on the choice of N 0 and {N k }. We consider two cases. Case (a). Let ρ = p 0 and ρ ∈ (0, 1). In this instance, we obtain the following result.