Solving variational inequalities with Stochastic Mirror-Prox algorithm

In this paper we consider iterative methods for stochastic variational inequalities (s.v.i.) with monotone operators. Our basic assumption is that the operator possesses both smooth and nonsmooth components. Further, only noisy observations of the problem data are available. We develop a novel Stochastic Mirror-Prox (SMP) algorithm for solving s.v.i. and show that with the convenient stepsize strategy it attains the optimal rates of convergence with respect to the problem parameters. We apply the SMP algorithm to Stochastic composite minimization and describe particular applications to Stochastic Semidefinite Feasability problem and Eigenvalue minimization.

1. Introduction.Let Z be a convex compact set in Euclidean space E with inner product •, • , • be a norm on E (not necessarily the one associated with the inner product), and F : Z → E be a monotone mapping: We are interested to approximate a solution to the variational inequality (v.i.) find z * ∈ Z : F (z), z * − z 0 ∀z ∈ Z (1.2) associated with Z, F .Note that since F is monotone on Z, the condition in (1.2) is implied by F (z * ), z − z * 0 for all z ∈ Z, which is the standard definition of a (strong) solution to the v.i.associated with Z, F .The inverse -a solution to v.i. as defined by (1.2) (a "weak" solution) is a strong solution as well -also is true, provided, e.g., that F is continuous.An advantage of the concept of weak solution is that such a solution always exists under our assumptions (F is well defined and monotone on a convex compact set Z).
We quantify the inaccuracy of a candidate solution z ∈ Z by the error Err vi (z) := max u∈Z F (u), z − u ; (1.3) note that this error is always 0 and equals zero iff z is a solution to (1.2).
In what follows we impose on F , aside of the monotonicity, the requirement with some known constants L 0, M 0. From now on, is the norm conjugate to • .
We are interested in the case where (1.2) is solved by an iterative algorithm based on a stochastic oracle representation of the operator F (•). Specifically, when solving the problem, the algorithm acquires information on F via subsequent calls to a black box ("stochastic oracle", SO).At i-th call, i = 0, 1, ..., the oracle gets as input a search point z i ∈ Z (this point is generated by the algorithm on the basis of the information accumulated so far) and returns the vector Ξ(z i , ζ i ), where {ζ i ∈ R N } ∞ i=1 is a sequence of i.i.d.(and independent of the queries of the algorithm) random variables.We suppose that the Borel function Ξ(z, ζ) is such that We call a monotone v.i.(1.1), augmented by a stochastic oracle (SO), a stochastic monotone v.i.(s.v.i.).
To motivate our goal, let us start with known results [5] on the limits of performance of iterative algorithms for solving large-scale stochastic v.i.'s.To "normalize" the situation, assume that Z is the unit Euclidean ball in E = R n and that n is large.In this case, the rate of convergence of a whatever algorithm for solving v.i.'s cannot be better than O(1) L t + M+N √ t .In other words, for a properly chosen positive absolute constant C, for every number of steps t, all large enough values of n and any algorithm B for solving s.v.i.'s on the unit ball of R n , one can point out a monotone s.v.i.satisfying (1.4), (1.6) and such that the expected error of the approximate solution zt generated by B after t steps , applied to such s.v.i., is at least c L t + M+N √ t for some c > 0. To the best of our knowledge, no one of existing algorithms allows to achieve, uniformly in the dimension, this convergence rate.In fact, the "best approximations" available are given by Robust Stochastic Approximation (see [3] and references therein) with the guaranteed rate of convergence O(1) L+M+N √ t and extra-gradient-type algorithms for solving deterministic monotone v.i.'s with Lipschitz continuous operators (see [6,9,10,11]), which attains the accuracy O(1) L t in the case of M = N = 0 or O(1) M √ t when L = N = 0.The goal of this paper is to demonstrate that a specific Mirror-Prox algorithm [6] for solving monotone v.i.'s with Lipschitz continuous operators can be extended onto monotone s.v.i.'s to yield, uniformly in the dimension, the optimal rate of convergence O(1) L t + M+N √ t .We present the corresponding extension and investigate it in details: we show how the algorithm can be "tuned" to the geometry of the s.v.i. in question, derive bounds for the probability of large deviations of the resulting error, etc.We also present a number of applications where the specific structure of the rate of convergence indeed "makes a difference".
The main body of the paper is organized as follows: in Section 2, we describe several special cases of monotone v.i.'s we are especially interested in (convex Nash equilibria, convex-concave saddle point problems, convex minimization).We single out these special cases since here one can define a useful "functional" counterpart Err N (•) of the just defined error Err vi (•); both Err N and Err vi will participate in our subsequent efficiency estimates.Our main development -the Stochastic Mirror Prox (SMP) algorithm -is presented in Section 3. Some general results obout the performance of the SMP are presented in Section 3.2.Then in Section 4 we present SMP for Stochastic composite minimization and discuss its applications to Stochastic Semidefinite Feasability problem and Eigenvalue minimization.All technical proofs are collected in the appendix.
Notations.In the sequel, lowercase Latin letters denote vectors (and sometimes matrices).Script capital letters, like E, Y, denote Euclidean spaces; the inner product in such a space, say, E, is denoted by •, • E (or merely •, • , when the corresponding space is clear from the context).Linear mappings from one Euclidean space to another, say, from E to F , are denoted by boldface capitals like A (there are also some reserved boldface capitals, like E for expectation, R k for the k-dimensional coordinate space, and S k for the space of k × k symmetric matrices).A * stands for the conjugate to mapping A: if A : E → F, then A * : F → E is given by the identity f, Ae F = A * f, e E for f ∈ F, e ∈ E. When both the origin and the destination space of a linear map, like A, are the standard coordinate spaces, the map is identified with its matrix A, and A * is identified with A T .For a norm • on E, • * stands for the conjugate norm, see (1.5).

Preliminaries.
2.1.Nash v.i.'s and functional error.In the sequel, we shall be especially interested in a special case of v.i.(1.2) -in a Nash v.i.coming from a convex Nash Equilibrium problem, and in the associated functional error measure.The Nash Equilibrium problem can be described as follows: there are m players, i-th of them choosing a point z i from a given set Z i .The loss of i-th player is a given function With slight abuse of notation, we use for φ i (z) also the notation φ i (z i , z i ), where z i is the collection of choices of all but the i-th players.Players are interested to minimize their losses, and Nash equilibrium z is a point from Z such that for every i the function φ i (z i , z i ) attains its minimum in z i ∈ Z i at z i = z i (so that in the state z no player has an incentive to change his choice, provided that the other players stick to their choices).
We call a Nash equilibrium problem convex, if for every i Z i is a compact convex set, φ i (z i , z i ) is a Lipschitz continuous function convex in z i and concave in z i , and the function Φ(z) = m i=1 φ i (z) is convex.It is well known (see, e.g., [8]) that setting where ∂ zi φ i (z i , z i ) is the subdifferential of the convex function φ i (•, z i ) at a point z i , we get a monotone operator such that the solutions to the corresponding v.i.(1.2) are exactly the Nash equilibria.Note that since φ i are Lipschitz continuous, the associated operator F can be chosen to be bounded.For this v.i.one can consider, along with the v.i.-accuracy measure Err vi (z), the functional error measure This accuracy measure admits a transparent justification: this is the sum, over the players, of the incentives for a player to change his choice given that other players stick to their choices.
Special cases: saddle points and minimization.An important by its own right particular case of Nash Equilibrium problem is an antagonistic 2-person game, where m = 2 and Φ(z) ≡ 0 (i.e., φ 2 (z) ≡ −φ 1 (z)).The convex case of this problem corresponds to the situation when φ(z 1 , z 2 ) ≡ φ 1 (z 1 , z 2 ) is a Lipschitz continuous function which is convex in z 1 ∈ Z 1 and concave in z 2 ∈ Z 2 , the Nash equilibria are exactly the saddle points (min in z 1 , max in z 2 ) of φ on Z 1 × Z 2 , and the functional error becomes Recall that the convex-concave saddle point problem min z1∈Z1 max z2∈Z2 φ(z 1 , z 2 ) gives rise to the "primal-dual" pair of convex optimization problems (P ) : min where The optimal values Opt(P ) and Opt(D) in these problems are equal, the set of saddle points of φ (i.e., the set of Nash equilibria of the underlying convex Nash problem) is exactly the direct product of the optimal sets of (P ) and (D), and Err N (z 1 , z 2 ) is nothing but the sum of non-optimalities of z 1 , z 2 considered as approximate solutions to respective optimization problems: Finally, the "trivial" case m = 1 of the convex Nash Equilibrium is the problem of minimizing a Lipschitz continuous convex function φ(z) = φ 1 (z 1 ) over the convex compact set Z = Z 1 , In this case, the functional error becomes the usual residual in terms of the objective: In the sequel, we refer to the v.i.(1.2) coming from a convex Nash Equilibrium problem as Nash v.i., and to the two just outlined particular cases of the Nash v.i. as the Saddle Point and the Minimization v.i., respectively.It is easy to verify that in the Saddle Point/Minimization case the functional error Err N (z) is Err vi (z); this is not necessary so for a general Nash v.i.
2.2.Prox-mapping.We once for ever fix a norm • on E; • * stands for the conjugate norm, see (1.5).A distance-generating function for Z is, by definition, a continuous convex function ω(•) : Z → R such that 1. if Z o be the set of all points z ∈ Z such that the subdifferential ∂ω(z) of ω(•) at z is nonempty, then the subdifferential of ω admits a continuous selection on Z o : there exists a continuous on Z o vector-valued function ω ′ (z) such that ω ′ (z) ∈ ∂ω(z) for all z ∈ Z o ; 2. for certain α > 0, ω(•) is strongly convex, modulus α, w.r.t. the norm • : In the sequel, we fix a distance-generating function ω(•) for Z and assume that ω(•) and Z "fit" each other, meaning that one can easily solve problems of the form The prox-function associated with the distance-generating function ω is defined as Note that z c is well defined (since Z is a convex compact set and ω(•) is continuous and strongly convex on Z) and belongs to Z o (since 0 ∈ ∂ω(z c )).Note also that due to the strong convexity of ω and the origin of z c we have in particular we see that Prox-mapping.Given z ∈ Z o , we associate with this point and ω(•) the proxmapping We illustrate the just-defined notions with three basic examples.
Example 1: Euclidean setup.Here E is R N with the standard inner product, • 2 is the standard Euclidean norm on R N (so that • * = • ) and ω(z) = 1 2 z T z (i.e., Z o = Z, α = 1).Assuming for the sake of simplicity that 0 ∈ Z, z c = 0, Ω = max z∈Z z 2 and Θ = 1  2 Ω 2 .The prox-function and the prox-mapping are given by V (z, u) = 1 2 z − u 2 2 , P (z, ξ) = argmin u∈Z (z − ξ) − u 2 .Example 2: Simplex setup.Here E is R N , N > 1, with the standard inner product, z = z 1 := N j=1 |z j | (so that ξ * = max j |ξ j |), Z is a closed convex subset of the standard simplex containing its barycenter, and ω(z) = N j=1 z j ln z j is the entropy.Then It is easily seen (see, e.g., [3]) that here (the latter inequality becomes equality when Z contains a vertex of D N ), and thus Ω √ 2 ln N .The prox-function is and the prox-mapping is easy to compute when Z = D N : Example 3: Spectahedron setup.This is the "matrix analogy" of the Simplex setup.Specifically, now E is the space of N × N block-diagonal symmetric matrices, N > 1, of a given block-diagonal structure equipped with the Frobenius inner product a, b F = Tr(ab) and the trace norm are the eigenvalues of a symmetric N × N matrix a; the conjugate norm |a| ∞ is the usual spectral norm (the largest singular value) of a. Z is assumed to be a closed convex subset of the spectahedron S = {z ∈ E : z 0, Tr(z) = 1} containing the matrix N −1 I N .The distance-generating function is the matrix entropy . When Z = S, it is relatively easy to compute the prox-mapping (see [2,6]); this task reduces to the singular value decomposition of a matrix from E. It should be added that the matrices from S are exactly the matrices of the form a = H(b) ≡ (Tr(exp{b})) −1 exp{b} with b ∈ E. Note also that when Z = S, the prox-mapping becomes "linear in matrix logarithm": if z = H(a), then P (z, ξ) = H(a − ξ).

Stochastic Mirror-Prox algorithm.
3.1.Mirror-Prox algorithm with erroneous information.We are about to present the Mirror-Prox algorithm proposed in [6].In contrast to the original version of the method, below we allow for errors when computing the values of F -we assume that given a point z ∈ Z, we can compute an approximation F (z) ∈ E of F (z).The t-step Mirror-Prox algorithm as applied to (1.2) is as follows: Algorithm 3.1.

At step t, output
The preliminary technical result on the outlined algorithm is as follows.Theorem 3.2.Consider t-step algorithm 3.1 as applied to a v.i.(1.2) with a monotone operator F satisfying (1.4).For τ = 1, 2, ..., let us set for z belonging to the trajectory {r 0 , w 1 , r 1 , ..., w t , r t } of the algorithm, let and let {y τ ∈ Z o } t τ =0 be the sequence given by the recurrence where Err vi ( Finally, when (1.2) is a Nash v.i., one can replace Err vi ( z t ) in (3.5) with Err N ( z t ).

Main result.
From now on, we focus on the case when Algorithm 3.1 solves monotone v.i.(1.2), and the corresponding monotone operator F is represented by a stochastic oracle.Specifically, at the i-th call to the SO, the input being z ∈ Z, the oracle returns the vector F = Ξ(z, ζ i ),, where i=1 is a sequence of i.i.d.random variables, and Ξ(z, ζ) : Z × R N → E is a Borel function.We refer to this specific implementation of Algorithm 3.1 as to Stocastic Mirror Prox (SMP) algorithm.
In the sequel, we impose on the SO in question the following assumption, slightly milder than (1.6): In some cases, we augment Assumption I by the following Assumption II: For all z ∈ Z and all i we have by the Jensen inequality.Remark 3.3.Observe that that the accuracy of Algorithm 3.1 (cf.(3.6)) depends in the same way on the "size" of perturbation ǫ z = F (z)−F (z) * and the bound M of (1.4) on the variation of the non-Lipschitz component of F .This is why, to simplify the presentation, we decided to use the same bound M for the scale of perturbation (3.8).
Remark 3.4.From now on, we assume that the starting point r 0 in Algorithm 3.1 is the minimizer z c of ω(•) on Z. Further, to avoid unnecessarily complicated formulas (and with no harm to the efficiency estimates) we stick to the constant stepsize policy γ τ ≡ γ, 1 τ t, where t is a fixed in advance number of iterations of the algorithm.Our main result is as follows: Theorem 3.5.Let v.i.(1.2) with monotone operator F satisfying (1.4) be solved by t-step Algorithm 3.1 using a SO, and let the stepsizes where M is the constant from (1.4) and Ω is given by (2.3).
When optimizing the bound (3.9) in γ, we get the following Corollary 3.6.In the situation of Theorem 3.5, let the stepsizes γ τ ≡ γ be chosen according to Then under Assumption I one has .12) (see (2.3)).Under Assumptions I, II, one has, in addition to (3.12), for any Λ > 0, In the case of a Nash v.i., Err vi (•) in (3.12), (3.13) can be replaced with Err N (•).

Comparison with Robust
Mirror SA Algorithm.Consider the case of a Nash s.v.i. with operator F satisfying (1.4) with L = 0, and let the SO be unbiased (i.e., µ = 0).In this case, the bound (3.12) reads where The bound (3.14) looks very much like the efficiency estimate (from now on, all O(1)'s are appropriate absolute positive constants) for the approximate solution zt of the t-step Robust Mirror SA (RMSA) algorithm [3] 1) .In the latter estimate, Ω is exactly the same as in (3.14), and M is given by .
Note that we always have M 2M , and typically M and M are of the same order of magnitude; it may happen, however (think of the case when F is "almost constant"), that M ≪ M .Thus, the bound (3.14) never is worse, and sometimes can be much better than the SA bound (3.15).It should be added that as far as implementation is concerned, the SMP algorithm is not more complicated than the RMSA (cf. the description of Algorithm 3.1 with the description of the RMSA).
The just outlined advantage of SMP as compared to the usual Stochastic Approximation is not that important, since "typically" M and M are of the same order.We believe that the most interesting feature of the SMP algorithm is its ability to take advantage of a specific structure of a stochastic optimization problem, namely, insensitivity to the presence in the objective of large, but smooth and well-observable components.
We are about to consider several less straightforward applications of the outlined insensitivity of the SMP algorithm to smooth well-observed components in the objective.where 1) In this reference, only the Minimization and the Saddle Point problems are considered.However, the results of [3] can be easily extended to s.v.i.'s.

Application to
1. X ⊂ X is a convex compact; the embedding space X is equipped with a norm • x , and X -with a distance-generating function ω x (x) with certain parameters α x , Θ x , Ω x w.r.t. the norm • x ; 2. φ ℓ (x) : X → E ℓ , 1 ℓ m, are Lipschitz continuous mappings taking values in Euclidean spaces E ℓ equipped with norms (not necessarily the Euclidean ones) • (ℓ) with conjugates • (ℓ, * ) and with closed convex cones K ℓ .We suppose that φ ℓ are K ℓ -convex, i.e. for any x, x ′ ∈ X, λ ∈ [0, 1], where the notation In addition to these structural restrictions, we assume that for all and certain nonnegative constants L x and M x .3. Functions φ ℓ (•) are represented by an unbiased SO.At i-th call to the oracle, x ∈ X being the input, the oracle returns vectors such that for any x ∈ X and i = 1, 2, ..., x , 1 ℓ m. 2) For a K-convex function φ :  where P jℓ are given p ℓ × q j matrices, and λ max (A) is the maximal eigenvalue of a symmetric matrix A. Observing that for a symmetric p × q matrix A one has where S q = {S ∈ S q + : Tr(S) = 1}.When denoting by Y the set of all symmetric positive semidefinite block-diagonal matrices y = Diag{y 1 , ..., y k } with unit trace and diagonal blocks y j of sizes q j × q j , we can represent (P ) in the form of (4.1), (4.4) with Φ(u) := max (we put A ℓ y = k j=1 P T jℓ y j P jℓ ).The set Y is the spectahedron in the space S q of symmetric block-diagonal matrices with k diagonal blocks of the sizes q j × q j , 1 j k.When equipping Y with the spectahedron setup, we get α y = 1, Θ y = ln( k j=1 q j ) and Ω y = 2 ln( k j=1 q j ), see Section 2.2.Observe that in the simplest case of k = m, p j = q j , 1 j m and P jℓ equal to I p for j = ℓ and to 0 otherwise, the SMMP problem becomes If, in addition, p j = q j = 1 for all j, we arrive at the usual ("scalar") minimax problem with convex real-valued functions φ ℓ .
Observe that in the case of (4.4), the optimization problem (4.1) is nothing but the primal problem associated with the saddle point problem and the cost function in the latter problem is Lipschitz continuous and convex-concave due to the K ℓ -convexity of φ ℓ (•) and the condition A ℓ y + b ℓ ∈ K * ℓ whenever y ∈ Y .The associated Nash v.i. is given by the domain Z and the monotone mapping The advantage of the v.i.reformulation of (4.1) is that F is linear in φ ℓ (•), so that the initial unbiased SO for φ ℓ induces an unbiased stochastic oracle for F , specifically, the oracle We are about to use this oracle in order to solve the stochastic composite minimization problem (4.1) by the SMP algorithm.

4.2.
Setup for the SMP as applied to (4.9).In retrospect, the setup for SMP we are about to present is a kind of the best -resulting in the best possible efficiency estimate (3.12) -we can build from the entities participating in the description of the problem (4.1).Specifically, we equip the space E = X × Y with the norm the conjugate norm clearly is Finally, we equip Z = X × Y with the distance-generating function The SMP-related properties of our setup are summarized in the following ) Combining Lemma 4.1 with Corollary (3.6) we get explicit efficiency estimates for the SMP algorithm as applied to the Stochastic composite minimization problem (4.1).

Application to Stochastic
Semidefinite Feasibility problem.Assume we are interested to solve a feasible system of matrix inequalities where m > 1, X ⊂ X is as in the description of the Stochastic composite problem, and ψ ℓ (•) take values in the spaces E ℓ = S p ℓ of symmetric p ℓ × p ℓ matrices.We equip E ℓ with the Frobenius inner product, the semidefinite cone K ℓ = S p ℓ + and the spectral norm that |A| ∞ is the maximal singular value of matrix A).We assume that ψ ℓ are Lipschitz continuous and K ℓ = S p ℓ + -convex functions on X such that for all x, x ′ ∈ X and for all ℓ one has max x ∈ X, with some known nonnegative constants L ℓ , M ℓ .
We assume that ψ ℓ (•) are represented by an SO which at i-th call, the input being x ∈ X, returns the matrices f ℓ (x, ζ i ) ∈ S p ℓ and the linear maps Given a number t of steps of the SMP algorithm, let us act as follows.
A. We compute the m quantities µ ℓ = ΩxL ℓ √ t + M ℓ , ℓ = 1, ..., m, and set Note that by construction β ℓ 1 and L x /L ℓ β ℓ , M x /M ℓ β ℓ for all ℓ, so that the functions φ ℓ satisfy (4.2) with the just defined L x , M x .Further, the SO for ψ ℓ (•)'s can be converted into an SO for φ ℓ (•)'s by setting Combining Lemma 4.1, Corollary 3.6 and taking into account the origin of the quantities L x , M x , and that A = 1, B = 0 3) , we arrive at the following result: 3) See ( 4 When applying to (4.21) the t-step SMP algorithm with the constant stepsizes γ τ ≡ γ (cf.(3.11) and note that we are in the situation α = Θ = 1), we get an approximate solution z t = ( x t , y t ) such that (3.12) and take into account that we are in the case of Ω = √ 2, while the optimal value in (4.20) is nonpositive, since (4.16) is feasible).Furthermore, if assumptions (4.18.b,c) are strengthened to then, in addition to (4.22), we have for any Λ > 0: Prob max Discussion.Imagine that instead of solving the system of matrix inequalities (4.16), we were interested to solve just a single matrix inequality ψ ℓ (x) 0, x ∈ X.When solving this inequality by the SMP algorithm as explained above, the efficiency estimate would be (recall that the matrix inequality in question is feasible), where x ℓ t is the resulting approximate solution.Looking at (4.22), we see that the expected accuracy of the SMP as applied, in the aforementioned manner, to (4.16) is only by a logarithmic in ℓ p ℓ factor worse: Thus, as far as the quality of the SPM-generated solution is concerned, passing from solving a single matrix inequality to solving a system of m inequalities is "nearly costless".As an illustration, consider the case where some of ψ ℓ are "easy" -smooth and easy-to-observe (M ℓ = 0), while the remaining ψ ℓ are "difficult", i.e., might be non-smooth and/or difficult-to-observe (L ℓ = 0).In this case, (4.23) reads In other words, the violations of the easy and the difficult constraints in (4.16) converge to 0 as t → ∞ with the rates O(1/t) and O(1/ √ t), respectively.It should be added that when X is the unit Euclidean ball in X = R n and X, X are equipped with the Euclidean setup, the rates of convergence O(1/t) and O(1/ √ t) are the best rates one can achieve without imposing bounds on n and/or imposing additional restrictions on ψ ℓ 's.

Eigenvalue optimization via SMP. The problem we are interested in now is
where A 0 , A 1 , ..., A n , n > 1, are given symmetric matrices with common blockdiagonal structure (p 1 , ..., p m ).I.e., all A j are block-diagonal with diagonal blocks A ℓ j of sizes p ℓ × p ℓ , 1 ℓ m.We denote Setting we represent (4.24) as a particular case of the Matrix Minimax problem (4.6), with all functions φ ℓ (x) being affine and X being the standard simplex in X = R n .Now, since A j are known in advance, there is nothing stochastic in our problem, and it can be solved either by interior point methods, or by "computationally cheap" gradient-type methods; these latter methods are preferable when the problem is largescale and medium accuracy solutions are sought.For instance, one can apply the t-step (deterministic) Mirror Prox algorithm from [6] to the saddle point reformulation (4.8) of our specific Matrix Minimax problem, i.e., to the saddle point problem min The accuracy of the approximate solution xt of the (deterministic This efficiency estimate is the best known so far among those attainable with "computationally cheap" deterministic methods.On the other hand, the complexity of one step of the algorithm is dominated, up to an absolute constant factor, by the necessity, given x ∈ X and y ∈ Y , 1. to compute the matrix A 0 + n j=1 x j A j and the vector [Tr(Y A 1 ); ...; Tr(Y A n )]; 2. to compute the eigenvalue decomposition of y.When using the standard Linear Algebra, the computational effort per step is (2) + p (3) ] arithmetic operations.
We are about to demonstrate that one can equip the deterministic problem in question by an "artificial" SO in such a way that the associated SMP algorithm, under certain circumstances, exhibits better performance than deterministic algorithms.Let us consider the following construction of the SO for F (different from the SO (4.10)!).Observe that the monotone operator associated with the saddle point problem (4.25) is [Tr(yA 1 ); ...; Tr(yA n )] y); F y (x, y)] as follows: 1. we generate a realization  of a random variable taking values 1, ..., n with probabilities x 1 , ..., x n (recall that x ∈ X, the standard simplex, so that x indeed can be seen as a probability distribution), and set The just defined random estimate Ξ of F (x, y) can be expressed as a deterministic function Ξ(x, y, η) of (x, y) and random variable η uniformly distributed on [0, 1].Given x, y and η, the value of this function can be computed with the arithmetic cost O(1)(n(p max ) 2 + p (2) ) (indeed, O(1)(n + p (1) ) operations are needed to convert η into ı and , O(1)p (2) operations are used to write down the y-component −A 0 − A  of Ξ, and O(1)n(p max ) 2 operations are needed to compute Ξ x ).Now consider the SO's Ξ k (k is a positive integer) obtained by averaging the outputs of k calls to our basic oracle Ξ.Specifically, at the i-t call to the oracle Ξ k , z = (x, y) ∈ Z = X × Y being the input, the oracle returns the vector where ζ i = [η i1 ; ...; η ik ] and {η is } 1 i, 1 s k are independent random variables uniformly distributed on [0, 1].Note that the arithmetic cost of a single call to Ξ k is ).
The Nash v.i.associated with the saddle point problem Besides this, for any (z ∈ Z, i = 1, 2, ..., It follows that the function is continuously differentiable on [0, 1] with the derivative g(t).Since the function attains its minimum on [0, 1] at t = 0, we have g(0) 0, which is exactly (5.1).
At least the first statement of the following Lemma is well-known: Lemma 5.1.For every z ∈ Z o , the mapping ξ → P (z, ξ) is a single-valued mapping of E onto Z o , and this mapping is Lipschitz continuous, specifically, 2α . (5.3) Setting u = w in (5.4) and u = v in (5.5), we get and (5.2) follows.This relation, as a byproduct, implies that P (z, •) is single-valued.
To prove (5.3), let v = P (z, ζ).We have as required in (a) of (5.3).The bound (b) of (5.3) is obtained from (5.3) using the Young inequality: Indeed, observe that by definition, V (z, •) is strongly convex with parameter α, and We have the following simple corollary of Lemma 5.1: Corollary 5.2.Let ξ 1 , ξ 2 , ... be a sequence of elements of E. Define the sequence {y τ } ∞ τ =0 in Z o as follows: Then y τ is a measurable function of y 0 and ξ 1 , ..., ξ τ such that Proof.Using the bound (b) of (5.3) with ζ = ξ t and z = y t−1 (so that y t = P (y t−1 , ξ t ) we obtain for any u ∈ Z: Note that Further, due to the strong convexity of V , When summing up from τ = 1 to τ = t we arrive at the corollary.We also need the following result.Then for all u ∈ Z one has (5.8) Proof.(a): this is nothing but (5.2). (b): Using (a) of (5.3) in Lemma 5.1 we can write for u = r + : This results in Using (5.3) with η substituted for ζ we get due to the strong convexity of V .To conclude the bound (b) of (5.8) it suffices to note that by the Young inequality, We are able now to prove Theorem 3.2.By (1.4) we have that When summing up from τ = 1 to τ = t we obtain Hence, for all u ∈ Z, where y τ are given by (3.3).Since the sequences {y τ }, {ξ τ = γ τ ∆ τ } satisfy the premise of Corollary 5.2, we have and thus (5.11) implies that for any u ∈ Z To complete the proof of (3.5) in the general case, note that since F is monotone, (5.12) implies that for all u ∈ Z, where ζ M(τ )+1 .Denoting by E i the expectation w.r.t.ζ i , we conclude that under assumption I we have and under assumption II, in addition, We conclude by (5.15) that where the concluding inequality follows from the fact that Z is contained in the •ball of radius Ω = 2Θ/α centered at z c , see (2.5).From (5.18) it follows that Combining the latter relation, (5.13), (5.14) and (5.17), we arrive at (3.9).(i) is proved.
To prove (ii), observe, first, that setting At the same time, we can write where ξ j 0 is a deterministic function of ζ I(j) for certain increasing sequence of integers {I(j)}.Moreover, when denoting by E j conditional expectation over ζ I(j) , ζ I(j)+1 ..., ζ I(j)−1 being fixed, we have  We have Then by (4.2), Combining the latter bounds with (5.32) we conclude (4.30.b).

(4. 3 ) 4 .
Φ(•) is a convex function on E = E 1 × ... × E m given by the representationΦ(u 1 , ..., u m ) = max y∈Y m ℓ=1 u ℓ , A ℓ y + b ℓ E ℓ − Φ * (y) (4.4) for u ℓ ∈ E ℓ , 1 ℓ m.Here (a) Y ⊂ Y is a convex compact set containing the origin; the embedding Euclidean space Y is equipped with a norm • y , and Y -with a distancegenerating function ω y (y) with parameters α y , Θ y , Ω y w.r.t. the norm • y ; (b) The affine mappings y → A ℓ y + b ℓ : Y → E ℓ are such that A ℓ y + b ℓ ∈ K * ℓ for all y ∈ Y and all ℓ; here K * ℓ is the cone dual to K ℓ ; as it is the case almost everywhere on int X), one has∂φ(x) ∂x ∈ ∂ K φ(x).

1 .
Proof of Theorem 3.2.We start with the following simple observation: if r e is a solution to (2.2), then ∂ Z ω(r e ) contains −e and thus is nonempty, so that r e ∈ Z o .Moreover, one has ω ′ (r e ) − e, u − r e 0 ∀u ∈ Z. (5.1)Indeed, by continuity argument, it suffices to verify the inequality in the case when u ∈ rint(Z) ⊂ Z o .For such an u, the convex functionf (t) = ω(r e + t(u − r e )) + r e + t(u − r e ), e , t ∈ [0, 1]is continuous on [0, 1] and has a continuous on [0, 1] field of subgradients g(t) = ω ′ (r e + t(u − r e )) + e, u − r e .