Stochastic Block-Coordinate Gradient Projection Algorithms for Submodular Maximization

We consider a stochastic continuous submodular huge-scale optimization problem, which arises naturally in many applications such as machine learning. Due to high-dimensional data, the computation of the whole gradient vector can become prohibitively expensive. To reduce the complexity and memory requirements, we propose a stochastic block-coordinate gradient projection algorithm for maximizing continuous submodular functions, which chooses a random subset of gradient vector and updates the estimates along the positive gradient direction. We prove that the estimates of all nodes generated by the algorithm converge to some stationary points with probability 1. Moreover, we show that the proposed algorithm achieves the tight ((pmin/2)F∗ − ε) approximation guarantee after O(1/ε2) iterations for DR-submodular functions by choosing appropriate step sizes. Furthermore, we also show that the algorithm achieves the tight ((γ2/(1 + γ2))pminF∗ − ε) approximation guarantee after O(1/ε2) iterations for weakly DR-submodular functions with parameter γ by choosing diminishing step sizes.


Introduction
In this paper, we focus on the submodular function maximization, which has recently attracted significant attention in academia since submodularity is a crucial concept in combinatorial optimization.Furthermore, they have arisen in a variety of areas, such as social sciences, algorithm game theory, signal processing, machine learning, and computer vision.Furthermore, submodular functions have found many applications in the applied mathematics and computer science, such as probabilistic models [1,2], crowd teaching [3,4], representation learning [5], data summarization [6], document summarization [7], recommender systems [8], product recommendation [9,10], sensor placement [11], network monitoring [12,13], the design of structured norms [14], clustering [15], dictionary learning [16], active learning [17], and the utility maximization in sensor networks [18].
In addition, there also exist many polynomial time algorithms for approximately maximizing the submodular functions with approximation guarantees, such as the local search and greedy algorithms [22][23][24][25].Despite this progress, these methods use the combinatorial techniques, which have some limitations [26].For this reason, a new approach is proposed by using multilinear relaxation [27], which can lift the submodular functions optimization problems into the continuous domain.Thus, the continuous optimization techniques are used to minimize exactly or maximize approximately submodular functions in polynomial time.Recently, most literature is devoted to continuous submodular optimization [28][29][30][31].The algorithms cited above need to compute all the (sub)gradients.
However, the computation of all (sub)gradients can become prohibitively expensive when dealing with hugescale optimization problems, where the decision vectors are high-dimensional.For this reason, coordinate descent method and its variants are proposed for solving efficiently convex optimization problems [32].At each iteration, the The remainder of this paper is organized as follows.We describe mathematical background in Section 2. We formulate the problem of our interest and propose a stochastic block-coordinate gradient projection algorithm in Section 3. In Section 4, the main results of this paper are stated.The detailed proofs of the main results of the paper are provided in Section 5.The conclusion of the paper is presented in Section 6.

Mathematical Background
Given a ground set , which consists of  elements.If a set function  : 2  → R + satisfies for all subsets ,  ⊆ , then the set function  is called submodular.The notation of submodularity is mostly used in discrete domain, but it can be extended to continuous domain [49].Given a subset of R  + , X = ∏  =1 X  , where each set X  is a subset of R + and is compact.A continuous function  : X → R + is called submodular continuous function if, for all x, y ∈ X, the following inequality holds, where x ∨ y fl max{x, y} (coordinate-wise) and x ∧ y fl min{x, y} (coordinate-wise).Moreover, if x ⪯ y, we have (x) ≤ (y) for all x, y ∈ X, and then the submodular continuous function  is called monotone on X.Furthermore, a differentiable submodular continuous function  is called DR-submodular if, for all x, y ∈ X such that x ⪯ y, we have ∇(x) ⪰ ∇(y); i.e., ∇(⋅) is an antitone mapping [29].When the submodular continuous function  is twice differentiable, the submodular  is submodular if and only if all off-diagonal components of its Hessian matrix are nonpositive [28]; i.e., for all x ∈ X, Furthermore, if the submodular function  is DRsubmodular, then all second-derivatives are nonpositive [29]; i.e., for all x ∈ X, In addition, the twice differentiability implies that the submodular  is smooth [50].Moreover, we say that a submodular function  is -smooth if, ∀x, y ∈ X, we have Note that the above definition is equivalent to More details about weak DR-submodular functions are available in [29].Complexity 3

Problem Formulation and Algorithm Design
In this section, we first describe the problem of our interest, and then we design an algorithm to efficiently solve the problem.
In this paper, we focus on the following constrained optimization problem: where K denotes the constraint set, D denotes an unknown distribution,   is a submodular continuous function for all  ∈ D.Moreover, we assume that the constraint set, K = ∏  =1 K  ⊆ X, is convex, where each K  ⊆ R + is convex and closed set for all  = 1, . . ., .The problem has recently been introduced in [30].In addition, we use the notation  * to denote the optimal value of (x) for all x ∈ K, i.e.,  * fl max x∈K (x).Furthermore, we can see that the function (x) is submodular function because each function   is submodular continuous function for all  ∈ D [28].
To solve problem (8), the projected stochastic gradient methods are a class of efficient algorithms [31].However, we focus on the case that the decision vectors x are highdimensional in this work; i.e., the dimensionality of vectors  is large.The full gradient computations are prohibitive expensive and become computational bottleneck.Therefore, we propose a stochastic block-coordinate gradient method by combining the great features of block-coordinate and stochastic gradient.We assume that the components of decision variables are arbitrarily chosen but fixed for each processor.Furthermore, at each iteration, each processor randomly chooses a subset of (stochastic) gradients, rather than all the (stochastic) gradients.The detailed description of the proposed algorithm is as follows.Starting from an initial value x ∈ K, for  = 0, 1, 2, . .., each  = 1, . . .,  updates its decision variable as where () is the step-size, Π K  () denotes the Euclidean projection of  on the set K  ,   () are independent and identically Bernoulli random variables with P(  () = 1) =   for all  = 0, 1, 2, . . .and  = 1, . . ., , and   () denotes the unbiased estimate of the gradient ∇  (x()), which denotes the -th coordinate in ∇(x()).
We introduce the following matrix.
Let F  denote the history information of all random variables generated by the proposed algorithm (11) up to time .In this paper, we adopt the following assumption on the random variables   (), which is stated as follows.
Assumption 1.For all , , , , the random variables   () and   () are independent of each other.Furthermore, the random variables {  ()} are independent of F −1 and g() for any decision variables x ∈ F −1 .
In addition, we assume that the function  and the sets K  satisfy the following conditions.Assumption 2. Assume that the following properties hold: (a) The constraint set K ⊆ X is convex, and each set K  ⊆ R + is convex and closed for all  ∈ {1, . . ., }.
(b) The function  : X → R + is monotone and weakly DR-submodular with parameter  over X.
(c) The function  is differentiable and -smooth with respect to norm ‖ ⋅ ‖.
Next, we make the following assumption about stochastic oracle g().Assumption 3. Assume that the stochastic oracle g() satisfies the following conditions: and The above assumption implies that the stochastic oracle g() is an unbiased estimate of ∇(x()).
In this section, we first formulate an optimization problem, and then design an optimization method to solve it.Moreover, we also give some standard assumptions to analyze the performance of the proposed method.

Main Results
In this section, We first provide the performance of convergence.To this end, we first introduce the definition of a stationary point, which is defined as in [31].
Definition 4. For a vector x ∈ K and a function  : From Definition 4, the convergence of our proposed algorithm is given in the following theorem.Theorem 5. Let Assumptions 1-3 hold.Assume that the set of stationary points is nonempty and 0 < () < 2/.Moreover, the sequence {x()} is generated by the stochastic block-coordinate gradient projection algorithm (11).Then, the iterative sequence {x()} converges to some stationary point x ∈ K with probability 1.
The proof can be found in the next section.The above result shows that the iterations converge to some local maximum with probability 1.
The proof can be found in the next section.From the above result, we can see that an objective value in expectation can be obtained after ( 2 / + ( 2 +  2 )/ 2 ) iterations of the stochastic block-coordinate gradient projection algorithm (11) for any initial value.Moreover, the objective value is at least ( min /2) * −  for any DR-submodular function.
The proof can be found in the next section.Note that the stochastic block-coordinate gradient projection algorithm yields an objective value after ( 2 /+( 2 + 2 )/ 2 ) iterations from any initial value.Furthermore, the expectation of the objective value is in at least ( 2 /(1+ 2 )) min  * for any weakly DR-submodular function.

Performance Analysis
In this section, the detailed proofs of main results are provided.We first analyze the convergence performance of the stochastic block-coordinate gradient projection algorithm.
To prove Theorem 6, we first present the following lemmas.The first lemma follows from [52], which is stated as follows.
Lemma 8.For all z ∈ K, we have for any diagonal matrix .
The next lemma is due to [53], which is stated as follows.
In addition, we also have the following lemma.
With Lemmas 8 and 10 in places, we have the following result.
Proof.From the result in Lemma 10, we have In addition, following on from Lemma 8, we also obtain which implies that Combining inequalities ( 40) and ( 42), we yield where the last inequality is due to () ≤ ()/(1 + ()).
Taking conditional expectation of the above inequality on F  , we yield Thus, by some algebraic manipulations, inequality (39) is obtained.
Next, we start to prove Theorem 6.
Proof of Theorem 6. Setting k = x * in Lemma 11, where x * is the globally optimal solution for problem (8), i.e., x * fl arg max x∈K (x), we have Since taking conditional expectation of ( 46) with respect to F  , we have which implies that Setting x = x() and y = x * in Lemma 9 and taking condition expectation on F  , we obtain Thus, plugging inequality (49) into relation (48), we get Taking expectation in (50) and using some algebraic manipulations, we have where we have used the relation E[g()] = ∇(x()) to obtain the first inequality.Summing both sides of ( 51) for  = 1, . . ., , we obtain where in the last inequality we have used the fact that E[‖x() − x * ‖ 2  −1 ] ≤  2 / min and 1/() − 1/( + 1) ≤ 0 for all  = 1, . . ., .On the other hand, we also have where  * = max x∈K (x) and the last inequality is due to (52).

Complexity
Plugging the above inequality into (53) and dividing both sides by 2, where we have used the fact that () = 1/(+ √ ) in the last inequality.Furthermore, the above inequality implies that In addition, the sample x() is obtained for  ∈ {1, . . ., } by choosing x(1), x( + 1) with probability 1/(2) and the other decision vectors with probability 1/; we have Therefore, the theorem is completely proved.
We now start to prove Theorem 7.
In this section, we proved the main results of the paper in detail.The conclusion of this paper is provided in the next section.

Conclusion
In this paper, we have considered a stochastic optimization problem of continuous submodular functions, which is an important problem in many areas such as machine learning and social science.Since the data is high-dimensional, usual algorithms based on the computation of the whole approximate gradient vector, such as stochastic gradient methods, are prohibitive.For this reason, we proposed the stochastic block-coordinate gradient projection algorithm for maximizing submodular functions, which randomly chooses a subset of the approximate gradient vector.Moreover, we studied the convergence performance of the proposed algorithm.We proved that the iterations converge to some stationary points with probability 1 by using the suitable step sizes.Furthermore, we showed that the algorithm achieves a tight (( min /2) * − ) approximation guarantee after (1/ 2 ) when the submodular functions are DR-submodular and the suitable step sizes are used.More generally, we also showed that the algorithm achieves the tight (( 2 /(1 +  2 )) min  * − ) after (1/ 2 ) iterations when the submodular functions are weakly DR-submodular with parameter  and the appropriate step sizes are used.