Scalable Distributed Algorithms for Size-Constrained Submodular Maximization in the MapReduce and Adaptive Complexity Models

Distributed maximization of a submodular function in the MapReduce (MR) model has received much attention, culminating in two frameworks that allow a centralized algorithm to be run in the MR setting without loss of approximation, as long as the centralized algorithm satisfies a certain consistency property -- which had previously only been known to be satisfied by the standard greedy and continous greedy algorithms. A separate line of work has studied parallelizability of submodular maximization in the adaptive complexity model, where each thread may have access to the entire ground set. For the size-constrained maximization of a monotone and submodular function, we show that several sublinearly adaptive (highly parallelizable) algorithms satisfy the consistency property required to work in the MR setting, which yields practical, parallelizable and distributed algorithms. Separately, we develop the first distributed algorithm with linear query complexity for this problem. Finally, we provide a method to increase the maximum cardinality constraint for MR algorithms at the cost of additional MR rounds.


Introduction
Submodular maximization has become an important problem in data mining and machine learning with real world applications ranging from video summarization [Mirzasoleiman et al., 2018] and mini-batch selection [Joseph et al., 2019] to more complex tasks such as active learning [Rangwani et al., 2021] and federated learning [Balakrishnan et al., 2022].In this work, we study the size-constrained maximization of a monotone, submodular function (SMCC), formally defined in Section 1.3.Because of the ubiquity of problems requiring the optimization of a submodular function, a vast literature on submodular optimization exists; we refer the reader to the surveys [Liu et al., 2020, Liu, 2020].
A foundational result of Nemhauser et al. [1978] shows that a simple greedy algorithm (Greedy, pseudocode in Appendix H) achieves the optimal approximation ratio for SMCC of 1 − 1/e ≈ 0.63 in the value query model, in which the submodular function f is available to the algorithm as an oracle that returns f (S) when queried with set S. However, because of the modern revolution in data [Mao et al., 2021, Ettinger et al., 2021], there is a need for more scalable algorithms.Specifically, the standard greedy algorithm is a centralized algorithm that needs access to the whole dataset, makes many adaptive rounds (defined below) of queries to the submodular function, and has quadratic total query complexity in the worst case.
Distributed Algorithms for SMCC.Massive data sets are often too large to fit on a single machine and are distributed over a cluster of many machines.In this context, there has been a line of work developing algorithms for SMCC in the MapReduce model, defined formally in Section 1.3 (see Table 1 and the Related Work section below).Of these, PAlg [Barbosa et al., 2015] is a general framework that allows a centralized algorithm Alg to be converted to the MapReduce distributed setting with nearly the same theoretical approximation ratio, as long as Alg satisfies a technical, randomized consistency property (RCP), defined in Section 1.3.
In addition, DistributedDistorted [Kazemi et al., 2019], is also a general framework that can adapt any Alg satisfying RCP, as we show1 in Section 4.2.However, to the best of our knowledge, the only centralized algorithms to satisfy this property are the Greedy Nemhauser et al. [1978] and the continuous greedy algorithm of Calinescu et al. [2007].
Parallelizable Algorithms for SMCC.A separate line of works, initiated by Balkanski and Singer [2018] has taken an orthogonal approach to scaling up the standard greedy algorithm: parallelizable algorithms for SMCC as measured by the adaptive complexity of the algorithm, formally defined in Section 1.3.In this model, each thread may have access to the entire ground set and hence these algorithms do not apply in a distributed setting.Altogether, these algorithms have exponentially improved the adaptivity of standard greedy from O(n) to O(log(n)) while obtaining the nearly the same 1 − 1/e approximation ratio Fahrbach et al. [2019a], Balkanski et al. [2019], Ene and Nguyen [2019], Chekuri and Quanrud [2019].Recently, highly practical sublinearly adaptive algorithms have been developed without compromising theoretical guarantees Breuer et al. [2020], Chen et al. [2021].However, none of these algorithms have been shown to satisfy the RCP.
Combining Parallelizability and Distributed Settings.In this work, we study parallelizable, distributed algorithms for SMCC; each node of the distributed cluster likely has many processors, so we seek to take advantage of this situation by parallelizing the algorithm within each machine in the cluster.In the existing MapReduce algorithms, the number of processors ℓ could be set to the number of processors available to ensure the usage of all available resources -this would consider each processor as a separate machine in the cluster.However, there are a number of disadvantages to this approach: 1) A large number of machines severely restricts the size of the cardinality constraint k -since, in all of the MapReduce algorithms, a set of size kℓ must be stored on a single machine; therefore, we must have k ≤ O(Ψ/ℓ) = O(n/ℓ 2 ).For example, with n = 10 6 , ℓ = 256 implies that k ≤ 15 • C, for some C.However, if each machine has 8 processor cores, the number of machines ℓ can be set to 32 (and then parallelized on each machine), which enables the use of 256 processors with k ≤ 976 • C. As the number of cores and processors per physical machine continues to grow, large practical benefits can be obtained by a parallelized and distributed algorithm.2) Modern CPU architectures have many cores that all share a common memory, and it is inefficient to view communication between these cores to be as expensive as communication between separate machines in a cluster.
In addition to the MapReduce and adaptive complexity models, the total query complexity of the algorithm to the submodular oracle is also relevant to the scalability of the algorithm.Frequently, evaluation of the submodular function is expensive, so the time spent answering oracle queries dominates the other parts of the computation.Therefore, the following questions are posed: Q1: Is it possible to design constant-factor approximation distributed algorithms with 1) constant number of MapReduce rounds; 2) sublinear adaptive complexity (highlyparallelizable); and 3) nearly linear total query complexity?Q2: Can we design practical distributed algorithms that also meet these three criteria?

Contributions
In overview, the contributions of the paper are: 1) an exponential improvement in adaptivity (from Ω(n) to O(log n)) for an algorithm with a constant number of MR rounds; 2) the first linear-time, constant-factor algorithm with constant MR rounds; 3) a way to increase the size constraint limitation at the cost of additional MR rounds; and 4) an empirical evaluation on a cluster of 64 machines that demonstrates an order of magnitude improvement in runtime is obtained by combining the MR and adaptive complexity models.
To develop MR algorithms with sublinear adaptivity, we first modify and analyze the lowadaptive algorithm ThreshSeqMod from Chen et al. [2021] and show that our modification satisfies the RCP (Section 2).Next, ThreshSeqMod is used to create a low-adaptive greedy procedure LAG; LAG then satisfies the RCP, so can be used within PAlg Barbosa et al. [2016] to yield a 1 − 1/e − ε approximation in O(1/ε 2 ) MR rounds, with adaptivity O(log n).We show in Section 4.2 that DistributedDistorted Kazemi et al. [2021] also works with any Alg satisfying RCP, which yields an improvement in the number of MR rounds to achieve the same ratio.The resulting algorithm we term G-DASH (Section 4), which achieves ratio 1 − 1/e − ε with O(log 2 n) adaptivity and O(1/ε) MR rounds.
Although G-DASH achieves nearly the optimal ratio in constant MR rounds and sublinear adaptivity, the number of MR rounds is high.To obtain truly practical algorithms, we develop constant-factor algorithms with 2 MR rounds: R-DASH and T-DASH with O(log(k) log(n)) and O(log n) adaptivity, respectively, and with approximation ratios of 1 2 (1 − 1/e − ε) (≃ 0.316) and 3/8 − ε.R-DASH is our most practical algorithm and may be regarded as a parallelized version of RandGreeDI Barbosa et al. [2015], the first MapReduce algorithm for SMCC and the state-of-the-art MR algorithm in terms of empirical performance.T-DASH is a novel algorithm that improves the theoretical properties of R-DASH at the cost of empirical performance (see Table 1 and Section 8).Notably, the adaptivity of T-DASH is O (log(n)) which is close to the best known adaptivity for a constant factor algorithm of Chen et al. [2021], which has adaptivity of O(log(n/k)).
Although our MR algorithms are nearly linear time (within a polylogarithmic factor of linear), all of the existing MR algorithms are superlinear time, which raises the question of if an algorithm exists that has a linear query complexity and has constant MR rounds and constant approximation factor.We answer this question affirmatively by adapting a linear-time algorithm of Kuhnle [2021], Chen et al. [2021], and showing that our adaptation satisfies RCP (Section 3).Subsequently, we develop the first MR algorithm (L-Dist) with an overall linear query complexity.
Our next contribution is MED, a general plug-in framework for distributed algorithms, which increases the size of the maximum cardinality constraint at the cost of more MR rounds.As discussed above, the maximum k value of any prior MR algorithm for SMCC is O(n/ℓ 2 ), where ℓ is the number of machines; MED increases this to O(n/ℓ).We also show that under certain assumptions on the objective function (which are satisfied by all of the empirical applications evaluated in Section 8), MED can be run with k max = n, which removes any cardinality restrictions.When used in conjunction with a γ-approximation MR algorithm, MED provides an (1 − e −γ )-approximate solution.
Finally, an extensive empirical evaluation of our algorithms and the current state-of-the-art on a 64-machine cluster with 32 cores each and data instances ranging up to 5 million nodes show that R-DASH is orders of magnitude faster than state-of-the-art MR algorithms and demonstrates an exponential improvement in scaling with larger k.Moreover, we show that MR algorithms slow down with increasing number of machines past a certain point, even if enough memory is available, which further motivates distributing a parallelizable algorithm over smaller number of machines.This observation also serves as a motivation for the development of the MED+Alg framework.In our evaluation, we found that the MED+Alg framework delivers solutions significantly faster than the Alg for large constraints, showcasing its superior performance and efficiency.
A previous version of this work was published in a conference Dey et al. [2023].In that version, RCP was claimed to be shown for the ThresholdSeq algorithm of Chen et al. [2021].Unfortunately, RCP does not hold for this algorithm.In this version, we provide a modified version of the ThresholdSeq algorithm of Chen et al. [2021], for which we show RCP.Additionally, in this version, we introduce the first linear time MR algorithm, LinearTime-Distributed (L-Dist).Finally, in our empirical evaluation, we expand upon the conference version by conducting two additional experiments using a larger cluster comprising 64 machines.These additional experiments aim to evaluate the performance of L-Dist and investigate the effectiveness of MED in handling increasing cardinality constraints, respectively.

Organization
The rest of this paper is organized as follows.In Section 2, we present the low-adaptive procedures and show they satisfy the RCP.In Section 3, we analyze an algorithm with linear query complexity and show it satisfies the RCP.In Section 4, we show how to use algorithms satisfying the RCP to obtain MR algorithms.In Section 5, we improve the theoretical properties of our 2-round MR algorithm.In Section 6, we detail the linear-time MR algorithm.In Section 7, we show how to increase the supported constraint size by adding additional MR rounds.In Section 8, we conduct our empirical evaluation.

Preliminaries
A submodular set function captures the diminishing gain property of adding an element to a set that decreases with increase in size of the set.Formally, let N be a finite set of size n.A non-negative set function f : 2 In this paper, we study the following optimization problem for submodular optimization under cardinality constraint (SMCC): MapReduce Model.The MapReduce (MR) model is a formalization of distributed computation into MapReduce rounds of communication between machines.A dataset of size n is distributed on ℓ machines, each with memory to hold at most Ψ elements of the ground set.The total memory of the machines is constrained to be Ψ • ℓ = O(n).After each round of computation, a machine may send O(Ψ) amount of data to other machines.We assume ℓ ≤ n 1−c for some constant c ≥ 1/2.
Next, we formally define the RCP needed to use the two frameworks to convert a centralized algorithm to the MR setting, as discussed previously.
Property 1 (Randomized Consistency Property of Barbosa et al. [2016]).Let q be a fixed sequence of random bits; and Alg be a randomized algorithm with randomness determined by q.Suppose Alg(N , q) returns a pair of sets, (AlgSol(N , q), AlgRel(N , q)), where AlgSol(N , q) is the feasible solution and AlgRel(N , q) is a set providing additional information.Let A and B be two disjoint subsets of N , and that for each b ∈ B, AlgRel(A ∪ {b}, q) = AlgRel(A, q).Also, suppose that Alg(A, q) terminates successfully.Then Alg(A ∪ B, q) terminates successfully and AlgSol(A ∪ B, q) = AlgSol(A, q).Adaptive Complexity.The adaptive complexity of an algorithm is the minimum number of sequential adaptive rounds of at most polynomially many queries, in which the queries to the submodular function can be arranged, such that the queries in each round only depend on the results of queries in previous rounds.

Related Work
MapReduce Algorithms.Addressing the coverage maximization problem (a special case of SMCC), Chierichetti et al. [2010] proposed an approximation algorithm achieving a ratio of 1 − 1/e, with polylogarithmic MapReduce (MR) rounds.This was later enhanced by Blelloch et al. [2012], who reduced the number of rounds to log 2 n.Kumar et al. [2013] contributed further advancements with a 1 − 1/e algorithm, significantly reducing the number of MR rounds to logarithmic levels.Additionally, they introduced a 1/2 − ϵ approximation algorithm that operates within O(1/δ) MR rounds, albeit with a logarithmic increase in communication complexity.
For SMCC, Mirzasoleiman et al. [2013] introduced the two-round distributed greedy algorithm (GreedI), demonstrating its efficacy through empirical studies across various machine learning applications.However, its worst-case approximation guarantee is 1/Θ(min{ √ k, ℓ}).Subsequent advancements led to the development of constant-factor algorithms by Barbosa et al. [2015].Barbosa et al. [2015] notably introduced randomization into the distributed greedy algorithm GreedI, resulting in the RandGreeDI algorithm.This algorithm achieves a 1 2 (1 − 1/e) approximation ratio with two MR rounds and O (nk) queries.Mirrokni and Zadimoghaddam [2015] also introduced an algorithm with two MR rounds, where the first round finded randomized composable core-sets and the second round applied a tie-breaking rule.This algorithm improve the approximation ratio from 1 2 (1 − 1/e) to 0.545 − ε.Later on, Barbosa et al. [2016] proposed a O 1/ε 2 round algorithm with the optimal (1 − 1/e − ε) approximation ratio without data duplication.Then, Kazemi et al. [2021] improved its space and communication complexity by a factor of O (1/ε) with the same approximation ratio.Another two-round algorithm, proposed by Liu and Vondrak [2019], achieved an 1/2 − ε approximation ratio, but requires data duplication with four times more elements distributed to each machine in the first round.As a result, distributed setups have a more rigid memory restriction when running this algorithm.
Parallelizable Algorithms.A separate line of works consider the parallelizability of the algorithm, as measured by its adaptive complexity.Balkanski and Singer [2018] introduced the first O (log(n))-adaptive algorithm, achieving a (1/3 − ε) approximation ratio with O nk 2 log 3 (n) query complexity for SMC Then, Balkanski et al. [2019] enhanced the approximation ratio to 1 − 1/e − ε with O nk 2 log 2 (n) query complexity while maintaining the same sublinear adaptivity.In the meantime, Ene and Nguyen [2019], Chekuri and Quanrud [2019], and Fahrbach et al. [2019a] achieved the same approximation ratio and adaptivity but with improved query complexities: O (npoly(log(n))) for Ene and Nguyen [2019], O (n log(n)) for Chekuri and Quanrud [2019], and O (n) for Fahrbach et al. [2019a].Recently, highly practical sublinearly adaptive algorithms have been developed without compromising on the approximation ratio.FAST, introduced by Breuer et al. [2020], operates with O log(n) log 2 (log(k)) adaptive rounds and O (n log (log(k))) queries, while LS+PGB, proposed by Chen et al. [2021], runs with O (log(n)) adaptive rounds and O (n) queries.However, none of these algorithms have been shown to satisfy the RCP.
Fast (Low Query Complexity) Algorithms.Since a query to the oracle for the submodular function is typically an expensive operation, the total query complexity of an algorithm is highly relevant to its scalability.The work by Badanidiyuru and Vondrak [2014] reduced the query complexity of the standard greedy algorithm from O (kn) to O (n log(n)), while nearly maintaining the ratio of 1 − 1/e.Subsequently, Mirzasoleiman et al. [2015] utilized random sampling on the ground set to achieve the same approximation ratio in expectation with optimal O (n) query complexity.Afterwards, Kuhnle [2021] obtained a deterministic algorithm that achieves ratio 1/4 in exactly n queries, and the first deterministic algorithm with O (n) query complexity and (1 − 1/e − ε) approximation ratio.Since our goal is to develop practical algorithms in this work, we develop paralellizable and distributed algorithms with nearly linear query complexity (i.e.within a polylogarithmic factor of linear).Further, we develop the first linear-time algorithm with constant number of MR rounds, although it doesn't parallelize well.
Non-monotone Algorithms.If the submodular function is no longer assumed to be monotone, the SMCC problem becomes more difficult.Here, we highlight a few works in each of the categories of distributed, parallelizable, and query efficient.Lee et al. [2009] provided the first constant-factor approximation algorithm with an approximation ratio of (1/4 − ε) and a query complexity of O nk 5 log(k) .Building upon that, Gupta et al. [2010] reduced the query complexity to O (nk) by replacing the local search procedure with a greedy approach that resulted in a slightly worse approximation ratio.Subsequently, Buchbinder et al. [2014] incorporated randomness to achieve an expected 1/e approximation ratio with O (nk) query complexity.Furthermore, Buchbinder et al. [2017] improved it to O (n) query complexity while maintaining the same approximation ratio in expectation.For parallelizable algorithms in the adaptive complexity model, Ene and Nguyen [2020] achieved the current best (1/e − ε) approximation ratio with O (log(n)) adaptivity and O nk 2 log 2 (n) queries to the continuous oracle (a continuous relaxation of the original value oracle).Later, Chen and Kuhnle [2024] achieved the current best sublinear adaptivity and nearly linear query complexity with a Algorithm 1 Low-Adaptive Threshold Algorithm (ThreshSeqMod) 1: procedure ThreshSeqMod(f, X, k, δ, ε, τ, q) 2: Input: evaluation oracle f : 2 N → R + , subset X ⊆ N , constraint k, confidence δ, error ε, threshold τ , a finite set of sequences of random bits q ← {σ 1 , . . ., σ M +1 } 3: for λ ∈ Λ in parallel do 13: 2 Low-Adaptive Algorithms That Satisfy the RCP In this section, we analyze two low-adaptive procedures, ThreshSeqMod (Alg.1), LAG (Alg. 2), variants of low-adaptive procedures proposed in Chen et al. [2021].This analysis enables their use in the distributed, MapReduce setting.For convenience, we regard the randomness of the algorithms to be determined by a sequence of random bits q, which is an input to each algorithm.
Observe that the randomness of both of ThreshSeqMod and LAG only comes from the random permutations of V j on Line 8 of ThreshSeqMod, since LAG employs ThreshSeq-Mod as a subroutine.Consider an equivalent version of these algorithms in which the entire ground set N is permuted randomly, from which the permutation of V j is extracted.That is, if σ is the permutation of N , the permutation of V is given by v < w iff σ(v) < σ(w), for v, w ∈ V .Then, the random vector q specifies a sequence of permutations of N : (σ 1 , σ 2 , . ..), which completely determines the behavior of both algorithms.

Low-Adaptive Threshold (ThreshSeqMod) Algorithm
This section presents the analysis of the low-adaptive threshold algorithm, ThreshSeqMod (Alg.1; a variant of ThresholdSeq from Chen et al. [2021]).In ThreshSeqMod, the randomness is explicitly dependent on a random vector q.This modification ensures the consistency property in the distributed setting.Besides the addition of randomness q, Thresh-SeqMod employs an alternative strategy of prefix selection within the for loop.Instead of identifying the failure point (the average marginal gain is under the threshold) after the final success point (the average marginal gain is above the threshold) in Chen et al. [2021], ThreshSeqMod determines the first failure point.This adjustment not only addresses the consistency problem but also preserves the theoretical guarantees below.
Theorem 1. Suppose ThreshSeqMod is run with input (f, X, k, δ, ε, τ, q).Then, the algorithm has adaptive complexity O(log(n/δ)/ε 3 ) and outputs S, R ⊆ N , where S is the solution set with |S| ≤ k and R provides additional information with |R| = O (k).The following properties hold: 1) The algorithm succeeds with probability at least
Proof.Consider that the algorithm runs for all M + 1 iterations of the outer for loop; if the algorithm would have returned at iteration j, the values S i and V i , for i > j keep their values from when the algorithm would have returned.The proof relies upon the fact that every call to ThreshSeqMod(•, q) uses same sequences of permutations of N : {σ 1 , σ 2 , . . ., σ M +1 }.We refer to iterations of the outer for loop on Line 4 of Alg. 1 simply as iterations.Since randomness only happens with the random permutation of N at each iteration in Alg. 1, the randomness of ThreshSeqMod(•, q) is determined by q, which satisfies the hypothesis of Property 1.
For the two sets returned by ThreshSeqMod(N , q), S M +1 = ThreshSeqModSol(N , q) represents the feasible solution, and R M +1 = ThreshSeqModRel(N , q) is the set that provides additional information.We consider the runs of (1) ThreshSeqMod(A, q), (2) ThreshSeqMod(A ∪ {b}, q), and (3) ThreshSeqMod(A ∪ B, q) together.Variables of (1) are given the notation defined in the pseudocode; variables of (2) are given the superscript b; and variables of (3) are given the superscript and ThreshSeqMod(A ∪ B, q) terminates successfully.Let P (i) be the statement for iteration i of the outer for loop on Line 4 that (i) S i = S ′ i , and If P (M + 1) is true, and ThreshSeqMod(A, q) and ThreshSeqMod(A ∪ B, q) terminate at the same iteration, implying that ThreshSeqMod(A ∪ B, q) also terminates successfully, then the lemma holds immediately.In the following, we prove these two statements by induction.

Same Subsequences
Sequence Order First failure point ( ) also true, and if ThreshSeqMod(A, q) terminates at iteration i, ThreshSeqMod(A ∪ B, q) also terminates at iteration i.
Firstly, we show that (iii) and (iv) of P (i) hold.Since P (i − 1) holds, it indicates that for any b ∈ B by (i) and (ii) of P (i − 1).So, (iii) and (iv) of P (i) clearly hold since the sets Secondly, we prove that (i) and (ii) of P (i) hold and that if ThreshSeqMod(A, q) terminates at iteration i, ThreshSeqMod(A ∪ B, q) also terminates at iteration i.
If ThreshSeqMod(A, q) terminates at iteration i because |S i−1 | = k, then the other two runs of ThreshSeqMod also terminate at iteration i, since Therefore, b has been filtered out before iteration i or will be filtered out at iteration i in both ThreshSeqMod(A ∪ {b}, q) and ThreshSeqMod(A ∪ B, q) since the sets S b i−1 and S B i−1 involved in updating V b i and V B i are the same.Consequently, V ′ i is also an empty set and ThreshSeqMod(A ∪ B, q) terminates at iteration i. Futhermore, (i) and (ii) of P (i) are true since S ′ i , S b i , S i , R b i , and R i do not update.Finally, consider the case that ThreshSeqMod(A, q) does not terminate at iteration i, where and V ′ i are subsets of the same random permutation of N , and Follows from Equality 2, and λ ′ i < λ # , it holds that , and even further So, Property (i) and (ii) hold.

Analysis of Guarantees
There are three adjustments in ThreshSeqMod compared with ThresholdSeq in Chen et al. [2021] altogether.First, we made the randomness q explicit allowing us to still consider each permutation on Line 8 as a random step.Therefore, the analysis of the theoretical guarantees is not influenced by adopting the randomness q.Second, the prefix selection step is changed from identifying the failure point after the final success point to identifying the initial failure point.Note that, this change is necessary for the algorithm to satisfy RCP (Property1).However, by making this change, fewer elements could be filtered out at the next iteration.Fortunately, we are still able to filter out a constant fraction of the candidate set.Third, the algorithm returns one more set that provides extra information.Since the additional set does not affect the functioning of the algorithm, it does not influence the theoretical guarantees.In the following, we will provide the detailed analysis for the theoretical guarantees.As demonstrated by Lemma 2 below, an increase in the number of selected elements results in a corresponding increase in the number of filtered elements.Consequently, there exists a point, say t, such that a given constant fraction of V j can be filtered out if we are adding more than t elements.The proof of this lemma is quite similar to the proof of Lemma 12 in Chen et al. [2021] which can be found in Appendix C.
Although we use a smaller prefix based on Line 16 and 18 compared to ThresholdSeq, it is still possible that with probability at least 1/2, a constant fraction of V j will be filtered out at the beginning of iteration j + 1, or the algorithm terminates with |S j | = k; given as Lemma 3 in the following.Intuitively, if there are enough such iterations, ThreshSeqMod will succeed.Also, the calculation of query complexity comes from Lemma 3 directly.Since the analyses of success probability, adaptivity and query complexity simply follows the analysis in Chen et al. [2021], we provide these analyses in Appendix C. Next, we prove that Lemma 3 always holds.Proof.By the definition of A i in Lemma 2, A λ * j will be filtered out from V j at the next iteration.Equivalently, A λ * j = V j \ V j+1 .After Line 8 at iteration j + 1, by Lemma 2, there exists a t such that t = min{i ∈ N : |A i | ≥ βε|V j |}.We say an iteration j succeeds, if λ * j ≥ min{s, t}.In this case, it holds that either |V j \ V j+1 | ≥ βε|V j |, or |S j | = k.In other words, iteration j succeeds if there does not exist λ ≤ min{s, t} such that B[λ] = False by the selection of λ * j on Line 18. Instead of directly calculating the success probability, we consider the failure probability.Define an element v i bad if ∆ (v i | S j−1 ∪ T i−1 ) < τ , and good otherwise.Consider the random permutation of V j as a sequence of dependent Bernoulli trials, with success if the element is bad and failure otherwise.When i ≤ min{s, t}, the probability of v i is bad is less than βε.If B[λ] = False, there are at least ελ bad elements in T λ .Let {Y i } ∞ i=1 be a sequence of independent and identically distributed Bernoulli trials, each with success probability βε.Next, we consider the following sequence of dependent Bernoulli trials Then, for any λ ≤ min{s, t}, it holds that P r (B[λ] = False) = P r (B ′ [λ] = False).In the following, we bound the probability that there are more than ε-fraction 1s in {X i } λ i=1 for any λ ≤ s, Subsequently, the probability of failure at iteration j is calculated below, P r (iteration j fails) = P r (∃λ ≤ min{s, t}, s.t., B[λ] = False) As for the Properties (3) and (4) in Theorem 1, since the structure of the solution set is the same as the solution of ThresholdSeq.So, the analysis is similar to what in Chen et al. [2021].We provide these proofs in Appendix C.

Low-Adaptive Greedy (LAG) Algorithm
Another building block for our distributed algorithms is a simple, low-adaptive greedy algorithm LAG (Alg.2).This algorithm is an instantiation of the ParallelGreedyBoost framework Algorithm 2 Low-Adaptive Greedy (LAG) of Chen et al. [2021], and it relies heavily on the low-adaptive procedure ThreshSeqMod (Alg.1).
Guarantees.Following the analysis of Theorem 3 in Chen et al. [2021], and a query complexity of O( n log k ε 4 ) Lemma 4. LAG satisfies the randomized consistency property (Property 1).
Proof of Lemma 4. Observe that the only randomness in LAG is from the calls to Thresh-SeqMod.Since LAG(A, q) succeeds, every call to ThreshSeqMod must succeed as well.Moreover, considering that q is used to permute the underlying ground set N , changing the set argument A of LAG does not change the sequence received by each call to ThreshSeqMod.
Next, we provide Claim 1 and its analysis below.
Claim 1.Let q be a fixed sequence of random bits, A ⊆ N , and b ∈ N \ A. Then, b ̸ ∈ ThreshSeqModRel(A ∪ {b}, q) if and only if ThreshSeqModRel(A ∪ {b}, q) = ThreshSeqModRel(A, q).
Proof.It is obvious that if ThreshSeqModRel(A ∪ {b}, q i ) = ThreshSeqModRel(A, q), then b ̸ ∈ ThreshSeqModRel(A ∪ {b}, q).In the following, we prove the reverse statement.Following the analysis of RCP for ThreshSeqMod in Section 2.1.1,we consider the runs of (1) ThreshSeqMod(A, q), and (2) ThreshSeqMod(A ∪ {b}, q).Variables of (1) are given the notation defined in the pseudocode; variables of (2) are given the superscript b.We analyze the following statement P (i) for each iteration i of the outer for loop on Line 4 in Alg. 1.

4:
Initialize a ← first element in arg max x∈N f (x) , S ← {a} 5: Thus, the analysis in Section 2.1.1 also holds in this case.According to the inductive method, we are able to prove that P (M + 1) is true for each i which indicates ThreshSeqModRel(A ∪ {b}, q) = ThreshSeqModRel(A, q).

Consistent Linear Time (Linear-TimeCardinality) Algorithm
In this section, we present the analysis of Linear-TimeCardinality (Alg.3, LTC), a consistent linear time algorithm which is an extension of the highly adaptive linear-time algorithm (Alg.3) in Chen et al. [2021].Notably, to ensure consistency within a distributed setting and bound the solution size, LTC incorporates the randomness q, and initializes the solution set with the maximum singleton.The algorithm facilitates the creation of linear-time MapReduce algorithms, enhancing overall computational efficiency beyond the capabilities of current state-of-the-art superlinear algorithms.
Theorem 2. Let (f, k) be an instance of SM.The algorithm Linear-TimeCardinality outputs S ⊆ N such that the following properties hold: 1) There are O(n) oracle queries and O(n) adaptive rounds.2) Let S ′ be the last k elements in S.
where O is an optimal solution to the instance (f, k).
3) The size of S is limited to O(k log(n)).

Analysis of Consistency
The highly adaptive linear-time algorithm (Alg.3) outlined in Chen et al. [2021] commences with an empty set and adds elements to it iff.∆ (x | S) ≥ f (S)/k.Without introducing any randomness, this algorithm can be deterministic only if the order of the ground set is fixed.Additionally, to limit the solution size, we initialize the solution set with the maximum singleton, a choice that also impacts the algorithm's consistency.However, by selecting the first element that maximizes the objective value, the algorithm can maintain its consistency.
In the following, we provide the analysis of randomized consistency property of LTC.
Proof.Consider the runs of (1) LTC(A, q), (2) LTC(A ∪ {b}, q), and (3) LTC(A ∪ B, q).With the same sequence of random bits q, after the random permutation, A, A ∪ {b} and A ∪ B are the subsets of the same sequence.For any x ∈ N , let i x be the index of x.Then, N ix are the elements before x (including x).Define S ix be the intermediate solution of (1) after we consider element x.If x ̸ ∈ A, define S ix = S ix−1 .Similarly, define S ′ ix and S b ix be the intermediate solution of (2) and (3).If S ix = S b ix = S ′ ix , for any x ∈ N and b ∈ B, LTC(A ∪ B, q) = LTC(A, q).Next, we prove the above statement holds.
For any b ∈ B, since LTC(A, q) = LTC(A∪{b}, q), it holds that either f (b) < max x∈A f (x), or f (b) = max x∈A f (x), and i a < i b , where a is the first element in arg max x∈A f (x).Therefore, a is also the first element in arg max x∈A∪B f (x).Furthermore, it holds that S 0 = S b 0 = S ′ 0 = {a}.
Suppose that Algorithm 4 Randomized-DASH (R-DASH) 1: Input: Evaluation oracle f : 2 N → R, constraint k, error ε, available machines M ← {1, 2, ..., ℓ} 2: for e ∈ N do Let N i be the elements assigned to machine i 7: Send S i , R i to primary machine 9: ▷ On primary machine

Analysis of Guarantees
Query Complexity and Adaptivity.Alg. 3 queries on Line 4 and 6, where there are n oracle queries on Line 4 and 1 oracle query for each element received on Line 6. Therefore, the query complexity and the adaptivity are both O(n).
Solution Size.Given that f is a monotone function, it holds that arg max S∈N f (S) = N .Furthermore, since f ({a}) = max x∈N f ({x}), the submodularity property implies that Each element added to S contributes to an increase in the solution value by a factor of (1 + 1 k ).Therefore, For any x ∈ N , let S x be the intermediate solution before we process where Inequality (1) follows from monotonicity and submodularity, Inequality (2) follows from submodularity.
4 Low-Adaptive Algorithms with Constant MR Rounds (Greedy-DASH and Randomized-DASH) Once we have the randomized consistency property of LAG, we can almost immediately use it to obtain parallelizable MapReduce algorithms.
Proof.Query Complexity and Adaptivity.R-DASH runs with two MR rounds.In the first MR round, Alg is invoked ℓ times in parallel, each with a ground set size of n/ℓ.So, during the first MR round, the number of queries is ℓΦ(n/ℓ) and the number of adaptive rounds is Ψ(n/ℓ).Then, the second MR round calls Alg once, handling at most n/ℓ elements.Consequently, the total number of queries of R-DASH is (ℓ + 1)Φ(n/ℓ), and the total number of adaptive rounds is 2Ψ(n/ℓ).Approximation Ratio.Let R-DASH be run with input (f, k, ε, M ).Since . By an application of Chernoff's bound to show that the size |N i | is concentrated.
Let N (1/ℓ) denote the random distribution over subsets of N where each element is included independently with probability 1/ℓ.For x ∈ N , let Therefore, Algorithm 5 Greedy-DASH(G-DASH) X r,i ← Elements assigned to machine i chosen uniformly at random in round r N r,i ← X r,i ∪ C r−1 6: Send S r,i , R r,i to each machine where Inequality 3 follows from Lemma 8 and F is convex.
communication complexity, and probability at least 1 − n 1−2c such that

Greedy-DASH
Next, we obtain the nearly the optimal ratio by applying LAG and randomized consistency to DistributedDistorted proposed by Kazemi et al. [2021].DistributedDistorted is a distributed algorithm for regularized submodular maximization that relies upon (a distorted version of) the standard greedy algorithm.For SMCC, the distorted greedy algorithm reduces to the standard greedy algorithm Greedy.In the following, we show that any algorithm that satisfies randomized consistency can be used in place of Greedy as stated in Theorem 4. By introducing LAG into DistributedDistorted, G-DASH achieves the near optimal (1 − 1/e − ε) expected ratio in ⌈ 1 ε ⌉ MapReduce rounds, O(log(n) log(k)) adaptive rounds, and O(n log(k)) total queries.We also generalize this to any algorithm that satisfies randomized consistency.
The Framework of Kazemi et al. [2021].DistributedDistorted has ⌈1/ε⌉ MR rounds where each round r works as follows: First, it distributes the ground set into m machines uniformly at random.Then, each machine i runs Greedy (when modular term ℓ(•) = 0) on the data N r,i that combines the elements distributed before each round X r,i and the elements forwarded from the previous rounds C r−1 to get the solution S r,i .At the end, the final solution, which is the best among S r,1 and all the previous solutions, is returned.To improve the adaptive rounds of DistributedDistorted, we replace standard greedy with LAG to get G-DASH.Proof.Query Complexity and Adaptivity.G-DASH operates with ⌈ 1 ε ⌉ MR rounds, where each MR round calls Alg ℓ times in parallel.As each call of Alg works on the ground set with a size of at most n ℓ , the total number of queries for G-DASH is ℓ ε Φ n ℓ , and the total number of adaptive rounds is 1 ε Ψ n ℓ .Approximation ratio.Let G-DASH be run with input (f, k, ε, M ).Similar to the analysis of Theorem 3, The size of N i is concentrated.Let O be the optimal solution.For any x ∈ N , define that, Then we provide the following lemma.
Lemma 6.For any x ∈ O and Proof.

P r (x
The rest of the proof bounds f (S r,1 ) in the following two ways. Therefore, Similarly, it holds that f (S r,1 ) ≥ αf (O r,3 ).And for any o ∈ O, by Lemma 6, it holds that, Therefore, By Inequalities 4 and 5, we bound the approximation ratio of G-DASH by the following, where Inequality (a) follows from Lemma 8 and F is convex.) is a two MR-rounds algorithm that runs ThreshSeqMod concurrently on every machine with a specified threshold value of αOPT/k.The primary machine then builds up its solution S 1 by adding elements with ThreshSeqMod from the pool of solutions returned by the other machines.Notice that there is a small amount of data duplication as elements of the ground set are not randomly partitioned in the same way as in the other algorithms.This version of the algorithm requires to know the OPT value; in Appendix E we show how to remove this assumption.In Appendix I, we further discuss the similar two MR round algorithm of Liu and Vondrak [2019], that provides an improved 1/2-approximation but requires four times the data duplication of T-DASH.

5:
Assign e to each machine independently with probability 1/ℓ 6: for i ∈ M do 7: ▷ On machine i 8: Let N i be the elements assigned to machine i 9: Send S i , R i to primary machine 12: ▷ On primary machine 13: R ← Proof .Approximation Ratio.In Algorithm 6, there are ℓ + 1 independent calls of Thresh-SeqMod.With |N i | ≥ n c , the success probability of each call of ThreshSeqMod is larger than 1 − 1 n c (ℓ+1) .Thus, Algorithm 6 succeeds with probability larger than 1 − n −c .For the remainder of the analysis, we condition on the event that all calls to ThreshSeqMod succeed.
In the case that |T ′ | = k, by Theorem 1 in Section 2.1 and Otherwise, we consider the case that |T ′ | < k in the following.Let (TSMSol(N , q), TSMRel(N , q)) be the pair of sets returned by ThreshSeqMod(N , q).For any x ∈ N , let (1 − 1/e − ε) approximation algorithm, to boost the objective value.We provide the theoretical guarantees of L-Dist as Theorem 6.The proof involves a minor modification of the proof provided in Theorem 3.
Theorem 6.Let (f, k) be an instance of SM where k < ψ ℓ log(ψ) .L-Dist returns a set V with two MR rounds, O n ℓ adaptive rounds, O (n) total queries, O(n) communication complexity such that

Analysis of Query Complexity and Adaptivity
L-Dist runs with two MR rounds.In the first MR round, LTC is invoked ℓ times in parallel, each with O (n/ℓ) queries and O (n/ℓ) adaptive rounds by Theorem 2. So, during the first MR round, the number of queries is O (n) and the number of adaptive rounds is O (n/ℓ).Then, the second MR round calls LTC and ThresholdGreedy once respectively, handling at most n/ℓ elements.By Theorem 2 and 8, the number of queries is O (n/ℓ) and the number of adaptive rounds is O (n/ℓ).Consequently, the total number of queries of L-Dist is O (n), and the total number of adaptive rounds is O (n/ℓ).

Analysis of Approximation Ratio
Let L-Dist be executed with input (f, k, ε, M ).Since ℓ ≤ n 1−c , it follows that the expected size of each subset To ensure that the size |N i | is concentrated, we apply Chernoff's bound.Let N (1/ℓ) denote the random distribution over subsets of N where each element is included independently with probability 1/ℓ.Let p ∈ [0, 1] n be the following vector.For x ∈ N , let Therefore, where inequality 7 follows from Lemma 8 and F is convex.

Post-Processing
In this algorithm, we employ a simple post-processing procedure.Since T ′ 1 is an 1/2−approximation of T 1 with size k, T ′ 1 is also an 1/2−approximation of the instance (f, k) on the ground set T 1 .By running any linear time algorithm that has a better approximation ratio on T 1 , we are able to boost the objective value returned by the algorithm with the same theoretical guarantees.By Theorem 8, ThresholdGreedy achieves (1 − 1/e − ε)−approximation in linear time with a guess on the optimal solution.Therefore, with input α = 1/2 and Γ = f (T ′ 1 ), utilizing ThresholdGreedy can effectively enhance the objective value.
7 Towards Larger k: A Memory-Efficient, Distributed Framework (MED) In this section, we propose a general-purpose plug-in framework for distributed algorithms, MemoryEfficientDistributed (MED, Alg. 8).MED increases the largest possible constraint value from O(n/ℓ 2 ) to O(n/ℓ) in the value oracle model.Under some additional assumptions, we remove all restrictions on the constraint value.
As discussed in Section 1, the k value is limited to a fraction of the machine memory for MapReduce algorithms: k ≤ O(n/ℓ 2 ), since those algorithms need to merge a group of solutions and pass it to a single machine.MED works around this limitation as follows: MED can be thought of as a greedy algorithm that uses an approximate greedy selection through Alg.One machine manages the partial solutions {S i }, built up over m iterations.In this way, each call to Alg is within the constraint restriction of Alg, i.e.O(n/ℓ 2 ), but a larger solution of up to size O(n/ℓ) can be constructed.
The restriction on k of MED of O(n/ℓ) comes from passing the data of the current solution to the next round.Intuitively, if we can send some auxiliary information about the function instead of the elements selected, the k value can be unrestricted.
Assumption 1.Let f be a set function with ground set N of size n.If for all S ⊆ N , there exists a bitvector v S , such that the function g(X) = f (S ∪ X) − f (S) can be computed from X and v S , then f satisfies Assumption 1.
We show in Appendix F that all four applications evaluated in Section 8 satisfy Assumption 1.As an example, consider MaxCover, which can be expressed as f (S) = i∈N f i (S), where f i (S) = 1 {i is covered by S} .Let g i (X) = f i (S ∪ X) − f i (S), and v S = (f 1 (S), . . ., f n (S)).Then, f i (S ∪ X) = 1 {i is covered by S} ∨ 1 {i is covered by X} , where the first term is given by v S and the second term is calculated by X.Therefore, since g(X) = i∈N g i (X), g(X) can be computed from X and v S .
Theorem 7. Let (f, k) be an instance of SMCC distributed over ℓ machines.For generic objectives, where the data of current solution need to be passed to the next round, k ≤ min{ ℓψ−n ℓ−1 + 1, ψ − ℓ + 1}; for special objectives, where only one piece of compressed data need to be passed to the next round, k ≤ n.Let S m be the set returned by MED.Then E [f (S m )] ≥ (1 − e −γ )OPT, where γ is the expected approximation of Alg.
Proof.Let O be an optimal and O 1 , O 2 , ..., O m be a partition of O into m pieces, each of size where Inequality (a) follows from fi is SMCC; Alg is γ-approximation.Unfix S i , it holds that To analyze the memory requirements of MED, consider the following.To compute g(•) ← f (• ∪ S i ) − f (S i ), the current solution S i need to be passed to each machine in M for the next call of Alg.Suppose |S| = k − x where 1 ≤ x ≤ k.The size of data stored on any non-primary machine in cluster M , as well as on the primary machine of M can be bounded as follows: Therefore, if Ψ ≥ 2n/ℓ, MED can run since ℓ ≤ n/ℓ in our MapReduce model.Under the alternative assumption, it is clear that MED can run for all k ≤ n.
All other algorithms use 2 machines (with 4 cores each).
Experiment 2 was conducted on a larger 64-machine cluster, each featuring 32 CPU cores.Experiment 4 operated on a cluster of 32 single-core machines.Notably, in Experiment 1, our algorithms used ℓ = 8 machines, while prior MapReduce (MR) algorithms employed ℓ = 32, fully utilizing 32 cores.MPICH version 3.3a2 was installed on every machine, and we used the python library mpi4py for implementing and parallelizing all algorithms with the Message Passing Interface (MPI).These algorithms were executed using the mpirun command, with runtime tracked using mpi4py.M P I.W time() at the algorithm's start and completion.Datasets.Table 2 presents dataset sizes for our experiments.In Experiment 1 and 3 (Fig. 4(b)), we assess algorithm performance on small datasets, varying from n = 10,000 to 100,000.These sizes enable evaluation of computationally intensive algorithms, such as DDist, T-DASH, and G-DASH.Experiment 2, 3 (Fig. 4(a)), and 4 focus on larger datasets, ranging from n = 50,000 to 5 million.
Experiment Objectives.The primary objective of our experiment set is to comprehensively evaluate the practicality of the distributed algorithms across varying cluster and dataset sizes.The specific objectives of each experiment are outlined below: • Experiment 11 : Baseline experiment aimed at assessing the performance of the algorithms using small datasets and cluster setup.
• Experiment 2: Assess the performance of the algorithms on large datasets and cluster setup.
• Experiment 3 * : Investigate the influence of the number of nodes in a cluster on algorithm performance.
• Experiment 4: Examine the impact of increasing cardinality constraints on the performance of MED.

Experiment 1 -Comparative Analysis on Small Datasets
The results of Experiment 1 (Fig. 2) show that all algorithms provide similar solution values (with T-DASH being a little worse than the others).However, there is a large difference in runtime, with R-DASH the fastest by orders of magnitude.The availability of only 4 threads per machine severely limits the parallelization of T-DASH, resulting in longer runtime; access to log 1+ε (k) threads per machine should result in faster runtime than R-DASH.[2015] to ensure the 1 2 (1 − 1/e) ratio.All Greedy implementations used lazy greedy to improve the runtime.T imeout for each application: 6 hours per algorithm.

Experiment 2 -Performance Analysis of R-DASH, RandGreeDI and L-Dist on a Large Cluster
This section presents a comparative analysis of R-DASH, L-Dist, and RandGreeDI on a large 64-node cluster, each node equipped with 32 cores.We assess two versions of L-Dist and RandGreeDI, one with ℓ = 64 and the other with ℓ = 2048, with a predetermined time limit of 3 hours for each application.The plotted results depict the instances completed by each algorithm within this time constraint.
In terms of solution quality, as depicted in Figure 3(a), statistically indistinguishable

Experiment 3 -Scalability Assessment
Figure 4(a) illustrates a linear speedup for R-DASH as the number of machines ℓ increases.Figure 4(b) highlights an intriguing observation that, despite having sufficient available memory, increasing ℓ can result in inferior performance when k > n ℓ 2 .Specifically, as depicted in Figure 4(b), we initially witness the expected faster execution of RandGreeDI with ℓ = 32 compared to RandGreeDI with ℓ = 8.However, once k > n 32 2 , the relative performance of RandGreeDI with ℓ = 32 rapidly deteriorates.This decline can be attributed to the degradation of RandGreeDI's optimal performance beyond k = n ℓ 2 .When running RandGreeDI with a single thread on each machine, the total running time on ℓ machines can be computed based on two components.First, the running time for one machine in the first MR round, which is proportional to (n/ℓ − (k − 1)/2)k.Second, the running time for the primary machine in the second MR round (post-processing step), which is proportional to (kℓ − (k − 1)/2)k.Consequently, the total running time is proportional to nk/ℓ + ℓk 2 − k(k − 1).Optimal performance is achieved when ℓ = n/k, which justifies the preference for parallelization within a machine to maintain a lower ℓ rather than distributing the data across separate processors.Furthermore, in Experiment 5 (Section 8.4), we demonstrate that utilizing MED enables MR algorithms to produce solutions much faster with no compromise in solution value, particularly when solving for k > n ℓ 2 .These results provide further support for the advantage of incorporating MED in achieving efficient and effective parallelization in MR algorithms.This section presents experimental results comparing the performance of MED+RG and the vanilla RandGreeDI algorithm.In terms of solution value, MED+RG consistently provides nearly identical solutions to the vanilla RandGreeDI algorithm across all instances for the three applications studied.Regarding runtime, the following observations are made: Initially, both algorithms exhibit similar execution times up to a threshold of k = n ℓ 2 .However, beyond this threshold, the performance gap between MED+RG and RandGreeDI widens linearly.MED+RG achieves notable average speedup factors of 1.8, 2.2, and 2.3 over RandGreeDI for the respective applications.Moreover, beyond k = n ℓ 2 , MED+RG outperforms RandGreeDI in terms of completing instances within a 12-hour timeout, completing 77% more instances of k across all three applications.These findings highlight the promising performance of MED+RG, demonstrating comparable solution values while significantly improving runtime efficiency compared to the vanilla RandGreeDI algorithm.
The empirical findings from this experiment provide insights into the capabilities of the MED+Alg framework.Our results indicate that MED+Alg achieves solution quality that is practically indistinguishable from the vanilla Alg algorithm, even for significantly larger values of k.Additionally, even with sufficient available memory to run Alg beyond k = n ℓ 2 , MED+Alg demonstrates noteworthy computational efficiency, surpassing that of the Alg algorithm.This combination of comparable solution quality and improved computational efficiency positions MED+Alg as a highly promising framework.These findings have significant implications and underscore the potential of MED+Alg to address complex problems in a distributed setting with large-scale values of k.

Discussion and Conclusion
Prior to this work, no MR algorithms for SMCC could parallelize within a machine; these algorithms require many sequential queries.Moreover, increasing the number of machines to the number of threads available in the cluster may actually harm the performance, as we showed empirically; intuitively, this is because the size of the set on the primary machine scales linearly with the number of machines ℓ.In this paper, we have addressed this limitation by introducing a suite of algorithms that are both parallelizable and distributed.Specifically, we have presented R-DASH, T-DASH, and G-DASH, which are the first MR algorithms with sublinear adaptive complexity (highly parallelizable).Moreover, our algorithms make have nearly linear query complexity over the entire computation.We also provide the first distributed algorithm with O(n) total query complexity, improving on the O(npolylog(n)) of our other algorithms and the algorithm of Liu and Vondrak [2019].
When RandGreeDI was introduced by Mirzasoleiman et al. [2013], the empirical performance of the algorithm was emphasized, with theoretical guarantees unproven until the work of Barbosa et al. [2015].Since that time, RandGreeDI has remained the most practical algorithm for distributed SMCC.Our R-DASH algorithm may be regarded as a version of RandGreeDI that is 1) parallelized; 2) nearly linear time in total (RandGreeDI is quadratic); and 3) empirically orders of magnitude faster in parallel wall time in our evaluation.R-DASH achieves approximation ratio of (1 − 1/e)/2 in two MR rounds; the first round merely being to distribute the data.We provide G-DASH to close the gap to (1 − 1/e) ratio in a constant number of rounds.However, MR rounds are expensive (as shown by our Experiment 1).The current best ratio achieved in two rounds is the 0.545-approximation algorithm of Mirrokni and Zadimoghaddam [2015].Therefore, a natural question for future research is what is the best ratio achievable in two MR rounds.Moreover, non-monotone or partially monotone objective functions and more sophisticated constraint systems are directions for future research.A Lovász Extension of Submodular Function.
Given submodular function f , the Lovász extension F of f is defined as follows: For z ∈ The Lovász extension satisfies the following properties: (1) F is convex; (2) F (cz) ≥ cF (z) for any c ∈ (0, 1).Moreover, we will require the following simple lemma: Lemma 8 (Barbosa et al. [2015]).Let S be a random set, and suppose

B Probability Lemma and Concentration Bounds
Lemma 9. (Chen et al. [2021]) Suppose there is a sequence of n Bernoulli trials: X 1 , X 2 , . . ., X n , where the success probability of X i depends on the results of the preceding trials X 1 , . . ., X i−1 .Suppose it holds that P r where η > 0 is a constant and x 1 , . . ., x i−1 are arbitrary.
Then, if Y 1 , . . ., Y n are independent Bernoulli trials, each with probability η of success, then where b is an arbitrary integer.Moreover, let A be the first occurrence of success in sequence X i .Then, Lemma 10 (Chernoff bounds Mitzenmacher and Upfal [2017]).Suppose X 1 , ... , X n are independent binary random variables such that P r (X i = 1) = p i .Let µ = n i=1 p i , and X = n i=1 X i .Then for any δ ≥ 0, we have Moreover, for any 0 ≤ δ ≤ 1, we have C Omitted Proof of ThreshSeqMod Lemma 2. At an iteration j, let Proof.After filtration on Line 5, it holds that, for any x ∈ V j , ∆ (x | S j−1 ) ≥ τ and ∆ (x | V j ∪ S j−1 ) = 0. Therefore, Proof of Success Probability.The algorithm successfully terminates if, at some point, To analyze the success probability, we consider a variant of the algorithm in which it does not terminate once |V j | = 0 or |S j | = k.In this case, the algorithm keeps running with s = 0 and selecting empty set after inner for loop in the following iterations.Thus, with probability 1, either Lemma 3 holds for all M + 1 iterations of the outer for loop.
If the algorithm fails, there should be no more than m = ⌈log 1−βε (1/n)⌉ successful iterations where λ * j ≥ min{s, t}.Otherwise, by Lemma 3, there exists an iteration j such that Let X be the number of successful iterations.Then, X is the sum of M + 1 dependent Bernoulli random variables, where each variable has a success probability of more than 1/2.Let Y be the sum of M + 1 independent Bernoulli random variables with success probability 1/2.By probability lemmata, the failure probability of Alg. 1 is calculated as follows, P r (Alg. 1 fails) ≤ P r (X ≤ m) Proof of Adaptivity and Query Complexity.In Alg. 1, oracle queries incurred on Line 5 and 14, and can be done in parallel.Thus, there are constant number of adaptive rounds within one iteration of the outer for loop.The adaptivity is O (M ) = O log(n/δ)/ε 3 .Consider an iteration j, there are no more than |V j−1 |+1 queries on Line 5 and log 1+ε (|V j |)+ 2 queries on Line 14.Let Y i be the i−th successful iteration.Then, for any Then, the expected number of oracle queries is as follows, Proof of Property (3).For each iteration j, consider two cases of λ * j by the choice of λ * j on Line 18. Firstly, if λ the last inequality follows from (1 + ε) u+1 ≥ λ * j > 1 ε ≥ 1 ε .Therefore, by above two inequalities and monotonicity of f , The objective value of the returned solution can be bounded as follows, Proof of Property (4).If the output set S follows that |S| < k, for any x ∈ N , it is filtered out at some point.Let x be discarded at itetation j x .Then,
If |S| = k at termination, let S j be the intermediate solution after iteration j of the while loop, and τ j is the corresponding threshold.Suppose that S j \ S j−1 ̸ = ∅.Then, each element added during iteration j provides a minimum incremental benefit of τ , Next, we consider iteration j − 1.If j > 1, since the algorithm does not terminate during iteration j − 1, each element in N is either added to the solution or omitted due to small marginal gain with respect to the current solution.Therefore, for any o ∈ O \ S j−1 , it holds that ∆ (o | S j−1 ) < τ j−1 by submodularity.Then, Therefore, for any iteration j that has elements being added to the solution, it holds that Moreover, we know that any two machines selects elements independently.So, where Inequality (a) follows from P r (o ̸ ∈ TSMRel(N 1 )|o ̸ ∈ TSMRel(N 1 ∪ {o})) = 1.Thus, we can bound the probability by the following, Description.The T-DASH algorithm described in Alg. 10 is a two MR-rounds algorithm using the AFD approach that runs ThreshSeqMod concurrently on every machine for log 1+ε (k) different guesses of threshold τ i,j in the range [ where α is the approximation of T-DASH and ∆ * i is the maximum singleton in N i .Every solution returned to the primary machine are placed into bins based on their corresponding threshold guess τ i,j such that ; where ∆ * is the maximum singleton in N .Since there must exist a threshold τ * that is close enough to αOPT/k ; running ThreshSeqMod Algorithm 10 Threshold-DASH with no knowledge of OPT (T-DASH) 1: Input: Evaluation oracle f : 2 N → R, constraint k, error ε, available machines M ← {1, 2, ..., ℓ} 2: Initialize δ ← 1/(ℓ + 1), q ← a fixed sequence of random bits.3: Set α ← 3 8 , (q i,j ) i∈[ℓ+1],j∈[log 1+ε (k)] ← q 4: for e ∈ N do do 5: Assign e to each machine independently with probability 1/ℓ 6: for i ∈ M do 7: ▷ On machine i 8: Let N i be the elements assigned to machine i 9: Set ∆ * i ← max{f (e) : e ∈ N i } 10: for j ← 0 to log 1+ε (k) in parallel do 11: Send ∆ * i and all (τ i,j , S i,j , R i,j ) to primary machine 14: ▷ On primary machine 15: Set ∆ * ← max{∆ * i : 1 ≤ i ≤ ℓ} 16: for x ← 0 to ⌈log 1+ε (k)⌉ + 1 in parallel do 17: x ) : 0 ≤ x ≤ log 1+ε (k)} 24: return T Overview of Proof.Alg. 10 is inspired by Alg. 6 in Section 5, which is a version of the algorithm that knows that optimal solution value.With ∆ * = max{f (e) : e ∈ N }, there exists an x 0 such that τ x 0 ≤ αOPT(1 + ε)/k ≤ τ x 0 +1 with O (log(k)/ε) guesses.Then, on each machine i, we only consider sets S i,j and R i,j such that τ x 0 ≤ τ i,j ≤ τ x 0 +1 .If this τ i,j does exist, (S i,j , R i,j ) works like (S i , R i ) in Alg. 6.If this τ i,j does not exist, then for any e ∈ N i , it holds that f (e) < αOPT/k, which means ThreshSeqMod(N i ) with τ = αOPT/k will return an empty set.Since each call of ThreshSeqMod with different guesses on τ is executed in parallel, the adaptivity remains the same and the query complexity increases by a factor of log(k)/ε.
On the machine which does not return a τ i,(j 0 ) , we consider it runs ThreshSeqMod(N i , τ x 0 +1 ) and returns two empty sets, and hence max{f (e) : e ∈ N i } < τ x 0 .Let R ′ x 0 = ∪ i∈M R ′ i,(j 0 ) , O 2 = R ′ x 0 ∩ O.Then, Lemma 7 in Appendix E still holds in this case.We can calculate the approximation ratio as follows with ε ≤ 2/3, F Experiment Setup

F.1 Applications
Given a constraint k, the objectives of the applications are defined as follows:

F.1.1 Max Cover
Maximize the number of nodes covered by choosing a set S of maximum size k, such that the number of nodes having at least one neighbour in the set S. The application is run on synthetic random BA graphs of groundset size 100,000, 1,000,000 and 5,000,000 generated using Barabási-Albert (BA) models for the centralized and distributed experiments respectively.

F.1.2 Image Summarization on CIFAR-10 Data
Given large collection of images, find a subset of maximum size k which is representative of the entire collection.The objective used for the experiments is a monotone variant of the image summarization from Fahrbach et al. [2019b].For a groundset with N images, it is defined as follows: where s i,j is the cosine similarity of the pixel values between image i and image j.The data for the image summarization experiments contains 10,000 and 50,000 CIFAR-10 Krizhevsky et al. [2009] color images respectively for the centralized and distributed experiments.
F.1.3Influence Maximization on a Social Network.
Maximise the aggregate influence to promote a topic by selecting a set of social network influencers of maximum size k.The probability that a random user i will be influenced by the set of influencers in S is given by: where -N S (i)-is the number of neighbors of node i in S. We use the Epinions data set consisting of 27,000 users from Rossi and Ahmed [2015] for the centralized data experiments and the Youtube online social network data Yang and Leskovec [2012] consisting more than 1 million users for distrbuted data experiments.The value of p is set to 0.01.F.1.4Revenue Maximization on YouTube.
Maximise revenue of a product by selecting set of users S of maximum size k, where the network neighbors will be advertised a different product by the set of users S. It is based on the objective function from Mirzasoleiman et al. [2016].For a given set of users X and w i,j as the influence between user i and j, the objective function can be defined by: where V (S), the expected revenue from an user is a function of the sum of influences from neighbors who are in S and α : 0 < α < 1 is a rate of diminishing returns parameter for increased cover.
We use the Youtube data set from Mirzasoleiman et al. [2016] consisting of 18,000 users for centralized data experiments.For the distrbuted data experiments we perform empirical evaluation on the Orkut online social network data from Yang and Leskovec [2012] consisting more than 3 million users.The value of α is set to 0.3

G Replicating the Experiment Results
Our experiments can be replicated by running the following scripts: • Install MPICH version 3.3a2 (DO NOT install OpenMPI and ensure mpirun utlizes mpich using the command mpirun -version (Ubuntu)) • Install pandas, mpi4py, scipy, networkx • Set up an MPI cluster using the following tutorial: https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/ • Create and update the host file ../nodesFileIPnew to store the ip addresses of all the connected MPI machines before running any experiments (First machine being the primary machine) -NOTE: Please place nodesFileIPnew inside the MPI shared repository; "cloud/" in this case (at the same level as the code base directory).DO NOT place it inside the code base "DASH-Distributed SMCC-python/" directory.
• Clone the DASH-Distributed SMCC-python repository inside the MPI shared repository (/cloud in the case using the given tutorial) -NOTE: Please clone the "DASH-Distributed SMCC-python" repository and execute the following commands on a machine with sufficient memory (RAM); capable of generating the large datasets.This repository NEED NOT be the primary repository ("/cloud/DASH-Distributed SMCC-python/") on the shared memory of the cluster; that will be used for the experiments.
• Additional Datasets For Experiment 1 : Please download the Image Similarity Matrix file "images 10K mat.csv"(https://drive.google.com/file/d/1s9PzUhV-C5dW8iL4tZPVjSRX4PBhrsiJ/view?u and place it in the data/data exp1/ directory.
• To generate the decentralized data for Experiment 2 and 3 : Please follow the below steps: -Execute bash GenerateDistributedData.bash nThreads nNodes -The previous command should generate nNodes directories in loading data/ directory (with names machine<nodeNo>Data) -Copy the data exp2 split/ and data exp3 split/ directories within each machine<i>Data directory to the corresponding machine M i and place the directories outside /cloud (directory created after setting up an MPI cluster using the given tutorial)).
To run all experiments in the apper Please read the README.mdfile in the "DASH-Distributed SMCC-python" (Code/Data Appendix) for detailed information.
H The Greedy Algorithm (Nemhauser and Wolsey [1978]) The standard Greedy algorithm starts with an empty set and then proceeds by adding elements to the set over k iterations.In each iteration the algorithm selects the element with the maximum marginal gain ∆(e|A i−1 ) where ∆(i|A) = f (A ∪ {i}) − f (A).The Algorithm 11 is a formal statement of the standard Greedy algorithm.The intermediate solution set A i represents the solution after iteration i.The (1 − 1/e ≈ 0.632)-approximation of the Greedy algorithm is the optimal ratio possible for monotone objectives.G 0 ← G 0 ∪ {e} 6: return G ′

4. 1
Randomized-DASHR-DASH (Alg.4) is a two MR-rounds algorithm obtained by plugging LAG into the RandGreeDI algorithm ofMirzasoleiman et al. [2013],Barbosa et al. [2015].R-DASH runs in two MapReduce rounds, O(log (k) log (n)) adaptive rounds, and guarantees the ratio1 2 (1 − 1/e − ε) (≃ 0.316).Description.The ground set is initially distributed at random by R-DASH across all machines M .In its first MR round, R-DASH runs LAG on every machine to obtain S i , R i in O (log(k) log(|N i |)) adaptive rounds.The solutions from every machine are then returned to the primary machine, where LAG selects the output solution that guarantees 1 2 (1 − 1/e − ε) approximation in O (log(k) log(|R|)) adaptive rounds as stated in Corollary 1. First, we provide Theorem 3 and its analysis by plugging any randomized algorithm Alg which follows RCP (Property 1) into the framework of RandGreeDI.The proof is a minor modification of the proof ofBarbosa et al. [2015] to incorporate randomized consistency.Then, we can get Corollary 1 for R-DASH immediately.Theorem 3. Let (f, k) be an instance of SM where k = O (ψ/ℓ).Alg is a randomized algorithm which satisfies the Randomized Consistency Property with α approximation ratio, Φ(n) query complexity, and Ψ(n) adaptivity.By replacing LAG with Alg, R-DASH returns set V with two MR rounds, 2Ψ(n/ℓ) adaptive rounds, (ℓ + 1)Φ(n/ℓ) total queries, O(n) communication complexity such that

Figure 2 :
Figure 2: Performance comparison of distributed algorithms on ImageSumm, InfluenceMax, RevenueMax and MaxCover; RandGreeDI (RG) is run with Greedy as the algorithm Alg Barbosa et al. [2015] to ensure the 1 2 (1 − 1/e) ratio.All Greedy implementations used lazy greedy to improve the runtime.T imeout for each application: 6 hours per algorithm.
partially supported by Texas A & M University.The authors have received no third-party funding in direct support of this work.The authors have no additional revenues from other sources related to this work.

(
Line 21) on every bin in the range [ α∆ * k , α∆ * ] and selecting the best solution guarantees the α-approximation of T-DASH in O(log (n)) adaptive rounds.Theorem 9. Let (f, k) be an instance of SM where k log(k) < εψ ℓ .T-DASH with no knowledge of OPT returns set T ′ with two MR rounds, O 1 ε 3 log(n) adaptive rounds, O n log(k) ε 4 total queries, O(n) communication complexity, and probability at least 1 − n −c such that E f (T ′ ) ≥ 3 8 − ε OPT.

Table 2 :
Environment Setup

Table 3 :
Small and Large Data .
E Threshold-DASH with no Knowledge of OPT Lemma 7.For any o ∈ O, it holds that P r(o ∈ O 1 ∪ O 2 ) ≥ 3/4.Proof.By Definition (p x in Section 5), it holds that P r (o ∈ O 1 ) = p o .Since o is assigned to each machine randomly with probability 1/ℓ, P r (o ∈ O