Learning on Bandwidth Constrained Multi-Source Data with MIMO-inspired DPP MAP Inference

This paper proposes a distributed version of Determinant Point Processing (DPP) inference to enhance multi-source data diversification under limited communication bandwidth. DPP is a popular probabilistic approach that improves data diversity by enforcing the repulsion of elements in the selected subsets. The well-studied Maximum A Posteriori (MAP) inference in DPP aims to identify the subset with the highest diversity quantified by DPP. However, this approach is limited by the presumption that all data samples are available at one point, which hinders its applicability to real-world applications such as traffic datasets where data samples are distributed across sources and communication between them is band-limited. Inspired by the techniques used in Multiple-Input Multiple-Output (MIMO) communication systems, we propose a strategy for performing MAP inference among distributed sources. Specifically, we show that a lower bound of the diversity-maximized distributed sample selection problem can be treated as a power allocation problem in MIMO systems. A determinant-preserved sparse representation of selected samples is used to perform sample precoding in local sources to be processed by DPP. Our method does not require raw data exchange among sources, but rather a band-limited feedback channel to send lightweight diversity measures, analogous to the CSI message in MIMO systems, from the center to data sources. The experiments show that our scalable approach can outperform baseline methods, including random selection, uninformed individual DPP with no feedback, and DPP with SVD-based feedback, in both i.i.d and non-i.i.d setups. Specifically, it achieves 1 to 6 log-difference diversity gain in the latent representation of CIFAR-10, CIFAR-100, StanfordCars, and GTSRB datasets.


Introduction
Determinant Point Processing (DPP) is a well-known probabilistic approach to facilitate generating diverse data points.DPP stands out among other point processing techniques, such as Poisson point processing, for its unique property of being solely determined by the correlation among elements.It assigns a high probability to the measurement of two points with low similarity, making it a valuable tool for tasks like dimensionality reduction and representative sample selection from large datasets [1].Additionally, by leveraging the properties of linear algebra, DPP is able to sample subsets from a given data set in an efficient manner [2][3][4].DPP is used in a wide range of applications, such as recommender systems [5], document summarization [6], image processing [7], and topological analysis of wireless networking [8].It also inherently connects to different principles, such as randomized numerical linear algebra [9], and information theory [10].
In a class of practical applications, such as recommender systems, the goal is to identify and select the most diverse subset which is representative of the entire population.Such tasks are performed by assigning high selection probability to diverse sets using DPP, known as the Maximum A Posteriori (MAP) inference problem.Recent studies have implemented a centralized version of this algorithm, where all samples are available in the same location [5,11,12].However, in many cases, data samples are generated by different sources at different positions, and cross-source communication is challenging or costly.Similar to Federated Learning, in our scenario, the communication bandwidth between data sources and the processing center is constrained.Although we assume that the selected items can be transmitted losslessly to the center, but the transmission budget allows only sending a subset of the entire dataset.Therefore, a conventional DPP MAP inference [5] by traversing all data samples is infeasible.In this paper, we present a scheduling strategy (shown in Fig. 1) that generates transmission candidates locally without exchanging raw samples among sources and only by transmitting a few sparse extra messages from the center unit back to each local source.Our method is inspired by the pre-coding and channel power allocation in Multiple-Input Multiple-Output (MIMO) systems, commonly used in the contemporary wireless communication systems, including Wi-Fi 802.11n family, 4G LTE, and 5G wireless technology.Contributions.In summary, we propose a strategy for performing the MAP inference on distributed data under limited communication constraints with the following steps.First, we reformulate the lower bound of the diversity as a power allocation problem, with the total number of items selected being subject to a power constraint.Then, we devise a determinant-preserving approximation based on Cauchy-Binet's formula, to address the bandwidth-limited transmission.This allows computing the sparse CSI feedback to be sent from the center to each source used in precoding the samples.The precoded messages are then utilized by each source to select items that maximize global diversity.
2 Background Knowledge ensemble DPP is presented by having an arbitrary subset A drawn from the entire set A to satisfy, P(A) ∝ det (L A ) , where L A denotes the submatrix of L with rows and columns indexed by set A. The MAP inference for K-DPP is formulated as, where A denotes the index set of selected samples and constant K denotes the given fixed cardinality, and K ≤ rank(L) holds to ensure the determinant greater than 0 [13].Since MAP inference is an NP-hard problem, the popular solution of it is using greedy search and formulating the following sub-modular function, j = arg max i∈S\A log det L A∪{i} − log det (L A ) , where j denotes the selected index in each round.The current fastest greedy search proposed in [5] is based on the Cholesky decomposition and requires O(|L| 3 ) complexity for initialization and O(K 2 |L|) to return K items.We denote selection by this method with given Gram matrix L and the set cardinality K as A * = MAP-DPP(L, K).

Multiple-Input Multiple-Output (MIMO) systems
Before delving into the design of our selection scheduling for DPP inference on distributed data, let us review the fundamental principles of MIMO systems.Their techniques serve as valuable inspiration for our proposed approach.In a MIMO system, a signal vector s The link between a transmitting antenna T X i and a receiving antenna RX j is represented by an element H i,j of the channel matrix H ∈ C M ×N .The channels are influenced by factors such as multi-path fading and interference, causing different link conditions.Mathematically, a MIMO system can be presented as r = Hs + n, where n ∈ C N is additive noise that follows a complex Gaussian distribution CN (0, σ 2 I) with zero mean and covariance σ 2 I. Apparently, when the channels are highly correlated and the rank of H is low, the equations for recovering s from r become under-determined.To mitigate the interference of different antennas at the receiver, precoding is employed to orthogonalize data between channels by utilizing channel state information (CSI).
We assume M = N and let Q = E[ss † ] denote the covariance matrix of s, where s † denotes the conjugate transpose of s.Inequality trace(Q) ≤ ρ always holds for preserving the overall power constraint.The capacity of the system measures the maximum amount of information that can be transferred with an arbitrary small error.According to [14], the capacity is: The SVD of H is denoted by USV † = svd(H), and λ i represents the i-th singular value of H, which corresponds to the diagonal element of S. Consequentially, the optimal solution of Q can be expressed as VPV † , where P is a diagonal matrix with elements p i .Then, the problem in Eq. ( 2) can be reformulated accordingly: Noting, log 1 + is concave and indicates the capacity of the single-input single-output (SISO) channel.The original problem can be transformed by (a) in Eq. ( 3) and solved by standard optimization algorithms that result in a water-filling solution [15].Additionally, the SVD-based precoding scheme can be applied to precode the signal to r = VP 1/2 r, which helps orthogonalize the data.

Problem Formulation
Suppose there are multiple data sources with disjointed index sets s 1 , s 2 , • • • , s i , • • • , s N , and S = s 1 ∪ s 2 ∪ s i ∪ • • • ∪ s N representing the indices of the entire set.The total number of samples is n = i n i , where n i = |s i | denotes the cardinality of the set s i .Also, let Z ∈ R n×m be the data matrix of the entire set, where the dimensions of each sample z i is m.Recall that to maximize diversity, we need to optimize the following problem to select a subset A: where, again, L A denotes the columns and rows of L indexed by A. Likewise, X A,B denotes a submatrix of X with rows and columns indexed by A and B, respectively.If A = B, it can be denoted as X A .Conventional MAP methods require transmitting all samples to the center to solve this optimization problem, which is impractical or costly when data is distributed across different sources.Therefore, the construction of L is not realistic.However, L s1 , • • • , L s N can be easily obtained at different sources, where L si = Z si Z ⊤ si .Hence, the MAP inference can be broken down into subproblems to be solved locally.Here,(L si ) ai denotes the matrix L si indexed by a i .The problem now is to collectively select a i for source i, which can be achieved by maximizing of det(L 4 Methodology

MIMO-like DPP Power Allocation
Similar to [10,16], since L A is always a positive-definite Hermitian matrix, we can use the approximation det( 1 ϵ L A ) ≈ det( 1 ϵ L A + I) for a very small ϵ.Therefore, we can rewrite the optimization problem as, The validity of (a) in Eq. ( 6) can be established through SVD decomposition.When N ≪ 1 ϵ holds, the maximization problem can be approximated as arg max L = log det( According to the concavity of f (A) = log det A for positive definite Hermitian matrices [13], we have, log det(αA + (1 − α)B) ≥ α log det A + (1 − α) log det B for α ∈ (0, 1), and the lower band of Eq. 6 is given by, Let's denote g(Λ j ) := log det( The proof is provided in appendix A. Apparently, we can maximize L by maximizing its lower bound L lower , and in each source, we select a i by performing DPP locally, which can approximately maximize g(a i ) in the greedy search.The formulation of the lower band in Eq. ( 7) can be seen as a modified version of the channel power allocation problem in MIMO systems in Eq. ( 3), where g(a i ) represents the capacity of one "DPP channel".Figure 2(b)(c) exhibit exemplary lower bound surfaces for two sources and three sources.The total power constraint can be viewed as i |a i | = K.We have the following lemma derived by proposition 1, and 1 While water-filling can be used to solve this problem (see appendix B), it becomes computationally expensive when the number of sources is large.Therefore, we directly allocate identical power to each source, ensuring that every source can send an equal number of samples.This allocation is due to the pseudo-concavity of L lower stated in Lemma 1, which means We present the results of the ablation analysis, namely the Random Wt. in our experiments to highlight the advantage of our approach over random power allocation.

MIMO-like CSI for Local DPP Precoding
Similar to the MIMO systems, we aim here to alleviate the interference of different sources.Ideally, to maximize global diversity, each source should perform the selection process individually, regardless of the selection from other sources, which is analogous to data orthogonalization in MIMO.Decoupling the correlations of the sources helps minimize the gap between the approximated and original problems.We can achieve this by precoding the samples in each source to Z ai = Z ai W i .However, learning W i by accessing all data samples in all sources may not be feasible due to the limited communication budget.Instead, we propose a sparse diversity measurement of selected samples that serves as the precoding matrix to guide the selection process.Recall that A represents the collection of all selected samples after the completion of the selection process.We then denote selected samples at some moment by B ⊆ A and define Y i as the index set of selected samples at this moment that are not from source s i , i.e., Y i = B \ a i .To avoid losing information on samples selected from the current local source (because there will be information loss during the compression of diversity measurement of selected samples in the center), we decided to use selection with replacement.This approach ensures the selected samples will not be dropped in the source after sending.We define Therefore, we can rewrite the approximation of the problem in Eq. ( 6) when conditioned by Y as where equality (a) holds when A = a i ∪ Y i = B and (b) defines the conditional problem.The lower bound of L(• | Y) can be obtained as, where equality (a) holds when A = a i ∪ Y i = B.According to Eqs.( 7) ( 8) and ( 9), we have the following inequality, is a tighter bound than L lower , the previously developed lower bound of L in Eq. ( 7).In our situation, B is only based on the samples in the center, and in fact, Here, Y t is formed after receiving data samples in the first t intervals.According to the right side of (a) in Eq. ( 9), we should maximize log det( 1 ϵ Z {ai∪Yi} Z ⊤ {ai∪Yi} + I) individually in each source to maximize L lower (• | Y).Fortunately, this problem rolls back to a MAP inference as According to Schur's complement, we can re-write it as, Without loss of information, we can send H i (which can be viewed as the CSI) from the center to each source and precode Z ai to Z ai = Z ai H 1/2 i , where To further accommodate the band-limited communication requirements, we seek sending a sparse representation of H i than sending the entire information of H i .To this end, we use Cauchy-Binet formula to re-write the optimization problem as, , where [m] denotes the set {1, ..., m} and , after applying the Cauchy-Binet formula and Cauchy-Schwarz inequality, the term det (Z ai H 2 can be bounded by 3 , Here, Ĥi denotes the approximate H i by its sparse representation.Substituting Eq. ( 13) to Eq. ( 12) to preserve the determinant of If a sub-matrix (H i ) C are allowed to transmit with a constrained cardinality r 0 of C, to minimize the difference between the left term and right term of (a) in Eq. ( 14), we can immediately obtain a suboptimal solution by DPP greedy search as C * = MAP-DPP(H i , r 0 ), since it has to be selected at least the first r 0 − k i largest det((H i ) S1 ) in the greedy search.Here C * denotes the index set of selected representative dimensions.In the experiment section, we show this selection scheme outperforms both representative dimensions selections by DPP sampling with kernel H i and random selection.
We define tolerable sparsity Rm as the number of elements that can be losslessly transmitted from the center to each source.Transmitting the symmetric matrix (H i ) C * requires (r 2 0 +r 0 )/2 elements, which corresponds to the number of elements in the lower triangular matrix of (H i ) C * .Furthermore, in cases where additional sparsity can be utilized for compression purposes, we can compress the residual matrix (Hi) C * through singular value decomposition (SVD).Here C * = [m] \ C * .By considering only the first r 1 singular vectors and values, we only require a sparsity of r 1 m for the compression.Hence, the constraint on the tolerable sparsity R m can be expressed as (r Therefore, the data samples in each source can be precoded as Z ai = Z ai Ĥ1/2 i .In fact, since the CSI information is not completely reliable, we precode the data samples conservatively and use a momentum way, which is presented as Z ai = Z ai W i = Z ai (I + Ĥ1/2 i ).A summary of our approach is shown in Algorithm 1.  8). for i in 1 : N do #In each source i. #All sources do this step in parallel.

Baselines
In our experiments, we use the exact greedy search proposed in [5] across all samples as the Ground Truth, similar to [11].We consider multiple baselines, including Maximum Diversity Source, Random Selection, Uniform w/o Precoding, Ours+SVD, Ours+Sampling Sketch, Ours+Random Sketch, Random Wt.+MAP Sketch, Random Wt.+ SVD, and Random Wt.+Sampling Sketch.Maximum Diversity Source performs the exact greedy search in one source with the largest RDbased diversity measured as log det(I + m |si|ϵ Z ⊤ si Z si ) [10].Random Selection involves randomly select samples from each source to send to the center (with the same total number as other methods).Uniform w/o Precoding assigns the same power to each source (i.e. the number of samples to be sent), but without receiving feedback from the center for precoding the local data.We then consider the following setups for ablation analysis of different components of our proposed method.All of them are equally constrained by the communication bandwidth (i.e.tolerable sparsity) and the total power (i.e. the total number of samples to send).Ours+SVD uses uniform power allocation but replaces the proposed way of compressing H i with an SVD-based compression.We consider this method because SVD is the optimal solution to preserve the f-norm with a given tolerable rank.Notably, if we set r 1 /R = 1, our approach becomes equivalent to this approach.Ours+Sampling Sketch uses uniform power allocation but generates C in Eq. ( 15) by exact K-DPP sampling (implemented by [17]).Ours+Random Sketch uses uniform power allocation but generate C by randomly sampling from [m].As before, [m] = {1, • • • , m} represents the index set of dimensions.Random Wt.+MAP Sketch allocates random power to each source but compressing H i is consistent with our proposed approach shown in Eq. ( 15).Random Wt.+ SVD uses random power allocation and compresses H i by SVD.Random Wt.+Sampling Sketch uses random power allocation but generates C by randomly sampling from [m].

Dataset and Experiment Setup
For the sake of completeness and avoiding dataset bias, the experiments are conducted using four popular datasets, including CIFAR10 [18], StanfordCars [19], CIFAR100 [18], and GTSRB [20].Image datasets were preferred due to the following reasons: i) they have relatively high raw data dimensions, which enables DPP to choose a subset with a larger number of samples, ii) dimensions can be controlled using a pre-trained feature extractor without requiring any high complexity matrix factorization (which is often required when preparing data for recommender systems), and iii) image datasets are compatible with various potential applications such as drone-based aerial monitoring and AI-based traffic monitoring [21].As a proof-of-concept experiment, we used a pre-trained ResNet-50 4 to extract the latent features of images and set m = 300.We implement both non-i.i.d. and i.i.d.scenarios for robust evaluation and comparison of our proposed methods.We employed the following configurations to accurately replicate the non-i.i.d.distribution of sources in our experiments.For the CIFAR10 dataset, we set 500 samples, with each source containing one non-overlapping class.In the case of the StanfordCars dataset, when the number of sources was set to N = 5, each source consisted of 20 non-overlapping classes.Similarly, when the number of sources was increased to N = 10, each source contained 19 non-overlapping classes.Conversely, to simulate an i.i.d.distribution among sources, we conducted experiments using the CIFAR100 and GTSRB datasets, with 500 samples allocated for each experiment.
We set the center to receive K = 40 samples in each interval for 5 consecutive intervals, resulting in a total of 40 × 5 = 200 samples.The tolerable sparsity level is set to R = K/2 = 20.Note that R ≥ K leads to a trivial scenario where we can send all the originally received samples back to each source.We set r 1 /R = 0.2 to CIFAR-10 and CIFAR-100 datasets and r 1 /R = 0.8 to StanfordCars and GTSRB datasets.All experiments are repeated about 100 times.The average value and standard deviation, as the error bar, are presented.

Results
In order to provide a comprehensive overview, we present the results of our experiments for a general setting with N = 10, as illustrated in Table 1.Additionally, we include the corresponding results for N = 5 in Appendix D. The proposed approach is highlighted in yellow in the table.The original DPP-diversity for a selected subset A is defined as det(Z A Z ⊤ A ), where a higher value represents a higher level of diversity.For comparison convenience, the performance is presented as thelog-difference of the diversity drop with respect to the ground truth.Specifically, if the ground truth is A * and the inference subset by one approach is A. Then, the log-difference is defined as At first glance, we observe that focusing only on one diverse source cannot reach reasonable performance, and it will be even worse than random selection over distributed sources.Our approach outperforms all baselines and alternative methods in most experimental conditions, including different numbers of sources, time intervals, and data distribution assumptions.First, our approach can easily beat approaches with random power allocation (i.e.method with Random Wt.) with a considerable gain.For example, when T = 5, approaches utilizing random power allocation consistently exhibit a 6 ∼ 40 decline in performance across all datasets when compared to our approach.Additionally, approaches utilizing Random Wt. frequently exhibit an undesirable level of performance variance.For instance, when T = 5, our proposed approach demonstrates a mere standard deviation of 2.77.In contrast, approaches employing Random Wt. display a substantial range of standard deviations, spanning from 9 to 85, when applied to CIFAR10 dataset.Comparable outcomes are observed across all other datasets.This observation confirms our claim in Section 4.1.The results also exhibit the important role of precoding.For instance, after the 5th interval, our approach obtains a gain range from 1 to 6 over Uniform w/o Precoding for all datasets.Fig. 3 reports the real-gain defined as det(Z A Z ⊤ A )/ det(Z A bs Z ⊤ A bs ), where A bs is selected by Uniform+ w/o Precoding.It demonstrates that the gain of approaches with precoding mostly increases over time for consecutive intervals (T departing from 1 to 5).
Table 1: Comparison of performance when N = 10.The numerical result is reported as log-difference of the diversity drop w.r.t the ground truth (↓). at least 3 log-difference gain over all other precoding schemes.Moreover, we note that baselines with other precoding schemes, their performance may be even worse than the one without precoding, such as when T = 5 and N = 10 on CIFAR10, Ours+SVD has a 2 log-difference decay.

Ablation Analysis
We have already discussed the ablation analysis for different components of our approach in Section 5.3.Here, we do ablation analysis for r 1 /R ∈ {0, 0.2, 0.4, 0.6, 0.8, 1}.r 1 /R = 1 means the compression of H i is only based on SVD (i.e.approaches with +SVD).The results in Fig. 4 show that SVD is often the worst case for compression, and other cases (i.e.r 1 /R ̸ = 1 ) have reasonable performance fluctuation.) to return N items.Generally, we have n i > m; thus, the overall computational complexity for each source in each interval is O(n 3 i ).Importantly, each source can conduct precoding and local DPP inference concurrently, resulting in high scalability of our approach.

Discussion
Potential Applications.This work has potential applications in learning-based systems that have limited networking resources, such as AI-based UAV aerial monitoring and RSU-based traffic monitoring, where a data center pools data from multiple sources to build a model.For example, very few sources with preferably diverse information should be selected when pooling traffic video from roadside cameras for traffic safety analysis under limited bandwidth.Additionally, our approach can be used in the coreset construction of replay-based continual learning approaches where data from old tasks can be treated as distributed sources.In this case, representative data from sources can be utilized to facilitate the learning for upcoming tasks.
Limitations.One limitation of the presented work comes from the low-rank problem of DPP, which constrains the number of selections K by K ≤ rank(L) ≈ min(n, m).While our proposed design aims to select sufficient samples from distributed sources for serving the downstream tasks, it may be constrained by the inherent limitations of DPP.Extending the dimensions of data by some techniques, such as using different feature extractors, is possible, but it could introduce extra overhead.As a pioneering solution for future work, we suggest designing a kernel that satisfies various downstream tasks.

Conclusion
DPP is a formal method to enhance data diversity for learning-based systems.However, it requires access to the entire dataset in one place, which limits its applicability to diverse sources in realworld applications.To address this key challenge, we implemented a DPP MAP inference for distributed data from multiple sources as a universal diversity-maximizing data-sharing strategy for distributed sources that only requires a lightweight feedback channel from the center to the sources with no cross-source communication requirement.To this end, a novel scheduling policy, inspired by MIMO systems is proposed.Specially, we demonstrated that the lower bound of the original diversity maximization problem that maximizes the global diversity can be transformed into the power allocation problem of MIMO.Additionally, approximating the lower bound to the original problem can be treated as receiving CSI and precoding.Under communication bandwidth constraints, we derive a sparse CSI representation to preserve the determinant via Cauchy-Binet formula.Our experiments demonstrate that our scalable approach can compete with the baselines in various data distribution conditions, achieving a 1 to 6 log-difference gain over the approach without precoding for all datasets when N = 10.We expect our approach can substantially influence designing future AI-based networking platforms, which require efficient processing of large-scale data from distributed sources.

A Proof of Proposition 1
This proposition is because of submodularity.For the greedy search for the original k-DPP, please refer to [5].Here, we only illustrate the solution to max A det(L A + I) in a similar way.Suppose the selection set is denoted as Λ * and the entire set is denoted as Z.Likewise, L Λ * denotes the Gram matrix of samples indexed Λ * .In each step, we greedy search a new sample j and add it to Λ * , and the gradient of g(Λ * ) is presented as, Again, L Λ * + I is always positive definite; thus, we apply Cholesky decomposition here, ) Nothing here V, c i and d i may different from them in [5].By applying Schur complement, we still can obtain the same conclusion as [5], Now we can obtain the candidate j+ in this round, and we have g ′ (Λ * ∪ j) = d 2 j+ ≥ d ′2 i for i ∈ Z \ (Λ * ∪ {j}).From Eq. ( 25), d ′2 i ≤ d 2 i , we can easily obtain the inequality, and the proposition is proved.

B Water-filling(WF)-like Optimization
The problem in Eq. ( 9) can be stated as, where we write g i (k i ) := g(a i ) with k i = |a i | (because of proposition 1), and we have k = For simplicity, we consider g i (k i ) is a continuous smooth concave function.
we introduce Lagrange multipliers λ ⪰ 0 ∈ R N and ν ∈ R, and the Lagrangian F is presented as, KKT condition is obtained as, Simplify them, Thus, Suppose g ′ i (k i ) := ∂gi(ki) ∂ki , and we have k i = (g ′ i (k i )) −1 .Thus, we have, Now k ⋆ i is a function of ν.Since k ⋆ i = K, we can search each k ⋆ i by controlling ν.This modified version of Water-filling, since now k ⋆ i is not a linear increasing function.However, fortunately, the inverse function of g ′ i (k i ) denoted by g ′−1 i (c) can be quickly obtained, and by this method, we can still quickly search for a sub-optimal solution to this problem.

Figure 1 :
Figure 1: The schematic diagram of the proposed method.The highlighted color indicates the status of the component is active.(a) The center computes and sends sparse CSI to every source.(b) Intra-source precoding.(c)sources send data to the center after optimal power allocation.

Figure 2 :
Figure 2: An example of selection by the DPP greedy search.(a) Then gradient of g(Λ * ) defined as g(Λ * j+1 ) − g(Λ * j ) as respect of k = | * j |.(b) The illustration of the lower bound for N = 2. (c) The illustration of the lower bound for N = 3.Apparently, we can maximize L by maximizing its lower bound L lower , and in each source, we select a i by performing DPP locally, which can approximately maximize g(a i ) in the greedy search.The formulation of the lower band in Eq. (7) can be seen as a modified version of the channel power allocation problem in MIMO systems in Eq. (3), where g(a i ) represents the capacity of one "DPP channel".Figure2(b)(c) exhibit exemplary lower bound surfaces for two sources and three sources.The total power constraint can be viewed as i |a i | = K.We have the following lemma derived by proposition 1, Lemma 1.Suppose k = [k 1 , k 2 , • • • , k N ] ⊤ ,where k i = |a i | denotes the power allocation of source i, 0 ≤ k i , and 1 T k = K.L lower (k) = i g(a i ) is pseudo-concave with respect to k.

Algorithm 1 :
DDPP: MAP Inference for MAP for Distributed Data Source Input: Source data Zs 1 , • • • , Zs N , Center information Y = {Y1, • • • , YN }, the number of items selected in each interval K. Sparsity parameters r0, r1.The index set of selection A. Output: The index set of selection A. Compute center information Y = {Y1, • • • , YN }. # Definition see text above Eq.(

Figure 4 :
Figure 4: The ablation analysis for different r 1 /R.When r 1 /R = 1, we put "'SVD' here.The y-axis is the log difference of the diversity drop w.r.t the ground truth (↓).(a) CIFAR10, (b) StanfordCars, (c) CIFAR100, and (d) GTSRB.5.5 Complexity AnalysisComputing CSI information H i in the center requires only O(m 3 )) complexity, including both SVD decomposition and DPP MAP inference.Following the data acquisition request from the center, every source requires a complexity of O(n i m 2 + n 2 i m) ≈ O(mn i max(n i , m)) for computing the precoded Gram matrix in each round.Subsequently, a local DPP inference by greedy search requires O(n 3 i ) for initialization and O(n i (K/N ) 2 ) to return N items.Generally, we have n i > m; thus, the overall computational complexity for each source in each interval is O(n 3 i ).Importantly, each source can conduct precoding and local DPP inference concurrently, resulting in high scalability of our approach.

2 i
)det L Λ * ∪{i} + I = det VV ⊤ • d 2 i = det (L Λ * + I) • d j, we can say d 2 j ≥ d 2 i for i ∈ Z \ Λ * .Note here d 2 j = g(Λ * ∪ {j}) − g(Λ *) is the "gradient" of the function g(Λ * ).Then, in the next round c ′ i and d ′2 i can be updated incrementally,V 0 c j d j c ′⊤ i = L Λ * ∪{j},i = L Λ * ,i L ji(23) 2.1 Determinant Point Processing (DPP)DPP is a probability measure on all 2 |S| subsets of S, where |S| denotes the cardinality of the set S. Suppose a finite dataset is represented by Z