Hardness and Approximability of Dimension Reduction on the Probability Simplex

Dimension reduction is a technique used to transform data from a high-dimensional space into a lower-dimensional space, aiming to retain as much of the original information as possible. This approach is crucial in many disciplines like engineering, biology, astronomy, and economics. In this paper, we consider the following dimensionality reduction instance: Given an n-dimensional probability distribution p and an integer m<n, we aim to find the m-dimensional probability distribution q that is the closest to p, using the Kullback-Leibler divergence as the measure of closeness. We prove that the problem is strongly NP-hard, and we present an approximation algorithm for it.


Introduction
Dimension reduction [1,2] is a methodology for mapping data from a high-dimensional space to a lower-dimensional space, while approximately preserving the original information content.This process is essential in fields such as engineering, biology, astronomy, and economics, where large datasets with high-dimensional points are common.
It is often the case that the computational complexity of the algorithms employed to extract relevant information from these datasets depends on the dimension of the space where the points lie.Therefore, it is important to find a representation of the data in a lower-dimensional space that still (approximately) preserves the information content of the original data, as per given criteria.
A special case of the general issue illustrated before arises when the elements of the dataset are n-dimensional probability distributions, and the problem is to approximate them by lower-dimensional ones.This question has been extensively studied in different contexts.In [3,4], the authors address the problem of dimensionality reduction on sets of probability distributions with the aim of preserving specific properties, such as pairwise distances.In [5], Gokhale considers the problem of finding the distribution that minimizes, subject to a set of linear constraints on the probabilities, the "discrimination information" with respect to a given probability distribution.Similarly, in [6], Globerson et al. address the dimensionality reduction problem by introducing a nonlinear method aimed at minimizing the loss of mutual information from the original data.In [7], Lewis explores dimensionality reduction for reducing storage requirements and proposes an approximation method based on the maximum entropy criterion.Likewise, in [8], Adler et al. apply dimensionality reduction to storage applications, focusing on the efficient representation of large-alphabet probability distributions.More closely related to the dimensionality reduction that we deal with in this paper are the works [9][10][11][12].In [10,11], the authors address task scheduling problems where the objective is to allocate tasks of a project in a way that maximizes the likelihood of completing the project by the deadline.They formalize the problem in terms of random variables approximation by using the Kolmogorov distance as a measure of distance and present an optimal algorithm for the problem.In contrast, in [12], Vidyasagar defines a metric distance between probability distributions on two distinct finite sets of possibly different cardinalities based on the Minimum Entropy Coupling (MEC) problem.Informally, in the MEC, given two probability distributions p and q, one seeks to find a joint distribution ϕ that has p and q as marginal distributions and also has minimum entropy.Unfortunately, computing the MEC is NP-hard, as shown in [13].However, numerous works in the literature present efficient algorithms for computing couplings with entropy within a constant number of bits from the optimal value [14][15][16][17][18].We note that computing the coupling of a pair of distributions can be seen as essentially the inverse of dimension reduction.Specifically, given two distributions p and q, one constructs a third, larger distribution ϕ, such that p and q are derived from ϕ or, more formally, aggregations of ϕ.In contrast, the dimension reduction problem addressed in this paper involves starting with a distribution p and creating another, smaller distribution that is derived from p or, more formally, is an aggregation of p.
Moreover, in [12], the author demonstrates that, according to the defined metric, any optimal reduced-order approximation must be an aggregation of the original distribution.Consequently, the author provides an approximation algorithm based on the total variation distance, using an approach similar to the one we will employ in Section 4. Similarly, in [9], Cicalese et al. examine dimensionality reduction using the same distance metric introduced in [12].They propose a general criterion for approximating p with a shorter vector q, based on concepts from Majorization theory, and provide an approximation approach to solve the problem.
We also mention that analogous problems arise in scenario reduction [19], where the problem is to (best) approximate a given discrete distribution with another distribution with fewer atoms in compressing probability distributions [20] and elsewhere [21][22][23].Moreover, we recommend the following survey for further application examples [24].
In this paper, we study the following instantiation of the general problem described before: Given an n-dimensional probability distribution p = (p 1 , . . ., p n ), and m < n, find the m-dimensional probability distribution q = (q 1 , . . ., q m ) that is the closest to p, where the measure of closeness is the well-known relative entropy [25] (also known as Kullback-Leibler divergence).In Section 2, we formally state the problem.In Section 3, we prove that the problem is strongly NP-hard, and in Section 4, we provide an approximation algorithm returning a solution whose distance from p is at most 1 plus the minimum possible distance.

Statement of the Problem and Mathematical Preliminaries
Let be the (n − 1)-dimensional probability simplex.Given two probability distributions p ∈ P n and q ∈ P m , with m < n, we say that q is an aggregation of p if each component of q can be expressed as the sum of distinct components of p.More formally, q is an aggregation of p if there exists a partition Π = (Π 1 , . . ., Π m ) of {1, . . ., n} such that q i = ∑ j∈Π i p j , for each i = 1, . . ., m.Notice that the aggregation operation corresponds to the following operation on random variables: Given a random variable X that takes value in a finite set X = {x 1 , . . ., x n }, such that Pr{X = x i } = p i for i = 1, . . ., n, any function f : X → Y, with Y = {y 1 , . . ., y m } and m < n, induces a random variable f (X) whose probability distribution q = (q 1 , . . ., q m ) is an aggregation of p. Dimension reduction in random variables through the application of deterministic functions is a common technique in the area (e.g., [10,12,26]).Additionally, the problem arises also in the area of "hard clustering" [27] where one seeks a deterministic mapping f from data, generated by an r.v.X taking values in a set X , to "labels" in some set Y, where typically |Y | ≪ |X |.
For any probability distribution p ∈ P n and an integer m < n, let us denote by A m (p) the set of all q ∈ P m that are aggregations of p.Our goal is to solve the following optimization problem: Problem 1.Given p ∈ P n and m < n, find q * ∈ A m (p) such that where D(q∥p) is the relative entropy [25], given by and the logarithm is of base 2.
An additional motivation to study Problem 1 comes from the fundamental paper [28], in which the principle of minimum relative entropy (called therein minimum cross entropy principle) is derived in an axiomatic manner.The principle states that, of the distributions q that satisfy given constraints (in our case, that q ∈ A m (p)), one should choose the one with the least relative entropy "distance" from the prior p.
Before establishing the computational complexity of the Problem 1, we present a simple lower bound on the optimal value.Lemma 1.For each p ∈ P n and q ∈ P m , m < n, it holds that where Proof.Given an arbitrary p ∈ P n , one can see that Moreover, for any p ∈ P n and q ∈ P m , the Jensen inequality applied to the log function gives the following:

Hardness
In this section, we prove that the optimization problem (2) described in Section 1 is strongly NP-hard.We accomplish this by reducing the problem from the 3-PARTITION problem, a well-known strongly NP-hard problem [29], described as follows.
3-PARTITION: Given a multiset S = {a 1 , . . ., a n } of n = 3m positive integers for which ∑ n i=1 a i = mT, for some T, the problem is to decide whether S can be partitioned into m triplets such that the sum of each triple is exactly T.More formally, the problem is to decide whether there exist S 1 , . . ., S m ⊆ S such that the following conditions hold: Theorem 1.The 3-PARTITION problem can be reduced in polynomial time to the problem of finding the aggregation q * ∈ P m of some p ∈ P n , for which Proof.The idea behind the following reduction can be summarized as follows: given an instance of 3-PARTITION, we transform it into a probability distribution p such that the lower bound lb(p) is an aggregation of p if and only if the original instance of 3-PARTITION admits a solution.Let an arbitrary instance of 3-PARTITION be given, that is, let S be a multiset {a 1 , . . ., a n } of n = 3m positive integers with ∑ n i=1 a i = mT.Without loss of generality, we assume that the integers a i are ordered in a non-increasing fashion.We construct a valid instance p of our Problem 1 as follows.We set p ∈ P n+m as follows: Note that p is a probability distribution.In fact, since n = 3m, we have Moreover, from ( 4) and ( 5), the probability distribution lb(p) ∈ P m associated to p is as follows: To prove the theorem, we show that the starting instance of 3-PARTITION is a YES instance if and only if it holds that min q∈A m (p) where p is given in (5).We begin by assuming the given instance of 3-PARTITION is a YES instance, that is, there is a partition of S into triplets S 1 , . . ., S m such that and we show that min q∈A m (p) D(q∥p) = log m+1 m .By Lemma 1, (5), and equality (6), we have min From ( 8), we have Let us define q ′ ∈ P m as follows: where, by (10), From ( 12) and from the fact that S 1 , . . ., S m are a partition of {a 1 , . . ., a n }, we obtain q ′ ∈ A m (p), that is, q ′ is a valid aggregation of p (cfr., ( 5)).Moreover, and D(q ′ ∥p) = log m+1 m .Therefore, by (9) and that q ′ ∈ A m (p), we obtain as required.
To prove the opposite implication, we assume that p (as given in ( 5)) is a YES instance, that is, We show that the original instance of 3-PARTITION is also a YES instance, that is, there is a partition of S into triplets S 1 , . . ., S m such that ∑ a i ∈S j a i = T, ∀j ∈ {1, . . ., m}. ( Let q * be the element in A m (p) that achieves the minimum in (13).Consequently, we have where H(q * ) = − ∑ m i=1 q * i log q * i is the Shannon entropy of q * .From (15), we obtain that H(q * ) = log m; hence, q * = (1/m, . . ., 1/m) (see [30], Thm.2.6.4).Recalling that q * ∈ A m (p), we obtain that the uniform distribution is an aggregation of p.We note that the first m components of p, as defined in (5), cannot be aggregated among them to obtain (16), because 2/(m + 1) > 1/m, for m > 2. Therefore, in order to obtain (16) as an aggregation of p, there must exist a partition S 1 , . . ., S m of S = {a 1 , . . ., a n } for which From ( 17), we obtain , ∀j ∈ {1, . . ., m}.
From this, it follows that We note that, for (19) to be true, there cannot exist any S j for which |S j | ̸ = 3.Indeed, if there were a subset S j for which |S j | ̸ = 3, there would be at least a subset S k for which |S k | > 3. Thus, for such an S k , we would have . Therefore, it holds that Moreover, from ( 19) and ( 20), we obtain Thus, from (21), it follows that the subsets S 1 , . . ., S m give a partition of S into triplets, such that ∑ a i ∈S j a i = T, ∀j ∈ {1, . . ., m}.
Therefore, the starting instance of 3-PARTITION is a YES instance.

Approximation
Given p ∈ P n and m < n, let OPT denote the optimal value of the optimization problem (2), that is OPT = min In this section, we design a greedy algorithm to compute an aggregation q ∈ A m (p) of p such that The idea behind our algorithm is to see the problem of computing an aggregation q ∈ A m (p) as a bin packing problem with "overstuffing" (see [31] and references therein quoted), which is a bin packing where overfilling of bins is possible.In the classical bin packing problem, one is given a set of items, with their associated weights, and a set of bins with their associated capacities (usually, equal for all bins).The objective is to place all the items in the bins, trying to minimize a given cost function.
In our case, we have n items (corresponding to the components of p) with weights p 1 , . . ., p n , respectively, and m bins, corresponding to the components of lb(p) (as defined in ( 4)) with capacities lb(p) 1 , . . ., lb(p) m .Our objective is to place all the n components of p into the m bins without exceeding the capacity lb(p) j of each bin j, j = 1, . . ., m, by more than (∑ m i=1 p i )lb(p) j .For such a purpose, the idea behind Algorithm 1 is quite straightforward.It behaves like a classical First-Fit bin packing: to place the i th item, it chooses the first bin j in which the item can be inserted without exceeding its capacity by more than (∑ m i=1 p i )lb(p) j .In the following, we will show that such a bin always exists and that fulfilling this objective is sufficient to ensure the approximation guarantee (23) we are seeking.
Algorithm 1: GreedyApprox 1. Compute lb(p) = (p 1 / ∑ m j=1 p j , . . ., p m / ∑ m j=1 p j ); 2. Let lb i j be the content of bin j after the first i components of p have been placed (lb 0 j = 0 for each j ∈ {1, . . ., m}); 3.For i = 0, . . ., n − 1 Let j be the smallest bin index for which holds that lb i j + p i+1 < (1 + ∑ m j=1 p j )lb(p) j , place p i+1 into the j-th bin: The step 3 of GreedyApprox operates as in the classical First-Fit bin packing algorithm.Therefore, it can be implemented to run in O(n log m) time, as discussed in [32].In fact, each iteration of the loop in step 3 can be implemented in O(log m)-time by using a balanced binary search tree with height O(log m) that has a leaf for each bin and in which each node keeps track of the largest remaining capacity of all the bins in its subtree.Lemma 2. GreedyApprox computes a valid aggregation q ∈ A m (p) of p ∈ P n .Moreover, it holds that Proof.We first prove that each component p i of p is placed in some bin.This implies that q ∈ A m (p).
Let us consider an arbitrary step m ≤ i < n, in which the algorithm has placed the first i components of p and needs to place p i+1 into some bin.We show that, in this case also, there is always a bin j in which the algorithm places the item p i+1 , without exceeding the capacity lb(p) j of the bin j by more than (∑ m ℓ=1 p ℓ )lb(p) j .First, notice that in each step i, m ≤ i < n, there is at least a bin k whose content lb i k does not exceed its capacity lb(p) k ; that is, for which lb i k < lb(p) k holds.Were this the opposite, for all bins j, we would have lb i j ≥ lb(p) j ; then, we would also have is not possible since we have placed only the first i < n components of p, and therefore, ).Consequently, let k be the smallest integer for which the content of the k-th bin does not exceed its capacity, i.e., for which lb i k < lb(p) k .For such a bin k, we obtain 1 + m ∑ j=1 p j lb(p) k = lb(p) k + m ∑ j=1 p j lb(p) k