The Minimum Consistent Spanning Subset Problem on Trees

. Given a vertex-colored edge-weighted graph, the minimum consistent subset (MCS) problem asks for a minimum subset S of vertices such that every vertex v / ∈ S has the same color as its nearest neighbor in S . This problem is NP-complete. A recent result of Dey, Maheshwari, and Nandy (2021) gives a polynomial-time algorithm for the MCS problem on two-colored trees. A block is a maximal connected set of vertices of the same color. We introduce a variant of the MCS problem, namely the minimum consistent spanning subset problem, for which we require the set S to contain a vertex from every block of the graph such that every vertex v / ∈ S has a nearest neighbor in S that is in the same block as v . We observe that this problem is NP-hard on general graphs. We present a polynomial-time algorithm for this problem on trees.


Introduction
Let G be a simple undirected edge-weighted graph whose vertex set V is partitioned into sets V 1 , . . ., V k .The distance between two vertices u, v ∈ V is defined as the weight of the shortest path between u and v in G.The Minimum Consistent Subset (MCS) problem asks for a minimum cardinality subset S of V such that, for every i ∈ {1, . . ., k} and for every v ∈ V i , a nearest neighbor of v in S belongs to V i ; see [4,5,6].We stress that v could have multiple nearest neighbors in S but for the purpose of the MCS problem it suffices to have only one of them in V i .The set S is a representative for the structure of the entire graph G.We may assume that the vertices of V are colored by k different colors such that the vertices in each V i have the same color and the vertices in V i and V j , with i ̸ = j, have different colors.Hence the MCS problem asks for a minimum cardinality subset S of V such that for every v ∈ V a nearest neighbor of v in S has the same color as v. See Fig. 1(a) for an example where all edges have the same weight.
The MCS problem was first introduced by Hart [9] (in 1968) for points in the Euclidean plane.In this version G is assumed to be a complete graph, and the vertices are represented by points in The Euclidean MCS where the circled points belong to S.
the plane, and edge weights are Euclidean distances between the endpoints [2,7,8,9].As every vertex of V has the same color as its nearest neighbor in S, in the Voronoi diagram of S all points in each Voronoi cell have the same color as the center of the cell; see Fig. 1

(b).
The MCS problem finds applications in solving nearest neighbor problems [3,7,8,13], finding optimal number of clusters in k-clustering problems such as k-means and k-nearest neighbors [6], and finding optimal set of classifiers in classifying algorithms [10].The MCS problem is also useful in the field of pattern recognition, such as speech and handwriting recognition [6,11].
When all edges of G have the same unit weight, we say that G is unweighted.In this case the distance between two vertices u and v is the number of edges in the shortest path between u and v.The MCS problem is NP-complete even for two-colored unweighted planar graphs [1,5]; this is shown by a reduction from the minimum dominating set problem.The Euclidean version of the MCS problem is also NP-complete [12] even for two-colored points [11].
There has not been much progress on the MCS problem from the algorithmic point of view.Recently, Dey, Maheshwari, and Nandy [4] solved this problem in polynomial time for some simple two-colored (also known as bicolored and bichromatic) unweighted trees such as paths, caterpillars, spiders, combs with respective running times O(n), O(n), O(n 2 ), and O(n 2 ). 1 In a companion paper [5] they present an O(n 4 )-time algorithm for general two-colored unweighted trees.See Fig. 2(a) and 2(c) for examples of MCS on bicolored trees.
A minimum consistent subset for a bicolored tree may consist of only two vertices, no matter how large the tree is; see for example the tree in Fig. 2(c).For the purpose of clustering and classifications such a solution does not accurately reflect the structure of the entire tree.To capture the entire structure of the tree we need a stronger version of the MCS problem which we introduce below.
We define a block to be a maximal connected set of vertices of the same color in a tree.The tree in Fig. 2(b) consists of seven blocks denoted B 1 , . . ., B 7 .The solution of the MCS problem may not contain representatives (i.e., vertices) from all blocks in the tree; see Fig. 2(a) and Fig. 2(c).Therefore, a minimum consistent subset may not capture the structure of the entire tree.In order to have a better representative for the tree we introduce a more constrained version of the MCS v p q problem.In this version, which we call the minimum consistent spanning subset (MCSS) problem, the solution S must contain at least one vertex (i.e., a representative) from each block of the tree such that every vertex v / ∈ S has a nearest neighbor in S that is in the same block as v.A solution to the MCSS problem spans over all blocks in the tree.The constraint of having a nearest neighbor in the same block is natural for clustering purposes.See Fig. 2(d) for an example of an MCSS of a two-colored tree.In this paper we first observe that the MCSS problem is NP-hard for general graphs.Then we turn our attention to trees.We present a simple 2-approximation algorithm for the MCSS problem on trees.Our main contribution, which is summarized below, is a polynomial-time algorithm that solves the problem optimally on trees.
Theorem 1 A minimum consistent spanning subset on a vertex-colored weighted tree with n vertices can be computed in O(n 4 ) time.
In Section 2 we review some related works and results.In Section 3 we present preliminaries for our algorithm.The algorithm itself is presented in Section 4. We describe our algorithm for general trees.For special trees such as paths, spiders, and combs we achieve better running times, namely O(n), O(n 2 ), and O(n 3 ), respectively, in Section 5.

Hardness of the MCSS problem on general graphs
The MCSS problem on a vertex-colored edge-weighted general graph G can be defined in a similar fashion where each block is a maximal connected subset of vertices of the same color in G.We observe that this problem is NP-hard.This is implied from the NP-hardness proof of the MCS problem for two-colored edge-weighted general graphs, due to Banerjee, Bhore, and Chitnis [1].They use a reduction from the dominating set problem in connected graphs as follows.Given an instance of the dominating set problem on a connected graph G, construct an instance of the MCS problem consisting of two copies of G, namely G 1 and G 2 , and a vertex v.The edges of G 1 and G 2 have weight 1. Connect every vertex of G 1 to all vertices of G 2 by edges of weight 2 − 3ϵ, and every vertex of G 2 to v by edges of weight ϵ.Then color the vertices of G 1 red, and the vertices of G 2 and v blue.
In the above construction the vertices of the same color form a block, and any solution for the MCS problem must contain vertices from both blocks.This matches with the requirements of our MCSS problem.Thus the same reduction implies that the MCSS problem on general graphs is also NP-hard, even for graphs that consist of two blocks.

Previous Work
In this section we discuss some related works for both general graphs and geometric graphs.Assume that the number of vertices of the input graph is n.
In a recent study, Dey et al. [4] show how to solve the MCS problem on some simple twocolored unweighted trees such as paths, caterpillars, spiders, and combs with respective running times O(n), O(n), O(n2 ), and O(n 2 ).In a companion paper [5] they show how to solve the MCS problem on two-colored unweighted trees in O(n 4 ) using a dynamic programming algorithm.They reduce each instance of the problem to a shortest path problem in a graph.Their algorithm is highly dependent on specific subtrees called gates where each gate consists of three vertices p, q, v where p and q are of different colors and equidistant to a vertex v of degree larger than 2; see Fig. 2(a) and Fig. 2

(c).
There are major differences between our MCSS algorithm and the MCS algorithm of [5] in terms of both objectives and features: (i) the two algorithms solve two different problems which have different objectives, (ii) our algorithm works for multicolored weighted trees while it is unclear how one could generalize the algorithm of [5] to more than two colors or to weighted trees without blowing the running time mainly due to the notion of gates, (iii) in contrast to that of [5] our algorithm is based on a recursive formulation of the problem and it does not transform the original problem to a shortest path problem nor uses any gates.
Recall that the MCS problem was originally introduced for points in the Euclidean plane by Hart [9].In a recent study regarding the Euclidean MCS, Biniaz et al. [2] provide some complex algorithms to find the MCS of colored point sets in the plane.These include a sub-exponential algorithm for determining the MCS of points on a plane, an algorithm to compute the Euclidean MCS for collinear points in O(n) time, and two dynamic programming algorithms to find the MCS for multi-colored and two-colored points lying on two parallel lines in the plane with the respective running time of O(n 6 ) and O(n 4 ).Furthermore, they propose an algorithm with time complexity of O(n log n) to determine whether the size of the MCS for a set of two-colored points is two, and to find such a subset if it exists [2].In the next section, some preliminaries will be discussed before presenting our algorithm.

Preliminaries for the Algorithm
For simplicity we present our algorithm for unweighted trees.In the end we show how to extend the algorithm to weighted trees.It is easily seen that an algorithm for the MCSS problem on bicolored trees would also work on multicolored trees because any solution should contain a vertex from each block regardless of the color of neighboring blocks. 2In other words one can think the tree as a collection of blocks.Therefore, in our figures (but not in the description of the algorithm) we only consider bicolored trees.In this section we discover some properties that will be used to design our algorithm in the subsequent section.Let T be a tree and let S be a consistent spanning subset of T .We say that a vertex v ∈ T is covered by the vertex u ∈ S if u is a vertex of S that is closest to v. Analogously, we say that u covers v.
Observation 1 If all vertices of T have the same color, i.e.T is monochromatic, then every vertex of T is a minimum consistent spanning subset for T .
Recall the definition of block as a maximal subset of connected vertices of the same color in T .In view of Observation 1 we may assume that T has more than one block.Let k ≥ 2 be the number of blocks in T .We define the block tree of T , denoted by B(T ), as the tree with k vertices, each corresponding to a block of T , and there is an edge between any two vertices if their corresponding blocks are neighbors in T .In other words, B(T ) is obtained by contracting all blocks of T .Notice that each vertex of B(T ) corresponds to a block of T , and vice versa.We refer to a block of T as a leaf block if its corresponding vertex in B(T ) is a leaf.A block of T that is not a leaf block is called an internal block, see Fig. 3.
We denote the shortest-path distance between two vertices u and v in T by dist(u, v).
Lemma 1 Any minimum consistent spanning subset of a tree T contains exactly one vertex from each leaf block of T .
Proof: We use contradiction to prove this lemma.Consider a minimum consistent spanning subset S * that contains more than one vertex from some leaf block, say B. Let N be the neighboring block of B. Let a be a vertex of S * in B that is closest to N , as in Fig. 4. Let S ′ be the set obtained from S * by removing all the vertices of B except for a.We claim that S ′ is a consistent spanning subset for T ; this would contradict S * being minimum.To prove the above claim, it suffices to show that a is a vertex of S ′ that is closest to every vertex of B. Let n be a vertex of S * in N that is closest to B. Let v be the last vertex of B on the path from a to n, as depicted in the figure to the right.As S * is a consistent spanning subset, we have dist(v, a) ≤ dist(v, n).Now consider any vertex in x ∈ B. The path between x and n passes through v. Therefore dist(x, a) ≤ dist(x, n), which proves the claim.□

An approximation algorithm
Recall the block tree B(T ); see Fig. 3. Let S be the empty set.For every edge e ∈ B(T ) our algorithm adds to S the two vertices of T that correspond to the endpoints of e.It is easily seen that S is a feasible solution for the MCSS problem.We claim that S is a 2-approximate solution.
Let b denote the number of blocks in T , which is the same as the number of vertices of B(T ).Since B(T ) is a tree, it has b − 1 edges.Therefore |S| = 2b − 2. On the other hand the size of any optimal solution S * is at least b because it must contain at least one vertex from each block of T , and hence We note that this elementary analysis is the best possible for this algorithm, for example if T is a path with blocks of size three, then our algorithm picks two vertices for each block (except for the two leaf blocks) while the optimal solution picks one vertex (the middle vertex) from each block.

The Algorithm
Lemma 1 suggests a more constrained version of the MCSS problem, in the sense that we can fix a leaf block B and enforce exactly one vertex of B to be in the solution.As we do not know in advance which vertex of B is in the optimal solution, we try all of them and report the best answer.
Our algorithm employs a nontrivial dynamic programming approach.First we introduce the subproblems that will be generated throughout the algorithm and then we will show how to solve the subproblems recursively.

Defining the subproblems
We denote each subproblem by T (a, c) where a and c are two given vertices of T .Consider the path δ between a and c in T and let x be the neighbor of c in δ (it might be the case that a = x).By removing the edge (x, c) from T we obtain two subtrees.Let T c be the subtree containing c, see Fig. 5(a).Let T ′ be the union of δ and T c as in Fig. 5(b).We define T (a, c) to be the MCSS problem on T ′ with the following constraints: a must be in the solution, and all the vertices from a to x on δ must be covered by a.
These constraints imply that the vertices from a to x should have the same color.

Solving the subproblems
We denote the size of the (constrained) MCSS for T (a, c) by S(a, c).If T (a, c) has no solution then we set S(a, c) = +∞.To solve T (a, c) we proceed as follows.
If T ′ is monochromatic, then (by Observation 1) we return a as the solution.In this case S(a, c) = 1.Assume that T ′ is not monochromatic.We root T ′ at a. Lemma 2 If T (a, c) has a solution, then any solution of T (a, c) contains a vertex z in the same block as a or in a neighboring block of a such that all vertices on the path from a to z are covered by a or by z.
Proof: As T ′ is multicolored, any solution of T (a, c) should contain at least two vertices.In particular it should contain at least one vertex from each neighboring block of a.In any solution of T (a, c) a vertex that is closest to a satisfies the statement of the lemma for z. □ Let z be any vertex of T ′ that satisfies the constraints of Lemma 2, see Fig. 6 (It might be the case that z = c.Also z could be in a's block or in a's neighboring block.)If such a vertex z does not exist then T (a, c) has no solution and thus we set S(a, c) = +∞.
Let a ⇝ z denote the path from a to z.By Lemma 2 all vertices of a ⇝ z are covered by a or z.Since x must be covered by a (as imposed by the definition of T (a, c)), we must have dist(x, a) ≤ dist(x, z), and thus dist(z, a) ≥ 2 • dist(x, a).Moreover if a ⇝ z has 2k vertices (including a and z) then the first k vertices must be covered by a and the second k vertices by z.If a ⇝ z has 2k + 1 vertices then the first k vertices must be covered by a, the last k vertices by z, and the middle vertex say m must be covered by one of a and z that has the same color as m.Now we are in a problem setting where both a and z must be in the solution, and all vertices on a ⇝ z must be covered by a and z.We denote this more constrained version of T (a, c) by problem T (a, c, z) (we do not call this a subproblem for a reason to be determined later).In other words T (a, c, z) is the MCSS problem on T ′ with the following constraints: z is in the same block as a or in a neighboring block of a, We refer to any vertex z that satisfies the above constraints as a valid pair for a.Now we show how to solve T (a, c, z).We denote the size of the solution for T (a, c, z) by S(a, c, z).Let A be the set of all the vertices on the path c ⇝ z that are closer to a than to z as in Fig. 6.Let Z be the set of all the vertices on c ⇝ z that are closer to z than to a.If a vertex m has the same distance to a and z, then we put it in the set that has the same color as m.
To solve T (a, c, z) we define two sets A and Z as follows.For each vertex v in A, we add to A all children of v that are not on the path c ⇝ z.For each vertex v in Z, we add to Z all children of v that are not on the path c ⇝ z.Then the solution of T (a, c, z) is obtained by taking the union of {a, z} with the solutions of T (a, v ′ ) for all v ′ ∈ A and the solutions of T (z, v ′ ) for all v ′ ∈ Z.It might happen that some choices of z does not lead to a valid solution, but then a subsequent subproblem, e.g.T (z, v ′ ) will be infeasible and S(z, v ′ ) = +∞.
One might wonder that for two vertices v ′ 1 , v ′ 2 ∈ Z the solutions of T (z, v ′ 1 ) and T (z, v ′ 2 ) may affect each other.However, this cannot happen because by the definition of T (., .)all vertices on paths from z to the parents of v ′ 1 and v ′ 2 must be covered by z, and thus any vertex in the solutions of T (z, v ′ 1 ) and T (z, v ′ 2 ) lies in the same level as z or in a lower level (i.e. higher depth) in T ′ .The same argument applies to vertices in A. Notice that some vertex, say m, might have equal distance Figure 8: An optimal solution assuming that the constraint of having a nearest neighbor in the same block has been dropped.The vertex v 6 does not have a nearest neighbor in its own block, however it has a nearest neighbor, v 8 , in a different block but of the same color.algorithm from B 1 in Fig. 8, the first try would be T (v 1 , v 2 ) in which v 3 is chosen as a valid pair for v 1 .Then the algorithm tries to solve T (v 3 , v 4 ) but the vertex z = v 9 in the neighboring block does not satisfy the lemma's outcome that all vertices from v 3 to v 9 are covered by v 3 or v 9 , because v 6 is covered by v 8 .

Figure 1 :
Figure 1: (a) An example of a 2-colored graph G where all edges have the same weight.The set S = {s 1 , s 2 , s 3 , s 4 } is an MCS for G.The vertex s 1 is a nearest neighbor of a, b, c, and d in S. (b) The Euclidean MCS where the circled points belong to S.

Figure 2 :
Figure 2: (a) An MCS for a two-colored unweighted tree.(b) Blocks in a tree.(c) An MCS for an unweighted tree that has size 2, and (d) an MCSS for the same tree.

Figure 3 :
Figure 3: (a) Example of a tree T together with its blocks.(b) The block tree of T .

Figure 5 :
Figure 5: (a) The tree T , and (b) the tree T ′ .

aFigure 6 :
Figure 6: Solving T (a, c) recursively in terms of T (a, v ′ ) and T (z, v ′ ) where z is a valid pair for a.