Community Detection in Networks with Node Features

Many methods have been proposed for community detection in networks, but most of them do not take into account additional information on the nodes that is often available in practice. In this paper, we propose a new joint community detection criterion that uses both the network edge information and the node features to detect community structures. One advantage our method has over existing joint detection approaches is the flexibility of learning the impact of different features which may differ across communities. Another advantage is the flexibility of choosing the amount of influence the feature information has on communities. The method is asymptotically consistent under the block model with additional assumptions on the feature distributions, and performs well on simulated and real networks.

Community detection is a fundamental problem in network analysis, extensively studied in a number of domains -see (1) and (2) for some examples of applications. A number of approaches to community detection are based on probabilistic models for networks with communities, such as the stochastic block model (3), the degree-corrected stochastic block model (4), and the latent factor model (5). Other approaches work by optimizing a criterion measuring the strength of community structure in some sense, often through spectral approximations. Examples include normalized cuts (6), modularity (7; 8), and many variants of spectral clustering, e.g., (9).
Many of the existing methods detect communities based only on the network adjacency matrix. However, we often have additional information on the nodes (node features), and sometimes edges as well, for example, (10), (11) and (12). In many networks the distribution of node features is correlated with community structure (13), and thus a natural question is whether we can improve community detection by using the node features. Several generative models for jointly modeling the edges and the features have been proposed, including the network random effects model (14), the embedding feature model (15), the latent variable model (16), the discriminative approach (17), the latent multi-group membership graph model (18), the social circles model for ego networks (13), the communities from edge structure and node attributes (CESNA) model (19), the Bayesian Graph Clustering (BAGC) model (20) and the topical communities and personal interest (TCPI) model (22). Most of these models are designed for specific feature types, and their effectiveness depends heavily on the correctness of model specification. Model-free approaches include weighted combinations of the network and feature similarities (23; 24), attribute-structure mining (25), simulated annealing clustering (26), and compressive information flow (27). Most methods in this category use all the features in the same way without determining which ones influence the community structure and which do not, and lack flexibility in how to balance the network information with the information coming from its node features, which do not always agree. Including irrelevant node features can only hurt community detection by adding in noise, while selecting features that by themselves cluster strongly may not correspond to features that correlate with the community structure present in the adjacency matrix.
In this paper, we propose a new joint community detection criterion that uses both the network adjacency matrix and the node features. The idea is that by properly weighing edges according to feature similarities on their end nodes, we strengthen the community structure in the network thus making it easier to detect. Rather than using all available features in the same way, we learn which features are most helpful in identifying the community structure from data. Intuitively, our method looks for an agreement between clusters suggested by two data sources, the adjacency matrix and the node features. Numerical experiments on simulated and real networks show that our method performs well compared to methods that use either

The joint community detection criterion
Our method is designed to look for assortative community structure, that is, the type of communities where nodes are more likely to connect to each other if they belong to the same community, and thus there are more edges within communities than between. This is a very common intuitive definition of communities which is incorporated in many community detection criteria, for example, modularity (8). Our goal is to use such a community detection criterion based on the adjacency matrix alone, and add feature-based edge weights to improve detection. Several criteria using the adjacency matrix alone are available, but having a simple criterion linear in the adjacency matrix makes optimization much more feasible in our particular situation, and we propose a new criterion which turns out to work particularly well for our purposes. Let A denote the adjacency matrix with A ij = 0 if there is no edge between nodes i and j, and otherwise A ij > 0 which can be either 1 for unweighted networks or the edge weight for weighted networks. The community detection criterion we start from is a very simple analogue of modularity, to be maximized over all possible label assignments e: Here e is the vector of node labels, with e i = k if node i belongs to community k, for k = 1, . . . , K, E k = {i : e i = k}, and |E k | is the number of nodes in community k. We assume each node belongs to exactly one community, and the number of communities K is fixed and known. Rescaling by |E k | α is designed to rule out trivial solutions that put all nodes in the same community, and α > 0 is a tuning parameter. When α = 2, the criterion is approximately the sum of edge densities within communities, and when α = 1, the criterion is the sum of average "within community" degrees, which both intuitively represent community structure. This criterion can be shown to be consistent under the stochastic block model by checking the conditions of the general theorem in (28). The ideal use of features with this criterion would be to use them to up-weigh edges within communities and down-weigh edges between them, thus enhancing the community structure in the observed network and making it easier to detect. However, node features may not be perfectly correlated with community structure, different communities may be driven by different features, as pointed out by (13), and features themselves may be noisy. Thus we need to learn the impact of different features on communities as well as balance the roles of the network itself and its features. Let f i denote the p-dimensional feature vector of node i. We propose a joint community detection criterion (JCDC), where α is a tuning parameter as in (1.1), β k ∈ R p is the coefficient vector that defines the impact of different features on the kth community, and β := {β 1 , . . . , β K }. The criterion is then maximized over both e and β. Having a different β k for each k allows us to learn the roles different features may play in different communities. The balance between the information from A and F := {f 1 , . . . , f n } is controlled by w n , another tuning parameter which in general may depend on n.
For the sake of simplicity, we model the edge weight W (f i , f j , β k ; w n ) as a function of the node features f i and f j via a p-dimensional vector of their similarity measures φ ij = φ(f i , f j ). The choice of similarity measures in φ depends on the type of f i (for example, on whether the features are numerical or categorical) and is determined on a case by case basis; the only important property is that φ assigns higher values to features that are more similar. Note that this trivially allows the inclusion of edge features as well as node features, as long as they are converted to some sort of similarity. To eliminate potential differences in units and scales, we standardize all φ ij along each feature dimension. Finally, the function W should be increasing in φ ij , β , which can be viewed as the "overall similarity" between nodes, and for optimization purposes it is convenient to take W to be concave. Here we use the exponential function, One can use other functions of similar shapes, for example, the logit exponential function, which we found empirically to perform similarly.

Estimation
The joint community detection criterion needs to be optimized over both the community assignments e and the feature parameters β. Using block coordinate descent, we optimize JCDC by alternately optimizing over the labels with fixed parameters and over the parameters with fixed labels, and iterating until convergence.

Optimizing over label assignments with fixed weights
When parameters β are fixed, all edge weights w ijk 's can be treated as known constants. It is infeasible to search over all n K possible label assignments, and, like many other community detection methods, we rely on a greedy label switching algorithm to optimize over e, specifically, the tabu search (29), which updates the label of one node at a time. Since our criterion involves the number of nodes in each community |E k |, no easy spectral approximations are available. Fortunately, our method allows for a simple local approximate update which does not require recalculating the entire criterion. For a given node i considered for label switching, the algorithm will assign it to community k rather than l if where S kk is twice the total edge weights in community k, and S i↔k is the sum of edge weights between node i and all the nodes in E k . When |E k | and |E l | are large, we can ignore +1 in the denominators, and (2.1) becomes which allows for a "local" update for the label of node i without calculating the entire criterion. This also highlights the impact of the tuning parameter α: when α = 1, the two sides of (2.2) can be viewed as averaged weights of all edges connecting node i to communities E k and E l , respectively. Then our method assigns node i to the community with which it has the strongest connection. When α = 1, the left hand side of (2.2) is multiplied by a factor (|E k |/|E l |) 1−α . Suppose |E k | is larger than |E l |; then choosing 0 < α < 1 indicates a preference for assigning a node to the larger community, while α > 1 favors smaller communities. A detailed numerical investigation of the role of α is provided in the Supplemental Material. The edge weights involved in (2.2) depend on the tuning parameter w n . When β = 0, all weights are equal to w n − 1. On the other hand, w ijk ≤ w n for all values of β. Therefore, w n /(w n − 1) is the maximum amount by which our method can reweigh an edge. When w n is large, w n /(w n − 1) ≈ 1, and thus the information from the network structure dominates. When w n is close to 1, the ratio is large and the feature-driven edge weights have a large impact. See the Supplemental Material for more details on the choice of w n .
While the tuning parameter w n controls the amount of influence features can have on community detection, it does not affect the estimated parameters β for a fixed community assignment. This is easy to see from rearranging terms in (1.2): where the function g does not depend on w n . Note that the term containing w n does not depend on β.

Optimizing over weights with fixed label assignments
Since we chose a concave edge weight function (1.3), for a given community assignment e the joint criterion is a concave function of β k , and it is straightforward to optimize over β k by gradient ascent. The role of β k is to control the impact of different features on each community. One can show by a Taylor-series type expansion around the maximum (details omitted) and also observe empirically that for our method, the estimatedβ k 's are correlated with the feature similarities between nodes in community k. In other words, our method tends to produce a large estimatedβ ij 's are negative (recall that similarities are standardized, so this cannot happen in all communities). To avoid these extreme solutions, we subtract a penalty term λ β 1 from the criterion (1.2) while optimizing over β. We use a very small value of λ (λ = 10 −5 everywhere in the paper) which safeguards against numerically unstable solutions but has very little effect on other estimated coefficients.

Consistency
The proposed JCDC criterion (1.2) is not model-based, but under certain models it is asymptotically consistent. We consider the setting where the network A and the features F are generated independently from a stochastic block model and a uniformly bounded distribution, respectively. Let P(A ij = 1) = ρ n P cicj where ρ n is a factor controling the overall edge density and c = (c 1 , . . . , c n ) is the vector of true labels. Assume the following regularity conditions hold: 1. There exist global constants M φ and M β , such that φ ij 2 ≤ M φ and β k 2 ≤ M β for all k, and the tuning parameter w n satisfies log w n > M φ M β .

Let
There exists a global constant π 0 such that |C k | ≥ π 0 n > 0 for all k.
Condition 1 states that node feature similarities are uniformly bounded. This is a mild condition in many applications as the node features are often themselves uniformly bounded. In practice, for numerical stability the user may want to standardize node features and discard individual features with very low variance, before calculating the corresponding similarities φ. Condition 2 guarantees communities do not vanish asymptotically. Condition 3 enforces assortativity. Since the estimated labels e are only defined up to an arbitrary permutation of communities, we measure the agreement betwee e and c by d(e, c) = min σ∈P K 1 n n i=1 1(σ(e i ) = c i ), where P K is the set of all permutations of {1, . . . , K}. Theorem 1 (Consistency of JCDC). Under conditions 1, 2 and 3, if nρ n → ∞, w n ρ n → ∞, and the parameter α satisfies max k,l 2(K − 1)P kl min k,l (P kk , P ll ) then we have, for any fixed δ > 0, The proof is given in the Supplemental Material.

Simulation studies
We compare JCDC to three representative benchmark methods which use both the adjacency matrix and the node features: CASC (Covariate Assisted Spectral Clustering, (24)), CESNA (Communities from Edge Structure and Node Attributes, (19)), and BAGC (BAyesian Graph Clustering, (20)). In addition, we also include two standard methods that use either the network adjacency alone (SC, spectral clustering on the Laplacian regularized with a small constant τ = 1e − 7, as in (21)), or the node features alone (KM, Kmeans performed on the p-dimensional node feature vectors, with 10 random initial starting values). We generate networks with n = 150 nodes and K = 2 communities of sizes 100 and 50 from the degree-corrected stochastic block model as follows. The edges are generated independently with probability θ i θ j p if nodes i and j are in the same community, and rθ i θ j p if nodes i and j are in different communities. We set p = 0.1 and vary r from 0.25 to 0.75. We set 5% of the nodes in each community to be "hub" nodes with the degree correction parameter θ i = 10, and for the remaining nodes set θ i = 1. All resulting products are thresholded at 0.99 to ensure there are no probability values over 1. These settings result in the average expected node degree ranging approximately from 22 to 29. For each node i, we generate p = 2 features, with one "signal" feature related to the community structure and one "noise" feature whose distribution is the same for all nodes. The "signal" feature follows the distribution N (µ, 1) for nodes in community 1 and N (−µ, 1) for nodes in community 2, with µ varying from 0.5 to 2 (larger µ corresponds to stronger signal). For use with CESNA, which only allows categorical node features, we discretize the continuous node features by partitioning the real line into 20 bins using the 0.05, 0.1, . . . , 0.95-th quantiles. For the JCDC, based on the study of the tuning parameters in the Supplemental Material, we use α = 1 and compare two values of w n , w n = 1.5 and w n = 5. Finally, agreement between the estimated communities and the true community labels is measured by normalized mutual information, a measure commonly used in the network literature which ranges between 0 (random guessing) and 1 (perfect agreement). For each configuration, we repeat the experiments 30 times, and record the average NMI over 30 replications. Figure 1 shows the heatmaps of average NMI for all methods under these settings, as a function of r and µ. As one would expect, the performance of spectral clustering (c), which uses only the network information, is only affected by r (the larger r is, the harder the problem), and the performance of K-means (d), which uses only the features, is only affected by µ (the larger µ is, the easier the problem). JCDC is able to take advantage of both network and feature information by estimating the coefficients β from data, and its performance only deteriorates when neither is informative. The informative features are more helpful with a larger value of w (a), and conversely uninformative features affect perfomance slightly more with a lower value of w (b), but this effect is not strong. CASC (e) appears to inherit the sharp phase transition from spectral clustering, which forms the basis of CASC; the sharp transition is perhaps due to different community sizes and hub nodes, which are both challenging to spectral clustering; CESNA (f) and BAGC (g) do not perform as well overall, with BAGC often clustering all the hub nodes into one community.

The world trade network
The world trade network (30) connects 80 countries based on the amount of trade of metal manufactures between them in 1994, or when not available for that year, in 1993 or 1995. Nodes are countries and edges represent positive amount of import and/or export between the countries. Each country also has three categorical features: the continent (Africa, Asia, Europe, N. America, S. America, and Oceania), the country's structural position in the world system in 1980 (core, strong semi-periphery, weak semi-periphery, periphery) and in 1994 (core, semi-periphery, periphery). Figures 2 (a) to (c) show the adjacency matrix rearranged by sorting the nodes by each of the features. The partition by continent (Figure 2(a)) clearly shows community structure, whereas the other two features show hubs (core status countries trade with everyone), and no assortative community structure. We will thus compare partitions found by all the competing methods to the continents, and omit the three Oceania countries from further analysis because no method is likely to detect such a small community. The two world position variables ('80 and '94) will be used as features, treated as ordinal variables.   The results for all methods are shown in Figure 2, along with NMI values comparing the detected partition to the continents. All methods were run with the true value K = 5.
The result of spectral clustering agrees much better with the continents than that of K-means, indicating that the community structure in the adjacency matrix is closer to the continents that the structure contained in the node features. JCDC obtains the highest NMI value, CASC performs similarly to spectral clustering, whereas CESNA and BAGC both fail to recover the continent partition. Note that no method was able to estimate Africa well, likely due to the disassortative nature of its trade seen in Figure 2 (a). Figure 2 (e) indicates that JCDC estimated N. America, S. America and Asia with high accuracy, but split Europe into two communities, since it was run with K = 5 and could not pick up Africa due to its disassortative structure. Table 1 contains the estimated feature coefficients, suggesting that in 1980 the "world position" had the most

The lawyer friendship network
The second dataset we consider is a friendship network of 71 lawyers in a New England corporate law firm (31 (Figure 3(a)) shows a strong assortative structure, and so does partition by office (Figure 3(c)) restricted to Boston and Hartford, but the small Providence office does not have any kind of structure. Thus we chose the status partition as a reference point for comparisons, though other partitions are certainly also meaningful. Communities estimated by different methods are shown in Figure 3 (i)-(o), all run with K = 2. Spectral clustering and K-means have equal and reasonably high NMI values, indicating that both the adjacency matrix and node features contain community information. JCDC obtains the highest NMI value, with w n = 5 performing slightly better than w n = 1.5. CASC improves upon spectral clustering by using the feature information, with NMI just slightly lower than that of JCDC with w n = 1.5. CESNA and BAGC have much lower NMI values, possibly because of hub nodes, or because they detect communities corresponding to something other than status.
The estimated feature coefficients are shown in Table 2. Office location, years with the firm, and age appear to be the features most correlated with the community structure of status, for both partners and associates, which is natural. Practice, school, and gender are less important, though it may be hard to estimate the influence of gender accurately since there are relatively few women in the sample.

Discussion
Our method incorporates feature-based weights into a community detection criterion, improving detection compared to using just the adjacency matrix or the node features alone, if the cluster structure in the features is related to the community structure in the adjacency matrix. It has the ability to estimate coefficients for each feature within each community and thus learn which features are correlated with the community structure. This ability guards against including noise features which can mislead community detection. The  community detection criterion we use is designed for assortative community structure, with more connections within communities than between, and benefits the most from using features that have a similar clustering structure.
This work can be extended in several directions. Variation in node degrees, often modeled via the degreecorrected stochastic block model (4) which regards degrees as independent of community structure, may in some cases be correlated with node features, and accounting for degree variation jointly with features can potentially further improve detection. Another useful extension is to overlapping communities. One possible way to do that is to optimize each summand in JCDC (1.2) separately and in parallel, which can create overlaps, but would require careful initialization. Statistical models that specify exactly how features are related to community assignments and edge probabilities can also be useful, though empirically we found no such standard models that could compete with the non-model-based JCDC on real data. This suggests that more involved and perhaps data-specific modeling will be necessary to accurately describe real networks, and some of the techniques we proposed, such as community-specific feature coefficients, could be useful in that context. Acknowledgments E.L. is partially supported by NSF grants DMS-1106772 and DMS-1159005. J.Z. is partially supported by a NSF grant DMS-1407698 and a NIH grant R01GM096194.

A.1 Choice of tuning parameters
The JCDC method involves two user-specified tuning parameters, α and w n . In this section, we investigate the impact of these tuning parameters on community detection results via numerical experiments.
First we study the impact of α, which determines the algorithm's preference for larger or smaller communities. We study its effect on the estimated community size as well as on the accuracy of estimated community labels. We generate data from a stochastic block model with n = 120 nodes and K = 2 communities of sizes n 1 and n 2 = n − n 1 . We set the within-community edge probabilities to 0.3 and between-community edge probabilities to 0.15, and vary n 1 from 60 to 110. Since α is not related to feature weights, we set features to a constant, resulting in unweighted networks. The results are averaged over 50 replications and shown in  Estimation accuracy measured by NMI as a function of the tuning parameter α. Solid lines correspond to JCDC and horizontal dotted lines correspond to spectral clustering, which does not depend on α.
We report the size of the larger estimated community in Figure 4(a), and the accuracy of community detection as measured by normalized mutual information (NMI) in Figure 4(b). For comparison, we also record the results from spectral clustering (horizontal lines in Figure 4), which do not depend on α. When communities are balanced (n 1 = n 2 = 60), JCDC performs well for all values of α, producing balanced communities and uniformly outperforming spectral clustering in terms of NMI. In general, larger values of α in JCDC result in more balanced communities, while smaller α's tend to produce a large and a small community. In terms of community detection accuracy, Figure 4(b) shows that the JCDC method outperforms spectral clustering over a range of values of α, and this range depends on how unbalanced the communities are. For simplicity and ease of interpretation, we set α = 1 for all the simulations and data analysis reported in the main manuscript; however, it can be changed by the user if information about community sizes is available.
Next, we investigate the impact of w n , which controls the influence of features. To study the tradeoff between the two sources of information (network and features), we generate two different community partitions. Specifically, we consider two communities of sizes n 1 and n 2 , with n 1 + n 2 = n = 120. We generate two label vectors c A and c F , with c A i = 1 for i = 1, . . . , n 1 and c A i = 2 for i = n 1 + 1, . . . , n, while the other label vector has c F i = 1 for i = 1, . . . , n 2 and c F i = 2 for i = n 2 + 1, . . . , n. Then the edges are generated from the stochastic block model based on c A , and the node features are generated based on c F . We generate two node features: one feature is sampled from the distribution N (µ, 1) if c F i = 1 and N (0, 1) if c F i = 2; the other feature is sampled from N (0, 1) if c F i = 1 and N (−µ, 1) if c F i = 2. We fix µ = 3 and set α = 1, as discussed above. We set the within-and between-community edge probabilities to 0.3 and 0.15, respectively, same as in the previous simulation, and vary the value of w n from 1.1 to 10. Finally, we look at the the agreement between the estimated communitiesê and c A and c F , as measured by normalized mutual information. The results are shown in Figure 5.  Figure 5: MNI between the estimated community structureê and the network community structure c A (solid lines) and the feature community structure c F (dotted lines). Note that when n 1 = n 2 = 60, c A = c F , so the solid and dotted lines coincide.
As we expect, smaller values of w n give more influence to features and thus the estimated community structure agrees better with c F than with c A . As w n increases, the estimatedê becomes closer to c A . In the manuscript, we compare two values of w n , 1.5 and 5.

A.2 Proofs
We start with summarizing notation. Let E 1 , . . . , E K be the estimated communities corresponding to the label vector e, and C 1 , . . . , C K the true communities corresponding to the label vector c. Recall we estimate e by maximizing the criterion R over e and β, where and defineê = arg max e max β R(e, β; w n ) , whereê and the correspondingβ are defined up to a permutation of community labels. Recall that we assumed A and F are conditionally independent given c and defined R 0 , the "population version" of R, as The expectation in R 0 is taken with respect to the distribution of node features, which determine the similarities φ ij . Lemma 2. Under conditions 1 and 2, if w n ρ n → ∞ and 0 < α ≤ 2, we have max e,β R(e, β; w n ) − R 0 (e, β; w n ) w n ρ n n 2−α = O p 1 √ w n ρ n .
Proof of Lemma 2. We first bound the difference between R and R 0 for fixed e and β. By Hoeffding's inequality and the fact that 2[n/2] ≥ n − 1, where [x] is the integer part of x, we have Taking t = w n ρ n n 2−α |E k | α−2 δ and applying the union bound, we have Next, we take the uniform bound over β. Consider the set It is straightforward to verify that B is an -net on [−M β , M β ] p , the space of β k 's. For each β k , let β(β k , B ) be the best approximation to β k in B . Then where the first term becomes 0 because of the choice of and |E k | < n. Finally, taking a union bound over all possible community assignments, we have √ w n ρ n completes the proof of Lemma 2.
We now proceed to investigate the "population version" of our criterion, R 0 . Define U ∈ R K×K by U kl = n i=1 1[e i = k, c i = l]/n, and let D be a diagonal K × K matrix with π 1 , . . . , π K on the diagonal, where π k = n i=1 1[c i = k]/n is the fraction of nodes in community C k . Roughly speaking, U is the confusion matrix between e and c, and U = DO for a permutation matrix O means the estimation is perfect. Define Each estimated community assignment e induces a unique U = U (e). It is not difficult to verify that g (U (e)) = K k=1 i,j∈E k P cicj |E k | α n 2−α . Lemma 3. Under conditions 1 and 2, there exists a constant C 2 such that max e,β R 0 (e, β; w n ) w n ρ n n 2−α − g (U (e)) ≤ Proof of Lemma 3. By definition, we have max e,β R 0 (e, β; w n ) w n ρ n n 2−α − g (U (e)) = max where C 2 := Kπ α−2 0 exp(M φ M β ) max kl P kl , and the two inequalities follow from conditions 1 and 2, respectively.

Lemma 4.
Under condition 3, if α ∈ [max 1≤k<l≤K 2(K − 1)P kl / min(P kk , P ll ), 1], then for all U satisfying K k=1 U kl = π l for 1 ≤ k ≤ K, g(U ) is uniquely maximized at U = DO for O ∈ O K , where O K denotes the set of K × K permutation matrices.
Proof of Lemma 4. We have For 0 < α ≤ 1, since U kl ≥ 0 for all k and l, we have K k=1 U kl 2−α ≥ K k=1 U 2−α kl . By mid-value theorem, there exists ξ kl ∈ 0, a =l U ka , such that Finally, we will need the following inequality: for 0 < α ≤ 2 and x, y ≥ 0 satisfying x + y ≤ u, For x = y = 0, equality holds. To verify (6.3) when 0 < x + y ≤ u, dividing by u 3−α we have The first inequality above implies that a necessary condition for equality to hold in (6.3) is xy = 0. We now lower bound the first term on the right hand side of (6.1).
where the last equality is obtained by applying (6.3) with x = U kl , y = U kl and u = K a=1 U ka . Plugging (6.4) into (6.1), we have g(D) − g(U ) ≥ 0 .
It remains to show that equality holds only if U = DO for some O ∈ O K . Note that the last inequality in (6.4) is obtained from (6.3), where equality holds only when xy = 0. The corresponding condition for equality to hold in (6.4) is thus U kl U kl = 0 for all k, l and l . Therefore, for each k, there is only one l such that U kl = 0, i.e., U = DO for some O ∈ O K .
Proof of Theorem 1. By Lemma 2 and Lemma 3, we have max e,β R(e, β; w n ) w n ρ n n 2−α − g (U (e)) = O p 1 √ w n ρ n . (6.5) It is straightforward to verify that, for any e, 2d(e, c) = min O∈O K U (e)−DO 1 , where Q 1 = We now show, by contradiction, that x n → 0 implies y n → 0. First, note that y n is non-increasing. Now if y 0 = lim n→∞ y n > 0, by compactness of the set U y0 = {U : min O∈O K U − DO 1 ≥ y 0 } and continuity of the function g, the supremum of g(U ) over U ∈ U y0 , which equals g(D), is attained in U y0 . This contradicts Lemma 4. Now let x n = 1/ 4 √ w n ρ n . By assumption of Theorem 1, x n → 0, which yields y n → 0. Also x n / 1/ √ w n ρ n =