Planted Hitting Set Recovery in Hypergraphs

In various application areas, networked data is collected by measuring interactions involving some specific set of core nodes. This results in a network dataset containing the core nodes along with a potentially much larger set of fringe nodes that all have at least one interaction with a core node. In many settings, this type of data arises for structures that are richer than graphs, because they involve the interactions of larger sets; for example, the core nodes might be a set of individuals under surveillance, where we observe the attendees of meetings involving at least one of the core individuals. We model such scenarios using hypergraphs, and we study the problem of core recovery: if we observe the hypergraph but not the labels of core and fringe nodes, can we recover the"planted"set of core nodes in the hypergraph? We provide a theoretical framework for analyzing the recovery of such a set of core nodes and use our theory to develop a practical and scalable algorithm for core recovery. The crux of our analysis and algorithm is that the core nodes are a hitting set of the hypergraph, meaning that every hyperedge has at least one node in the set of core nodes. We demonstrate the efficacy of our algorithm on a number of real-world datasets, outperforming competitive baselines derived from network centrality and core-periphery measures.


Core and fringe nodes in networks
The data that we collect in practice is typically incomplete in several fundamental and recurring ways, and this is particularly true for graph and network data modeling complex systems throughout the social and biological sciences [29,33,40,42,43]. One common type of incompleteness arises from a particular data collection procedure where one wants to record all interactions involving a set of core nodes C. This results in what is sometimes called a 'core-fringe' structure [6], where the resulting network structure contains the core nodes along with a (potentially) much larger set of fringe nodes F that are observed in some interaction with a node in C. For example, in survey data, C might represent a set of respondents, and F a set of individuals with whom they interact; this is a common scenario in snowball and respondent-driven sampling [29,31,35]. Another situation arises in restrictions on surveillance. For example, the Enron [39] and Avocado 1 email datasets are common data sources for email and networks research. Both datasets come from collecting emails involving a core C representing the employees of a company. Through this process, the data includes emails involving a fringe F of people outside the company who receive email from or send email to members of C. However, we would not have access to emails sent between people outside the company. We will see in our data analysis later that the size of the fringe is often much larger than the size of the core in these datasets.
One can imagine similar scenarios in a variety of settings. In intelligence data, one might record all of the attendees at meetings that involve at least one of a set of individuals under surveillance. In a similar manner, telecommunications providers can observe group text messages where only some of the members of the group are subscribers. Furthermore, measurements of Internet structure from service providers at the IP-layer from individual service providers provide only a partial measurement [56], and large-scale Web crawls do not encapsulate the entire network structure of hyperlinks [7,8]. (Right) a planted hitting set (red nodes) in a hypergraph, constituting the core nodes. Hyperedges are illustrated by ellipses encompassing nodes, which include both core and non-core (fringe) nodes. We study how well we can recover the core (red) nodes if they are not identified.
While a dataset may be equipped with labels of which nodes are members of the core and which are members of the fringe, there are several situations in which such labels are unavailable. For example, consider the email datasets described above. Such data could be released by a hacker from outside the organization, or a leaker from inside it, who has collected email from a set of core accounts but not released the identity of the accounts [36]. Similarly, in the case of intelligence data that records attendees of meetings, the data could be released with the identities of the individuals under surveillance redacted. In other cases, the metadata may be lost simply to issues of data maintenance-such concerns are central to the research areas of data provenance and preservation [13,45,53,55].
The question of recovering the identity of the core nodes in a dataset with core-fringe structure is therefore a fundamental question arising in different applications, and for multiple different reasons. Core-fringe recovery in hypergraphs. Here, we study the problem of identifying the set of core nodes, when the labels of which nodes are core and which are fringe are unavailable and the data is represented by a hypergraph, i.e., a set of nodes along with a set of hyperedges that each connect multiple nodes. Figure 1 illustrates this model. The hypergraph model is apt for many of the scenarios described above, including emails, meeting attendees, and group messages. While the hypergraph model is richer than a more standard graph model, the higher-order structure also presents challenges. If a hyperedge connects a (latent) core node and several fringe nodes, an algorithm trying to identify the core has to consider all nodes in the hyperedge as possible candidates.
We first provide theoretical results that motivate the development of an algorithm for recovery of the core. The key structural property of our analysis is that the core nodes constitute a hitting set, meaning every hyperedge in the hypergraph has at least one node in the core (figure 1). This property comes from how the data is measured-every included hyperedge comes from an interaction of a member of the core with some other set of nodes. We thus think of the core as a 'planted' hitting set that we must find in the data. Our first theoretical results rely on two assumptions (both relaxable, to some extent): the planted hitting set corresponding to the core is (i) minimal, that is, no node could be removed to make the hitting set smaller; and (ii) relatively small, i.e., bounded by a constant k. We analyze how large the union of all minimal hitting sets, denoted U(k), can be. We build upon results in fixed-parameter tractability [18,19] to show that |U(k)| is Θ(k r ) in the worst case, where r is the size of the largest hyperedge. Importantly, this bound is independent of the hypergraph's size, and U(k) is guaranteed to contain our planted hitting set if it meets the modeling assumptions.
Furthermore, we prove that a classical greedy approximation algorithm for set cover relates to partial recovery of the planted hitting set, even in cases where the above assumptions do not hold. Specifically, in a hypergraph where r is the size of the largest hyperedge, we show that the output of the algorithm must overlap with the planted hitting set by at least a O(1/r) fraction of the nodes, provided that the hitting set size is within a constant factor of the minimum hitting set size.
Motivated by these two main results, we design a core recovery algorithm, which we call the union of minimal hitting sets (UMHS). This algorithm is practical and scalable, using only simple subroutines and having computational complexity roughly linear in the size of the data. The idea is to run the greedy algorithm multiple times with different random initializations to find several minimal hitting sets and then to simply take the union of these hitting sets. The first part of our theory says that the output should not grow too large, and the second part of our theory says the output should also overlap the planted hitting set. We show that our method consistently outperforms a number of baselines derived from network centrality and core-periphery structure for hypergraphs on several real-world datasets.
Our approach to the planted hitting set problem is purely combinatorial. This contrasts with typical planted problems such as the stochastic block model (SBM) [1,21,48] and planted clique recovery [2,23], which are based on probabilistic structure. Thus, the methods we use are fundamentally different than what is common for these types of problems. Moreover, we show that if random block-model-type structure exists for our problem, then the problem becomes substantially easier.

Problem setup and theoretical results for core recovery
We start with some formal definitions to facilitate our presentation of the theoretical results. A hypergraph G consists of a set of nodes V and a set of hyperedges E, each a subset of V of size at least two. The rank r of a hypergraph is the maximum size of any hyperedge, i.e., r = max e∈E |e|. A hypergraph is called r-uniform if all hyperedges have cardinality equal to r.
In our setup, there is some unidentified subset C ⊆ V that is designated as the set of 'core nodes.' Our goal is to find C, knowing that C is a hitting set of G (i.e., for every e ∈ E, there is some v ∈ e ∩ C). We say that C is planted since we do not know its identity. Absent any additional information, we are helpless-it could be that C = V. However, we can do better by assuming two properties of C (both relaxable): (i) C is a minimal hitting set in G and (ii) C is bounded in size, i.e., |C| k. Under these constraints, it is certainly true that C is contained in the union of all minimal hitting sets having at most k nodes, which is a principle object of our analysis: Definition 1. Given a hypergraph G, U(k) is the union of all minimal hitting sets of G of size at most k.
We next show bounds on |U(k)| that are independent of the size of the graph. This says that C is contained in a relatively small set if it satisfies the conditions. Furthermore, we can find U(k) in time exponential in k but polynomial in the size of the graph [18]; however, we develop more practical algorithms in section 3. After, we relax the assumptions on minimality and get results on partial recovery, i.e., algorithms that are guaranteed to find a part of C. Finally, we show how these results improve under a random hypergraph model.

Minimal hitting sets
First let us suppose that we are given G and asked to find C, where we know that C is minimal and |C| k. We ask the following question: is it possible to find a small set H ⊂ G, whose size is independent of G, such that C ⊂ H? In this section, we answer the question in the affirmative.
We at least know that the union of all minimal hitting sets of size at most k, denoted U(k) (definition 1), must contain C. But how large can this set be? The following result says that U(k) actually cannot be too large.

Lemma 2.
In a hypergraph of rank r, U(k) has size Θ(k r ) in the worst case.
Damaschke and Molokov established most parts of this result using sophisticated techniques from parameterized complexity theory [18,19]. Chitnis et al independently established similar (albeit looser) results for streaming data applications [14]. Here, we provide a concrete, complete characterization and proof that should be accessible to anyone familiar with basic graph theory by using the celebrated sunflower lemma of Erdös and Rado [25]. A consequence of lemma 2 is the following theorem about how small of a set we can find that is guaranteed to contain the core: Theorem 3. Let C be a planted minimal hitting set with |C| k in a hypergraph of rank r. Then we can find a set D of size O(k r ) that is guaranteed to contain C. We begin by providing a self-contained proof of the upper bound in lemma 2 and theorem 3. Namely, let G be a hypergraph of rank r, and let U(k) be the union of all minimal hitting sets in G of size at most k. We will find a function g(r, k) = O(k r ) so that |U(k)| g(r, k). To find such a function, we use a combinatorial structure called a sunflower [10]. A collection of r sets {X 1 , X 2 , . . . , X r } is an r-sunflower if for all pairs of distinct sets X i , X j , the pairwise intersection X i ∩ X j is equal to the mutual intersection r i=1 X i . The term comes from the fact that we can think of the X i as the 'petals' of the sunflower: they share a 'center' r i=1 X i and are otherwise pairwise disjoint from each other. A famous theorem of Erdös and Rado asserts that every sufficiently large collection of sets must contain a large sunflower.
Lemma 4 (sunflower lemma [25]). Define the function σ(r, k) = r!(k − 1) r . Any collection of more than σ(r, k) sets, each of size at most r, must contain a k-sunflower.
We use this result to find the desired function g(r, k) = O(k r ). Later, we will establish an asymptotically matching lower bound. We now state and prove the upper bound. Proof. We assume that G has at least one hitting set of size at most k, as otherwise the statement of lemma 5 holds with U(k) equal to the empty set. We modify G according to the following iterative procedure, which operates in discrete phases. We start by defining H 0 = G, and produce a succession of hypergraphs H 1 , H 2 , . . . , with H t the hypergraph at the end of phase t. For each hypergraph H t , let C(H t ) denote the set of all hitting sets of size at most k in H t , and let C * (H t ) denote the set of all minimal hitting sets of size at most k in H t . We will establish by induction that each hypergraph H t has at least one hitting set of size at most k.
At the start of phase t, we ask whether H t−1 has more than σ(r, k + 1) hyperedges, where σ is the function from lemma 4. If it does not, we declare the procedure to be over. If it does, then we find a (k + 1)-sunflower in the set of hyperedges, consisting of hyperedges X t,1 , X t,2 , . . . , X t,k+1 . Now, the core of the sunflower found in phase t, which we will denote by Y t = k+1 i=1 X t,i must be nonempty, for if it were empty, then H t−1 would contain k + 1 pairwise disjoint hyperedges, and hence could not have a hitting of size at most k. We define H t as follows: starting from H t−1 , we remove the hyperedges X t,1 , X t,2 , . . . , X t,k+1 and add the hyperedge Y t . We observe the following properties.
To see property (a), by the definition of a sunflower, the sets X t,i \Y t are pairwise disjoint for i = 1, 2, . . . , k + 1. Thus, if C is a set of size at most k that does not contain any nodes of Y t , then it must be disjoint from at least one of the sets X t,i , and hence it cannot be a hitting set for H t−1 . The definition of H t is as follows: starting from H t−1 , we remove the hyperedges X t,1 , X t,2 , . . . , X t,k+1 and add the hyperedge Y t . Since H t−1 and H t are hypergraphs on the same node sets, their respective collections of hitting sets C(H t−1 ) and C(H t ) are also over the same sets of elements.
, and so they are equal as sets.
We apply the same process in each phase, producing H 1 from G = H 0 , then producing H 2 from H 1 and so forth. By induction using property (b), since H t−1 contains at least one hitting set of size at most k, so does H t , as required. Since the procedure reduces the number of hyperedges in each phase, it must eventually terminate, in some phase t * . We write H * = H t * and by applying property (b) transitively, We say that a node is isolated if it does not belong to any hyperedge. Let T * be the set of non-isolated nodes of H * . Since the procedure stopped when faced with H * , H * had at most σ(r, k + 1) hyperedges. Since each hyperedge has most r elements, we have and hence |T * | is bounded by O(k r ). Finally, recall that U(k) is the union of all minimal hitting sets in G of size at most k. We can write U(k) = C∈C * (H) C, and by equation (1), U(k) = C∈C * (H * ) C. No minimal hitting set C in H * can contain an isolated node v, since then C − {v} would also be a hitting set in H * . From this, U(k) ⊂ T * , and |U(k)| |T * | = O(k r ) by equation (2).
Next, we show asymptotic tightness for constant hypergraph rank r, by constructing a family of hypergraphs for which |U(k)| = Ω(k r ). The following lemma combined with lemma 5 proves theorem 3.

Lemma 6. For each constant r 2, there exists an infinite family of rank-r hypergraphs G and parameters k, with both the number of nodes in G and k going to infinity, for which |U(k)| = Ω(k r ).
Proof. For a parameter b, consider a graph Γ b that consists of b disjoint complete b-ary trees of depth r. We will refer to the roots of the b trees in Γ b as the root nodes of Γ b , and to the leaves of the trees in Γ b as leaf nodes. Let T b be the hypergraph on the node set of Γ b whose hyperedges consist of all root-to-leaf paths in Γ b , from the root of one of the b trees in Γ b to a leaf of the corresponding tree (the example of T 2 for r = 3 is in figure 2). All edges in G have size r. Let λ be any function that maps nodes of Γ b to numbers in {1, 2, . . . , b} with the following properties: (i) λ is a bijection from the roots of Γ b to {1, 2, . . . , b}; and (ii) for each non-leaf node v in Γ b , λ is a bijection from the children of v to {1, 2, . . . , b}. For each node v, we will call λ(v) its label (with respect to λ), and we will say that λ is a consistent labeling if it satisfies (i) and (ii). Let v * λ be unique leaf node for which all the nodes on the root-to-leaf path have the label b. For a consistent labeling λ, let C λ be the set of all nodes v satisfying two properties: We observe the following two facts.
(a) For every consistent labeling λ, the set C * λ is a minimal hitting set for T b . (b) Every node v of T b belongs to at least one set of the form C * λ for some consistent labeling λ. To prove (a), we claim that C * λ is a hitting set. Consider any hyperedge X of T b , consisting of a root-to-leaf path P in Γ b . If all nodes on P have label b, then the leaf node of P is v * λ , and hence X intersects C * λ . Otherwise, consider the first node v on the path P that has a label unequal to b. Thus, v ∈ C * λ , and again X intersects C * λ and C * λ is a hitting set. We now argue that it is minimal. If we delete v * λ from C * λ , then C * λ \{v * λ } would be disjoint from the hyperedge X consisting of the root-to-leaf path to v * λ . If we delete some other v * λ , then a hyperedge X that passes through v does not pass through any other node of C * λ on its way from root to leaf, and so C * λ \{v} would be disjoint from X.
To prove (b), we simply choose any consistent labeling λ such that λ(u) = b for all nodes u on the unique path from a root of Γ b to v (other than v itself), and λ(v) = b. Now, we consider the size of a set of the form C * λ . Since C * λ contains b − 1 nodes from each of the first r − 1 levels of the trees in Γ b , and b from the lowest level, we have Since each set C * λ is a minimal hitting set of size k, from (b) we see that the union U(k) of all minimal hitting sets of size k is the entire node set of T b .
The number of nodes in T b is greater than b r , and hence |U(k)| > b r . Since k br, we have that |U(k)| > b r (k/r) r = r −r k r .
These theoretical result establish justification for why a set of core nodes might be identifiable-if the core is a minimal hitting set, it is certainly contained in a set that has size independent of the graph. Next, we consider the case when the core may not be a minimal hitting set.

Non-minimal hitting sets
Here we present results on which types of nodes in a planted hitting set C must be contained in U(|C|) when the hitting set is itself not minimal. This matters in practice, when C may not be minimal.

Greedy matchings intersect the planted hitting set
One first results are based on finding hitting sets with a greedy algorithm for the set cover problem. The classical algorithm is simple (algorithm 1): loop through each hyperedge and if no vertex in the current hyperedge is in the current cover S, add all of the nodes in the hyperedge to S. The greedy algorithm produces a set that is both a hitting set and a maximal matching. To see that it is a hitting set, suppose that all nodes in a hyperedge h were not added. Then h would have been added to S at the time the algorithm processed it. The output is a maximal matching because if we could append another hyperedge h to S, then h would have already been added to S when the algorithm processed it.
Let k * be the minimum hitting set size (the smallest size of all minimal hitting sets). The output of algorithm 1 is an r-approximation if the hypergraph G is an r-uniform hypergraph: in the worst case, any hitting set contains at least one of r nodes from each hyperedge, and we therefore have k * |S|/r. It turns out that this Algorithm 1. Maximal matching r-approximation to the minimum hitting set that intersects the core.
S ← S ∪ {u 1 , . . . , u i } 7: end if 8: end for 9: return S greedy algorithm must also partially overlap with the planted hitting set. We formalize this in the following lemma.

Lemma 7.
Let |B| bk * for some hitting set B. If the input to algorithm 1 is a rank-r hypergraph, then the output S satisfies Proof. Let r 1 be the number of vertices in the hyperedge containing the least vertices. The set S contains within itself all vertices from h hyperedges where h satisfies h |S|/r 1 rk * /r 1 . Since B is a hitting set, it must contain at least one vertex from each of the h hyperedges in S. From this, we obtain |S ∩ B| h k * /r 1 1 br |B|. Lemma 7 guarantees that the greedy algorithm will output a result that overlaps any hitting set, including the planted one. Thus, we immediately get the following corollary.

Corollary 8.
If the planted hitting set C has size |C| ck * then the output of algorithm 1 is a set S with |S ∩ C|/|C| 1 cr , i.e., S intersects at least a fraction of 1 cr of C. We can process the hyperedges in any order within algorithm 1, and this will be important for our practical recovery algorithm that we develop later. Another immediate corollary of lemma 7 says that regardless of the processing, the outputs must be fairly similar.

Classes of nodes that must be in U(|C|)
Our next results are on nodes that must appear in U(|C|). First, a node in C must be in U(|C|) if there is a hyperedge containing u where all of the other nodes in the hyperedge are not in C.

Lemma 10.
Let G be a rank-r hypergraph with planted hitting setting C. If u ∈ C and there exists a hyperedge Proof. We perform a pruning of C as follows. Check whether there exists a node w with C\{w} still being a hitting set of smaller cardinality. If this is the case, update C to C\{w} and continue pruning. Upon termination, the remaining set is a minimal hitting set C ⊆ C. Clearly, |C | |C| so that C ⊂ U(|C|). However, The above lemma provides a class of nodes in a planted hitting set which are guaranteed to be in the UMHS of size no more than that of the planted hitting set itself. Although this lemma is purely combinatorial, it will be useful for analyzing random hypergraphs in section 2.3.
Next, we show that nodes adjacent to other nodes deeply integrated into C must also be in U(|C|). We do not use this result anywhere else in the text, but we include it for theoretical interest and intuition about unions of minimal hitting sets. Formally, define a node v to be in the interior of a hitting set C if all hyperedges containing v are comprised entirely of nodes in C. Our next result is that a node in C must appear in U(|C|) if all the other nodes in a hyperedge containing u are in the interior of C.
Lemma 11. Let G be a rank-r hypergraph with planted hitting set C. If u ∈ C and there exists a hyperedge (u, v 1 , . . . , v i ) where v 1 , . . . , v i are in the interior of C, then u ∈ U(|C|).

Proof.
Since v 1 , . . . , v i are in the interior of C, C 0 = C\{v 1 , . . . , v i } is a hitting set as well. Now perform pruning on C 0 as in the previous proof, the output of which is C ⊂ C 0 , and since clearly |C | |C|, we must have that C ⊂ U(|C|). Clearly u was not deleted during pruning since (u, v 1 , Thus far, we have made no assumptions on the structure of the hypergraph. In the next section, we show that assuming a stochastic-block-model-type hypergraph structure gives substantial improvements in recovery results.

Recovery in a random hypergraph model
In this section we show that if we make additional assumptions on the structure of the problem, it becomes much easier. Most research that studies guarantees on planted structure recovery in graphs (e.g., community detection or planted clique detection) already makes these types of assumptions in order to make the problem feasible. The fact that we can do anything without such assumptions is quite remarkable.
More specifically, we show here that under certain stochastic-block-model-like assumptions for threeuniform hypergraphs laid out in this section-which would already be standard in other graph recovery problems-we can reduce the size of the UMHS (and hence of the upper bound of the size of the output of our algorithm) from O(k 3 ) in the general case to O(k 2 log k) in this special structured case.
We emphasize that our main results and the algorithms we use for analyzing data are purely combinatorial, making no assumption on this type of probabilistic structure. And the algorithms we develop later perform extremely well even when no such structure is assumed. Instead, we simply wish to highlight that some of our bounds can be improved under random hypergraph models. We also have an additional goal of introducing our problem to the rich literature on finding planted structure in random graphs.
Let us assume that the hyperedges are generated according to a SBM for hypergraphs [28]. One block will be the core C and the other the fringe nodes F. We assume that there is zero probability of a hyperedge containing only nodes in F. Explicitly, we will state that the probability of a hyperedge between nodes in C is p while the probability of a hyperedge containing at least one node in C and at least one node in F is q.
We first provide a lemma on the independence number of hypergraphs drawn from the hypergraph SBM. This will help us control the size of |U(|C|)|.

Lemma 12. Let α(G) be the independence number of an r-uniform hypergraph G drawn from the SBM on n vertices. Then
Pr Proof. We follow a proof technique common in the combinatorics literature [41]. Let z = |G| k , S 1 , . . . , S z the size-k subsets of vertices of G, and X i an indicator random variable for S i being an independent set of G, so X i = 1 only happens when all k r possible hyperedges among the vertices of S i do not appear in the hyperedge set of G. Note that this happens with probability (1 − p) This lemma helps us prove a result for three-uniform hypergraphs.

Theorem 13. For three-uniform hypergraphs,
with x as defined in [19].  S ← algorithm1(G) 6: Prune S to be minimal hitting set of G 7: S ← S ∪ S 8: end for 9: return S Theorem 13 enables us to find a small set J that contains the core C with high probability.

Theorem 14.
Let C with |C| = k be a planted hitting set in a three-uniform hypergraph drawn from the SBM parameterized by p and q with ck nodes for some c 1. Then with high probability in k, we can find a set J with Proof. The number of ways to link a vertex in the core to two nodes outside the core is k(c−1) 2 so that the probability of a node having at least one hyperedge to two nodes outside the core is Thus w k is the probability of each of the nodes in the core having at least one hyperedge to two nodes outside the core. This probability tends to 1 as k → ∞, so by using lemma 10, C ⊆ U(|C|) with probability tending to 1. Setting J = U(|C|) and using theorem 13 gives the result.
Of course, the above theorem is only useful if we have a bound on the quantity x since only in that case are we able to improve our bound on the size of J from O(k 3 ) to O(k 2 log k). Absent any additional information, x k * [19].
In the following section, we build a practical algorithm for recovery of planted hitting sets based on the UMHS, the theory for which has been examined so far.

Union of minimal hitting sets (UMHS): a practical core recovery algorithm
Based on the theoretical results described above, we now develop a practical algorithm for recovering a planted hitting set in a hypergraph. To put our theory into practice, we place several of our theoretical results in context. First, lemma 7 says that algorithm 1 produces outputs that must overlap with the planted hitting set. Moreover, this was true regardless of the order in which algorithm 1 processed the hyperedges. Thus, the basic idea of our algorithm is straightforward: we find many hitting sets S from algorithm 1 and take their union.
By corollary 8, if the planted hitting set C is close to a minimum hitting set, then any output from algorithm 1 will recover a large fraction of it. Furthermore, by corollary 9, the outputs of algorithm 1 must overlap by a modest amount, so taking the unions of outputs cannot grow too fast. To limit growth further, we can prune the output of algorithm 1 to be a minimal hitting set (this also tends to give better results in practice). Now, theorem 3 says that the union of pruned outputs will be bounded, provided the graph is large enough and that the outputs are small enough. This turns out to be the case in our experimental results, as we will see in the next section. To summarize, our procedure is as follows: (a) Find a hitting set S from algorithm 1, processing the hyperedges in a random order; (b) Prune S to be a minimal hitting set S ; and (c) Repeat steps (a) and (b) several times and output the union of the S found in step (b).
This method is formalized in algorithm 2. Again, the key rationale is that lemma 7 holds regardless of the ordering in which the hyperedges are processed. A similar algorithm for the special case of two-uniform hypergraphs (i.e., graphs) was recently analyzed [6].
The output of algorithm 2 is by construction a UMHS of the input hypergraph G, and we will refer to the algorithm as UMHS. The theoretical guarantees from section 2.1 apply here. Specifically, under the assumption that the planted hitting set C is itself minimal, we are guaranteed full recovery with sufficient iterations. However, planted hitting sets are rarely minimal in practice (in fact, they are not for the datasets we consider). However, theorem 3 still guarantees that the output of the algorithm will not grow very quickly, and our experiments verify that this is the case in practice.
Importantly, algorithm 2 runs in time linear in the number of hyperedges. In particular, suppose the input is a hypergraph G = (V, E) of rank r with a minimum hitting set of size k * and that we invoke algorithm 1 N times. Then the algorithm produces an output in O(Nr 2 k * |E|) time. This happens because every call to algorithm 1 takes O(r|E||) time, while pruning H to be a minimal hitting set of G requires O(r 2 k * |E|) time in the Table 1. Summary statistics of core-fringe hypergraph datasets. We construct r-uniform hypergraphs from six corpora for rank r = 3, 4, 5 along with non-uniform hypergraphs where the rank is at most 25. The DBLP and Math tags datasets are collections of 50 hypergraphs, so we report the value range. The fringe tends to be much larger than the core in these datasets.

Number of nodes
Size of core (|C|) Number of hyperedges Enron  1283  976  869  4423  84  73  77  132  2361  1048  614  15 653  Avocado  5521  3510  2965  36 244  227 218 211 243  21 690  12 455  6973  96 966  W3C  1778  749  308  14 317  353 237 125 1509  1882  389  88  19 821  DBLP  3- worst case (in practice, it is much faster). Thus, in the worst case, pruning to a minimal hitting set takes more time than one iteration of the greedy algorithm. However, this increase in processing time is not prohibitive since it only increases by the constant factor rk * . Nonetheless, pruning is crucial as in practice we find that it drastically reduces the output size.

Experimental results
We now test algorithm 2 on a number of real-world datasets and compare to several baselines derived from network centrality and core-periphery structure. We find that it consistently outperforms these baselines. Lastly, we show that we do not need too many iterations within algorithm 2 (the parameter N) for our performance levels.

Data
The five datasets we use broadly fall into three broad classes based on the types of planted hitting problems we derive from them. The first group is three email datasets (Enron, Avocado, and W3C), where the planted hitting set is a group of people at an organization, and hyperedges come from emails involving multiple addresses. Next, in the Digital Bibliography & Library Project (DBLP) dataset, a planted hitting set is a set of authors who publish at the same conference in a given year, and we consider several such conferences. Finally, in the Math Stack Exchange tagging dataset, the core is a one-hop neighborhood of a node, which is a hitting set for the two-hop neighborhood of the node; again, we consider several hitting sets. In this sense, the DBLP and tagging datasets are collections of hypergraphs with core-fringe structure. We now provide additional details, and table 1 lists summary statistics of the datasets. Enron [39], Avocado 1 , and W3C [16]. The hyperedges in these datasets are emails, where the nodes are the sender and all receivers (more specifically, their email addresses). Repeated hyperedges are discarded. In the Enron data, the core is a set of 150 employees whose email was made public when the company was under United States federal investigation. The Avocado corpus comes from a now-defunct technology company, where the core are employee email accounts (we exclude email accounts associated with conference rooms or mailing lists from the core). Finally, the W3C dataset comes from emails on mailing lists related to W3C; the core is all email addresses with a w3c.org domain. DBLP. DBLP is an online computer science bibliography. We construct a hyperedges from coauthorship data in conference proceedings. We randomly sampled 50 conferences and constructed a set of core nodes C from the authors of a given conference in a randomly selected year. We then took the core-fringe hypergraph to be all hyperedges involving at least one of these authors. In total, we have 50 core-fringe hypergraphs.
Math tags [5]. This dataset comes from tags applied to questions on the Math Stack Exchange web site 2 . Hyperedges are sets of tags that have been applied to the same question. We sampled 50 tags with relatively small cores and then formed a core set of nodes C from the one-hop neighborhood of that tag, i.e., from the tag and all tags that appear as co-tags with the tag. The set C is then a planted hitting set for the two-hop neighborhood of that tag (i.e., all hyperedges containing at least one tag from the planted hitting set C). In total, we again have 50 core-fringe hypergraphs.
We constructed two types of filtered hypergraph datasets. First, we constructed three-, four-, and fiveuniform hypergraphs (derived as sub-hypergraphs). By imposing uniformity, we can facilitate comparison of UMHS performance to that of other centrality-based algorithms which work on only uniform hypergraphs (in particular, H-and Z-eigenvector centrality). This also lets us study the performance of our algorithm as Table 2. Planted hitting set recovery performance for our algorithm and competitive baseline algorithms. We compare our proposed UMHS method against five hypergraph centrality measures-degree [38], clique-motif graph eigenvector [4], a projected-graph version of PageRank centrality [47], Z-eigenvector [4], H-eigenvector [4]-as well as two core-periphery measurements-Borgatti-Everett [12] and k-core [52]. Each method produces an ordering of vertices, and we measure performance by precision at the core size (fraction of top-|C| ranked nodes that are in C) and AUPRC for r-uniform hypergraphs (r = 3, 4, 5), as well as nonuniform hypergraphs with r 25 nodes in each hyperedge. The DBLP and tags datasets are collections of 50 hypergraphs, and we report the mean and standard deviation for these. UMHS scores outperforming all baselines by at least an 8% relative improvement are bold. Any method's score outperforming UMHS by at least 8% is also bold. Our UMHS method performs the best on all the non-uniform hypergraphs and many of the uniform cases. we change the rank of the hypergraph. In addition, since our algorithm naturally handles nonuniform hypergraphs, we also constructed nonuniform sub-hypergraphs of rank 25 (in the Stack Exchange data, posts have at most five tags, so we include the entire hypergraph for this data). Limiting the rank to 25 cuts out outliers. In particular, it removes large mailing lists in email data and specialized papers with many authors. This subsampling omits a relatively small fraction of hyperedges-11% for the Enron dataset, 1% for the Avocado dataset, and 0.2% for the W3C dataset.

Recovery results
We tested the UMHS algorithm on several datasets, and it outperforms other algorithms consistently. For baselines, we use two techniques. First, we use notions of network centrality developed for hypergraphs with the idea that nodes in the core could be identified via large centrality scores. Specifically, we compare against Figure 3. Planted hitting set recovery improves with more iterations before leveling (top row), and similarly, size increases with more iterations before leveling (bottom row). Fewer than 50 iterations typically reaches peak performance in our email datasets. The leveling of the output size is consistent with theorem 3 and corollary 8, which say that our algorithm's output cannot grow too large.
Second, we use notions of network core-periphery decompositions, where we expect that nodes in C will be identified as 'core' in this sense. We use two algorithms: (a) Borgatti-Everett network core-periphery scores [12] in the weighted clique graph; and (b) the k-core decomposition [52] based on hypergraph degree.
All of these methods work equally well for uniform and non-uniform hypergraphs, except for Z-eigenvector and H-eigenvector eigenvector centrality, which are only designed for uniform hypergraphs. All of these methods also induce an ordering on the nodes. We induce an ordering on the output of algorithm 2 by degree and then sort the remaining nodes (those not in the output of algorithm 2) in order by degree. With an ordering on the nodes, we measure performance in terms of precision at core size, i.e., the fraction of the first |C| in the ordering that are in C, as well as area under the precision-recall curve (AUPRC). We use AUPRC as opposed to area under the receiver operating characteristic (ROC) curve due to class imbalance [20]; namely, most nodes are not in the core. Table 2 reports the results of all methods on all datasets. In the case of nonuniform hypergraphs, UMHS is superior to all other baselines in terms of both precision at core size and AUPRC by substantial margins. With regards to the uniform sub-hypergraphs, UMHS out-performs the baselines by wide margins on the Enron, Avocado, and Math tags datasets for all uniformities r in terms of precision at core size. UMHS also does well for the three-regular W3C hypergraphs. On DBLP and the other W3C hypergraphs, projected PageRank and the simple degree heuristic seem to perform well, although our algorithm still outperforms them on a large share of samples. There are similar trends in terms of AUPRC, with UMHS performing the best on the same set of datasets from above; and in this case, UMHS is more competitive even for the datasets where it is weakest-the four-uniform and five-uniform W3C hypergraphs, and even the DBLP hypergraphs.
Overall, the performance of many algorithms degrades as we increase the uniformity of the hypergraph. This makes sense in general, as the core nodes are increasingly hidden in larger hyperedges. In the case of UMHS, we have a more specific interpretation of what this means. As the hyperedges get larger, there may be cases of hyperedges that contain just one node in the core C, and the rest in the fringe. However, by the greedy structure of algorithm 1, we would put all nodes in the hitting set. Furthermore, the bounds in corollaries 8 and 9 also degrade as we increase the uniformity r. Nevertheless, our proposed UMHS algorithm still outperforms the baselines at an aggregate level.

Recovery as a function of output size
Our UMHS algorithm (algorithm 2) has a single tuning parameter, which is the number of iterations N, i.e., the number of calls to the sub-routine for greedy maximal matching (algorithm 1). Here we examine performance of UMHS as a function of N. Specifically, we analyze: (a) the fraction of core nodes in the planted hitting set that are recovered and (b) the output size of algorithm 2 as a function of the number of iterations Figure 3 shows these statistics for the three email datasets. We highlight a couple of important findings. First, we only need around 50 iterations to achieve high recovery rates. Each iteration is fast, and the entirety of the algorithm's running time takes at most a few minutes on the larger datasets. Second, the UMHS size tends to increase sharply with a few iterations and then levels off sharply. These results are consistent with our theory: theorem 3 and corollary 8 both provide theoretical justification for why the output should not grow too large.

Related work
On the theoretical side, our problem can be thought of as an instance of a 'planted' problem, where a certain type of graph structure is planted or hidden in a graph and one must recover the latent structure given the graph. Well-studied problems in this space include the planted clique, where one tries to find a clique placed in a sample from a G n,1/2 graph [2,22,23,26]; and the planted partition or SBM recovery, where a random graph is sampled with probabilities dependent on latent labels of nodes, and the goal is to recover these labels [1,46,48,57]. These planted problems are based on some random way in which the graph was sampled. In our case of planted hitting sets, the graph was deterministic, although we could improve our results under a random hypergraph model. Most related to our results is recent work on planted vertex covers; this is a special case of hitting sets for the case of graphs (which mathematically are the same as two-uniform hypergraphs) [6]. As discussed above, the hypergraph model is more realistic for many datasets (especially email), given its ability to represent groups of more than two individuals at a time.
Within the field of network science, the idea of a small planted hitting set fits with two related ideas: node centrality and core-periphery structure. The former concept deals with finding important nodes (or ranking them) based on their connections, often modeled as a graph [9,11,30]. Nodes in hitting sets are central to hypergraphs almost by definition-every hyperedge must contain at least one of these nodes. Thus, we expect them to be 'central' in some sense. However, we found that existing measures of node centrality in hypergraphs did not recover planted hitting sets at the same levels as our UMHS algorithm.
Core-periphery structure is a mesoscale property of many networks, where there is a densely connected core set of nodes along with a loosely connected periphery [17]. Such a composition has been studied in sociology [24,44] and international trade [54], where the core-periphery structure is due to differential status. Now, core-periphery identification is a broader tool for identifying structure in general networks [32,37,49,51]. The planted hitting set that we aim to recover corresponds to an extreme type of core-periphery structure; due to the way in which we assume the hypergraph is measured, nodes on the periphery (the 'fringe nodes') cannot be connected without a core node as an intermediary in a hyperedge.
Finally, core-fringe structure itself has received some attention. For example, the behavior of a core group of employees at a hedge fund has been analyzed in the context of their relationships with contacts outside of the company [50], and the core-fringe structure has been shown to influence graph-based link prediction algorithms [3]. Our research highlights additional richness to the problem when the underlying data model is a hypergraph.

Discussion
Network data is a partial view of a larger system [43]. A common case is when the interactions of some specified set of actors or nodes are under surveillance. This provides a 'core-fringe' structure to the network-we can see all interactions involving the core but only the interactions of the fringe with the core. When data is leaked or metadata is lost over time due to data provenance issues, we would like to be able to recover these core and fringe labels for security or data maintenance purposes.
Here, we have studied this problem where the network data is a hypergraph, so the core is a hitting set. This setting is common in email data or situations in which groups of people are meeting. We used co-authorship as a proxy for the latter situation, but one can imagine situations in which one records the groups of attendees of meetings involving someone under surveillance. Theoretically, we showed that the UMHS cannot be too large and that the output of the well-known approximation algorithm for minimum hitting sets has to somehow overlap the core.
Using these results as motivation, we developed an extremely simple algorithm for recovering the core: take the UMHS that are output by randomly initialized instances of the approximation algorithm. This method out-performed several strong baselines, including methods based on centrality and core-periphery structure.
However, our simple algorithm opens several avenues for improvement in future work. For instance, our model assumed an undirected and unweighted hypergraph structure. There are models of directed and weighted hypergraphs [27], as well as other hypergraph generalizations [15], that could be used to improve the recovery algorithm in practice. In addition, theory on the number of calls to the approximation algorithm subroutine would be useful. In practice, only a few calls is sufficient and perhaps assuming particular structure on the hypergraph could yield additional theoretical insight.