Model-based clustering in simple hypergraphs through a stochastic blockmodel

We propose a model to address the overlooked problem of node clustering in simple hypergraphs. Simple hypergraphs are suitable when a node may not appear multiple times in the same hyperedge, such as in co-authorship datasets. Our model generalizes the stochastic blockmodel for graphs and assumes the existence of latent node groups and hyperedges are conditionally independent given these groups. We first establish the generic identifiability of the model parameters. We then develop a variational approximation Expectation-Maximization algorithm for parameter inference and node clustering, and derive a statistical criterion for model selection. To illustrate the performance of our R package HyperSBM, we compare it with other node clustering methods using synthetic data generated from the model, as well as from a line clustering experiment and a co-authorship dataset.


Introduction
Over the past two decades, a wide range of models has been developed to capture pairwise interactions represented in graphs. However, modern applications in various fields have highlighted the necessity to consider high-order interactions, which involve groups of three or more nodes. Simple examples include triadic and larger group interactions in social networks (whose importance has been recognized early on, see Simmel, 1950), scientific co-authorship (Estrada and Rodríguez-Velázquez, 2006), interactions among more than two species in ecological systems (Muyinda et al., 2020;Singh and Baruah, 2021), or high-order correlations between neurons in brain networks (Chelaru et al., 2021). To formalize these high-order interactions, hypergraphs provide the most general framework. Similar to a graph, a hypergraph consists of a set of nodes and a set of hyperedges, where each hyperedge is a subset of nodes involved in an interaction.
In this context, it is important to distinguish simple hypergraphs from multiset hypergraphs, where hyperedges can contain repeated nodes. Multisets are a generalization of sets, allowing elements to appear with varying multiplicities. Recent reviews on high-order interactions can be found in the works of , Bick et al. (2021), and Torres et al. (2021).
Despite the increasing interest in high-order interactions, the statistical literature on this topic remains limited. Some graph-based statistics, such as centrality or clustering coefficient, have been extended to hypergraphs to aid in understanding the structure and extracting information from the data (Estrada and Rodríguez-Velázquez, 2006). However, these statistics do not fulfill the need for random hypergraph models.
Early analyses of hypergraphs have relied on their embedding into the space of bipartite graphs (see, e.g., . Hypergraphs with self-loops and multiple hyperedges (weighted hyperedges with integer-valued weights) are equivalent to bipartite graphs. However, bipartite graph models were not specifically designed for hypergraphs and may introduce artifacts (please refer to Section A in the Supplementary Material for more details).
Generalizing Erdős-Rényi's model of random graphs leads to uniformly random hypergraphs.
This model involves drawing uniformly at random from the set of all m-uniform hypergraphs (hypergraphs with hyperedges of fixed cardinality m) over a set of n nodes. However, similar to Erdős-Rényi's model for graphs, this hypergraph model is too simplistic and homogeneous to be used for statistical analysis of real-world datasets. In the configuration model for random graphs, the graphs are generated by drawing uniformly at random from the set of all possible graphs over a set of n nodes, while satisfying a given prescribed degree sequence. In the context of hypergraphs, configuration models were proposed in Ghoshal et al. (2009), focusing on tripartite and 3-uniform hypergraphs. Later,  extended the configuration model to a more general hypergraph setup. In these references, both the node degrees and the hyperedge sizes are kept fixed (a consequence of the fact that they rely on bipartite representations of hypergraphs). The configuration model is useful for sampling (hyper)graphs with the same degree sequence (and hyperedge sizes) as an observed dataset through shuffling algorithms. Therefore, it is often employed as a null model in statistical analyses. However, sampling exactly (rather than approximately) from this model poses challenges, particularly in the case of hypergraphs.
For a comprehensive discussion on this issue, we refer readers to Section 4 in .
Another popular approach for extracting information from heterogeneous data is clustering.
In the context of graphs, stochastic blockmodels (SBMs) were introduced in the early eighties (Frank and Harary, 1982;Holland et al., 1983) and have since evolved in various directions.
These models assume that nodes are grouped into clusters, and the probabilities of connections between nodes are determined by their cluster memberships. Variants of SBMs have been developed to handle weighted graphs and degree-corrected versions, among others. In the context of hypergraphs, Ghoshdastidar and Dukkipati (2014) introduced a planted partition model for m-uniform hypergraphs, which is a specific case of SBM. They developed a spectral partitioning method and established its consistency in the case where nodes are grouped into equally-sized clusters, and two parameters determine the connection probabilities within clusters and between clusters, with the former being larger than the latter. This result was further extended to the non-uniform and weighted sparse setting, where most weights are close to zero, in Ghoshdastidar and Dukkipati (2017). By introducing hypergraphons, Balasubramanian (2021) extended the ideas of hypergraph SBMs to a nonparametric setting. In a parallel vein, Turnbull et al. (2021) proposed a latent space model for hypergraphs, generalizing random geometric graphs to hypergraphs, although it was not specifically designed to capture clustering. An approach linked to SBMs is presented in Vazquez (2009), where nodes belong to latent groups and participate in a hyperedge with a probability that depends on both their group and the specific hyperedge.
Modularity is a widely used criterion for clustering entities in the context of interaction data. It aims to identify specific clusters, known as communities, characterized by high withingroup connection probabilities and low between-group connection probabilities (Ghoshdastidar and Dukkipati, 2014). However, in the hypergraph context, the definition of modularity is not unique. In particular,  introduced a "strict" modularity criterion, where only hyperedges with all their nodes belonging to the same group contribute to an increase in modularity. Their criterion measures the deviation of the number of these homogeneous hyperedges from a new null model called the configuration-like model for hypergraphs, where the average values of the degrees are fixed. Building upon this,  proposed a degree-corrected hypergraph SBM and introduced two new modularity criteria. Similar to , one of these criteria utilizes an "all-or-nothing" affinity function that distinguishes whether a given hyperedge is entirely contained within a single cluster or not. In this setup, they established a connection between approximate maximum likelihood estimation and their modularity criterion. This work is reminiscent of the work of Newman (2016) in the graph context. However, the estimators proposed by  do not guarantee maximum likelihood estimation, as they constrained the parameter space by assuming a symmetric affinity function.
It is important to highlight that the developments presented in  and  are specifically conducted in the context of multiset hypergraphs, where hyperedges can contain repeated nodes with certain multiplicities. The use of multiset hypergraphs simplifies some of the challenges associated with computing modularity. However, to the best of our knowledge, modularity approaches still lack instantiation in the case of simple hypergraphs where each node can only appear once in a hyperedge. More specifically, the null model used in hypergraph modularity criteria relies on a model for multiset hypergraphs, similar to how the null model used in classical graph modularity is based on graphs with self-loops. While it is known in the case of graphs that this assumption is inadequate, as it induces a stronger deviation than expected and affects sparse networks as well (Massen and Doye, 2005;Cafieri et al., 2010;Squartini and Garlaschelli, 2011), the assumption of multisets has not yet been discussed in the context of hypergraph modularity.
In the context of community detection, random walk approaches have also been utilized for hypergraph clustering (Swan and Zhan, 2021). Additionally, low-rank tensor decompositions have been explored (Ke et al., 2020). The misclassification rate for the community detection problem in hypergraphs and its limits have been analyzed in various contexts (see, for instance, Ahn et al., 2018;Chien et al., 2019;Cole and Zhu, 2020). It is worth mentioning that a recent approach has been proposed to cluster hyperedges instead of nodes (Ng and Murphy, 2022). However, our focus in this work is on clustering nodes.
The literature on high-order interactions often discusses simplicial complexes alongside hypergraphs . However, the unique characteristic of simplicial complexes, where each subset of an occurring interaction should also occur, places them outside the scope of this introduction, which is specifically focused on hypergraphs.
In this article, our focus is on model-based clustering for simple hypergraphs, specifically studying stochastic hypergraph blockmodels. We formulate a general stochastic blockmodel for simple hypergraphs, along with various submodels, and outline the key differences compared to previous proposals (Section 2.1). We provide the first result on the generic identifiability of parameters in a hypergraph stochastic blockmodel (Section 2.2). Parameter inference and node clustering are performed using a variational Expectation-Maximization (VEM) algorithm (Section 2.3) that approximates the maximum likelihood estimator. Model selection for the number of groups is based on an integrated classification likelihood (ICL) criterion (Section 2.4). To illustrate the performance of our method, we conduct experiments on synthetic sparse hypergraphs, including a comparison with hypergraph spectral clustering (HSC) and modularity approaches (Section 3). Notably, the line clustering experiment (Section 3.4) highlights the significant differences between our approach and the one proposed by . As a by-product, our synthetic experiments demonstrate that the detectability thresholds for non-uniform sparse hypergraphs, also known as Kesten-Stigum thresholds (Angelini et al., 2015;Stephan and Zhu, 2022) cannot be deduced from the uniform case (Section 3.2). We also analyze a co-authorship dataset, presenting conclusions that differ from spectral clustering and bipartite stochastic blockmodels (Section 4). The proofs of all theoretical results are provided in Section 5. An R package, HyperSBM, which implements our method in efficient C++ code, as well as all associated scripts, are available online (see Section 6). This manuscript is accompanied by Supplementary Material (SM) that contains additional information and experiments.

Model formulation
Let H = (V, E) represent a binary hypergraph, where V = {1, . . . , n} is a set of n nodes and E is the set of hyperedges. In this context, a hyperedge of size m ≥ 2 is defined as a collection of m distinct nodes from V. We do not allow for hyperedges to be multisets or self-loops. Let M = max e∈E |e| denote the largest possible size of hyperedges in E, with M ≥ 2 (for graphs, M = 2). We can define the sets of node subsets, node tuples, and hyperedges of size m as . For each node subset {i 1 , . . . , i m } ∈ V (m) , we can define the indicator variable: We can represent a random hypergraph as Y = (Y i 1 ,...,im ) i 1 ,...,im∈V (m) .
Similar to the formulation of the stochastic blockmodel (SBM) for graphs, we assume that the nodes in the hypergraph belong to Q unobserved groups. We use Z 1 , . . . , Z n to denote n independent and identically distributed latent variables, where Z i follows a prior distribution π q = P(Z i = q) for each q = 1, . . . , Q. The values π q satisfy π q ≥ 0 and Q q=1 π q = 1. To simplify notation, we sometimes represent Z i as a binary vector Z i = (Z i1 , . . . , Z iQ ) ∈ {0, 1} Q , where only one element, Z iq , equals 1. We also define Z = (Z 1 , . . . , Z n ). For every m-subset of nodes, we associate a latent configuration that represents the set of latent groups to which these nodes belong. We define Q (m) = {q 1 , . . . , q m } : q 1 , . . . , q m ∈ {1, . . . , Q} , as the set of all possible latent configurations for elements in V (m) . Note that the values of the latent groups within a configuration may be repeated. Now, given the latent variables Z, all indicator variables Y i 1 ,...,im are assumed to be independent and follow a Bernoulli distribution whose parameter depends on the latent configuration: Here B (m,n) q 1 ,...,qm represents the probability that m unordered nodes, with latent configuration {q 1 , . . . , q m }, are connected into a hyperedge. We emphasize the dependency of those parameters on both m and n, having in mind 2 possible sparse settings. First, as the number of nodes increases, it is natural to assume that the probability of a hyperedge may decrease; otherwise, we would only observe dense hypergraphs. Second, while it is implicit that B (m,n) q 1 ,...,qm depends on m through q 1 , . . . , q m , we emphasize this dependency as it is likely that real data contain fewer hyperedges of larger size m. Each B (m,n) is a fully symmetric tensor of rank m, namely B (m,n) q 1 ,...,qm = B (m,n) q σ(1) ,...,q σ(m) , ∀q 1 , . . . , q m and ∀σ permutation of {1, . . . , m}. (1) We denote the parameter vector as θ = (π q , B (m,n) q 1 ,...,qm ) q,m,q 1 ≤···≤qm and the corresponding probability distribution and expectation as P θ , E θ , respectively. Lemma 1. The number of different parameters in each tensor B (m,n) = (B (m,n) As a result, the total number of parameters in our hypergraph stochastic blockmodel (HSBM) is given by: As shown in Table 1, the number of B (m,n) q 1 ,...,qm parameters increases rapidly as the values of Q and m grow. To reduce the complexity of the model, we introduce submodels by assuming equality of certain conditional probabilities B (m,n) q 1 ,...,qm . In particular, we consider two affiliation submodels given by The number of parameters is dropped to (Q − 1) + 2(M − 1) and to (Q − 1) + 2 under Assumptions (Aff-m) and (Aff ), respectively. These submodels align with the concepts discussed in  and , where they propose that only hyperedges with nodes belonging to the same group should contribute to the increase in modularity. Additionally, these submodels correspond to the scenarios in which Dukkipati (2014, 2017) obtained their results .  Q   m  2  3  4  5  6  7   3  4  10  20  35  56  84   4  5  15  35  70  126  210   5  6  21  56  126  252  462   6  7  28  84  210  462  924   7  8  36  120  330  792 1716 Generalizations. Our model can accommodate self-loops without significant changes by allowing for m = 1. Furthermore, it can be easily extended to handle multiple hypergraphs (with or without self-loops) by incorporating a zero-inflated or deflated Poisson distribution on the conditional distribution of the hyperedges. In a more general setting, the conditional Bernoulli distribution can be replaced with any parametric distribution to handle weighted hypergraphs, and it could also easily incorporate covariates. This flexibility allows for the adaptation of our model to various types of hypergraph data.
Distinction from previous proposals. We here highlight the distinctive methodological aspects of our approach compared to other existing methods. Dukkipati (2014, 2017) obtained error bounds that converge to zero only for specific cases of the (Aff-m) model with equally-sized groups. Their results are limited to these specific scenarios, whereas our approach extends beyond such constraints. References such as Ke et al. (2020); Ahn et al. (2018); Chien et al. (2019) primarily focus on community detection, which means they only identify clusters that correspond to communities. In contrast, our method is not limited to detecting community-like structures, but can find clusters that may not align with traditional notions of communities (e.g. disassortative clusters).  introduced a comprehensive degree-corrected SBM for multiple hypergraphs. However, their inference method solves the clustering problem in the space of multiset hypergraphs, whereas our approach is specifically designed for simple hypergraphs. While both models are valid, they lead to different statistical analyses, and the choice of which model to use depends on the specific characteristics of the data. Furthermore, it is worth noting that while  initially employ a maximum likelihood approach, they deviate from that setting during their analysis. In contrast, our method retains the maximum likelihood framework throughout the inference process. Lastly, the SBM for hypergraphs presented in Balasubramanian (2021) is highly general. However, their least-squares estimator for a hypergraphon model is computationally infeasible. Additionally, their Algorithm 1 is dedicated to community detection and does not provide general cluster recovery.
In Sections 3 and 4, we provide numerical illustrations that highlight the differences between our algorithm and those proposed by ; Ghoshdastidar and Dukkipati (2017), as well as the bipartite SBM.

Parameter identifiability
We first establish the generic identifiability of the parameters in a HSBM that is restricted to simple m-uniform hypergraphs for any m ≥ 2. In a parametric context, generic identifiability implies that every parameter θ, except possibly for some parameters in a subset of dimension strictly smaller than the full parameter space, uniquely determines the distribution P θ . In other words, if we randomly select a parameter θ ∈ Θ according to the Lebesgue measure, it almost surely uniquely defines P θ . Identifiability is established up to label switching on the node groups, as is common in discrete latent variable models. For the case of m = 2, the identifiability result corresponds to Theorem 2 in . Our proof follows similar principles, building upon a key result by . In our case, we crucially additionally rely on a sufficient condition for a sequence of nonnegative integers to represent the degree sequence of a simple m-uniform hypergraph, as established by .
Theorem 2. For any m ≥ 2 and any integer Q, the parameter θ = (π q , B (m,n) q 1 ,...,qm ) 1≤q≤Q,1≤q 1 ≤···≤qm≤Q of the HSBM restricted to m-uniform simple hypergraphs over n nodes, is generically identifiable, up to label switching on the node groups, for large enough n (depending only on m, Q).
This result does not provide specific insights into the identifiability when we restrict the model to fixed group proportions, such as equal group proportions π q = 1/Q. Indeed, it does not explicitly characterize the subspace of the parameter space where identifiability may not hold, although we know that its dimension is smaller than that of the full parameter space. When we impose fixed group proportions, we are precisely working within a lower-dimensional subspace, and identifiability may not be guaranteed without additional considerations.
Similarly, this result says nothing about the affiliation cases (Aff-m) and (Aff ), which correspond to further restrictions of the parameter space to lower-dimensional subspaces.
The result we have established for m-uniform hypergraphs also implies a similar result for non-uniform simple hypergraphs, as shown in the following corollary.
Our proof of Corollary 3 relies on the assumption that all the π q 's are distinct, which is a generic condition. This condition is not explicitly stated in the corollary, but it is required for the proof to hold. Consequently, the result of generic identifiability does not bring any insight in cases where the group proportions are equal, as it is not sufficient to identify the parameters separately for each value of m.
Additional technical work is thus needed to establish whether a HSBM with equal group proportions, or whether the affiliation submodels have identifiable parameters.

Parameter estimation via variational Expectation-Maximization
The likelihood of the model is given by The computation of the model likelihood is generally intractable. Equation (2) involves a summation over all possible Q n different latent configurations, which becomes computationally prohibitive when n and Q are large. In the context of latent variable models, the Expectation-Maximization (EM) algorithm (Dempster et al., 1977) is commonly used to address this issue.
However, the EM algorithm cannot be directly applied to SBMs. This is because the E-step, which involves computing the conditional posterior distribution of the latent variables P (Z|Y ), is itself intractable in SBMs (see, e.g., Matias and Robin, 2014). One possible solution is to employ variational approximations of the EM algorithm, known as Variational EM (VEM Jordan et al., 1999).
We let Then, the complete data log-likelihood is Note that the final equality ensures that each parameter value appears only once. The key principle underlying the variational method is to adopt the same iterative two-step structure as the EM algorithm but replace the intractable posterior distribution P θ (Z|Y ) with the best approximation, in terms of Kullback-Leibler divergence, from a class of simpler distributions, often factorized. We introduce a class of factorized probability distributions Q τ over Z = (Z 1 , . . . , Z n ) given by where the variational parameter τ iq = Q τ (Z i = q) ∈ [0, 1] and Q q=1 τ iq = 1 for any i = 1, . . . , n and q = 1, . . . , Q. The expectation under distribution Q τ is denoted as E Qτ , and H(Q τ ) represents the entropy of Q τ . Now we define the evidence lower bound (ELBO): It can be observed that J (θ, τ ) satisfies the following relation: where KL(·||·) denotes the Kullback-Leibler divergence. Thus, J serves as a lower bound for the model log-likelihood log P θ (Y ). The VEM algorithm iterates between the following two steps until a suitable convergence criterion is met: • VE-Step maximizes J (θ, τ ) with respect to τ This step involves finding the best approximation of the conditional distribution P θ (Z|Y ) by minimizing the Kullback-Leibler divergence term in (6).
• M-Step maximizes J (θ, τ ) with respect to θ This step updates the values of the model parameters π q and B (m,n) q 1 ,...,qm .
In the following we provide the solutions of the two maximization problems in Equations (7) and (8).
Proposition 4 (VE-Step). Given the current model parameters (π q , B (m,n) q 1 ,...,qm ) q,m,q 1 ,...,qm at any iteration of the VEM algorithm, the corresponding optimal values of the variational parameters ( τ iq ) i,q defined in Equation (7) should satisfy the fixed point equation: for any 1 ≤ i ≤ n and 1 ≤ q ≤ Q and where c i are normalising constants such that q τ iq = 1.
Remark. From Proposition 4, the τ i 's are obtained using a fixed point algorithm. Although in all the situations we experienced, the algorithm converged in a reasonable number of iterations, we have no guarantee about existence nor uniqueness of a solution to (9).
Proposition 5 (M-Step). Given the variational parameters (τ iq ) i,q at any iteration of the VEM algorithm, the corresponding optimal values of the model parameters ( π q , B (m,n) q 1 ...qm ) q,m,q 1 ≤···≤qm defined in Equation (8) are given by We now express the solutions of the M-Step under the submodels given by (Aff-m) and (Aff ).
Note that the VE-Step is unchanged under these settings.
Proposition 6 (M-Step, affiliation setup). In the particular affiliations submodels given by (Aff-m) and (Aff ) respectively, given variational parameters (τ iq ) i,q , at any iteration of the VEM algorithm, the corresponding optimal values of ( α (m,n) , β (m,n) ) m and α (n) , β (n) maximising J as in Equation (8) are given by • Under Assumption (Aff ), q 1 ≤···≤qm |{q 1 ,...,qm}|≥2 Algorithm initialization. We choose to begin the algorithm with its M-step, which provides an initial value for τ . This allows us to leverage smart initialization strategies based on a preliminary clustering of the nodes. Specifically, we employ three different initialization strategies and select the best result that maximizes the ELBO criterion J : Random initialization: This naive method involves drawing each (τ iq ) 1≤q≤Q uniformly from (0, 1) for every node i and normalizing the vector τ i .
"Soft" spectral clustering: We utilize Algorithm 1 from Ghoshdastidar and Dukkipati (2017) combined with soft k-means. In this approach, we compute a hypergraph Laplacian and construct the column matrix X consisting of its leading Q orthonormal eigenvectors. The rows of X are then normalized to have unit norm (following steps 1 to 3 in Algorithm 1 from Ghoshdastidar and Dukkipati, 2017). We subsequently perform a soft k-means algorithm on the rows of X to obtain τ iq , which represents the posterior probability of node i belonging to cluster q.
Graph-component absolute spectral clustering: This strategy focuses on edges in the hypergraph (m = 2) and the corresponding adjacency matrix. We apply the absolute spectral clustering method (Rohe et al., 2011) to this adjacency matrix. It should be noted that this initialization only uses information from hyperedges of size m = 2, excluding hyperedges with larger sizes.
However, absolute spectral clustering is considered superior to spectral clustering as it captures disassortative groups.
Stopping criteria. The iterations of the VEM algorithm should be terminated when the ELBO J and the sequence of model parameter vectors θ (t) = (θ (t) s ) s have converged. However, in practice, we have observed that sometimes the algorithm stops prematurely when the VE-Step still requires a few iterations to reach a fixed point. In such cases, continuing with the VEM iterations often leads to higher values of the ELBO function and better parameter estimates. To address this, we enforce the condition that the fixed point in the VE-Step is reached in its first iteration. This reduces the chance of converging to local maxima of J . If these convergence conditions are not met, we stop the algorithm if the maximum number of iterations has been reached. To summarize, we stop the algorithm whenever: Section D in SM contains additional details about the algorithm's implementation.
Algorithm complexity and choice of M . The complexity of our algorithm is of the order O(nQ M n M ), which can be quite prohibitive for large datasets, especially when M becomes large. It is important to emphasize that when analyzing a dataset, the value of M is not necessarily the maximum observed size of the hyperedges, but rather a modeling choice. Indeed, while an occurring hyperedge Y i 1 ,...,im with node clusters {q 1 , . . . , q m } contributes log B (m,n) q 1 ,...,qm to the likelihood, a non occurring one contributes log(1 − B (m,n) q 1 ,...,qm ) and the statistical information that they bring to the parameter is the same (see Equations (3) and (4)). Now let's consider for e.g. a co-authorship dataset where we observe n authors and at most 3 co-authors per paper.
The absence of hyperedges of size 4 provides as much information for a HSBM as if all possible size-4 hyperedges were present. Similarly, the information contained in a dataset with all but 5 possible size-4 hyperedges present is the same as the information contained in a dataset with only 5 occurring size-4 hyperedges. In other words, 0 and 1 values play a symmetric role.
As a consequence, the choice of M is left to the discretion of the statistician, depending on the characteristics of the dataset and the available computational resources. In particular, if there are hyperedges with very large sizes M , the statistician may decide not to consider them, just as it is justified not to take into account the absence of hyperedges of size M + 1, where M is the largest observed size. It is important to note that choosing M > 2 already represents an improvement in terms of considering more information compared to a graph analysis of the data. Therefore, for large datasets, we recommend limiting the analysis to smaller values of M , such as M = 3 or M = 4, to reduce computational burden and improve efficiency. Ghoshdastidar and Dukkipati (2017) propose a method for selecting the number of groups based on the spectral gap, whereas our approach relies on a statistical framework to construct a model selection criterion.

Model selection
After obtaining the estimated parametersθ and (τ i ) i from the VEM algorithm, we assign each node i to its estimated groupẐ i = arg max qτiq . We then define the integrated classification likelihood (ICL, Biernacki et al., 2000) for the full model and the submodels (Aff-m) and (Aff ) as follows: We then determine the number of groupsq as the value that maximizes the corresponding ICL criterion:q = arg max q ICL(q).

Synthetic data
We conducted a simulation study to evaluate the performance of the HyperSBM package. We generated hypergraphs under the HSBM with two or three latent groups (Q = 2 or Q = 3). The group proportions were non-uniform, with π = (0.6, 0.4) for Q = 2 and π = (0.4, 0.3, 0.3) for Q = 3. We set the largest hyperedge size M to 3, and we considered different numbers of nodes, To simplify the latent structure, we assumed the (Aff-m) submodel, and we parametrized the model through the ratios ρ (m) of within-group size-m hyperedges over between-groups size-m hyperedges (assumed constant with n, see Section E in SM for details). We analyzed two different scenarios: A. Communities: in this scenario, we focus on community detection and consider the case of more within-group than between-groups size-m hyperedges ρ (m) > 1 for m = 2, 3.
B. Disassortative: in this scenario, we focus on disassortative behaviour and consider the case of less within-group than between-groups size-m hyperedges ρ (m) < 1 for m = 2, 3.
Each setting is a combination of a scenario (X=A,B) and number of groups (Q = 2, 3) and is denoted XQ. In each setting, values of α (m,n) and β (m,n) decrease with increasing n so as to maintaining constant the quantities nα (2,n) and nβ (2,n) as well as n 2 α (3,n) and n 2 β (3,n) . This implies that the number of size-m hyperedges (both within and between groups) grows linearly with n. We have explored a total of 5 different settings, denoted by A2, A3, B2, B3 and A3' and we present below the most striking results. In the case of scenarios A (communities) with Q = 3 groups, we pushed the limits and present to different settings (namely settings A3 and A3'), with setting A3 being highly sparse, i.e. sparser than the already sparse setting A3'. Details of the parametrization, specific parameter values and number of hyperedges are fully given in Section E in SM, where additional results are also provided.
For each setting and each value of n, we randomly draw 50 different hypergraphs. We estimate the parameters using the full HSBM formulation with our VEM algorithm, relying on soft spectral clustering (for Scenario A) and graph-component absolute spectral clustering (for Scenario B) initializations (see paragraph "Algorithm initialization" above).

Clustering and estimation under HSBM with a fixed number of groups
In this part we focus on clustering and parameter estimation with a known number of groups.
The performance of HyperSBM is evaluated based on its ability to accurately recover the true clustering and estimate the original parameters. We also compare our results with hypergraph spectral clustering, relying on Algorithm 1 from Ghoshdastidar and Dukkipati (2017), denoted HSC below.
Clustering results. The performance of correct classification is evaluated using the Adjusted Rand Index (ARI, Hubert and Arabie, 1985). The ARI measures the similarity between the true node clustering and the estimated clustering. It is upper bounded by 1, where a value of 1 indicates perfect agreement between the clusterings, and negative values indicate less agreement than expected by chance. Figure 1 displays the boxplot values of the ARI for settings A2 to B3. It is evident that our HyperSBM consistently outperforms HSC, obtaining higher ARI values overall and significantly lower variances in most cases, except for setting B3, where HyperSBM exhibits a larger variance but still yields substantially better results compared to HSC. We also observe that increasing the number of nodes n does not appear to significantly enhance the clustering results of HyperSBM.
This behavior could be attributed to our simulation setting, where the numbers of size-m hyperedges (m = 2, 3) are kept linearly increasing with n. However, it is worth noting that the variances of the ARI obtained by HyperSBM tend to decrease with an increasing number of nodes n. One final comment pertains to the relatively poor clustering performance obtained by both methods in setting B3. This setting appears to be particularly challenging, indicating the necessity for further research on detectability thresholds in the context of non-uniform hypergraphs.
We briefly discuss this topic in the next paragraph.
Detectability thresholds. A threshold for community detection in m-uniform sparse hypergraphs is conjectured in Angelini et al. (2015) and Stephan and Zhu (2022) established it is efficiently achieved by a spectral algorithm based on the non-backtracking operator. We also mention the work by Zhang and Tan (2023) in a different regime. In the specific case of a  Figure 1: Boxplots of Adjusted Rand Indexes for different settings XQ (where X=A, B is the scenario and Q = 2, 3 is the number of groups), number of nodes n (along x-axis) and 2 methods: our HyperSBM (left boxplot) and HSC (right boxplot). First row (resp. first column) shows scenario A with communities (resp. Q = 2) while second row (resp. second column) shows scenario B with disassortative behaviour (resp. Q = 3).
symmetric m-uniform HSBM (which corresponds to (Aff ) submodel) with equal group proportions (see Example 1 in Stephan and Zhu, 2022), the Kesten-Stigum (KS) threshold for possible recovery in the sparse regime is When the groups proportions are not uniform, we conjecture that this threshold becomes It appears that none of the 5 settings explored satisfies the KS m nor KS m thresholds when restricting attention to either m = 2 or m = 3 (see Section E.3 in SM). Thus, in theory, the clusters cannot be recovered in those scenarios by looking only at 2-uniforms or 3-uniforms hypergraphs. Our results seem to indicate that non-uniform sparse hypergraphs behave differently than the superposition of their uniform layers and that the existence of hyperedges of different sizes improves the detectability of the clusters.
Parameter estimation accuracy. We also evaluate the accuracy of parameter estimation. As the parameter values may be extremely small (see Section E in SM), we choose to compute the Mean Squared Relative Error (MSRE) between the true parameters (in the full model) and the estimated values, both for the prior probabilities π q and the probabilities of hyperedge occurrence B (m) q 1 ,...,qm . Specifically, we compute the aggregated MSRE over all the components of θ using the following formula: ..,qm } m,q 1 ,...,qm ) is the set of free parameters estimated on the i-th dataset by the full model and n rep = 50 is the number of replicates.
The corresponding results are summarized through the boxplots in Figure 2. The relative errors are rather small, decreasing and showing a lower variance as the number of nodes increases.
Note that the absolute values of MSRE cannot be compared between the cases Q = 2 (first column) and Q = 3 (second column). Indeed, in the first case, the relative error is cumulated over a total of 1+3 +4=8 free parameters (in the full model), while this increases to 2+ 6+10=18 free parameters when Q = 3.

Performance of model selection
In this section we assess the performance of ICL as a model selection criterion. The simulated data is fitted with our HyperSBM with a number of latent states ranging from 1 to 5.  In Table 2 we show the frequency of the selected number of groups for setting A3'. The correct model is selected in 74% of cases for n = 50, in 98% of cases for n = 100 and in 100% of cases for n = 150, 200. We also compute the value of ARI of the classification obtained with 3 clusters depending on the selected number of latent groups. This value is always equal to 1 when the correct model is recovered, thus confirming the optimal behavior of HyperSBM already shown in Section 3.2.

Line clustering through hypergraphs
Following Leordeanu and Sminchisescu (2012);  and earlier references, we here explore the use of hypergraphs to detect line clusters of points in R 2 . Similarly to the construction of pairwise similarity measures, we here resort on third order affinity measures to detect alignment of points since pairwise measures would be useless to detect alignment. Thus, for any triplet of points {i, j, k}, we use the mean distance to the best fitting line as a dissimilarity measure d(i, j, k) and transform this through a Gaussian kernel to a similarity measure.
We performed two different experiments, with either 2 or 3 lines. In each setting, we randomly generate the same number of points per line in the range [−0.5, 0.5] 2 and perturbed with centered Gaussian noise with standard deviation 0.01. We then add noisy points, generated from uniform distribution on [−0.5, 0.5] 2 . The particular settings of each experiment are described in Table 3 and Figure 3 shows the resulting sets of points.  For both settings, we generated 100 different 3-uniform hypergraphs using the following procedure. We randomly selected 3 points {i, j, k} and calculated the mean distance d(i, j, k) to the best-fitting line. We then measured their similarity using a Gaussian kernel exp(−d(i, j, k) 2 /σ 2 ) with σ 2 = 0.04. If the similarity was greater than a threshold ϵ = 0.999, we constructed a hyperedge {i, j, k}. This procedure resulted in both signal hyperedges, where all points belonged to the same line cluster, and noise hyperedges, where the points were sufficiently aligned without belonging to the same line. The signal-to-noise ratio of hyperedges was set to 2 for each hypergraph. We specifically simulated sparse hypergraphs, and the average number of hyperedges is presented in Table 3. Additionally, isolated nodes in the hypergraph were excluded from the clustering analysis.
We applied our HyperSBM algorithm to cluster the nodes of these 3-uniform and sparse hyper-  4 Analysis of a co-authorship dataset

Dataset description
We analyze a co-authorship dataset available at http://vlado.fmf.uni-lj.si/pub/networks/ data/2mode/Sandi/Sandi.htm. The dataset originates from the bibliography of the book "Prod-

Analysis with HyperSBM
We conducted an analysis of this dataset using our HyperSBM package. The model selection based on the ICL criterion determined that there are 2 groups (Q = 2). One group consists of only 8 authors, while the remaining 71 authors belong to the second group. Table 4  Coming back to the bipartite graph of authors and (co-authored) papers, we looked at the degree distribution of the authors, given in Table 5. This corresponds to the distribution of the number of co-authored papers per author. We observed that 5 of the 8 authors from our first group are the ones that co-published the most, the three others having also high degree (one Nb co-authors 1 2 3 4 5 6 7 8 10 11 12 Count 23 27 13 6 2 2 1 1 2 1 1 Table 4: Distribution of the number of distinct co-authors per author. The first group contains the 6 authors having the largest number of distinct co-authors (between 7 and 12) plus 2 authors with 4 co-authors each.
of degree 5 and two of degree 4). Thus, our first group is made of authors (among) the most collaborative ones, which are also (among) the most prolific ones.
Author degree 1 2 3 4 5 6 7 8 10 13 Count 44 14 6 6 4 1 1 1 1 1 Neither the first nor the second group inferred by HyperSBM are communities. Indeed we obtained the following estimated values from the size-2 hyperedges:B (2) 11 ≃ 4.2% is of the same order asB (2) 12 ≃ 5.1% whileB (2) 22 ≃ 0.8% is around five times smaller. (We have dropped from our notationB (m,n) the index n and only use the size m of the hyperedges to distinguish the parameter values). This means that the first group contains authors that have written with authors from the two groups while the second group is made of authors who have less co-authored papers with people of their own group. Looking now at size-3 hyperedges, we get thatB 112 that concerns 2 authors of the small first group co-authoring a paper with one author of the large second group. The second most important estimated frequency isB (3) 122 and is obtained for one author from small first group co-authoring a paper with two authors of the large second group. The remaining frequencies of size-3 hyperedges are negligible. This characterizes further the first groups as being composed by authors that do co-author with their own group as well as with authors from the second one.
Finally, looking now at size-4 hyperedges, the only non negligible estimated frequency is obtained forB (4) 1222 ≃ 4 · 10 −6 . We note here that the quantities B (3) 's and B (4) 's are intrinsically on different scales, as are the quantities B (2) 's and B (3) 's. So again, authors from group one co-authored with the others authors. (Note that the first group is not large enough for a B (4) frequency with at least 2 authors in that group 1 to be non negligible).

Comparison with 2 other methods
We first compared our approach with the hypergraph spectral clustering (HSC) algorithm proposed in Ghoshdastidar and Dukkipati (2017). Let us recall that spectral clustering does not come with a statistical criterion to select the number of groups. Looking at the partition obtained with Q = 2 groups, spectral clustering output groups with sizes 24 and 55, respectively. These groups are neither characterized by the number of co-authors nor their degrees in the bipartite graph (see details in SM). Indeed, in our case the best clusters are not communities and their sizes are very different, while we recall that spectral clustering tends to: i) extract communities ; ii) favor groups of similar size.
We then analysed the same dataset as a bipartite graph of authors/papers with the R package SBM through the function estimateBipartiteSBM . This method infers a latent blockmodel (that in fact corresponds to a SBM for bipartite graphs) and automatically selects a number of groups on both parts (authors and papers). Hereafter, we refer to this method as the Bipartite-SBM implementation. Let us underline here that while the bipartite stochastic blockmodel can be written as a particular case of a HSBM, the converse is not true (see Section A.3 in the SM).
The Bipartite-SBM also selected 2 groups of authors (and one group of papers). There was one small group with 4 authors, entirely contained in our first small group; it corresponds to authors that have the highest degree in the bipartite graph and the highest number of co-authors.
So, Bipartite-SBM output a very small group of the most prolific and the most collaborative authors in this dataset. Further details about the distinctions between these groups and the ones obtained by HyperSBM are given in SM.
As a conclusion, we see that while the outputs of Bipartite-SBM and HyperSBM may seem close on this specific dataset, they are nonetheless different. On the other hand, and still on this specific dataset, the spectral clustering approach outputs results that are completely different from those of HyperSBM.

Proofs
Proof of Lemma 1. We consider a fixed value of m ≥ 2 and denote by a, b the set of integer values between a, b. Let us recall that B (m,n) is a fully symmetric tensor (1), so the number of free parameters in B (m,n) is equal to the number of ordered sequences q 1 ≤ · · · ≤ q m of elements in 1, Q . We denote by Q + this set. Then we define a function f which, to any such sequence q = (q 1 , . . . , q m ), associates the value l = f (q) defined by f (q) = (q 1 , q 2 +1, q 3 +2, . . . , q m +m−1).
We let L + denote the set of sequences l = (l 1 , . . . , l m ) with coordinates in 1, Q + m − 1 and such that l 1 < l 2 < · · · < l m . Thus, for any q ∈ Q + we get that f (q) ∈ L + .
As a consequence, the functions f and g are such that their composition is the identity function: f • g = g • f = Id. These are one-to-one functions mapping Q + to L + and conversely.
This implies that the cardinalities of these two sets are equal. But an element in L + is exactly a subset of size m of 1, Q + m − 1 so that the cardinality of L + is the number of subsets of size m of 1, Q + m − 1 . This concludes the proof of the lemma.
Proof of Theorem 2. The proof closely follows the structure of the proof for Theorem 2 in  for the SBM on simple graphs, with the generalization to simple m-uniform HSBM.
The complete details of this generalization can be found in Section C of the SM. In this section, we focus on the key element that distinguishes our proof from the proof of Theorem 2 in . This crucial element involves constructing a set of degree sequences with specific properties. Due to the fundamental differences in characterizing which integer sequences can be realized as degree sequences between graphs and m-uniform hypergraphs, our construction differs from the one presented in the proof of Theorem 2 in .
Consider the following set of integer-valued sequences Lemma 7. The set D of n 0 -length integer sequences satisfies (i) for each i ∈ {1, . . . , n 0 }, the set of i-th coordinates {d i | d ∈ D} has cardinality at most Q; (ii) For large enough n 0 (depending on Q, m), any d ∈ D is the degree sequence of a simple m-uniform hypergraph over n 0 nodes; (iii) |D| ≥ Q n 0 .
Note that conditions (i), (iii) imply that {d i | d ∈ D} should have cardinality exactly Q and that |D| = Q n 0 .
Proof of Lemma 7. Points (i), (iii) are a consequence of the definition of D. For any integer sequence d, a necessary condition for d to be a degree sequence of a simple m-uniform hypergraph over n 0 nodes is that m divides i d i . Here, we rather need sufficient conditions in order to prove (ii). We rely on Corollary 2.2 in .
Corollary 2.2 in . Let d be an integer-valued sequence with maximum term ∆ and let p be an integer such that ∆ ≤ p−1 m−1 . If m divides i d i and i d i ≥ (∆ − 1)p + 1 then d is the degree sequence of a simple m-uniform hypergraph. Now, for any m ̸ = m ′ it remains to be able to jointly order the parameters θ (m) and θ (m ′ ) up to a permutation on {1, . . . , Q}. If all the π q 's are different, which is a generic condition, this can be easily done because θ (m) and θ (m ′ ) share the same distinct π q 's.
Proof of Proposition 4. We want to maximize J (θ, τ ) with respect to τ iq under the constraint Q q=1 τ iq = 1 for all i. Using the method of Lagrange multipliers, this is equivalent to maximizing with respect to τ iq the Lagrangian function Computing the partial derivative of Λ(θ, τ , λ) with respect to τ iq , we obtain the following expres- The term e λ i −1 = 1 Q q=1 τ iq is the normalizing constant such that Q q=1 τ iq = 1 for each i. Finally, let us remark that the Lagrangian function Λ is concave with respect to each τ iq , being the sum of a concave term (τ iq log(π q /τ iq )) and linear terms. Then the critical point is a maximum.
Proof of Proposition 5. For the prior probabilities π q , we want to maximize J (θ, τ ) with respect to π q subject to the constraint Q q=1 π q = 1. Using again Lagrange multipliers, this is equivalent to maximizing Noting that the second term of J (θ, τ ) does not depend on π q , the computation of the partial derivative of Λ(θ, τ, λ) reduces to This quantity is equal to 0 if where λ = −n is the normalizing constant in order to satisfy Q q=1 π q = 1. Note the Lagrangian function Λ is concave with respect to each π q , being the sum of a concave term (log(π q /τ iq )), of a linear term (λ Q q=1 π q ) and of a constant. The critical point is then a maximum.
Finally, the partial derivative w.r.t. B (m,n) Through some basic algebraic manipulations, this quantity results equal to 0 if Again, the Lagrangian function is the sum of a concave term (log B (m,n) q 1 ,...,qm ) and of some constant terms, thus being a concave function. The critical point is then a maximum.

A Limits of the bipartite graphs representation of hypergraphs
A.1 Bipartite graphs and multiple hypergraphs with self-loops equivalence Some early analyses of hypergraphs rely on the embedding of the former into the space of bipartite graphs (see for e.g. . Indeed, any hypergraph H = (V, E) where V is the set of nodes and E the set of hyperedges may be represented as a bipartite graph with two parts.
The top part is simply the set V of hypergraph nodes, while the bottom part is the set E of hyperedges and there is a link between v ∈ V and e ∈ E whenever node v belongs to hyperedge e in the original hypergraph H. Now, it is possible to define a "converse" application from bipartite graphs to hypergraphs.
Indeed, any bipartite graph can be projected into two distinct hypergraphs, by choosing one of the two parts as the nodes set and forming a hyperedge with any set of nodes that are neighbors (in the bipartite graph) of the same node (belonging to the second part). A major difference appears whether we consider simple hypergraphs or multiple hypergraphs with self-loops. In multiple hypergraphs (not to be confused with multiset hypergraphs) hyperedges may appear several time so that these are weighted hypergraphs with integer valued weights. We also allow for self-loops, i.e hyperedges of cardinality 1. Then, this application from bipartite graphs to hypergraphs slightly differs depending on whether we allow the image of a bipartite graph to be a multiple hypergraphs with self-loops or a simple hypergraph. In the first case, all the information from the bipartite graph will be encoded in the multiple hypergraphs with self-loops; while in the second case, part of the information will be lost. This is illustrated on a toy example in Figure 6.
The embedding of the simple hypergraphs space into the bipartite graphs space is not the inverse of the natural projection of bipartite graphs into simple hypergraphs. Thus, models of bipartite graphs are inappropriate to handle simple hypergraphs, as the former generally put mass on any bipartite graph, notwithstanding the fact that not all of these may be realized as the image of a simple hypergraph. For the same reason, preferential attachment models of bipartite graphs (Guillaume and Latapy, 2004) may not be directly used for simple hypergraphs as they would produce unconstrained bipartite graphs that do not necessarily come from simple hypergraphs.

A.2 Artefacts induced by bipartite graphs models
In order to view a bipartite graph as a hypergraph, one first needs to select the top and bottom parts. Swapping the role of the two parts will in general give another hypergraph. Statistical models of bipartite graphs handle the two parts symmetrically and do not differentiate between a top and a bottom part. They are thus inadequate for modeling hypergraphs.
One may also note that most random bipartite graphs models are designed for fixed parts sizes, which induces, on top of a fixed number of nodes, a fixed number of hyperedges in the corresponding hypergraph model, an artifact which is not always desirable. For instance the uniformly random hypergraphs model allows for any possible density on the hyperedges.
A last example of inadequacy is given by configuration models on bipartite graphs that induce configuration models on hypergraphs. In these models, the degree distributions in each part are kept fixed. When projected in the hypergraphs space, that means that the degrees of the nodes and the sizes of the hyperedges are kept fixed. Then, relying on shuffling algorithms to explore the space of this configuration model, one will loose the labels on the bottom part (the hyperedges part) as these are automatically induced by the new edges of the bipartite graph and the labelling of the top part (the nodes part). As a consequence, if a specific node tends to take part in large size hyperedges, this information is lost in the configuration model issued from bipartite graphs.
To our knowledge, there is no configuration model on hypergraphs that only keeps the nodes degrees sequence fixed. We mention that Section 4 from Chodrow (2020) provides a discussion about the limitations of the embedding approach in terms of the types of hypergraph null models from which we can conveniently sample. In particular,  establishes that there is no obvious route for vertex-label sampling in hypergraphs through bipartite random graphs.

A.3 HSBM is not a bipartite SBM
In this section, we briefly outline that (i) while the bipartite stochastic blockmodel can be seen as a particular case of HSBM, (ii) the converse is not true in general.
The model is also given by a connectivity matrix M of size Q × R whose entries M qr are the conditional probabilities that a node in V from group q connects a node in U from group r. In other words M qr = P(X iu = 1|Z i = q, W u = r) where X = (X iu ) is the n × e incidence matrix of G. Now consider the hypergraph H constructed on the set of nodes V and whose hyperedges are obtained by looking at the set of nodes in V connected to a same node in U. (A similar construction could be made with swapping the roles of V and U). Then, the probability distribution of H under the induced bipartite SBM is exactly a HSBM with Q groups, with group proportions π and parameters B (m,n) q 1 ,...,qm = P(Y i 1 ,...,im = 1|Z i 1 = q 1 , . . . , Z im = q m ) = P(X i 1 ,u = 1, . . . , X im,u = 1|Z i 1 = q 1 , . . . , Z im = q m ) where u is the node that connects {i 1 , . . . , i m } into a hyperedge. So we see that the bibartite SBM induces a HSBM with constrained connection probabilities.
Let us now explain why (ii) the converse is not true in general. We start from a HSBM with Here, we first remark that the bipartite SBM fit on the co-authorship dataset (from Section 4) selected R = 1, thus inducing hyperedges connectivity parameters with a product form Our fitted HSBM on this same dataset did not result in hyperedges connectivity parameters with a product form, which establishes that the models are clearly different. Now, more generally, we could ask whether for given parameters (B

B The need for simple hypergraphs models
In this section, we discuss modeling differences between multiset hypergraphs where multiset hyperedges are allowed, versus simple hypergraphs where hyperedges are subsets of nodes. We recall that multiset hypergraphs allow for repeated nodes in a same hyperedge, the latter being defined as a multiset of nodes. Multiset hyperedges generalize in some sense the notion of selfloops in graphs and thus are a natural extension to consider. However, they are not appropriate in all contexts. For instance, a co-authorship dataset cannot contain hyperedges with repeated nodes (but may contain a self-loop of a unique author). In the same way, a social interaction hypergraph does not contain multisets hyperedges; triadic interactions are restricted to 3 different individuals and self-loops are not allowed. In the meantime, they are natural in other contexts; consider, e.g., chemical reaction hypergraphs where the multiplicity plays the role of the stoichiometric coefficients (Flamm et al., 2015). We first argue that multisets-hypergraph models are inappropriate for analysing simple hypergraphs.

B.1 A motivating example
Let us first focus on parameter estimates rather than on clustering. For the sake of simplicity, we restrict our attention to 3-uniform hypergraphs on a set of n nodes and consider two different models. The first one, denoted as MH, acts on 3-uniform multiset hypergraphs and draws a hyperedge between any 3 nodes, not necessarily distinct, with probability p MH . The second one, because the only possible size-3 hyperedge is observed. As a consequence, the statistical conclusions drawn on this dataset will highly differ depending on whether we restrict attention to simple hypergraphs or work with more general multiset hypergraphs. This choice of the ambient space has to be made according to the specificities of the dataset. This simple and elementary example shows that it is not possible to statistically analyse a simple hypergraph with a multiset hypergraphs model without erroneous conclusions.

B.2 Null model for multiset hypergraphs in modularity computations
The main technical difference between multiset hypergraphs and simple hypergraphs analysis comes from the enumeration of size-m subsets of nodes. In the multiset hypergraphs setting, the summations over multisets of nodes {i 1 , . . . , i m } ∈ {1, . . . , n} m factorize into m independent sums. On the contrary, in the simple hypergraph setting, the summations involve sets of nodes {i 1 , . . . , i m } that are constrained to be distinct. As a consequence, such a factorization is impossible.
Let us consider a concrete example. We already emphasized the fact that modularity criteria for hypergraphs have been proposed only in the multiset hypergraphs setting . Modularities are generally constructed as deviation measures of the number of hyperedges from their expected number under a null model. For instance in the graphs context, the Newman and Girvan modularity of a partition (C 1 , . . . , C Q ) of the nodes into Q clusters is computed as where A = (A ij ) i,j is the graph adjacency matrix, d i is the degree of node i, and 2|E| = i d i is twice the number of edges. While the first part of these criteria enumerates only the occurring hyperedges, a quantity that is small in general as most hypergraph datasets are sparse, the second part needs to account for all subset of nodes in the graph (or at least in the same group C q ). In the case of graphs allowing self-loops, this second term factorizes to where the computation of the volume Vol(q) = i∈Cq d i has time complexity of O(n). Similarly, for multiset hypergraphs the modularity computed in  uses two main terms: the first is a cut term that depends only on occurring hyperegdes while the second relies on volumes of latent configurations of the nodes (see Eq. (12) and (13) in . On the contrary, in the simple hypergraph setting, enumerating all subsets of nodes constrained to be distinct requires enumerating M m=2 n m elements for a hypergraph with n nodes and maximum hyperedge size M . This quantity is huge and represents the main computational limit when analysing hypergraphs (our approach to this issue is detailed in Section D from SM).

C The complete proof of Theorem 2
For the sake of completeness, we provide here the complete proof of Theorem 2. This proof mostly reproduces the proof of Theorem 2 in .
The strategy relying on Kruskal's result. The proof strongly relies on an algebraic result from  that appeared to be a powerful tool to establish identifiability results in various models whose common feature is the presence of discrete latent groups and at least three conditionally independent random variables. We first rephrase Kruskal's result in a statistical context. Consider a latent random variable V with state space {1, . . . , r} and distribution given by the column vector v = (v 1 , . . . , v r ). Assume that there are three observable random variables U j for j = 1, 2, 3, each with finite state space {1, . . . , κ j }. The U j s are moreover assumed to be independent conditional on V . Let M j , j = 1, 2, 3 be the stochastic matrix of size r × κ j whose ith row is m j i = P(U j = · | V = i). Then consider the 3-dimensional array (or tensor) with dimensions κ 1 × κ 2 × κ 3 denoted [v; M 1 , M 2 , M 3 ] and whose (s, t, u) entry (for any 1 ≤ s ≤ Now, the Kruskal rank of a matrix M , denoted rank K M , is the largest number I such that every set of I rows of M are independent. Note that for any matrix M , its Kruskal rank is necessarily less than its rank, namely rank K M ≤ rank M , and equality of rank and Kruskal rank does not hold in general. However, in the particular case when a matrix M of size p × q has rank p, it also has Kruskal rank p. Now, let I j = rank K M j .  established the following result. If then the tensor [v; M 1 , M 2 , M 3 ] uniquely determines v and the M j , up to simultaneous permutation of the rows. In other words, the set of parameters {(v, P(U j = · | V ))} is uniquely identified, up to label switching on the latent groups, from the distribution of the random variables (U 1 , U 2 , U 3 ).
Now, to obtain generic identifiability, it is sufficient to exhibit a single parameter value for which (11) is satisfied. Indeed, the set of parameter values for which rank K M j is fixed can be expressed through a Boolean combination of polynomial inequalities (̸ =, or rather non-equalities) involving matrix minors in those parameters. In the same way, the converse condition of (11), namely inequality I 1 + I 2 + I 3 ≤ 2r + 1 is the finite Boolean combination of polynomial nonequalities on the model parameters. This means that this set of parameters is an algebraic variety.
But an algebraic variety can only be either the whole parameter space (in which case exhibiting a single value where (11) is satisfied would not be possible) or a proper subvariety, thus a subspace of dimension strictly lower than that of the whole parameter space.
The strategy of the proof for showing identifiability of certain discrete latent class models developed in  and other papers by the same authors is to embed these models in the context of Kruskal's result just described. Applying Kruskal's result to the embedded model, the authors derive partial identifiability results on the embedded model, and then, using details of the embedding, relate these to the original model.
Embedding the HSBM into Kruskal's setup. For some number of nodes n (to be specified later), we let V = (Z 1 , Z 2 , . . . , Z n ) be the latent random variable, with state space {1, . . . , Q} n and denote by v the corresponding vector of its probability distribution. The entries of v are of the form π n 1 1 · · · π n Q Q for some integers n q ≥ 0 and such that q n q = n. We fix m ≥ 2 and consider simple m-uniform hypergraphs on the set of nodes V = {1, . . . , n}. Recall that V (m) is the set of all distinct m-tuples of nodes in V and {Y i 1 ,...,im ; {i 1 , . . . , i m } ∈ V (m) } the set of all indicator variables corresponding to possible (simple) hyperedges of a m-uniform hypergraph over V. Now, we will construct below subsets H 1 , H 2 , H 3 ⊂ V (m) of distinct m-tuples of nodes such that H i ∩ H j = ∅ for any i ̸ = j. Then, we choose the 3 observed variables U j (1 ≤ j ≤ 3) as the vectors of indicator variables U j = (Y i 1 ,...,im ) {i 1 ,...,im}∈H j . This induces that κ j = 2 |H j | (where |H j | is the cardinality of H j ). As the subsets H 1 , H 2 , H 3 do not share any m-tuple of nodes, the random variables U j are conditionally independent given V . We are in the statistical context of Kruskal's result.
The goal is now to construct the 3 subsets H j of m-tuples such that their pairwise intersections are empty and such that condition (11) is satisfied (for at least one parameter value of the embedded model and thus generically for this embedded model). This construction of the H j 's proceeds in two steps: the base case and an extension step.
Starting with a small set V 0 = {1, . . . , n 0 } of nodes, we define a matrix A of dimension Q n 0 × 2 ( n 0 m ) . Its rows are indexed by latent configurations v ∈ {1, . . . , Q} n 0 of the nodes in V 0 , its columns by the set of all possible states of the vector of indicator variables (Y i 1 ,...,im ) {i 1 ,...,im}∈V 0 , and the entries of A give the probability of observing the specified states of the vector of indicator variables, conditioned on the latent configurations v. Thus each column index corresponds to a different simple m-uniform hypergraph on V 0 . The base case consists in exhibiting a value of n 0 such that this matrix A generically has full row rank. Then, in an extension step, relying on n = n 2 0 nodes, we construct the subsets H 1 , H 2 , H 3 with the desired properties (namely their pairwise intersections are empty and (11) is generically satisfied). From Kruskal's theorem, we obtain that the vector v and the matrices M 1 , M 2 , M 3 are generically uniquely determined, up to simultaneous permutation of the rows from the distribution of a simple m-uniform HSBM.
With these embedded parameters v, M 1 , M 2 , M 3 in hand, it is still necessary to recover the initial parameters of the simple m-uniform HSBM: the group proportions π q and the connectivity matrix B (m) = (B (m,n) q 1 ,...,qm ) 1≤q 1 ≤···≤qm≤Q . This will be done in the conclusion.
The initial step consists in finding a value of n 0 such that the matrix A of size Q n 0 × 2 ( n 0 m ) containing the probabilities of any simple m-uniform hypergraph over these n 0 nodes, conditional on the hidden node states, generically has full row rank.
The condition of having full row rank can be expressed as the non-vanishing of at least one Q n 0 × Q n 0 minor of A. Composing the map sending the parameters {B q 1 ,...,qm } → A with this collection of minors gives polynomials in the parameters of the model. To see that these polynomials are not identically zero, and thus are non-zero for generic parameters, it is enough to exhibit a single choice of the {B q 1 ,...,qm } for which the corresponding matrix A has full row rank. We choose to consider parameters {B q 1 ,...,qm } of the form B q 1 ,...,qm = s q 1 s q 2 . . . s qm s q 1 s q 2 . . . s qm + t q 1 t q 2 . . . t qm , soB q 1 ,...,qm = t q 1 t q 2 . . . t qm s q 1 s q 2 . . . s qm + t q 1 t q 2 . . . t qm , with s q , t l > 0 to be chosen later. However, since the property of having full row rank is unchanged under non-zero rescaling of the rows of the matrix A, and all entries of A are monomials with total degree n 0 m in B q 1 ,...,qm ,B q 1 ,...,qm }, we may simplify the entries of A by removing denominators, and consider the matrix (also called A) with entries in terms of B q 1 ,...,qm = s q 1 s q 2 . . . s qm and B q 1 ,...,qm = t q 1 t q 2 . . . t qm . In order to prove that the matrix A has full row rank, it is enough to exhibit Q n 0 independent columns of A. To this aim, we introduce polynomial functions whose independence is equivalent to that of corresponding columns.

The rows of
For each node i ∈ {1, . . . , n 0 } and each latent group q ∈ {1, . . . , Q}, introduce an indeterminate X i,q and a Q n 0 -size row vector X = ( 1≤i≤n 0 X i,v(i) ) v∈{1,...,Q} n 0 . For each degree sequence d, we have and show that necessarily all a d = 0. This will be given by the following lemma from . This lemma is originally formulated for a set D of degree sequences. However it is not specific to degree sequences; it applies for any sets D of sequences of integers indexed by {1, . . . , n 0 } and thus we phrase it in this way. We refer to  for its proof. The next step is to construct a set D of n 0 -length integer sequences that satisfies • for each i ∈ {1, . . . , n 0 }, the set of i-th coordinates {d i | d ∈ D} has cardinality at most Q (condition in Lemma 8); • any d ∈ D may be the degree sequence of a simple m-uniform hypergraph; • |D| ≥ Q n 0 .
With such a set at hand, by choosing one column of A associated to each degree sequence in D, we obtain a collection of |D| ≥ Q n 0 different columns of A. These columns are independent since for each sequence d ⋆ ∈ D, by Lemma 8 we can choose values of the indeterminates {X i,q } 1≤i≤n 0 ,1≤q≤Q such that all polynomials XA d vanish, except XA d ⋆ , leading to a d ⋆ = 0 in equation (12). Thus, exhibiting such a set D is the last step to prove that A has generically full row rank. Now, this is where our proof strongly differs from the one of Theorem 2 in .
Indeed, the characterizations of degree sequences for graphs and simple m-uniform hypergraphs are completely different. Relying on a result by , we have exhibited such a set in Lemma 7.
This concludes the proof of the base case.
The extension step. The extension step builds on the base case, in order to construct a larger set of n = n 2 0 nodes and subsets H 1 , H 2 , H 3 ⊂ V (m) of distinct m-tuples of nodes in V = {1, . . . , n} with the desired properties. This step was first stated as Lemma 16 in Allman et al. (2009) in the context of simple graphs SBM and we extend it below to our case.
Let us recall that we want to construct H 1 , H 2 , H 3 ⊂ V (m) that are pairwise disjoint. Then, with notation from above, we choose the 3 observed variables U j (1 ≤ j ≤ 3) as the vectors of indicator variables U j = (Y i 1 ,...,im ) {i 1 ,...,im}∈H j . As the subsets H 1 , H 2 , H 3 do not share any m-tuple of nodes, the random variables U j are conditionally independent given V = (Z 1 , . . . , Z n ).
We let M j denote the Q n × 2 |H j | matrix of conditional probabilities of U j given Z.
Lemma 9. Suppose that for some number of nodes n 0 , the matrix A of size Q n 0 × 2 ( n 0 m ) defined above has generically full row rank. Then with n = n 2 0 there exist pairwise disjoint subsets H 1 , H 2 , H 3 ⊂ V (m) of m-tuples of nodes in V = {1, . . . , n} such that for each j the Q n × 2 |H j | matrix M j has generically full row rank (Q n ).
Proof of Lemma 9. Let us describe the construction of H j . We will partition the n 2 0 nodes into n 0 groups of size n 0 in three different ways, each way leading to one H j . Then each H j will be the union of the n 0 sets of all m-tuples made of some n 0 nodes. Thus each H j has cardinality n 0 n 0 m . Labeling the nodes by (u, v) ∈ {1, · · · , n 0 } × {1, · · · , n 0 }, we picture the nodes as lattice points in a square grid. We take as the partition leading to H 1 the rows of the grid, as the partition leading to H 2 the columns of the grid, and as the partition leading to H 3 the diagonals.
In other words, H 1 is the union over n 0 rows of all m-tuples of nodes within each row. The same with columns and diagonals. Explicitly, we define two functions u, v that associate to any i ∈ {1, . . . , n 0 } its coordinates (u(i), v(i)) on the n 0 × n 0 grid. Then, the H j are m-tuple of nodes defined as The H j are pairwise disjoints as required.
The matrix M j of conditional probabilities of U j given Z has Q n rows indexed by composite states of all n = n 2 0 nodes, and 2 n 0( n 0 m ) columns indexed by m-tuples in H j . Observe that with an appropriate ordering of the rows and columns (which is dependent on j), M j has a block structure given by (Note that since A is Q n 0 × 2 ( n 0 m ) , the tensor product on the right is (Q n 0 ) n 0 × 2 ( n 0 m ) n 0 which is Q n 2 0 × 2 n 0( n 0 m ) , the size of M j .) That M j is this tensor product is most easily seen by noting the partitioning of the n 2 0 nodes into n 0 disjoint sets (rows, columns and diagonals of the grid) gives rise to n 0 copies of the matrix A, one for each set of all simple m-uniform hypergraphs over Conclusion for the original model. The entries of v are of the form π n 1 1 · · · π n Q Q with n q = n, while the entries of the M j contain information on the B (m,n) q 1 ,...,qm . Although the ordering of the rows of the M j is arbitrary, crucially we do know how the rows of M j are paired with the entries of v.
By focusing on one of the matrices, say M 1 , and adding appropriate columns of it, we can obtain marginal conditional probabilities of single hyperedge variables, namely a column vector with values (P θ (Y i 1 ,...,im = 1|(Z 1 , . . . , Z n ) = v)) v for any m-tuple {i 1 , . . . , i m }. Indeed, this vector is obtained by summing all the columns of M 1 corresponding to simple m-uniform hypergraphs with Y i 1 ,...,im = 1. Thus, we recover the set of values {B (m,n) q 1 ,...,qm } 1≤q 1 ≤···≤qm≤Q , but without order. Namely, we still do not know the B (m,n) q 1 ,...,qm up to a permutation on {1, . . . , Q} only, but rather up to a permutation on {1, . . . , Q} n .
In the following, we assume without loss of generality, as it is a generic condition, that all q,...,q ; 1 ≤ q ≤ Q} from the complete set of parameters. Note that the corresponding entries of v are given by π m q . So we also recover the paired values {(π q , B (m,n) q,...,q ); 1 ≤ q ≤ Q}. Then, we continue with the sets R v with cardinality two: these are of the form {B q,...,q,l ; 1 ≤ q ̸ = l ≤ Q}. By induction, we recover the set of parameters {B (m,n) q,...,q,l,l ′ ; 1 ≤ q, l, l ′ ≤ Q and q, l, l ′ distinct} et caetera, ending with the set of parameters {B (m,n) q 1 ,...,qm ; 1 ≤ q 1 < q 2 < · · · < q m ≤ Q}. This means that we finally have obtained the parameters {π q , B (m,n) q 1 ,...,qm } 1≤q≤Q;1≤q 1 ≤···≤qm≤Q up to a permutation over {1, . . . , Q}. Finally, note that all generic aspects of this argument, in the base case and the requirement that the parameters B (m,n) q 1 ,...,qm be distinct, concern only the B (m,n) q 1 ,...,qm . Thus if the group proportions π q are fixed to any specific values, the theorem remains valid.
Remark. The requirement on large enough n is more precisely given as n ≥ Q 2 p 2 where p is the smallest integer such that p−1 m−1 ≥ Qm. A rough approximation gives that p is of the order (Qm) 1/(m−1) which gives that n should be larger than Q 2 (Qm) 2/(m−1) .

D Computational details on the algorithm's implementation
In order to provide an efficient implementation, the whole estimation algorithm is implemented in C ++ language using the Armadillo library for linear algebra. Moreover the implementation is made available in R by means of the R packages Rcpp (Eddelbuettel and François, 2011;Eddelbuettel, 2013) and RcppArmadillo (Eddelbuettel and Sanderson, 2014). In the following we consider some of the most relevant computational details.
Dealing with heavy computational cost. Dealing with very large data structures, the main drawback of the proposed algorithm is the intensive computational effort, in terms of both execution time needed to converge and required memory space. The most outstanding example regards the computation of the products τ i 1 q 1 · · · τ imqm , required both in the VE-Step (see Proposition 4, for τ iq ) and in the M-Step (see Proposition 5, for B (m,n) q 1 ,...,qm ). The huge computational cost of this calculation derives from the large number of potential unordered node tuples even for rather small values of n and m; indeed |V (m) | = n m . A first possibility is to compute all the products τ i 1 q 1 · · · τ imqm in a recursive manner at the beginning of each VEM iteration and to store them in a matrix. Although this is actually very beneficial for the computational time, the resulting matrix is huge, having number of rows and columns equal to n m and Q+m−1 m respectively. The result is a structure that is intractable except for very small values of n, Q, and (especially) m. Taking into account that every element requires 8 bytes, we report some examples in Table 6, in order to better clarify the magnitude of the quantity to store. Stated the impossibility to store a matrix of such size, the computation of the required products τ i 1 q 1 · · · τ imqm is implemented directly inside the VE-and M-Steps through nested loops; this process involves an important increase in the computing times, but on the other hand requires a minimal amount of memory.  Table 6: Memory size of the matrix containing the products τ i 1 q 1 · · · τ imqm for given values of n (number of nodes), Q (number of latent groups) and m (hyperedge size).
Floating point underflow. Another crucial aspect is the possible occurrence of numerical instability deriving from the multiplication of many small values in the computation of τ iq . A simple remedy is provided by the calculation of log τ iq instead of τ iq . So, denoting b iq = log(τ iq ) − c i , we computeτ iq relying on where b max,i = max q=1...Q b iq prevents the denominator to grow excessively large, thus avoiding new potential numerical issues related to the floating point underflow.

E.1 Settings
We generated hypergraphs under the HSBM with two or three latent groups (Q = 2 or Q = 3).
To choose the parameter values of our scenarios, we thus start by setting α 0 and a constant ratio ρ that will be used both for pairwise and three-wise interactions (m = 2, 3). Then we simply obtain β 0 = α 0 ρ q π 2 q 1 − q π 2 q and c = q π 2 The initial choices of α 0 , ρ in each setting is reported in  while scenarios B exhibit disassortative behaviours (ρ < 1). The numbering of the settings (e.g "2" in A2) indicates the number of groups Q.
Hyperparameters settings All the experiments were made with the following hyperparameters. Concerning the soft spectral clustering initialization, the k-means algorithm (on the rows of the column leading eigenvectors matrix) is run with 100 initializations. The tolerance threshold ϵ used to stop the fixed point and the VEM algorithm is set to 10 −6 . The maximum numbers of iterations for the fixed point and the VEM algorithm were set to U max = 50 and T max = 50, respectively.

E.2 Additional synthetic results
In this section, we provide additional synthetic results and comments. First, we explore clustering estimation in scenario A3'. Figure 7 shows boxplots of the ARI for both our method HyperSBM and HSC. We observe that while HyperSBM again outperforms HSC (obtaining higher ARI values overall and significantly lower variances), both methods obtain very good results in this setting (with ARI always larger than 0.9). Thus this setting, while being sparse (see values of E (m,n) in Table 8) appears to be an easy one from the clustering point of view. This is why we chose to push the limits and introduced setting A3 that is sparser than the sparse scenario A3'. Indeed, the ratio between the number of hyperegdes E (n,m) in setting A3' and A3 is equal to 2.7. In this much sparser regime that we call highly sparse below, we have shown that clustering still works with HyperSBM while it fails with HSC.
However, setting A3 appears to be too sparse for model selection. Indeed,   Table 9: Resulting parameter values (within-group α (m,n) and between-groups β (m,n) , rounded at the fifth decimal) and expected number E (m,n) of size-m hyperedges (rounded at integer value) for Scenarios B and either Q = 2 (setting B2) or Q = 3 (setting B3). we always under-estimate the number of groups. It is out of the scope of the present work to further explore the reasons for that nor the characterization of highly sparse wrt sparse regimes.
At this point, we conclude that ICL performs well in dense settings (data not shown) as well as sparse ones, but might not in highly sparse regimes.
In Table 11, we indicate the values of KS m , KS m in the different settings explored. We recall that those thresholds are compared to 1 and we observe values that are very small and far away from 1.

F Analyses on the co-authorship dataset
The original dataset has 274 papers and 314 authors, with 1 paper having 6 authors and 1 paper having 5 authors. We decided to consider M = 4 and discard these 2 papers with more than 5   Figure 8: Integrated Classification Likelihood index resulting from fitting the HSBM to the coauthorship dataset with number of latent groups ranging from 2 to 5.
We ran HyperSBM with Q ranging from 2 to 5, with 2 different initialisations: 1 random and 1 relying on the soft spectral clustering. The random initialisation always gave the best result.
The results were robust to different tries. ICL selected Q = 2 groups, as shown in Figure 8.
We obtained a first small group with only 8 authors (the remaining 71 authors being in the second large group). Inspecting more closely the variational parameters τ iq for all the nodes, we found that a total of 4 nodes could be considered as ambiguously classified, while all other nodes had posterior probabilities to belong to one of the group larger than 0.8. More precisely, in the first small group, 2 nodes had posterior probabilities to belong to that group equal to 0.54 and 0.63, respectively; while in the second large group, 2 nodes had posterior probabilities to belong to that group equal to 0.56 and 0.72, respectively.
We discussed in the main text the number of co-authors and degrees in the bipartite graph (i.e. number of co-published papers) of the first small group of authors. We noticed that the 2 authors in this group that had smallest number of co-authors (namely 4) and smallest number of degrees (also 4) are the ones that are ambiguously clustered in this group. While the 2 other authors ambiguously clustered in the second large group have a number of co-authors of 6 and 4, respectively; and both a degree of 4. This reinforces the conclusion that on this dataset, HyperSBM has grouped apart the authors which are both the most collaborative and the most prolific ones.
Then we ran the spectral clustering algorithm on our dataset. We looked at the spectral gap, that indicated 15 groups but the gap is not clear. Then we looked at the clustering obtained with Q = 2 groups. Spectral clustering output groups with sizes 24 and 55, respectively. We recall that spectral clustering tends to output comparable sizes groups. The small group contains the only author with 12 co-authors and the remaining authors have a number of co-authors ranging from 1 to 4. The second large group has a distribution of the number of co-authors ranging from 1 to 11. The small group contains authors with small degree in the bipartite graph, i.e having few co-published papers (all but one author have degrees less 4 and a last author has degree 7), while the second large group contains the 3 authors with largest degree, the rest of the authors having degrees ranging from 1 to 6. Thus, these groups are neither characterized by the number of co-authors nor by their degrees in the bipartite graph.
Finally, we analyzed the same dataset as a bipartite graph under a Bipartite SBM. We relied on the R package SBM through the function estimateBipartiteSBM .
The Bipartite-SBM also selected 2 groups of authors (and one group of papers). There was one small group with 4 authors, which are exactly the ones that have the highest degree in the bipartite graph and also correspond to the 4 authors having the highest number of co-authors.
Here, 2 nodes could be considered as ambiguously classified: one node from the first small (resp. second large) group had posterior probability to belong to that group of 0.73 only (resp. 0.67 only). These 2 nodes where not ambiguously classified by HyperSBM and both appeared in our first small group.
It is interesting to compare the situation of three particular authors here. Author with index 48 has 7 co-authors (the 6th highest) and 6 co-authored papers (the 5th highest). It is outside the small first group with Bipartite-SBM method (posterior probability 1 − 0.67 = 0.33 to belong to that group); while HyperSBM clusters it unambiguously in the first small group. Similarly, author with index 27 has 12 coauthors (1st highest) and only 7 co-authored papers (the 4th highest). This node was ambiguously classified by Bipartite-SBM method in the first small group (posterior probability 0.73 only); while HyperSBM clusters it unambiguously in the first small group. Now, conversely, author with index 35 has 8 co-authors (the 6th highest) and 5 co-authored papers (also the 5th highest). This author is unambiguously clustered from the two methods; but while HyperSBM puts it in the first small graph, Bipartite-SBM excludes it from that group. The examination of these 3 particular tangent cases seem to show that on this dataset, Bipartite-SBM was more sensible to authors's degrees in the bipartite graph while HyperSBM paid more attention to the sizes of the hyperedges (i.e. number of co-authors) an author was involved in We also looked at estimated connection probabilities in the bipartite SBM. The authors from the first small group of Bipartite-SBM have many papers (estimated connection probability with the unique group pf papers in the bipartite graph is 11.5% whereas only 2.5% for the other large group). Finally, we computed the parameters values B (m,n) q 1 ,...,qm obtained with the groups estimated by Bipartite-SBM. We obtained with m = 2 thatB (2) 11 ≃ 16, 6% (to be compared with 4.2% in HyperSBM); whileB (2) 12 ≃ 7% andB (2) 22 ≃ 1% (more similar to the results of HyperSBM, which are 5.1% and 0.8%, respectively). In this case, the first group of authors behaves differently with respect to within-group connections compared to between-group connections.