Identifiability of Level-1 Species Networks from Gene Tree Quartets

When hybridization or other forms of lateral gene transfer have occurred, evolutionary relationships of species are better represented by phylogenetic networks than by trees. While inference of such networks remains challenging, several recently proposed methods are based on quartet concordance factors—the probabilities that a tree relating a gene sampled from the species displays the possible 4-taxon relationships. Building on earlier results, we investigate what level-1 network features are identifiable from concordance factors under the network multispecies coalescent model. We obtain results on both topological features of the network, and numerical parameters, uncovering a number of failures of identifiability related to 3-cycles in the network. Addressing these identifiability issues is essential for designing statistically consistent inference methods.


Introduction
Statistical inference of phylogenetic networks, showing evolutionary relationships between species when hybridization or other horizontal gene transfer has occurred, poses substantial theoretical and practical problems.With data in the form of many sequenced and aligned genes, standard phylogenetic methods can be used to infer gene trees.However, due to both horizontal inheritance and the population genetic effect of incomplete lineage sorting, these gene trees reflect the species network topology only indirectly.Extracting the network signal with an acceptable computational time, and even determining what aspects of the network can be inferred under the Network Multispecies Coalescent (NMSC) model, is challenging.
Several recently-developed network inference methods utilize summaries of (inferred) gene trees through counts of their displayed quartet trees, that is, empirical quartet concordance factors (C Fs).SNaQ (Solís-Lemus and Ané 2016) uses pseudolikelihood on these C Fs to pick an optimal network among those of level 1. NANUQ (Allman et al. 2019) also uses quartet counts in the level-1 setting, but avoids pseudolikelihood computations, by conducting hypothesis tests for each quartet, followed by a distance-based approach to avoid searching over networks.(PhyloNet Yu and Nakhleh 2015 similarly uses pseudolikelihood, though with rooted triple counts and without the level-1 restriction).
While these methods strike a balance between thorough statistical analysis and computational effort, a complete exploration of what level-1 network features are identifiable from CFs under the NMSC has yet to be undertaken.First results in this direction (Solís-Lemus and Ané 2016) showed certain semidirected level-1 network topologies were distinguishable from those obtained by dropping a hybrid edge, and that in some cases numerical parameters were identifiable up to a finite number of possibilities, i.e., were locally identifiable.Topological identifiability was later investigated (Baños 2019), establishing that semidirected level-1 network topologies are identifiable up to contraction of 2-and 3-cycles and directions of hybrid edges in 4cycles, for generic parameters.While these works provide our starting point, we seek to fill in unaddressed gaps.Appendix A gives more detail on how this work complements its predecessors, and discusses the claims and arguments in Solis-Lemus et al. (2020) for work described in Solís-Lemus and Ané (2016).
We rigorously establish what can be identified, and what cannot, from quartet C Fs under the NMSC.Our concern here is with the theoretical question of identifiability.We thus delineate what might be consistently inferred by a method using quartet counts, although particular methods may not be able to do so.Our results also imply parameter identifiability results for data types from which quartet counts can be obtained (e.g., topological gene trees, or metric gene trees), although for such data it is possible that stronger identifiability claims could be established.
Our main results address identifiability of the full semidirected topology of a binary network, including hybrid edge directions (Theorem 19), and the numerical parameters of edge lengths and hybridization (or inheritance) probabilities (Theorem 30).One interesting aspect is that the presence of a 3-cycle can generally be detected, but whether the hybrid node of that cycle can be identified or not depends on the numerical parameters.The subsets of parameters on which this question has a positive or negative answer both have positive measure, and thus neither set can be dismissed as nongeneric.This means, in particular, that the direction of gene flow between recently diverged populations may simply be unknowable from CFs in some instances.(We do not, however, suggest CFs be abandoned as a useful data summary, as the limits of identifiability from alternative data summaries has not been similarly investigated.) The precise statements of these main theorems have exceptions for cycles adjacent to pendant edges.However, these simplify if one has multiple samples per taxon.Then the semidirected network topology is generically identifiable except for the presence of 2-cycles and (sometimes) hybrid nodes in 3-cycles.If the semidirected network topology with 2-cycles removed is known, then all numerical parameters except those relating to 3-cycles and their adjacent edges are generically identifiable.
Underlying our results are analyses of algebraic varieties associated with certain small networks using computational algebra software.These lead to algebraic (polynomial equality) tests of quartet C Fs for different network substructures.However, 3-cycle identifiability results depend on semialgebraic tests (polynomial inequalities).These tests were motivated by equalities found for related networks, but their construction is not purely computational.
Our identifiability results for numerical parameters are based on explicit rational formulas for parameters in terms of C Fs, so if a topological network is known or proposed, one could in principle estimate numerical parameters with them.But while some of these formulas are simple, others are quite complicated, and should not be expected to provide good estimates from data.These formulas may, however, give useful initial estimates of parameters that could then be refined through optimization, such as with likelihood methods.
Sections 2 and 3 give definitions and earlier results that we use as our starting point.In Sect. 4 we study topological identifiability of level-1 binary networks from quartet C Fs, and in Sect. 5 the identifiability of numerical network parameters from the same information.Section 6 discusses implications for data analysis.
Appendix A explains how our results complement earlier work, and Appendix B catalogs the computational results for specific networks that underly our arguments.

Rooted and Unrooted Phylogenetic Networks
A topological binary rooted phylogenetic network N + is a finite rooted graph, with all edges directed away from the root, whose non-root internal nodes form two classes: Tree nodes have indegree 1 and outdegree 2, while hybrid nodes have indegree 2 and outdegree 1. Hybrid edges and tree edges are classified according to their child nodes.Leaves of the network are bijectively labelled by taxa in a set X .A network is metric if in addition each tree edge is assigned a positive length, each hybrid edge a non-negative length and a positive probability γ , such that for every pair of hybrid edges e, e with a common child γ + γ = 1.More formal definitions of phylogenetic networks appear in Solís-Lemus and Ané (2016), Baños (2019), Steel (2016).We often depict these networks with their root at the top, referring to edges and nodes as above or below one another in the natural way.
As explained in Ané et al. (2024), for gene quartet-based methods of inference a useful form of an unrooted network is more subtle than that for a tree.Substructures above the least stable ancestor (LSA) of the taxa Steel (2016) are undetectable by these methods, as is the LSA itself.The topological unrooted phylogenetic network induced from N + , is the semidirected network N = N − , obtained from N + by deleting all vertices and nodes above the LSA, undirecting tree edges, and suppressing the LSA.Since our concern in this work is the identifiability of unrooted phylogenetic networks, we will often use N rather than the more cumbersome N − to denote them.We refer to N as unrooted or semidirected interchangeably.Note that N naturally inherits a metric structure if N + has one.
Figure 1 shows an example of a network N + and its semidirected network N − .While in that example all leaves are equidistant from the root of N + , we do not assume ultrametricity generally.
Tree edges can be further partitioned into cut and noncut edges, according to whether their deletion results in a graph with 2 connected components or not.Note that hybrid edges are never cut edges.
Of particular interest are unrooted networks on four taxa obtained from a larger network by restricting the taxon set.Recall that a binary unrooted topological tree on four taxa a, b, c, d is called a quartet ab|cd if deletion of its sole internal edge gives connected components with taxa {a, b} and {c,d}.When n ≥ 4, an n-taxon tree displays a quartet ab|cd if the induced unrooted tree on the four taxa is ab|cd.A formal extension of this concept to quartet networks follows.
Definition 1 Let N + be a rooted network on X , and let a, b, c, d ∈ X .The induced quartet network N | Q on Q = {a, b, c, d} is the unrooted network obtained by An analogous definition for induced quartet networks of N is given in Baños (2019), which also shows that the quartet networks induced from N + and N are isomorphic.Figure 2 shows some metric quartet networks induced from the networks of Fig. 1.

Level-1 Networks
We restrict our study to the family of level-1 phylogenetic networks.These have been the focus of many works (Solís-Lemus and Ané 2016; Allman et al. 2019;Baños 2019;Huber et al. 2017;Huson et al. 2010;Gusfield et al. 2007;Rosselló and Valiente 2009), though only a few of these incorporate the coalescent model that is central here.
By a cycle in either a rooted or unrooted phylogenetic network we mean a set of edges and nodes that form a cycle when all edges are treated as undirected.
Definition 2 Let N be a (rooted or unrooted) binary topological network.If no two cycles in the undirected graph of N share a node, then N is level-1.
In some works level-1 networks are defined as those in which no cycles share an edge; i.e., cycles are edge-disjoint rather than the stricter vertex-disjoint condition we adopt.However, in our context of binary networks these are equivalent (Rosselló and Valiente 2009).
In a level-1 network a cycle that is composed of m edges, (2 hybrid edges and m − 2 tree edges) is said to be an m-cycle.More specifically, it is an m k -cycle if there are exactly k taxa descended from its unique hybrid node (Baños 2019).This terminology can be used for semidirected networks, since 'descended from a hybrid node' is unambiguous, regardless of where the network is rooted.
Let N be an unrooted level-1 network on X with an m-cycle C. Then C induces a partition of X into m subsets according to the connected components obtained by deleting all edges in the cycle.Elements of this partition are the blocks of C. The hybrid block of C is the block of taxa descended from the hybrid node in C. If the blocks of C have n 1 , n 2 , . . ., n m taxa, then we say C induces a (n 1 , n 2 , . . ., n m ) partition.Finally, for a cut edge e = (v, w) in semidirected network, the taxon block below w is the set of taxon labels in the subgraph that contains w when e is deleted from the network.

The Network Multispecies Coalescent Model and Quartet Concordance Factors
The Network Multispecies Coalescent (NMSC) model (Meng and Kubatko 2009) mechanistically describes the formation of gene trees within a species network, as gene lineages are traced backward in time to common ancestors in the edge populations of the network.Under it, gene trees may differ in topology from any displayed trees on the species network.Given a metric rooted phylogenetic network, the NMSC assigns positive probabilities to all resolved metric gene trees, and, through marginalization, to topological gene trees and induced gene quartet topologies.
Definition 3 Let N + be a metric rooted network on a taxon set X , and A, B, C, D a gene sampled from individuals in species a, b, c, d ∈ X respectively.The (scalar) quartet concordance factor C F ab|cd = C F ab|cd (N + ) is the probability under the NMSC on N + that a gene tree displays the quartet AB|C D. The (vector) quartet concordance factor C F abcd = C F abcd (N + ) is the triple of concordance factors of each possible quartet on the taxa a, b, c, d.
That C Fs for quartet networks depend only on the semidirected quartet network, was proved in Baños (2019).That result implies the following.
Lemma 1 Under the NMSC on a level-1 network N + the values of the quartet C Fs depend only on the induced semidirected network N .
Following on the first steps investigating level-1 network identifiability from quartet C Fs taken in Solís-Lemus and Ané (2016), the next result, that most topological features of a level-1 species network are identifiable from quartet C Fs, appeared in Baños (2019).
Theorem 2 (Baños 2019, Theorem 4) Let N be a binary semidirected metric level-1 species network on taxon set X with |X | ≥ 4. Let N be the semidirected topological network obtained from N by contracting all 2-and 3-cycles, suppressing degree-2 nodes, and undirecting hybrid edges in 4-cycles.Under the NMSC model with generic numerical parameters, the network N is identifiable from quartet C Fs for N .
We take this theorem as our starting point, and in Sect. 4 focus on the remaining questions of topological identifiability: From quartet C Fs can any aspects of 2-cycles or 3-cycles can be identified, and for 4-cycles can the hybrid node be identified?In Sect. 5 we turn to identifiability of the numerical parameters of edge lengths and hybridization probabilities.While these were not a focus in Baños (2019), partial results on local identifiability of numerical parameters were given in Solís-Lemus and Ané (2016).Note that for 4-cycle quartet networks, the map to C Fs is overparameterized, and Gröbner basis methods easily yield the following.
Lemma 3 Under the NMSC on a semidirected 4-taxon 4-cycle network with generic parameters, neither the hybrid node nor individual numerical parameters are identifiable from C Fs.
Proof Consider the 4-taxon, 4-cycle network on taxa Q = {a, b, c, d}, with a the hybrid descendant, obtained from Fig. 12 (center), by setting a = a 1 , and removing taxon a 2 .From (Baños 2019) the hybrid node is not identifiable.But even if the hybrid node is known, the hybrid edge probabilities h 1 , h 2 do not appear in the formulas for the C Fs, so they cannot be identified.Computational algebra software (Decker et al. 2022;Grayson 2002) shows the elimination ideals retaining exactly one of the parameters γ, x 1 , x 2 are generated by C F ab|cd + C F ac|bd + C F ad|bc − 1.Thus no nontrivial formula relating C Fs and a single parameter exists.
Unless explicitly stated otherwise, we assume that exactly 1 gene lineage is sampled per taxon.If 2 lineages were sampled for a taxon, say a, 'pseudotaxa,' a 1 and a 2 can be introduced by attaching a cherry leading to these at the leaf a of the network.Under the NMSC, C Fs for the modified network with 1 sample from each a i are identical to those for the original network with 2 samples from a. Sampling more than 2 lineages per taxon only introduces new C Fs in which 3 or 4 pseudotaxa from the same taxon appear, but due to exchangeability of lineages under the NMSC these C Fs are always 1/3.Thus identifiability results for any multiple sampling scheme will follow from the single sample case on a modified network.No edge lengths are needed in the pseudotaxa cherries, since no coalescent event may occur on them.
Under the NMSC one can derive formulas for C Fs for any fixed network in terms of the numerical parameters.These have the form of polynomials in the hybridization parameters γ and the exp(−t) for all edge lengths t.The expression exp(−t) has a simple interpretation as the probability that two gene lineages entering an edge of length t coalescent units (tracing time backwards) do not coalesce within that edge.By reparameterizing using edge probabilities = exp(−t) ∈ (0, 1] rather than lengths t ∈ [0, ∞), all formulas for C Fs are given by polynomial formulas in the s and γ s.
The 3 n 4 scalar quartet C Fs for a fixed topological network N on n taxa then define a polynomial map from the numerical parameter space into R 3( n 4 ) .Extending the map to allow complex , γ , gives a parameterized algebraic variety.The set of multivariate polynomials in the C Fs that vanish on the parameterization's image is an ideal, denoted I(N ) = I(N + ) = I(N − ).The zero set V(N ) of the polynomials in I(N ) is the Zariski closure of the parameterized variety.These notions from applied algebraic geometry provide a framework for our work.Elements of I(N ) are called invariants, and depend only the network topology, and not its numerical parameters.
Our arguments use symbolic computations with CFs from specific networks, performed and verified by the software Singular (Decker et al. 2022) and Macaulay2 (Grayson 2002).Despite their essential role, for brevity all computational results are stated in Appendix B. That section also contains an exposition of certain linear invariants that can be derived without computation, and which simplify both computations and statements of results.

2-Cycles
We first show 2-cycles (parallel edges) in level-1 networks are never identifiable.By replacing a 2-cycle with parental node u and child node v by an edge, or suppressing a 2-cycle, we mean removing its two edges, introducing a new directed edge (u, v) with a specified edge probability, and suppressing resulting nodes of degree 2.
The content of the following Lemma was essentially given in Solís-Lemus and Ané (2016), and has appeared in other works subsequently, including a generalization to 2-blobs (Ané et al. 2024, Theorem 4).
Lemma 4 Let N + be a level-1 rooted binary metric phylogenetic network, with a 2-cycle composed of hybrid edges with edge probabilities h 1 , h 2 , and corresponding hybridization parameters γ 1 , γ 2 = 1−γ 1 .Then quartet C Fs for N + under the NMSC are unchanged if the 2-cycle is replaced by an edge with edge probability ∈ (0, 1) determined by the equation Since varying the 2-cycle parameters in the above expression causes to range over the full interval (0, 1), we obtain the following.
Corollary 5 Using quartet C Fs, under the NMSC a topological level-1 phylogenetic network N with a 2-cycle cannot be distinguished from the network N obtained by replacing that 2-cycle with an edge.

3-Cycles
The first study of whether 3-cycles on networks were detectable from C Fs (Solís-Lemus and Ané 2016) introduced notions of "good" and "bad" triangles, corresponding to the networks in Fig. 6(right) and Figs. 3 and 5, with the terminology indicating whether the presence of a 3-cycle and partial information about its numerical parameters could be detected from C Fs.Although we do not use these terms here, in Sect.6 we discuss issues concerning 3-cycle inference from C Fs relevant to that work.
Using Theorem 2, the question of identifying topological 3-cycles in a network is reduced to distinguishing between the network that theorem identifies, and networks obtained from it by replacing some set of non-cycle tree nodes with 3-cycles.We focus here on level-1 networks with 5 or more taxa, as the 4-taxon case is fully studied in Baños (2019).

3-Cycles Near Leaves
We begin with a non-identifiability result, for certain 3-cycles adjacent to two pendant edges of a network, as shown in Fig. 3. 1.If C is a 3 1 -cycle, so its hybrid node has only 1 descendant taxon, the topologies of N and N cannot be distinguished using quartet C Fs.That is, for any choice of parameters on one of these networks, there exist parameters on the other giving identical C Fs.Moreover, the parameters other than those associated to C and internal edges adjacent to C may be chosen to be identical on both networks.2. If C is a 3 k -cycle with k = n − 2 ≥ 2, and the parameter spaces are extended to allow all real edge lengths in the CF formulas, then for any choice of extended parameters on N there are extended parameters on N giving identical C Fs, and vice versa.Moreover, the parameters other than those associated to C and internal edges adjacent to C may be chosen to be identical on N and N .Furthermore, for strictly positive edge lengths on N and N , there are two positivemeasure subsets of parameters, 1 , 2 , for N , such that on 1 the topologies of N and N are not distinguishable using quartet C Fs, and on 2 are distinguishable.
Note that case (1) implies that if N is as shown in Fig. 3(left) then N and N also cannot be distinguished from the network obtained from N by interchanging the a and b labels.In case (2), if parameters are such that N is not distinguishable from N , then case (1) implies that they are also not distinguishable from the two networks obtained by redesignating the hybrid node in the 3-cycle to have a single descendant taxon.When N is distinguishable from N , then by case (1) it is distinguishable from those two other networks as well.
Proof Let a, b denote the taxa in the singleton blocks.For case (1), we may assume the network is rooted, with the root outside C and not on the pendant edges leading to a, b (Fig. 3(left)).Then under the NMSC there is a probability p ∈ (0, 1), depending on the numerical parameters of the 3-cycle, that lineages a and b fail to coalesce before leaving the 3-cycle.Replacing the 3-cycle and its adjacent edges by a 3-leaf tree where the edge leading toward the n − 2 taxa has edge probability p leaves the distribution of topological gene trees, and hence quartet C Fs, unchanged.Varying parameters over the 3-cycle or over the 3-leaf tree allows all probabilities p ∈ (0, 1) to be achieved.
In case (2), let v denote the hybrid node in the 3-cycle, so a, b are not descendants of v for any rooting.(Fig. 3(right)).The value of any C F involving at most one of a, b Fig. 4 Figure for the proof of Proposition 6, case (2).A 3 2 -cycle quartet network with internal cut edge contracted to length 0, other edge probabilities h 1 , h 2 , x, and hybridization parameter γ is determined by the network and numerical parameters below v, since as a gene tree forms either a coalescent event occurs below v, or 3 or 4 lineages reach v, so that all three gene quartet topologies have probability 1/3.Thus the 3-cycle only affects values of C Fs involving both a and b, and only through events in which no coalescence has occurred below v.We may thus replace the cycle and its adjacent edges to a, b with any graphical structure and parameters that produce the same probabilities of gene quartet topologies when exactly two lineages enter at v.These conditional probabilities are the CFs of the quartet network shown in Fig. 4: (1) Note that we have dropped the subscripts 1, 2 from the c taxa, since by exchangeability of those lineages under the NMSC, they may be assigned arbitrarily.Now a quartet tree with topology ab|cc and internal edge probability z yields so, using (1), without changing the C Fs the 3-cycle and edges to a, b in N could be replaced by a 3-leaf tree with an edge leading to an ab cherry having edge probability provided 0 < z < 1.Since this inequality holds on a set of positive measure in parameter space, on that set the topologies N and N are not distinguishable.However, z > 1 also occurs on a set of positive measure.Suppose in this case that the edge e = (v, w) below v has as its child w a node outside of a cycle, and let c 1 , c 2 be taxa chosen from distinct taxon blocks below that node.Then if parameters on N are in the set determined by z > 1 and the edge probability p for e satisfies pz > 1, then for N C F ac|bc = pz/3 > 1/3.Since for a quartet tree C F ac|bc < 1/3, N is distinguishable from N on this set.
If w is instead in a cycle, a similar argument applies.This proof essentially follows arguments given in Baños (2019) for quartet networks with a 3 1 -cycle and 3 2 -cycle.In case (2) the parameters for which the 3-cycle is topologically identifiable are ones that make the quartet network anomalous, in the sense of Ané et al. (2024).

3-Cycles on Small Networks: Algebraic Conditions
Figure 5 shows a 5-taxon tree, T 5 , and two 5-taxon networks with 3-cycles, N 5−3 1 , N 5−3 2 .Propositions 32-34 of Appendix B give computational results on the ideals I(T 5 ), I(N 5−3 1 ), and I(N 5−3 2 ), showing that the polynomial is in I(T 5 ), but not in I(N 5−3 1 ) nor I(N 5−3 2 ).Using expressions for CFs in terms of parameters from Proposition 32, f abc can be interpreted as expressing the total internal path length in the tree T 5 is the sum of the lengths of the two internal edges.This polynomial, and variants of it, will play an important role in identifying 3-cycles.The first result in this direction is the following.
Theorem 7 Under the NMSC model, the vanishing of f abc distinguishes a 5-taxon unrooted tree T 5 from the 5-taxon semidirected networks with a central 3-cycle whose contraction yields the tree T 5 , for generic numerical parameters.
Proof Consider the networks of Fig. 5 and a fourth obtained by interchanging the a, b taxa in Fig. 5 (right).Since f abc / ∈ I(N ) for the non-tree N , it does not vanish for all parameters on them, and is zero only for a set of measure zero in their parameter space.Thus generically the vanishing of f abc distinguishes T 5 from the others.
Propositions 32-34 also show that the two 5-taxon networks of Fig. 5 have the same associated ideals, I(N 5−3 1 ) = I(N 5−3 2 ) ⊂ I(T 5 ).As a result, there is no purely algebraic means (using only polynomial equalities) of distinguishing them using C Fs.
Computational results for the 6-taxon networks T 6 and N a of Fig. 6 appear in Propositions 35 and 36.Note that I(T 6 ) contains 3 polynomials, f abc , f bca , f cab , none of which are in I(N a ), expressing three different internal path length relationships in While the vanishing of any of the three f abc , f bca , f cab (and hence all) distinguishes the tree T 6 from N a , N b , and N c , that was already implicit in Theorem 7.

3-Cycles on Small Networks: Semialgebraic Conditions
While Sect.4.2.2 has shown the presence of a 3-cycle can be detected in some networks, that result pertains only to the undirected cycle.To obtain information on the hybrid node, we use a semialgebraic approach, focusing on polynomial inequalities.
Proposition 8 Let N be one of T 5 , N 5−3 1 , N 5−3 2 of Fig. 5, or the network N 5−3 2 obtained from interchanging the a, b taxon labels on N 5−3 2 .Let f abc be as in Eq. (2).Then for generic numerical parameters under the NMSC, if, and only if, f abc = 0, Moreover, f abc is identical on the networks N 5−3 2 and N 5−3 2 for the same parameter values, so f abc gives no information to distinguish between these.
Finally, there are positive measure subsets of the numerical parameter space for N 5−3 2 on which f abc < 0 and on which f abc > 0.
Proof Theorem 7 states that f abc = 0 for generic parameters if, and only if, N = T 5 .If N = N 5−3 1 then using the formulas for C Fs in Proposition 33 gives, for γ, x, 1 , 2 ∈ (0, 1), Since f abc is invariant under interchanging the as and bs, its values for N 5−3 2 and N 5−3 2 are the same.
Specific examples of parameters on N 5−3 2 show both f abc < 0 and f abc > 0 can occur, and by continuity there are positive measure subsets of parameter space on which these occur.
If a 5-taxon network does have a 3-cycle C, then this proposition may provide some information on the hybrid node's location.For instance, f abc < 0 implies the taxon c which is not in a cherry on the tree obtained by contracting C to a vertex is also not a hybrid descendant of the 3-cycle.However, for other numerical parameters f abc > 0, in which case there is no information on the hybrid location.
To further develop semialgebraic tests for 3-cycle hybrid nodes, we again consider the 6-taxon networks N a , N b , N c described in Fig. 6.Define the following functions of the C Fs, building on the f xyz : Proposition 9 Under the NMSC, for C Fs arising from the tree T 6 , G xyz = 0 for all x, y, z, while G xyz > 0 for C Fs arising from the network N x .
If a network is known to have one of the topologies N a , N b , N c , then at least one of these topologies can be ruled out by the signs of G abc , G cab , G bca : If G xyz < 0 then the network is not N x .
Finally, there are positive measure subsets of the numerical parameter space for N y and N z on which G xyz < 0 and on which G xyz > 0.
Since G abc + G cab + G bca = 0 and one of these terms is positive for each of N a , N b , N c , at least one is negative.
One can find specific parameters on N y for which G xyz < 0 and G xyz > 0, and by continuity these conditions hold on sets of positive measure.Remark 1 It is natural to ask if f xyz or G xyz could be used to detect 3-cycles in situations where incomplete lineage sorting is negligible, so that all gene trees are displayed on the species network.This scenario is modeled by immediate coalescence of gene lineages on entering a common network edge or, equivalently, by a limiting model of the NMSC, in which all edge probabilities go to 0. (See Allman et al. 2022, Sect. 6.2 for more details.)The formulae for C Fs given in this work still apply with all edge probabilities set to 0, and one finds that for all the 5-taxon networks of Proposition 8 f abc = 0, while for all the 6-taxon networks of Proposition 9 G xyz = 0. Indeed, these functions depend only on C Fs for quartet trees not displayed on the networks, which are therefore all zero.
Another model of interest, the common inheritance coalescent model (Gerard et al. 2011), gives only gene trees arising from the coalescent process on the displayed trees of the species network, with the probability of each displayed tree the product of its edges' hybridization parameters.For this model, the functions f abc , G xyz are generically non-zero, and produce a figure similar to Fig. 7 (calculations and figure not shown).Although our investigation of identifiability of that model and its generalization to the correlated inheritance coalescent model (Fogg et al. 2023) are not complete, this illustrates how a coalescent model of some sort allows for greater identifiability of network structure.
Proposition 9 and Fig. 7 suggest determining which of N a , N b , or N c produced certain numerical C Fs may be impossible, which we rigorously show by the following example.
Example 1 (Non-identifiability of the hybrid node in a 3-cycle) Consider the network N a with parameters where the parameters for N b are as shown for N a in Fig. 6 but with taxon labels Then the C Fs of N a and N b are equal.Specifically, for both N a and N b , In fact, there is a neighborhood in V(N a ) = V(N b ) of the C F point of this example contained in the image of the parameterizations of both N a and N b .Indeed, a computation of the Jacobians for the two parameterization maps at the example parameters shows that locally the images are of dimension 6, which matches the dimension of the variety.A sufficiently small neighborhood of the C F point is thus in the image of the parameterizations for both N a and N b , with inverse images of positive measure.One may similarly show, using a C F point that arises only from N a (lying in a uniformly colored sector in Fig. 7), that there is a set of positive measure in the N a parameter space which gives C Fs in the image of the parametrization of N a only.We combine these results formally in the following theorem.
Theorem 10 There exists a positive measure subset of the numerical parameter space of N a for which it is distinguishable from T 6 , N b , and N c , and a positive measure subset of the parameter space for which only the undirected network can be distinguished from T 6 , with 1 node in the 3-cycle determined to be non-hybrid.
Again using the parameter values in Example 1, an analog of this result for 5-taxon networks with a single 3-cycle can be established.
Theorem 11 There exist positive measure subsets of the numerical parameter spaces of N 5−3 1 and N 5−3 2 for which the semidirected network topologies are distinguishable from the other networks among T 5 , N 5−3 1 , N 5−3 2 , N 5−3 2 , and positive measure subsets of the parameter spaces for which they are not distinguishable from at least one other of N 5−3 1 , N 5−3 2 , N 5−3 2 .Proof First, suppose the network is N 5−3 2 .Dropping a taxon to pass to a quartet network with a 3 2 -cycle, Proposition 6 implies that the semidirected topology is identifiable on some positive measure subset of parameters.That there is such a set on which the semidirected topology is not identifiable follows from using the parameter values of Example 1 (after dropping an appropriately chosen taxon) on such networks with different hybrid cherries, and computing Jacobians to verify that an open set of such examples exists.
To investigate identifiability for the network N 5−3 1 , consider the function We first show that f < 0 for all parameters on N 5−3 2 .Using Proposition 34 to expand in terms of parameters, Since h 1 appears linearly in this expression with a negative coefficient, we set h 1 = 0 to bound f above.The coefficient of h 2 , which also appears linearly, may be positive or negative, so we consider h 2 = 0 and 1.If It is easy, however, to find an open set of parameters for N 5−3 1 for which f > 0, and on that set c is identifiable as the hybrid block.We obtain a set on which the semidirected topology of N 5−3 1 is not identifiable by again using the parameter values in Example 1.

Large Networks with 3-Cycles
After considering specific 5-and 6-taxon networks with a single 3-cycle, we shift focus to 3-cycles in general networks N + .We extend the previous results on semialgebraic identifiability of both cycles and hybrid nodes, using a decomposition of N + into 4 subnetworks, as in Fig. 8.A similar decomposition is used in Gross et al. (2023), of a level-1 network into trees and 'sunlets,' but that work does not model coalescence, so the details are quite different.Our decomposition extends to larger cycles but we present only the 3-cycle case needed here.
The subnetworks in Fig. 8  Note that a, b, c are each in two of these subnetworks.Since the root must be above D's hybrid node, and the semidirected network is unchanged by moving the root along tree edges, we may assume the root lies in B or C, and, after renaming, in C.
The C F of any quartet under the NMSC on N + has an algebraic decomposition into terms associated to the subnetworks A, B, C, D, which we next develop.We use two facts about coalescent events between 4 lineages leading to gene quartets: 1.The first coalescent event between 2 of the lineages determines the gene quartet tree that forms, and 2. Conditioned on 3 or 4 lineages reaching a common node with no previous coalescence, by exchangeability of lineages each quartet has probability 1/3.
For S ∈ {A, B, C, D} and a gene quartet x y|zw where x, y, z, w ∈ X are taxa on N + , we define an event, denoted C S → x y|zw, that captures whether the behavior of gene lineages in S ensures that under the coalescent model the gene tree x y|zw is formed, or will be, with a determined probability.This may be due to a coalescent event occurring in S, or 3 or 4 lineages reaching a common node in S without having yet coalesced.Since any coalescent event between any of the four lineages that occurs below S would already determine the quartet tree, we define events conditional on the lineages from those of {x, y, z, w} that are below S not having coalesced below S. For instance, in Fig. 8 and c 1 on C then we condition on b 1 , b 2 not having coalesced in B.More formally, consider the following events, described backwards in time: , {x, y, z, w}) = No coalescence between any of the lineages x, y, z, w that may enter S occurs below their entry to S. F = F(S, x y|zw) = There is a node v in S that 3 or 4 of the x, y, z, w lineages reach with no coalescence occurring below v, and then x y|zw forms.G = G(S, x y|zw) = A first coalescence occurs at some node v in S so that x y|zw forms, without 3 or 4 lineages having reached a common node below v.
Let P(C S → x y|zw) denote the conditional probability of the event C S → x y|zw.Then with a i , b i , c i distinct taxa from A, B, C, respectively, a few example decompositions of C Fs are: For calculating probabilities associated to D, we suppress indices on taxa.This is allowable since, conditioned on distinct lineages entering D, a 1 , a 2 are exchangeable, as are b 1 , b 2 .Thus, for instance, Significantly, all C Fs for N + can be computed using only the following probabilities associated to D together with expressions dependent only on A, B, C: The 6 linearly independent polynomials, p 1 , p 2 , . . ., p 6 parameterize a variety, V(D).Combined with the previous discussion of decomposing C F formulas, this yields the following.
Proposition 12 Let V N be the C F variety for a semidirected network (not necessarily level-1) N with the form shown in Fig. 8, and numerical parameter space with the p i given above.Then the map C F : where π is the map projecting (N ) onto the numerical parameters on A, B, C only.
Proposition 37 shows that V D = C 6 , and thus φ is an infinite-to-1 map, establishing the following.
Corollary 13 Consider a semidirected topological network N with a 3-cycle, with decomposition as in Fig. 8(left).Then no test using polynomial equalities in quartet C Fs can identify the hybrid node in the 3-cycle.
Specifically, if N 's root must be in the subnetwork C because of the semidirected topology of C, then the network N B which has A, B interchanged from N = N A , so that B is below the 3-cycle's hybrid node, leads to the same ideal of invariants, that is, I(N A ) = I(N B ).If the semidirected topology of N allows for rooting in either subnetwork B or C, then Proof If deleting the 3-cycle from the network induces a (n 1 , n 2 , n 3 ) partition of the taxa with all n i ≥ 2, then the corollary follows directly from Proposition 12 and Proposition 37. Cases with n i = 1 then follow by deleting taxa from an appropriate network with all n i ≥ 2, intersecting the ideals with a ring generated by fewer CFs.
Note that this corollary applies to networks with more than one 3-cycle.However, when multiple cycles are present, the location of one cycle's hybrid node indicates that one of the nodes in a descendant cycle cannot be hybrid.Thus for a network with k 3-cycles, there are between 2 k and 3 k networks differing only in the choice of hybrid nodes in the 3-cycles, all of which are algebraically indistinguishable using C Fs.
Nonetheless, using semialgebraic tests, we can obtain additional information on hybrid node location, as the following generalization of Proposition 9 shows.Proposition 14 Consider a partition of a taxon set X into three blocks of size at least 2. For any network N (not necessarily level-1) with a node or 3-cycle inducing these blocks, denote the node or 3-cycle and its adjacent edges by D, and the subgraphs attached to D as A, B, C (as in Sect.4.2.4 for a cycle).
Let G abc , G cab , G bca be as defined by Eq. (3), for any distinct taxa a i on A, b i on B, and c i on C. Then D is:

-cycle and adjacent edges
if G xyz > 0, G yzx ≤ 0, G zx y ≤ 0 with x below the hybrid node for {x, y, z} = {a, b, c}, a 3-cycle and adjacent edges with if G xyz > 0, G yzx > 0, G zx y < 0 x or y below the hybrid node for {x, y, z} = {a, b, c}.
Moreover, for a network N with a 3-cycle D and descendants of x forming its hybrid block, there exist positive measure subsets of parameters on D on which G yzx and G zx y satisfy both of the above sign conditions.
Proof First suppose D is a 3-cycle and, without loss of generality, A is below the hybrid node.Then we decompose formulas for C Fs for N as But the terms in the last factor, all of which depend only on D, arise as multiples of C Fs on the network N 6−3 2 of Fig. 8 Thus, by Proposition 9, G abc (N ) = Since G abc + G cab + G bca = 0, either 1 or 2 of these terms are positive, and the two cases for 3-cycle Ds follow.The case of the network N with D a node is obtained by setting x = h 1 = h 2 = 0 in the formulas for any of the 3-cycle networks, showing, for instance, that G abc (N ) is a multiple of G abc (T 6 ) = 0.
The final statement on positive measure subsets of parameter space follows from Proposition 9.
Proposition 14 yields the following generalization of Theorem 10.
Theorem 15 Consider a partition of a taxon set X into three blocks of size at least 2. Then for all networks (not necessarily level-1) with a node or 3-cycle inducing these blocks, the presence of the node or the (undirected) 3-cycle along with one non-hybrid block is identifiable.If the network has a 3-cycle then there are positive measure subsets of its parameter space on which the hybrid node can be determined, and on which it cannot.
Proof By Proposition 14, an undirected 3-cycle is signaled by the non-vanishing of at least one of G abc , G bca or G cab , and for a 3-cycle, a non-hybrid block is identifiable since one of the Gs must be negative.That the hybrid node can be identified on a positive measure set follows from the existence of such a set for which only one G is positive.That the hybrid node cannot be identified on another set is seen by choosing specific parameters on 3-cycles with different hybrid nodes (e.g., using parameters given in Example 1 for the 3-cycle and adjacent edge parameters) which produce the same values for the p i .
If n i = 1 for some i, then similar arguments as given for Proposition 14 and Theorem 15 shows the function f xyz can identify the presence of a 3-cycle, but possibly not its hybrid node.While we omit the proof, we state the result.
Proposition 16 Consider a partition of a taxon set X into three blocks of size 1, n 1 , n 2 with n i ≥ 2. For any network N (not necessarily level-1) with a node or 3-cycle inducing these blocks, let D denote the node or 3-cycle and adjacent edges, and A, B, C the subgraphs attached to D by the adjacent edges, with C being a single node.Let f abc be as in Proposition 8, for any distinct a i on A, b i on B, and c on C. Then for generic numerical parameters, D is: if, and only if, f abc = 0, a 3-cycle and adjacent edges with A or B below the hybrid node if f abc < 0, a 3-cycle and adjacent edges with A, B, or C below the hybrid node if f abc > 0.
Finally, there are positive measure subsets of the numerical parameter space for the networks with a 3-cycle D and either A or B below its hybrid node on which f abc < 0 and on which f abc > 0.
For identifying the hybrid node in a 3-cycle inducing a (1, n 1 , n 2 ) partition when there is a single descendant of the hybrid node, we generalize Theorem 11.
Theorem 17 Consider a partition of a taxon set X into three blocks of sizes 1, n 1 , n 2 with n i ≥ 2. Then for all networks (not necessarily level-1) with a 3-cycle inducing these blocks, there are positive measure subsets of the parameter space on which the hybrid node of the 3-cycle is identifiable, and on which it is not.Proof Let f be as defined in Eq. ( 4).We first show the result for a general network N with a 3-cycle with a single hybrid descendant.For such a network, using decompositions as in Sect.4.2.4 but with C a hybrid singleton taxon and the root in B, we find that for any choices of two taxa in the A and B blocks where N 5−3 1 is given parameters from the 3-cycle and adjacent edges of N .Similarly, if a non-hybrid block C is the singleton where N 5−3 2 is given parameters from the 3-cycle and adjacent edges of N .Thus the signs of f on N can be used as in the proof of Theorem 11 to obtain the claim when the hybrid block is a singleton.
If the singleton block is not hybrid on N the claim is established as for Theorem 11, by passing to a subnetwork with a 3 2 -cycle and using the parameters of Example 1.

4-Cycles
To study topological 4-cycle identifiability beyond the results of Solís-Lemus and Ané (2016) and Baños (2019), we consider first the networks N s , N w , N n on 5 taxa of Fig. 9, called good (N w ) and bad (N s , N n ) diamonds in Solís-Lemus and Ané (2016).Note that hybrid edge probabilities are not labeled for the networks N w and N n , since no coalescence can occur in those edges as they have only one descendant taxon.
For any network N , an ideal J (N ) of linear invariants is easy to construct from certain symmetries in C Fs under the NMSC.There are, for example, trivial invariants like 1 − C F ab|cd + C F ac|bd + C F ad|bc , as well as cut invariants derived from cut edges in N and exchange invariants derived from exchangeable lineages under the NMSC.These linear invariants form a subideal J (N ) of the full ideal I(N ) for the network variety, and depend only on N 's undirected topology.See Appendix B.1 for full details.
Fig. 9 The semidirected 5-taxon binary networks with a single 4-cycle, up to taxon labelling.We denote these by N s , N w , N n from left to right, according to compass directions for the a 1 , a 2 cherry when the hybrid node is located at south.Note that N e is omitted since, up to taxon labelling, it is the same as N w .Edge probabilities and the hybridization parameter γ are shown next to edges Fig. 10 The semidirected 5-taxon level-1 binary networks with a single 4-cycle and 3-cycle, up to taxon labelling Since N s , N w and N n all have the same undirected topology, and the location of the hybrid node can not be determined using these linear invariants.However, computations of the full ideals of invariants for these three networks, presented as Propositions 38-40, with additional computation, yield the following identifiability result for hybrid nodes.

Proposition 18 Consider a semidirected binary level-1 network on n ≥ 5 taxa whose topology is known up to contracting 2-and 3-cycles and undirecting hybrid edges in 4-cycles. Then for generic numerical parameter values on the network, the 4-cycle hybrid edge directions are identifiable from C Fs.
Proof Suppose first a network N has exactly 5 taxa, and a 4-cycle.Then after contracting 2-cycles N yields N s , N w , N n , or one of the five networks shown in Fig. 10.Although we do not know whether N has a 3-cycle, if it does then by Proposition 6 it has the same associated variety as the network with that 3-cycle contracted, so we investigate the relationships of the varieties V(N s ), V(N w ), and V(N n ).
Proposition 38 to 40 show that V(N s ), V(N w ), and V(N n ) have dimensions 5, 4, and 3, respectively.Moreover, V(N s ) contains both V(N w ) and V(N n ).Additional computations show that the intersection Thus generic parameters for N s give points neither on V(N w ) nor V(N n ), while generic parameters for N w give points not on V(N n ), and generic parameters for N n give points not on V(N w ).Thus for generic parameters, the hybrid node in the 4-cycle can be determined by testing invariants to see whether the C Fs lie on V(N n ) or V(N w ), or neither.
If N has more than 5 taxa, choose one taxon from each of 3 of the taxon blocks determined by a 4-cycle, and 2 from the remaining block, and pass to the induced network on these 5 taxa to apply the result for 5-taxon networks.

Remark 2
The components V 1 , V 2 , and V 3 of V(N w ) ∩ V(N n ) arise naturally from the parameterizations.Restricting to γ = 1 on N w and γ = 0 on N n , essentially giving the unrooted tree ((a 1 , a 2 ), (b, c), d) for both, yields V 1 .V 2 arises from γ = 0 on N w and γ = 1 on N n which gives the unrooted tree ((a 1 , a 2 ), b, (c, d)).V 3 arises from = 0 on both N w and N n , which by corresponding to an infinite edge length, ensures a 1 , a 2 form a cherry in any gene tree involving those two taxa, and for those involving only one a i , gives C Fs from a 4 1 -cycle with a non-identifable hybrid node.

Summary of Topological Identifiability
The results of this section combined with Theorem 2 and Lemma 3 yield the following theorem.
Theorem 19 (Topological Identifiability from quartet C Fs) Let N + be a binary level-1 phylogenetic network on n ≥ 4 taxa, with generic numerical parameters.Then no 2-cycle on the semidirected network can be identified from C Fs, so let N be the topological semidirected network induced by N + with all 2-cycles replaced with edges.Then the topological structure of N , including directions of hybrid edges, is identifiable from quartet C Fs of N + , with the following exceptions: 1.If a 3-cycle induces a (1, 1, n − 2) partition of taxa, then if the hybrid node has a single descendant taxon the network cannot be distinguished from the network in which the cycle is contracted to a node, or from the network in which the hybrid and other singleton block are interchanged.If the hybrid node has n − 2 descendant taxa, then there are positive-measure subsets of parameters on which the semidirected 3-cycle is and is not identifiable.2. If a 3-cycle induces a (1, n 1 , n 2 ) partition with n 1 , n 2 ≥ 2 then the undirected 3-cycle can be identified.There are positive measure subsets of parameters on which the semidirected 3-cycle is and is not identifiable.

If a 3-cycle induces an
(n 1 , n 2 , n 3 ) partition with all n i ≥ 2, then the undirected 3-cycle can be identified, and at least 1 of the 3-cycle nodes can be determined not to be hybrid, but there are positive measure subsets of parameters on which the semidirected 3-cycle is and is not identifiable.4. If a 4-cycle induces a (1, 1, 1, 1) partition, then the location of the hybrid node is not identifiable.

Identifiability of Numerical Parameters
To address identifiability of numerical parameters-both edge lengths and hybridization parameters-we assume the network has no 2-cycles, as these are not identifiable.
For the remainder of the section we thus study N , the semidirected metric binary phylogenetic network induced from a rooted network N + , with 2-cycles replaced by edges.In showing an edge in N has identifiable length, we are showing that if the original network did have a 2-cycle, then an "effective" length of an edge resulting from replacing the cycle as in Lemma 4 is identifiable.
Since we assume exactly one sample per taxon for each gene, no coalescent event can occur in pendant edges.Thus no pendant edge length appears in C F parameterizations, and such lengths cannot be identified from C Fs, yielding the following.
Proposition 20 Let N be a semidirected phylogenetic network.Then pendant edge lengths are not identifiable from quartet C Fs under the NMSC model with one sample per taxon.

Lengths of Edges Defined by 4 Taxa
We focus first on edges in N for which it is simple to identify edge lengths.
With Q = {a, b, c, d} a set of 4 taxa from X , let N (Q) denote the subgraph of N obtained as is the induced quartet graph N | Q in Definition 1 but without suppressing degree-2 nodes.
Definition 4 Let e be an edge in N .If Q = {a, b, c, d} is a set of 4 taxa and N (Q) the subgraph of N described above, then we say that e is defined by a set Q if: 1. Edge e lies in the subnetwork N (Q), 2. Edge e is a cut edge of N (Q) separating pairs of taxa, say a, b from c, d, and 3.In N (Q) there are 4 cut edges adjacent to e, separating each of a, b, c, d, respectively, from the others.
In an unrooted tree, every internal edge is defined by some Q, even if the tree is not binary.But for a network, even if binary and level-1 as in Fig. 11, this is not the case.In such a network, a k-cycle, with k ≥ 5, has k − 4 edges in it that are defined by such sets, with the hybrid edges and those adjacent to them exceptions, as will be proved in the next proposition.Edges descended from hybrid nodes are also never defined by a set Q.These examples show edges defined by a set Q need not be cut edges, and not all cut edges are defined by a set Q.For a binary network, an alternate characterization of edges defined by sets Q can be given.
Proposition 21 For a binary level-1 semidirected network N , there is a set Q of 4 taxa defining an edge e if, and only if, e is an internal edge that is neither hybrid nor adjacent to a hybrid edge.
Proof Suppose e is defined by Q.If e were either hybrid or adjacent to a hybrid edge, then Q would contain a descendant of a hybrid node.But then N (Q) contains all edges of the cycle in which the hybrid edge lies.This contradicts that both e and its adjacent edges are cut edges in N (Q), since the hybrid edges are not cut.
Conversely, suppose e is neither hybrid nor adjacent to a hybrid edge.If none of these 5 edges is in a cycle in N , then choosing one taxon in each component obtained by deleting e and its incident nodes and adjacent edges gives a set Q defining e.
If any one of these edges is in a cycle, then since N is level-1 and binary, exactly one of the following holds: a) e is in a cycle, together with exactly 2 adjacent edges, one at each endpoint of e, b) e is not in a cycle, but exactly one cycle contains two edges adjacent to e at the same endpoint of e, or c) e is not in a cycle, but all 4 edges adjacent to e are, with e adjacent to two different cycles.
For case (a), the 2 edges adjacent to e that are not in the cycle must be cut edges, and the two adjacent to e that are in the cycle must be adjacent to 2 other distinct cut edges not in the cycle.Choosing taxa from the non-e components left by deleting these 4 cut edges gives a set Q defining e.
In case (b), The two edges in the cycle must be adjacent to distinct cut edges other than e which are not in the cycle.Choosing taxa from the non-e components of the graph obtained by deleting these two edges and the two non-cycle edges adjacent to e gives a quartet defining e. Case (c) is similar, treating each cycle the same way.
For any network, regardless of level or other special structure, lengths of edges defined by sets Q are easily identified.
Proposition 22 If an edge e in a metric network N is defined by a set Q of 4 taxa, then its length is identifiable from quartet C Fs.
Proof If e is defined by Q = {a, b, c, d} has length t and in N | Q induces the split ab|cd, then C F ac|bd = exp(−t)/3, so t = − log(3C F ac|bd ).

Numerical Parameters Associated to 3-Cycles
Edges either in or adjacent to a 3-cycle are always adjacent to a hybrid edge.Thus in binary networks, these edges are not defined by sets of 4 taxa, so Proposition 22 does not apply.Propositions 33(c), 34(c) and 36(c) illustrate that, at least for specific small networks, the numerical parameters associated to 3-cycles are not identifiable.More generally, we obtain the following.
Proposition 23 If C is a 3-cycle on a semidirected binary level-1 network N , then neither the hybridization parameters nor the lengths of any edges in or adjacent to C can be identified from quartet C Fs.
Proof Suppose first the 3-cycle induces an (n 1 , n 2 , n 3 )-partition of the taxa with all n i ≥ 2. Then using Propositions 12 and 37(a) we see that the map from numerical parameters to C Fs factors by sending the 7 numerical parameters associated to the 3-cycle and its adjacent edges into a 6-dimensional variety.This implies that the numerical parameters cannot all be identifiable.To see that no single parameter can be identified, first observe that from the factorization of maps in Eq. ( 5), if a single parameter were identifiable, it would have to be identifiable from a point in V D .However, Proposition 37(b) shows that is not the case.
If a 3-cycle induces a (1, n 2 , n 3 )or (1, 1, n 3 )-partition of taxa, then by considering samples of 2 individuals for each gene from the singleton taxa, we can modify the network by attaching cherries of pseudotaxa for each singleton.Since in this case we already know that numerical parameters around the 3-cycle are not identifiable from all C Fs, with access only to C Fs using only one of the pseudotaxa, they are still not identifiable.But that means they are not identifiable for the original network.

Other Numerical Parameters
The remaining numerical parameters on a binary level-1 network to be considered include lengths of hybrid edges, lengths of edges adjacent to hybrid edges, and hybridization parameters, all when the relevant cycle is of size ≥ 4.

Proposition 24 Let N be a level-1 metric binary semidirected network with no 2cycles, containing a k-cycle C with k ≥ 5. Then hybridization parameters and lengths of the cycle edges adjacent to the hybrid edges in C can be identified from quartet C Fs. If the hybrid node of C has at least 2 descendant taxa, the lengths of the hybrid edges can also be identified. If the hybrid node has only one descendant taxon then the lengths of the hybrid edges are not identifiable.
Proof From Proposition 22 we already know that the k − 4 edges in the cycle that are not hybrid or adjacent to a hybrid edge have identifiable lengths.If the taxon blocks for the cycle are, proceeding from the hybrid around the cycle, X 1 , X 2 , . . ., X k , then pick one taxon from each of X 1 , X 2 , X 3 , X 4 , and X k and pass to the induced subnetwork.Replacing any 2-cycles with edges, we may assume we have a 5-cycle sunlet network as in Fig. 12(left), in which the edge probability y of the edge opposite the hybrid node is known, and the edge probability x is that of the edge in C which is adjacent to a hybrid edge, lying between blocks X 2 and X 3 .
Using y and C Fs we can identify γ , and then x through Similarly, the other edge in C adjacent to a hybrid edge has identifiable length.
If the hybrid node has 2 descendant taxa, then by picking two taxa from X 1 and one from each of X 2 , X 3 , X k we pass to an induced subnetwork which, after replacing 2-cycles by edges, has the form of the network of Fig. 12(center) or (right) with the same hybrid edge lengths as the full network.In case (center), with a cherry below the hybrid node, applying the result of Proposition 38 (c) on N S identifies the hybrid edge lengths from C Fs using the already identified γ .In case (right), a 3 1 -cycle below the hybrid node, by Proposition 6 all C Fs are unchanged if the 3-cycle is contracted to a node and the edge length above it modified appropriately.Then the identifiability of the hybrid edge lengths follows from the cherry case.
If the hybrid node has only 1 descendant taxon, then at most 1 lineage may enter (going backwards in time) the hybrid edges of C, so no coalescent events may occur on the hybrid edges.Thus the C Fs do not depend on the lengths of those edges, which are therefore not identifiable from C Fs.
We next turn to cut edges adjacent to a single hybrid edge.

Proposition 25
Let N be a level-1 metric binary semidirected network with no 2cycles, containing an internal cut edge eadjacent to exactly one hybrid edge (at its non-hybrid node), with the hybrid edge in a k-cycle.If k ≥ 4, then the length of e is identifiable.
Proof If k ≥ 4, by passing to the induced network on a subset of the taxa, we may assume k = 4. Since e is not pendant, and not adjacent to a hybrid edge of another cycle, after again passing to an induced subnetwork and replacing any 2-cycles with single edges, we may assume the network has the structure of N w in Fig. 9 (center), with e the edge joining the cherry to the 4-cycle.But then Proposition 39(c) gives the claim.If an edge is adjacent to hybrid edges at both of its endpoints, but neither endpoint is a hybrid node, as in Fig. 13 (left), then the following applies.

Proposition 26
Let N be a level-1 metric binary semidirected network with no 2cycles, containing an edge adjacent to exactly two hybrid edges which lie in two different cycles.If the sizes of both cycles are ≥ 4, then the length of is identifiable.
Proof If both cycles are of size ≥ 4, then the network has an induced subnetwork which, after suppressing 2-cycles has the form shown in Fig. 13 (left), with the central edge arising from , with edge probability .
Using Proposition 39 on the induced network after dropping taxon f we may identify γ, x 1 , x 2 and the product y 1 .Similarly, dropping a we may identify y 1 , which then gives .
Next we consider edges adjacent to two hybrid edges at one endpoint, that is, edges with a hybrid node as an endpoint, as in Fig. 13 (center, right).If the hybrid node is in a large cycle we obtain the following.

Proposition 27
Let N be a level-1 metric binary semidirected network with no 2cycles, containing an edge q whose parent is the hybrid node of a k-cycle with k ≥ 5.If q has at least two descendant taxa, and the child node of q is not in a 3-cycle, then the length of q is identifiable.
Proof Since the cycle is of size ≥ 5, by Proposition 24 its hybridization parameter γ is identified.
First suppose the child node of q is not incident to a hybrid edge.If q has two descendant taxa, there is an induced subnetwork which, after replacing 2-cycles by edges, has the form of N s of Fig. 9 (left), with q the child edge of the hybrid node.With γ in hand, by Proposition 38(c) the length of q is identified.
If instead the child node of q is incident to a hybrid edge, assume that edge lies in a cycle of size ≥ 4. We may then pass to a network with the structure of Fig. 13 (right) where q is the edge joining the two cycles.But dropping taxon f again yields a network of form N s , so using γ we identify y 1 .Instead dropping b from Fig. 13 (right), by Proposition 6, the 3-cycle on this can then be contracted to a node, adjusting the edge length of q (now possibly negative) so C Fs are unchanged.Then Proposition 39 can be applied to identify y 1 .Thus is identifiable.The remaining parameters to consider are the edge probabilities and hybridization parameter in 4-cycles, and the edge probability of the child edge of the hybrid node in a 4-cycle.Identifiability of these is more complicated, as it can depend on the sizes of the taxon blocks of the cycle.In handling these cases, we use the following.
Lemma 28 Consider a 6-taxon semidirected network with a 4-cycle, a cherry below the cycle's hybrid node, and one other cherry, as shown in Fig. 14.Then all numerical parameters are identifiable from quartet C Fs.
Proof Consider Fig. 14 (left), N sw .Then the subnetwork obtained by dropping taxon a 2 has the form of N w , and Proposition 39 shows γ, x 1 , x 2 , 2 are identifiable.But the network obtained by dropping taxon b 2 has the form of N s , so using Proposition 38 and the known value of γ identifies h 1 , h 2 , 1 .
The identifiability of all parameters for Fig. 14(right), N sn , follows from another computation, presented as Proposition 41.

Proposition 29
Let N be a level-1 metric binary semidirected network on n ≥ 4 taxa with no 2-cycles, containing a 4-cycle, as shown in Fig. 15, with taxon blocks A, B, C, D of size n A , n B , n C , n D and edge probabilities and hybridization parameters on and below the cycle as shown.Then the parameters x 1 , x 2 , h 1 , h 2 , γ, are identifiable according to the following cases, at least one of which must hold.
(i) the child of the edge with probability is not in a 3-cycle: all identifiable (ii) the child of the edge with probability is in a 3-cycle: x 1 , x 2 , h 1 , h 2 , γ identifiable, not identifiable Simple instances of the 5 cases in the proposition may be helpful to consider.The network N s falls under case a), N n under b)i), N w under b)ii), and N sw and N sn under c)i).Examples for case c)ii) are obtained from N sw and N sn by replacing the cherry below the hybrid edge with a 3-cycle.The proof of the proposition leverages computational results for these to obtain more general statements.
Proof That at least one of these cases must hold is most easily seen by noting that case c) is the complement of the union of a) and b).We consider each case to establish its claim.

Case a):
The 4-cycle determines a hybrid block of taxa A and three taxa, b, c, d, in singleton blocks.If n A = 1, then the result is Lemma 3. If n A ≥ 2, the only C Fs dependent on the parameters θ = (x 1 , x 2 , h 1 , h 2 , γ, ) are those involving at most two elements of A, since with 3 or 4 elements of A either a coalescence has occurred below the hybrid node, or at least 3 lineages reach it and are then exchangeable, giving probabilities 1/3 for each quartet tree.Those C Fs dependent on θ decompose into sums of products of expressions involving only parameters outside of θ or only parameters in θ , similar to the approach in Sect.4.2.4.The expressions involving only parameters in θ can even be chosen from the C Fs for the network N s of Proposition 38.But that Proposition shows the parameters in θ are not identifiable from the C Fs for N s , so they cannot be identified from those for N .

Case b)i):
The 4-cycle determines a hybrid singleton a, two adjacent singleton blocks of b and d, and a larger subnetwork C opposite the hybrid.Viewing the network as rooted in C, the C Fs for N depend on parameters x 1 , x 2 , h 1 , h 2 , γ, only through the various probabilities of first coalescent events among subsets of {a, b, d} determining the quartet tree before lineages leave the 4-cycle and enter C. Using D to denote the subnetwork below C which contains the 4-cycle, these are Since these probabilities are linear functions of p 1 , p 2 , and none of γ, x 1 , x 2 are identifiable from p 1 , p 2 , none of the parameters are identifiable from C Fs for N .Case b)ii): Pick two taxa in one of the blocks adjacent to the hybrid one, and one taxon in all others.Passing to the induced subnetwork and removing 2-cycles yields either a network with the form N w or one where the cherry in N w is replaced by a 3-cycle.Using Proposition 6, we may replace such a 3-cycle with a node without changing C Fs, provided we modify the edge length leading to the 4-cycle, including allowing for a possibly negative branch length.But then the network has the form N w and applying Proposition 39 shows γ, x 1 , x 2 can be identified.
Since there is only one taxon descended from the hybrid node, there can be no coalescent event in either of the hybrid edges or their descendant, and thus these edge lengths do not appear in the formulas for the C Fs for N .Therefore these parameters cannot be identifiable.Case c)i): Pick two taxa in one of the non-hybrid blocks, two taxa in the hybrid block, and one taxon from each of the others.Passing to the induced subnetwork on these 6 taxa, and removing any 2-cycles, we obtain a network of one of the forms in Fig. 14, or ones where 3-cycles appear in place of one or both cherry nodes.If there are 3-cycles, by Proposition 6 we may replace them with nodes without changing C Fs (provided we modify edge lengths leading to the 4-cycle).Then using Lemma 28 we can identify γ, To identify , let v be the child node of the edge with this probability.If v is not in a cycle in N , then picking one taxon descended from each of its child edges and passing to an induced subnetwork, is identifiable by Lemma 28.
If v is in a cycle, it is of size ≥ 4. Passing to an induced subnetwork, we may assume that v is in a 4-cycle.Note that v cannot be the hybrid node of that cycle, else the semidirected network would not be rootable.If v is opposite the hybrid node, then we may pass to an induced subnetwork which, after replacing 2-cycles with edges, has a cherry below v and follow the previous argument.If v is adjacent to the hybrid node, then the subnetwork has the form of Fig. 13(right).Since γ is identified, the argument used in Proposition 27 then shows is identifiable.

Case c)ii):
The argument of the first paragraph for Case c)i) shows γ, x 1 , x 2 , h 1 , h 2 are identifiable.Since the edge descending from the hybrid node of the 4-cycle is incident to a 3-cycle, its length is not identifiable by Proposition 23.

Summary of Numerical Parameter Identifiability
We summarize this section's results with the following.
Theorem 30 (Numerical parameter identifiability from quartet C Fs) Let N be a level-1 metric binary semidirected network on X with no 2-cycles, and |X | ≥ 4. Then from quartet C Fs under the NMSC with one sample per taxon all numerical parameters on N are identifiable except for the following, which are not identifiable: 1. Pendant edge lengths, 2. For 3-cycles, hybridization parameters and the lengths of the six edges in and adjacent to the cycle, 3.For 4-cycles, the hybridization parameter and edge lengths in the cycle and descended from the hybrid node, as stated in Proposition 29.
If two individuals are sampled in some taxon x, as discussed earlier this can be modeled by attaching a cherry of pseudotaxa x 1 , x 2 at the leaf x.Doing so for all taxa resolves the non-identifiability issues of Items 1 and 3, yielding the following.

Corollary 31
Let N be a level-1 metric binary semidirected network with no 2-cycles.Then from quartet C Fs under the NMSC with two or more samples for all taxa, all numerical parameters on N are identifiable except hybridization parameters and lengths of edges in and adjacent to 3-cycles.

Implications for Data Analysis
Attempting to infer the non-identifiable can either be misleading (unless all possible alternatives are reported), or very slow (spending computational time considering equally good possibilities), so our results here should inform development and use of C F-based inference methods.
The issues with identifiability of 3-cycles from C Fs under the NMSC shown here are perhaps the greatest source of problems for practical inference.Hybridization or gene flow is generally believed to occur most frequently among recently diverged populations, and when this occurs between sister populations it leads to a 3-cycle.Thus these cycles may commonly underlie empirical data.We have shown that in many cases C Fs may indicate the presence of a 3-cycle, though not necessarily its hybrid node, but that the numerical parameters associated to it are not identifiable.
This poses particular issues for likelihood and pseudolikelihood approaches.Quartet C Fs may carry signals of undirected 3-cycles (even in certain "bad triangle" cases not considered in SNaQ's search), and ignoring the possibility of such cycles could have unknown consequences under these optimality criteria.Since for some parameter values there is a signal of a 3-cycle's hybrid node in the C Fs, the search cannot be limited to undirected 3-cycles in all circumstances.
Even if only the network topology is sought, these criteria require optimization over numerical parameters, so these must be dealt with in a search.However, since the numerical parameters are not identifiable, searching over them directly will be slow.Reducing the over-parameterization at 3-cycles (e.g., from 7 parameters to 6 when the 3-cycle is not near a leaf) is desirable, but how to do so while maintaining the same range of C Fs is unclear.Even if this were accomplished, as the numerical parameters vary, the semidirected topology may pass between identifiable and non-identifiable regimes, and the boundaries of these are not known.Without such information, one must consider all possibilities for the location of a hybrid node in a 3-cycle throughout, but allow for multiple optimal networks.SNaQ (Solís-Lemus and Ané 2016), with its default settings, restricts its search for 3-cycles in networks to those with all cycle blocks of size at least 2 ("good triangles"), corresponding to our Theorem 19 (3).It addresses the numerical overparameterization at 3-cycles by setting the edge probability below the putative hybrid node ( 1 in Fig. 6, right) to 1, reducing the number of numerical parameters to be estimated to six (γ, h 1 , h 2 , x, 2 , 3 ), matching the dimension of the variety.The estimated parameters are then composite parameters, which are functions of the original ones producing the same C Fs.While our computations (not shown) indicate that this parameterization of C Fs is 2-to-1, unlike the 1-to-1 map suggested by Proposition 37, that is not necessarily problematic but could result in multiple optima.
More worrisome is the fact that both of these maps are only guaranteed to give complex parameterizations of the relevant variety, and when restricted to stochastic parameters do not necessarily produce the same collection of C Fs as the original map.We experimented with 40,000 sets of 7 parameters (γ and 6 edge probabilities) chosen uniformly-at-random in [0, 1] for the network in Fig. 6(right), and found in 93.4% of the cases there were no stochastic parameters with 1 = 1 producing the same C Fs.In 6.4% there was a single stochastic parameter choice producing the C Fs, and in the remaining 0.2% there were two.A full numerical parameter search with the SNaQ approach thus requires examining non-stochastic parameter values (negative, > 1, or even complex), and then verifying optimal values give C Fs that also arise from some set of stochastic parameters (without 1 restricted to 1).While limiting the search space to both be lower dimensional and only give C Fs arising from stochastic parameters would be desirable, how to do so is an open problem.
Since exactly what information in C Fs is extracted by maximizing the pseudolikelihood function is difficult to analyze theoretically, using simulation the impact of 3-cycles on inference needs to be studied thoroughly, both for SNaQ and for Phy-loNet's similar inference from rooted triples (Yu and Nakhleh 2015).NANUQ (Allman et al. 2019) does not suffer from these problems, as its inference goal is more modest, providing a statistically consistent estimate only of larger cycle topology, without any search over the numerical parameter space.Whether NANUQ can be supplemented to extract C F information on the existence of 3-cycles should also be explored.
Although our goal in this work was to understand the theoretical question of parameter identifiability from C Fs under the NMSC for level-1 networks, some of our results for small networks also address the question of practical identifiability for networks with any number of taxa.For example, a true reticulate evolutionary history for a large number of taxa might be described by a graph containing a 4-cycle in which some or all of the cut edges leading to cycle blocks are long.Long branch lengths (in coalescent units) can arise from either small population sizes (bottlenecks) or long times in generations.Regardless, the probability that all lineages entering those long edges coalesce before entering the cycle may be close to 1, almost ensuring that only a single lineage reaches the 4-cycle from such a cut edge.This reduces what parameters may be practically identifiable from a finite data set, with the extreme case of a single lineage from each of the 4 cut edges yielding only the undirected topological 4-cycle, and no numerical information.Using standard likelihood-based approaches, network details such as these may be inferred and reported, even when there is little signal in the data supporting them.
In closing we remark that identifiability theorems needed to justify network inference methods from data types other than C Fs are largely lacking.Studies of the parameter identifiability question for these data are urgently needed as well.
…for networks with n ≥ 4 taxa, we restrict our focus to the case when N is the network topology obtained from N by removing a single hybrid edge of interest.…The presence of the hybridization of interest can be detected if the quartet C Fs from N cannot all be equal to the quartet C Fs from N simultaneously.
In other words, that work focused on distinguishability of the 2-element set {N , N } using quartet C Fs.While investigating this question was certainly a strong first contribution to the broader question of identifiability of level-1 network topology from C Fs, addressing the full question would require showing distinguishability of the set of all possible level-1 networks on a taxon set X , including networks with completely unrelated topologies.More fully, one would need to show this for sets X of arbitrary size.The approach taken to prove detectability by Solís-Lemus and Ané (2016), however, depends on equating formulas for C Fs in terms of numerical parameters on the two networks and solving the system, and it is unclear how this could be applied to the full identifiability question.
Note that while Solís-Lemus and Ané (2016) states that their detectability results extend to level-1 networks with many cycles (where multiple hybrid edges may be removed to get the set of distinguishable networks), a justification for this claim was only given in Solis-Lemus et al. (2020), with Lemma 3 of that work being key.Unfor-tunately, that lemma is incorrect as stated.(See Sect.A.2 of this appendix for a counterexample.)While it might be possible to obtain a correct justification using similar ideas, doing so seems unnecessary given the results of Baños (2019), which we describe next.
After presenting a detailed study of all level-1 networks on 4 taxa and their C Fs, Baños (2019) used combinatorial arguments to show that information about larger networks could be obtained through induced quartet networks, and hence C Fs.In particular, the topological identifiability result stated in this paper as Theorem 2 was established.This work also clarified several points that were either implicit or unaddressed in Solís-Lemus and Ané (2016): First, the semidirected network was formally defined, highlighting that network structure above the LSA needed to be excised.Second, identifiability of large cycles (more than 4 edges) was explicitly addressed (although Solis-Lemus et al. 2020 later provided an argument for detectability).Third, identifiability for networks with multiple cycles was shown.However, because it considered only one C F at a time to deduce information about an induced quartet network, without exploring whether additional information might be found in the relationship of C Fs for overlapping sets of taxa, it did not obtain the strongest possible result.As an example, while Solís-Lemus and Ané (2016) showed the distinguishability of the hybrid node of a 4-cycle in certain sufficiently large networks, results in (Baños 2019) simply left all 4-cycles undirected in its network identifiability result.
Moreover, in (Solís-Lemus and Ané 2016) it was shown that in some cases 3cycles were detectable, while the main result of Baños (2019) omits 3-cycles in its identifiability result.It did contain a theorem, though, suggesting a 3 2 -cycle on a 4taxon network may be identifiable for some parameter values.In both works, then, questions about 3-cycle identifiability were left open.
In short, the general question, in arbitrary level-1 networks, of topological identifiability of both directed and undirected 3-cycles and of directed 4-cycles, the focus of Sects.4.2 and 4.3 of this work, remained.
Although not considered by Baños (2019), the study of identifiability of numerical parameters was also initiated by Solís-Lemus and Ané (2016).Using calculations of Gröbner bases of ideals of polynomial relationships between C Fs, arguments in Solís-Lemus and Ané (2016) (with a technical matter corrected in Solis-Lemus et al. 2020) investigated whether the dimension of the associated algebraic variety allowed for all numerical parameters to be identifiable.If this dimension is less than the number of parameters, then not all parameters can be identified, though it is possible that a subset are.If the dimension and number of parameters are equal, then the parameterization map must be generically finite-to-1.For specific networks Solís-Lemus and Ané (2016) determined whether or not the parameterization was finite-to-1, but passing to networks with more than 1 cycle again depended on the faulty Lemma 3 of Solis-Lemus et al. (2020).Moreover, the calculations in Solís-Lemus and Ané (2016) are focused on numerical parameters associated to cycles (cycle edge lengths, hybridization parameters, and possibly lengths of edges adjacent to a cycle).Although the arguments do not seem to cover edges not adjacent to any cycles, this omission is easily overcome (for instance as in our Proposition 22).On the other hand, it is unclear how the computations can be applied to answer whether the edges adjacent to hybrid edges in two different cycles have identifiable lengths.Moreover, when a dimension computation leads to a valid conclusion that a full set of numerical parameters cannot be identified, it still is possible that some subset of the parameters could be.This issue was not explored.
Finally, establishing that there can only be finitely many choices of parameters that yield the C Fs on a network is a local identifiability result, and Solís-Lemus and Ané (2016) emphasized it leaves open the question of global identifiability: Does this finite set actually contain only one such choice?This cannot be settled by dimension computations of the sort in Solís-Lemus and Ané (2016).There are in fact simple statistical models (outside of phylogenetics) where parameterizations are finite-to-1, but not 1-to-1, in surprising ways (Allman et al. 2015).Investigating global identifiability is thus highly desirable.
Several questions of identifiability of numerical parameters questions thus remained open: identifiability of certain edge lengths, especially in multicycle networks; identifiability of subsets of parameters when a full set was not identifiable; and the broad question of global identifiability.These questions are the focus of Sect. 5.
When identifiability of individual parameters fails, it remains possible that composite parameters (i.e., functions of the parameters) might be identifiable and used in inference.For instance, for the network in Lemma 3, Solís-Lemus and Ané ( 2016 In optimizing the pseudolikelihood function, SNaQ makes use of composite parameters in some situations when the original ones are not identifiable.While this can be an important issue in algorithm design, we do not focus on it in this work, though some of these known composite parameters appear in our arguments. As the manuscript for this work was being completed, the preprint (Tiley and Solis-Lemus 2023) appeared.This includes work on distinguishability of several different 6-taxon networks with 4-cycles and several cherries, but not on the full network identifiability questions even for 6-taxon networks with 4-cycles.It does however include simulation work to investigate practical identifiability, the extent to which with finite data sets one can infer such cycles, using several pseudolikelihood methods.

A.2 An Example
Consider the two networks shown in Fig. 16, where the right network is obtained from the left by removing a hybrid edge in the top 4-cycle.This is an instance of the construction implied in the statement of Lemma 3 of Solis-Lemus et al. ( 2020), which is used to justify analyzing C Fs only of level-1 networks with a single cycle to understand those with multiple cycles.We show here that the statement of Lemma 3 is not correct for this pair of networks.
We first note that this lemma states that the C F's for the two networks which depend on parameters associated to the cycle are equal.This is not strictly true as stated: Focusing on the lower cycle of the left network, for instance, C F ab|ce depends on γ, x 1 as well as on δ for the left network, but for the right network has no dependence on δ.Other interpretations of this statement are that the described set of C Fs for the two networks produce the same values as parameters vary, or that the varieties defined Fig. 16 (left) A network with a 4-cycle of interest at bottom, and (right) a network with a single cycle obtained by removing a hybrid edge from each pair not in the cycle of interest by the parameterized C Fs are the same.As this last interpretation is the broadest, we show it is also false.
Consider C F ac|d f and C F ac|e f for the two networks.For the left network which are not equal for generic parameter values.However, for the right network since d, e can be exchanged in the graph because they form a cherry.Thus the variety for the right network has a defining equation that does not hold for the left.This example illustrates an important point that in passing to smaller induced networks a cycle other than the one of immediate focus may have an impact on C Fs.In particular, a more detailed treatment of networks with multiple cycles, such as our work in Sect.4.3, seems to be a necessary part of addressing the full identifiability question.
entries add to 1 to be trivial invariants, as they hold for all network topologies.If only 4 taxa a, b, c, d are considered, then the trivial invariant is C F ab|cd + C F ac|bd + C F ad|bc − 1, while for n taxa there are n 4 such invariants, one for each possible quartet.Second, if there is a cut edge on an induced quartet network separating two of the taxa, a, b from the others c, d, then the C Fs associated with the discordant topologies ac|bd and ad|bc must be equal.This was shown in the level-1 case in Baños (2019) and for general networks in Allman et al. (2022).The invariant expressing this is We call such polynomials cut invariants.
Third, when a network has one or more cherries (2 taxa joined by pendant edges to a common node), we will label the taxa in each cherry by the same subscripted letter, such as a 1 , a 2 .(See for instance Fig. 5.) In C Fs involving such taxa, we may then suppress the subscripts, since under the NMSC model on a suitably rooted version of the semidirected network the taxa a 1 and a 2 are exchangeable, giving the same C F values when they are interchanged.Thus, for example, on a network with a cherry formed by a 1 , a 2 , We call these exchange invariants, but note that some of these these are also cut invariants.For computations, our notational simplification of surpressing subscripts allows for their omission.
Definition 5 For a topological phylogenetic network N , the ideal generated by all trivial, cut, and exchange invariants of N is denoted J (N ).
Note that J (N ) depends on the topology of the network N due to the cut and exchange invariants.However, it only depends on the undirected topology.Also, while J (N ) ⊆ I(N ) these ideals are typically not equal.Since the generators of J (N ) are linear, and simple to enumerate, we use them for removing some C Fs from calculations of I(N ), and for stating results on ideal generators more succinctly.

B.2 Propositions from 3-Cycle Computations
Proposition 32 Let T 5 = ((a 1 , a 2 ), c, (b 1 , b 2 )) be a 5-taxon tree, as shown in Fig. 5 (left).Then (c) None of the numerical parameters γ , h 1 , h 2 , x, 1 , 2 can be determined from quartet C Fs.They can be determined with three degrees of freedom.If γ , 1 and 2 are known then: , Proposition 35 Let T 6 be the 6-taxon tree with a central node and 3 cherries, as in Fig. 6 (right).Then, (a) The quartet concordance factors of T 6 are The variety V T 6 has dimension 3. (c) The numerical parameters 1 , 2 , 3 can be determined from quartet C Fs by: Proposition 36 Let N 6−3 2 = N a be a 6-taxon level-1 semidirected network with a central 3 2 -cycle and 3 cherries, as shown in Fig. 6 (b) The ideal defining V N s ⊂ C 12 is I(N s ) = J (N s ).The variety V N s has dimension 5. (c) None of the numerical parameters x 1 , x 2 , h 1 , h 2 , , γ can be determined from quartet C Fs.They can be determined with 1 degree of freedom, for instance, if γ is given: but none of x 1 , x 2 , and γ can be.However, the quantities x 1 γ −γ and x 2 (γ −1)−γ can be determined from the quartet C Fs, so if γ is known: Proposition 41 Let N sn be a 6-taxon level-1 network with a central 4-cycle as shown in Fig. 14 (right).Then (a) The quartet concordance factors of N sn are The variety V N sn has dimension 7. (c) The numerical parameters γ , h 1 , h 2 , x 1 , x 2 , 1 , 2 can be determined from quartet C Fs: small

Fig. 1
Fig. 1 (left) A rooted network N + on X with root r = LSA(X ), and (right) the unrooted network N − obtained from N +

Fig. 2
Fig.2Several semidirected quartet networks induced from the network in Fig.1

Fig. 5
Fig.5(left) The 5-taxon unrooted binary tree T 5 ; (center) the 5-taxon network N 5−3 1 with a 3 1 -cycle; and (right) the 5-taxon network N 5−3 2 with a 3 2 -cycle, with numerical parameters shown.Edge probabilities of hybrid edges in N 5−3 1 and of pendant edges in networks are omitted, since they do not appear in formulas for the C Fs

Fig. 6
Fig. 6 (left) The 6-taxon tree T 6 with three cherries.(right) The 6-taxon network N a with a central 3-cycle surrounded by 3 cherries, with a 1 , a 2 descending from the hybrid node.The network N b is obtained by 'rotating' the three pairs of taxa so b 1 , b 2 descend from the hybrid node, and similarly for N c

Fig. 7
Fig. 7 Values of (G abc , G bca , G cab ) plotted in three dimensions, for random numerical parameter values on each of the three networks N a , N b , N c .Color indicates network topology.Plotted points lie in the plane x + y + z = 0, which is viewed orthogonally.The three coordinate planes x = 0, y = 0, z = 0 intersect this plane in the colored lines, separating the points by color into overlapping half-planes.Numerical parameters for networks were chosen uniformly from the interval [0, 1] (Color figure online) the network N b with parameters

Fig. 8
Fig. 8 (left) A decomposition of a level-1 network N + with a 3-cycle into 4 subnetworks, denoted A, B, C, D, with root in C. (right) The semidirected 3-cycle network N 6−3 2 with 3 cherries, which is a simple instance of the network on the left are: D: The 3-cycle and its three adjacent cut edges, with pendant vertices a, b, c, where a is the child of the hybrid node of the cycle; A, B, C: The connected components containing a, b, c, respectively, when the edges and internal nodes of D are deleted from N + .

Fig. 11 A
Fig. 11 A semidirected network with edges defined by sets Q of 4 taxa highlighted in blue (Color figure online)

Fig. 12
Fig. 12 Subnetworks used in the proof of Proposition 24

Fig. 13
Fig. 13Semidirected binary networks on 6 taxa with two 4-cycles joined by an edge adjacent to two or more hybrid edges

Fig. 14
Fig. 14 Semidirected binary networks on 6 taxa with a 4-cycle and two cherries: (left) N sw and (right) N sn The ideal defining V T 6 ⊂ C 18 is I(T 6 ) = J (T 6 ) + f abc , f bca , f cab wheref abc = 3C F ab|ac C F ab|bc − C F ab|ab , f bca = 3C F ab|bc C F ac|bc − C F bc|bc , f cab = 3C F ab|ac C F ac|bc − C F ac|ac .