Ancient gene linkages support ctenophores as sister to other animals

A central question in evolutionary biology is whether sponges or ctenophores (comb jellies) are the sister group to all other animals. These alternative phylogenetic hypotheses imply different scenarios for the evolution of complex neural systems and other animal-specific traits1–6. Conventional phylogenetic approaches based on morphological characters and increasingly extensive gene sequence collections have not been able to definitively answer this question7–11. Here we develop chromosome-scale gene linkage, also known as synteny, as a phylogenetic character for resolving this question12. We report new chromosome-scale genomes for a ctenophore and two marine sponges, and for three unicellular relatives of animals (a choanoflagellate, a filasterean amoeba and an ichthyosporean) that serve as outgroups for phylogenetic analysis. We find ancient syntenies that are conserved between animals and their close unicellular relatives. Ctenophores and unicellular eukaryotes share ancestral metazoan patterns, whereas sponges, bilaterians, and cnidarians share derived chromosomal rearrangements. Conserved syntenic characters unite sponges with bilaterians, cnidarians, and placozoans in a monophyletic clade to the exclusion of ctenophores, placing ctenophores as the sister group to all other animals. The patterns of synteny shared by sponges, bilaterians, and cnidarians are the result of rare and irreversible chromosome fusion-and-mixing events that provide robust and unambiguous phylogenetic support for the ctenophore-sister hypothesis. These findings provide a new framework for resolving deep, recalcitrant phylogenetic problems and have implications for our understanding of animal evolution.


Interpreting two-species comparisons in a muti-species context
Gene translocation during evolution is a stochastic process 144,145 , uninterrupted conserved gene linkages across species are likely relicts of ancient gene linkages 144 . These properties lead us to consider conserved gene linkage across many species as a conservation score to determine whether a given gene pair colocalized on a chromosome reflects an ancestral configuration, or a derived rearrangement.
Consider a chromosome that contains 1000 genes in an ancestral species whose descendants diverge into 26 independently-evolving species s A...Z with a complex evolutionary history. During the course of the chromosome evolution of these species there are many unique and independent heritable gene translocation events. Visualizing the synteny between any two genomes, say of species s A and s B , on the tips of that evolutionary tree will show that there have been gene translocations since the divergence of those two genomes from the ancestral state. However, if we look at the position of the genes translocated between s A,B in the genomes of the other 24 extant species s C...Z , we may see that those genes tend to exist on the same chromosomes of those 24 species. In this case, we know that the genes translocated between S A,B are conserved on the same chromosome in most other species, and therefore likely share an evolutionary history 144 . This degree-of-conservation is at the gene-gene level, and the degree-of-conservation can also be calculated at the level of gene communities when considering many gene-gene conservations in a network. This concept is represented in Extended Data Figure 7, a formal definition of the ortholog network conservation scores follows in Supplementary Information 11.2.2-11.2.4, and our findings are presented in Supplementary Information 11.3. This program is available in the FigShare repository under the name orthology conservation score.py.

Ortholog data structure
To perform these analyses, we must first estimate orthogroups between many species using Orthofinder v2.5.4 using blastp. See Supplementary Information 10 detailing our analyses and species included for the Orthofinder analysis performed for this study.
As we are looking for conservation of gene colocalizations of two species based on the gene colocalization in other species, this Orthofinder analysis must include at least three species. The species included in the Orthofinder analysis can be described as a set s.
Below, we will analyze the macrosynteny of species s 1 and s 2 in the context of the orthologs' conservation in the species [s 3 , . . . , s k ].
Every species s k has the additional property of a list of chromosome/sca↵old IDs. Let p be the number of sca↵olds in the genome assembly and annotation for a given species s k 2 s.
Let us represent each orthogroup resulting from an Orthofinder analysis, or from another method of orthology finding, as a node V i in the set of nodes V . V = {v} v2 in the ortholog-finding analysis of 8s k 2s (3) Each orthogroup V i 2 V will have m genes for a given species s k , and each gene has a chromosome x m on which it resides. We access the chromosomes on which each gene resides with the function: In this analysis, we are comparing the macrosynteny of species s 1 and s 2 . To avoid chromosome misidentifications, for each species s i we select only orthogroups V i 2 V that contain one gene, or contain multiple genes only on one chromosome. We define a function returning boolean values indicating whether the gene(s) in each ortholog V i exist(s) on single chromosomes in species s i .
Moving forward, we limit our analyses to orthogroups V i 2 V if and only if the function (V i , s k ) is satisfied for both species s 1 and s 2 . We call this subset of orthogroups W .

Conservation score
The set W contains orthogroups that contain genes that only exist on single chromosomes in both species s 1 and s 2 . In the first step toward constructing a pairwise conservation score we select pairs of orthogroups that exist on the same pairs of chromosomes p and q in the species s 1 and s 2 .
We also define a set of edges E p,q that each ortholog is connected to in W p,q . The set of all edges in the graph is defined by E.
We designate an identity function to test whether two chromosome IDs x i and x j are the same. For example, we later use this function to see if two orthogroups W i and W j exist on the same chromosome in species s k .
For a given edge (W i , W j ) in set of edges E p,q , and a species s k in s 3,...,|s| we quantify whether the orthologs W i and W j , which are present on the same chromosomes in species s 1 and s 2 , are present on the same chromosome in species s k . We define this value below to be returned by the function c(W i , W j , s k ).
To quantify the changes in chromosome position that may have occurred during the evolutionary history of orthogroups, and because orthogroups sometimes have multiple sequences for a single species, we calculate the value for c(W i , W j , s k ) in the range [0, 1]. The value c(W i , W j , s k ) = 0 means that none of the sequences in the orthogroups W i and W j exist on the same chromosome in species s k . Conversely, the value c(W i , W j , s k ) = 1 means that all of the sequences in the orthogroups W i and W j exist on the same chromosome in species s k .
Because we wish to analyze the conservation of gene organization between species s 1 and s 2 relative to many other species, we calculate the gene pair conservation scores, c(W i , W j , s k ), of the species in [s 3 , · · · , s k ]. We chose to summarize the results of ortholog-ortholog conservation across many species with a median of those values. Measuring the median of the conservation scores is robust against outliers in a scenario in which the gene pair is conserved on single chromosomes in most, but not all, species in [s 3 , · · · , s k ]. Therefore, the expression of the gene colocalization conservation score of two orthogroups W i and W j is: From the measurement of ortholog-ortholog conservation on a single chromosome in many species, we can also estimate a measure of conservation of a single ortholog in the context of all genes located on the same chromosome pair in species s 1 and s 2 . This measure gives an approximation of the percent of genes in a single chromosome pair are conserved.

Significance Testing
The measures C((W i , W j )) and C(W i ) are the two measures of gene-gene colocalization conservation we noted in earlier sections. Either of these two measures can be now be used in Fisher's exact test to test for the significance of macrosyntenic relationships between two species.
To test for significance for single gene-gene conservation scores, we use a Fisher's exact test on the counts of the the gene-gene conservation scores C((W i , W j )) greater than or equal to a threshold t. The threshold t is the same range as the conservation scores: [0, 1]. The Fisher's exact test table is constructed as shown below. We use the notation ¬p to refer to all components of (s 1 ) except p. Likewise ¬q refers to all components of (s 2 ) except q.
Inside (s 1 ) p Outside (s 1 ) p Inside (s 2 ) q |{e 2 E p,q | C(e) t}| |{e 2 S E ¬p,q | C(e) t}| Outside (s 2 ) q |{e 2 S E p,¬q | C(e) t}| |{e 2 S E ¬p,¬q | C(e) t}| The resulting Fisher's exact test is Bonferroni-corrected by multiplying the p-value by | (s 1 )| · | (s 2 )|. This method of significance testing may be highly sensitive in that it only rewards conservation across many species.
Measuring Fisher's exact test on the ortholog-network-level conservation C(W i ) measures gene conservation in the context of the entire chromosome, and is therefore more sensitive to derived chromosome breaks in s 1 or s 2 .