Comparative analyses identify genomic features potentially involved in the evolution of birds-of-paradise

Abstract The diverse array of phenotypes and courtship displays exhibited by birds-of-paradise have long fascinated scientists and nonscientists alike. Remarkably, almost nothing is known about the genomics of this iconic radiation. There are 41 species in 16 genera currently recognized within the birds-of-paradise family (Paradisaeidae), most of which are endemic to the island of New Guinea. In this study, we sequenced genomes of representatives from all five major clades within this family to characterize genomic changes that may have played a role in the evolution of the group's extensive phenotypic diversity. We found genes important for coloration, morphology, and feather and eye development to be under positive selection. In birds-of-paradise with complex lekking systems and strong sexual dimorphism, the core birds-of-paradise, we found Gene Ontology categories for “startle response” and “olfactory receptor activity” to be enriched among the gene families expanding significantly faster compared to the other birds in our study. Furthermore, we found novel families of retrovirus-like retrotransposons active in all three de novo genomes since the early diversification of the birds-of-paradise group, which might have played a role in the evolution of this fascinating group of birds.

Page 12 Line 22 -I think the "Other positively selected genes and enriched GO categories" section could be much reduced. Include the function of genes in the list of genes under positive selection rather than including that in the main text here.
We deleted this section, as we removed the GO enrichment from the analysis.
Supplementary Table 3 -do intron size and number really not vary between the genomes?
We added that we rounded the values. They do vary, but only marginally. In general, bird genomes are highly conserved.
Reviewer #2: This is an interesting study that looks at the genomics of five different species of bird of paradise. These are extremely interesting, and under-studied, set of species (as the authors note), and the initial sequencing of these birds can certainly help to resolve the phylogeny surrounding them.
The study itself has two main aspects -firstly to resolve the phylogeny of these birds by sequencing three (and resequencing two) species, and secondly to attempt to identify the genes that underlie the range of different phenotypes that are exhibited by this set of species. The phylogeny resolution results and the conclusions drawn from them are sound and not over-interpreted. The major issues with this manuscript concern the search for positive selection and the use of GO terms to extrapolate the genes underlying the various phenotypes.
We removed the GO enrichment for the positive selection analysis, and carried out FDR correction.

Major Issues 1)
In the search for positively selected genes the authors identify 178 genes using a nominal p-value significance if I understand their methods correctly, as they state the distribution of p-values gave an excess of non-significant p-values so therefore felt it was not appropriate to use a bonferonni or other multiple testing correction. Given that they test 8133 genes, some form of multiple testing correction is definitely required. Permutation testing seems the most obvious approach as that will be robust against non-normal distributions (see Nielsen, RCD et al, 2005, A scan for positively selected genes in the genomes of humans and chimpanzees. PloS Biology 3:723-733 for one example, but I am sure may others exist).
It was Rasmus Nielsen, first author of the mentioned paper, that told us not to perform any multiple-testing corrections in the first place. After further discussions, we decided to carry out a FDR correction, which should at least be less biased than methods like bonferroni. This is a heavily debated topic and it will be interesting to see if people will come up with better methods to test the reliability of positive selection signals.

2)
In terms of using this technique (dN/dS ratios) to identify single genes that are being positively selected, this technique is more robust when all the genes in a genome are considered more to identify overall patterns (see Hartl and Clark 'Principles of Population Genetics'). By only having one representative per species as in this case, it should be mentioned that some of the variation present for that species may be being overlooked, which can bias the estimate accordingly.
We acknowledge that methods based on multiple individuals such as MKT, or QTL methods are more accurate and powerful. However, getting even one sample is a huge endeavor for some/most birds-of-paradise species and getting multiple samples will in many cases be almost impossible. In order to avoid some of the issues associated with using one individuals per species, we only performed positive selection tests on the branch leading to the birds-of-paradise, as we do have at least five representatives for this branch. 3) The use of genes with some signature of positive selection as a means of determining genes underlying phenotypic traits is almost certainly over-extrapolated here. By taking a laundry list of genes and then picking potentially interesting putative functions and ascribing them to different traits without further proof is over-speculation (e.g. collagen and extracellular matrix genes are identified, but these genes can potentially affect a gamut of different functions, the authors then say that they are possibly affecting feather formation, craniofacial development, etc). As a further example, the authors state structural colours as being typified by these birds, and that some of these genes are potentially related to structural (and other) colouration. However, no actual genes for structural colours have been identified that I am aware of, so it is unlikely that any genes identified by previous means are the exact ones that control structural colour variation. In general the traits the authors state are of interest are complex quantitative traits, so the QTL controlling the inter-specific variation are most likely in non-coding regions, and depending on the linkage disequilibrium in the genomes of these birds, easily missed in any case when only considering coding regions. Therefore without additional analysis this speculation over gene function should be drastically reduced.
For reasons mentioned above, we had to use simple single-individual comparative genomics methods. We toned down the inferences based on the results and added a sentence on the issues related to comparative genomics methods. We further highlighted that some of the most interesting traits are most likely complex traits and will require more in depth analyses. In general, we only discussed genes which had known and tested functions to avoid over interpretation of the results.

4)
Similarly, looking at GO enriched terms can also easily be extrapolated. For example, in a study by Pavlidis, P. et al. (A critical assessment of storytelling: GO categories and the importance of validating genomic scans. Mol Biol Evol 29 (10):3237-3248, 2012), a set of selective sweeps were generated at random in a genome. In each instance, the genes were then analysed for GO enrichment and plausible functions ascribed in each case. This means that GO terms should be used as a complement to, rather than as the basis for the whole study.
The mentioned paper mostly shows that it is possible to get false positive selection signals from neutral data. However, we do acknowledge that GO enrichment has flaws and that there is no statistical framework to test any results, and so we removed it from the positive selection analysis.
In summary, the novel aspects of the sequencing of these genomes is very interesting, their use to resolve the phylogeny is strong, and they are an important step in being able to bring the full power of modern genomics to bear on these species. However, the use of just single representatives of each species to then attempt to prescribe putative phenotypic functions and relationships between the genes that are identified purely based on dN/dS ratios and then speculations regarding the genes identified is highly speculative and undermines the study as it currently is. See above.
Reviewer #3: In this manuscript, Prost et al. perform whole-genome sequencing of five species of birds-of-paradise. Three of these are assembled de-novo, while the other two are mapped to one of the de-novo assembles genomes; all five are subsequently annotated. Special attention is paid to the evolution of repeats, while other analyses include phylogenetic relationships among the birds-of-paradise, the detection of positive selection and gene family size evolution. Genes inferred to have evolved under positive selection, as well as those being among rapidly evolving gene families, were next subjected to Gene Ontology tests, and a substantial part of the Discussion relates to specific genes and gene families whose evolution may have been relevant to that of birds-of-paradise. I think the paper contains appropriate and solid analyses, and only have relatively Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation minor comments. I do not find the results of the GO and related analyses particularly arresting and thought the Discussion was at parts quite dull while discussing gene after gene, but I realize the limitations of what one can do in genome papers like this one -I think that the authors overall strike a reasonable balance, and do not make unwarranted statements (except regarding TE evolution -see below). I would consider the overall writing somewhat subpar, as the paper contains relatively many errors and some sloppy, uncareful writing -some instances are pointed out in the specific comments below. Generally, I would recommend a careful rereading of the manuscript by the authors.
We shortened the discussion significantly and removed the GO enrichment analysis for the positive selection tests. Furthermore, we toned down the interpretation of the TE results.
Specific comments: *I would suggest to make Table S1 into Table 1, i.e. to put it in the main part of the paper, and potentially to even combine it with Table S3 (and put the resulting table in the main part of the paper). The genome assembly and annotation is the backbone of this paper.
Done. * p. 3, l. 30-34: "Fortunately, the current revolution in sequencing technologies and laboratory methods does not only enable us to sequence whole-genome data from non-model organisms, but it also allows us to harvest genome information from specimens in museum collections [10]." -This point, made in the Introduction, is not being returned to elsewhere in the paper, as far as I could see. I would appreciate it if the authors could make clear in Methods, Results and/or Discussion what it is, specifically, in recent technology/methods that facilitated the use of museum specimens in this manuscript, and perhaps also discuss the promise/limitations/future developments in this area.
We hope, we clarified this enough in the text. * Discussion section "Repeats and their possible role in the evolution of birds-ofparadise" (page 6-7). I find this section quite speculative and insufficiently carefully worded. -First, "A growing body of literature is emerging" -this sentence only cites a 9 year-old and a 10 year-old paper, which is not very convincing support of that statement.
We removed this sentence.
-Second, to see if the paper cited in the next sentence (ref. 30, which is from 2011) would provide an additional and more recent reference supporting the statement in the previous sentence, I had a look at it. It is cited as "Bursts of TE activity are often lineage and species-specific, which highlights their potential role in speciation", but the words "TE", "transposable", or even "repeat" or not to be found in this paper. Please clarify/change accordingly.
We cited the wrong paper by accident.
-"The timing [of a burst of TE activity] fits the emergence and radiation of birds-ofparadise". The fact that the emergence (ca. 35 ma) and radiation (I guess somewhere between 10 and 20-25 ma) are mentioned here already indicates that this statement is quite general, and I think a similar statement could have been made if the burst of TE activity fell anywhere between 10 and about ma 40 ago. The question thus is how meaningful this coincidence in timing really is, especially considering that, presumably, the likely statistical inference of such TE bursts may restricted to a certain time window (?).
We changed this section to address the reviewers comments.

Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
-"It is thus likely that the diversification of birds-of-paradise was influenced by lineage specificity of their TE repertoires through retroviral germline invasions and smaller activity bursts". An in my view completely unwarranted statement, mainly due to the qualifier "likely". Instead, I would suggest a qualifier along the lines of "An intriguing hypothesis that warrants further investigation".
Changed. * In the discussion of genes and GO categories, please state more clearly in which GO test the genes in question were flagged: positive selection or gene family size evolution.
We removed the GO enrichment for the positive selection analysis. * p. 6, l. 34-36: "Dale and colleagues recently showed that sexual selection on male ornamentation in birds has antagonistic effects, where male coloration is increasing, while females show a strong reduction in coloration [37]. This is very apparent in polygynous core birds-of-paradise, where females between species and sometimes even between genera look highly similar." Increasing and decreasing "coloration"? "Ornamentation" would be a better choice of word. Also, a reduction in ornamentation/coloration is not the same thing as a reduced tempo of evolution, which is clearly referred to in the second sentence. Thus, expand/clarify. We changed "coloration" to "ornamentation". We only discuss phenotypes, not the tempo of their evolution in the second sentence.
* Methods, Gene Annotation section: "We did not train the de novo gene predictor Augustus [109] because no training data set for birds was available." Please expandnone of the gene sets from previously published bird genomes can be used?
We added a sentence, why we did not carry out the training at the time, which was decided after a discussion with Mario Stanke, the author of Augustus. * Methods, Phylogenetic Analysis section: "We used the individual exon phylip files for gene tree reconstruction" -Why were only single exons used for each gene? I worry whether these were sufficiently informative, especially considering the high discordance among gene trees. * Methods, Intron Calling section: "To do so we used the extract_intron_gff3_from_gff3.py script (https://github.com/irusri/Extract-intron-from-gff3) to include intron coordinates into the gff file. We then parsed out all intron coordinates and extracted the intron sequences from the genomes using the exttract_seq_from_gff3.pl script (https://github.com/irusri/Extract-intron-from-gff3)." -The same script is referenced twice in the actual link, but with a different name. Furthermore, I assume "exttract_seq_from_gff3.pl" has a typo ("exttract").
Both scripts can be found using the same link. Furthermore, the "exttract" is a typo by the author of the script, so we did not correct it as this is the actual name of the script. *Methods, Inference of Positive Selection section: "due to the fact that branch-site tests result in an access of non-significant p-values" -Change "access" to "excess".
Changed. The spectacular morphological and behavioral diversity found in birds-of-paradise (Paradisaeidae) form one of the most remarkable examples in the animal kingdom of traits that are thought to have evolved via forces of sexual selection and female choice. The family is comprised of 41 recognized species divided into 16 genera [2], all of which are confined to the Australo-Papuan realm and the Moluccas islands (Indonesia). The birds-of-paradise have adapted to a wide variety of habitats ranging from tropical lowlands to high-altitude mountain forests [3] and in the process acquired a diverse set of morphological traits. Some species are sexually monomorphic and crow-like in appearance with simple mating systems, whereas others have complex courtship behaviors and display strong sexual dimorphism with males exhibiting elaborate feather ornaments that serve as secondary sexual traits [3]. As such, strong sexual and natural selection have likely acted in concert to produce the exquisite phenotypic diversity among members the Paradisaeidae.
While having attracted substantial attention from systematists for centuries, the evolutionary processes and genomic mechanisms that have shaped these phenotypes remain largely unknown. In the past, the evolutionary history of birds-of-paradise has been studied with morphological data [1], molecular distances [4,5], and a single mitochondrial gene [6], but the conclusions have been largely incongruent. The most comprehensive phylogenetic study at present includes all 41 species and is based on DNA-sequence data from both mitochondrial (cytochrome B) and nuclear genes (ornithine decarboxylase introns ODC6 and ODC7) [7]. This study suggested that the birds-of-paradise started to diverge during late Oligocene or early Miocene and could be divided into five main clades. The sexually monomorphic genera Manucodia, Phonygammus, and Lycocorax form a monophyletic clade (Clade A; Fig. 1 in Irestedt et al. [7]), which was suggested to be sister to the other four clades that include species, many of which show strong sexual dimorphism (here referred to as "core birds-of-paradise"). Among the latter four clades, the genera Pteridophora and Parotia were suggested to form the earliest diverging clade (Clade B; Fig. 1 in Irestedt et al. [7]), followed by a clade consisting of the genera Seleucidis, Drepanornis, Semioptera, Ptiloris, and Lophorina (Clade C; Fig. 1 in Irestedt et al. [7]). The last two sister clades are comprised of Epimachus, Paradigalla, and Astrapia (Clade D; Fig.  1 in Irestedt et al. [7]), and Diphyllodes, Cicinnurus, and Paradisaea (Clade E; Fig. 1 in Irestedt et al. [7]), respectively. Overall, the phylogenetic hypothesis presented in Irestedt et al. [7] receives strong branch support (posterior probabilities), but several nodes are still weakly supported and there is incongruence among gene trees. Recently, Irestedt and colleagues [8] and Scholes and Laman [9] argued for the Superb birds-of-paradise to be split into several species, based on genetics, morphology and courtship behavior. Thus, while preliminary genetic analyses have outlined the major phylogenetic divisions, the interspecific relationships of birds-ofparadise remain largely unresolved.
Birds-of-paradise are most widely known for their extravagant feather types, coloration and mating behaviors [3] . In addition, they also exhibit an array of bill shapes (often specialized to their respective foraging behavior), body morphologies, and sizes [3]. Ornament feather types include 'wire-type' tail feathers (e.g. twelve-wired bird-of-paradise (Seleucidis melanoleuca)), erectile head plumes (e.g. king of Saxony bird-of-paradise (Pteridophora alberti)), significantly elongated tail feathers (e.g. ribbon-tailed Astrapia (Astrapia mayeri)) or finely-filamental flank plumes (e.g. lesser bird-of-paradise (Paradisaea minor); see Frith and Beehler [3]). Feathers and coloration are crucial components of their mating displays. Polygynous birds-of-paradise show aggregated leks high in tree tops, less aggregated leks on lower levels or the forest floor (often exploded leks), and even solitary mating displays [3].
Coloration in birds-of-paradise involve both pigment-based and structural mechanisms. Coloration via pigmentation is achieved by pigment absorption of diffusely scattered light in a specific wavelength range. Pigments such as carotenoids are frequently associated with red and yellow hues in birds, whereas light absorption by various classes of melanin pigments give rise to black plumage features common in many birds-of-paradise species [10]. On the other hand, structural coloration is caused constructive interference and light reflection from quasiordered spongy structures of the feather barbs and melanosomes in feather barbules [11,12]. The plumages of male birds-of-paradise feature both coloration types to various degrees, and some species such as the Lawe's parotia (Parotia lawesii) use angular-dependent spectral color shifts of their structural feathers in their elaborate display rituals to attract females [13,14] Dale and colleagues recently showed that sexual selection on male ornamentation in birds has antagonistic effects, where male ornamentation is increasing, while females show a strong reduction in ornamentation [15]. This is very apparent in polygynous core birds-of-paradise, where females between species and sometimes even between genera look highly similar.
In the current study, we made use of these technological advances to generate de novo genomes for three birds-of-paradise species from fresh samples and re-sequenced the genomes of two other species from museum samples. Using these genomes, we were able to contrast the trajectory of genome evolution across passerines and simultaneously evaluate which genomic features have evolved during the radiation of birds-of-paradise. We identified a set of candidate genes and genomic features that might have contributed to the extraordinary diversity in phenotypic traits found in birds-of-paradise.

Assembly and gene annotation
We de novo assembled the genomes of Lycocorax pyrrhopterus, Ptiloris paradiseus and Astrapia rothschildi using paired-end and mate pair Illumina sequence data, and performed referencebased mapping for Pteridophora alberti and Paradisaea rubra. Scaffold N50 ranged from 4.2 Mb (L. pyrrhopterus) to 7.7 Mb (A. rothschildi), and the number of scaffolds from 2,062 (P. paradiseus) to 3,216 (L. pyrrhopterus; Table 1). All assemblies showed a genome assembly size around 1 Gb (1.03 to 1.07 Gb, see Table 1). BUSCO2 [21] scores for complete genes (using the aves_odb9 database) found in the respective assemblies ranged from 93.8% to 95.1%, indicating a high completeness (Supplementary Table S1). Next, we annotated the genomes using homology to proteins of closely related species as well as de novo gene prediction. Gene numbers ranged from 16,260 (A. rothschildi) to 17,269 (P. paradiseus; see Supplementary Table  S2).

Genome synteny to the collared flycatcher
We found strong synteny of the three de novo assembled birds-of-paradise genomes (Lycocorax pyrrhopterus, Ptiloris paradiseus, Astrapia rothschildi) to that of the collared flycatcher ( Fig. 2 and Supplementary Figures S1-3). Only few cases were found where individual scaffolds of the birds-of-paradise genomes mapped to multiple chromosomes in the collared flycatcher genome.

Phylogeny
We found 4,656 one-to-one orthologous genes to be present in all eight sampled bird genomes (5 birds-of-paradise and 3 outgroup songbirds). A phylogeny inferred using these orthologs shows a topology with high bootstrap scores ( Fig. 3 and Supplementary Figure S4). However, the sole use of bootstrapping or Bayesian posterior probabilities in analyses of large scale data sets has come into question in recent years [25] . Studies based on genome-wide data have shown that phylogenetic trees with full bootstrap or Bayesian posterior probability support can exhibit different topologies (e.g. Jarvis et al. [26] and Prum et al. [27]; discussed in Suh [25]). Thus, next we performed a concordance analysis by comparing gene trees for the 4,656 singlecopy orthologs to the inferred species topology. We find strong concordance for the older splits in our phylogeny (see Supplementary Figure S4). However, the splits between Ptiloris and its sister clade, which contains Astrapia and Paradisaea, and the split between Astrapia and Paradisaea itself showed much lower concordance values, 0.31 and 0.26, respectively. Only ~10% of the gene trees exactly matched the topology of the inferred species tree and we find an average Robinson-Foulds distance of 3.92 for all gene trees compared to the species tree (Supplementary Table S5). A Robinson-Foulds distance of 0 would indicate that the two tree topologies (species to gene tree) are identical. The highest supported species tree topology ( Fig. 3 and Supplementary Figure S4) is concordant to the birds-of-paradise species tree constructed in Irestedt et al. 2009 [7]. Overall, we find that the birds-of-paradise form a monophyletic clade, with the crow (Corvus cornix) being the most closely related sister taxon, in most gene trees (74%). Within the birds-of-paradise clade, we further distinguish a core birds-of-paradise clade, which consist of 4 of the 5 species in our sample (excluding only the paradise crow, Lycocorax pyrrhopterus).

Positive selection in the birds-of-paradise
We carried out positive selection analyses using all previously ascertained orthologous genes (8,134 genes present in at least seven out of the eight species) on the branch leading to the birds-of-paradise. First, we investigated saturation by calculating pairwise dN/dS ratios. The inferred values did not show any signs of saturation (Supplementary Table S6). To infer positive selection on the branch of the birds-of-paradise, we used the BUSTED model in HyPhy (similar to branch-sites model; [28]). We found 213 genes to be under selection (p value < 0.05; gene symbol annotation for 212 of the 213 genes can be found in Supplementary Table S7). Multipletesting correction was carried out using FDR in R (function p.adjust() from the stats package).

Gene gain and loss
We identified 9,012 gene families across all 8 species. Using CAFE [29] we inferred 98 rapidly evolving families within the birds-of-paradise clade. Supplementary Table S8 summarizes the   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 gene family changes for all 8 species (also see Fig. 3). Zebra finch had the highest average expansion rate across all families at 0.0916, while the hooded crow had the lowest average expansion rate at -0.1724, meaning that they have the most gene family contractions. Gene gain/loss rates can be found in Supplementary Table S9. Next, we tested for enrichment of GO terms in the set of families rapidly evolving in the birds-of-paradise clade. Gene families were assigned GO terms based off the Ensembl GO predictions for flycatcher and zebra finch. In all, we were able to annotate 6,350 gene families with at least one GO term. Using a Fisher's exact test on the set of 98 rapidly evolving families in the birds-of-paradise, we find 25 enriched GO terms in 20 families (FDR 0.05; Supplementary Table S10). All the gene gain and loss results can be found online (https://cafe-portal.gitlab.io/bop/).

Discussion
Renowned for their extravagant plumage and elaborate courtship displays, the birds-of-paradise are among the most prominent examples of how sexual selection can give rise to extreme phenotypic diversity. Despite extensive work on systematics and a long-standing interest in the evolution of their different mating behaviors, the genomic changes that underlie this phenotypic radiation have received little attention. Here, we have assembled representative genomes for the five main birds-of-paradise clades and characterized differences in genome evolution within the family and relative to other avian groups. We reconstructed the main structure of the family phylogeny, inferred the TE (transposable element) landscape and identified a list of genes under selection and gene families significantly expanded or contracted potentially involved in many phenotypic traits for which birds-of-paradise are renowned. Below, we discuss these different genomic features and how they might have contributed to the evolution of birds-of-paradise.

Genome synteny and phylogeny
We found genome synteny (here in comparison to the collared flycatcher [30]) to be highly conserved for all three de novo assembled genomes ( Fig. 2 and Supplementary Figures S1-3). Only few cases were recorded where regions of scaffolds of the birds-of-paradise genomes aligned to different chromosomes of the collared flycatcher. These could be artifacts of the genome assembly process or be caused by translocations. Passerine birds show variable numbers of chromosomes (72-84 [22]). However, they do not vary as much as other groups', such as Charadriiformes (shorebirds, 40-100 [22]). In general, studies have shown a high degree of genome synteny even between Galloanseres (galliform and anseriform birds) and Neoaves (most other birds including passerines; approx. 80-90 mya divergence; reviewed in Ellegren [31]; Poelstra et al. [17]). However, genomes with higher continuity, generated with long-read technologies or using long-range scaffolding methods (such as Hi-C [32]), or a combination thereof, will be needed to get a more detailed view of rearrangements in birds-of-paradise genomes (see Peona et a. [33]).  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 Our analyses reconstructed a phylogenetic tree topology congruent with the one presented in Irestedt et al. [7] (Fig. 3 and Supplementary Figure 4). However, while bootstrapping found strong support for the topology of the species tree, congruence analysis found high discordance for the two most recent branches (Ptiloris and its sister clade (Astrapia and Paradisaea) and the split between Astrapia and Paradisaea). Furthermore, we found the highest supported tree topology to be based on only 10% of all gene trees (Supplementary Table S5). This could be caused by incomplete lineage sorting (ILS), which refers to the persistence of ancestral polymorphisms across multiple speciation events [34]. Jarvis et al. [26] and Suh et al. [35] showed that ILS is a common phenomenon on short branches in the avian Tree of Life. Another possibility could be hybridization, a phenomenon frequently recorded in birds-of-paradise [3]. Overall, most gene tree topologies (74%) support the monophyly for the birds-of-paradise and the core birds-of-paradise clades.

Repeats and their possible role in the evolution of birds-of-paradise
Bursts of TE activity are often lineage-or species-specific, which highlights their potential to affect speciation [30]. This is further supported by the fact that TE activity bursts often correlate with the speciation timing of the respective species or species group [30]. Similarly, we find a burst of LINE activity within all three de novo assembled genomes (Lycocorax pyrrhopterus, Ptiloris paradiseus, Astrapia rothschildi) dating back to about 24 mya (Fig. 1). The timing roughly corresponds to or slightly predates the emergence and radiation of birds-of-paradise (see Fig  3). The notion that we found 16 novel families of retroviral LTRs suggests multiple recent germline invasions of the birds-of-paradise lineage by retroviruses. The recent cessation of activity of CR1 LINEs and instead recent activity of retroviral LTRs (Fig. 1) is in line with similar trends in collared flycatcher and hooded crow [22,24]. This suggests that recent activity of retroviral LTRs might be a general genomic feature of songbirds, however, with different families of retroviruses being present and active in each songbird lineage. Thus, an intriguing hypothesis that warrants further investigation is that the diversification of birds-of-paradise (and probably songbirds in general) was influenced by lineage specificity of their TE repertoires through recurrent retroviral germline invasions and smaller activity bursts.

Coloration, feather and skeletal development in birds-of-paradise
Given the strong sexual dimorphism and array of morphological phenotypes found in birds-ofparadise and their important role in mating success, we would expect genes important for coloration, morphology and feather structure to be under selection in the birds-of-paradise. Indeed, we found several genes potentially involved in these phenotypic traits to show signatures of positive selection. However, most of them were non-significant after the multiple-testing correction (see Supplementary Table S7). Given that these corrections are often very conservative, these genes might still warrant more detailed follow-up investigations. One such gene is ADAMTS20, which is crucial for melanocyte development. ADAMTS20 has been shown to 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 cause white belt formation in the lumbar region of mice [36]. Nonsense or missense mutations in this gene disrupt the function of KIT, a protein that regulates pigment cell development [36]. In mammals and birds, pigment patterns are largely influenced by melanocytes. Thus, this gene could be a strong candidate for differential coloration in the birds-of-paradise. Another gene under positive selection with a potential role in coloration is ATP7B. It is a copper-transporting Ptype ATPase and thought to translate into a melanosomal protein (see Bennett and Lamoreux [37] for a review). Copper is crucial for melanin synthesis because tyrosinase contains copper and thus ATP7B might play a crucial role in pigment formation. Candidate genes for positive selection, which are known to have functions in feather development (in chicken) and morphology (in chicken and mammals) include FGR1, SPECC1L, BMPR1A, GAB2, PAPSS2, DCST2, ALDH3A2, MYF5 and APOBEC2. For example, FGFR1 (Fibroblast growth factor receptor 1) is implicated in feather development [38]. In humans it has further been shown to be involved in several diseases associated with craniofacial defects (OMIM; http://www.ncbi.nlm.nih.gov/omim/). ALDH3A2 (aldehyde dehydrogenase 3 family member A2), a membrane-associated protein and SPECC1L are implicated in craniofacial disorders (e.g. Van Der Woude Syndrome) in humans [39]. APOBEC2 seems to play a role in muscle development (skeletal and heart muscle) in chickens [40].

APOBECs and their potential role in the immune system
Intriguingly, APOBECs have also been shown to have important functions in the immune systems of vertebrates, where they act as restriction factors in the defense against a range of retroviruses and retrotransposons [41,42]. Functioning as cytosine deaminases they act against endogenous retroviruses (ERVs), especially Long terminal repeat retrotransposons (LTR) by interfering with the reverse transcription and by hypermutating retrotransposon DNA. A recent study on 123 vertebrates showed that birds have the strongest hypermutation signals, especially oscine passerines [43] (such as zebra finch and medium ground finch). This study also demonstrated that edited retrotransposons may preferentially be retained in active regions of the genome, such as exons and promoters because hypermutation decreases their potential for mobility. Thus, it seems very likely that retrotransposon editing via APOBECs has an important role in the innate immunity of vertebrates as well as in genome evolution. In line with this hypothesis, we found a burst in recent activity of retroviral LTRs in the genomes of birds-ofparadise, a signal also found in other passerines [22,24]. This could also explain why we found APOBEC2 to be under positive selection.

Visual system
Genes that showed positive selection signals and are known to have known roles in eye function and development include CABP4, NR2E3, IMPG1, AKAP13, MGARP, GNB1, ATP6AP2 and MYOC, of which the latter three remained significant after multiple-testing correction. Interestingly, there are no obvious explanations for selection on vision in birds-of-paradise. Evidence for co-evolution between coloration and vision in birds is weak (see e.g. Lind et al. [44], Price [45], but see Mundy et al. [46] and Bloch [47]). A phenotype that could be associated with selection on vision is the diverse array of mating displays in some core birds-of-paradise. Many species, such as Lawes's parotia (Parotia lawesii) modify color by changing the angle of the light reflection [13,14], which requires the visual system to be able to process the fine nuances of these color changes. However, the fact that (color) vision serves many purposes (including e.g. foraging, etc.) makes it very difficult to establish co-evolution between coloration and color vision [44]. We can thus only speculate at this point about the potential role of coloration or mating displays in the selection of vision genes found in birds-of-paradise.

Olfactory system
Another often overlooked sensory system in birds is odor perception. Olfactory receptors (ORs) are important in odor perception and detection of chemical cues. In many animal taxa, including birds, it has been shown that olfaction is crucial to identify species [48], relatedness [49], individuals [50], as well as for mate choice [51] and in foraging [52]. In concordance with previous studies, we found this gene family to expand rapidly in the zebra finch [53]. Even more so, the zebra finch showed the strongest expansion (+17 genes) in our data set. Furthermore, we find a rapid expansion on the branch leading to the core birds-of-paradise (+5 genes) and further in Astrapia (+6 genes). Interestingly, olfactory receptor genes show rapid contractions in the paradise crow (-6 genes), the hooded crow (-9 genes) and the collared flycatcher (-5 genes). This is in line with a study that suggested poor olfactory development in a different corvid species, the Japanese jungle crow (Corvus macrorhynchos) [54]. Olfactory could serve many functions in birds-of-paradise such as species recognition (to avoid extensive hybridization), individual recognition, mating or foraging (given their extensive diet breadth).

Startle response and adult locomotory behavior
Startle response is an important behavioral trait. It is the ability to react quickly to the presence of a stimulus. We find a gene family associated with startle response and adult locomotory behavior to be evolving significantly faster than under a neutral model on the branch leading the core birds-of-paradise (+5 genes). It is even further expanded in Paradisaea (+3 genes). This gene family is contracted in the two outgroups, the zebra finch (-4 genes) and the collared flycatcher (-2 genes), as well as the monochromatic, non-lekking paradise crow (-1 genes). We found no expansion or contraction in the hooded crow genome. For core birds-of-paradise that show extravagant lekking behavior, fast response to stimuli or increased locomotory behavior could have several advantages. For example, being highly visible during leks means that lekking birds need to be able to look out for predators and react to them quickly, and indeed Frith & Beehler [3] mention that lekking birds-of-paradise appear to be constantly on the lookout for predators. Interestingly, species of the genus Paradisaea have aggregated leks high in emergent trees and thus may be more visible to the numerous birds of prey that inhabit the region [3]. It could further be important for interaction between males or between the sexes during leks. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 We found indications of a conserved synteny between birds-of-paradise and other passerine birds, such as the collared flycatcher. Similar to other passerine genomes, we also found signatures of recent activity of novel retroviral LTRs in the genomes. Furthermore, several genes with known function in coloration, feather, and skeletal development showed signatures of positive selection in birds-of-paradise. This is in accordance with our prediction that phenotypic evolution in birds-of-paradise should have left genomic signatures. While these genes all are obvious candidates for evolution of birds-of-paradise's phenotypic and behavioral diversity, we also found positively selected genes that are not as straightforward to explain. These include genes involved in development and function of the visual system. Gene gain/loss analyses further revealed significant expansions in gene families associated with 'startle response/locomotory behavior' and olfactory function.

Conclusions
Although the efforts to document the phenotypic and behavioral diversity in this model system of sexual selection continues to generate intense interest in birds-of-paradise, we still have limited understanding of the processes that have shaped their evolution. Here, we provide a first glimpse into genomic features underlying the diverse array of species found in birds-of-paradise. However, our analyses concentrate on comparative genomics, and we acknowledge that analyses based on multiple individuals per species will likely result in more reliable inferences. Furthermore, some of the most interesting traits in birds-of-paradise, such as lekking behavior, are likely complex traits and will require analyses based on many individuals (such as QTL mapping). In general, more in-depth analyses will be needed to establish a causal relationship between signatures of selection in the birds-of-paradise genome and the unique diversity of phenotypic traits, or to investigate changes in genome structure with higher resolution. Fortunately, technologies keep advancing, and along with decreasing costs for sequencing, we will soon be able to gain more information about this fascinating, but genetically understudied family of birds.

Genome Sequencing, Assembly, and Quality Assessment
We prepared two paired-end (overlapping and 450 bp average insert size) and two mate pair libraries (3 kb and 8 kb average insert size) for each of the three de novo assemblies (Ptiloris paradiseus, Astrapia rothschildi, and Lycocorax pyrrhopterus). All libraries, for the de novo and the reference-based mapping approaches were sequenced on a HiSeq2500 v4 at SciLifeLab Stockholm, Sweden. We generated two lanes of sequencing for each de novo assembly and pooled the two reference-based samples on one lane. We first assessed the read qualities for all species using the program FastQC (FastQC, RRID:SCR_014583) [57]. For the three species, Ptiloris paradiseus, Astrapia rothschildi, and Lycocorax pyrrhopterus we then used the preqc [58] function of the SGA [59] assembler to (1) estimate the predicted genome size, (2) find the predicted correlation between k-mer sizes and N50 contig length and (3) assess different error statistics implemented in preqc. For Ptiloris paradiseus, Astrapia rothschildi, and Lycocorax pyrrhopterus reads were assembled using Allpaths-LG [60]. To improve the assemblies, especially in repeat regions, GapCloser (GapCloser, RRID:SCR_015026; part of the SOAPdenovo package [61]) was used to fill in gaps in the assembly. Assemblies were then compared using CEGMA (CEGMA, RRID:SCR_015055) [62] and BUSCO2 (BUSCO (RRID:SCR_015008)) [21]. We added BUSCO2 scores for better comparisons at a later stage of the project. For the reference-based mapping, we mapped all reads back to the Ptiloris paradiseus assembly using BWA (BWA, RRID:SCR_010910) [63] (mem option), the resulting sam file was then processed using samtools [64]. To do so, we first converted the sam file generated by BWA to the bam format, then sorted and indexed the file. Next we removed duplicates using Picard (Picard, RRID:SCR_006525) (http://broadinstitute.github.io/picard/) and realigned reads around indels using GATK (GATK , RRID:SCR_001876) [65]. The consensus sequence for each of the two genomes was then called using ANGSD [66] (using the option -doFasta 3). [Genomes will be deposited in GenBank]

Genome Synteny
We also inferred genome architecture changes (synteny) between our three de novo assembled genomes and the chromosome-level assembly of the collared flycatcher [30]. To do so, we first performed pairwise alignments using Satsuma [79] and then plotted the synteny using Circos plots [80]. More precisely, we first performed asynchronous 'battleship'-like local alignments using SatsumaSynteny to allow for time efficient pairwise alignments of the entire genomes. In order to avoid signals from repetitive elements we used masked assemblies for the alignments. Synteny between genomes was then plotted using Circos and in-house perl scripts.

Gene Annotation
We masked repeats (only transposable elements) in the genome prior to gene annotation. Contrary to the repeat annotation step, we did not mask simple repeats in this approach. Those were later soft-masked as part of the gene annotation pipeline Maker2 [81], to allow for more efficient mapping during gene annotation. Gene annotation was performed using ab-initio gene prediction and homology-based gene annotation. To do so we used the genome annotation pipeline Maker2 [81], which is able to perform all the aforementioned genome annotation strategies. Previously published protein evidence (genome annotations) from Zhang et al. [19] were used for the homology-based gene prediction. To improve the genome annotation we used CEGMA to train the ab-initio gene predictor SNAP [82] before running Maker2 [81]. We did not train the de novo gene predictor Augustus [83] because no training data set for birds was available, and it was not recommended at the time to use potentially lower quality annotations for training of Augustus (M. Stanke, personal communication).  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64

Ortholog Gene Calling
In the next step we inferred orthologous genes using PoFF [84]. We included all five birds-ofparadise, as well as the hooded crow (Corvus cornix) [17], the zebra finch (Taeniopygia guttata) [85] and the collared flycatcher (Ficedula albicollis) [30] as outgroups. We ran PoFF using both the transcript files (in fasta format) and the transcript coordinates file (in gff3 format). The gff files were used (flag -synteny) to calculate the distances between paralogous genes to accurately distinguish between orthologous and paralogous genes. We then extracted the sequences for all one-to-one orthologs using a custom python script. Next, we determined the number of genes with missing data in order to maximize the number of genes included in the subsequent analyses. For a gene to be included in our analyses, it had to be present in at least 75% of all species (6 out of 8 species), which resulted in a set of 8,134 genes. In order to minimize false positives in the subsequent positive selection analysis caused by alignment errors, we used the codon-based alignment algorithm of Prank [86] and further masked sites with possible alignment issues using Aliscore [87]. Aliscore uses Monte Carlo resampling within sliding windows to identify low-quality alignments in amino acid alignments (converted by the program). The identified potential alignment issues were then removed from the nucleotide alignments using ALICUT [88].

Intron Calling
In addition to exons, we also extracted intron information for the birds-of-paradise genomes (see Supplementary Table S2). To do so we used the extract_intron_gff3_from_gff3.py script (https://github.com/irusri/Extract-intron-from-gff3) to include intron coordinates into the gff file. We then parsed out all intron coordinates and extracted the intron sequences from the genomes using the exttract_seq_from_gff3.pl script (https://github.com/irusri/Extract-intron-from-gff3). All introns for the same gene were then concatenated using a custom python script.

Phylogenetic analysis
The individual alignment files (we used exon sets without missing species, which resulted in 4,656 alignments) were then (1) converted to the phylip format individually, and (2) 200 randomly selected exon alignments were concatenated and then converted to phylip format using the catfasta2phyml.pl script (https://github.com/nylander/catfasta2phyml). We used the individual exon phylip files (all exons combined per gene) for gene tree reconstruction using RaxML [89] (using the GTR + G model). Subsequently, we combined the gene trees into a species tree using a multi-species coalescent model and carried out bootstrapping using Astral [90,91]. Astral does not provide branch lengths needed for calibrating phylogenetic trees (used for the gene gain/loss analysis). So, we subsampled our data, and constructed a ML tree based on 200 randomly chosen and concatenated exons using ExaML [92]. We then calibrated the species tree using the obtained branch lengths along with calibration points obtained from timetree.org using r8s [93]. These calibration points are the estimated 44 mya divergence time between flycatcher and zebra finch and the 37 mya divergence time between crow and the birds-of-paradise. Next, we performed a concordance analysis. First, we rooted the gene trees based on the outgroup (collared flycatcher, zebra finch). Then, for each node in the species tree we counted the number of gene trees that contained that node and divided that by the total number of gene trees. We next counted the number of gene trees that support a given topology (see Supple-1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65   mentary Table S65) and further calculated the Robison-Foulds distance between gene trees using RaxML.

Inference of Positive selection
We inferred genes under positive selection using dN/dS ratios of 8,133 orthologs. First, we investigated saturation of synonymous sites in the phylogenetic sampling using pairwise comparisons in CodeML [94]. The pairwise runmode of CodeML estimates dN and dS ratios using a ML approach between each species pair (Supplementary Table S6). We then investigated positive selection on the branch to the birds-of-paradise using the BUSTED model [28] (branch-site model) implemented in HyPhy [95]. The branch-site test allows for inference of positive selection in specific branches (foreground branches) compared to the rest of the phylogeny (background branches). The significance of the model comparisons was determined using likelihoodratio tests (LRT). We carried out multiple testing corrections using FDR [96] instead of Bonferroni correction, as the branch-site tests result in an excess of non-significant p-values, which violates the assumption of a uniform distribution in multiple testing correction methods such as the Bonferroni correction. The genes were then assigned gene symbols. To do so, we first extracted all the respective zebra finch or collared flycatcher GeneBank protein accessions. We then converted the accessions into gene symbols using the online conversion tool bioDBnet [97]. GO terms were obtained for the flycatcher assembly and assigned to orthologs that had a corresponding flycatcher transcript ID in Ensembl (7,305 genes out of 8,133). To determine enriched GO categories in positively selected genes, GO terms in genes inferred to have undergone positive selection were then compared to GO terms in all genes (with a GO term) using Fisher's exact test with a false discovery rate cut-off of 0.05. We found 262 GO terms enriched in positively selected genes before FDR correction and 47 enriched after.

Gene Gain-Loss
In order to identify rapidly evolving gene families in the birds-of-paradise we used the peptide annotations from all five birds-of-paradise species, along with the three outgroup species in our analysis: hooded crow, zebra finch and collared flycatcher. The crow genes were obtained from NCBI and the zebra finch and flycatcher genes were acquired from ENSEMBL 86 [98]. To ensure that each gene was counted only once, we used only the longest isoform of each protein in each species. We then performed an all-vs-all BLAST [76] search on these filtered sequences.
The resulting e-values from the search were used as the main clustering criterion for the MCL program to group peptides into gene families [99]. This resulted in 13,289 clusters. We then removed all clusters only present in a single species, resulting in 9,012 gene families. Since CAFE requires an ultrametric time tree as input, we used r8s to smooth the phylogenetic tree with calibration points based on the divergence time of crow and the birds-of-paradise at 37 mya and of flycatcher and zebra finch at 44 mya [100]. With the gene family data and ultrametric phylogeny ( Figure 3) as input, we estimated gene gain and loss rates (λ) with CAFE v3.0 [101]. This version of CAFE is able to estimate the amount of assembly and annotation error (ε) present in the input data using a distribution across the observed gene family counts and a pseudo-likelihood search. CAFE is then able to correct for this error and obtain a more accurate estimate of λ. We find an ε of about 0.01, which implies that 3% of gene families have observed counts that are not equal to their true counts. After correcting for this error rate, we find λ = 0.0021. This value for λ is considerably higher than those reported for other distantly related groups (Supplementary Table S9). GO terms were assigned to genes within families based on flycatcher and zebra finch gene IDs from Ensembl. We used these GO assignments to determine molecular functions that may be enriched in gene families that are rapidly evolving along the ancestral BOP lineage (Node BOP11 in Figure S1). GO terms in genes in families that are rapidly evolving along the BOP lineage were compared to all other GO terms using a Fisher's exact test (FDR cut-off of 0.05). We found 36 genes in 26 families to have enriched GO terms before FDR correction and 25 genes in 20 families after.

DATA AVAILABILITY
All genomes and supporting data are available via the GigaScience repository, GigaDB [102], and raw read data via NCBI (SRA archive; BioProject: PRJNA506819, BioSample: SAMN10474331-SAMN10474335, SRA Study: SRP173170). All results of the gene gain-loss analyses can be found online [103].