First de novo whole genome sequencing and assembly of the bar-headed goose

Background The bar-headed goose (Anser indicus) mainly inhabits the plateau wetlands of Asia. As a specialized high-altitude species, bar-headed geese can migrate between South and Central Asia and annually fly twice over the Himalayan mountains along the central Asian flyway. The physiological, biochemical and behavioral adaptations of bar-headed geese to high-altitude living and flying have raised much interest. However, to date, there is still no genome assembly information publicly available for bar-headed geese. Methods In this study, we present the first de novo whole genome sequencing and assembly of the bar-headed goose, along with gene prediction and annotation. Results 10X Genomics sequencing produced a total of 124 Gb sequencing data, which can cover the estimated genome size of bar-headed goose for 103 times (average coverage). The genome assembly comprised 10,528 scaffolds, with a total length of 1.143 Gb and a scaffold N50 of 10.09 Mb. Annotation of the bar-headed goose genome assembly identified a total of 102 Mb (8.9%) of repetitive sequences, 16,428 protein-coding genes, and 282 tRNAs. In total, we determined that there were 63 expanded and 20 contracted gene families in the bar-headed goose compared with the other 15 vertebrates. We also performed a positive selection analysis between the bar-headed goose and the closely related low-altitude goose, swan goose (Anser cygnoides), to uncover its genetic adaptations to the Qinghai-Tibetan Plateau. Conclusion We reported the currently most complete genome sequence of the bar-headed goose. Our assembly will provide a valuable resource to enhance further studies of the gene functions of bar-headed goose. The data will also be valuable for facilitating studies of the evolution, population genetics and high-altitude adaptations of the bar-headed geese at the genomic level.


INTRODUCTION
High-altitude environments impose severe physiological challenges on vertebrates, owing to the decrease in oxygen, pressure, and temperature relative to lowland habitats. Understanding how vertebrates cope with these harsh conditions can provide important insights into the process of adaptive evolution (Zhang et al., 2016a;Zhang et al., 2016b). Among the high-altitude-adapted vertebrates, birds in general have excellent hypoxia tolerance and can maintain the highest basal metabolic rates during hypoxia too severe for most mammals to survive (Tucker, 1968;Faraci, 1991;Monge & Leon-Velarde, 1991), and are therefore especially compelling subjects for studies of high-altitude adaptation (Projecto-Garcia et al., 2013;Qu et al., 2013;Qu et al., 2015;Zhu et al., 2018).
One of the most well-known high-altitude bird species is the bar-headed goose (Anser indicus). Bar-headed geese breed in selected wetlands on the Qinghai-Tibetan Plateau (Takekawa et al., 2009) and Mongolian Plateau (Batbayar et al., 2014), and winter in the south-central Tibetan (Bishop et al., 1997) and India (Javed et al., 2000). The total worldwide populations of bar-headed geese were estimated at 97,000-118,000 individuals, with the highest numbers occurring in China (67.5-69.2%) and .5%) (Liu et al., 2017). Bar-headed geese migrate along the central Asian flyway, and a certain proportion of individuals fly between central Asia (breeding areas) and India (wintering areas), and annually fly twice across the Himalayas (mean elevation 4,500 m) (Hawkes et al., 2011), where the partial pressure of oxygen is one-third of that at sea level (Scott & Milsom, 2007). Interestingly, bar-headed geese can sustain the high metabolic rates and the high rates of oxygen consumption needed for flapping flight in severe hypoxia during their migration across the Himalayas (Ward et al., 2002). Therefore, this species has become renowned for an example of high-altitude adaptation.
Many studies have sought to determine the physiological, molecular and behavioral basis for the successful adaptation of bar-headed geese to high-altitude flying and living (Butler, 2010;Scott et al., 2015). For example, bar-headed geese have evolved multiple mechanisms that enhance the uptake, circulation and peripheral diffusion of oxygen during hypoxia, including the increased lung mass and total ventilation (Scott & Milsom, 2007), hemoglobin with an increased oxygen affinity because of a single amino acid point mutation in the alpha polypeptide chain (McCracken, Barger & Sorenson, 2010;Natarajan et al., 2018), greater capillary density in flight and cardiac muscles increasing oxygen supply , and a higher proportion of mitochondria in a subsarcolemmal location reducing oxygen diffusion distances (Snyder, Byers & Kayar, 1984). In addition, bar-headed geese were found to take a roller coaster strategy, rising and falling with the underlying terrain, to conserve energy during Himalayan migrations (Bishop et al., 2015).
With the rapid evolution of genome sequencing technologies over the past decade, whole genome sequences could be obtained in a more economical and efficient way (Weisenfeld et al., 2017;Paajanen et al., 2019). For example, the high-quality reference genomes of the African wild dog (Lycaon pictus) were generated at a low cost by using the linked-read 10× Genomics Chromium system (Armstrong Taylor et al., 2019). However, to date, there is still no genome assembly information publicly available for bar-headed geese. Ottenburghs et al. (2016) and Ottenburghs et al. (2017) sequenced nineteen goose genomes (including bar-headed goose) and used an exon-based phylogenomic approach (41,736 exons, representing 5,887 genes) to unravel the evolutionary history of geese group. However, the high quality of bar-headed goose genome assembly and annotation were unavailable in Ottenburghs et al. (2016) and Ottenburghs et al. (2017)'s work. In this study, we present the de novo whole genome sequencing and assembly of the bar-headed goose, along with gene prediction and annotation. We anticipate that these data will provide a valuable resource for future investigation of the genetic adaptation of bar-headed geese to high altitude, and particularly, for the comparative analysis of genome biology among several related birds (e.g., the pink-footed goose (Anser brachyrhynchus) (Pujolar et al., 2018), the swan goose (Anser cygnoides) (Lu et al., 2015) and the mallard (Huang et al., 2013)).

Ethics statement
This study conformed to the guidelines for the care and use of experimental animals established by the Ministry of Science and Technology of the People's Republic of China (Approval number: 2006-398). The research protocol was reviewed and approved by the Ethical Committee of Qinghai University.

Sampling and genome sequencing
A female bar-headed goose was collected in Huangzhong, Qinghai province of China, at an elevation of 3,000 m. Sampling (heart, liver, and lung) was done upon approved agreement of Xiulan Liu (20180411). The tissues were stored in liquid nitrogen and transported to the sequencing center (Novogene Bioinformatics Institute, Beijing, China). High-quality genomic DNA was extracted from the heart using the Qiagen DNA purification kit (Qiagen, Valencia, CA, USA) under the recommended protocol provided by the manufacturer. Standard 10X Genomic library was performed according to the manufacturer's instructions described in the Chromium Genome User Guide Rev A (https://support.10xgenomics.com/de-novo-assembly/sample-prep/doc/user-guidechromium-genome-reagent-kit-v1-chemistry). De novo sequencing was conducted with an Illumina HiSeq X Ten platform.

Genome assembly and completeness evaluation
After removing adapter sequences and the unqualified sequences in the raw data, we de novo assembled the bar-headed goose genome from the high-quality clean reads using Supernova (version 1.1.3) (Weisenfeld et al., 2017). Supernova is a software package for de novo assembly from Chromium Linked-Reads that are made from a single whole-genome library from an individual DNA source (https://support.10xgenomics.com/de-novoassembly/software/overview/latest/welcome). Standard assembly statistics were generated including: the length and number of contigs and scaffolds, number of scaffolds >2,000 bp, N50, N60, N70, N80 and N90, length of the longest contigs and scaffolds, and the total assembly length of contigs and scaffolds. The sequencing coverage and the guanine-cytosine (GC) content were also calculated.
After assembly, we evaluated the completeness of the bar-headed goose genome assembly using Core Eukaryotic Genes Mapping Approach software (CEGMA), which compared a set of 248 core eukaryotic genes to the assembled sequence (Parra, Bradnam & Korf, 2007). BUSCO analysis was also performed to assess the assembly quality by searching against the Benchmarking Universal Single-Copy Orthologs called BUSCOs (version 3.0) (Simao et al., 2015). In our study, the BUSCO dataset consisted of 2,586 single-copy orthologs genes.
Second, repetitive sequences in the bar-headed goose genome were identified using a combination of homology-based and de novo approaches. We used RepeatMasker and the associated RepeatProteinMask (Tarailo-Graovac & Chen, 2009) for homolog-based comparison by screening the published RepBase database (Bao, Kojima & Kohany, 2015). We used RepeatModeler (Smit & Hubley, 2018) to build a de novo repeat library. Then RepeatMasker was employed to align sequences from the bar-headed goose assembly to this de novo library for identifying repeat sequences. Tandem repeats, adjacent copies of a repeating pattern of nucleotides in the genomic sequences, were de novo predicted using Tandem Repeats Finder (TRF) tool (Benson, 1999).
Last, functional annotation of all genes was undertaken by BLASTP (evalue 1E-05). Protein domains were annotated by searching InterPro (version 32.0) (Mulder & Apweiler, 2007) and Pfam (Version 27.0) databases (Finn et al., 2014), using InterProScan (Version 4.8) and Hmmer (Potter et al., 2018) (Version 3.1) respectively. Gene Ontology terms for each gene were obtained from the corresponding InterPro or Pfam entry. The pathways, in which the gene might be involved, were assigned by blast against the KEGG database (release 53) (Ogata et al., 1999), with an E-value cutoff of 1E-05.

Data access
The sequencing data in the fastq format have been deposited in NCBI Sequence Read Archive (SRA) with accession number SRP199943 and Bioproject accession PRJNA542959. The assembled draft genome has been deposited at DDBJ/ENA/GenBank under the accession VDDG00000000. The version described in this paper is version VDDG01000000. The annotation results of repeated sequences, gene structure, noncoding RNAs and functional prediction were deposited in Figshare database (https: //doi.org/10.6084/m9.figshare.8229083.v1).

Genome sequencing, assembly and annotation
We sequenced the genome of a female bar-headed goose from Huangzhong, Qinghai, China using the linked-read 10× Genomics Chromium system and the Illumina HiSeq X Ten platform. We obtained a total of 124 Gb sequencing data (∼103-fold coverage) with a read length of 150 bp and an insert size of 600 bp (Table S1). We then used these data to de novo assemble the bar-headed goose genome using the 10× Genomics Supernova assembler. The final total assembled genome length was 1.14 Gb with a contig N50 length of 120 Kb, and a scaffold N50 length of 10.09 Mb (Table 1). The average GC content of the bar-headed goose genome was 41.87% (Table S2). CEGMA and BUSCO analyses were performed to assess the completeness of the assembled bar-headed goose genome. Using CEGMA on our assembly, we found that 211 conserved genes (85.08%) in bar-headed goose genome were successfully assembled, when compared to the 248 evolutionarily conserved core gene sets from six eukaryotic model organisms (Table S3). According to BUSCO analysis, 97.5% of the 2,586 conserved genes were identified in our assembled bar-headed goose genome (Table 2). These results indicated that the assembly of the bar-headed goose genome showed a high level of completeness. We detected 8.9% repeat sequences in bar-headed goose genome (Table S4), including long interspersed nuclear elements (LINEs, 6.20%), long terminal repeats (LTRs, 1.73%), and DNA transposons 0.27%. We also predicted 342 microRNAs (miRNAs), 49 rRNAs, 282 tRNAs, and 288 small nuclear RNAs (snRNAs) in the bar-headed goose genome (Table 3). By combining homologous comparison, de novo prediction and RNA-seq assisted methods, we predicted 16,428 protein-coding genes in the bar-headed goose genome (Table 4). The average length of genes were 25,274 bp with an average of 10 exons per gene (Table 4). Of these protein-coding genes, a total of 15,790 genes (96.1%) were functionally annotated according to NCBI's Reference Sequence (RefSeq) database (O'Leary et al., 2016), Swiss-prot, KEGG and InterPro databases (Table 5).

Comparative genomic analysis
We identified a total of 18,914 common gene families from the 16 available vertebrates' genomes, including swan goose (Anser cygnoides, Acy), crested ibis (Nipponia nippon, Nni), mallard (Anas platyrhynchos, Apl), red junglefowl (Gallus gallus, Gga), turkey (Meleagris     S1). Furthermore, the number of unique or shared orthologous genes among bar-headed goose, swan goose, mallard and red junglefowl was provided by a Venn diagram (Fig. 1). Of these four species, 361 orthologous genes were found to be unique in the bar-headed goose genome. 50% and 27.32% of these specific orthologous genes were enriched in gene ontology (GO) terms related to molecular functions and biological process, respectively. The molecular functions were involved in protein binding (GO: 0005515) and ATP binding (GO: 0005524), while the biological process was involved in signal transduction (GO: 0007165) and G-protein coupled receptor signaling pathway (GO: 0007186) (Table S5).

Expansion and contraction of gene families
Compared with the other 15 vertebrates, we identified 63 and 20 gene families that were substantially expanded and contracted in the bar-headed goose, respectively (Fig. 2). Functional identities were presented for all significantly expanded and contracted gene families in Tables S6 and S7, respectively. Most of the significantly expanded gene families were associated with the functional categories of adhesion, transport, hydrolase activities and binding (Table S6) adenyl nucleotide binding (GO: 0030554) and Rho GTPase binding (GO: 0017048), among others. For gene families that were significantly contracted, the main functional categories, as indicated by GO annotations, also included adhesion, transport, hydrolase activities and binding (Table S7). However, gene family members belonging to the specific GO terms that showed contraction differed from those that showed significant expansion.

Positive selection in bar-headed goose
Identification of genes subject to positive selection in the bar-headed goose genome is key to understand its adaptation to high-altitude environments. Compared with swan goose, a relative species living at low altitude areas, we identified 1,715 positively selected genes (PSGs) in the bar-headed goose genome using the branch-site likelihood ratio test (Table S8). The GO annotation for these PSGs was shown in Fig. 3. Molecular functions contained genes mainly involved in binding (711 genes, GO: 0005488) and catalytic activity (330 genes, GO: 0003824). Genes associated with cellular components were primarily cell (236 genes, GO: 0005623), cell part (236 genes, GO: 0044464), membrane (207 genes, GO: 0016020) and organelle (149 gene, GO: 0043226). Genes related to biological process were mainly involved in cellular process (453 genes, GO: 0009987), metabolic process (396 genes, GO: 0008152), single-organism process (324 genes, GO: 0044699), biological regulation (206 genes, GO: 0065007) and response to stimulus (135 genes, GO: 0050896). These different functional categories resulting from the GO annotations indicated that a substantial diversity of PSGs were present in the bar-headed goose genome. We found several PSGs related to the adaptation of bar-headed goose to the harsh high-altitude environment. For example, there were nine PSGs (IFNGR2, CDKN1B, Figure 2 Gene family expansion and contraction in the bar-headed goose genome. The number of expanded (blue) and contracted (red) gene families are shown along branches and nodes. MRCA, most recent common ancestor; Ain, bar-headed goose; Acy, swan goose; Nni, crested ibis; Apl, mallard; Gga, red junglefowl; Mga, turkey; Cca, common cuckoo; Cli, rock pigeon; Phu, ground tit; Cfl, bananaquit; Pma, great tit; Tgu, zebra finch; Ppu, ruff; Fpe, peregrine falcon; Bmu, yak; Pho, tibetan antelope.
Full-size DOI: 10.7717/peerj.8914/ fig-2 PDHA1, EIF4E2, EP300, IFNG, NPPA, PIK3R5 and MAPK1) annotated in the HIF-1 (Hypoxia-Inducible Factor-1) signaling pathway (map04066). Several PSGs associated with response to ultraviolet (UV) radiation were annotated in the cellular response to DNA damage stimulus (16 genes, GO: 0006974) and the DNA repair (15 genes, GO: 0006281). PSGs, which were annotated in the ATP binding (79 genes, GO: 0005524), GTP binding (29 genes, GO: 0005525) and endoplasmic reticulum (10 genes, GO: 0005783), were related to the energy metabolism rate under hypoxic conditions. In addition, we also found several other PSGs, which might be involved in response to high-altitude environment. VASH2, annotated in angiogenesis (GO: 0001525), might be a mediator in hypoxia-induced angiogenesis. PGR gene annotated in steroid hormone receptor activity (GO: 0003707), which might play a key role in diverse events associated with reproduction in female geese under the high-altitude condition. The positively selected MYOG (annotated in GO: 0007517) and TNNT3 (annotated in GO: 0006936) might have important implications in the skeletal development. However, all these genes need to be validated by future functional experiments.

DISCUSSION
The number of bird genomes accessible has increased dramatically during the last couple of years due to the Bird 10,000 Genomes (B10K) Project (https://b10k.genomics.cn), which was aimed to generate representative draft genome sequences from all extant bird species . Comparative genomic analyses of 48 consistently annotated bird genomes from the first phase of the B10K project provided new perspectives on avian evolution, ecology and conservation (Jarvis et al., 2014;Zhang et al., 2014). With the decrease in sequencing costs and improvements in genome assembly algorithms, many more bird genome sequences will be published at a striking pace (Ekblom & Wolf, 2014;Kraus & Wink, 2015). Bar-headed goose, a species occupying a wide variety of habitats on the Qinghai-Tibetan Plateau, is considered a good model for studying highaltitude adaptation in birds. Much of what is known about the mechanisms of highaltitude adaptation in bar-headed geese comes from studies that have taken physiological, biochemical, and morphological methods (Snyder, Byers & Kayar, 1984;Scott & Milsom, 2007;Scott et al., 2009;Butler, 2010;McCracken, Barger & Sorenson, 2010;Scott et al., 2015;Natarajan et al., 2018). An understanding of the genetic basis of this adaptation, however, has lagged behind due to the unavailability of the bar-headed goose genome. Therefore, we sequenced, assembled, and annotated a bar-headed goose genome for the first time to provide a valuable tool for future population genetics and comparative genomics studies associated with this species. Birds have the smallest genomes among amniotes, ranging from an estimated 0.9 Gb to 2.1 Gb (Gregory, 2019). In the current study, the genome size of the bar-headed goose was estimated to be 1.2 Gb and a total of 124 Gb (∼103-fold coverage) sequencing data was generated. The assembled sequence was 1.143 Gb, consisting of 10,528 scaffolds with a scaffold N50 of 10.09 Mb. In comparison with the swan goose genome (Lu et al., 2015) and pink-footed goose genome (Pujolar et al., 2018), our bar-headed goose genome exhibits longer contig N50 lengths and scaffold N50 lengths. The assembly was well supported by the BUSCO and CEGMA assessments. The average GC content of the bar-headed goose genome was around 41.87%, similar to that of the other birds such as Sichuan white goose (Anser cygnoides Linn. var domestica) (Gao et al., 2016) and buff-throated partridge (Tetraophasis szechenyii) (Zhou et al., 2019). Repeat sequences comprised approximately 8.9% of the bar-headed goose genome, which was higher than that of the Sichuan white goose genome at 6.9%. These results were consistent with previous reports indicating that almost all avian genomes present lower levels of repeat elements (∼4 to 10% of each genome) than those of other tetrapod vertebrates (for example, 34 to 52% in mammals) (Bohne et al., 2008;Kapusta & Suh, 2017). We predicted 16,428 protein-coding genes in the bar-headed goose, 96.1% of which were functionally annotated by public databases. The gene number of bar-headed goose was intermediate between the gene numbers of swan goose (16,150 genes) and pink-footed goose (26,134 genes).
Furthermore, a gene family clustering of four bird species (bar-headed goose, swan goose, mallard and red junglefowl) from the three clades within the phylogenetic tree identified 361 unique orthologous genes in the bar-headed goose genome comparing with the other three species. We then examined large-scale differences in gene families within 14 species of birds and between birds and two mammals using the genome data. In total, we determined that there were 63 expanded and 20 contracted gene families in the bar-headed goose compared with the other 15 vertebrates. The detailed evolutionary history and the function of each gene family needs further investigation. We performed a positive selection analysis between the bar-headed goose and the closely related lowaltitude goose, swan goose, to uncover its genetic adaptations to the Qinghai-Tibetan Plateau. The main stressor at high altitude is hypoxia. Most responses to hypoxia are mediated by the activation of a family of transcription factors named hypoxia-inducible factors (HIFs). In hypoxic conditions, HIFs are stabilized and enter the nucleus, where they bind to the Hypoxia Response Elements (HRE) and finally induce transcription of a large number of target genes. In our study, nine PSGs were annotated in the HIF-1 (Hypoxia-Inducible Factor-1) signaling pathway. These genes may be important for the bar-headed goose to survive in the high-altitude environment. For example, elevated expression of CDKN1B (p27Kip1) during hypoxia could arrest cells in G 0 /G 1 phase and may prevent the inappropriate proliferation of genetically damaged cells (Kumar & Vaidya, 2016). Pyruvate dehydrogenase (PDH) occupies a central crossroad between glycolysis and the tricarboxylic acid cycle. It was reported that cardiac PDHA1 (Pyruvate Dehydrogenase E1 Alpha 1 Subunit) deficiency impairs ischemic AMP-activated protein kinase signaling and sensitizes hearts to the toxicological actions of ischemic stress (Sun et al., 2016). EP300 (E1A Binding Protein P300), functions as a histone acetyltransferase, was identified as a co-activator of HIF-1A, and thus plays a role in the stimulation of hypoxia-induced genes (Dames et al., 2002). Natriuretic peptides are a well-described family of hormones with a major role in sodium and body volume homeostasis through their actions on renal hemodynamics and tubular function (Nakao et al., 1996). These positively selected genes in the bar-headed goose genome laid a solid foundation for understanding the specific adaptations to the high-altitude environments, in addition to other characteristics known from other high-altitude bird species (Scott, 2011;Qu et al., 2013). The functional role of these genes can be investigated in the future.

CONCLUSIONS
In summary, this is the first report describing the sequencing, assembly, and annotation of the bar-headed goose genome and contributes to the genomic resources for studying its genome evolution and the adaptive mechanisms that confer its resistance to the harsh environments on the Qinghai-Tibetan Plateau. Functional experimental assays will be needed to validate the actual functional significance of many of the positively selected genes.
National Wetland Park in Qinghai Province (grant No. 2018-THY-010203). Dr. Wen Wang and Shuo Feng were both supported by ''1000 Talent'' programs of Qinghai Province. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors:

Competing Interests
The authors declare there are no competing interests. Rongkai Hao is employed by Novogene Bioinformatics Institute.

Author Contributions
• Wen Wang conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.
• Fang Wang, Kirill Sharshov and Alexey Druzyaka analyzed the data, authored or reviewed drafts of the paper, and approved the final draft.
• Rongkai Hao and Zhuoma Lancuo analyzed the data, prepared figures and/or tables, and approved the final draft.
• Aizhen Wang performed the experiments, analyzed the data, prepared figures and/or tables, and approved the final draft.
• Yuetong Shi performed the experiments, prepared figures and/or tables, and approved the final draft.
• Shuo Feng conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Animal Ethics
The following information was supplied relating to ethical approvals (i.e., approving body and any reference numbers): This study conformed to the guidelines for the care and use of experimental animals established by the Ministry of Science and Technology of the People's Republic of China (Approval number: 2006-398). The research protocol was reviewed and approved by the Ethical Committee of Qinghai University.

Field Study Permissions
The following information was supplied relating to field study approvals (i.e., approving body and any reference numbers): Xiulan Liu granted access to her private field for the the sample collection (20180411).

Data Availability
The following information was supplied regarding data availability: The sequencing data are available in NCBI Sequence Read Archive (SRA): (SRP199943) and Bioproject (PRJNA542959).
The assembled draft genome is available at DDBJ/ENA/GenBank: VDDG00000000. The version in this article is VDDG00000000.
The annotation results of repeated sequences, gene structure, non-coding RNAs and functional prediction are available in FigShare: Wang, Wen (2019)

Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/ peerj.8914#supplemental-information.