A high-quality chromosomal-level genome assembly of Greater Scaup (Aythya marila)

Zhou, Shengyang; Xia, Tian; Gao, Xiaodong; Lyu, Tianshu; Wang, Lidong; Wang, Xibao; Shi, Lupeng; Dong, Yuehuan; Zhang, Honghai

doi:10.1038/s41597-023-02142-x

Download PDF

Data Descriptor
Open access
Published: 04 May 2023

A high-quality chromosomal-level genome assembly of Greater Scaup (Aythya marila)

Shengyang Zhou¹^na1,
Tian Xia ORCID: orcid.org/0000-0002-0097-6949¹^na1,
Xiaodong Gao¹^na1,
Tianshu Lyu¹,
Lidong Wang¹,
Xibao Wang¹,
Lupeng Shi¹,
Yuehuan Dong¹ &
…
Honghai Zhang¹

Scientific Data volume 10, Article number: 254 (2023) Cite this article

1826 Accesses
2 Altmetric
Metrics details

Subjects

Abstract

Aythya marila is one of the few species of Anatidae, and the only Aythya to live in the circumpolar. However, there is a relative lack of research on genetics of this species. In this study, we reported and assembled the first high-quality chromosome-level genome assembly of A. marila. This genome was assembled using Nanopore long reads, and errors corrected using Illumina short reads, with a final genome size of 1.14 Gb, scaffold N50 of 85.44 Mb, and contig N50 of 32.46 Mb. 106 contigs were clustered and ordered onto 35 chromosomes based on Hi-C data, covering approximately 98.28% of the genome. BUSCO assessment showed that 97.0% of the highly conserved genes in aves_odb10 were present intact in the genome assembly. In addition, a total of 154.94 Mb of repetitive sequences were identified. 15,953 protein-coding genes were predicted in the genome, and 98.96% of genes were functionally annotated. This genome will be a valuable resource for future genetic diversity and genomics studies of A. marila.

Chromosome-level genome assembly of watershield (Brasenia schreberi)

Article Open access 19 July 2023

Haplotype-resolved chromosomal-level assembly of wasabi (Eutrema japonicum) genome

Article Open access 11 July 2023

A chromosome-level genome assembly reveals genomic characteristics of the American mink (Neogale vison)

Article Open access 16 December 2022

Background & Summary

Greater scaup (Aythya marila), the diving duck of the family Anatidae, which are distributed across Eurasia and North America. A. marila is one of the few Anatidae that breed in the Arctic, and the only duck belonging to the Aythya. A. marila which is a typical winter migrant. In winter, they migrate to the south to overwinter in shallow waters near the coast, estuaries, inland lakes, reservoirs, etc^1,2,3. Little attention has been paid to this species because of extremely large range and population size. In fact, the population of A. marila has been declining in the past decades due to lack of effective conservation and reduced food quality^4,5.

In recent years, with the development of high-throughput sequencing technology, genomes of more birds have been sequenced, providing a rich genetic resource for the study of species evolutionary mechanisms^6,7,8,9,10. However, the published data does not include this species. Studies at the genetic level in this species were extremely rare due to lack of attention, and other few relevant studies were also related to diet, populations, and avian viruses^4,5,11.

In this study, we present the first genome assembly of A. marila based on Nanopore sequencing and Hi-C sequencing technology. We used Nanopore sequencing data, Illumina sequencing data, and Hi-C sequencing data to assemble a genome (1,135.19 Mb) with scaffold N50 of 85.44 Mb, and 98.28% of the assembled bases were associated with the 35 chromosomes. This genome will be a valuable resource for many studies, including convergent evolution, population genomics, adaptive evolution, and comparative genomics, among others.

Materials and methods

Ethics statement

All animal experimental procedures were approved by the Biomedical Ethics Committee of Qufu Normal University (approval number: 2022001).

Sampling and sequening

The experimental sample is a wounded male duck found during the wild bird survey in Jiangsu, China, which died unexpectedly during rescue. We dissected the sample and used the muscle for genome sequencing. Additionally, six transcriptomic samples (heart, kidney, liver, lung, pancreas, and spleen) from the same individual were collected and stored in an ultra-low temperature refrigerator at −80 °C, until samples were used for extracted genomic DNA and RNA.

Using Illumina® TruSeq® Nano DNA Library Prep kits to generate sequencing libraries of genomic DNA, we constructed a paired-end library with an insertion size of 350 bp, which was sequenced using Illumina HiSeq 4000 platform. Finally, a total of 60.77 Gb data (coverage of 45.30×) of 150 bp paired-end reads were generated. Simultaneously, using the PromethION platform with single-molecule real-time sequencing to sequence long reads. 122.55 Gb data were obtained, which covered 91.36 fold of the greater scaup’s genome. To assemble a chromosome-level genome of A. marila, a high-through chrome configuration capture (Hi-C) library was built and sequenced using Illumina PE150 platform. In total, we obtained 63.43 Gb raw reads. The six transcriptomic samples were used for RNA and sequenced using Illumina NovaSeq 6000 Platform, 43.9 Gb raw reads of all samples were obtained and used for subsequent gene prediction.

Genome size estimation and assembly

To estimate the genome size, heterozygosity and the proportion of repetitive sequences in A. marila, we performed K-mer analysis using Illumina sequencing data by jellyfish (v2.2.7). While K = 17, the estimated genome size was 1,341.4 Mb, the heterozygosity was 0.47%, and the proportion of repetitive sequences was 42.28% (Table 1). Then, we used NextDenovo (v2.4.0) (https://github.com/Nextomics/NextDenovo) software to assemble the genome with Oxford nanopore technologies (ONT) long reads, and NextPolish¹² (v1.3.1) was used to increase the precision of single base with Illumina short reads. We obtained a preliminarily assembly with size of 1,135.19 Mb after de novo assembly and base correction, which included 194 contigs, contig N50 was 34.42 Mb, and the proportion of GC was 41.94%.

Table 1 Genomic characteristics statistics while K = 17.

Full size table

ALLHiC¹³ (v0.9.8) and Juicebox (v1.11.08) software were used to mount the contigs in preliminarily assembly onto chromosomes. Contigs were sorted, pruned, and optimized based on the signal strength after Hi-C data were utilized to identify the interaction signals between Contigs. The interaction signals were then shown as a heat map using ALLHiC (Fig. 1). Using Juicebox to correct the position of contigs according to the intensity of chromosome interaction, and finally get a chromosome-level genome assembly with scaffold N50 of 85.44 Mb (Table 2). A total of 106 contigs were mapped to 35 chromosomes with lengths ranging from 2.11 Mb to 207.93 Mb, and 98.28% of the sequences were mapped to the chromosomes (Table 3).

Table 2 Assembly statistics of A. marila.

Full size table

Table 3 Summary of assembled 35 chromosomes of A. marila.

Full size table

Assessment of the genome assemblies

To assess the accuracy of genome assembly, we used Burrows-Wheeler aligner¹⁴ (BWA) (v0.7.8) to map Illumina reads to the genome of A. marila. The reads matching rate was approximately 98.80%, indicating a good agreement between the reads and the assembled genome. Merqury¹⁵ (v1.3) was ran to evaluate assembly quality value (QV), and a high QV (42.14) indicates that this assembly is of good quality. Benchmarking Universal Single-Copy Orthologs¹⁶ (BUSCO) (v5.4.4) (use option “--augustus”) and Core Eukaryotic Genes Mapping Approach¹⁷ (CEGMA) (v2.5) were also used to assess the integrity of the assembled genome. The BUSCO results showed that 97.0% of the complete BUSCOs and 0.8% of the fragmented BUSCOs were found in 8338 single-copy orthologs of avers_odb10, and 2.2% of BUSCOs was missing (Fig. 2). Moreover, 238 of 248 core eukaryotic genes were detected using CEGMA. The above results indicate that the genome assembly of A. marila was complete and of high quality.

Comparison with tufted duck genome assembly

Mummer¹⁸ (v4.0.0) was used to identify the synteny between A. marila and tufted duck¹⁹ (Aythya fuligula) genomes to determine orthologous chromosome pairs, and we used TBtools²⁰ (v1.112) to draw the synteny between their chromosomes. The synteny analysis shows that the chromosome sequences of the two species have good synteny, and each chromosome corresponds to multiple chromosomes of another species. There is one sex chromosome Z (Chr4) in A. marila, and it’s multiple autosome have some synteny with the sex chromosome W of A. fuligula (Fig. 3). Most of the chromosomes can correspond to the chromosome of the tufted duck, however both Chr31 and Chr32 shared synteny with chromosome 34 of the tufted duck, and the synteny was poor (Fig. 3d). Simultaneously, they also have synteny with some contigs. This may be due to the poor assembly quality caused by too small chromosomes size (Chr31, 2.05 Mb; Chr32, 1.03 Mb). In gap analysis, the chromosomes of A. fuligula genome assembly had a higher proportion of N bases (Fig. 3c). In general, the genome we assembled was of high quality.

Annotation of repetitive sequences

We used a combined approach of de novo prediction and homology-based prediction to annotate repetitive sequences in the genome of A. marila. For de novo prediction, we used Tandem Repeats Finder²¹ (TRF) (v4.09) to detect tandem repeat sequences in the genome. RepeatModeler (v2.0.3), RepeatScout²² (v1.0.6) and RECON (v1.08) were used to build a database of transposable element (TE) family for de nove prediction. RepeatProteinMask and RepeatMasker (v4.1.2-p1) were used for homology prediction with Repbase database²³ and Dfam database²⁴, the species parameter used was chicken. Ultimately, we identified 154.94 Mb of repetitive sequences by all methods, accounting for 13.65% of the assembled genome (Table 4).

Table 4 Classification of repeat elements in A. marila genome.

Full size table

Gene structure prediction

We used three methods to predict the protein-coding sequences in the genome of A. marila, ab initio prediction, homology-based prediction and RNA-Seq prediction. Three ab initio prediction software were used in the study, Augustus²⁵ (v3.3.2), GlimmerHMM²⁶ (v3.0.4), and Geneid²⁷ (v1.4.5). For the six transcriptomic samples, the raw data were quality controlled and filtered out bad reads using fastp²⁸ (v0.23.1) to obtain clean data. Then, the paired-end reads are assembled using SPAdes²⁹ (v3.15.3). Next, the candidate coding regions in the transcript sequences are identified using TransDecoder³⁰ (v5.5.0), and the sequences were clustered using CD-hit³¹ (v4.8.1) to remove redundant sequences. Eventually, the obtained protein sequences were used for subsequent prediction along with other downloaded sequences of related species. Genomes and annotation files of six related species^{19,32,33,34,35,36} (Anser cygnoides, Anas platyrhynchos, Aythya fuligula, Cygnus olor, Cygnus atratus, Gallus gallus) were downloaded from NCBI database to extract the longest transcript and translated them into protein sequences. Their protein sequences were matched to A. marila genome using Spaln³⁷ (v2.4.6), and then the matched sequences were accurately spliced against the homologous protein-coding sequences using GeneWise³⁸ (v2.4.1) software. Homology-based prediction results, RNA-Seq prediction results and ab initio prediction results are integrated using EvidenceModeler³⁹ (EVM) (v1.1.1) to generate gene sets, and including masked repeats as input to the gene predictions. There were 19,257 genes after EVM integration. The BUSCO results showed that complete BUSCOs was 88.2%, fragmented BUSCOs was 1.5%, and missing BUSCOs was 10.3%, which was lower than the BUSCO results of genome (Fig. 4a). Thus, for some single-copy orthologs of aves_odb10, that ware complete in genome but missing or fragmented in predictive gene set, their annotation file were integrated into the gene set. Overall, a total of 19,533 protein-coding genes were predicted in A. marila genome, including 159 prematurely terminated genes and 273 partial genes. The average transcript length of this gene set was 23,126.79 bp, and the average coding sequence (CDS) length was 1,558.71 bp (Table 5). The BUSCO results showed 95.5% of the complete BUSCOs, 0.3% of the fragmented BUSCOs and 4.2% of the missing BUSCOs (Fig. 4b).

Table 5 The statistics of protein-coding genes annotated in A. marila genome.

Full size table

Gene function annotation

Next, function annotation of predicted genes for A. marila using diamond⁴⁰ against SwissProt⁴¹, TrEMBL⁴², Non-Redundant Protein Sequence Database (NR), Gene Ontology⁴² (GO), and Kyoto Encyclopedia of Genes and Genomes Orthology⁴³ (KO) databases with e-value cutoff of 1e-5. The InterPro database⁴⁴ is also used for function annotation by classifying proteins into families and predicting domains and important sites with Interproscan⁴⁵ (v5.53). Finally, 15,789 genes were annotated, and accounted for 81.50% of all predicted genes of A. marila (Table 6).

Table 6 Annotation of predicted genes.

Full size table

Filtering and verification of gene set

We filtered the predicted gene set, because the number of predicted genes was more than in the six related species, and more than 3,000 genes could not be function annotated. We used OrthoFinder to find the orthologs of A. marila and six related species. There were 4,086 unassigned genes, and 3,421 of them were not annotated in any database (Table 7). Most of these genes (3,417/3,421) were predicted using at least one of the de novo prediction software, and only four genes are supported by other evidence. The BUSCO test was performed after removed unassigned genes without function annotations, the results showed no effect compared with before removed (Table 8; Fig. 4b). These suggests that these genes may only have specific sequences and patterns, rather than real genes⁴⁶. After filtering the unassigned genes without annotation and 159 prematurely terminated genes, 15,953 genes remain, including 182 partial genes, and 98.96% of them were annotated. The filtered gene set of A. marila was compared with the longest transcript set of related species, it has similar structure (Table 9 and Fig. 5). Except G. gallu, the average transcript length, average CDS length, average exons number, and average intron length of A. marila were smaller than those of others. Simultaneously, the average exon length of A. Marila was the longest among all species. However, these differences were not significant. This may be due to A. Marila has a higher proportion of short genes (Table 5a).

Table 7 Orthofinder Statistics Result of PerSpecies.

Full size table

Table 8 Busco results of the filtered gene set.

Full size table

Table 9 The comparison of the genetic structure from A. marila genome and related species.

Full size table

Data Records

All sequencing data had been uploaded to NCBI database

The genomic Illumina sequencing data, Nanopore sequencing data and Hi-C sequencing data were deposited at the Sequence Read Archive database of NCBI (accession numbers: SRR21672225⁴⁷, SRR21672223⁴⁸, and SRR21672224⁴⁹).

The transcriptome data of 6 samples were deposited at the Sequence Read Archive database of NCBI (accession numbers: SRR21700073 - SRR21700078^{50,51,52,53,54,55}).

The genome assembly of A. marila were deposited at GenBank database under the accession JAOQIG000000000⁵⁶.

The annotation results of repeats sequences, gene structure and functional prediction were deposited at the Figshare database⁵⁷.

Technical Validation

Quality control was performed on Illumina, Nanopore, Hi-C, and RNA sequencing data. The Q20 of Illumina sequencing data were greater than 96% and Q30 were greater than 90%. The Q7 of Nanopore sequencing data were greater than 92%. The Q20 of Hi-C sequencing data were greater than 94% and Q30 were greater than 96%. And the Q20 of RNA sequencing data were greater than 97%, and Q30 were greater than 93%. These evidences suggest that the data are reliable and can be used in subsequent analyses (Tables 10, 11).

Table 10 Statistical information of Illumina, Hi-C and RNA sequencing data.

Full size table

Table 11 Statistical information of Nanopore sequencing data.

Full size table

Code availability

No special codes or scripts were used in this work, all data processing commands and parameters were executed according to the manuals and protocols of the corresponding bioinformatics software.

References

Liu Y. & Chen S. H. The CNG Field Guide to the Birds of China [China]. Hunan Science & Technology Press, HuNan, (2021).
Youngman, J. A., Flaspohler, D. J., Knowlton, J. L. & Dombroski, L. J. Autumn waterbird migration over Lake Superior: Numbers, species, and timing. Journal of Great Lakes Research 43, 1186–1190 (2017).
Article Google Scholar
Cervencl, A. & Alvarez-Fernandez, S. Winter distribution of Greater Scaup Aythya marila in relation to available food resources. Journal of Sea Research 73, 41–48 (2012).
Article ADS Google Scholar
Marchowski, D. et al. Effectiveness of the European Natura 2000 network to sustain a specialist wintering waterbird population in the face of climate change. Scientific Reports 10, 20286 (2020).
Article CAS PubMed PubMed Central Google Scholar
Marchowski, D. et al. The Importance of Non-Native Prey, the Zebra Mussel Dreissena polymorpha, for the Declining Greater Scaup Aythya marila: A Case Study at a Key European Staging and Wintering Site. PLoS One 10, e0145496 (2015).
Article PubMed PubMed Central Google Scholar
Cole, T. L. et al. Genomic insights into the secondary aquatic transition of penguins. Nature Communications 13, 3912 (2022).
Article CAS PubMed PubMed Central ADS Google Scholar
Luo, H. et al. Genomic insight into the nocturnal adaptation of the black-crowned night heron (Nycticorax nycticorax). BMC Genomics 23, 683 (2022).
Article CAS PubMed PubMed Central Google Scholar
Sætre, C. L. C. et al. A Chromosome-Level Genome Assembly of the Reed Warbler (Acrocephalus scirpaceus). Genome Biology and Evolution 13, evab212 (2021).
Article PubMed PubMed Central Google Scholar
Recuerda, M. et al. Chromosome-Level Genome Assembly of the Common Chaffinch (Aves: Fringilla coelebs): A Valuable Resource for Evolutionary Biology. Genome Biology and Evolution 13, evab034 (2021).
Article PubMed PubMed Central Google Scholar
Zhang, G. et al. Genomics: Bird sequencing project takes off. Nature 522, 34 (2015).
Article CAS PubMed ADS Google Scholar
Wille, M., Muradrasoli, S., Nilsson, A. & Järhult, J. D. High Prevalence and Putative Lineage Maintenance of Avian Coronaviruses in Scandinavian Waterfowl. PLoS One 11, e0150198 (2016).
Article PubMed PubMed Central Google Scholar
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255 (2020).
Article CAS PubMed Google Scholar
Zhang, X., Zhang, S., Zhao, Q., Ming, R. & Tang, H. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nature Plants 5, 833–845 (2019).
Article CAS PubMed Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–60 (2009).
Article CAS PubMed PubMed Central Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology 21, 245 (2020).
Article CAS PubMed PubMed Central Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed Google Scholar
Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).
Article CAS PubMed Google Scholar
Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLOS Computational Biology 14, e1005944 (2018).
Article PubMed PubMed Central Google Scholar
Formenti, G., Uliano Da Silva, M. & Kraus, R. H. S. Aythya fuligula isolate bAytFul2, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:WNMM00000000 (2019).
Chen, C. et al. TBtools: An Integrative Toolkit Developed for Interactive Analyses of Big Biological Data. Molecular Plant 13, 1194–1202 (2020).
Article CAS PubMed Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research 27, 573–580 (1999).
Article CAS PubMed PubMed Central Google Scholar
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21, i351–i358 (2005).
Article CAS PubMed Google Scholar
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA 6, 11 (2015).
Article PubMed PubMed Central Google Scholar
Storer, J., Hubley, R., Rosen, J., Wheeler, T. J. & Smit, A. F. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mobile DNA 12, 2 (2021).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M. & Morgenstern, B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res 33, W465–7 (2005).
Article CAS PubMed PubMed Central Google Scholar
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–9 (2004).
Article CAS PubMed Google Scholar
Parra, G., Blanco, E. & Guigó, R. GeneID in Drosophila. Genome Res 10, 511–515 (2000).
Article CAS PubMed PubMed Central Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Article PubMed PubMed Central Google Scholar
Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A. & Korobeynikov, A. Using SPAdes De Novo Assembler. Current Protocols in Bioinformatics 70, e102 (2020).
Article CAS PubMed Google Scholar
Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols 8, 1494–1512 (2013).
Article CAS PubMed Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Article CAS PubMed PubMed Central Google Scholar
Li, Q., Zhao, X. & Gao, G. Anser cygnoides breed Sichuan white goose isolate SCWG-2014, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:LABU00000000 (2017).
Li, J. Anas platyrhynchos breed Pekin duck isolate Z2, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:JACGAL000000000 (2020).
Formenti, G., Fedrigo, O. & Kraus, R. Cygnus olor isolate bCygOlo1, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:WNMI00000000 (2021).
Cheng, Y., Bruxner, T. J. C. & Karawita, A. C. Cygnus atratus isolate AKBS03, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:JABXOC000000000 (2020).
Fedrigo, O., Formenti, G. & Warren, W. Gallus gallus breed Cross of Broiler mother + white leghorn layer father isolate bGalGal1, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:JAENSK010000000 (2021).
Gotoh, O. A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence. Nucleic Acids Research 36, 2630–2638 (2008).
Article CAS PubMed PubMed Central Google Scholar
Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res 14, 988–995 (2004).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 9, R7 (2008).
Article PubMed PubMed Central Google Scholar
Buchfink, B., Reuter, K. & Drost, H. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods 18, 366–368 (2021).
Article CAS PubMed PubMed Central Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Research 28, 45–48 (2000).
Article CAS PubMed PubMed Central Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nature Genetics 25, 25–29 (2000).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research 45, D353–D361 (2017).
Article CAS PubMed Google Scholar
Blum, M. et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Research 49, D344–D354 (2021).
Article CAS PubMed Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Article CAS PubMed PubMed Central Google Scholar
Scalzitti, N., Jeannin-Girardon, A., Collet, P., Poch, O. & Thompson, J. D. A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC Genomics 21, 293 (2020).
Article CAS PubMed PubMed Central Google Scholar
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR21672225 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR21672223 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR21672224 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR21700073 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR21700074 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR21700075 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR21700076 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR21700077 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR21700078 (2023).
Zhou, S. Aythya marila isolate BB-2018, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc:JAOQIG000000000 (2023).
Zhou, S. A high-quality chromosomal-level genome assembly of Greater Scaup (Aythya marila). figshare. https://doi.org/10.6084/m9.figshare.22500445.v1 (2023).

Download references

Acknowledgements

This research was supported by the National Natural Science Fundation of China (32200407, 31872242, 32070405, 32270444, and 32170530).

Author information

These authors contributed equally: Shengyang Zhou, Tian Xia, Xiaodong Gao.

Authors and Affiliations

College of Life Sciences, Qufu Normal University, Qufu, 273165, Shandong, China
Shengyang Zhou, Tian Xia, Xiaodong Gao, Tianshu Lyu, Lidong Wang, Xibao Wang, Lupeng Shi, Yuehuan Dong & Honghai Zhang

Authors

Shengyang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Tian Xia
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Gao
View author publications
You can also search for this author in PubMed Google Scholar
Tianshu Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Lidong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xibao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lupeng Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yuehuan Dong
View author publications
You can also search for this author in PubMed Google Scholar
Honghai Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Shengyang Zhou, Tian Xia, and Honghai Zhang designed this study; Xiaodong Gao, Xibao Wang, Lupeng Shi and Yuehuan Dong dissected and collected the samples. Tianshu Lyu and Lidong Wang extracted DNA. Shengyang Zhou, Tian Xia and Xiaodong Gao performed data analysis. Shengyang Zhou wrote the paper. All authors had read, revised, and finally agreed to submit this manuscript.

Corresponding author

Correspondence to Honghai Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhou, S., Xia, T., Gao, X. et al. A high-quality chromosomal-level genome assembly of Greater Scaup (Aythya marila). Sci Data 10, 254 (2023). https://doi.org/10.1038/s41597-023-02142-x

Download citation

Received: 28 October 2022
Accepted: 11 April 2023
Published: 04 May 2023
DOI: https://doi.org/10.1038/s41597-023-02142-x