Chromosome-Level Genome Assemblies: Expanded Capabilities for Conservation Biology Research

Genome assemblies are becoming increasingly important for understanding genetic diversity in threatened species. However, due to limited budgets in the area of conservation biology, genome assemblies, when available, tend to be highly fragmented with tens of thousands of scaffolds. The recent advent of high throughput chromosome conformation capture (Hi-C) makes it possible to generate more contiguous assemblies containing scaffolds that are length of entire chromosomes. Such assemblies greatly facilitate analyses and visualization of genome-wide features. We compared genetic diversity in seven threatened species that had both draft genome assemblies and newer chromosome-level assemblies available. Chromosome-level assemblies allowed better estimation of genetic diversity, localization, and, especially, visualization of low heterozygosity regions in the genomes.


Introduction
Conservation biology aims to maintain, protect, and restore biodiversity across genetic, species, and ecosystem levels, and thereby prevent extinction. One of the most important aspects of species conservation is genetic diversity, which is affected by demographic history and essential for ensuring adaptive potential. Reduction in sequencing costs have facilitated the estimation of genetic diversity in multiple species and their populations using whole genome resequencing approaches. However, analyses of whole genome resequencing data requires the generation of a reference genome assembly from either the same species or a closely related species. The current trend is to use chromosome-level assemblies, which offer a set of useful advantages. Conservation biology deals with a huge number of non-model species, but corresponding genomic studies usually have significantly smaller budgets than in biomedical or agricultural sciences, thereby resulting in a continuous trade off between quality of generated data and its cost. Recently,  a USD 1,000 approach for generation of chromosome-level assemblies from one short-insert Illumina paired end library and an in situ high-throughput chromosome conformation capture (Hi-C) library was proposed [1], which might provide a temporary solution to this problem for the next several years. Here, we compared genetic diversity in seven threatened mammalian species for which previous highly fragmented scaffold assemblies and recently generated chromosome-level assemblies (including those generated by the USD 1000 approach) were available. We show that the newer, more contiguous assemblies allowed better estimation of genetic diversity, localization, and visualization of low heterozygosity regions in the genomes.

Heterozygosity Visualization
Filtered genetic variants were split into single nucleotide polymorphism (SNP) and insertion-deletion (indel) categories. All subsequent analyses were based on SNPs only. Indels were not used due to the low quality calls of these from short reads. Counts of heterozygous SNPs were calculated in non-overlapping windows of 100 kbp and 1 Mbp and scaled to SNPs per kbp. Heatmaps and boxplots were drawn using custom scripts based on the Matplotlib 2 library [16].

Evaluation of Genome Assemblies
This study involved analysis of genomes from seven threatened species representing three different IUCN (International Union for Conservation of Nature) Red List categories (NT-Near threatened, VU-Vulnerable, EN-Endangered): sea otter (Enhydra lutris), cheetah (Acinonyx jubatus), clouded leopard (Neofelis nebulosa), giant otter (Pteronura brasiliensis), red panda (Ailurus fulgens), Asian small-clawed otter (Aonyx cinereus), and American bison (Bison bison) ( Table 1). Each species was represented by two genome assemblies: the initial draft assembly and a chromosome-level assembly generated from the draft using Hi-C scaffolding. The draft assemblies were generated using different sequencing and assembly approaches, resulting in assemblies with differing contiguity and integrity. The scaffold N50 of the draft assemblies ranged from 0.10 Mbp for A. cinereus to 38.75 Mbp for E. lutris. Total gap lengths also varied considerably among the assemblies, from 1.4 Mbp in P. brasiliensis to 195.77 in B. bison. With Hi-C scaffolding, total gap length did not significantly increase in absolute values (maximum 14.15 Mbp were added in case of A. cinereus), and for E. lutris it even decreased, probably due to extensive correction of missassemblies preceding scaffolding stage. The chromosome-level assemblies included large-sized scaffolds that corresponded to the haploid chromosome number (1n) of the species along with a high number of smaller scaffolds. The difference in length between these categories differed by orders of magnitude (1-2 decimal orders). Chromosomelength scaffolds were ordered according to length, from longest to shortest, without assignment to species-specific karyotype. As included assemblies were generated from both male and female individuals, we excluded sex chromosomes from further analysis.

Heterozygosity Estimations and Visualization
Genome-wide genetic diversity is usually estimated as heterozygosity -the proportion of sites that contain heterozygous single nucleotide variants across the genome. This yields a single numerical value but does not reveal how variant sites are distributed across the genome, which may be critical for identifying hotspots and cold spots of genetic diversity. A more informative way includes calculation of mean or median heterozygosity in overlapping windows of fixed size. The size of the window is a matter of choice depending on the integrity of the assembly and planned analysis and visualization, but commonly used sizes fall in the 50 to 5000 kbp range. A significant part of the genome must be presented in windows to make heterozygosity estimates reliable. Among the studied species, P. brasiliensis and A. cinereus with N50 of 0.17 and 0.1 Mbp, respectively, (Table 1) had the most fragmented draft assemblies, which significantly affected the number of 1 Mbp and even 100 kbp windows (Table 2) and the assessment of heterozygosity distribution (Figure 1). At the lower boundary, window size is limited by the number of heterozygous SNPs present in the most of windows, thereby limiting the number of windows that could be used for heterozygosity estimation and visualization. In the case of mammalian genomes with a typical size of 2.5-3.0 Gbp, the number of 100 kbp windows exceeds >200,000 for assemblies of high contiguity. For a window size of 1 Mbp, the number of windows used is 10-fold less, which allows for easy visualization of SNP density and heterozygosity on chromosomal scaffolds (Figure 2). Such plots are impossible for draft assemblies due to the high number of scaffolds. However, we note that variant counts between draft and chromosome-level assemblies are similar. Table 2. Counts of single nucleotide polymorphisms (SNPs) and windows for draft and chromosome-level assemblies of the analyzed genomes. Two species with the lowest window counts are in bold.

Species
Number   The species we analyzed include those well known for extremely low levels of heterozygosity such as the sea otter ( Figure 2a) and cheetah ( Figure 2b) and species with higher genetic diversity but considered to be threatened too: American bison, Asian small-clawed otter, and red panda (Figure 2e-g). Despite significant differences in mean heterozygosity (Figure 1) all genomes showed regions with very low diversity (blue and dark blue regions on Figure 2). The most striking difference in heterozygosity between different regions of the genome was found in the giant otter. Having ~2.5 times higher mean heterozygosity, the giant otter assembly showed long homozygous stretches (dark blue on Figure 2d) on more than half its chromosomes.

Conclusions
Chromosome-level genome assemblies provide a more informative way to directly visualize genome-wide genetic diversity. Such assemblies could be generated using various sequencing technologies (long-read and short read) but because of the limited budgets of many researchers, short read drafts followed by Hi-C scaffolding offers a relatively inexpensive approach for many species of conservation concern in the near future.