A near-chromosome-scale genome assembly of the gemsbok (Oryx gazella): an iconic antelope of the Kalahari desert

Abstract Background The gemsbok (Oryx gazella) is one of the largest antelopes in Africa. Gemsbok are heterothermic and thus highly adapted to live in the desert, changing their feeding behavior when faced with extreme drought and heat. A high-quality genome sequence of this species will assist efforts to elucidate these and other important traits of gemsbok and facilitate research on conservation efforts. Findings Using 180 Gbp of Illumina paired-end and mate-pair reads, a 2.9 Gbp assembly with scaffold N50 of 1.48 Mbp was generated using SOAPdenovo. Scaffolds were extended using Chicago library sequencing, which yielded an additional 114.7 Gbp of DNA sequence. The HiRise assembly using SOAPdenovo + Chicago library sequencing produced a scaffold N50 of 47 Mbp and a final genome size of 2.9 Gbp, representing 90.6% of the estimated genome size and including 93.2% of expected genes according to Benchmarking Universal Single-Copy Orthologs analysis. The Reference-Assisted Chromosome Assembly tool was used to generate a final set of 47 predicted chromosome fragments with N50 of 86.25 Mbp and containing 93.8% of expected genes. A total of 23,125 protein-coding genes and 1.14 Gbp of repetitive sequences were annotated using de novo and homology-based predictions. Conclusions Our results provide the first high-quality, chromosome-scale genome sequence assembly for gemsbok, which will be a valuable resource for studying adaptive evolution of this species and other ruminants.

protein-coding genes and 1.14 Gbp of repetitive sequences were annotated using de novo and homology-based predictions. Conclusions. Our results provide the first high-quality, chromosomescale genome sequence assembly for gemsbok, which will be a valuable resource for studying adaptive evolution of this species and other ruminants.

Background information
The Gemsbok (Oryx gazella) is the largest antelope in the genus Oryx, and a member of the Hippotraginae tribe of ruminants [1] (Figure 1). The gemsbok's biogeographical distribution includes Botswana and Namibia, traditionally inhabiting the Kalahari and Karoo Deserts in Southern Africa [2].
The climate of these regions is highly seasonal, with cool winters (10°C -15°C) and hot summers (43°C -46°C) when most of the annual rainfall occurs (90 -100 mm). High evaporation rates and low precipitation result in a semi-arid climate in both deserts [3]. Living in such extreme environments, gemsbok have evolved to be highly adapted to drought and extreme heat by minimizing water demand and loss. All the species in the Oryx genus are heterotherms, i.e., they can increase their body temperature from ~36°C to ~45°C in order to delay evaporative cooling [4]. Oryx species can also change their feeding behavior from grazing to browsing and digging when faced by extreme environmental conditions [5]. Male and female gemsbok are characterized by their low sexual dimorphism, with both sexes having horns and other shared secondary sexual traits [6], making them highly sought after by trophy hunters.
The gemsbok karyotype has 2n=56 chromosomes, with two Robertsonian translocations compared to cattle [7]. Gemsbok populations have high genetic diversity [8], consistent with other African bovids [9,10]. Here we report a chromosome-scale gemsbok genome sequence that will be useful for elucidating the unique adaptations that allow gemsbok to live in arid climates. Several of the large scaffolds are chromosome-length or near chromosome-length, which will facilitate detailed studies of genome evolution in ruminants. The high quality, chromosome scale assembly of the gemsbok contribute to the goals of the Genome 10K Project [11] and the Earth BioGenome Project [12].

Data description Library construction, sequencing and filtering
Genomic DNA was extracted from a captive born female Gemsbok from San Diego Safari Park (USA) using heart muscle collected at necropsy (NCBI BioSample ID SAMN09604855). High-molecular weight genomic DNA was obtained using the phenol/chloroform protocol as previously described [13].  Table 1). Reads were trimmed based on low base quality, and reads with more than 5% of uncalled ("N") bases were removed, providing a total of 179.64 Gbp of filtered read data for genome assembly.
Two Chicago libraries were generated (Dovetail Genomics, Santa Cruz, CA) as previously described [14]. Briefly, high-molecular-weight DNA was assembled into chromatin in vitro and then chemically cross-linked before being restriction digested. The overhangs were filled in with a biotinylated nucleotide, and the chromatin was incubated in a proximity-ligation reaction. The crosslinks were then reversed, and the DNA purified from chromatin. After sequencing these libraries on the Illumina Hiseq 4000 platform, we obtained ~382 million 150 bp read pairs.

Evaluation of genome size
We used k-mer analysis to estimate the size of gemsbok's genome. A k-mer refers to an artificial sequence division of K nucleotides iteratively from sequencing reads. A raw sequence read with L bp contains (L-K+1) k-mers if the length of each k-mer is K bp. The frequency of each k-mer 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 can be calculated from the genome sequence reads. Typically, k-mer frequencies plotted against the sequence depth gradient follow a Poisson distribution in any given dataset, whereas sequencing errors may lead to a higher representation of low frequencies.  Figure 2a). To assess assembly quality, approximately 98 Gbp (representing genome coverage of 34x) high quality short-insert size reads were aligned to the assembly using Burrows-Wheeler Aligner (BWA, with parameters of -t 1 -I) [16]. A total of 95.3% reads could be mapped, covering 97.8% of the assembly excluding gaps; 82.1% of these reads were properly paired with an expected insert size associated with the different libraries.
To increase the contiguity of the assembly we used sequence information from the Chicago libraries and the HiRise (version 2.0) scaffolder ( Figure 2a) [14]. A total of 5,411 new joins were produced, resulting in a superscaffold N50 of 47.03 Mbp (Table 1).

Evaluation of SOAPdenovo assembly
To further evaluate the structure of the SOAPdenovo scaffolds we used the information provided by RACA (Figure 2b). The RACA evaluation allowed identification of problematic regions in scaffolds with low read physical coverage and not supported by syntenic information from either the reference and the outgroup genomes. As we previously showed [17,19], 20 to 60 percent of the flagged problematic scaffolds are chimeric and, therefore, not existent in the genome. In gemsbok, only 12 SOAPdenovo scaffolds were identified as putatively chimeric after running RACA ( Table 1).
The HiRise assembler also pinpointed putatively chimeric SOAPdenovo scaffolds using the Chicago libraries sequence information (Figure 2b). A total of 17 regions in 16 SOAPdenovo scaffolds were identified in this manner. Among the 16 problematic SOAPdenovo scaffolds identified using Chicago library sequence information, four were also flagged by RACA, while four SOAPdenovo scaffolds were not included in the RACA assembly because they were smaller than 10 Kbp. Seven SOAPdenovo scaffolds were broken in the SOAPdenovo + Chicago assembly, but one of the fragments was below the 150 Kbp resolution chosen to run RACA and therefore not reported in the RACA output. Only two complete disagreements between the SOAPdenovo + Chicago and SOAPdenovo + RACA assemblies were identified.

Evaluation of SOAPdenovo + Chicago assembly
To assess the SOAPdenovo + Chicago assembly, RACA was used to identify putative chimeric superscaffolds ( Figure 2b). Because there is no physical or genetic map for gemsbok, we were not able to verify the scaffold adjacencies in PCFs predicted by RACA, and therefore, the PCFs were used as a tool to evaluate the SOAPdenovo + Chicago assembly. In this assessment, cattle and human genomes served as the reference and outgroup, respectively, and the SOAPdenovo + Chicago assembly as input. A total of 47 PCFs were reconstructed with N50 of 86.25 Mbp (  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 was orthologous to cattle chromosome 11, but in the new assembly it was fragmented into two PCFs, one of ~186 Kbp containing sequence not present in the SOAPdenovo + RACA assembly. More than 98% of the scaffold joins introduced in the SOAPdenovo + Chicago assembly were consistent with RACA results and are thus likely to be accurate. However, RACA introduced 50 breaks in 25 SOAPdenovo + Chicago scaffolds, suggesting that these scaffolds might be chimeric ( Figure 2b). Of the 50 breaks, 27 comprised joins of SOAPdenovo scaffolds into superscaffolds made using the HiRise assembler. The other 23 breaks were inside single SOAPdenovo scaffolds, with five being also broken in the SOAPdenovo + RACA assembly, while the rest were either not used (4 cases) or below the 150 Kbp resolution of the SOAPdenovo + RACA assembly (14 cases). Although physical or genetic maps for gemsbok are not available to verify the SOAPdenovo + Chicago + RACA assembly, we previously showed that RACA produces highly accurate chromosome assemblies when compared to meiotic linkage [20] or cytogenetic physical maps [19], suggesting that the 47 PCFs of the gemsbok assembly accurately represent scaffold order and orientation on the gemsbok chromosomes. Therefore, using RACA allowed us to identify putatively chimeric scaffolds and superscaffolds, as well as to align components of chimeric scaffolds to their likely location on the gemsbok genome.

Genome annotation
To annotate the gemsbok genome, we started by mapping transposable elements (TEs). The TEs were predicted in the genome by homology to RepBase sequences using RepeatProteinMask  Table 2).
To assign functions to the newly annotated genes in the gemsbok genome, we aligned them to SwissProt database using blastp with an (E)-value cutoff of 1 e -5 . A total of 19,949 genes (86.27% of the total annotated genes) had a Swissprot match. Publicly available databases (including Pfam, PRINTS, PROSITE, ProDom, and SMART) were used to annotate motifs and domains using InterPro, producing a total of 17,112 genes annotated with domain information (74%). By searching the KEGG database using a best hit for each gene, 9,696 genes were mapped to a known pathway (41.93% of the genes). Finally, we assigned a gene ontology term to 14,196 genes, representing 61.39% of the whole set. Overall, 20,008 genes (86.52%) had at least one functional annotation (Supplementary Table 3).

Availability of supported data
The

Figure2
Click here to access/download; Figure;

Figure3
Click here to access/download; Figure;Figure3.pdf