An integrated chromosome-scale genome assembly of the Masai giraffe (Giraffa camelopardalis tippelskirchi)

Abstract Background The Masai giraffe (Giraffa camelopardalis tippelskirchi) is the largest-bodied giraffe and the world's tallest terrestrial animal. With its extreme size and height, the giraffe's unique anatomical and physiological adaptations have long been of interest to diverse research fields. Giraffes are also critical to ecosystems of sub-Saharan Africa, with their long neck serving as a conduit to food sources not shared by other herbivores. Although the genome of a Masai giraffe has been sequenced, the assembly was highly fragmented and suboptimal for genome analysis. Herein we report an improved giraffe genome assembly to facilitate evolutionary analysis of the giraffe and other ruminant genomes. Findings Using SOAPdenovo2 and 170 Gbp of Illumina paired-end and mate-pair reads, we generated a 2.6-Gbp male Masai giraffe genome assembly, with a scaffold N50 of 3 Mbp. The incorporation of 114.6 Gbp of Chicago library sequencing data resulted in a HiRise SOAPdenovo + Chicago assembly with an N50 of 48 Mbp and containing 95% of expected genes according to BUSCO analysis. Using the Reference-Assisted Chromosome Assembly tool, we were able to order and orient scaffolds into 42 predicted chromosome fragments (PCFs). Using fluorescence in situ hybridization, we placed 153 cattle bacterial artificial chromosomes onto giraffe metaphase spreads to assess and assign the PCFs on 14 giraffe autosomes and the X chromosome resulting in the final assembly with an N50 of 177.94 Mbp. In this assembly, 21,621 protein-coding genes were identified using both de novo and homology-based predictions. Conclusions We have produced the first chromosome-scale genome assembly for a Giraffidae species. This assembly provides a valuable resource for the study of artiodactyl evolution and for understanding the molecular basis of the unique adaptive traits of giraffes. In addition, the assembly will provide a powerful resource to assist conservation efforts of Masai giraffe, whose population size has declined by 52% in recent years.


Abstract: Background
The Masai giraffe (Giraffa camelopardalis tippelskirchi) is the largest-bodied giraffe and the world's tallest terrestrial animal. With its extreme size and height, the giraffe's unique anatomical and physiological adaptations have long been of interest to diverse research fields. Giraffes are also critical to ecosystems of sub-Saharan Africa, with their long neck serving as a conduit to food sources not shared by other herbivores.
Although the genome of a Masai giraffe has been sequenced, the assembly was highly fragmented and unsuitable for the analysis of chromosome evolution. Herein we report an improved giraffe genome assembly to facilitate evolutionary analysis of the giraffe and other ruminant genomes.

Findings
Using SOAPdenovo2 and 170 Gbp of Illumina paired-end and mate-pair reads we generated a 2.6 Gbp female Masai giraffe genome assembly, with a scaffold N50 of 3 Mbp. The incorporation of 114.6 Gbp of Chicago library sequencing data resulted in a HiRise SOAPdenovo + Chicago assembly with an N50 of 48 Mbp and containing 95% of expected genes according to BUSCO analysis. Using the Reference-Assisted Chromosome Assembly tool, we were able to order and orient scaffolds into 42 predicted chromosome fragments (PCFs). Using fluorescence in situ hybridization we placed 153 cattle BACs onto giraffe metaphase spreads to assess and assign the PCFs on 14 giraffe autosomes and the X chromosome. In this assembly, 21,621 protein-coding genes were identified using both de novo and homology-based predictions.

Conclusions
We have produced the first chromosome-scale genome assembly for a Giraffidae species. This assembly provides a valuable resource for the study of artiodactyl evolution and for understanding the molecular basis of the unique adaptive traits of giraffes. In addition, the assembly will provide a powerful resource to assist conservation efforts of Masai giraffe, whose population size has declined by 52% in recent years. autosomes and the X chromosome. In this assembly, 21,621 protein-coding genes were identified using both de novo and homology-based predictions. Conclusions. We have produced the first chromosome-scale genome assembly for a Giraffidae species. This assembly provides a valuable resource for the study of artiodactyl evolution and for understanding the molecular basis of the unique adaptive traits of giraffes. In addition, the assembly will provide a powerful resource to assist conservation efforts of Masai giraffe, whose population size has declined by 52% in recent years.

Background information
Giraffes (Giraffa) are a genus of even-toed ungulate mammals comprising four species [1]. They are members of the family Giraffidae, which also includes the okapi (Okapia johnstoni). The Masai giraffe (also known as Kilimanjaro giraffe; Giraffa camelopardalis tippelskirchi; Figure 1) is native to East Africa and distributed throughout Tanzania and Kenya [2]. Masai giraffes are not only the largestbodied giraffes [3] but also the tallest terrestrial animals. Giraffes present several distinctive anatomical characteristics, such as their long neck and legs, horn-like ossicones and coat patterns, which together with their unique cardiovascular and musculoskeletal adaptations have interested researchers in many fields [3][4][5][6].
The giraffe genome comprises 15 pairs of chromosomes (2n = 30) that are believed to have originated by multiple Robertsonian fusions from the pecoran ancestral karyotype (2n = 58) [7,8]. In 2016, Agaba and colleagues generated the first genome sequence of a female Masai giraffe and compared it with the genome sequence of an okapi [9]. This study identified candidate genes and pathways involved in the giraffes' unique skeletal and cardiovascular adaptations [9]. The reported genome was fragmented, which hinders its use for studies of overall genome architecture and evolution. Missing and fragmented genes also limit the utility of the assembly for study of the genetic basis of the giraffe's unique adaptations. Here we report a chromosome-scale assembly of a female Masai giraffe genome sequenced de novo. This assembly will facilitate studies of ruminant genome evolution and will be a powerful resource for further elucidation of the genetic basis for the giraffe's characteristic features. Furthermore, having another Masai giraffe genome sequence will assist conservation efforts for this species, whose population has declined by more than 52% in recent decades [2,10].

Data description Library construction, sequencing, and filtering
Genomic DNA was extracted from a fibroblast cell culture of a female Masai giraffe (Taxonomy ID:  Table 1). To improve read quality, low-quality bases from both ends of the reads were trimmed, duplicated reads and those with more than 5% of uncalled ("N") bases were removed. A total of 171.09 Gbp of filtered read data were used for genome assembly (Supplementary Table 1).
Two Chicago libraries were generated by Dovetail Genomics (Santa Cruz, CA) as previously described [11]. Briefly, high-molecular-weight DNA was assembled into chromatin in vitro, chemically cross-linked and digested by restriction enzymes. The resulting digestion overhangs were filled in with a biotinylated nucleotide, and the chromatin was incubated in a proximity-ligation reaction. The cross-links were then reversed, and the DNA purified from the chromatin. These libraries were sequenced in one flow-cell lane using the Illumina HiSeq 4000 platform, resulting in the generation of ~385 million read pairs or 114.60 Gbp of sequence data (Supplementary Table 1).

Evaluation of genome size
The Masai giraffe genome size was estimated by k-mer analysis. A k-mer refers to an artificial sequence division of K nucleotides iteratively from sequencing reads. A raw sequence read with L bp

Genome assembly
We applied SOAPdenovo version 2.04 (SOAPdenovo, RRID:SCR_010752) with default parameters to construct contigs and scaffolds as described previously [13]. All reads were aligned against each other to produce contigs which were further assembled in scaffolds using the paired-end information.
The generated Masai giraffe genome assembly was 2.55 Gbp long, including 76.82 Mbp (3%) of unknown bases ("Ns"). The contig and scaffold N50 lengths were 21.78 Kbp and 3.00 Mbp, respectively (Table 1). To assess the assembly quality, approximately 90 Gbp (representing 35.6x genome coverage) high-quality short-insert size reads were aligned to the SOAPdenovo assembly using BWA (BWA, RRID:SCR_010910), with parameters of -t 1 -I. A total of 98.9% reads could be mapped and covered 98.9% of the assembly, excluding gaps. Approximately 92% of these reads were properly paired, having an expected insert size associated with the libraries of origin.
To increase the contiguity of the assembly we used the HiRise2.1 scaffolder [11] and sequence information from the Chicago libraries and SOAPdenovo assembly as inputs. The SOAPdenovo + Chicago assembly introduced a total of 56 breaks in 54 SOAPdenovo scaffolds, and formed 3,200 new scaffold joints, resulting in an increased scaffold N50 length of 57.20 Mbp ( Table 1).

Evaluation of the SOAPdenovo genome assembly and PCR verification of putatively chimeric scaffolds
To identify putatively chimeric scaffolds, we utilized the Masai giraffe SOAPdenovo genome assembly to obtain predicted chromosome fragments (PCFs) using Reference-Assisted Chromosome Assembly (RACA) software [14]. The RACA tool uses a combination of comparative information and sequencing data to order and orient scaffolds of target species and generate PCFs. The cattle (Bos taurus, bosTau6) and human (Homo sapiens, hg19) genome assemblies were used as a reference and outgroup, respectively, and all Illumina paired-end and mate-pair libraries were included in the RACA assembly. The read libraries were aligned to the SOAPdenovo scaffolds using Bowtie2 (Bowtie, RRID:SCR_005476) [15]. The cattle-giraffe and cattle-human pairwise alignments were performed using lastZ and UCSC Kent utilities [16], as previously described [14,17]. The RACA software was used at a minimum resolution of 150 Kbp for syntenic fragment (SF) detection. Only SOAPdenovo scaffolds >10 Kbp were used as input for RACA, comprising 95% of the assembly length.
After an initial run of RACA with default parameters, we tested the structure of 32/41 (76%) RACA-split SF adjacencies corresponding to 40 SOAPdenovo scaffolds flagged as putatively chimeric.
Chimerism was evaluated using PCR amplification of Masai giraffe DNA with primers that flank the RACA-defined split of SF joint boundaries (Supplementary Table 2 and Supplementary Table 3).
Because we were only able to test 76% of the putatively chimeric SOAPdenovo scaffolds, we mapped short-and long-insert size read libraries to the SOAPdenovo assembly to establish a minimum physical coverage of reads that mapped across the SF joint intervals, following previous publications [18]. By comparing the PCR results and the read mapping coverage, we established 158x as the minimum physical coverage that allowed differentiation of scaffolds that were likely to be chimeric from those that were likely to be authentic (Supplementary Table 2). This threshold was used to update the parameters of a second round of RACA (stage 2 RACA), which resulted in the generation of 47 PCFs, of which 13 were homologous to complete cattle chromosomes. The stage 2 RACA assembly had an N50 length of 85.22 Mbp. This assembly comprised 1,283 SOAPdenovo scaffolds, representing 93% of the original SOAPdenovo assembly, of which 33 were split by RACA, and two were manually split as they had been shown to be chimeric by PCR (Table 1). These results indicate the power of comparative information for improving assembly contiguity and for identifying problematic regions in de novo assemblies.

Evaluation of the HiRise SOAPdenovo + Chicago assembly
More than 94% of the joints introduced in the SOAPdenovo + Chicago assembly were concordant with the RACA assembly, 4% were inconsistent between the two assemblies, and 1% represented extra adjacencies with intervening scaffolds located at the ends of PCFs. Among the 54 SOAPdenovo scaffolds broken in the SOAPdenovo + Chicago assembly, 26 were also broken in the RACA assembly.
Among the remaining 28 scaffolds, five were not included in PCFs because they were under the 150
BAC clone coordinates for cattle (bosTau6) assembly were downloaded from NCBI CloneDB [20] and converted to coordinates in the giraffe SOAPdenovo + Chicago + RACA PCFs using the UCSC Genome Browser LiftOver tool [21]. A total of 153 BACs were successfully mapped to the giraffe assembly and were retained for the following analysis. To evaluate the 146 scaffold joints introduced by RACA, a reliability score was further calculated considering four components: (i) the relative positions of the BACs in giraffe metaphase spreads compared to the PCFs (Figure 2), (ii) if the joint was supported by sequence reads from Chicago libraries, (iii) physical coverage of Illumina pair-end reads, and (iv) comparative syntenic information. Different weights were given to each component of the score, ranging from 10% for the comparative syntenic information to 40% for the physical map using BAC data (Supplementary Table 4). Only those joints with a reliability score >30% were considered as authentic, indicating that at least FISH or Chicago library read support was present. More than 89% (N=130) of the adjacencies had FISH and/or Chicago support, while six (4%) adjacencies had syntenic support only (Supplementary Figure 1). The final genome assembly comprised PCFs placed on 14 giraffe autosomes and 10 chromosome X fragments (Table 1). Because chromosome X in Cetartiodactyls (including giraffe, cattle, and pigs) has been highly rearranged during evolution [19], tools such as RACA, that use a reference-assisted assembly approach, will have limited success in increasing the contiguity of the assembly of sex chromosomes in the Cetartiodactyl clade.

Completeness evaluation of genome assemblies using BUSCO
We evaluated genome completeness using the Benchmarking Universal Single-Copy Orthologs (BUSCO, RRID:SCR_015008; version 3.0) software [22]. Although comparing BUSCO results on different versions of genome assemblies might be inappropriate due to difference in parameter estimations [23], we found a high agreement between genome assemblies, with only 34 BUSCO single copy genes present in the SOAPdenovo assembly reported missing in the final assembly, while 42 BUSCO genes reported as fragmented and an additional 14 reported as missing in the SOAPdenovo assembly were labelled as complete in the final assembly. Overall, approximately 95% of the core mammalian gene set was complete in the SOAPdenovo and SOAPdenovo + Chicago assemblies; SOAPdenovo + RACA included 94% of the mammalian gene set, while the final chromosome-level assembly contained 95% complete BUSCO genes, similar to other reference-quality ruminant assemblies (94% for cattle ARS-UCD1.2 and goat ARS1). In comparison, the Masai giraffe genome assembly reported by Agaba and colleagues [9] included 87% of BUSCO genes (Figure 3). These results show that the genome assemblies we generated are of high completeness and accuracy, and a significant improvement over the genome assembly currently available for Masai giraffe.

Genome annotation
To annotate transposable elements (TEs) in the Masai giraffe genome, we started by predicting TEs by homology to RepBase sequences using RepeatProteinMask and RepeatMasker (RepeatMasker, RRID:SCR_012954) [26] with default parameters. Results from both types of software were combined to produce a non-redundant final set of TEs. Approximately 40% of the Masai giraffe's genome is comprised of TEs, with LINEs being the most frequent group (24%, Supplementary Table 6).
The remainder of the SOAPdenovo genome assembly was annotated using both homology-based and de novo methods. For the homology-based prediction, human, mouse, cow, and horse proteins were downloaded from Ensembl (Ensembl, RRID:SCR_002344), release 64, and mapped onto the genome using tblastn. The homologous genome sequences were aligned against the matching proteins using GeneWise (GeneWise, RRID:SCR_015054) [27] to define gene models. For de novo prediction, Augustus (Augustus: Gene Prediction, RRID:SCR_008417) [28], GENSCAN (GENSCAN, RRID:SCR_012902) [29], and SNAP (SNAP, RRID:SCR_007936)) [30] were applied to predict coding genes as described in Zhang et al. 2018 [31]. Finally, homology-based and de novo derived gene sets were merged to form a comprehensive and non-redundant reference gene set using GLEAN [32]. We obtained a reference gene set that contained 21,621 genes (Supplementary Table 7).
To assign functions to the newly annotated genes in the Masai giraffe genome we aligned them Finally, we assigned a gene ontology term to 12,263 genes, representing 56.72% of the full gene set.

Genome evolution
The position of the Giraffidae family in the Ruminantia has been highly debated, with some studies using mitochondrial DNA or SNPchip data suggesting that Giraffidae are an outgroup to Bovidae and Cervidae [33,34], while palaeontological and biochemical evidence suggested that Giraffidae and Cervidae are sister taxa [35,36]. To shed light on the giraffe phylogeny, we first used the TreeFam methodology [37] to define gene families in eight mammalian genomes (cattle, sheep, gemsbok, yak, giraffe, Pere David's deer, horse, and human) using newly defined or available gene annotations. We applied the same pipeline and parameters as described by Kim and co-workers [38].
A total of 16,148 gene families, of which 1,327 are single-copy orthologous families, were obtained.
Branch reliability was assessed by 1,000 bootstrap replicates. Finally, PAML mcmctree [40] was used to determine divergence times with the approximate likelihood calculation method and data from TimeTree [41]. The resulting tree suggests that Giraffidae are a sister taxon to the Cervidae, diverging ~21.5 million year ago ( Figure 4); however, further studies using more deer species and other ruminants, such as pronghorn, as well as other methodologies to detect orthologous genes, will be needed to clarify the ruminant phylogeny.

Conclusions
Herein, we report a de novo chromosome-scale genome assembly for Masai giraffe using a combination of sequencing and assembly methodologies aided by physical mapping of 153 BACs onto giraffe metaphase chromosomes. Gene and repeat annotation of the assembly identified a similar number of genes and transposable elements as found in other ruminant species. Following the example of the sable antelope [42] and the California condor [43], the new giraffe genome assembly will foster research into conservation of this charismatic species, serving as a foundation for characterizing the genetic diversity of wild and captive populations. Furthermore, the high quality, chromosome-scale assembly described in this report contributes to the goals of the Genome 10K Project [24] and the Earth BioGenome Project [25].

Note added in proof
The underlying giraffe SOAPdenovo assembly described in this paper is the same as the one used by Chen and co-workers [45].

Competing interests
The authors declare that they have no competing interests.