Genome of Russian wheat aphid an economically important cereal aphid

Although the hemipterans (Aphididae) are comprised of roughly 50,000 extant insect species, only four have sequenced genomes that are publically available, namely Acyrthosiphon pisum (pea aphid), Rhodnius prolixus (Kissing bug), Myzus persicae (Green peach aphid) and Diuraphis noxia (Russian wheat aphid). As a significant proportion of agricultural pests are phloem feeding aphids, it is crucial for sustained global food security that a greater understanding of the genomic and molecular functioning of this family be elucidated. Recently, the genome of US D. noxia biotype US2 was sequenced but its assembly only incorporated ~ 32% of produced reads and contained a surprisingly low gene count when compared to that of the model/first sequenced aphid, A. pisum. To this end, we present here the genomes of two South African Diuraphis noxia (Kurdjumov, Hemiptera: Aphididae) biotypes (SA1 and SAM), obtained after sequencing the genomes of the only two D. noxia biotypes with documented linked genealogy. To better understand overall targets and patterns of heterozygosity, we also sequenced a pooled sample of 9 geographically separated D. noxia populations (MixIX). We assembled a 399 Mb reference genome (PRJNA297165, representing 64% of the projected genome size 623 Mb) using ± 28 Gb of 101 bp paired-end HiSeq2000 reads from the D. noxia biotype SAM, whilst ± 13 Gb 101 bp paired-end HiSeq2000 reads from the D. noxia biotype SA1 were generated to facilitate genomic comparisons between the two biotypes. Sequencing the MixIX sample yielded ±26 Gb 50 bp paired-end SOLiD reads which facilitated SNP detection when compared to the D. noxia biotype SAM assembly. Ab initio gene calling produced a total of 31,885 protein coding genes from the assembled contigs spanning ~ 399 Mb (GCA_001465515.1).


Introduction
Diuraphis noxia (Kurdjumov), commonly known as the Russian wheat aphid, is an economically important hemipteran pest species (Hemiptera: Aphididae) afflicting wheat and barley yield in dry-land production regions [1]. Diuraphis noxia was first reported as a pest of small grains in South Africa during 1978 [2]. In 1986, D. noxia was detected in the US Texas Panhandle [3], where after it spread to 16 other states and two Canadian provinces within a few years. In 1988, D. noxia was recorded in Chile, by 1992 in Argentina [4] and finally spread to Australia in 2016 [5].The feeding of D. noxia results in foliar damage which include distinct white, yellow, purple or reddish-purple longitudinal streaks (chlorotic streaking), with severe leaf rolling in fully expanded leaves and the inhibition of leaf unfolding of developing leaves. This inability of the leaves to unfold traps the developing spike of the plant (termed "head-trapping") which results in no seeds being produced [6,7]. The rolling of leaves has the added unwanted effect of protecting the aphid from harsh environmental conditions (such as insecticide spraying or extreme temperatures) and from natural predators [8]. Overall, D. noxia infested wheat also suffers from stunted growth leading to a lowered biomass and a decrease in the number of tillers produced [6] thereby greatly affecting yield potential. Seed obtained from D. noxia infested wheat also tend to have lowered protein content and other negative attributes for the flour industry [9] which only adds to the economic injury of this pest. In D. noxia, it is common for mothers to carry both their daughters and granddaughters, as parthenogenetically produced granddaughter embryos develop directly within daughters, even before their own birth. This process allows for short D.
noxia generation times and rapid population growth in favorable environments [1], but is thought to limit the available diversity possible within D. noxia populations [10]. Since its appearance in South Africa in the late 1970's, D. noxia has undergone several biotypification events as there are currently five different biotypes recognized in South Africa [11,12] and eight in the USA [13]. Biotypification, as referenced here, is when an aphid population is able to overcome previously established resistance within wheat [14]. Recently, the genome of the United States Diuraphis noxia biotype US2 was released [15] with an assembly size of~395 Mb (296 Mb represented by contigs) and containing 19,097 genes. While the study was able to produce a total of 1.3 Gb of sequence data, it could only incorporate~32% of this into an assembly comprising7 0% of their predicted genome size. A partial assembly due to an under estimation of genome size may explain why their values differ so greatly to that of the closest relative of D. noxia, A. pisum (37, 865 genes and 541 Mb assembly).
Here we present the genomes of the most virulent [11] South African D. noxia biotype SAM and its progenitor, the least virulent South African D. noxia biotype, SA1 [16], as well as information on the heterozygosity within geographically separated D. noxia populations. This study forms part of a larger survey encompassing global D. noxia genomic variation.

Classification and features
Diuraphis noxia Kurdjumov (Hemiptera: Aphididae) ( Table 1) is a phloem feeding Hemipteran that predominantly feeds on winter wheat and spring barley [17], with the ability to utilize other grasses as alternate hosts [3,16]. It is pale green and up to 2 mm long with short and rounded cornicles (Fig. 1). Cornicles are structures limited to aphids on the posterior abdomen and its presence is used to assist in the identification of D. noxia [18]. The cornicles above the cauda give the aphid the appearance of having two tails and it is believed that these structures help aphids with predator defense [3]. Alignments using whole mitochondrial genomes [19] indicate that the closest relative of D. noxia is Acyrthosiphon pisum (Fig. 2). Reproduction of D. noxia can either be holocyclic (sexually reproducing males and females), as in areas where D. noxia is deemed endemic such as Hungary and Russia [20,21], or anholocyclic (parthenogenic females), where D. noxia is deemed invasive [8]. Reproduction through asexual means can lead to a fecundity rate of between 3 and 5 aphids per day with an average lifespan of roughly 50 days, of which 9 are spent as nymphs [20]. Both forms of reproduction can lead to two morphological morphs, namely alatae (wingless forms) and apterae (winged form), with the latter form responsible for the wider geographical dispersal of the aphid [6].

Genome project history
The genome of the most virulent South African D. noxia biotype, SAM, was sequenced, along with that of its less virulent progenitor, biotype SA1, in an attempt to determine the genomic factors responsible for biotypification. With this, a pooled sample comprising of geographically separated D. noxia populations (MixIX) was also sequenced to ascertain the scope of heterogeneity experienced by the species as a whole. The draft genome , not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from the Gene Ontology project [31] sequence, as well as that all produced sequences, has been deposited at the NCBI in the GenBank database under ID GCA_001465515.1 and BioProject PRJNA297165. The project information and its association with MIGS version 2.0 compliance are summarized in Table 2 [22].

Growth conditions and genomic DNA preparation
Insectary reared strains of D. noxia, kept at ± 22°C and natural lighting, were utilized for all confirmed D. noxia populations' genomic DNA extractions. Genomic DNA from adult aphids of the South African D. noxia biotypes SAM [11] and SA1 [16], and that of the pooled MixIX sample, was used for next generation sequencing (NGS) during the study. The genomic DNA extraction of aphid DNA was conducted as follows; whole aphids were flash frozen in liquid nitrogen, ground and DNA extracted using the Qiagen DNeasy Blood and Tissue kit according to the manufacturer's protocol [19]. The MixIX sample consisted of 2 μg of genomic DNA for each D. noxia representative, which consisted of three field collected South African D. noxia populations (SA1 < SA2 < SA3 in order of increasing virulence); one field collected Czech D. noxia population; two insectary reared US D. noxia populations (US1 < US2 in order of increasing virulence); one field collected Syrian D. noxia population; and two field collected Argentinian D. noxia populations. The integrity of the extracted DNA was then verified through electrophoresis, making use of a 1.5% agarose gel, and quantified using a Qubit v2.0 fluorometer. Raw sequences obtained from the Illumina HiSeq2000 sequencing of the D. noxia SAM biotype, and from the SOLiD system for the MixIX sample, were trimmed and filtered so that all bases had a minimum Phred score of 20. Reads mapping to Buchnera aphidicola of D. noxia [CP013259.1] and that of the mitochondrion of D. noxia [19] were removed from further analysis. Optimal k-mer length for the D. noxia biotype SAM assembly was determined using KMERGENIE [23], while using DSK [24] to estimate the optimal k-mer frequency cut-off. GCE [25] was utilized to estimate the genome size of D. noxia through using the optimal k-mer size generated by KMERGENIE and the frequency of the optimal k-mer size as determined by DSK. The D. noxia genome of biotype SAM was assembled using the SOAP de novo software package [26]. After contig assembly, scaffolds were constructed by realignment of useable paired-end reads onto the contig sequences. Trimmed D. noxia biotype SAM reads were also iteratively mapped (3 iterations), using the Geneious (v7.1.5) software package [27], against the assembled scaffolds of Acyrthosiphon pisum (Acyr2.0) obtained from ENSEMBL [28]. The consensus sequences obtained from the reference mapping were then compared through the use of the BLASTn [29] application to the de novo contigs. Any sequences that produced no match through use of BLASTn were added to the contigs obtained from the SAM de novo assembly to build the final draft genome. A total of 190,686 contigs greater than 300 bp in length was produced with an average coverage of 44.8×, representing 83% of the total reads generated. Using the assembled contigs, BUSCO v1.1 [30] was utilized to assess the completeness of the assembly and found that of the 2675 single-copy orthologues 85% were present, 7% were fragmented and 8% were missing (Fig. 3a). To allow for comparison, analysis using BUSCO was also performed with the scaffolds of the D. noxia biotype RWA2 genome (GCA_001186385.1) [15] and that of Acyrthosiphon pisum (GCA_000142985.2) (Fig. 3b and c respectively).  [56] utilizing whole mitochondrial genomes, that was aligned with MAFFT [57], illustrating Diuraphis noxia's close association with Acyrthosiphon pisum  to obtain the putative Gene Ontology (GO) [33] of the D. noxia protein coding genes predicted by Augustus. KOG [34] functional categories were assigned to predicted protein coding genes through use of the NCBI's RPS-BLAST [35] and Conserved Domain KOG database [34], with an E-value smaller than 10e-3 accepted as significant. Protein coding genes were analyzed for their amino acid content through use of the Geneious (v7.1.5) platform [27] and CRISPR sites were predicted using the CRISPR Recognition Tool v1.1 [36]. Reads obtained from the SOLiD system were then mapped to the predicted protein coding genes of the SAM assembly to facilitate nucleotide variant calling using the Geneious (v7.1.5) software package [27]. The minimal criteria for assigning a single nucleotide polymorphism (SNP) required that the area in question had a mapping coverage of more than ×10, the variant was present in at least 2 sequences, and that the p-value predicted for the SNP should be smaller than 1 × 10 −6 (calculated by first averaging the base quality of each base equal to the proposed SNP and averaging the qualities of each base not equal to the proposed SNP).  The total is based on the total number of protein coding genes in the genome  An EDTA-mediated exudation protocol [37] was used to collect phloem from uninfested, susceptible Triticum aestivum subsp. aestivum L cultivar Gamtoos-S leaves in triplicate. The exudates were blown to dryness under nitrogen at 55°C and the residues were reconstituted in 200 ul 1 M (pH 8.0) borate buffer containing internal amino acid standards by the Central Analytical Facilities (CAF), Stellenbosch University. Ten microlitres of each reconstituted sample was derivatized using the Waters AQC derivatization kit. Derivatized amino acids were then separated and detected using a Waters Acquity UPLC fitted with an UltraTag C18 column and a photodiode array detector. Peaks were detected and integrated by the MassLynx software (Waters Corporation).

Genome properties
The genome of female D. noxia consists of 5 holocentric chromosome pairs (4 autosomes and 1 sex or X chromosome) giving it an XX/XO sex determination system [38]. The final assembly totaled 399,704,836 bp which represents~64% of the predicted genome size of between 593 and 623 Mb obtained through using GCE  Where synonymous SNPs cause no amino acid change, substitution SNPs cause a single amino acid substitution, truncation SNPs introduces of a stop codon, frame shit SNPs disrupt the reading frame through deletions and/or insertions of 1 or 2 bases; insertion SNPs introduces an additional codon; deletion SNPs is where a codon is removed and extension SNPs disrupt existing stop codons [25], the optimal k-mer size (KMERGENIE [23]) and distribution graphs (DSK [24]) ( Table 3). The assembly GC content was 29.5% and ab initio gene calling, through Augustus [31], identified 31,885 protein coding genes greater than 32 amino acids in length. The total gene complement represented 66,633,929 bp of the assembly (16.67%) of which 20,316,122 (5.08%) consisted of coding domain sequence. Amino acid usage in the protein coding genes complement of D. noxia (Table 4) indicated that leucine followed by serine are the most frequently used amino acids, while tryptophan was the least frequently occurring amino acid. Of the 31,885 protein coding genes, 27,386 (~86%) sequences were putatively identified through BLASTx and BLASTp and only 12,791 (~47%) of these had a GO term assigned to them through Blast2GO [32].

Insights from the genome sequence
With an AT content of 70.5%, D. noxia is the most ATrich insect genome sequenced to date. This is very similar to its closest aphid relative, A. pisum, which has an AT content of 70.4%. A cursory comparison of the genic complement between D. noxia biotypes SAM and SA1 shows no differences, with SA1 reads mapping to all predicted protein coding genes, and no indication of genomic rearrangements. Genome size estimations, utilizing GCE and k-mer counting, were also inconclusive with both biotypes predicted to have roughly equal genome sizes. The predicted genome size of roughly 623 Mb containing 31,885 protein coding genes is also comparable to that of A. pisum which currently has an assembly size of 542 Mb with 36,195 protein coding genes assigned to it. In order to assess whether there is a bias for selected amino acids during transcription in the D. noxia genome, we analyzed the frequency of specific amino acids and codon usage (Table 4). From the data it was evident that leucine followed by serine are the most frequently used amino acids in the predicted protein coding genes, while tryptophan was the least frequently occurring amino acid.
With regards to codon usage, leucine codons were used in the following order from most used TTA > TTG > CTT > CTG > CTA > CTC, while in the case of serine they were as follows TCA > TCT > AGT > TCG > AGC > TCC. Codons with low usage include tryptophan (TGG), cysteine (TGT > TGC) and histidine (CAT > CAC). The start codon (methionine, ATG) and stop codons (TAA, TGA and TAG) also occurred as expected at lower frequencies.
When comparing the amino acid usage of D. noxia protein coding genes to that of the free amino acid composition of wheat phloem (Fig. 4), it was interesting to note that of the ten most abundant amino acids present in D. noxia protein coding genes (in order: Leu > Ser > Lys > Ile > Glu > Val > Asn > Thr > Asp > Ala), seven were also Fig. 6 Relative abundance of SNPs within genes assigned to KOG functional categories most abundant in wheat phloem (i.e., Asp, Glu, Ser, Ile, Thr, Leu and Ala). Previous studies that either utilized an EDTA mediated phloem exudation method and/or stylectomy to investigate wheat phloem reported similar levels in unchallenged wheat plants (Additional file 1: Figure S1) [37,39]. The apparent organization of D. noxia protein coding genes around the availability of free amino acids within its diet could illustrate the adaptation of the aphid towards its limited host range.
Assigning predicted D. noxia protein coding genes into KOG categories revealed that, out of the 31,885 predicted genes, 13,523 (42.42%) were successfully assigned of which the largest comprised of the general function category R (12.98%) > the signal transduction mechanisms category T (12.26%) > the transcription category K (7.61%) > posttranslational modification, protein turnover and chaperones category O (7.29%) > the intracellular trafficking, secretion, and vesicular transport category U (6.15%) ( Table 5; Fig. 5). The large grouping of genes associated with protein modification and turnover is interesting in that it has been shown previously that phloem feeding aphids, despite low levels of heterogeneity, display various levels of virulence towards single host cultivars [40][41][42][43], as is the case for Diuraphis noxia biotypes SA1 and SAM [11,44]. The basis for this observed variance may include the adaptability of the aphid's salivary cohort in response to its feeding environment [45,46] as this is central to the molecular interaction between aphids and their hosts. In a study by Lapitan et al. [47], where fractionated aphid extracts from different D. noxia biotypes were injected into resistant and susceptible wheat cultivars, it was found that the D. noxia effector(s) modulating aphid-host interactions was proteinaceous in nature and differed between biotypes. Thus D. noxia, as well as other Hemipterans, would require an adaptive and responsive salivary enzyme cohort that is able to adjust for their continually changing feeding environment [48].

Extended insights
The pooling, and subsequent sequencing, of different D. noxia geographically separated populations was performed to give a clearer indication of the level of variation present overall within the species. The total number of polymorphic sites identified between the predicted protein coding genes of the South African D. noxia biotype SAM assembly and the MixIX sample was 92,125 ( Table 6). The majority of these polymorphic sites were either synonymous (19.9%) or resulted in an amino acid substitution (68.4%). Other SNPs resulted in major underlying protein effects such as the introduction of aberrant stop codons leading to truncated transcripts (7.4%), frame shift alterations (2.6%), in-frame insertions (0.6%) and deletions (0.5%) and the extension of transcripts through disrupting stop codons (0.5%). In total, out of the predicted 31,885 protein coding genes 10,934 (34.29%) contained SNPs. The KOG general function category R (1657 genes) was assigned the most genes, followed by the translation, ribosomal structure and biogenesis category T (1434 genes); the replication, recombination and repair category L (1092); the posttranslational modification, protein turnover, chaperones category O (1061); the transcription category K (1025); and the cell cycle control, cell division, chromosome partitioning category D (938) (Fig. 6). Again, genes allocated to the general category of protein modification and turnover features prominently within the top genes containing the most SNPs, especially so when comparing SNPs with an underlying protein effect.
The overall low number of SNPs leading to protein content variation (i.e., insertion and deletion of in-frame codons) was the least represented. This may indicate a conservation of local amino acid identity within proteins. Although SNPs resulting in an amino acid substitution were the highest recorded type of all the SNP effect types, these generally don't incur significant functional changes. Substitutions involving amino acids possessing similar properties would constrain protein folding and target specificity. Any prediction on the underlying protein effects of these types of SNPs would also require site specific information and corroborating molecular evidence. Truncation SNPs, polymorphisms introducing aberrant stop codons within the coding domain of genes, was the third most prevalent SNP type observed (after synonymous and substitution type SNPs). Arguably, the effect of these types of SNPs can be considered more significant as they have the potential of producing transcripts of varying lengths, possibly altering the molecular action and target affinities of proteins and their underlying complexes. This could potentially afford the aphid with a wider array of "molecular machinery" to adapt to defensive responses from its host.

Conclusions
The genome of the South African Diuraphis noxia biotype SAM was successfully assembled into contigs spanning roughly 400 Mb and predicted to contain 31,885 protein coding genes. A large proportion of predicted genes were assigned to KOG functional categories relating to protein modification and turnover that may help explain the differential adaptability of different D. noxia biotypes towards their host. The overall low variation across the genome of D. noxia is consistent with previous studies that have found limited variation between biotypes [48,49]. It is though interesting that most of the functional nucleotide variation observed was predominantly present in genes governing protein modification and turnover which in turn is supportive of the adaptability of D. noxia when facing resistance mechanisms from its host.